What’s new in 4.0.0 (2024-11-18)
================================

New features
------------

Transition graph
~~~~~~~~~~~~~~~~

- Added a main threshold on the canvas for easier node filtering.
- Added two-sided thresholds for filtering nodes and edges in the left-side bar.
- Improved visualization of the incoming and outcoming edges when clicking on a node.
- Added import & export of the whole graph, not nodes layout only as it was previously. See the :ref:`transition graph user guide<transition_graph_import>` for more details.
- Thresholds on the left-side panel are synchronized with the weight column chosen.
- Eliminated eye icon for hiding nodes. Switcher icon is left as the only way to hide nodes from the canvas.

Segments
~~~~~~~~

Segments is a brand new feature that allows you to divide an eventstream into segments and compare them.

- Segments can be either static (e.g. user gender, marketing source, AB-test group) or dynamic (e.g. first user date, before or after release date, or an arbitrary user state). Segments are effectively stored in the eventstream as synthetic events.
- Some visualization tools support segment comparison. See :py:meth:`Eventstream.step_matrix()<retentioneering.eventstream.eventstream.Eventstream.step_matrix>`, :py:meth:`Eventstream.transition_matrix()<retentioneering.eventstream.eventstream.Eventstream.transition_matrix>`.
- Segments can be also compared with new visualization tools: :py:meth:`Eventstream.segment_overview()<retentioneering.eventstream.eventstream.Eventstream.segment_overview>`, :py:meth:`Eventstream.segment_diff()<retentioneering.eventstream.eventstream.Eventstream.segment_diff>`.
- The :py:meth:`Eventstream.filter_events()<retentioneering.data_processors_lib.filter_events.FilterEvents>` data processor now supports filtering by a segment value.
- The :py:meth:`Eventstream.projection()<retentioneering.eventstream.eventstream.Eventstream.projection>` visualizing tool has been refactored. Now it supports sampling to speed up the visualization of large datasets. It also supports easier segment comparison by switching the colors related to different segments.

Here is how you can quickly exhibit the difference between two segments ``US`` VS ``UK`` or ``US`` VS its complement ``_OUTER_`` (assuming that you have a column ``country`` in your eventstream):

.. code:: python

    stream = stream.add_segment('country')
    stream.step_matrix(groups=['country', 'US', 'UK'])
    stream.transition_matrix(groups=['country', 'US', '_OUTER_'])

See the :doc:`segments user guide <../user_guides/segments_and_clusters>` for more details.

Clusters
~~~~~~~~

The Clusters module have been fully reworked. Path-cluster mapping is stored as a segment. All segment comparison tools are valid for cluster analysis as well. Working with clusters is designed now as a set of Eventstream class methods instead of a separate Clusters class.

- New clustering algorithm HDBSCAN has been added.
- A special tool :py:meth:`Eventstream.clusters_overview()<retentioneering.eventstream.eventstream.Eventstream.clusters_overview>` has been tailored for cluster analysis.
- :py:meth:`Eventstream.segment_diff()<retentioneering.eventstream.eventstream.Eventstream.segment_diff>` now allows to compare feature distributions between clusters: numerically (using Wasserstein's distance) and visually (using density plots).
- For the K-Means clustering algorithm, the elbow curve visualization has been added to help choose the optimal number of clusters.
- The :py:meth:`extract_features<retentioneering.eventstream.eventstream.Eventstream.clusters_overview>` method has been moved to Eventstream methods.

Below is an example of how you can split user behavior into clusters and analyze them with few lines of code

.. code:: python

    features = stream.extract_features(ngram_range=(1, 1), feature_type='count')
    stream = stream.get_clusters(
        features,
        method='kmeans',
        n_clusters=8,
        random_state=42,
        segment_name='kmeans_clusters'
    )

and overview the clusters with the features and additional custom metrics

.. code:: python

    custom_metrics = [
        ('segment_size', 'mean', 'cluster size'),
        ('len', 'mean', 'path_len, mean'),
        ('has:payment_done', 'mean', 'CR: payment_done'),
        (lambda _df: (_df['event'] == 'catalog').sum(), 'median', 'catalog, median'),
        (pd.NamedAgg('timestamp', lambda s: len(s.dt.date.unique())), 'mean', 'Active days, mean')
    ]

    stream.clusters_overview('kmeans_clusters', features, aggfunc='mean', metrics=custom_metrics)

.. figure:: /_static/user_guides/segments_and_clusters/clusters_overview.png
    :width: 400

or compare the feature distributions between particular clusters

.. code:: python

    stream.segment_diff(['kmeans_clusters', '2', '4'], features)

.. figure:: /_static/user_guides/segments_and_clusters/segment_diff.png
    :width: 600

See the :doc:`clusters user guide <../user_guides/segments_and_clusters>` for more details.


Transition matrix
~~~~~~~~~~~~~~~~~

- Transition matrix has been redesigned as a separate visualization tool.
- Group comparison is now supported.
- The default value of the ``weight_col`` argument is changed to ``user_id`` so the values represented in a matrix are the numbers of unique users who experienced a given transition.

This how you can look at the difference between two segment values of a binary segment ``Apr 2020``. See the :doc:`transition matrix<../user_guides/transition_matrix>` user guide for more details.

.. code:: python

    stream.transition_matrix(groups='Apr 2020', norm_type='node')

.. figure:: /_static/user_guides/segments_and_clusters/diff_transition_matrix.png
    :width: 500


Other features
~~~~~~~~~~~~~~

- Added a new method :py:meth:`Eventstream.path_metrics()<retentioneering.eventstream.eventstream.Eventstream.path_metrics>` for calculating arbitrary metrics for paths. Added special shortcuts such as ``len``, ``has:TARGET_EVENT``, ``time_to:TARGET_EVENT`` for common metrics. Here is a simple example of how to use it applied to the ``cart`` event:

.. code:: python

    metrics = [
        # path length
        ('len', 'path_length'),
        # True if there's a cart event in a path, otherwise False
        ('has:cart', 'has_cart'),
        # Time from the path start to the first occurrence of the cart event.
        ('time_to:cart', 'time_to_cart'),
        # The number of cart events in a path
        (lambda _df: (_df['event'] == 'cart').sum(), 'cart_count'),
        # The number of unique days in a path
        (pd.NamedAgg('timestamp', lambda s: len(s.dt.date.unique())), 'active_days')
    ]

    stream.path_metrics(metrics).head()

.. raw:: html

    <table class="dataframe">
    <thead>
        <tr style="text-align: right;">
        <th></th>
        <th>path_length</th>
        <th>has_cart</th>
        <th>time_to_cart</th>
        <th>cart_count</th>
        <th>active_days</th>
        </tr>
    </thead>
    <tbody>
        <tr>
        <th>122915</th>
        <td>34</td>
        <td>True</td>
        <td>6 days 01:22:39.090422</td>
        <td>1</td>
        <td>2</td>
        </tr>
        <tr>
        <th>463458</th>
        <td>12</td>
        <td>False</td>
        <td>NaT</td>
        <td>0</td>
        <td>1</td>
        </tr>
        <tr>
        <th>1475907</th>
        <td>16</td>
        <td>True</td>
        <td>23 days 13:03:45.213509</td>
        <td>1</td>
        <td>2</td>
        </tr>
        <tr>
        <th>1576626</th>
        <td>3</td>
        <td>False</td>
        <td>NaT</td>
        <td>0</td>
        <td>1</td>
        </tr>
        <tr>
        <th>2112338</th>
        <td>7</td>
        <td>False</td>
        <td>NaT</td>
        <td>0</td>
        <td>1</td>
        </tr>
    </tbody>
    </table>
    <br>

Improvements
------------

- Python 3.12 is supported now. Python 3.8 is not supported anymore.
- Many libraries that are used in the project have been updated to the latest versions. In particular, pandas 2.2, numpy 2.0, scikit-learn 1.4 are supported now.
- Improved the performance of some tools and data processors: :py:meth:`CollapseLoops<retentioneering.data_processors_lib.collapse_loops.CollapseLoops>`, :doc:`centered step matrix<../user_guides/step_matrix>`, :doc:`Eventstream constructor <../user_guides/eventstream>`.