What’s new in 4.0.0 (2024-11-18) ================================ New features ------------ Transition graph ~~~~~~~~~~~~~~~~ - Added a main threshold on the canvas for easier node filtering. - Added two-sided thresholds for filtering nodes and edges in the left-side bar. - Improved visualization of the incoming and outcoming edges when clicking on a node. - Added import & export of the whole graph, not nodes layout only as it was previously. See the :ref:`transition graph user guide` for more details. - Thresholds on the left-side panel are synchronized with the weight column chosen. - Eliminated eye icon for hiding nodes. Switcher icon is left as the only way to hide nodes from the canvas. Segments ~~~~~~~~ Segments is a brand new feature that allows you to divide an eventstream into segments and compare them. - Segments can be either static (e.g. user gender, marketing source, AB-test group) or dynamic (e.g. first user date, before or after release date, or an arbitrary user state). Segments are effectively stored in the eventstream as synthetic events. - Some visualization tools support segment comparison. See :py:meth:`Eventstream.step_matrix()`, :py:meth:`Eventstream.transition_matrix()`. - Segments can be also compared with new visualization tools: :py:meth:`Eventstream.segment_overview()`, :py:meth:`Eventstream.segment_diff()`. - The :py:meth:`Eventstream.filter_events()` data processor now supports filtering by a segment value. - The :py:meth:`Eventstream.projection()` visualizing tool has been refactored. Now it supports sampling to speed up the visualization of large datasets. It also supports easier segment comparison by switching the colors related to different segments. Here is how you can quickly exhibit the difference between two segments ``US`` VS ``UK`` or ``US`` VS its complement ``_OUTER_`` (assuming that you have a column ``country`` in your eventstream): .. code:: python stream = stream.add_segment('country') stream.step_matrix(groups=['country', 'US', 'UK']) stream.transition_matrix(groups=['country', 'US', '_OUTER_']) See the :doc:`segments user guide <../user_guides/segments_and_clusters>` for more details. Clusters ~~~~~~~~ The Clusters module have been fully reworked. Path-cluster mapping is stored as a segment. All segment comparison tools are valid for cluster analysis as well. Working with clusters is designed now as a set of Eventstream class methods instead of a separate Clusters class. - New clustering algorithm HDBSCAN has been added. - A special tool :py:meth:`Eventstream.clusters_overview()` has been tailored for cluster analysis. - :py:meth:`Eventstream.segment_diff()` now allows to compare feature distributions between clusters: numerically (using Wasserstein's distance) and visually (using density plots). - For the K-Means clustering algorithm, the elbow curve visualization has been added to help choose the optimal number of clusters. - The :py:meth:`extract_features` method has been moved to Eventstream methods. Below is an example of how you can split user behavior into clusters and analyze them with few lines of code .. code:: python features = stream.extract_features(ngram_range=(1, 1), feature_type='count') stream = stream.get_clusters( features, method='kmeans', n_clusters=8, random_state=42, segment_name='kmeans_clusters' ) and overview the clusters with the features and additional custom metrics .. code:: python custom_metrics = [ ('segment_size', 'mean', 'cluster size'), ('len', 'mean', 'path_len, mean'), ('has:payment_done', 'mean', 'CR: payment_done'), (lambda _df: (_df['event'] == 'catalog').sum(), 'median', 'catalog, median'), (pd.NamedAgg('timestamp', lambda s: len(s.dt.date.unique())), 'mean', 'Active days, mean') ] stream.clusters_overview('kmeans_clusters', features, aggfunc='mean', metrics=custom_metrics) .. figure:: /_static/user_guides/segments_and_clusters/clusters_overview.png :width: 400 or compare the feature distributions between particular clusters .. code:: python stream.segment_diff(['kmeans_clusters', '2', '4'], features) .. figure:: /_static/user_guides/segments_and_clusters/segment_diff.png :width: 600 See the :doc:`clusters user guide <../user_guides/segments_and_clusters>` for more details. Transition matrix ~~~~~~~~~~~~~~~~~ - Transition matrix has been redesigned as a separate visualization tool. - Group comparison is now supported. - The default value of the ``weight_col`` argument is changed to ``user_id`` so the values represented in a matrix are the numbers of unique users who experienced a given transition. This how you can look at the difference between two segment values of a binary segment ``Apr 2020``. See the :doc:`transition matrix<../user_guides/transition_matrix>` user guide for more details. .. code:: python stream.transition_matrix(groups='Apr 2020', norm_type='node') .. figure:: /_static/user_guides/segments_and_clusters/diff_transition_matrix.png :width: 500 Other features ~~~~~~~~~~~~~~ - Added a new method :py:meth:`Eventstream.path_metrics()` for calculating arbitrary metrics for paths. Added special shortcuts such as ``len``, ``has:TARGET_EVENT``, ``time_to:TARGET_EVENT`` for common metrics. Here is a simple example of how to use it applied to the ``cart`` event: .. code:: python metrics = [ # path length ('len', 'path_length'), # True if there's a cart event in a path, otherwise False ('has:cart', 'has_cart'), # Time from the path start to the first occurrence of the cart event. ('time_to:cart', 'time_to_cart'), # The number of cart events in a path (lambda _df: (_df['event'] == 'cart').sum(), 'cart_count'), # The number of unique days in a path (pd.NamedAgg('timestamp', lambda s: len(s.dt.date.unique())), 'active_days') ] stream.path_metrics(metrics).head() .. raw:: html
path_length has_cart time_to_cart cart_count active_days
122915 34 True 6 days 01:22:39.090422 1 2
463458 12 False NaT 0 1
1475907 16 True 23 days 13:03:45.213509 1 2
1576626 3 False NaT 0 1
2112338 7 False NaT 0 1

Improvements ------------ - Python 3.12 is supported now. Python 3.8 is not supported anymore. - Many libraries that are used in the project have been updated to the latest versions. In particular, pandas 2.2, numpy 2.0, scikit-learn 1.4 are supported now. - Improved the performance of some tools and data processors: :py:meth:`CollapseLoops`, :doc:`centered step matrix<../user_guides/step_matrix>`, :doc:`Eventstream constructor <../user_guides/eventstream>`.