What’s new in 4.0.0 (2024-11-18)#

New features#

Transition graph#

  • Added a main threshold on the canvas for easier node filtering.

  • Added two-sided thresholds for filtering nodes and edges in the left-side bar.

  • Improved visualization of the incoming and outcoming edges when clicking on a node.

  • Added import & export of the whole graph, not nodes layout only as it was previously. See the transition graph user guide for more details.

  • Thresholds on the left-side panel are synchronized with the weight column chosen.

  • Eliminated eye icon for hiding nodes. Switcher icon is left as the only way to hide nodes from the canvas.

Segments#

Segments is a brand new feature that allows you to divide an eventstream into segments and compare them.

Here is how you can quickly exhibit the difference between two segments US VS UK or US VS its complement _OUTER_ (assuming that you have a column country in your eventstream):

stream = stream.add_segment('country')
stream.step_matrix(groups=['country', 'US', 'UK'])
stream.transition_matrix(groups=['country', 'US', '_OUTER_'])

See the segments user guide for more details.

Clusters#

The Clusters module have been fully reworked. Path-cluster mapping is stored as a segment. All segment comparison tools are valid for cluster analysis as well. Working with clusters is designed now as a set of Eventstream class methods instead of a separate Clusters class.

  • New clustering algorithm HDBSCAN has been added.

  • A special tool Eventstream.clusters_overview() has been tailored for cluster analysis.

  • Eventstream.segment_diff() now allows to compare feature distributions between clusters: numerically (using Wasserstein’s distance) and visually (using density plots).

  • For the K-Means clustering algorithm, the elbow curve visualization has been added to help choose the optimal number of clusters.

  • The extract_features method has been moved to Eventstream methods.

Below is an example of how you can split user behavior into clusters and analyze them with few lines of code

features = stream.extract_features(ngram_range=(1, 1), feature_type='count')
stream = stream.get_clusters(
    features,
    method='kmeans',
    n_clusters=8,
    random_state=42,
    segment_name='kmeans_clusters'
)

and overview the clusters with the features and additional custom metrics

custom_metrics = [
    ('segment_size', 'mean', 'cluster size'),
    ('len', 'mean', 'path_len, mean'),
    ('has:payment_done', 'mean', 'CR: payment_done'),
    (lambda _df: (_df['event'] == 'catalog').sum(), 'median', 'catalog, median'),
    (pd.NamedAgg('timestamp', lambda s: len(s.dt.date.unique())), 'mean', 'Active days, mean')
]

stream.clusters_overview('kmeans_clusters', features, aggfunc='mean', metrics=custom_metrics)
../_images/clusters_overview.png

or compare the feature distributions between particular clusters

stream.segment_diff(['kmeans_clusters', '2', '4'], features)
../_images/segment_diff.png

See the clusters user guide for more details.

Transition matrix#

  • Transition matrix has been redesigned as a separate visualization tool.

  • Group comparison is now supported.

  • The default value of the weight_col argument is changed to user_id so the values represented in a matrix are the numbers of unique users who experienced a given transition.

This how you can look at the difference between two segment values of a binary segment Apr 2020. See the transition matrix user guide for more details.

stream.transition_matrix(groups='Apr 2020', norm_type='node')
../_images/diff_transition_matrix.png

Other features#

  • Added a new method Eventstream.path_metrics() for calculating arbitrary metrics for paths. Added special shortcuts such as len, has:TARGET_EVENT, time_to:TARGET_EVENT for common metrics. Here is a simple example of how to use it applied to the cart event:

metrics = [
    # path length
    ('len', 'path_length'),
    # True if there's a cart event in a path, otherwise False
    ('has:cart', 'has_cart'),
    # Time from the path start to the first occurrence of the cart event.
    ('time_to:cart', 'time_to_cart'),
    # The number of cart events in a path
    (lambda _df: (_df['event'] == 'cart').sum(), 'cart_count'),
    # The number of unique days in a path
    (pd.NamedAgg('timestamp', lambda s: len(s.dt.date.unique())), 'active_days')
]

stream.path_metrics(metrics).head()
path_length has_cart time_to_cart cart_count active_days
122915 34 True 6 days 01:22:39.090422 1 2
463458 12 False NaT 0 1
1475907 16 True 23 days 13:03:45.213509 1 2
1576626 3 False NaT 0 1
2112338 7 False NaT 0 1

Improvements#

  • Python 3.12 is supported now. Python 3.8 is not supported anymore.

  • Many libraries that are used in the project have been updated to the latest versions. In particular, pandas 2.2, numpy 2.0, scikit-learn 1.4 are supported now.

  • Improved the performance of some tools and data processors: CollapseLoops, centered step matrix, Eventstream constructor.