Segments & clusters =================== |colab| |jupyter| .. |jupyter| raw:: html Download - Jupyter Notebook .. |colab| raw:: html Google Colab Segments -------- A segment is a group of paths or subpaths united by a common feature. In practice, we often want to explore individual segments or compare them. For example, we might be interested in how the users from a particular age segment behave or what the difference in behavior between users from different countries, AB-test groups, time-based cohorts, etc. is. Retentioneering tools simplify the process of filtering and comparing segments. Once created, the paths related to a particular segment value can be easily filtered .. code:: python stream.filter_events(segment=['country', 'US']) or compared with other segment values .. code:: python # Get an overview comparing custom metric values over all countries stream.segment_overview('country', metrics=custom_metrics) # Get the difference in feature distributions between US and UK stream.segment_diff(['country' 'US', 'UK'], features) # Plot the differential step matrix for US vs its complement stream.step_matrix(groups=['country', 'US', '_OUTER_']) Tools that support segment comparison are: - :doc:`Step matrix`, - :doc:`Transition matrix`, - :doc:`Funnel`. Segment definition ~~~~~~~~~~~~~~~~~~ Each segment has its name and includes segment values. For example, a segment called ``country`` can encompass such values as ``US``, ``CA``, ``UK``, etc. Segments can be static, semi-static, or dynamic. - Static. A segment value is valid for the entire path. For example, a segment can be a user gender or an explicit marker that a user experienced a specific event (e.g. purchase or getting into an AB-test group). - Semi-static. Technically a segment is not static while it may appear so. For example, user country. In most cases a user has the same country in the entire path, but sometimes it can change (due to trips or VPN usage). Semi-static segments are often coerced to static segments. For example, we can consider a user belonging to her prevailing country for the entire path. - Dynamic. A segment value can naturally change during the path so it can be associated with a user's state. For example, we can consider a segment ``user_experience`` with 3 values: ``newbie``, ``advanced``, ``experienced`` according to how much a user has interacted with the product. This state can evolve during the path. Dynamic segments can also indicate some changes in the whole eventstream. For example, if we roll out a new product feature we can create a segment ``release_date`` with values ``before`` and ``after`` and compare the paths in these segement values. In this case the paths can belong to a segment value entirely or partially. Segment creation ~~~~~~~~~~~~~~~~ You can create a segment from a column, a pandas Series, or a custom function. In most typical scenarios a segment is created from a custom column in the eventstream constructor. The column should contain segment values for each event in the eventstream. The segment name is inherited from the column name. For example, having an eventstream as follows, we can create a segment from the ``country`` column passing it to the ``segment_cols`` argument. .. code-block:: python import pandas as pd from retentioneering.eventstream import Eventstream df = pd.DataFrame( [ [1, 'main', 'US', 'ios', '2021-01-01 00:00:00'], [1, 'catalog', 'US', 'ios', '2021-01-01 00:01:00'], [1, 'cart', 'US', 'ios', '2021-01-01 00:02:00'], [1, 'purchase', 'US', 'ios', '2021-01-01 00:03:00'], [1, 'main', 'US', 'android', '2021-01-02 00:00:00'], [2, 'main', 'UK', 'web', '2021-01-01 00:00:00'], [2, 'catalog', 'UK', 'web', '2021-01-01 00:01:00'], [2, 'main', 'UK', 'web', '2021-01-02 00:00:00'], ], columns=['user_id', 'event', 'country', 'platform', 'timestamp'] ) stream = Eventstream(df, add_start_end_events=False, segment_cols=['country']) stream.to_dataframe(drop_segment_events=False) .. raw:: html
user_id event timestamp event_type platform
0 1 country::US 2021-01-01 00:00:00 segment ios
1 1 main 2021-01-01 00:00:00 raw ios
2 1 catalog 2021-01-01 00:01:00 raw ios
3 1 cart 2021-01-01 00:02:00 raw ios
4 1 purchase 2021-01-01 00:03:00 raw ios
5 1 main 2021-01-02 00:00:00 raw android
6 2 country::UK 2021-01-01 00:00:00 segment web
7 2 main 2021-01-01 00:00:00 raw web
8 2 catalog 2021-01-01 00:01:00 raw web
9 2 main 2021-01-02 00:00:00 raw web
Eventstream stores segment information as synthetic events of a special ``segment`` event type. As we can see from the output above, a couple of such synthetic events appeared: ``country::US`` for user 1 and ``country::UK`` for user 2. This segment is static since the user's country doesn't change during the path in this particular example. We also notice that by default the segment events are hidden in the :py:meth:`Eventstream.to_dataframe()` output. To make them visible use ``drop_segment_events=False`` flag. .. note:: The sourcing column is removed from the eventstream after creating a segment since this information becomes redundant. To turn it back, you should use the :py:meth:`Eventstream.materialize_segment()` method. .. note:: For any segement created two special segment values are available for comparison: ``_OUTER_``, ``_ALL_``. They are useful when you need to comapre a particular segment value with it complement or with the whole eventstream correspondingly. These values are technical and they are not represented in eventstream. From column ^^^^^^^^^^^ Another similar option is to create a segment from a custom column explicitly using the :py:meth:`Eventstream.add_segment()` data processor. .. code:: python stream = stream.add_segment('platform') stream.to_dataframe(drop_segment_events=False) .. raw:: html
user_id event timestamp event_type
0 1 country::US 2021-01-01 00:00:00 segment
1 1 platform::ios 2021-01-01 00:00:00 segment
2 1 main 2021-01-01 00:00:00 raw
3 1 catalog 2021-01-01 00:01:00 raw
4 1 cart 2021-01-01 00:02:00 raw
5 1 purchase 2021-01-01 00:03:00 raw
6 1 platform::android 2021-01-02 00:00:00 segment
7 1 main 2021-01-02 00:00:00 raw
8 2 country::UK 2021-01-01 00:00:00 segment
9 2 platform::web 2021-01-01 00:00:00 segment
10 2 main 2021-01-01 00:00:00 raw
11 2 catalog 2021-01-01 00:01:00 raw
12 2 main 2021-01-02 00:00:00 raw
We notice that the segment ``platform`` is dynamic since user 1 changed the platform from ``ios`` to ``android`` at some point while user 2 used the same platform ``web`` within the entire path. The corresponding synthetic events ``platform::ios``, ``platform::amdroid``, and ``platform::web`` have been added to the eventstream. From Series ^^^^^^^^^^^ You can create a static segment from a pandas Series. The series should contain a segment value for each path_id in the eventstream. The segment name is inherired from the series name so it is obligatory to specify it. .. code:: python user_sources = pd.Series({1: 'facebook', 2: 'organic'}, name='source') stream.add_segment(user_sources)\ .to_dataframe(drop_segment_events=False) .. raw:: html
user_id event timestamp event_type
0 1 country::US 2021-01-01 00:00:00 segment
1 1 platform::ios 2021-01-01 00:00:00 segment
2 1 source::facebook 2021-01-01 00:00:00 segment
3 1 main 2021-01-01 00:00:00 raw
4 1 catalog 2021-01-01 00:01:00 raw
5 1 cart 2021-01-01 00:02:00 raw
6 1 purchase 2021-01-01 00:03:00 raw
7 1 platform::android 2021-01-02 00:00:00 segment
8 1 main 2021-01-02 00:00:00 raw
9 2 country::UK 2021-01-01 00:00:00 segment
10 2 platform::web 2021-01-01 00:00:00 segment
11 2 source::organic 2021-01-01 00:00:00 segment
12 2 main 2021-01-01 00:00:00 raw
13 2 catalog 2021-01-01 00:01:00 raw
14 2 main 2021-01-02 00:00:00 raw
From function ^^^^^^^^^^^^^ Besides adding a segment from a custom column, you can add a dynamic segment using an arbitrary function. The function should accept DataFrame representation of an eventstream and return a vector of segment values attributed to each event. Below we provide two examples: how to create a static and a dynamic segment from a function. Let us create a static segment ``has_purchase`` that indicates a user who purchased at least once. Besides the main argument ``segment`` that can accept a callable function, we can pass the ``name`` argument to specify the segment name. .. code:: python def add_purchased_segment(df): purchased_users = df[df['event'] == 'purchase']['user_id'].unique() has_purchase = df['user_id'].isin(purchased_users) return has_purchase stream.add_segment(segment=add_purchased_segment, name='has_purchase')\ .to_dataframe(drop_segment_events=False) .. raw:: html
user_id event timestamp event_type
0 1 country::US 2021-01-01 00:00:00 segment
1 1 platform::ios 2021-01-01 00:00:00 segment
2 1 has_purchase::True 2021-01-01 00:00:00 segment
3 1 main 2021-01-01 00:00:00 raw
4 1 catalog 2021-01-01 00:01:00 raw
5 1 cart 2021-01-01 00:02:00 raw
6 1 purchase 2021-01-01 00:03:00 raw
7 1 platform::android 2021-01-02 00:00:00 segment
8 1 main 2021-01-02 00:00:00 raw
9 2 country::UK 2021-01-01 00:00:00 segment
10 2 platform::web 2021-01-01 00:00:00 segment
11 2 has_purchase::False 2021-01-01 00:00:00 segment
12 2 main 2021-01-01 00:00:00 raw
13 2 catalog 2021-01-01 00:01:00 raw
14 2 main 2021-01-02 00:00:00 raw
As we see, ``has_purchase::True`` and ``has_purchase::False`` events have been prepended for user 1 and 2 paths correspondingly. Next, let us add a truly dynamic segment. Suppose we want to separate the first user day from the other days. .. code:: python def first_day(df): df['date'] = df['timestamp'].dt.date df['first_day'] = df.groupby('user_id')['date'].transform('min') segment_values = df['date'] == df['first_day'] return segment_values stream = stream.add_segment(first_day, name='is_first_day') stream.to_dataframe(drop_segment_events=False) .. raw:: html
user_id event timestamp event_type
0 1 country::US 2021-01-01 00:00:00 segment
1 1 platform::ios 2021-01-01 00:00:00 segment
2 1 is_first_day::True 2021-01-01 00:00:00 segment
3 1 main 2021-01-01 00:00:00 raw
4 1 catalog 2021-01-01 00:01:00 raw
5 1 cart 2021-01-01 00:02:00 raw
6 1 purchase 2021-01-01 00:03:00 raw
7 1 platform::android 2021-01-02 00:00:00 segment
8 1 is_first_day::False 2021-01-02 00:00:00 segment
9 1 main 2021-01-02 00:00:00 raw
10 2 country::UK 2021-01-01 00:00:00 segment
11 2 platform::web 2021-01-01 00:00:00 segment
12 2 is_first_day::True 2021-01-01 00:00:00 segment
13 2 main 2021-01-01 00:00:00 raw
14 2 catalog 2021-01-01 00:01:00 raw
15 2 is_first_day::False 2021-01-02 00:00:00 segment
16 2 main 2021-01-02 00:00:00 raw
As a result, two new segment events appeared for each user: ``is_first_day::True`` for the first day and ``is_first_day::False`` for the second day. Segment materizlization ~~~~~~~~~~~~~~~~~~~~~~~ Sometimes it is convenient to keep a segment not as a set of synthetic events but as an explicit column that will contain the segment values. Such a transformation can be done with the :py:meth:`Eventstream.materialize_segment()` data processor. .. code:: python stream.materialize_segment('platform')\ .to_dataframe(drop_segment_events=False) .. raw:: html
user_id event timestamp event_type platform
0 1 country::US 2021-01-01 00:00:00 segment ios
1 1 platform::ios 2021-01-01 00:00:00 segment ios
2 1 main 2021-01-01 00:00:00 raw ios
3 1 catalog 2021-01-01 00:01:00 raw ios
4 1 cart 2021-01-01 00:02:00 raw ios
5 1 purchase 2021-01-01 00:03:00 raw ios
6 1 platform::android 2021-01-02 00:00:00 segment android
7 1 main 2021-01-02 00:00:00 raw android
8 2 country::UK 2021-01-01 00:00:00 segment web
9 2 platform::web 2021-01-01 00:00:00 segment web
10 2 main 2021-01-01 00:00:00 raw web
11 2 catalog 2021-01-01 00:01:00 raw web
12 2 main 2021-01-02 00:00:00 raw web
We see that the ``platform`` column has appeared in the output. It indicates what platform each event is attributed to. The corresponding segment events are kept in the eventstream. Segment removal ~~~~~~~~~~~~~~~ To remove all synthetic events related to a segment, use the :py:meth:`Eventstream.drop_segment()` data processor. .. code:: python stream.drop_segment('platform')\ .to_dataframe(drop_segment_events=False) .. raw:: html
user_id event timestamp event_type
0 1 country::US 2021-01-01 00:00:00 segment
1 1 main 2021-01-01 00:00:00 raw
2 1 catalog 2021-01-01 00:01:00 raw
3 1 cart 2021-01-01 00:02:00 raw
4 1 purchase 2021-01-01 00:03:00 raw
5 1 main 2021-01-02 00:00:00 raw
6 2 country::UK 2021-01-01 00:00:00 segment
7 2 main 2021-01-01 00:00:00 raw
8 2 catalog 2021-01-01 00:01:00 raw
9 2 main 2021-01-02 00:00:00 raw
Segment filtering ~~~~~~~~~~~~~~~~~ To filter all the events related to a specific segment value, use the :py:meth:`Eventstream.filter_events()` data processor with the ``segment`` argument. This argument must be a list of two elements: segment name and segment value. .. code:: python stream.filter_events(segment=['country', 'UK'])\ .to_dataframe(drop_segment_events=False) .. raw:: html
user_id event event_type timestamp
0 2 country::UK segment 2021-01-01 00:00:00
1 2 main raw 2021-01-01 00:00:00
2 2 catalog raw 2021-01-01 00:01:00
3 2 main raw 2021-01-02 00:00:00
In this output we can see only the events related to the UK segment (i.e. to user 2). Segment renaming ~~~~~~~~~~~~~~~~ To rename a segment name use the :py:meth:`Eventstream.rename_segment()` data processor. Below we rename the ``country`` segment to ``user_country``. .. code:: python stream2.rename_segment(old_label='country', new_label='user_country')\ .to_dataframe(drop_segment_events=False) .. raw:: html
user_id event timestamp event_type
0 1 user_country::US 2021-01-01 00:00:00 segment
1 1 platform::ios 2021-01-01 00:00:00 segment
2 1 main 2021-01-01 00:00:00 raw
3 1 catalog 2021-01-01 00:01:00 raw
4 1 cart 2021-01-01 00:02:00 raw
5 1 purchase 2021-01-01 00:03:00 raw
6 1 platform::android 2021-01-02 00:00:00 segment
7 1 main 2021-01-02 00:00:00 raw
8 2 user_country::UK 2021-01-01 00:00:00 segment
9 2 platform::web 2021-01-01 00:00:00 segment
10 2 main 2021-01-01 00:00:00 raw
11 2 catalog 2021-01-01 00:01:00 raw
12 2 main 2021-01-02 00:00:00 raw
Segment values renaming ~~~~~~~~~~~~~~~~~~~~~~~ To rename segment values use the :py:meth:`Eventstream.remap_segment()` data processor passing a dictionary with mapping old values to new ones. .. code:: python mapping_dict = { 'US': 'United States', 'UK': 'United Kingdom' } stream.remap_segment('country', mapping_dict)\ .to_dataframe(drop_segment_events=False) .. raw:: html
user_id event timestamp event_type
0 1 country::United States 2021-01-01 00:00:00 segment
1 1 platform::ios 2021-01-01 00:00:00 segment
2 1 main 2021-01-01 00:00:00 raw
3 1 catalog 2021-01-01 00:01:00 raw
4 1 cart 2021-01-01 00:02:00 raw
5 1 purchase 2021-01-01 00:03:00 raw
6 1 platform::android 2021-01-02 00:00:00 segment
7 1 main 2021-01-02 00:00:00 raw
8 2 country::United Kingdom 2021-01-01 00:00:00 segment
9 2 platform::web 2021-01-01 00:00:00 segment
10 2 main 2021-01-01 00:00:00 raw
11 2 catalog 2021-01-01 00:01:00 raw
12 2 main 2021-01-02 00:00:00 raw
Segment mapping ~~~~~~~~~~~~~~~ The :py:meth:`Eventstream.segment_map()` method is used to get mapping between segment values and path ids. Besides the ``name`` argument representing the segment name, the method has the ``index`` argument that specifies the index of the resulting Series: either ``path_id`` (default) or ``segment_value``. .. code:: python stream.segment_map(name='country', index='path_id') .. parsed-literal:: user_id 1 US 2 UK Name: segment_value, dtype: object .. code:: python stream.segment_map(name='country', index='segment_value') .. parsed-literal:: segment_value UK 2 US 1 Name: user_id, dtype: int64 If segment is not static the index or the values of the series are not unique. .. code:: python stream.segment_map('platform') .. parsed-literal:: user_id 1 ios 1 android 2 web Name: segment_value, dtype: object In case of semi-static segments the ``resolve_collision`` argument can be used for coersing a path attribution to the most frequent segment value (``majority``) or to the last one (``last``). .. code:: python stream.segment_map(name='platform', index='path_id', resolve_collision='majority') .. parsed-literal:: user_id 1 ios 2 web Name: segment_value, dtype: object Now user 1 is associated with her dominant platform ``ios``. Finally, if ``None`` segment name is passed, the DataFrame with all segments mapping is returned. .. code:: python stream.segment_map(name=None) .. raw:: html
user_id segment_name segment_value
0 1 country US
1 1 platform ios
2 1 platform android
3 2 country UK
4 2 platform web
Segment usage ~~~~~~~~~~~~~ In this section we will use :doc:`the simple_shop dataset` to demonstrate how to use segments in practice. Let us load the dataset first. .. code:: python from retentioneering import datasets stream2 = datasets.load_simple_shop() With the :py:meth:`Eventstream.event_timestamp_hist()` histogram we can notice that the distribution of the new users is not uniform across time: starting from April 2020 the new users number has been surged. .. code:: python stream2.event_timestamp_hist(event_list=['path_start']) .. figure:: /_static/user_guides/segments_and_clusters/event_timestamp_hist.png :width: 350 Let us check whether the new users and the old users have different behavior. First of all, we need to create a segment that will distinguish them. We will consider the new users as those who started their paths after 2020-04-01. .. code:: python def add_segment_by_date(df): first_day = df.groupby('user_id')['timestamp'].min() target_index = first_day[first_day < '2020-04-01'].index segment_values = df['user_id'].isin(target_index) segment_values = segment_values.map({True: 'Before 2020-04', False: 'After 2020-04'}) return segment_values stream2 = stream2.add_segment(add_segment_by_date, label='Apr 2020') Now we can compare the behavior of the new and old users. Let us start from a very basic summary comparing the segment sizes along with a couple of conversion rates: to ``cart`` and ``payment_done`` events. The :py:meth:`Eventstream.segment_overview()` method can do this. The definitions of the metrics are similar to the ones in the :py:meth:`Eventstream.path_metrics()` method. The only difference is that a tuple defining a metric should have 3 elements instead of 2: path metric definition, a function to aggregate the path metric values over a segment, and a metric name. The same string definitions can be used extended with the ``segment_size`` literal. .. code:: python custom_metrics = [ ('segment_size', 'mean', 'segment size'), ('has:cart', 'mean', 'Conversion rate: cart'), ('has:payment_done', 'mean', 'Conversion rate: payment_done') ] stream2.segment_overview('Apr 2020', metrics=custom_metrics) .. figure:: /_static/user_guides/segments_and_clusters/segment_overview.png :width: 500 The output shows that from the convertion rates point of view the users are almost identical. The different segment sizes make no sense since the sizes could be of arbitrary values in this case. The bar chart can be used only if all the metrics are of the same scale. Otherwise it is better to use a heatmap table that can be enabled with the ``kind='heatmap'`` argument. Here we use the same metric set extended with the ``len`` and ``time_to:`` metrics. .. code::python custom_metrics = [ ('segment_size', 'mean', 'segment size'), ('len', 'mean', 'Path length, mean'), ('has:cart', 'mean', 'Conversion rate: cart'), ('has:payment_done', 'mean', 'Conversion rate: payment_done'), ('time_to:payment_done', pd.Series.median, 'Time to payment_done, mean') ] stream2.segment_overview('Apr 2020', metrics=custom_metrics, kind='heatmap') .. figure:: /_static/user_guides/segments_and_clusters/segment_overview_heatmap.png :width: 450 The default ``axis=1`` argument colorize each row separately: the minimum value in each row is deep blue, the maximum value is deep red. ``axis=0`` colorizes the table in column-wise manner. However, for segments of low cardinality the heatmap might be excessive. If we are interested in the numerical values of the table only, we can disable the plot with the ``show_plot=False`` argument and access the ``values`` property. .. code:: python stream2.segment_overview( segment_name='Apr 2020', metrics=custom_metrics, kind='heatmap', show_plot=False ).values .. raw:: html
Apr 2020 After Before
segment size 0.637 0.363
Path length, mean 9.419 12.69
Time to payment_done, mean 1.4m 2.9m
Conversion rate: cart 0.504 0.528
Conversion rate: payment_done 0.174 0.174

Next, we want to compare the behavior of the new and old users deeper. We will use the :doc:`step matrix` and :doc:`transition matrix` tools for this. Since the segment is binary, when calling these methods we can just pass its name to the ``groups`` argument not specifying the segment values. .. code:: python stream2.step_matrix(groups='Apr 2020', threshold=0) .. figure:: /_static/user_guides/segments_and_clusters/diff_step_matrix.png :width: 600 The step matrix reveals that the new users are less likely to surf the catalog and the main page: the grey values in ``catalog`` and ``main`` rows indicate that the users from ``Before`` segment visit these pages more often. Also, the paths of the new users are shorter which is shown by the brown values in the ``path_end`` row. But since the conversion rates to ``cart`` and ``payment_done`` are almost the same, it looks like the new users are more decisive. .. code:: python stream2.transition_matrix(groups='Apr 2020') .. figure:: /_static/user_guides/segments_and_clusters/diff_transition_matrix.png :width: 500 The transitioin matrix exhibits the differences in more detail. The old users tend to transit to the main page from any other page much more often than new users. On the other hand we see many differences in ``* → path_end`` transitions. It looks like the old users end up their paths primarily on the ``main`` page, while the last event for the new users is distributed more evenly. Finally, we report that the old users prefer cash payment much more than the new users: the difference in ``payment_choice → payment_cash`` transition is very high: 0.25. Clusters -------- Retentioneering provides a set of clustering tools that could automatically group users based on their behavior. In a nutshell, the clustering process consists of the following steps: - Path vectorization. Represent the event sequences as a matrix where each row corresponds to a particular path and each column corresponds to a path feature. - Clustering. Apply a clustering algorithm using the feature matrix from the previous step. - Cluster analysis. Analyze the clusters to understand the differences between them. Vectorization ~~~~~~~~~~~~~ First, we calculate the feature set using the :py:meth:`Eventstream.extract_features()` method. Here we use the simplest configuration: unigrams with the ``count`` feature type. .. code:: python features = stream2.extract_features(ngram_range=(1, 1), feature_type='count') .. raw:: html
main catalog ... cart payment_done
user_id
122915 7 18 ... 1 0
463458 1 8 ... 0 0
1475907 2 5 ... 1 0
1576626 1 0 ... 0 0
2112338 2 3 ... 0 0
Now each path is represented as "bag of unigrams": a vector of event counts. More feature types such as ``tfidf``, ``binary``, or time-related features are available. See the :py:meth:`Eventstream.extract_features()` method documentation for more details. Clustering algorithms ~~~~~~~~~~~~~~~~~~~~~ Next, we apply a clustering algorithm to the feature matrix. The :py:meth:`Eventstream.get_clusters()` method supports three algorithms: KMeans, HDBSCAN, and GMM. Here we will use KMeans as the most common one. Actually, this algorithm requires the number of clusters to be specified. However, :py:meth:`Eventstream.get_clusters()` shows the elbow curve plot to help you choose the optimal number of clusters in case you didn't specify any. .. code:: python stream2.get_clusters(features, method='kmeans') .. figure:: /_static/user_guides/segments_and_clusters/elbow_curve.png :width: 400 This elbow curve appeared to be too smooth to determine the optimal number of clusters as it often happens in practice. In this case, the number of clusters can be chosen basing on the cluster sizes (they should not be too small). We assume that 8 clusters would be enough and call :py:meth:`Eventstream.get_clusters()` again with the ``n_clusters`` argument set to 8. The cluster partitioning is stored as a regular segment and we can set its name using the ``segment_name`` argument. .. code:: python stream2 = stream2.get_clusters( features, method='kmeans', n_clusters=8, random_state=42, segment_name='kmeans_clusters' ) Cluster analysis ~~~~~~~~~~~~~~~~ Now ``kmeans_clusters`` segment is available for any segment analysis described above. However, since cluster analysis is strongly relate to the feature space that induces the clustering, we can use the :py:meth:`Eventstream.clusters_overview()` method that better suits this analysis. .. code:: python stream2.clusters_overview('kmeans_clusters', features, aggfunc='mean', metrics=custom_metrics) .. figure:: /_static/user_guides/segments_and_clusters/clusters_overview.png :width: 400 The columns of this table describe the aggregated feature and custom metric values for each cluster. The heatmap (with the default ``axis=1`` parameter) shows how the features values vary across the clusters. For example, the users from the cluster 0 have the lowest activity (all the values in the column 0 are colored in deep blue) while the size of the cluster is the highest (36.4%). In case we can not see the difference between some clusters clearly looking at the aggregated values only, we can use the :py:meth:`Eventstream.segment_diff()` method to compare a pair of clusters directly. For example, clusters 2 and 4 look very similar. Let us compare them. .. code:: python stream2.segment_diff(['kmeans_clusters', '2', '4'], features) .. figure:: /_static/user_guides/segments_and_clusters/segment_diff.png :width: 600 Now we clearly see that the biggest difference between the clusters 2 and 4 is in the ``catalog`` and ``main`` events distribution. With a help of the special ``_OUTER_`` literal you can explore the difference of a cluster with the rest of the clusters. .. code:: python stream2.segment_diff(['kmeans_clusters', '2', '_OUTER_'], features) .. figure:: /_static/user_guides/segments_and_clusters/segment_diff_outer.png :width: 600 Finally, we can label the clusters with meaningful names using the :py:meth:`Eventstream.remap_segment()` method and make the same overview visualization for a publication. .. code:: python cluster_labels = { '0': 'passers_by', '1': 'aimless_1', '2': 'somewhat_interested_1', '3': 'mildly_purchasing', '4': 'somewhat_interested_2', '5': 'aimless_2', '6': 'active', '7': 'super_active' } stream2 = stream2.remap_segment('kmeans_clusters', cluster_labels) .. figure:: /_static/user_guides/segments_and_clusters/clusters_overview_2.png :width: 400 In case we need to get 2D representation of the clusters we can use the :py:meth:`Eventstream.projection()` method. t-SNE and UMAP algorithms are supported. A fancy dropdown menu allows you to switch between segments. This might be useful when you want to compare multiple clustering versions treating them as different segments. Since the projection is a computationally expensive operation, it is recommended to use the ``sample_size`` argument to reduce the number of paths along with the ``random_state`` argument to make the results reproducible. .. code:: python stream2.projection(features=features, sample_size=3000, random_state=42) .. figure:: /_static/user_guides/segments_and_clusters/projection.png :width: 500