Segments & clusters
===================
|colab| |jupyter|
.. |jupyter| raw:: html
.. |colab| raw:: html
Segments
--------
A segment is a group of paths or subpaths united by a common feature. In practice, we often want to explore individual segments or compare them. For example, we might be interested in how the users from a particular age segment behave or what the difference in behavior between users from different countries, AB-test groups, time-based cohorts, etc. is. Retentioneering tools simplify the process of filtering and comparing segments. Once created, the paths related to a particular segment value can be easily filtered
.. code:: python
stream.filter_events(segment=['country', 'US'])
or compared with other segment values
.. code:: python
# Get an overview comparing custom metric values over all countries
stream.segment_overview('country', metrics=custom_metrics)
# Get the difference in feature distributions between US and UK
stream.segment_diff(['country' 'US', 'UK'], features)
# Plot the differential step matrix for US vs its complement
stream.step_matrix(groups=['country', 'US', '_OUTER_'])
Tools that support segment comparison are:
- :doc:`Step matrix`,
- :doc:`Transition matrix`,
- :doc:`Funnel`.
Segment definition
~~~~~~~~~~~~~~~~~~
Each segment has its name and includes segment values. For example, a segment called ``country`` can encompass such values as ``US``, ``CA``, ``UK``, etc. Segments can be static, semi-static, or dynamic.
- Static. A segment value is valid for the entire path. For example, a segment can be a user gender or an explicit marker that a user experienced a specific event (e.g. purchase or getting into an AB-test group).
- Semi-static. Technically a segment is not static while it may appear so. For example, user country. In most cases a user has the same country in the entire path, but sometimes it can change (due to trips or VPN usage). Semi-static segments are often coerced to static segments. For example, we can consider a user belonging to her prevailing country for the entire path.
- Dynamic. A segment value can naturally change during the path so it can be associated with a user's state. For example, we can consider a segment ``user_experience`` with 3 values: ``newbie``, ``advanced``, ``experienced`` according to how much a user has interacted with the product. This state can evolve during the path. Dynamic segments can also indicate some changes in the whole eventstream. For example, if we roll out a new product feature we can create a segment ``release_date`` with values ``before`` and ``after`` and compare the paths in these segement values. In this case the paths can belong to a segment value entirely or partially.
Segment creation
~~~~~~~~~~~~~~~~
You can create a segment from a column, a pandas Series, or a custom function. In most typical scenarios a segment is created from a custom column in the eventstream constructor. The column should contain segment values for each event in the eventstream. The segment name is inherited from the column name. For example, having an eventstream as follows, we can create a segment from the ``country`` column passing it to the ``segment_cols`` argument.
.. code-block:: python
import pandas as pd
from retentioneering.eventstream import Eventstream
df = pd.DataFrame(
[
[1, 'main', 'US', 'ios', '2021-01-01 00:00:00'],
[1, 'catalog', 'US', 'ios', '2021-01-01 00:01:00'],
[1, 'cart', 'US', 'ios', '2021-01-01 00:02:00'],
[1, 'purchase', 'US', 'ios', '2021-01-01 00:03:00'],
[1, 'main', 'US', 'android', '2021-01-02 00:00:00'],
[2, 'main', 'UK', 'web', '2021-01-01 00:00:00'],
[2, 'catalog', 'UK', 'web', '2021-01-01 00:01:00'],
[2, 'main', 'UK', 'web', '2021-01-02 00:00:00'],
],
columns=['user_id', 'event', 'country', 'platform', 'timestamp']
)
stream = Eventstream(df, add_start_end_events=False, segment_cols=['country'])
stream.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
timestamp |
event_type |
platform |
0 |
1 |
country::US |
2021-01-01 00:00:00 |
segment |
ios |
1 |
1 |
main |
2021-01-01 00:00:00 |
raw |
ios |
2 |
1 |
catalog |
2021-01-01 00:01:00 |
raw |
ios |
3 |
1 |
cart |
2021-01-01 00:02:00 |
raw |
ios |
4 |
1 |
purchase |
2021-01-01 00:03:00 |
raw |
ios |
5 |
1 |
main |
2021-01-02 00:00:00 |
raw |
android |
6 |
2 |
country::UK |
2021-01-01 00:00:00 |
segment |
web |
7 |
2 |
main |
2021-01-01 00:00:00 |
raw |
web |
8 |
2 |
catalog |
2021-01-01 00:01:00 |
raw |
web |
9 |
2 |
main |
2021-01-02 00:00:00 |
raw |
web |
Eventstream stores segment information as synthetic events of a special ``segment`` event type. As we can see from the output above, a couple of such synthetic events appeared: ``country::US`` for user 1 and ``country::UK`` for user 2. This segment is static since the user's country doesn't change during the path in this particular example.
We also notice that by default the segment events are hidden in the :py:meth:`Eventstream.to_dataframe()` output. To make them visible use ``drop_segment_events=False`` flag.
.. note::
The sourcing column is removed from the eventstream after creating a segment since this information becomes redundant. To turn it back, you should use the :py:meth:`Eventstream.materialize_segment()` method.
.. note::
For any segement created two special segment values are available for comparison: ``_OUTER_``, ``_ALL_``. They are useful when you need to comapre a particular segment value with it complement or with the whole eventstream correspondingly. These values are technical and they are not represented in eventstream.
From column
^^^^^^^^^^^
Another similar option is to create a segment from a custom column explicitly using the :py:meth:`Eventstream.add_segment()` data processor.
.. code:: python
stream = stream.add_segment('platform')
stream.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
timestamp |
event_type |
0 |
1 |
country::US |
2021-01-01 00:00:00 |
segment |
1 |
1 |
platform::ios |
2021-01-01 00:00:00 |
segment |
2 |
1 |
main |
2021-01-01 00:00:00 |
raw |
3 |
1 |
catalog |
2021-01-01 00:01:00 |
raw |
4 |
1 |
cart |
2021-01-01 00:02:00 |
raw |
5 |
1 |
purchase |
2021-01-01 00:03:00 |
raw |
6 |
1 |
platform::android |
2021-01-02 00:00:00 |
segment |
7 |
1 |
main |
2021-01-02 00:00:00 |
raw |
8 |
2 |
country::UK |
2021-01-01 00:00:00 |
segment |
9 |
2 |
platform::web |
2021-01-01 00:00:00 |
segment |
10 |
2 |
main |
2021-01-01 00:00:00 |
raw |
11 |
2 |
catalog |
2021-01-01 00:01:00 |
raw |
12 |
2 |
main |
2021-01-02 00:00:00 |
raw |
We notice that the segment ``platform`` is dynamic since user 1 changed the platform from ``ios`` to ``android`` at some point while user 2 used the same platform ``web`` within the entire path. The corresponding synthetic events ``platform::ios``, ``platform::amdroid``, and ``platform::web`` have been added to the eventstream.
From Series
^^^^^^^^^^^
You can create a static segment from a pandas Series. The series should contain a segment value for each path_id in the eventstream. The segment name is inherired from the series name so it is obligatory to specify it.
.. code:: python
user_sources = pd.Series({1: 'facebook', 2: 'organic'}, name='source')
stream.add_segment(user_sources)\
.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
timestamp |
event_type |
0 |
1 |
country::US |
2021-01-01 00:00:00 |
segment |
1 |
1 |
platform::ios |
2021-01-01 00:00:00 |
segment |
2 |
1 |
source::facebook |
2021-01-01 00:00:00 |
segment |
3 |
1 |
main |
2021-01-01 00:00:00 |
raw |
4 |
1 |
catalog |
2021-01-01 00:01:00 |
raw |
5 |
1 |
cart |
2021-01-01 00:02:00 |
raw |
6 |
1 |
purchase |
2021-01-01 00:03:00 |
raw |
7 |
1 |
platform::android |
2021-01-02 00:00:00 |
segment |
8 |
1 |
main |
2021-01-02 00:00:00 |
raw |
9 |
2 |
country::UK |
2021-01-01 00:00:00 |
segment |
10 |
2 |
platform::web |
2021-01-01 00:00:00 |
segment |
11 |
2 |
source::organic |
2021-01-01 00:00:00 |
segment |
12 |
2 |
main |
2021-01-01 00:00:00 |
raw |
13 |
2 |
catalog |
2021-01-01 00:01:00 |
raw |
14 |
2 |
main |
2021-01-02 00:00:00 |
raw |
From function
^^^^^^^^^^^^^
Besides adding a segment from a custom column, you can add a dynamic segment using an arbitrary function. The function should accept DataFrame representation of an eventstream and return a vector of segment values attributed to each event. Below we provide two examples: how to create a static and a dynamic segment from a function.
Let us create a static segment ``has_purchase`` that indicates a user who purchased at least once. Besides the main argument ``segment`` that can accept a callable function, we can pass the ``name`` argument to specify the segment name.
.. code:: python
def add_purchased_segment(df):
purchased_users = df[df['event'] == 'purchase']['user_id'].unique()
has_purchase = df['user_id'].isin(purchased_users)
return has_purchase
stream.add_segment(segment=add_purchased_segment, name='has_purchase')\
.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
timestamp |
event_type |
0 |
1 |
country::US |
2021-01-01 00:00:00 |
segment |
1 |
1 |
platform::ios |
2021-01-01 00:00:00 |
segment |
2 |
1 |
has_purchase::True |
2021-01-01 00:00:00 |
segment |
3 |
1 |
main |
2021-01-01 00:00:00 |
raw |
4 |
1 |
catalog |
2021-01-01 00:01:00 |
raw |
5 |
1 |
cart |
2021-01-01 00:02:00 |
raw |
6 |
1 |
purchase |
2021-01-01 00:03:00 |
raw |
7 |
1 |
platform::android |
2021-01-02 00:00:00 |
segment |
8 |
1 |
main |
2021-01-02 00:00:00 |
raw |
9 |
2 |
country::UK |
2021-01-01 00:00:00 |
segment |
10 |
2 |
platform::web |
2021-01-01 00:00:00 |
segment |
11 |
2 |
has_purchase::False |
2021-01-01 00:00:00 |
segment |
12 |
2 |
main |
2021-01-01 00:00:00 |
raw |
13 |
2 |
catalog |
2021-01-01 00:01:00 |
raw |
14 |
2 |
main |
2021-01-02 00:00:00 |
raw |
As we see, ``has_purchase::True`` and ``has_purchase::False`` events have been prepended for user 1 and 2 paths correspondingly.
Next, let us add a truly dynamic segment. Suppose we want to separate the first user day from the other days.
.. code:: python
def first_day(df):
df['date'] = df['timestamp'].dt.date
df['first_day'] = df.groupby('user_id')['date'].transform('min')
segment_values = df['date'] == df['first_day']
return segment_values
stream = stream.add_segment(first_day, name='is_first_day')
stream.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
timestamp |
event_type |
0 |
1 |
country::US |
2021-01-01 00:00:00 |
segment |
1 |
1 |
platform::ios |
2021-01-01 00:00:00 |
segment |
2 |
1 |
is_first_day::True |
2021-01-01 00:00:00 |
segment |
3 |
1 |
main |
2021-01-01 00:00:00 |
raw |
4 |
1 |
catalog |
2021-01-01 00:01:00 |
raw |
5 |
1 |
cart |
2021-01-01 00:02:00 |
raw |
6 |
1 |
purchase |
2021-01-01 00:03:00 |
raw |
7 |
1 |
platform::android |
2021-01-02 00:00:00 |
segment |
8 |
1 |
is_first_day::False |
2021-01-02 00:00:00 |
segment |
9 |
1 |
main |
2021-01-02 00:00:00 |
raw |
10 |
2 |
country::UK |
2021-01-01 00:00:00 |
segment |
11 |
2 |
platform::web |
2021-01-01 00:00:00 |
segment |
12 |
2 |
is_first_day::True |
2021-01-01 00:00:00 |
segment |
13 |
2 |
main |
2021-01-01 00:00:00 |
raw |
14 |
2 |
catalog |
2021-01-01 00:01:00 |
raw |
15 |
2 |
is_first_day::False |
2021-01-02 00:00:00 |
segment |
16 |
2 |
main |
2021-01-02 00:00:00 |
raw |
As a result, two new segment events appeared for each user: ``is_first_day::True`` for the first day and ``is_first_day::False`` for the second day.
Segment materizlization
~~~~~~~~~~~~~~~~~~~~~~~
Sometimes it is convenient to keep a segment not as a set of synthetic events but as an explicit column that will contain the segment values. Such a transformation can be done with the :py:meth:`Eventstream.materialize_segment()` data processor.
.. code:: python
stream.materialize_segment('platform')\
.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
timestamp |
event_type |
platform |
0 |
1 |
country::US |
2021-01-01 00:00:00 |
segment |
ios |
1 |
1 |
platform::ios |
2021-01-01 00:00:00 |
segment |
ios |
2 |
1 |
main |
2021-01-01 00:00:00 |
raw |
ios |
3 |
1 |
catalog |
2021-01-01 00:01:00 |
raw |
ios |
4 |
1 |
cart |
2021-01-01 00:02:00 |
raw |
ios |
5 |
1 |
purchase |
2021-01-01 00:03:00 |
raw |
ios |
6 |
1 |
platform::android |
2021-01-02 00:00:00 |
segment |
android |
7 |
1 |
main |
2021-01-02 00:00:00 |
raw |
android |
8 |
2 |
country::UK |
2021-01-01 00:00:00 |
segment |
web |
9 |
2 |
platform::web |
2021-01-01 00:00:00 |
segment |
web |
10 |
2 |
main |
2021-01-01 00:00:00 |
raw |
web |
11 |
2 |
catalog |
2021-01-01 00:01:00 |
raw |
web |
12 |
2 |
main |
2021-01-02 00:00:00 |
raw |
web |
We see that the ``platform`` column has appeared in the output. It indicates what platform each event is attributed to. The corresponding segment events are kept in the eventstream.
Segment removal
~~~~~~~~~~~~~~~
To remove all synthetic events related to a segment, use the :py:meth:`Eventstream.drop_segment()` data processor.
.. code:: python
stream.drop_segment('platform')\
.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
timestamp |
event_type |
0 |
1 |
country::US |
2021-01-01 00:00:00 |
segment |
1 |
1 |
main |
2021-01-01 00:00:00 |
raw |
2 |
1 |
catalog |
2021-01-01 00:01:00 |
raw |
3 |
1 |
cart |
2021-01-01 00:02:00 |
raw |
4 |
1 |
purchase |
2021-01-01 00:03:00 |
raw |
5 |
1 |
main |
2021-01-02 00:00:00 |
raw |
6 |
2 |
country::UK |
2021-01-01 00:00:00 |
segment |
7 |
2 |
main |
2021-01-01 00:00:00 |
raw |
8 |
2 |
catalog |
2021-01-01 00:01:00 |
raw |
9 |
2 |
main |
2021-01-02 00:00:00 |
raw |
Segment filtering
~~~~~~~~~~~~~~~~~
To filter all the events related to a specific segment value, use the :py:meth:`Eventstream.filter_events()` data processor with the ``segment`` argument. This argument must be a list of two elements: segment name and segment value.
.. code:: python
stream.filter_events(segment=['country', 'UK'])\
.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
event_type |
timestamp |
0 |
2 |
country::UK |
segment |
2021-01-01 00:00:00 |
1 |
2 |
main |
raw |
2021-01-01 00:00:00 |
2 |
2 |
catalog |
raw |
2021-01-01 00:01:00 |
3 |
2 |
main |
raw |
2021-01-02 00:00:00 |
In this output we can see only the events related to the UK segment (i.e. to user 2).
Segment renaming
~~~~~~~~~~~~~~~~
To rename a segment name use the :py:meth:`Eventstream.rename_segment()` data processor. Below we rename the ``country`` segment to ``user_country``.
.. code:: python
stream2.rename_segment(old_label='country', new_label='user_country')\
.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
timestamp |
event_type |
0 |
1 |
user_country::US |
2021-01-01 00:00:00 |
segment |
1 |
1 |
platform::ios |
2021-01-01 00:00:00 |
segment |
2 |
1 |
main |
2021-01-01 00:00:00 |
raw |
3 |
1 |
catalog |
2021-01-01 00:01:00 |
raw |
4 |
1 |
cart |
2021-01-01 00:02:00 |
raw |
5 |
1 |
purchase |
2021-01-01 00:03:00 |
raw |
6 |
1 |
platform::android |
2021-01-02 00:00:00 |
segment |
7 |
1 |
main |
2021-01-02 00:00:00 |
raw |
8 |
2 |
user_country::UK |
2021-01-01 00:00:00 |
segment |
9 |
2 |
platform::web |
2021-01-01 00:00:00 |
segment |
10 |
2 |
main |
2021-01-01 00:00:00 |
raw |
11 |
2 |
catalog |
2021-01-01 00:01:00 |
raw |
12 |
2 |
main |
2021-01-02 00:00:00 |
raw |
Segment values renaming
~~~~~~~~~~~~~~~~~~~~~~~
To rename segment values use the :py:meth:`Eventstream.remap_segment()` data processor passing a dictionary with mapping old values to new ones.
.. code:: python
mapping_dict = {
'US': 'United States',
'UK': 'United Kingdom'
}
stream.remap_segment('country', mapping_dict)\
.to_dataframe(drop_segment_events=False)
.. raw:: html
|
user_id |
event |
timestamp |
event_type |
0 |
1 |
country::United States |
2021-01-01 00:00:00 |
segment |
1 |
1 |
platform::ios |
2021-01-01 00:00:00 |
segment |
2 |
1 |
main |
2021-01-01 00:00:00 |
raw |
3 |
1 |
catalog |
2021-01-01 00:01:00 |
raw |
4 |
1 |
cart |
2021-01-01 00:02:00 |
raw |
5 |
1 |
purchase |
2021-01-01 00:03:00 |
raw |
6 |
1 |
platform::android |
2021-01-02 00:00:00 |
segment |
7 |
1 |
main |
2021-01-02 00:00:00 |
raw |
8 |
2 |
country::United Kingdom |
2021-01-01 00:00:00 |
segment |
9 |
2 |
platform::web |
2021-01-01 00:00:00 |
segment |
10 |
2 |
main |
2021-01-01 00:00:00 |
raw |
11 |
2 |
catalog |
2021-01-01 00:01:00 |
raw |
12 |
2 |
main |
2021-01-02 00:00:00 |
raw |
Segment mapping
~~~~~~~~~~~~~~~
The :py:meth:`Eventstream.segment_map()` method is used to get mapping between segment values and path ids. Besides the ``name`` argument representing the segment name, the method has the ``index`` argument that specifies the index of the resulting Series: either ``path_id`` (default) or ``segment_value``.
.. code:: python
stream.segment_map(name='country', index='path_id')
.. parsed-literal::
user_id
1 US
2 UK
Name: segment_value, dtype: object
.. code:: python
stream.segment_map(name='country', index='segment_value')
.. parsed-literal::
segment_value
UK 2
US 1
Name: user_id, dtype: int64
If segment is not static the index or the values of the series are not unique.
.. code:: python
stream.segment_map('platform')
.. parsed-literal::
user_id
1 ios
1 android
2 web
Name: segment_value, dtype: object
In case of semi-static segments the ``resolve_collision`` argument can be used for coersing a path attribution to the most frequent segment value (``majority``) or to the last one (``last``).
.. code:: python
stream.segment_map(name='platform', index='path_id', resolve_collision='majority')
.. parsed-literal::
user_id
1 ios
2 web
Name: segment_value, dtype: object
Now user 1 is associated with her dominant platform ``ios``.
Finally, if ``None`` segment name is passed, the DataFrame with all segments mapping is returned.
.. code:: python
stream.segment_map(name=None)
.. raw:: html
|
user_id |
segment_name |
segment_value |
0 |
1 |
country |
US |
1 |
1 |
platform |
ios |
2 |
1 |
platform |
android |
3 |
2 |
country |
UK |
4 |
2 |
platform |
web |
Segment usage
~~~~~~~~~~~~~
In this section we will use :doc:`the simple_shop dataset` to demonstrate how to use segments in practice. Let us load the dataset first.
.. code:: python
from retentioneering import datasets
stream2 = datasets.load_simple_shop()
With the :py:meth:`Eventstream.event_timestamp_hist()` histogram we can notice that the distribution of the new users is not uniform across time: starting from April 2020 the new users number has been surged.
.. code:: python
stream2.event_timestamp_hist(event_list=['path_start'])
.. figure:: /_static/user_guides/segments_and_clusters/event_timestamp_hist.png
:width: 350
Let us check whether the new users and the old users have different behavior. First of all, we need to create a segment that will distinguish them. We will consider the new users as those who started their paths after 2020-04-01.
.. code:: python
def add_segment_by_date(df):
first_day = df.groupby('user_id')['timestamp'].min()
target_index = first_day[first_day < '2020-04-01'].index
segment_values = df['user_id'].isin(target_index)
segment_values = segment_values.map({True: 'Before 2020-04', False: 'After 2020-04'})
return segment_values
stream2 = stream2.add_segment(add_segment_by_date, label='Apr 2020')
Now we can compare the behavior of the new and old users. Let us start from a very basic summary comparing the segment sizes along with a couple of conversion rates: to ``cart`` and ``payment_done`` events. The :py:meth:`Eventstream.segment_overview()` method can do this. The definitions of the metrics are similar to the ones in the :py:meth:`Eventstream.path_metrics()` method. The only difference is that a tuple defining a metric should have 3 elements instead of 2: path metric definition, a function to aggregate the path metric values over a segment, and a metric name. The same string definitions can be used extended with the ``segment_size`` literal.
.. code:: python
custom_metrics = [
('segment_size', 'mean', 'segment size'),
('has:cart', 'mean', 'Conversion rate: cart'),
('has:payment_done', 'mean', 'Conversion rate: payment_done')
]
stream2.segment_overview('Apr 2020', metrics=custom_metrics)
.. figure:: /_static/user_guides/segments_and_clusters/segment_overview.png
:width: 500
The output shows that from the convertion rates point of view the users are almost identical. The different segment sizes make no sense since the sizes could be of arbitrary values in this case.
The bar chart can be used only if all the metrics are of the same scale. Otherwise it is better to use a heatmap table that can be enabled with the ``kind='heatmap'`` argument. Here we use the same metric set extended with the ``len`` and ``time_to:`` metrics.
.. code::python
custom_metrics = [
('segment_size', 'mean', 'segment size'),
('len', 'mean', 'Path length, mean'),
('has:cart', 'mean', 'Conversion rate: cart'),
('has:payment_done', 'mean', 'Conversion rate: payment_done'),
('time_to:payment_done', pd.Series.median, 'Time to payment_done, mean')
]
stream2.segment_overview('Apr 2020', metrics=custom_metrics, kind='heatmap')
.. figure:: /_static/user_guides/segments_and_clusters/segment_overview_heatmap.png
:width: 450
The default ``axis=1`` argument colorize each row separately: the minimum value in each row is deep blue, the maximum value is deep red. ``axis=0`` colorizes the table in column-wise manner. However, for segments of low cardinality the heatmap might be excessive. If we are interested in the numerical values of the table only, we can disable the plot with the ``show_plot=False`` argument and access the ``values`` property.
.. code:: python
stream2.segment_overview(
segment_name='Apr 2020',
metrics=custom_metrics,
kind='heatmap',
show_plot=False
).values
.. raw:: html
Apr 2020 |
After |
Before |
segment size |
0.637 |
0.363 |
Path length, mean |
9.419 |
12.69 |
Time to payment_done, mean |
1.4m |
2.9m |
Conversion rate: cart |
0.504 |
0.528 |
Conversion rate: payment_done |
0.174 |
0.174 |
Next, we want to compare the behavior of the new and old users deeper. We will use the :doc:`step matrix` and :doc:`transition matrix` tools for this. Since the segment is binary, when calling these methods we can just pass its name to the ``groups`` argument not specifying the segment values.
.. code:: python
stream2.step_matrix(groups='Apr 2020', threshold=0)
.. figure:: /_static/user_guides/segments_and_clusters/diff_step_matrix.png
:width: 600
The step matrix reveals that the new users are less likely to surf the catalog and the main page: the grey values in ``catalog`` and ``main`` rows indicate that the users from ``Before`` segment visit these pages more often. Also, the paths of the new users are shorter which is shown by the brown values in the ``path_end`` row. But since the conversion rates to ``cart`` and ``payment_done`` are almost the same, it looks like the new users are more decisive.
.. code:: python
stream2.transition_matrix(groups='Apr 2020')
.. figure:: /_static/user_guides/segments_and_clusters/diff_transition_matrix.png
:width: 500
The transitioin matrix exhibits the differences in more detail. The old users tend to transit to the main page from any other page much more often than new users. On the other hand we see many differences in ``* → path_end`` transitions. It looks like the old users end up their paths primarily on the ``main`` page, while the last event for the new users is distributed more evenly. Finally, we report that the old users prefer cash payment much more than the new users: the difference in ``payment_choice → payment_cash`` transition is very high: 0.25.
Clusters
--------
Retentioneering provides a set of clustering tools that could automatically group users based on their behavior. In a nutshell, the clustering process consists of the following steps:
- Path vectorization. Represent the event sequences as a matrix where each row corresponds to a particular path and each column corresponds to a path feature.
- Clustering. Apply a clustering algorithm using the feature matrix from the previous step.
- Cluster analysis. Analyze the clusters to understand the differences between them.
Vectorization
~~~~~~~~~~~~~
First, we calculate the feature set using the :py:meth:`Eventstream.extract_features()` method. Here we use the simplest configuration: unigrams with the ``count`` feature type.
.. code:: python
features = stream2.extract_features(ngram_range=(1, 1), feature_type='count')
.. raw:: html
|
main |
catalog |
... |
cart |
payment_done |
user_id |
|
|
|
|
|
|
122915 |
7 |
18 |
... |
1 |
0 |
463458 |
1 |
8 |
... |
0 |
0 |
1475907 |
2 |
5 |
... |
1 |
0 |
1576626 |
1 |
0 |
... |
0 |
0 |
2112338 |
2 |
3 |
... |
0 |
0 |
Now each path is represented as "bag of unigrams": a vector of event counts. More feature types such as ``tfidf``, ``binary``, or time-related features are available. See the :py:meth:`Eventstream.extract_features()` method documentation for more details.
Clustering algorithms
~~~~~~~~~~~~~~~~~~~~~
Next, we apply a clustering algorithm to the feature matrix. The :py:meth:`Eventstream.get_clusters()` method supports three algorithms: KMeans, HDBSCAN, and GMM. Here we will use KMeans as the most common one. Actually, this algorithm requires the number of clusters to be specified. However, :py:meth:`Eventstream.get_clusters()` shows the elbow curve plot to help you choose the optimal number of clusters in case you didn't specify any.
.. code:: python
stream2.get_clusters(features, method='kmeans')
.. figure:: /_static/user_guides/segments_and_clusters/elbow_curve.png
:width: 400
This elbow curve appeared to be too smooth to determine the optimal number of clusters as it often happens in practice. In this case, the number of clusters can be chosen basing on the cluster sizes (they should not be too small). We assume that 8 clusters would be enough and call :py:meth:`Eventstream.get_clusters()` again with the ``n_clusters`` argument set to 8. The cluster partitioning is stored as a regular segment and we can set its name using the ``segment_name`` argument.
.. code:: python
stream2 = stream2.get_clusters(
features,
method='kmeans',
n_clusters=8,
random_state=42,
segment_name='kmeans_clusters'
)
Cluster analysis
~~~~~~~~~~~~~~~~
Now ``kmeans_clusters`` segment is available for any segment analysis described above. However, since cluster analysis is strongly relate to the feature space that induces the clustering, we can use the :py:meth:`Eventstream.clusters_overview()` method that better suits this analysis.
.. code:: python
stream2.clusters_overview('kmeans_clusters', features, aggfunc='mean', metrics=custom_metrics)
.. figure:: /_static/user_guides/segments_and_clusters/clusters_overview.png
:width: 400
The columns of this table describe the aggregated feature and custom metric values for each cluster. The heatmap (with the default ``axis=1`` parameter) shows how the features values vary across the clusters. For example, the users from the cluster 0 have the lowest activity (all the values in the column 0 are colored in deep blue) while the size of the cluster is the highest (36.4%).
In case we can not see the difference between some clusters clearly looking at the aggregated values only, we can use the :py:meth:`Eventstream.segment_diff()` method to compare a pair of clusters directly. For example, clusters 2 and 4 look very similar. Let us compare them.
.. code:: python
stream2.segment_diff(['kmeans_clusters', '2', '4'], features)
.. figure:: /_static/user_guides/segments_and_clusters/segment_diff.png
:width: 600
Now we clearly see that the biggest difference between the clusters 2 and 4 is in the ``catalog`` and ``main`` events distribution.
With a help of the special ``_OUTER_`` literal you can explore the difference of a cluster with the rest of the clusters.
.. code:: python
stream2.segment_diff(['kmeans_clusters', '2', '_OUTER_'], features)
.. figure:: /_static/user_guides/segments_and_clusters/segment_diff_outer.png
:width: 600
Finally, we can label the clusters with meaningful names using the :py:meth:`Eventstream.remap_segment()` method and make the same overview visualization for a publication.
.. code:: python
cluster_labels = {
'0': 'passers_by',
'1': 'aimless_1',
'2': 'somewhat_interested_1',
'3': 'mildly_purchasing',
'4': 'somewhat_interested_2',
'5': 'aimless_2',
'6': 'active',
'7': 'super_active'
}
stream2 = stream2.remap_segment('kmeans_clusters', cluster_labels)
.. figure:: /_static/user_guides/segments_and_clusters/clusters_overview_2.png
:width: 400
In case we need to get 2D representation of the clusters we can use the :py:meth:`Eventstream.projection()` method. t-SNE and UMAP algorithms are supported. A fancy dropdown menu allows you to switch between segments. This might be useful when you want to compare multiple clustering versions treating them as different segments. Since the projection is a computationally expensive operation, it is recommended to use the ``sample_size`` argument to reduce the number of paths along with the ``random_state`` argument to make the results reproducible.
.. code:: python
stream2.projection(features=features, sample_size=3000, random_state=42)
.. figure:: /_static/user_guides/segments_and_clusters/projection.png
:width: 500