Sequences
=========
|colab| |jupyter|
.. |jupyter| raw:: html
.. |colab| raw:: html
Basic example
-------------
Sequences is a tool that displays the frequency of n-grams. n-gram is a term referring to event sequence of length n. For example, a path ``A`` → ``B`` → ``C`` → ``D`` contains two 3-grams: ``A`` → ``B`` → ``C`` and ``B`` → ``C`` → ``D``.
Hereafter we use :doc:`simple_shop ` dataset, which has already been converted to :doc:`Eventstream` and assigned to ``stream`` variable. If you want to use your own dataset, upload it following :ref:`this instruction`.
.. code-block:: python
from retentioneering import datasets
stream = datasets.load_simple_shop()
To run sequences tool, use the :py:meth:`Eventstream.sequences()` method.
.. code-block:: python
stream.sequences()
.. figure:: /_static/user_guides/sequences/basic_example.png
Let us explore the output. The output is a pandas DataFrame colored with a heatmap. Particular sequences form the DataFrame index. By default, single events are considered as sequences (1-grams). To adjust this behavior use ``ngram_range`` argument.
The columns mostly display metrics that reflect frequency of a particular n-gram. The possible metrics are:
- ``paths``. The number of unique paths that contain a particular event sequence.
- ``paths_share``: The ratio of paths containing a sequence to the total number of paths.
- ``count``: The number of occurrences of a particular sequence (might occur multiple times within a path).
- ``count_share``: The ratio of a particular count to the sum of counts over all sequences.
- ``avg_count``: The average number of occurrences per path.
``sequence_type`` column allows to differentiate important types of sequences: *loops* and *cycles*. A sequence of length >= 2 is a ``loop`` if it consists of a single unique event. A sequence of length >= 3 is a ``cycle`` if its starting and ending event are the same events.
Finally, ``path_id_sample`` column contains samples of random path ids that contain given sequence. They are useful when you need to explore deeper why a particular sequence could occur.
.. note::
``paths`` and ``paths_share`` metric names are replaced with the corresponding ``weight_col`` values in the output. Namely, for the default ``weight_col='user_id'`` value, ``user_id`` and ``user_id_share`` are used as the column titles. Also, ``path_id_sample`` is replaced with ``user_id_sample``.
Tuning the arguments
--------------------
Now let us consider another example to demonstrate how the arguments can be tuned. We also use here the :py:meth:`SplitSessions` data processor in order to split the eventstream into sessions and get additional ``session_id`` column.
.. code-block:: python
stream\
.split_sessions(timeout=(30, 'm'))\
.sequences(
ngram_range=(2, 3),
weight_col='session_id',
metrics=['count', 'count_share', 'paths_share'],
threshold=['count', 1200],
sorting=['count_share', False],
heatmap_cols=['session_id_share'],
sample_size=3
)
.. figure:: /_static/user_guides/sequences/tuning_the_arguments.png
To set the range of n-gram length (i.e. n) use ``ngram_range`` argument. This is a very important parameter because it limits the number of all possible n-grams to be discovered. If the upper length is set too high, the number of n-grams might be immense, so it takes much time to compute them all. In practice, it is rarely reasonable to compute all the n-grams of length >= 6-7. So be careful with it.
``weight_col`` sets the eventstream column that contain path identifiers. Similar to :ref:`transition graph` and :ref:`step matrix`, you can calculate the sequence statistics within a whole path (by ``user_id``) or within its subpaths (for example, by ``session_id``). In this example we switch it to ``weight_col='session_id'``.
``metrics`` parameter defines the metrics to be included in the output columns. The metric names were defined in the previous section.
Since the number of all sequences is often large we usually need to include in the output the most valuable sequences. With the ``threshold`` parameter you can define a column to be used as a filter and the corresponding threshold value. The values above given threshold are included in the output. In the example we define ``threshold=['count', 1200]`` meaning that the filtering column is ``count`` and the threshold value is 1200.
Sorting of the output table is controlled by the ``sorting`` parameter. The heatmap is defined by ``heatmap_cols`` parameter. Note that instead of ``heatmap_cols=['session_id_share']`` we could use ``heatmap_cols=['paths_share']`` which would be an alias in case of ``weight_col='session_id'``.
Finally, the ``sample_size`` parameter defines the length of the list with sampled path_ids.
.. _sequences_segments:
Comparing groups
----------------
One of the most powerful application of the Sequences tool is comparing sequences frequencies between two groups of users. We will use a random split of the users just for demonstration purposes.
.. code-block:: python
np.random.seed(111)
users = set(stream.to_dataframe()['user_id'])
group1 = set(np.random.choice(list(users), size=len(users)//2))
group2 = users - group1
.. code-block:: python
stream.sequences(
groups=[group1, group2],
group_names=['A', 'B'],
metrics=['paths_share', 'count_share'],
threshold=[('user_id_share', 'delta_abs'), 0],
sorting=[('count_share', 'delta'), False]
)
.. figure:: /_static/user_guides/sequences/groups.png
To activate group mode for Sequences, you simply need to set ``groups`` parameter that defines two sets of users to be compared. Optionally, you can define the names of these groups with ``group_names`` parameter so the output columns will be labeled with the corresponding titles.
Metrics columns are designed as follows. Each metric is represented with four columns:
- metric value for the first group (A),
- metric value for the second group (B),
- ``delta_abs``: the metric difference between the first and the second group (A - B),
- ``delta_rel``: the relative value of the delta compared to the value for the second group (A - B) / B.
Unlike regular output, Sequences output for groups contains `pandas.MultiIndex `_ in the columns. So while using ``threshold``, ``sorting``, and ``heatmap_cols`` you need to refer a column as an element of 2-level multiindex.
Common tooling properties
-------------------------
values
~~~~~~
If you want to get the underlying pandas DataFrame you can use property :py:meth:`Sequences.values`. An additional flag ``show_plot=False`` supresses the output.
.. code-block:: python
seq_df = stream.sequences(show_plot=False).values
seq_df
.. raw:: html
|
user_id |
user_id_share |
count |
count_share |
sequence_type |
user_id_sample |
Sequence |
|
|
|
|
|
|
path_end |
3751 |
1.00 |
3751 |
0.09 |
other |
[696492792] |
path_start |
3751 |
1.00 |
3751 |
0.09 |
other |
[807066609] |
catalog |
3611 |
0.96 |
14518 |
0.36 |
other |
[969637876] |
main |
2385 |
0.64 |
5635 |
0.14 |
other |
[274091445] |
cart |
1924 |
0.51 |
2842 |
0.07 |
other |
[712986878] |
product2 |
1430 |
0.38 |
2172 |
0.05 |
other |
[196471324] |
delivery_choice |
1356 |
0.36 |
1686 |
0.04 |
other |
[162041520] |
product1 |
1122 |
0.30 |
1515 |
0.04 |
other |
[368983170] |
payment_choice |
958 |
0.26 |
1107 |
0.03 |
other |
[418845606] |
delivery_courier |
748 |
0.20 |
834 |
0.02 |
other |
[397948421] |
payment_done |
653 |
0.17 |
706 |
0.02 |
other |
[827859068] |
payment_card |
521 |
0.14 |
565 |
0.01 |
other |
[204780950] |
delivery_pickup |
469 |
0.13 |
506 |
0.01 |
other |
[470581033] |
payment_cash |
190 |
0.05 |
197 |
0.00 |
other |
[766327250] |