Eventstream =========== |colab| |jupyter| .. |jupyter| raw:: html Download - Jupyter Notebook .. |colab| raw:: html Google Colab What is Eventstream? -------------------- Eventstream is the core class in the retentioneering library. This data structure is designed around three following purposes: .. todo:: Set a link to eventstream concept as soon as it is ready. Vladimir Kukushkin - **Data container**. Eventstream class implements a convenient approach to storing clickstream data. - **Preprocessing**. Eventstream allows to efficiently implement a data preparation process. See :doc:`Preprocessing user guide <../user_guides/preprocessing>` for more details. - **Applying analytical tools**. Eventstream integrates with retentioneering tools and allows you to seamlessly apply them. See :ref:`user guides on the path analysis tools`. .. _eventstream_creation: Eventstream creation -------------------- Default field names ~~~~~~~~~~~~~~~~~~~ An ``Eventstream`` is a container for clickstream data, that is initialized from a ``pandas.DataFrame``. The class constructor expects the DataFrame to have at least 3 columns: ``user_id``, ``event``, ``timestamp``. Let us create a dummy DataFrame to illustrate Eventstream init process: .. code-block:: python import pandas as pd df1 = pd.DataFrame( [ ['user_1', 'A', '2023-01-01 00:00:00'], ['user_1', 'B', '2023-01-01 00:00:01'], ['user_2', 'B', '2023-01-01 00:00:02'], ['user_2', 'A', '2023-01-01 00:00:03'], ['user_2', 'A', '2023-01-01 00:00:04'], ], columns=['user_id', 'event', 'timestamp'] ) Having such a dataframe, you can create an eventstream simply as follows: .. code-block:: python from retentioneering.eventstream import Eventstream stream1 = Eventstream(df1) To do the inverse transformation (i.e. obtain a DataFrame from an eventstream object), :py:meth:`to_dataframe()` method can be used. However, the method is not just a converter. Using it, we can display the internal ``Eventstream`` structure: .. _eventstream_stream1: .. code-block:: python stream1.to_dataframe() .. raw:: html
event timestamp user_id event_type event_index event_id
0 A 2023-01-01 00:00:00 user_1 raw 0 52c63aed-b1ff-4fea-806d-8484a8978443
1 B 2023-01-01 00:00:01 user_1 raw 1 e9537f88-f776-4047-ae16-66f8b64c4076
2 B 2023-01-01 00:00:02 user_2 raw 2 bb6f1cd3-a630-4d48-94c2-8a3cf5a3d22f
3 A 2023-01-01 00:00:03 user_2 raw 3 3680e70b-166d-475d-8ac1-166ea213f5f4
4 A 2023-01-01 00:00:04 user_2 raw 4 c22d97a7-63a1-4d57-b738-16863107dfb7

We will describe the columns of the resulting DataFrame later in `Displaying eventstream`_ section. .. _eventstream_custom_fields: Custom field names ~~~~~~~~~~~~~~~~~~ For custom DataFrame column names you can either rename them using pandas, or set a mapping rule that would tell the Eventstream constructor the mapping to the correct column names. This can be done with Eventstream attribute ``raw_data_schema`` with uses :py:meth:`RawDataSchema` class under the hood. Let us illustrate its usage with the following example with the same dataframe containing the same data but with different column names (``client_id``, ``action`` and ``datetime``): .. code-block:: python df2 = pd.DataFrame( [ ['user_1', 'A', '2023-01-01 00:00:00'], ['user_1', 'B', '2023-01-01 00:00:01'], ['user_2', 'B', '2023-01-01 00:00:02'], ['user_2', 'A', '2023-01-01 00:00:03'], ['user_2', 'A', '2023-01-01 00:00:04'] ], columns=['client_id', 'action', 'datetime'] ) raw_data_schema = { 'user_id': 'client_id', 'event_name': 'action', 'event_timestamp': 'datetime' } stream2 = Eventstream(df2, raw_data_schema=raw_data_schema) stream2.to_dataframe().head(3) .. raw:: html
event timestamp user_id event_type event_index event_id
0 A 2023-01-01 00:00:00 user_1 raw 0 0fd741dd-8140-4182-8046-cf23906208c6
1 B 2023-01-01 00:00:01 user_1 raw 1 3a4db2d7-a4b8-4845-ab21-0950f6a2bfc0
2 B 2023-01-01 00:00:02 user_2 raw 2 fd4552d9-db28-47cc-b7a1-4408a895cff9

As we see, ``raw_data_schema`` argument maps fields ``client_id``, ``action``, and ``datetime`` so that they are imported to the eventstream correctly. Another common case is when your DataFrame has some additional columns that you want to be included in the eventstream. ``raw_data_schema`` argument supports this scenario too with the help of ``custom_cols`` key value. The value for this key is a list of dictionaries, one dict per one custom field. A single dict must contain two fields: ``raw_data_col`` and ``custom_col``. The former stands for a field name from the sourcing dataframe, the latter stands for the corresponding field name to be set at the resulting eventstream. Suppose the initial DataFrame now also contains a session identifier: ``session_id`` column. In that case, ``raw_data_schema`` supports the following way to handle ``session_id`` support: .. code-block:: python df3 = pd.DataFrame( [ ['user_1', 'A', '2023-01-01 00:00:00', 'session_1'], ['user_1', 'B', '2023-01-01 00:00:01', 'session_1'], ['user_2', 'B', '2023-01-01 00:00:02', 'session_2'], ['user_2', 'A', '2023-01-01 00:00:03', 'session_3'], ['user_2', 'A', '2023-01-01 00:00:04', 'session_3'] ], columns=['client_id', 'action', 'datetime', 'session'] ) raw_data_schema = { 'user_id': 'client_id', 'event_name': 'action', 'event_timestamp': 'datetime', 'custom_cols': [ { 'raw_data_col': 'session', 'custom_col': 'session_id' } ] } stream3 = Eventstream(df3, raw_data_schema=raw_data_schema) stream3.to_dataframe().head(3) .. raw:: html
event timestamp user_id event_type session_id event_index event_id
0 A 2023-01-01 00:00:00 user_1 raw session_1 0 c15ff01a-6822-464f-a4cc-dc6118e44e6d
1 B 2023-01-01 00:00:01 user_1 raw session_1 1 359c5e0e-d533-4101-8f9c-86247cb78590
2 B 2023-01-01 00:00:02 user_2 raw session_2 2 ce03482c-0b47-42eb-9947-8ee3cc39ecd9

Here we see that the original ``session`` column is stored in ``session_id`` column, according to the defined ``raw_data_schema`` If the core triple columns of the DataFrame were titled with the default names ``user_id``, ``event``, ``timestamp`` (instead of ``client_id``, ``action``, ``datetime``) then you could just ignore their mapping in setting ``raw_data_schema`` and pass ``custom_cols`` key only. .. _eventstream_field_names: Eventstream field names ~~~~~~~~~~~~~~~~~~~~~~~ Using the :py:meth:`EventstreamSchema` attribute you can: 1. Regulate how ``Eventstream`` column names will be displayed as an output of :py:meth:`to_dataframe()` method. For example, it can be useful if it is more common and important to operate with custom column names; 2. Get access to the eventstream columns which is used for such preprocessing tools as: - :py:meth:`AddPositiveEvents `, - :py:meth:`AddNegativeEvents `, - :py:meth:`FilterEvents `, - :py:meth:`GroupEvents `. To demonstrate how eventstream schema works we use the same ``stream1`` that we have already used :ref:`above`. Let us set the names of the core triple columns as ``client_id``, ``action``, and ``datetime`` with the help of ``schema`` argument: .. code-block:: python from retentioneering.eventstream import EventstreamSchema new_eventstream_schema = EventstreamSchema( user_id='client_id', event_name='action', event_timestamp='datetime' ) stream1_new_schema = Eventstream(df1, schema=new_eventstream_schema) stream1_new_schema.to_dataframe().head(3) .. raw:: html
action datetime client_id event_type event_index event_id
0 A 2023-01-01 00:00:00 user_1 raw 0 884a38f1-dc62-4567-a10b-5c20a690a173
1 B 2023-01-01 00:00:01 user_1 raw 1 8ed1d3fb-8026-413a-a426-9f8858cd9d73
2 B 2023-01-01 00:00:02 user_2 raw 2 965de90a-6a68-42a5-8d8a-8e23534bfd72

As we can see, the names of the main columns have changed. It happened because an ``Eventstream`` object stores an instance of the :py:meth:`EventstreamSchema` class with the mapping to custom column names. If you want to get the full list of the fields supported by EventstreamSchema, get ``EventstreamSchema.schema`` property. Each of these fields can be modified with EventstreamSchema. .. code-block:: python stream1_new_schema.schema .. parsed-literal:: EventstreamSchema( event_id='event_id', event_type='event_type', event_index='event_index', event_name='action', event_timestamp='datetime', user_id='client_id', custom_cols=[] ) User sampling ~~~~~~~~~~~~~ Contemporary data analysis usually involve working with large datasets. Using retentioneering to work with such datasets might cause the following undesirable effects: - High computational costs. - The messy big picture (especially in case of applying such tools as :doc:`Transition Graph`, :doc:`StepMatrix` and :doc:`StepSankey`). Insufficient user paths or large number of almost identical paths (especially short paths) often add no value to the analysis. It might be reasonable to get rid of them. - Due to Eventstream design, all the data uploaded to an Eventstream instance is kept immutable. Even if you remove some eventstream rows while preprocessing, the data stays untouched: it just becomes hidden and is marked as removed. Thus, the only chance to tailor the dataset to a reasonable size is to sample the user paths at entry point - while applying Eventstream constructor. .. todo:: Set a link to eventstream concept as soon as it is ready. Vladimir Kukushkin The size of the original dataset can be reduced by path sampling. In theory, this procedure could affect the eventstream analysis, especially in case you have rare but important events and behavioral patterns. Nevertheless, the sampling is less likely to distort the big picture, so we recommend to use it when it is needed. We also highlight that user path sampling means that we remove some random paths entirely. We guarantee that the sampled paths contain all the events from the original dataset, and they are not truncated. There are a couple sampling parameters in the Eventstream constructor: ``user_sample_size`` and ``user_sample_seed``. There are two ways of setting the sample size: - A float number. For example, ``user_sample_size=0.1`` means that we want to leave 10% ot the paths and remove 90% of them. - An integer sample size is also possible. In this case a specified number of events will be left. ``user_sample_seed`` is a standard way to make random sampling reproducible (see `this Stack Overflow explanation `_). You can set it to any integer number. Below is a sampling example for :doc:`simple_shop ` dataset. .. code-block:: python from retentioneering import datasets simple_shop_df = datasets.load_simple_shop(as_dataframe=True) sampled_stream = Eventstream( simple_shop_df, user_sample_size=0.1, user_sample_seed=42 ) print('Original number of the events:', len(simple_shop_df)) print('Sampled number of the events:', len(sampled_stream.to_dataframe())) unique_users_original = simple_shop_df['user_id'].nunique() unique_users_sampled = sampled_stream.to_dataframe()['user_id'].nunique() print('Original unique users number: ', unique_users_original) print('Sampled unique users number: ', unique_users_sampled) .. parsed-literal:: Original number of the events: 32283 Sampled number of the events: 3298 Original unique users number: 3751 Sampled unique users number: 375 We see that the number of the users has been reduced from 3751 to 375 (10% exactly). The number of the events has been reduced from 32283 to 3298 (10.2%), but we didn't expect to see exact 10% here. .. _to_dataframe explanation: Displaying eventstream ---------------------- Now let us look at columns represented in an eventstream and discuss :py:meth:`to_dataframe()` method using the example of ``stream3`` eventstream. .. code-block:: python stream3.to_dataframe() .. raw:: html
event timestamp user_id session_id event_type event_index event_id
0 A 2023-01-01 00:00:00 user_1 session_1 raw 0 7427d9f5-8666-4821-b0a9-f74a962f6d72
1 B 2023-01-01 00:00:01 user_1 session_1 raw 1 6c9fef69-a176-45d1-bb13-628796e68602
2 B 2023-01-01 00:00:02 user_2 session_2 raw 2 7aee8104-b1cc-4df4-8a8d-f569395ffad9
3 A 2023-01-01 00:00:03 user_2 session_3 raw 3 3b3610b2-8016-4259-bf68-6daf34518e34
4 A 2023-01-01 00:00:04 user_2 session_3 raw 4 945e6514-2f41-457c-ba70-2ac35150b41e

Besides the standard triple ``user_id``, ``event``, ``timestamp`` and custom column ``session_id`` we see the columns ``event_id``, ``event_type``, ``event_index``. These are some technical columns, containing the following: .. _event_type_explanation: - ``event_type`` - all the events that come from the sourcing DataFrame are of ``raw`` event type. However, preprocessing methods can add some synthetic events that have various event types. See the details in :ref:`data processors user guide`. - ``event_index`` - an integer which is associated with the event order. By default, an eventstream is sorted by timestamp, and optionally by ``event`` column. As for the synthetic events which are often placed at the beginning or in the end of a user's path, special sorting is applied. See explanation of :ref:`index and reindex logic` for the details and also :ref:`data processors user guide`. Please note that the event index might has duplicated values. It is ok due to its design. - ``event_id`` - a string identifier of an eventstream row. .. todo:: Set a link to eventstream concept as soon as it is ready. Vladimir Kukushkin see :ref:`Eventstream concept` for the details. There are additional arguments that may be useful. - ``show_deleted``. Eventstream is immutable data container. It means that all the events once uploaded to an eventstream are kept. Even if we remove some events, they are just marked as removed. By default, ``show_deleted=False`` so these events are hidden in the output DataFrame. If ``show_deleted=True``, all the events from the original state of the eventstream and all the in-between preprocessing states are displayed. .. todo:: Set a link to eventstream concept as soon as it is ready. Vladimir Kukushkin see :ref:`Eventstream concept` for the details. - ``copy`` - when this flag is ``True`` (by default it is ``False``) then an explicit copy of the DataFrame is created. See details in `pandas documentation `_. .. _index_explanation: Eventstream index and reindex ----------------------------- In the previous section, we have already mentioned the sorting algorithm when we described special ``event_type`` and ``event_index`` eventstream columns. Now, let us take a closer look at the sorting logic and illustrate it with several examples. By default, raw events are sorted by timestamp column. So the events with the same timestamps are kept in the same order as they are represented in the sourcing DataFrame. In case you need some custom ordering for the events with the same timestamps use the ``events_order`` parameter. For example, due to some technical reasons events can arrive at the server with the same timestamp and randomly change the order and create unnecessary variability. After initial sorting, the relative order of raw events is fixed, and re-sorting is not possible. In the dummy dataframe below, there are two users, each with a pair of events ("A" and "B") that have equal timestamps. If we create an eventstream with default parameters, the order will be preserved, exactly the same as in the input dataframe. .. code-block:: python df4 = pd.DataFrame( [ ['user_1', 'A', '2023-01-01 00:00:00'], ['user_1', 'B', '2023-01-01 00:00:00'], ['user_2', 'B', '2023-01-01 00:00:03'], ['user_2', 'A', '2023-01-01 00:00:03'], ['user_2', 'A', '2023-01-01 00:00:04'] ], columns=['user_id', 'event', 'timestamp'] ) stream4 = Eventstream(df4) stream4.to_dataframe() .. raw:: html
event timestamp user_id event_type event_index
0 A 2023-01-01 00:00:00 user_1 raw 0
1 B 2023-01-01 00:00:00 user_1 raw 1
2 B 2023-01-01 00:00:03 user_2 raw 2
3 A 2023-01-01 00:00:03 user_2 raw 3
4 A 2023-01-01 00:00:04 user_2 raw 4

Now we create a new Eventstream from our dummy DataFrame, but with the specified ``events_order=["B", "A"]`` parameter. As we can see, the first two events have swapped places. .. code-block:: python Eventstream(df4, events_order=["B", "A"]).to_dataframe() .. raw:: html
event timestamp user_id event_type event_index
0 B 2023-01-01 00:00:00 user_1 raw 0
1 A 2023-01-01 00:00:00 user_1 raw 1
2 B 2023-01-01 00:00:03 user_2 raw 2
3 A 2023-01-01 00:00:03 user_2 raw 3
4 A 2023-01-01 00:00:04 user_2 raw 4

As for the synthetic events which are often placed at the beginning or in the end of a user's path, special sorting is applied. There is a set of pre-defined event types, that are arranged in the following default order: .. code-block:: python IndexOrder = [ "profile", "path_start", "new_user", "existing_user", "cropped_left", "session_start", "session_start_cropped", "group_alias", "raw", "raw_sleep", None, "synthetic", "synthetic_sleep", "positive_target", "negative_target", "session_end_cropped", "session_end", "session_sleep", "cropped_right", "absent_user", "lost_user", "path_end" ] Most of these types are created by build-in :ref:`data processors`. Note that some of the types are not used right now and were created for future development. To see full explanation about which data processor creates which ``event_type`` you can explore the :ref:`data processors user guide`. If needed, you can pass a custom sorting list to the Eventstream constructor as the ``index_order`` argument. In case you already have an eventstream instance, you can assign a custom sorting list to ``Eventstream.index_order`` attribute. Afterwards, you should use :py:meth:`index_events()` method to apply this new sorting. For demonstration purposes we use here a :py:meth:`AddPositiveEvents` data processor, which adds new event with prefix ``positive_target_``. .. code-block:: python add_events_stream = stream4.add_positive_events(targets=['B']) add_events_stream.to_dataframe() .. raw:: html
event timestamp user_id event_type event_index
0 A 2023-01-01 00:00:00 user_1 raw 0
1 B 2023-01-01 00:00:00 user_1 raw 1
2 positive_target_B 2023-01-01 00:00:00 user_1 positive_target 1
3 B 2023-01-01 00:00:03 user_2 raw 2
4 positive_target_B 2023-01-01 00:00:03 user_2 positive_target 2
5 A 2023-01-01 00:00:03 user_2 raw 3
6 A 2023-01-01 00:00:04 user_2 raw 4

We see that ``positive_target_B`` events with type ``positive_target`` follow their ``raw`` parent event ``B``. Assume we would like to change their order. .. code-block:: python custom_sorting = [ 'profile', 'path_start', 'new_user', 'existing_user', 'cropped_left', 'session_start', 'session_start_cropped', 'group_alias', 'positive_target', 'raw', 'raw_sleep', None, 'synthetic', 'synthetic_sleep', 'negative_target', 'session_end_cropped', 'session_end', 'session_sleep', 'cropped_right', 'absent_user', 'lost_user', 'path_end' ] add_events_stream.index_order = custom_sorting add_events_stream.index_events() add_events_stream.to_dataframe() .. raw:: html
event timestamp user_id event_type event_index
0 A 2023-01-01 00:00:00 user_1 raw 0
1 positive_target_B 2023-01-01 00:00:00 user_1 positive_target 1
2 B 2023-01-01 00:00:00 user_1 raw 1
3 positive_target_B 2023-01-01 00:00:03 user_2 positive_target 2
4 B 2023-01-01 00:00:03 user_2 raw 2
5 A 2023-01-01 00:00:03 user_2 raw 3
6 A 2023-01-01 00:00:04 user_2 raw 4

As we can see, the order of the events changed, and now ``raw`` events ``B`` follow ``positive_target_B`` events. .. _eventstream_descriptive_methods: Descriptive methods ------------------- Eventstream provides a set of methods for a first touch data exploration. To showcase how these methods work, we need a larger dataset, so we will use our :doc:`simple_shop` dataset. For demonstration purposes, we add ``session_id`` column by applying :py:meth:`SplitSessions` data processor. .. code-block:: python from retentioneering import datasets stream_with_sessions = datasets\ .load_simple_shop()\ .split_sessions(timeout=(30, 'm')) stream_with_sessions.to_dataframe().head() .. raw:: html
event timestamp user_id session_id event_type event_index event_id
0 session_start 2019-11-01 17:59:13.273932 219483890 219483890_1 session_start 0 92aa043e-02ac-4a4d-9f37-4bfc9dd101dc
1 catalog 2019-11-01 17:59:13.273932 219483890 219483890_1 raw 1 c1368d21-85fe-4ed0-864b-87b79eca8076
3 product1 2019-11-01 17:59:28.459271 219483890 219483890_1 raw 3 4f437751-b117-4ef2-ba23-e91fe3a022fc
5 cart 2019-11-01 17:59:29.502214 219483890 219483890_1 raw 5 740eeca9-db1f-4279-a0aa-ceb07de60638
7 catalog 2019-11-01 17:59:32.557029 219483890 219483890_1 raw 7 4f0d2ba7-396f-492e-bbf0-887ef14211e6

General statistics ~~~~~~~~~~~~~~~~~~ .. _eventstream_describe: Describe ^^^^^^^^ In a similar fashion to pandas, we use :py:meth:`describe()` for getting a general description of an eventstream. .. code-block:: python stream_with_sessions.describe() .. raw:: html
value
category metric
overall unique_users 3751
unique_events 14
unique_sessions 6454
eventstream_start 2019-11-01 17:59:13
eventstream_end 2020-04-29 12:48:07
eventstream_length 179 days 18:48:53
path_length_time mean 9 days 11:15:18
std 23 days 02:52:25
median 0 days 00:01:21
min 0 days 00:00:00
max 149 days 04:51:05
path_length_steps mean 12.05
std 11.43
median 9.0
min 3
max 122
session_length_time mean 0 days 00:00:52
std 0 days 00:01:08
median 0 days 00:00:30
min 0 days 00:00:00
max 0 days 00:23:44
session_length_steps mean 7.0
std 4.18
median 6.0
min 3
max 55

The output consists of three main categories: - **overall statistics** - full user-path statistics - time distribution - steps (events) distribution - sessions statistics - time distribution - steps (events) distribution .. _explain_describe_params: ``session_col`` parameter is optional and points to the eventstream column that contains session ids (``session_id`` is the default value). If such a column is defined, session statistics are also included. Otherwise, the values related to sessions are not displayed. There is one more parameter - ``raw_events_only`` (default False) that could be useful if some synthetic events have already been added by :ref:`adding data processors `. Note that those events affect all "\*_steps" categories. Now let us go through the main categories and take a closer look at some of the metrics: **overall** By ``eventstream start`` and ``eventstream end`` in the "Overall" block we denote timestamps of the first event and the last event in the eventstream correspondingly. ``eventstream_length`` is the time distance between event stream start and end. **path/session length time** and **path/session length steps** These two blocks show some time-based statistics over user paths and sessions. Categories "path/session_length_time" and "path/session length steps" provide similar information on the length of users paths and sessions correspondingly. The former is calculated in days and the latter in the number of events. It is important to mention that all the values in "\*_steps" categories are rounded to the 2nd decimal digit, and in "\*_time" categories - to seconds. This is also true for the next method. .. _eventstream_describe_events: Describe events ^^^^^^^^^^^^^^^ The :py:meth:`describe_events()` method provides event-wise statistics about an eventstream. Its output consists of three main blocks: #. **basic statistics** #. full user-path statistics, - time to first occurrence (FO) of each event, - steps to first occurrence (FO) of each event, #. sessions statistics (if this column exists), - time to first occurrence (FO) of each event, - steps to first occurrence (FO) of each event. You can find detailed explanations of each metric in :py:meth:`api documentation`. The default parameters are ``session_col='session_id'``, ``raw_events_only=False``. With them, we will get statistics for each event present in our data. These two arguments work exactly the same way as in the :ref:`describe()` method. .. code-block:: python stream = datasets.load_simple_shop() stream.describe_events() .. raw:: html
basic_statistics time_to_FO_user_wise steps_to_FO_user_wise
number_of_occurrences unique_users number_of_occurrences_shared unique_users_shared mean std median min max mean std median min max
event
cart 2842 1924 0.09 0.51 3 days 08:59:14 11 days 19:28:46 0 days 00:00:56 0 days 00:00:01 118 days 16:11:36 4.51 4.09 3.0 1 41
catalog 14518 3611 0.45 0.96 0 days 05:44:21 3 days 03:22:32 0 days 00:00:00 0 days 00:00:00 100 days 08:19:51 0.30 0.57 0.0 0 7
delivery_choice 1686 1356 0.05 0.36 5 days 09:18:08 15 days 03:19:15 0 days 00:01:12 0 days 00:00:03 118 days 16:11:37 6.78 5.56 5.0 2 49
delivery_courier 834 748 0.03 0.20 6 days 18:14:55 16 days 17:51:39 0 days 00:01:28 0 days 00:00:06 118 days 16:11:38 8.96 6.84 7.0 3 45
delivery_pickup 506 469 0.02 0.13 7 days 21:12:17 18 days 22:51:54 0 days 00:01:34 0 days 00:00:06 114 days 01:24:06 9.51 8.06 7.0 3 71
main 5635 2385 0.17 0.64 3 days 20:15:36 9 days 02:58:23 0 days 00:00:07 0 days 00:00:00 97 days 21:24:23 2.00 2.94 1.0 0 20
payment_card 565 521 0.02 0.14 6 days 21:42:26 17 days 18:52:33 0 days 00:01:40 0 days 00:00:08 138 days 04:51:25 11.14 7.34 9.0 5 65
payment_cash 197 190 0.01 0.05 13 days 23:17:25 24 days 00:00:02 0 days 00:02:18 0 days 00:00:10 118 days 16:11:39 14.15 11.10 9.5 5 73
payment_choice 1107 958 0.03 0.26 6 days 12:49:38 17 days 02:54:51 0 days 00:01:24 0 days 00:00:06 118 days 16:11:39 9.42 6.37 7.0 4 52
payment_done 706 653 0.02 0.17 7 days 01:37:54 17 days 09:10:00 0 days 00:01:34 0 days 00:00:08 115 days 09:18:59 12.21 8.29 10.0 5 84
product1 1515 1122 0.05 0.30 5 days 23:49:43 16 days 04:36:13 0 days 00:00:50 0 days 00:00:00 118 days 19:38:40 5.46 6.04 3.0 1 61
product2 2172 1430 0.07 0.38 4 days 06:13:24 13 days 03:26:17 0 days 00:00:34 0 days 00:00:00 126 days 23:36:45 4.32 4.51 3.0 1 36

If the number of unique events in an eventstream is high, we can leave events only from the list defined in ``event_list`` parameter. In the example below we leave the ``cart`` and ``payment_done`` events only as the events of high importance. We also transpose the output DataFrame for a nicer view. .. code-block:: python stream.describe_events() stream.describe_events(event_list=['payment_done', 'cart']).T .. raw:: html
event cart payment_done
basic_statistics number_of_occurrences 2842 706
unique_users 1924 653
number_of_occurrences_shared 0.09 0.02
unique_users_shared 0.51 0.17
time_to_FO_user_wise mean 3 days 08:59:14 7 days 01:37:54
std 11 days 19:28:46 17 days 09:10:00
median 0 days 00:00:56 0 days 00:01:34
min 0 days 00:00:01 0 days 00:00:08
max 118 days 16:11:36 115 days 09:18:59
steps_to_FO_user_wise mean 4.51 12.21
std 4.09 8.29
median 3.0 10.0
min 1 5
max 41 84

Often, such simple descriptive statistics are not enough to deeply understand the time-related values, so we want to see their distribution. For these purposes the following group of methods has been implemented. Time-based histograms ~~~~~~~~~~~~~~~~~~~~~ .. _eventstream_user_lifetime: User lifetime ^^^^^^^^^^^^^ One of the most important time-related statistics is user lifetime. By lifetime we mean the time distance between the first and the last event represented in a user's trajectory. The histogram for this variable is plotted by :py:meth:`user_lifetime_hist()` method. .. code-block:: python stream.user_lifetime_hist() .. figure:: /_static/user_guides/eventstream/01_user_lifetime_hist_simple.png :width: 500 The method has multiple parameters: .. _common_hist_params: - ``timedelta_unit`` defines a `datetime unit `_ that is used for the lifetime measuring; - ``log_scale`` sets logarithmic scale for the bins; - ``lower_cutoff_quantile``, ``upper_cutoff_quantile`` indicate the lower and upper quantiles (as floats between 0 and 1), the values between the quantiles only are considered for the histogram; - ``bins`` defines the number of histogram bins. Also can be the name of a reference rule or number of bins. See details in `numpy documentation `_; - ``width`` and ``height`` set figure width and height in inches. .. note:: The method is especially useful for selecting parameters to :py:meth:`DropPaths`. See :doc:`the user guide on preprocessing` for details. .. _eventstream_timedelta_hist: Timedelta between two events ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Previously, we have defined user lifetime as the timedelta between the beginning and the end of a user's path. This can be generalized. :py:meth:`timedelta_hist()` method shows a histogram for the distribution of timedeltas between a couple of specified events. The method supports similar formatting arguments (``timedelta_unit``, ``log_scale``, ``lower_cutoff_quantile``, ``upper_cutoff_quantile``, ``bins``, ``width``, ``height``) as we have already mentioned in :ref:`user_lifetime_hist` method. If no arguments are passed (except formatting arguments), timedeltas between all adjacent events are calculated within each user path. For example, this tiny eventstream .. figure:: /_static/user_guides/eventstream/02_timedelta_trivial_example.png :width: 500 generates 4 timedeltas :math:`\Delta_1, \Delta_2, \Delta_3, \Delta_4` as shown in the diagram. The timedeltas between events B and D, D and C, C and E are not taken into account because two events from each pair belong to different users. Here is how the histogram looks for the ``simple_shop`` dataset with ``log_scale=True`` and ``timedelta_unit='m'``: .. code-block:: python stream.timedelta_hist(log_scale=True, timedelta_unit='m') .. figure:: /_static/user_guides/eventstream/03_timedelta_log_scale.png :width: 500 This distribution of the adjacent events fairly common. It looks like a bimodal (which is not true: remember we use log-scale here), but these two bells help us to estimate a timeout for splitting sessions. From this charts we can see that it is reasonable to set it to some value between 10 and 100 minutes. Be careful if there are some synthetic events in the data. Usually those events are assigned with the same timestamp as their "parent" raw events. Thus, the distribution of the timedeltas between events will be heavily skewed to 0. Parameter ``raw_events_only=True`` can help in such a situation. Let us add to our dataset some common synthetic events using :ref:`AddStartEndEvents` and :ref:`SplitSessions` data processors. .. code-block:: python stream_with_synthetic = datasets\ .load_simple_shop()\ .add_start_end_events()\ .split_sessions(timeout=(30, 'm')) stream_with_synthetic.timedelta_hist(log_scale=True, timedelta_unit='m') stream_with_synthetic.timedelta_hist( raw_events_only=True, log_scale=True, timedelta_unit='m' ) .. figure:: /_static/user_guides/eventstream/04_timedelta_raw_events_only_false.png :width: 500 .. figure:: /_static/user_guides/eventstream/05_timedelta_raw_events_only_true.png :width: 500 You can see that on the second plot there is no high histogram bar located at :math:`\approx 10^{-3}`, so that the second histogram looks more natural. Another use case for :py:meth:`timedelta_hist()` is visualizing the distribution of timedeltas between two specific events. Assume we want to know how much time it takes for a user to go from ``product1`` to ``cart``. Then we set ``event_pair=('product1', 'cart')`` and pass it to ``timedelta_hist``: .. code-block:: python stream.timedelta_hist(event_pair=('product1', 'cart'), timedelta_unit='m') .. figure:: /_static/user_guides/eventstream/06_timedelta_pair_of_events.png :width: 500 From the Y scale, we see that such occurrences are not very numerous. This is because the method still works with only adjacent pairs of events (in this case ``product1`` and ``cart`` are assumed to go one right after another in a user's path). That is why the histogram is skewed to 0. ``adjacent_events_only`` parameter allows us to work with any cases when a user goes from ``product1`` to ``cart`` non-directly but passing through some other events: .. code-block:: python stream.timedelta_hist( event_pair=('product1', 'cart'), timedelta_unit='m', adjacent_events_only=False ) .. figure:: /_static/user_guides/eventstream/07_timedelta_adjacent_events_only.png :width: 500 We see that the number of observations has increased, especially around 0. In other words, for the vast majority of the users transition ``product1 → cart`` takes less than 1 day. On the other hands, we observe a "long tail" of the users whose journey from ``product1`` to ``cart`` takes multiple days. We can interpret this as there are two behavioral clusters: the users who are open for purchases, and the users who are picky. However, we also notice that adding a product to a cart does not necessarily mean that a user intends to make a purchase. Sometimes users adds an item to a cart just to check its final price, delivery options, etc. Here we should make a stop and explain how timedeltas between event pairs are calculated. Below you can see the picture of one user path and timedeltas that will be displayed in a ``timedelta_hist`` with the parameters ``event_pair=('A', 'B')`` and ``adjacent_events_only=False``. Let us consider each time delta calculation: - :math:`\Delta_1` is calculated between 'A' and 'B' events. 'D' and 'F' are ignored because of ``adjacent_events_only=False``. - The next 'A' event is colored grey and is skipped because there is one more 'A' event closer to the 'B' event. In such cases, we pick the 'A' event, that is closer to the next 'B' and calculate :math:`\Delta_2`. .. figure:: /_static/user_guides/eventstream/08_event_pair_explanation.png :width: 500 Single user path Now let us get back to our example. Due to the fact we have a lot of users with short trajectories and a few users with very long paths our histogram is unreadable. To make entire plot more comprehensible - the ``log_scale`` parameter can be used. We have already used that parameter for the ``x axis``, but it is also available fot the ``y axis``. For example: ``log_scale=(False, True)``. Another way to resolve that problem, is to look separately on different parts of our plot. For that purpose we can use parameters ``lower_cutoff_quantile`` and ``upper_cutoff_quantile``. These parameters specify boundaries for the histogram and will be applied last. In the example below, firstly, we keep users with ``event_pair=('product1', 'cart')`` and ``adjacent_events_only=False``, and after it we truncate 90% of users with the shortest trajectories and keep 10% of the longest. .. code-block:: python stream.timedelta_hist( event_pair=('product1', 'cart'), timedelta_unit='m', adjacent_events_only=False, lower_cutoff_quantile=0.9 ) .. figure:: /_static/user_guides/eventstream/timedelta_lower_cutoff_quantile.png Here it is the same algorithm, but 10% of users with the shortest trajectories will be kept. .. code-block:: python stream.timedelta_hist( event_pair=('product1', 'cart'), timedelta_unit='m', adjacent_events_only=False, upper_cutoff_quantile=0.1 ) .. figure:: /_static/user_guides/eventstream/timedelta_upper_cutoff_quantile.png If we set both parameters, boundaries will be calculated simultaneously and truncated afterward. Let us turn to another case. Sometimes we are interested in looking only at events within a user session. If we have already split the paths into sessions, we can use ``weight_col='session_id'``: .. code-block:: python stream_with_synthetic\ .timedelta_hist( event_pair=('product1', 'cart'), timedelta_unit='m', adjacent_events_only=False, weight_col='session_id' ) .. figure:: /_static/user_guides/eventstream/09_timedelta_sessions.png :width: 500 It is clear now that within a session the users walk from ``product1`` to ``cart`` event in less than 3 minutes. For frequently occurring events we might be interested in aggregating the timedeltas over sessions or users. For example, transition ``main -> catalog`` is quite frequent. Some users do these transitions quickly, some of them do not. It might be reasonable to aggregate the timedeltas over each user path first (we would get one value per one user at this step), and then visualize the distribution of these aggregated values. This can be done by passing an additional argument ``time_agg='mean'`` or ``time_agg='median'``. .. code-block:: python stream\ .timedelta_hist( event_pair=('main', 'catalog'), timedelta_unit='m', adjacent_events_only=False, weight_col='user_id', time_agg='mean' ) .. figure:: /_static/user_guides/eventstream/10_timedelta_time_agg_mean.png :width: 500 Eventstream global events ^^^^^^^^^^^^^^^^^^^^^^^^^ ``event_pair`` argument can accept a couple of auxiliary events: ``eventstream_start`` and ``eventstream_end``. They indicate the first and the last events in an evenstream. It is especially useful for choosing ``left_cutoff`` and ``right_cutoff`` parameters for :py:meth:`LabelCroppedPaths` data processor. Before you choose it, you can explore how a path's beginning/end margin from the right/left edge of an eventstream. In the histogram below, :math:`\Delta_1` illustrates such a margin for ``event_pair=('eventstream_start', 'B')``. Note that here only one timedelta is calculated - from the 'eventstream_start' to the first occurrence of specified event. .. figure:: /_static/user_guides/eventstream/11_timedelta_event_pair_with_global.png :width: 500 :math:`\Delta_1` in the following example illustrates a margin for ``event_pair=('B', 'eventstream_end')``. And again, only one timedelta per userpath is calculated - from the 'B' event (its last occurrence) to the 'eventstream_end'. .. figure:: /_static/user_guides/eventstream/11_timedelta_event_pair_with_global_end.png :width: 500 .. code-block:: python stream_with_synthetic\ .timedelta_hist( event_pair=('eventstream_start', 'path_end'), timedelta_unit='h', adjacent_events_only=False ) .. figure:: /_static/user_guides/eventstream/12_timedelta_eventstream_start_path_end.png :width: 500 For more details on how this histogram helps to define ``left_cutoff`` and ``right_cutoff`` parameters see :ref:`LabelCroppedPaths section` in the data processors user guide. .. _eventstream_events_timestamp: Event intensity ^^^^^^^^^^^^^^^ There is another helpful diagram that can be used for eventstream overview. Sometimes we want to know how the events are distributed over time. The histogram for this distribution is plotted by :py:meth:`event_timestamp_hist()` method. .. code-block:: python stream.event_timestamp_hist() .. figure:: /_static/user_guides/eventstream/13_event_timestamp_hist.png :width: 500 We can notice the heavy skew in the data towards the period between April and May of 2020. One of the possible interpretations of this fact is that the product worked in beta version until April 2020, and afterwards a stable were released so that new users started to arrive much more intense. ``event_timestamp_hist`` has ``event_list`` argument, so we can check this hypothesis by choosing ``path_start`` in the event list . .. code-block:: python stream\ .add_start_end_events()\ .event_timestamp_hist(event_list=['path_start']) .. figure:: /_static/user_guides/eventstream/14_event_timestamp_hist_event_list.png :width: 500 From this histogram we see that our hypothesis is true. New users started to arrive much more intense in April 2020. Similar to :py:meth:`timedelta_hist()`, ``event_timestamp_hist`` also has parameters ``raw_events_only``, ``upper_cutoff_quantile``, ``lower_cutoff_quantile``, ``bins``, ``width`` and ``height`` that work with the same logic.