Eventstream#

Google Colab Download - Jupyter Notebook

What is Eventstream?#

Eventstream is the core class in the retentioneering library. This data structure is designed around three following purposes:

  • Data container. Eventstream class implements a convenient approach to storing clickstream data.

  • Preprocessing. Eventstream allows to efficiently implement a data preparation process. See Preprocessing user guide for more details.

  • Applying analytical tools. Eventstream integrates with retentioneering tools and allows you to seamlessly apply them. See user guides on the path analysis tools.

Eventstream creation#

Default field names#

An Eventstream is a container for clickstream data, that is initialized from a pandas.DataFrame. The class constructor expects the DataFrame to have at least 3 columns: user_id, event, timestamp.

Let us create a dummy DataFrame to illustrate Eventstream init process:

import pandas as pd

df1 = pd.DataFrame(
    [
        ['user_1', 'A', '2023-01-01 00:00:00'],
        ['user_1', 'B', '2023-01-01 00:00:01'],
        ['user_2', 'B', '2023-01-01 00:00:02'],
        ['user_2', 'A', '2023-01-01 00:00:03'],
        ['user_2', 'A', '2023-01-01 00:00:04'],
    ],
    columns=['user_id', 'event', 'timestamp']
)

Having such a dataframe, you can create an eventstream simply as follows:

from retentioneering.eventstream import Eventstream
stream1 = Eventstream(df1)

To do the inverse transformation (i.e. obtain a DataFrame from an eventstream object), to_dataframe() method can be used. However, the method is not just a converter. Using it, we can display the internal Eventstream structure:

stream1.to_dataframe()
event_type event_index event timestamp user_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 f7c892ef-87a7-40a8-be56-7694b7220f57
1 raw 0 A 2023-01-01 00:00:00 user_1 f7c892ef-87a7-40a8-be56-7694b7220f57
2 raw 1 B 2023-01-01 00:00:01 user_1 2e08e88d-26f1-4441-ba9b-8ffdc75d11d2
3 path_end 1 path_end 2023-01-01 00:00:01 user_1 2e08e88d-26f1-4441-ba9b-8ffdc75d11d2
4 path_start 2 path_start 2023-01-01 00:00:02 user_2 31ea0be2-4b63-45b9-a35b-b85822d60fc7
5 raw 2 B 2023-01-01 00:00:02 user_2 31ea0be2-4b63-45b9-a35b-b85822d60fc7
6 raw 3 A 2023-01-01 00:00:03 user_2 0d188292-d1cb-48d3-8917-de271a4ac62d
7 raw 4 A 2023-01-01 00:00:04 user_2 28bdab09-5a54-4400-b8f3-c039c61b4ca1
8 path_end 4 path_end 2023-01-01 00:00:04 user_2 28bdab09-5a54-4400-b8f3-c039c61b4ca1

We will describe the columns of the resulting DataFrame later in Displaying eventstream section.

Changing default field names#

For custom DataFrame column names you can either rename them using pandas, or set a mapping rule that would tell the Eventstream constructor the mapping to the correct column names. This can be done with Eventstream attribute raw_data_schema with uses RawDataSchema class under the hood.

Let us illustrate its usage with the following example with the same dataframe containing the same data but with different column names (client_id, action and datetime):

df2 = pd.DataFrame(
    [
        ['user_1', 'A', '2023-01-01 00:00:00'],
        ['user_1', 'B', '2023-01-01 00:00:01'],
        ['user_2', 'B', '2023-01-01 00:00:02'],
        ['user_2', 'A', '2023-01-01 00:00:03'],
        ['user_2', 'A', '2023-01-01 00:00:04']
    ],
     columns=['client_id', 'action', 'datetime']
)

raw_data_schema = {
    'user_id': 'client_id',
    'event_name': 'action',
    'event_timestamp': 'datetime'
}

stream2 = Eventstream(df2, raw_data_schema=raw_data_schema)
stream2.to_dataframe().head(3)
event_type event_index event timestamp user_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 aef2b160-e554-42eb-8251-6f7937162d86
1 raw 0 A 2023-01-01 00:00:00 user_1 aef2b160-e554-42eb-8251-6f7937162d86
2 raw 1 B 2023-01-01 00:00:01 user_1 2f6561f5-57cc-4690-9e8f-69d34888ace8

As we see, raw_data_schema argument maps fields client_id, action, and datetime so that they are imported to the eventstream correctly.

Custom columns#

Another common case is when your DataFrame has some additional columns that you want to add to the eventstream. By default, these columns are included automatically. In case you want to explicitly control what columns should be included, you can add them in the custom_cols argument.

Suppose the initial DataFrame now also contains two columns: session and device, and you want to leave the former column only. Then you can use custom_cols parameter that works as a whitelist.

df3 = pd.DataFrame(
    [
        ['user_1', 'A', '2023-01-01 00:00:00', 'session_1', 'mobile'],
        ['user_1', 'B', '2023-01-01 00:00:01', 'session_1', 'mobile'],
        ['user_2', 'B', '2023-01-01 00:00:02', 'session_2', 'desktop'],
        ['user_2', 'A', '2023-01-01 00:00:03', 'session_3', 'desktop'],
        ['user_2', 'A', '2023-01-01 00:00:04', 'session_3', 'desktop']
    ],
    columns=['client_id', 'action', 'datetime', 'session', 'device']
)

raw_data_schema = {
    'user_id': 'client_id',
    'event_name': 'action',
    'event_timestamp': 'datetime',
}

stream3 = Eventstream(df3, raw_data_schema=raw_data_schema, custom_cols=['session'])
stream3.to_dataframe().head(3)
event_type event_index event timestamp user_id session_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 session_1 ab6c0351-e5a9-4f87-9621-a3f8fd62e9d4
1 raw 0 A 2023-01-01 00:00:00 user_1 session_1 ab6c0351-e5a9-4f87-9621-a3f8fd62e9d4
2 raw 1 B 2023-01-01 00:00:01 user_1 session_1 74f15970-b16c-46c4-a37f-e989b1eaa58e

As we see from above, the session column was included in the eventstream while device was not.

The same results could be achieved if you pass custom_cols as a key in the RawDataSchema. In this case, the value must be a list of dictionaries, one dict per one custom field. A single dict must contain two fields: raw_data_col and custom_col. The former stands for a field name from the sourcing dataframe, the latter stands for the corresponding field name to be set in the resulting eventstream.

raw_data_schema = {
    'user_id': 'client_id',
    'event_name': 'action',
    'event_timestamp': 'datetime',
    'custom_cols': [
        {
            'raw_data_col': 'session',
            'custom_col': 'session_id'
        }
    ]
}

stream3 = Eventstream(df3, raw_data_schema=raw_data_schema)
stream3.to_dataframe().head(3)

Here we see that the original session column is stored in session_id column, according to the defined raw_data_schema

If the core triple columns of the DataFrame are titled with the default names user_id, event, timestamp (instead of client_id, action, datetime) then you can ignore their mapping in the raw_data_schema and pass custom_cols argument only.

Eventstream field names#

You can also set the names of the eventstream columns (the default names are user_id, event, timestamp) by defining schema argument. It is a dictionary with the following possible keys, the values are the desired column names. In the example below we set a schema for the same stream1 and define the names of the basic columns as client_id, action, and datetime:

new_eventstream_schema = {
    'user_id': 'client_id',
    'event_name': 'action',
    'event_timestamp': 'datetime'
}

stream1_new_schema = Eventstream(df1, schema=new_eventstream_schema)
stream1_new_schema.to_dataframe().head(3)
event_type event_index action datetime client_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 939ba582-06fd-4200-8f17-94fccc0ca13d
1 raw 0 A 2023-01-01 00:00:00 user_1 939ba582-06fd-4200-8f17-94fccc0ca13d
2 raw 1 B 2023-01-01 00:00:01 user_1 ebcf18dc-e87e-4550-a74e-a86ca3968bc7

The full list of an eventstream fields and the corresponding values is available in Eventstream.schema attribute:

stream1_new_schema.schema
EventstreamSchema(
    event_id='event_id',
    event_type='event_type',
    event_index='event_index',
    event_name='action',
    event_timestamp='datetime',
    user_id='client_id',
    custom_cols=[]
)

Timestamp column format#

A few words about the time format. The work with timezone-aware timestamp column are not supported in the eventstream and should be converted to timezone-naive format. You can do it using convert_tz parameter. By default it is None, so if you got the timezone information the TypeError will be raised and you should specify how you would like to convert your data:

  • If local, the timestamp column will be converted to local time, and the timezone part will be truncated.

  • If UTC, the timestamp column will be converted to utc time, and the timezone part will be truncated.

Note, that if your timestamp column contains mixed time offsets, the conversion to local time could be slow due to the operation being applied to each row.

Here it is the example, where we have mixed time offsets:

df1_1 = pd.DataFrame(
    [
        ['user_1', 'A', '2023-01-01 00:00:00+02:00'],
        ['user_1', 'B', '2023-01-01 00:00:01+02:00'],
        ['user_2', 'B', '2023-01-01 00:00:02+04:00'],
        ['user_2', 'A', '2023-01-01 00:00:03+04:00'],
        ['user_2', 'A', '2023-01-01 00:00:04+02:00'],
    ],
    columns=['user_id', 'event', 'timestamp']
)

You can see, that time was not changed, and timezone part was truncated:

stream_local = Eventstream(df1_1, convert_tz='local')
stream_local.to_dataframe()
event_type event_index action datetime client_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 97498112-3365-4383-82df-718532c26e84
1 raw 0 A 2023-01-01 00:00:00 user_1 97498112-3365-4383-82df-718532c26e84
2 raw 1 B 2023-01-01 00:00:01 user_1 468cc136-544c-45df-b3a0-a35e38d40334
3 path_end 1 path_end 2023-01-01 00:00:01 user_1 468cc136-544c-45df-b3a0-a35e38d40334
4 path_start 2 path_start 2023-01-01 00:00:02 user_2 b534a1f7-d0d8-4cfb-b282-4059bfb38e4f
5 raw 2 B 2023-01-01 00:00:02 user_2 b534a1f7-d0d8-4cfb-b282-4059bfb38e4f
6 raw 3 A 2023-01-01 00:00:03 user_2 d4eea3b5-f8ec-4a55-a94c-25f022b08d33
7 raw 4 A 2023-01-01 00:00:04 user_2 568b221c-1d23-4a43-8147-8a8b6ffab193
8 path_end 4 path_end 2023-01-01 00:00:04 user_2 568b221c-1d23-4a43-8147-8a8b6ffab193

Specifying convert_tz='UTC' we can see, that time and date were changed due to convertion to utc time. was truncated:

stream_utc = Eventstream(df1_1, convert_tz='UTC')
stream_utc.to_dataframe()
event_type event_index action datetime client_id event_id
0 path_start 0 path_start 2022-12-31 20:00:02 user_2 ae67608b-83f8-4633-9daa-445d4672d247
1 raw 0 B 2022-12-31 20:00:02 user_2 ae67608b-83f8-4633-9daa-445d4672d247
2 raw 1 A 2022-12-31 20:00:03 user_2 65ff2cab-c11f-4c13-ad29-e924df6dde4b
3 path_start 2 path_start 2022-12-31 22:00:00 user_1 107d1ea6-e644-483b-979a-a923f7b5f597
4 raw 2 A 2022-12-31 22:00:00 user_1 107d1ea6-e644-483b-979a-a923f7b5f597
5 raw 3 B 2022-12-31 22:00:01 user_1 cf7ef191-1bf2-4a32-85ec-f2e3e2521ca6
6 path_end 3 path_end 2022-12-31 22:00:01 user_1 cf7ef191-1bf2-4a32-85ec-f2e3e2521ca6
7 raw 4 A 2022-12-31 22:00:04 user_2 eed83ec4-7340-4179-bfad-40549a66ec4e
8 path_end 4 path_end 2022-12-31 22:00:04 user_2 eed83ec4-7340-4179-bfad-40549a66ec4e

Path start and end#

For many practical reasons it is useful to keep synthetic events indicating path start and path end. Eventstream constructor adds path_start and path_end events explicitly. See also add_start_end_events data processor.

User sampling#

Contemporary data analysis usually involve working with large datasets. Using retentioneering to work with such datasets might cause the following undesirable effects:

  • High computational costs.

  • The messy big picture (especially in case of applying such tools as Transition Graph, StepMatrix and StepSankey). Insufficient user paths or large number of almost identical paths (especially short paths) often add no value to the analysis. It might be reasonable to get rid of them.

  • Due to Eventstream design, all the data uploaded to an Eventstream instance is kept immutable. Even if you remove some eventstream rows while preprocessing, the data stays untouched: it just becomes hidden and is marked as removed. Thus, the only chance to tailor the dataset to a reasonable size is to sample the user paths at entry point - while applying Eventstream constructor.

The size of the original dataset can be reduced by path sampling. In theory, this procedure could affect the eventstream analysis, especially in case you have rare but important events and behavioral patterns. Nevertheless, the sampling is less likely to distort the big picture, so we recommend to use it when it is needed.

We also highlight that user path sampling means that we remove some random paths entirely. We guarantee that the sampled paths contain all the events from the original dataset, and they are not truncated.

There are a couple sampling parameters in the Eventstream constructor: user_sample_size and user_sample_seed. There are two ways of setting the sample size:

  • A float number. For example, user_sample_size=0.1 means that we want to leave 10% ot the paths and remove 90% of them.

  • An integer sample size is also possible. In this case a specified number of events will be left.

user_sample_seed is a standard way to make random sampling reproducible (see this Stack Overflow explanation). You can set it to any integer number.

Below is a sampling example for simple_shop dataset.

from retentioneering import datasets

simple_shop_df = datasets.load_simple_shop(as_dataframe=True)
sampled_stream = Eventstream(
    simple_shop_df,
    user_sample_size=0.1,
    user_sample_seed=42
)

print('Original number of the events:', len(simple_shop_df))
print('Sampled number of the events:', len(sampled_stream.to_dataframe()))

unique_users_original = simple_shop_df['user_id'].nunique()
unique_users_sampled = sampled_stream.to_dataframe()['user_id'].nunique()

print('Original unique users number: ', unique_users_original)
print('Sampled unique users number: ', unique_users_sampled)
Original number of the events: 32283
Sampled number of the events: 4048
Original unique users number:  3751
Sampled unique users number:  375

We see that the number of the users has been reduced from 3751 to 375 (10% exactly). The number of the events has been reduced from 32283 to 4048 (12.5%), but we didn’t expect to see exact 10% here.

Displaying eventstream#

Now let us look at columns represented in an eventstream and discuss to_dataframe() method using the example of stream3 eventstream.

stream3.to_dataframe()
event_type event_index event timestamp user_id session_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 session_1 ab6c0351-e5a9-4f87-9621-a3f8fd62e9d4
1 raw 0 A 2023-01-01 00:00:00 user_1 session_1 ab6c0351-e5a9-4f87-9621-a3f8fd62e9d4
2 raw 1 B 2023-01-01 00:00:01 user_1 session_1 74f15970-b16c-46c4-a37f-e989b1eaa58e
3 path_end 1 path_end 2023-01-01 00:00:01 user_1 session_1 74f15970-b16c-46c4-a37f-e989b1eaa58e
4 path_start 2 path_start 2023-01-01 00:00:02 user_2 session_2 493ba02b-32a9-4732-b02a-228076530f91
5 raw 2 B 2023-01-01 00:00:02 user_2 session_2 493ba02b-32a9-4732-b02a-228076530f91
6 raw 3 A 2023-01-01 00:00:03 user_2 session_3 a20b60e4-d7ae-4f6a-a0d0-ee9cc3d68c7e
7 raw 4 A 2023-01-01 00:00:04 user_2 session_3 7a343bdf-08f1-4b63-8b9c-f7868bba17f5
8 path_end 4 path_end 2023-01-01 00:00:04 user_2 session_3 7a343bdf-08f1-4b63-8b9c-f7868bba17f5

Besides the standard triple user_id, event, timestamp and custom column session_id we see the columns event_id, event_type, event_index. These are some technical columns, containing the following:

  • event_type - all the events that come from the sourcing DataFrame are of raw event type. However, preprocessing methods can add some synthetic events that have various event types. See the details in data processors user guide.

  • event_index - an integer which is associated with the event order. By default, an eventstream is sorted by timestamp, and optionally by event column. As for the synthetic events which are often placed at the beginning or in the end of a user’s path, special sorting is applied. See explanation of index and reindex logic for the details and also data processors user guide. Please note that the event index might has duplicated values. It is ok due to its design.

  • event_id - a string identifier of an eventstream row.

  • copy - the only argument of to_dataframe method. When this flag is True (by default it is False) then an explicit copy of the DataFrame is created. See details in pandas documentation.

Eventstream index and reindex#

In the previous section, we have already mentioned the sorting algorithm when we described special event_type and event_index eventstream columns. Now, let us take a closer look at the sorting logic and illustrate it with several examples.

By default, raw events are sorted by timestamp column. So the events with the same timestamps are kept in the same order as they are represented in the sourcing DataFrame. In case you need some custom ordering for the events with the same timestamps use the events_order parameter. For example, due to some technical reasons events can arrive at the server with the same timestamp and randomly change the order and create unnecessary variability.

After initial sorting, the relative order of raw events is fixed, and re-sorting is not possible.

In the dummy dataframe below, there are two users, each with a pair of events (“A” and “B”) that have equal timestamps. If we create an eventstream with default parameters, the order will be preserved, exactly the same as in the input dataframe.

df4 = pd.DataFrame(
    [
        ['user_1', 'A', '2023-01-01 00:00:00'],
        ['user_1', 'B', '2023-01-01 00:00:00'],
        ['user_2', 'B', '2023-01-01 00:00:03'],
        ['user_2', 'A', '2023-01-01 00:00:03'],
        ['user_2', 'A', '2023-01-01 00:00:04']
    ],
    columns=['user_id', 'event', 'timestamp']
)

stream4 = Eventstream(df4)
stream4.to_dataframe()
event_type event_index event timestamp user_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 48c9e40c-b889-4ac4-92cd-b077cf3a621a
1 raw 0 A 2023-01-01 00:00:00 user_1 48c9e40c-b889-4ac4-92cd-b077cf3a621a
2 raw 1 B 2023-01-01 00:00:00 user_1 ec9f6658-b124-48c3-b067-6262d4bc3ceb
3 path_end 1 path_end 2023-01-01 00:00:00 user_1 ec9f6658-b124-48c3-b067-6262d4bc3ceb
4 path_start 2 path_start 2023-01-01 00:00:03 user_2 68bb195a-c2bf-41a2-94e9-9d05f5eb2d4f
5 raw 2 B 2023-01-01 00:00:03 user_2 68bb195a-c2bf-41a2-94e9-9d05f5eb2d4f
6 raw 3 A 2023-01-01 00:00:03 user_2 69a0fe6c-d6c2-4485-b527-0587abbbdca6
7 raw 4 A 2023-01-01 00:00:04 user_2 01813acd-de1a-4197-a825-7cce8c386739
8 path_end 4 path_end 2023-01-01 00:00:04 user_2 01813acd-de1a-4197-a825-7cce8c386739

Now we create a new Eventstream from our dummy DataFrame, but with the specified events_order=["B", "A"] parameter. As we can see, the first two events have swapped places.

Eventstream(df4, events_order=["B", "A"]).to_dataframe()
event_type event_index event timestamp user_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 3812e204-99dd-411b-9521-5ac45a7f2f64
1 raw 0 B 2023-01-01 00:00:00 user_1 3812e204-99dd-411b-9521-5ac45a7f2f64
2 raw 1 A 2023-01-01 00:00:00 user_1 474000df-16ad-44bb-8be3-db95cdb82917
3 path_end 1 path_end 2023-01-01 00:00:00 user_1 474000df-16ad-44bb-8be3-db95cdb82917
4 path_start 2 path_start 2023-01-01 00:00:03 user_2 eb1cb4b2-1db3-4f69-8037-9f3e5bc6bda8
5 raw 2 B 2023-01-01 00:00:03 user_2 eb1cb4b2-1db3-4f69-8037-9f3e5bc6bda8
6 raw 3 A 2023-01-01 00:00:03 user_2 e674ca48-2b96-4f18-a711-5c19a5afaf22
7 raw 4 A 2023-01-01 00:00:04 user_2 8ad543d7-3f70-4701-b3e2-dcca70adf7d5
8 path_end 4 path_end 2023-01-01 00:00:04 user_2 8ad543d7-3f70-4701-b3e2-dcca70adf7d5

As for the synthetic events which are often placed at the beginning or in the end of a user’s path, special sorting is applied. There is a set of pre-defined event types, that are arranged in the following default order:

IndexOrder = [
    "profile",
    "path_start",
    "new_user",
    "existing_user",
    "cropped_left",
    "session_start",
    "session_start_cropped",
    "group_alias",
    "raw",
    "raw_sleep",
    None,
    "synthetic",
    "synthetic_sleep",
    "positive_target",
    "negative_target",
    "session_end_cropped",
    "session_end",
    "session_sleep",
    "cropped_right",
    "absent_user",
    "lost_user",
    "path_end"
]

Most of these types are created by build-in data processors. Note that some of the types are not used right now and were created for future development.

To see full explanation about which data processor creates which event_type you can explore the data processors user guide.

If needed, you can pass a custom sorting list to the Eventstream constructor as the index_order argument.

In case you already have an eventstream instance, you can assign a custom sorting list to Eventstream.index_order attribute. Afterwards, you should use index_events() method to apply this new sorting. For demonstration purposes we use here a AddPositiveEvents data processor, which adds new event with prefix positive_target_.

add_events_stream = stream4.add_positive_events(targets=['B'])
add_events_stream.to_dataframe()
event_type event_index event timestamp user_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 1960fe37-d6b6-456d-b4a5-b7849c6bbe40
1 raw 0 A 2023-01-01 00:00:00 user_1 81ef2592-27f3-41b9-bc73-3f46ef5a602e
2 raw 1 B 2023-01-01 00:00:00 user_1 566df377-369f-4648-8a04-4b5b9928efa0
3 positive_target 1 positive_target_B 2023-01-01 00:00:00 user_1 566df377-369f-4648-8a04-4b5b9928efa0
4 path_end 1 path_end 2023-01-01 00:00:00 user_1 ae2842a8-5f9a-4188-bb90-a0238356f073
5 path_start 2 path_start 2023-01-01 00:00:03 user_2 156496ef-635a-497d-9ea4-dedefd6da778
6 raw 2 B 2023-01-01 00:00:03 user_2 7c01a1ab-c51d-41a8-84b5-71e6f8b1d2bc
7 positive_target 2 positive_target_B 2023-01-01 00:00:03 user_2 7c01a1ab-c51d-41a8-84b5-71e6f8b1d2bc
8 raw 3 A 2023-01-01 00:00:03 user_2 4e444914-2cba-48f1-8c59-6007cfe31c12
9 raw 4 A 2023-01-01 00:00:04 user_2 468c9bd6-3b7d-45fa-afe6-4ec75bfcbd51
10 path_end 4 path_end 2023-01-01 00:00:04 user_2 fd25c127-f573-4dc8-9ba4-19a0ef9969da

We see that positive_target_B events with type positive_target follow their raw parent event B. Assume we would like to change their order.

custom_sorting = [
    'profile',
    'path_start',
    'new_user',
    'existing_user',
    'cropped_left',
    'session_start',
    'session_start_cropped',
    'group_alias',
    'positive_target',
    'raw',
    'raw_sleep',
    None,
    'synthetic',
    'synthetic_sleep',
    'negative_target',
    'session_end_cropped',
    'session_end',
    'session_sleep',
    'cropped_right',
    'absent_user',
    'lost_user',
    'path_end'
]

add_events_stream.index_order = custom_sorting
add_events_stream.index_events()
add_events_stream.to_dataframe()
event_type event_index event timestamp user_id event_id
0 path_start 0 path_start 2023-01-01 00:00:00 user_1 1960fe37-d6b6-456d-b4a5-b7849c6bbe40
1 raw 0 A 2023-01-01 00:00:00 user_1 81ef2592-27f3-41b9-bc73-3f46ef5a602e
2 positive_target 1 positive_target_B 2023-01-01 00:00:00 user_1 566df377-369f-4648-8a04-4b5b9928efa0
3 raw 1 B 2023-01-01 00:00:00 user_1 566df377-369f-4648-8a04-4b5b9928efa0
4 path_end 1 path_end 2023-01-01 00:00:00 user_1 ae2842a8-5f9a-4188-bb90-a0238356f073
5 path_start 2 path_start 2023-01-01 00:00:03 user_2 156496ef-635a-497d-9ea4-dedefd6da778
6 positive_target 2 positive_target_B 2023-01-01 00:00:03 user_2 7c01a1ab-c51d-41a8-84b5-71e6f8b1d2bc
7 raw 2 B 2023-01-01 00:00:03 user_2 7c01a1ab-c51d-41a8-84b5-71e6f8b1d2bc
8 raw 3 A 2023-01-01 00:00:03 user_2 4e444914-2cba-48f1-8c59-6007cfe31c12
9 raw 4 A 2023-01-01 00:00:04 user_2 468c9bd6-3b7d-45fa-afe6-4ec75bfcbd51
10 path_end 4 path_end 2023-01-01 00:00:04 user_2 fd25c127-f573-4dc8-9ba4-19a0ef9969da

As we can see, the order of the events changed, and now raw events B follow positive_target_B events.

Descriptive methods#

Eventstream provides a set of methods for a first touch data exploration. To showcase how these methods work, we need a larger dataset, so we will use our simple_shop dataset. For demonstration purposes, we add session_id column by applying SplitSessions data processor.

from retentioneering import datasets

stream_with_sessions = datasets\
    .load_simple_shop()\
    .split_sessions(timeout=(30, 'm'))

stream_with_sessions.to_dataframe().head()
event_type event_index event timestamp user_id session_id event_id
0 path_start 0 path_start 2019-11-01 17:59:13.273932 219483890 219483890_1 a0f483d1-4707-4816-809c-60e455d923e2
1 session_start 0 session_start 2019-11-01 17:59:13.273932 219483890 219483890_1 68c3d0df-fabd-4d11-8109-fd50adba317f
2 raw 0 catalog 2019-11-01 17:59:13.273932 219483890 219483890_1 a0f483d1-4707-4816-809c-60e455d923e2
3 raw 1 product1 2019-11-01 17:59:28.459271 219483890 219483890_1 163e81cc-45bf-473a-a91e-086fa4ef83d4
4 raw 2 cart 2019-11-01 17:59:29.502214 219483890 219483890_1 590d1426-6f8c-4f7b-b9d0-4880c2afb3d7

General statistics#

Describe#

In a similar fashion to pandas, we use describe() for getting a general description of an eventstream.

stream_with_sessions.describe()
value
category metric
overall unique_users 3751
unique_events 16
unique_sessions 6454
eventstream_start 2019-11-01 17:59:13
eventstream_end 2020-04-29 12:48:07
eventstream_length 179 days 18:48:53
path_length_time mean 9 days 11:15:18
std 23 days 02:52:25
median 0 days 00:01:21
min 0 days 00:00:00
max 149 days 04:51:05
path_length_steps mean 14.05
std 11.43
median 11.0
min 5
max 124
session_length_time mean 0 days 00:00:52
std 0 days 00:01:08
median 0 days 00:00:30
min 0 days 00:00:00
max 0 days 00:23:44
session_length_steps mean 8.16
std 4.28
median 7.0
min 3
max 55

The output consists of three main categories:

  • overall statistics

  • full user-path statistics
    • time distribution

    • steps (events) distribution

  • sessions statistics
    • time distribution

    • steps (events) distribution

session_col parameter is optional and points to the eventstream column that contains session ids (session_id is the default value). If such a column is defined, session statistics are also included. Otherwise, the values related to sessions are not displayed.

There is one more parameter - raw_events_only (default False) that could be useful if some synthetic events have already been added by adding data processors. Note that those events affect all “*_steps” categories.

Now let us go through the main categories and take a closer look at some of the metrics:

overall

By eventstream start and eventstream end in the “Overall” block we denote timestamps of the first event and the last event in the eventstream correspondingly. eventstream_length is the time distance between event stream start and end.

path/session length time and path/session length steps

These two blocks show some time-based statistics over user paths and sessions. Categories “path/session_length_time” and “path/session length steps” provide similar information on the length of users paths and sessions correspondingly. The former is calculated in days and the latter in the number of events.

It is important to mention that all the values in “*_steps” categories are rounded to the 2nd decimal digit, and in “*_time” categories - to seconds. This is also true for the next method.

Describe events#

The describe_events() method provides event-wise statistics about an eventstream. Its output consists of three main blocks:

  1. basic statistics

  2. full user-path statistics,
    • time to first occurrence (FO) of each event,

    • steps to first occurrence (FO) of each event,

  3. sessions statistics (if this column exists),
    • time to first occurrence (FO) of each event,

    • steps to first occurrence (FO) of each event.

You can find detailed explanations of each metric in api documentation.

The default parameters are session_col='session_id', raw_events_only=False. With them, we will get statistics for each event present in our data. These two arguments work exactly the same way as in the describe() method.

stream = datasets.load_simple_shop()
stream.describe_events()
basic_statistics time_to_FO_user_wise steps_to_FO_user_wise
number_of_occurrences unique_users number_of_occurrences_shared unique_users_shared mean std median min max mean std median min max
event
cart 2842 1924 0.07 0.51 3 days 08:59:14 11 days 19:28:46 0 days 00:00:56 0 days 00:00:01 118 days 16:11:36 5.51 4.09 4.0 2 42
catalog 14518 3611 0.36 0.96 0 days 05:44:21 3 days 03:22:32 0 days 00:00:00 0 days 00:00:00 100 days 08:19:51 1.30 0.57 1.0 1 8
delivery_choice 1686 1356 0.04 0.36 5 days 09:18:08 15 days 03:19:15 0 days 00:01:12 0 days 00:00:03 118 days 16:11:37 7.78 5.56 6.0 3 50
delivery_courier 834 748 0.02 0.20 6 days 18:14:55 16 days 17:51:39 0 days 00:01:28 0 days 00:00:06 118 days 16:11:38 9.96 6.84 8.0 4 46
delivery_pickup 506 469 0.01 0.13 7 days 21:12:17 18 days 22:51:54 0 days 00:01:34 0 days 00:00:06 114 days 01:24:06 10.51 8.06 8.0 4 72
main 5635 2385 0.14 0.64 3 days 20:15:36 9 days 02:58:23 0 days 00:00:07 0 days 00:00:00 97 days 21:24:23 3.00 2.94 2.0 1 21
path_end 3751 3751 0.09 1.00 9 days 11:15:18 23 days 02:52:25 0 days 00:01:21 0 days 00:00:00 149 days 04:51:05 9.61 9.10 7.0 2 101
path_start 3751 3751 0.09 1.00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0 days 00:00:00 0.00 0.00 0.0 0 0
payment_card 565 521 0.01 0.14 6 days 21:42:26 17 days 18:52:33 0 days 00:01:40 0 days 00:00:08 138 days 04:51:25 12.14 7.34 10.0 6 66
payment_cash 197 190 0.00 0.05 13 days 23:17:25 24 days 00:00:02 0 days 00:02:18 0 days 00:00:10 118 days 16:11:39 15.15 11.10 10.5 6 74
payment_choice 1107 958 0.03 0.26 6 days 12:49:38 17 days 02:54:51 0 days 00:01:24 0 days 00:00:06 118 days 16:11:39 10.42 6.37 8.0 5 53
payment_done 706 653 0.02 0.17 7 days 01:37:54 17 days 09:10:00 0 days 00:01:34 0 days 00:00:08 115 days 09:18:59 13.21 8.29 11.0 6 85
product1 1515 1122 0.04 0.30 5 days 23:49:43 16 days 04:36:13 0 days 00:00:50 0 days 00:00:00 118 days 19:38:40 6.46 6.04 4.0 2 62
product2 2172 1430 0.05 0.38 4 days 06:13:24 13 days 03:26:17 0 days 00:00:34 0 days 00:00:00 126 days 23:36:45 5.32 4.51 4.0 2 37

If the number of unique events in an eventstream is high, we can leave events only from the list defined in event_list parameter. In the example below we leave the cart and payment_done events only as the events of high importance. We also transpose the output DataFrame for a nicer view.

stream.describe_events()
stream.describe_events(event_list=['payment_done', 'cart']).T
event cart payment_done
basic_statistics number_of_occurrences 2842 706
unique_users 1924 653
number_of_occurrences_shared 0.07 0.02
unique_users_shared 0.51 0.17
time_to_FO_user_wise mean 3 days 08:59:14 7 days 01:37:54
std 11 days 19:28:46 17 days 09:10:00
median 0 days 00:00:56 0 days 00:01:34
min 0 days 00:00:01 0 days 00:00:08
max 118 days 16:11:36 115 days 09:18:59
steps_to_FO_user_wise mean 5.51 13.21
std 4.09 8.29
median 4.0 11.0
min 2 6
max 42 85

Often, such simple descriptive statistics are not enough to deeply understand the time-related values, so we want to see their distribution. For these purposes the following group of methods has been implemented.

Time-based histograms#

User lifetime#

One of the most important time-related statistics is user lifetime. By lifetime we mean the time distance between the first and the last event represented in a user’s trajectory. The histogram for this variable is plotted by user_lifetime_hist() method.

stream.user_lifetime_hist()
../_images/01_user_lifetime_hist_simple.png

The method has multiple parameters:

  • timedelta_unit defines a datetime unit that is used for the lifetime measuring;

  • log_scale sets logarithmic scale for the bins;

  • lower_cutoff_quantile, upper_cutoff_quantile indicate the lower and upper quantiles (as floats between 0 and 1), the values between the quantiles only are considered for the histogram;

  • bins defines the number of histogram bins. Also can be the name of a reference rule or number of bins. See details in numpy documentation;

  • width and height set figure width and height in inches.

Note

The method is especially useful for selecting parameters to DropPaths. See the user guide on preprocessing for details.

Timedelta between two events#

Previously, we have defined user lifetime as the timedelta between the beginning and the end of a user’s path. This can be generalized. timedelta_hist() method shows a histogram for the distribution of timedeltas between a couple of specified events.

The method supports similar formatting arguments (timedelta_unit, log_scale, lower_cutoff_quantile, upper_cutoff_quantile, bins, width, height) as we have already mentioned in user_lifetime_hist method.

If no arguments are passed (except formatting arguments), timedeltas between all adjacent events are calculated within each user path. For example, this tiny eventstream

../_images/02_timedelta_trivial_example.png

generates 4 timedeltas \(\Delta_1, \Delta_2, \Delta_3, \Delta_4\) as shown in the diagram. The timedeltas between events B and D, D and C, C and E are not taken into account because two events from each pair belong to different users.

Here is how the histogram looks for the simple_shop dataset with log_scale=True and timedelta_unit='m':

stream.timedelta_hist(log_scale=True, timedelta_unit='m')
../_images/03_timedelta_log_scale.png

This distribution of the adjacent events fairly common. It looks like a bimodal (which is not true: remember we use log-scale here), but these two bells help us to estimate a timeout for splitting sessions. From this charts we can see that it is reasonable to set it to some value between 10 and 100 minutes.

Be careful if there are some synthetic events in the data. Usually those events are assigned with the same timestamp as their “parent” raw events. Thus, the distribution of the timedeltas between events will be heavily skewed to 0. Parameter raw_events_only=True can help in such a situation. Let us add to our dataset some common synthetic events using AddStartEndEvents and SplitSessions data processors.

stream_with_synthetic = datasets\
    .load_simple_shop()\
    .add_start_end_events()\
    .split_sessions(timeout=(30, 'm'))

stream_with_synthetic.timedelta_hist(log_scale=True, timedelta_unit='m')
stream_with_synthetic.timedelta_hist(
    raw_events_only=True,
    log_scale=True,
    timedelta_unit='m'
)
../_images/04_timedelta_raw_events_only_false.png
../_images/05_timedelta_raw_events_only_true.png

You can see that on the second plot there is no high histogram bar located at \(\approx 10^{-3}\), so that the second histogram looks more natural.

Another use case for timedelta_hist() is visualizing the distribution of timedeltas between two specific events. Assume we want to know how much time it takes for a user to go from product1 to cart. Then we set event_pair=('product1', 'cart') and pass it to timedelta_hist:

stream.timedelta_hist(event_pair=('product1', 'cart'), timedelta_unit='m')
../_images/06_timedelta_pair_of_events.png

From the Y scale, we see that such occurrences are not very numerous. This is because the method still works with only adjacent pairs of events (in this case product1 and cart are assumed to go one right after another in a user’s path). That is why the histogram is skewed to 0. adjacent_events_only parameter allows us to work with any cases when a user goes from product1 to cart non-directly but passing through some other events:

stream.timedelta_hist(
    event_pair=('product1', 'cart'),
    timedelta_unit='m',
    adjacent_events_only=False
)
../_images/07_timedelta_adjacent_events_only.png

We see that the number of observations has increased, especially around 0. In other words, for the vast majority of the users transition product1 cart takes less than 1 day. On the other hands, we observe a “long tail” of the users whose journey from product1 to cart takes multiple days. We can interpret this as there are two behavioral clusters: the users who are open for purchases, and the users who are picky. However, we also notice that adding a product to a cart does not necessarily mean that a user intends to make a purchase. Sometimes users adds an item to a cart just to check its final price, delivery options, etc.

Here we should make a stop and explain how timedeltas between event pairs are calculated. Below you can see the picture of one user path and timedeltas that will be displayed in a timedelta_hist with the parameters event_pair=('A', 'B') and adjacent_events_only=False.

Let us consider each time delta calculation:

  • \(\Delta_1\) is calculated between ‘A’ and ‘B’ events. ‘D’ and ‘F’ are ignored because of adjacent_events_only=False.

  • The next ‘A’ event is colored grey and is skipped because there is one more ‘A’ event closer to the ‘B’ event. In such cases, we pick the ‘A’ event, that is closer to the next ‘B’ and calculate \(\Delta_2\).

../_images/08_event_pair_explanation.png

Single user path#

Now let us get back to our example. Due to the fact we have a lot of users with short trajectories and a few users with very long paths our histogram is unreadable.

To make entire plot more comprehensible - the log_scale parameter can be used. We have already used that parameter for the x axis, but it is also available fot the y axis. For example: log_scale=(False, True).

Another way to resolve that problem, is to look separately on different parts of our plot. For that purpose we can use parameters lower_cutoff_quantile and upper_cutoff_quantile. These parameters specify boundaries for the histogram and will be applied last.

In the example below, firstly, we keep users with event_pair=('product1', 'cart') and adjacent_events_only=False, and after it we truncate 90% of users with the shortest trajectories and keep 10% of the longest.

stream.timedelta_hist(
        event_pair=('product1', 'cart'),
        timedelta_unit='m',
        adjacent_events_only=False,
        lower_cutoff_quantile=0.9
    )
../_images/timedelta_lower_cutoff_quantile.png

Here it is the same algorithm, but 10% of users with the shortest trajectories will be kept.

stream.timedelta_hist(
        event_pair=('product1', 'cart'),
        timedelta_unit='m',
        adjacent_events_only=False,
        upper_cutoff_quantile=0.1
    )
../_images/timedelta_upper_cutoff_quantile.png

If we set both parameters, boundaries will be calculated simultaneously and truncated afterward.

Let us turn to another case. Sometimes we are interested in looking only at events within a user session. If we have already split the paths into sessions, we can use weight_col='session_id':

stream_with_synthetic\
    .timedelta_hist(
        event_pair=('product1', 'cart'),
        timedelta_unit='m',
        adjacent_events_only=False,
        weight_col='session_id'
    )
../_images/09_timedelta_sessions.png

It is clear now that within a session the users walk from product1 to cart event in less than 3 minutes.

For frequently occurring events we might be interested in aggregating the timedeltas over sessions or users. For example, transition main -> catalog is quite frequent. Some users do these transitions quickly, some of them do not. It might be reasonable to aggregate the timedeltas over each user path first (we would get one value per one user at this step), and then visualize the distribution of these aggregated values. This can be done by passing an additional argument time_agg='mean' or time_agg='median'.

stream\
    .timedelta_hist(
        event_pair=('main', 'catalog'),
        timedelta_unit='m',
        adjacent_events_only=False,
        weight_col='user_id',
        time_agg='mean'
    )
../_images/10_timedelta_time_agg_mean.png

Eventstream global events#

event_pair argument can accept a couple of auxiliary events: eventstream_start and eventstream_end. They indicate the first and the last events in an evenstream.

It is especially useful for choosing left_cutoff and right_cutoff parameters for LabelCroppedPaths data processor. Before you choose it, you can explore how a path’s beginning/end margin from the right/left edge of an eventstream. In the histogram below, \(\Delta_1\) illustrates such a margin for event_pair=('eventstream_start', 'B'). Note that here only one timedelta is calculated - from the ‘eventstream_start’ to the first occurrence of specified event.

../_images/11_timedelta_event_pair_with_global.png

\(\Delta_1\) in the following example illustrates a margin for event_pair=('B', 'eventstream_end'). And again, only one timedelta per userpath is calculated - from the ‘B’ event (its last occurrence) to the ‘eventstream_end’.

../_images/11_timedelta_event_pair_with_global_end.png
stream_with_synthetic\
    .timedelta_hist(
        event_pair=('eventstream_start', 'path_end'),
        timedelta_unit='h',
        adjacent_events_only=False
    )
../_images/12_timedelta_eventstream_start_path_end.png

For more details on how this histogram helps to define left_cutoff and right_cutoff parameters see LabelCroppedPaths section in the data processors user guide.

Event intensity#

There is another helpful diagram that can be used for eventstream overview. Sometimes we want to know how the events are distributed over time. The histogram for this distribution is plotted by event_timestamp_hist() method.

stream.event_timestamp_hist()
../_images/13_event_timestamp_hist.png

We can notice the heavy skew in the data towards the period between April and May of 2020. One of the possible interpretations of this fact is that the product worked in beta version until April 2020, and afterwards a stable were released so that new users started to arrive much more intense. event_timestamp_hist has event_list argument, so we can check this hypothesis by choosing path_start in the event list .

stream\
    .add_start_end_events()\
    .event_timestamp_hist(event_list=['path_start'])
../_images/14_event_timestamp_hist_event_list.png

From this histogram we see that our hypothesis is true. New users started to arrive much more intense in April 2020.

Similar to timedelta_hist(), event_timestamp_hist also has parameters raw_events_only, upper_cutoff_quantile, lower_cutoff_quantile, bins, width and height that work with the same logic.

Path metrics#

Besides the pre-defined statistics you can calculate any custom metric over paths using the path_metrics() method. In case of multiple metrics, the method accepts a list of tuples representing a metric and its name, and returns a DataFrame. A metric can be defined as an arbitrary callable function or pandas.NamedAgg to be applied to the grouped evenstream data. Also, some string shortcuts such as len, has:TARGET_EVENT, time_to:TARGET_EVENT are available.

metrics = [
    ('len', 'path_length'),
    ('has:cart', 'has_cart'),
    ('time_to:cart', 'time_to_cart'),
    (lambda _df: (_df['event'] == 'cart').sum(), 'cart_count'),
    (pd.NamedAgg('timestamp', lambda s: len(s.dt.date.unique())), 'active_days')
]

stream.path_metrics(metrics).head()
path_length has_cart time_to_cart cart_count active_days
122915 34 True 6 days 01:22:39.090422 1 2
463458 12 False NaT 0 1
1475907 16 True 23 days 13:03:45.213509 1 2
1576626 3 False NaT 0 1
2112338 7 False NaT 0 1