Eventstream#
What is Eventstream?#
Eventstream is the core class in the retentioneering library. This data structure is designed around three following purposes:
Data container. Eventstream class implements a convenient approach to storing clickstream data.
Preprocessing. Eventstream allows to efficiently implement a data preparation process. See Preprocessing user guide for more details.
Applying analytical tools. Eventstream integrates with retentioneering tools and allows you to seamlessly apply them. See user guides on the path analysis tools.
Eventstream creation#
Default field names#
An Eventstream
is a container for clickstream data, that is initialized from a pandas.DataFrame
.
The class constructor expects the DataFrame to have at least 3 columns:
user_id
, event
, timestamp
.
Let us create a dummy DataFrame to illustrate Eventstream init process:
import pandas as pd
df1 = pd.DataFrame(
[
['user_1', 'A', '2023-01-01 00:00:00'],
['user_1', 'B', '2023-01-01 00:00:01'],
['user_2', 'B', '2023-01-01 00:00:02'],
['user_2', 'A', '2023-01-01 00:00:03'],
['user_2', 'A', '2023-01-01 00:00:04'],
],
columns=['user_id', 'event', 'timestamp']
)
Having such a dataframe, you can create an eventstream simply as follows:
from retentioneering.eventstream import Eventstream
stream1 = Eventstream(df1)
To do the inverse transformation (i.e. obtain a DataFrame from an eventstream object),
to_dataframe()
method can be used.
However, the method is not just a converter. Using it, we can display the internal Eventstream
structure:
stream1.to_dataframe()
event_type | event_index | event | timestamp | user_id | event_id | |
---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | f7c892ef-87a7-40a8-be56-7694b7220f57 |
1 | raw | 0 | A | 2023-01-01 00:00:00 | user_1 | f7c892ef-87a7-40a8-be56-7694b7220f57 |
2 | raw | 1 | B | 2023-01-01 00:00:01 | user_1 | 2e08e88d-26f1-4441-ba9b-8ffdc75d11d2 |
3 | path_end | 1 | path_end | 2023-01-01 00:00:01 | user_1 | 2e08e88d-26f1-4441-ba9b-8ffdc75d11d2 |
4 | path_start | 2 | path_start | 2023-01-01 00:00:02 | user_2 | 31ea0be2-4b63-45b9-a35b-b85822d60fc7 |
5 | raw | 2 | B | 2023-01-01 00:00:02 | user_2 | 31ea0be2-4b63-45b9-a35b-b85822d60fc7 |
6 | raw | 3 | A | 2023-01-01 00:00:03 | user_2 | 0d188292-d1cb-48d3-8917-de271a4ac62d |
7 | raw | 4 | A | 2023-01-01 00:00:04 | user_2 | 28bdab09-5a54-4400-b8f3-c039c61b4ca1 |
8 | path_end | 4 | path_end | 2023-01-01 00:00:04 | user_2 | 28bdab09-5a54-4400-b8f3-c039c61b4ca1 |
We will describe the columns of the resulting DataFrame later in Displaying eventstream section.
Changing default field names#
For custom DataFrame column names you can either rename them
using pandas, or set a mapping rule that would tell the Eventstream constructor
the mapping to the correct column names.
This can be done with Eventstream attribute raw_data_schema
with uses
RawDataSchema
class under the hood.
Let us illustrate its usage with the following example with the same dataframe
containing the same data but with different column names
(client_id
, action
and datetime
):
df2 = pd.DataFrame(
[
['user_1', 'A', '2023-01-01 00:00:00'],
['user_1', 'B', '2023-01-01 00:00:01'],
['user_2', 'B', '2023-01-01 00:00:02'],
['user_2', 'A', '2023-01-01 00:00:03'],
['user_2', 'A', '2023-01-01 00:00:04']
],
columns=['client_id', 'action', 'datetime']
)
raw_data_schema = {
'user_id': 'client_id',
'event_name': 'action',
'event_timestamp': 'datetime'
}
stream2 = Eventstream(df2, raw_data_schema=raw_data_schema)
stream2.to_dataframe().head(3)
event_type | event_index | event | timestamp | user_id | event_id | |
---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | aef2b160-e554-42eb-8251-6f7937162d86 |
1 | raw | 0 | A | 2023-01-01 00:00:00 | user_1 | aef2b160-e554-42eb-8251-6f7937162d86 |
2 | raw | 1 | B | 2023-01-01 00:00:01 | user_1 | 2f6561f5-57cc-4690-9e8f-69d34888ace8 |
As we see, raw_data_schema
argument maps fields client_id
, action
,
and datetime
so that they are imported to the eventstream correctly.
Custom columns#
Another common case is when your DataFrame has some additional columns
that you want to add to the eventstream. By default, these columns are included automatically. In case you want to explicitly control what columns should be included, you can add them in the custom_cols
argument.
Suppose the initial DataFrame now also contains two columns: session
and device
, and you want to leave the former column only. Then you can use custom_cols
parameter that works as a whitelist.
df3 = pd.DataFrame(
[
['user_1', 'A', '2023-01-01 00:00:00', 'session_1', 'mobile'],
['user_1', 'B', '2023-01-01 00:00:01', 'session_1', 'mobile'],
['user_2', 'B', '2023-01-01 00:00:02', 'session_2', 'desktop'],
['user_2', 'A', '2023-01-01 00:00:03', 'session_3', 'desktop'],
['user_2', 'A', '2023-01-01 00:00:04', 'session_3', 'desktop']
],
columns=['client_id', 'action', 'datetime', 'session', 'device']
)
raw_data_schema = {
'user_id': 'client_id',
'event_name': 'action',
'event_timestamp': 'datetime',
}
stream3 = Eventstream(df3, raw_data_schema=raw_data_schema, custom_cols=['session'])
stream3.to_dataframe().head(3)
event_type | event_index | event | timestamp | user_id | session_id | event_id | |
---|---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | session_1 | ab6c0351-e5a9-4f87-9621-a3f8fd62e9d4 |
1 | raw | 0 | A | 2023-01-01 00:00:00 | user_1 | session_1 | ab6c0351-e5a9-4f87-9621-a3f8fd62e9d4 |
2 | raw | 1 | B | 2023-01-01 00:00:01 | user_1 | session_1 | 74f15970-b16c-46c4-a37f-e989b1eaa58e |
As we see from above, the session
column was included in the eventstream while device
was not.
The same results could be achieved if you pass custom_cols
as a key in the RawDataSchema
. In this case, the value must be a list of dictionaries, one dict per one custom field. A single dict must contain two fields: raw_data_col
and custom_col
. The former stands for a field name from the sourcing dataframe, the latter stands for the corresponding field name to be set in the resulting eventstream.
raw_data_schema = {
'user_id': 'client_id',
'event_name': 'action',
'event_timestamp': 'datetime',
'custom_cols': [
{
'raw_data_col': 'session',
'custom_col': 'session_id'
}
]
}
stream3 = Eventstream(df3, raw_data_schema=raw_data_schema)
stream3.to_dataframe().head(3)
Here we see that the original session
column is stored in session_id
column,
according to the defined raw_data_schema
If the core triple columns of the DataFrame are titled with the default names
user_id
, event
, timestamp
(instead of client_id
, action
, datetime
)
then you can ignore their mapping in the raw_data_schema
and pass custom_cols
argument only.
Eventstream field names#
You can also set the names of the eventstream columns (the default names are user_id
, event
, timestamp
) by defining schema
argument. It is a dictionary with the following possible keys
, the values are the desired column names. In the example below we set a schema for the same stream1 and define the names of the basic columns as client_id
, action
, and datetime
:
new_eventstream_schema = {
'user_id': 'client_id',
'event_name': 'action',
'event_timestamp': 'datetime'
}
stream1_new_schema = Eventstream(df1, schema=new_eventstream_schema)
stream1_new_schema.to_dataframe().head(3)
event_type | event_index | action | datetime | client_id | event_id | |
---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | 939ba582-06fd-4200-8f17-94fccc0ca13d |
1 | raw | 0 | A | 2023-01-01 00:00:00 | user_1 | 939ba582-06fd-4200-8f17-94fccc0ca13d |
2 | raw | 1 | B | 2023-01-01 00:00:01 | user_1 | ebcf18dc-e87e-4550-a74e-a86ca3968bc7 |
The full list of an eventstream fields and the corresponding values is available in Eventstream.schema
attribute:
stream1_new_schema.schema
EventstreamSchema(
event_id='event_id',
event_type='event_type',
event_index='event_index',
event_name='action',
event_timestamp='datetime',
user_id='client_id',
custom_cols=[]
)
Timestamp column format#
A few words about the time format. The work with timezone-aware timestamp column are not supported in the eventstream and should be converted to timezone-naive format.
You can do it using convert_tz
parameter. By default it is None
, so if you got the timezone information the TypeError will be raised and you should specify how you would like to convert your data:
If
local
, the timestamp column will be converted to local time, and the timezone part will be truncated.If
UTC
, the timestamp column will be converted to utc time, and the timezone part will be truncated.
Note, that if your timestamp column contains mixed time offsets, the conversion to local time could be slow due to the operation being applied to each row.
Here it is the example, where we have mixed time offsets:
df1_1 = pd.DataFrame(
[
['user_1', 'A', '2023-01-01 00:00:00+02:00'],
['user_1', 'B', '2023-01-01 00:00:01+02:00'],
['user_2', 'B', '2023-01-01 00:00:02+04:00'],
['user_2', 'A', '2023-01-01 00:00:03+04:00'],
['user_2', 'A', '2023-01-01 00:00:04+02:00'],
],
columns=['user_id', 'event', 'timestamp']
)
You can see, that time was not changed, and timezone part was truncated:
stream_local = Eventstream(df1_1, convert_tz='local')
stream_local.to_dataframe()
event_type | event_index | action | datetime | client_id | event_id | |
---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | 97498112-3365-4383-82df-718532c26e84 |
1 | raw | 0 | A | 2023-01-01 00:00:00 | user_1 | 97498112-3365-4383-82df-718532c26e84 |
2 | raw | 1 | B | 2023-01-01 00:00:01 | user_1 | 468cc136-544c-45df-b3a0-a35e38d40334 |
3 | path_end | 1 | path_end | 2023-01-01 00:00:01 | user_1 | 468cc136-544c-45df-b3a0-a35e38d40334 |
4 | path_start | 2 | path_start | 2023-01-01 00:00:02 | user_2 | b534a1f7-d0d8-4cfb-b282-4059bfb38e4f |
5 | raw | 2 | B | 2023-01-01 00:00:02 | user_2 | b534a1f7-d0d8-4cfb-b282-4059bfb38e4f |
6 | raw | 3 | A | 2023-01-01 00:00:03 | user_2 | d4eea3b5-f8ec-4a55-a94c-25f022b08d33 |
7 | raw | 4 | A | 2023-01-01 00:00:04 | user_2 | 568b221c-1d23-4a43-8147-8a8b6ffab193 |
8 | path_end | 4 | path_end | 2023-01-01 00:00:04 | user_2 | 568b221c-1d23-4a43-8147-8a8b6ffab193 |
Specifying convert_tz='UTC'
we can see, that time and date were changed due to convertion to utc time.
was truncated:
stream_utc = Eventstream(df1_1, convert_tz='UTC')
stream_utc.to_dataframe()
event_type | event_index | action | datetime | client_id | event_id | |
---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2022-12-31 20:00:02 | user_2 | ae67608b-83f8-4633-9daa-445d4672d247 |
1 | raw | 0 | B | 2022-12-31 20:00:02 | user_2 | ae67608b-83f8-4633-9daa-445d4672d247 |
2 | raw | 1 | A | 2022-12-31 20:00:03 | user_2 | 65ff2cab-c11f-4c13-ad29-e924df6dde4b |
3 | path_start | 2 | path_start | 2022-12-31 22:00:00 | user_1 | 107d1ea6-e644-483b-979a-a923f7b5f597 |
4 | raw | 2 | A | 2022-12-31 22:00:00 | user_1 | 107d1ea6-e644-483b-979a-a923f7b5f597 |
5 | raw | 3 | B | 2022-12-31 22:00:01 | user_1 | cf7ef191-1bf2-4a32-85ec-f2e3e2521ca6 |
6 | path_end | 3 | path_end | 2022-12-31 22:00:01 | user_1 | cf7ef191-1bf2-4a32-85ec-f2e3e2521ca6 |
7 | raw | 4 | A | 2022-12-31 22:00:04 | user_2 | eed83ec4-7340-4179-bfad-40549a66ec4e |
8 | path_end | 4 | path_end | 2022-12-31 22:00:04 | user_2 | eed83ec4-7340-4179-bfad-40549a66ec4e |
Path start and end#
For many practical reasons it is useful to keep synthetic events indicating path start and path end. Eventstream constructor adds path_start
and path_end
events explicitly. See also add_start_end_events data processor.
User sampling#
Contemporary data analysis usually involve working with large datasets. Using retentioneering to work with such datasets might cause the following undesirable effects:
High computational costs.
The messy big picture (especially in case of applying such tools as Transition Graph, StepMatrix and StepSankey). Insufficient user paths or large number of almost identical paths (especially short paths) often add no value to the analysis. It might be reasonable to get rid of them.
Due to Eventstream design, all the data uploaded to an Eventstream instance is kept immutable. Even if you remove some eventstream rows while preprocessing, the data stays untouched: it just becomes hidden and is marked as removed. Thus, the only chance to tailor the dataset to a reasonable size is to sample the user paths at entry point - while applying Eventstream constructor.
The size of the original dataset can be reduced by path sampling. In theory, this procedure could affect the eventstream analysis, especially in case you have rare but important events and behavioral patterns. Nevertheless, the sampling is less likely to distort the big picture, so we recommend to use it when it is needed.
We also highlight that user path sampling means that we remove some random paths entirely. We guarantee that the sampled paths contain all the events from the original dataset, and they are not truncated.
There are a couple sampling parameters in the Eventstream constructor: user_sample_size
and user_sample_seed
. There are two ways of setting the sample size:
A float number. For example,
user_sample_size=0.1
means that we want to leave 10% ot the paths and remove 90% of them.An integer sample size is also possible. In this case a specified number of events will be left.
user_sample_seed
is a standard way to make random sampling reproducible
(see this Stack Overflow explanation).
You can set it to any integer number.
Below is a sampling example for simple_shop dataset.
from retentioneering import datasets
simple_shop_df = datasets.load_simple_shop(as_dataframe=True)
sampled_stream = Eventstream(
simple_shop_df,
user_sample_size=0.1,
user_sample_seed=42
)
print('Original number of the events:', len(simple_shop_df))
print('Sampled number of the events:', len(sampled_stream.to_dataframe()))
unique_users_original = simple_shop_df['user_id'].nunique()
unique_users_sampled = sampled_stream.to_dataframe()['user_id'].nunique()
print('Original unique users number: ', unique_users_original)
print('Sampled unique users number: ', unique_users_sampled)
Original number of the events: 32283
Sampled number of the events: 4048
Original unique users number: 3751
Sampled unique users number: 375
We see that the number of the users has been reduced from 3751 to 375 (10% exactly). The number of the events has been reduced from 32283 to 4048 (12.5%), but we didn’t expect to see exact 10% here.
Displaying eventstream#
Now let us look at columns represented in an eventstream and discuss
to_dataframe()
method using the example of stream3
eventstream.
stream3.to_dataframe()
event_type | event_index | event | timestamp | user_id | session_id | event_id | |
---|---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | session_1 | ab6c0351-e5a9-4f87-9621-a3f8fd62e9d4 |
1 | raw | 0 | A | 2023-01-01 00:00:00 | user_1 | session_1 | ab6c0351-e5a9-4f87-9621-a3f8fd62e9d4 |
2 | raw | 1 | B | 2023-01-01 00:00:01 | user_1 | session_1 | 74f15970-b16c-46c4-a37f-e989b1eaa58e |
3 | path_end | 1 | path_end | 2023-01-01 00:00:01 | user_1 | session_1 | 74f15970-b16c-46c4-a37f-e989b1eaa58e |
4 | path_start | 2 | path_start | 2023-01-01 00:00:02 | user_2 | session_2 | 493ba02b-32a9-4732-b02a-228076530f91 |
5 | raw | 2 | B | 2023-01-01 00:00:02 | user_2 | session_2 | 493ba02b-32a9-4732-b02a-228076530f91 |
6 | raw | 3 | A | 2023-01-01 00:00:03 | user_2 | session_3 | a20b60e4-d7ae-4f6a-a0d0-ee9cc3d68c7e |
7 | raw | 4 | A | 2023-01-01 00:00:04 | user_2 | session_3 | 7a343bdf-08f1-4b63-8b9c-f7868bba17f5 |
8 | path_end | 4 | path_end | 2023-01-01 00:00:04 | user_2 | session_3 | 7a343bdf-08f1-4b63-8b9c-f7868bba17f5 |
Besides the standard triple user_id
, event
, timestamp
and custom column session_id
we see the columns event_id
, event_type
, event_index
.
These are some technical columns, containing the following:
event_type
- all the events that come from the sourcing DataFrame are ofraw
event type. However, preprocessing methods can add some synthetic events that have various event types. See the details in data processors user guide.event_index
- an integer which is associated with the event order. By default, an eventstream is sorted by timestamp, and optionally byevent
column. As for the synthetic events which are often placed at the beginning or in the end of a user’s path, special sorting is applied. See explanation of index and reindex logic for the details and also data processors user guide. Please note that the event index might has duplicated values. It is ok due to its design.event_id
- a string identifier of an eventstream row.copy
- the only argument of to_dataframe method. When this flag isTrue
(by default it isFalse
) then an explicit copy of the DataFrame is created. See details in pandas documentation.
Eventstream index and reindex#
In the previous section, we have already mentioned the sorting algorithm when we described special
event_type
and event_index
eventstream columns.
Now, let us take a closer look at the sorting logic and illustrate it with several examples.
By default, raw events are sorted by timestamp column. So the events with the same timestamps are kept in the same
order as they are represented in the sourcing DataFrame.
In case you need some custom ordering for the events with the same timestamps use the events_order
parameter.
For example, due to some technical reasons events can arrive at the server with the same timestamp and
randomly change the order and create unnecessary variability.
After initial sorting, the relative order of raw events is fixed, and re-sorting is not possible.
In the dummy dataframe below, there are two users, each with a pair of events (“A” and “B”) that have equal timestamps. If we create an eventstream with default parameters, the order will be preserved, exactly the same as in the input dataframe.
df4 = pd.DataFrame(
[
['user_1', 'A', '2023-01-01 00:00:00'],
['user_1', 'B', '2023-01-01 00:00:00'],
['user_2', 'B', '2023-01-01 00:00:03'],
['user_2', 'A', '2023-01-01 00:00:03'],
['user_2', 'A', '2023-01-01 00:00:04']
],
columns=['user_id', 'event', 'timestamp']
)
stream4 = Eventstream(df4)
stream4.to_dataframe()
event_type | event_index | event | timestamp | user_id | event_id | |
---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | 48c9e40c-b889-4ac4-92cd-b077cf3a621a |
1 | raw | 0 | A | 2023-01-01 00:00:00 | user_1 | 48c9e40c-b889-4ac4-92cd-b077cf3a621a |
2 | raw | 1 | B | 2023-01-01 00:00:00 | user_1 | ec9f6658-b124-48c3-b067-6262d4bc3ceb |
3 | path_end | 1 | path_end | 2023-01-01 00:00:00 | user_1 | ec9f6658-b124-48c3-b067-6262d4bc3ceb |
4 | path_start | 2 | path_start | 2023-01-01 00:00:03 | user_2 | 68bb195a-c2bf-41a2-94e9-9d05f5eb2d4f |
5 | raw | 2 | B | 2023-01-01 00:00:03 | user_2 | 68bb195a-c2bf-41a2-94e9-9d05f5eb2d4f |
6 | raw | 3 | A | 2023-01-01 00:00:03 | user_2 | 69a0fe6c-d6c2-4485-b527-0587abbbdca6 |
7 | raw | 4 | A | 2023-01-01 00:00:04 | user_2 | 01813acd-de1a-4197-a825-7cce8c386739 |
8 | path_end | 4 | path_end | 2023-01-01 00:00:04 | user_2 | 01813acd-de1a-4197-a825-7cce8c386739 |
Now we create a new Eventstream from our dummy DataFrame, but with the specified events_order=["B", "A"]
parameter.
As we can see, the first two events have swapped places.
Eventstream(df4, events_order=["B", "A"]).to_dataframe()
event_type | event_index | event | timestamp | user_id | event_id | |
---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | 3812e204-99dd-411b-9521-5ac45a7f2f64 |
1 | raw | 0 | B | 2023-01-01 00:00:00 | user_1 | 3812e204-99dd-411b-9521-5ac45a7f2f64 |
2 | raw | 1 | A | 2023-01-01 00:00:00 | user_1 | 474000df-16ad-44bb-8be3-db95cdb82917 |
3 | path_end | 1 | path_end | 2023-01-01 00:00:00 | user_1 | 474000df-16ad-44bb-8be3-db95cdb82917 |
4 | path_start | 2 | path_start | 2023-01-01 00:00:03 | user_2 | eb1cb4b2-1db3-4f69-8037-9f3e5bc6bda8 |
5 | raw | 2 | B | 2023-01-01 00:00:03 | user_2 | eb1cb4b2-1db3-4f69-8037-9f3e5bc6bda8 |
6 | raw | 3 | A | 2023-01-01 00:00:03 | user_2 | e674ca48-2b96-4f18-a711-5c19a5afaf22 |
7 | raw | 4 | A | 2023-01-01 00:00:04 | user_2 | 8ad543d7-3f70-4701-b3e2-dcca70adf7d5 |
8 | path_end | 4 | path_end | 2023-01-01 00:00:04 | user_2 | 8ad543d7-3f70-4701-b3e2-dcca70adf7d5 |
As for the synthetic events which are often placed at the beginning or in the end of a user’s path, special sorting is applied. There is a set of pre-defined event types, that are arranged in the following default order:
IndexOrder = [
"profile",
"path_start",
"new_user",
"existing_user",
"cropped_left",
"session_start",
"session_start_cropped",
"group_alias",
"raw",
"raw_sleep",
None,
"synthetic",
"synthetic_sleep",
"positive_target",
"negative_target",
"session_end_cropped",
"session_end",
"session_sleep",
"cropped_right",
"absent_user",
"lost_user",
"path_end"
]
Most of these types are created by build-in data processors. Note that some of the types are not used right now and were created for future development.
To see full explanation about which data processor creates which event_type
you can explore
the data processors user guide.
If needed, you can pass a custom sorting list to the Eventstream constructor as
the index_order
argument.
In case you already have an eventstream instance, you can assign a custom sorting list
to Eventstream.index_order
attribute. Afterwards, you should use
index_events()
method to
apply this new sorting. For demonstration purposes we use here a
AddPositiveEvents
data processor, which adds new event with prefix positive_target_
.
add_events_stream = stream4.add_positive_events(targets=['B'])
add_events_stream.to_dataframe()
event_type | event_index | event | timestamp | user_id | event_id | |
---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | 1960fe37-d6b6-456d-b4a5-b7849c6bbe40 |
1 | raw | 0 | A | 2023-01-01 00:00:00 | user_1 | 81ef2592-27f3-41b9-bc73-3f46ef5a602e |
2 | raw | 1 | B | 2023-01-01 00:00:00 | user_1 | 566df377-369f-4648-8a04-4b5b9928efa0 |
3 | positive_target | 1 | positive_target_B | 2023-01-01 00:00:00 | user_1 | 566df377-369f-4648-8a04-4b5b9928efa0 |
4 | path_end | 1 | path_end | 2023-01-01 00:00:00 | user_1 | ae2842a8-5f9a-4188-bb90-a0238356f073 |
5 | path_start | 2 | path_start | 2023-01-01 00:00:03 | user_2 | 156496ef-635a-497d-9ea4-dedefd6da778 |
6 | raw | 2 | B | 2023-01-01 00:00:03 | user_2 | 7c01a1ab-c51d-41a8-84b5-71e6f8b1d2bc |
7 | positive_target | 2 | positive_target_B | 2023-01-01 00:00:03 | user_2 | 7c01a1ab-c51d-41a8-84b5-71e6f8b1d2bc |
8 | raw | 3 | A | 2023-01-01 00:00:03 | user_2 | 4e444914-2cba-48f1-8c59-6007cfe31c12 |
9 | raw | 4 | A | 2023-01-01 00:00:04 | user_2 | 468c9bd6-3b7d-45fa-afe6-4ec75bfcbd51 |
10 | path_end | 4 | path_end | 2023-01-01 00:00:04 | user_2 | fd25c127-f573-4dc8-9ba4-19a0ef9969da |
We see that positive_target_B
events with type positive_target
follow their raw
parent event B
. Assume we would like to change their order.
custom_sorting = [
'profile',
'path_start',
'new_user',
'existing_user',
'cropped_left',
'session_start',
'session_start_cropped',
'group_alias',
'positive_target',
'raw',
'raw_sleep',
None,
'synthetic',
'synthetic_sleep',
'negative_target',
'session_end_cropped',
'session_end',
'session_sleep',
'cropped_right',
'absent_user',
'lost_user',
'path_end'
]
add_events_stream.index_order = custom_sorting
add_events_stream.index_events()
add_events_stream.to_dataframe()
event_type | event_index | event | timestamp | user_id | event_id | |
---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2023-01-01 00:00:00 | user_1 | 1960fe37-d6b6-456d-b4a5-b7849c6bbe40 |
1 | raw | 0 | A | 2023-01-01 00:00:00 | user_1 | 81ef2592-27f3-41b9-bc73-3f46ef5a602e |
2 | positive_target | 1 | positive_target_B | 2023-01-01 00:00:00 | user_1 | 566df377-369f-4648-8a04-4b5b9928efa0 |
3 | raw | 1 | B | 2023-01-01 00:00:00 | user_1 | 566df377-369f-4648-8a04-4b5b9928efa0 |
4 | path_end | 1 | path_end | 2023-01-01 00:00:00 | user_1 | ae2842a8-5f9a-4188-bb90-a0238356f073 |
5 | path_start | 2 | path_start | 2023-01-01 00:00:03 | user_2 | 156496ef-635a-497d-9ea4-dedefd6da778 |
6 | positive_target | 2 | positive_target_B | 2023-01-01 00:00:03 | user_2 | 7c01a1ab-c51d-41a8-84b5-71e6f8b1d2bc |
7 | raw | 2 | B | 2023-01-01 00:00:03 | user_2 | 7c01a1ab-c51d-41a8-84b5-71e6f8b1d2bc |
8 | raw | 3 | A | 2023-01-01 00:00:03 | user_2 | 4e444914-2cba-48f1-8c59-6007cfe31c12 |
9 | raw | 4 | A | 2023-01-01 00:00:04 | user_2 | 468c9bd6-3b7d-45fa-afe6-4ec75bfcbd51 |
10 | path_end | 4 | path_end | 2023-01-01 00:00:04 | user_2 | fd25c127-f573-4dc8-9ba4-19a0ef9969da |
As we can see, the order of the events changed, and now raw
events B
follow positive_target_B
events.
Descriptive methods#
Eventstream provides a set of methods for a first touch data
exploration. To showcase how these methods work, we
need a larger dataset, so we will use our simple_shop
dataset.
For demonstration purposes, we add session_id
column by applying
SplitSessions
data processor.
from retentioneering import datasets
stream_with_sessions = datasets\
.load_simple_shop()\
.split_sessions(timeout=(30, 'm'))
stream_with_sessions.to_dataframe().head()
event_type | event_index | event | timestamp | user_id | session_id | event_id | |
---|---|---|---|---|---|---|---|
0 | path_start | 0 | path_start | 2019-11-01 17:59:13.273932 | 219483890 | 219483890_1 | a0f483d1-4707-4816-809c-60e455d923e2 |
1 | session_start | 0 | session_start | 2019-11-01 17:59:13.273932 | 219483890 | 219483890_1 | 68c3d0df-fabd-4d11-8109-fd50adba317f |
2 | raw | 0 | catalog | 2019-11-01 17:59:13.273932 | 219483890 | 219483890_1 | a0f483d1-4707-4816-809c-60e455d923e2 |
3 | raw | 1 | product1 | 2019-11-01 17:59:28.459271 | 219483890 | 219483890_1 | 163e81cc-45bf-473a-a91e-086fa4ef83d4 |
4 | raw | 2 | cart | 2019-11-01 17:59:29.502214 | 219483890 | 219483890_1 | 590d1426-6f8c-4f7b-b9d0-4880c2afb3d7 |
General statistics#
Describe#
In a similar fashion to pandas, we use describe()
for getting a general description of an eventstream.
stream_with_sessions.describe()
value | ||
---|---|---|
category | metric | |
overall | unique_users | 3751 |
unique_events | 16 | |
unique_sessions | 6454 | |
eventstream_start | 2019-11-01 17:59:13 | |
eventstream_end | 2020-04-29 12:48:07 | |
eventstream_length | 179 days 18:48:53 | |
path_length_time | mean | 9 days 11:15:18 |
std | 23 days 02:52:25 | |
median | 0 days 00:01:21 | |
min | 0 days 00:00:00 | |
max | 149 days 04:51:05 | |
path_length_steps | mean | 14.05 |
std | 11.43 | |
median | 11.0 | |
min | 5 | |
max | 124 | |
session_length_time | mean | 0 days 00:00:52 |
std | 0 days 00:01:08 | |
median | 0 days 00:00:30 | |
min | 0 days 00:00:00 | |
max | 0 days 00:23:44 | |
session_length_steps | mean | 8.16 |
std | 4.28 | |
median | 7.0 | |
min | 3 | |
max | 55 |
The output consists of three main categories:
overall statistics
- full user-path statistics
time distribution
steps (events) distribution
- sessions statistics
time distribution
steps (events) distribution
session_col
parameter is optional and points to the eventstream column that contains session ids
(session_id
is the default value). If such a column is defined, session statistics are also included.
Otherwise, the values related to sessions are not displayed.
There is one more parameter - raw_events_only
(default False) that could be useful if some synthetic
events have already been added by adding data processors.
Note that those events affect all “*_steps” categories.
Now let us go through the main categories and take a closer look at some of the metrics:
overall
By eventstream start
and eventstream end
in the “Overall” block we denote timestamps of the
first event and the last event in the eventstream correspondingly. eventstream_length
is the time distance between event stream start and end.
path/session length time and path/session length steps
These two blocks show some time-based statistics over user paths and sessions. Categories “path/session_length_time” and “path/session length steps” provide similar information on the length of users paths and sessions correspondingly. The former is calculated in days and the latter in the number of events.
It is important to mention that all the values in “*_steps” categories are rounded to the 2nd decimal digit, and in “*_time” categories - to seconds. This is also true for the next method.
Describe events#
The describe_events()
method provides event-wise statistics about an eventstream. Its output consists of three main blocks:
basic statistics
- full user-path statistics,
time to first occurrence (FO) of each event,
steps to first occurrence (FO) of each event,
- sessions statistics (if this column exists),
time to first occurrence (FO) of each event,
steps to first occurrence (FO) of each event.
You can find detailed explanations of each metric in
api documentation
.
The default parameters are session_col='session_id'
, raw_events_only=False
.
With them, we will get statistics for each event present in our data. These two arguments
work exactly the same way as in the describe() method.
stream = datasets.load_simple_shop()
stream.describe_events()
basic_statistics | time_to_FO_user_wise | steps_to_FO_user_wise | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
number_of_occurrences | unique_users | number_of_occurrences_shared | unique_users_shared | mean | std | median | min | max | mean | std | median | min | max | |
event | ||||||||||||||
cart | 2842 | 1924 | 0.07 | 0.51 | 3 days 08:59:14 | 11 days 19:28:46 | 0 days 00:00:56 | 0 days 00:00:01 | 118 days 16:11:36 | 5.51 | 4.09 | 4.0 | 2 | 42 |
catalog | 14518 | 3611 | 0.36 | 0.96 | 0 days 05:44:21 | 3 days 03:22:32 | 0 days 00:00:00 | 0 days 00:00:00 | 100 days 08:19:51 | 1.30 | 0.57 | 1.0 | 1 | 8 |
delivery_choice | 1686 | 1356 | 0.04 | 0.36 | 5 days 09:18:08 | 15 days 03:19:15 | 0 days 00:01:12 | 0 days 00:00:03 | 118 days 16:11:37 | 7.78 | 5.56 | 6.0 | 3 | 50 |
delivery_courier | 834 | 748 | 0.02 | 0.20 | 6 days 18:14:55 | 16 days 17:51:39 | 0 days 00:01:28 | 0 days 00:00:06 | 118 days 16:11:38 | 9.96 | 6.84 | 8.0 | 4 | 46 |
delivery_pickup | 506 | 469 | 0.01 | 0.13 | 7 days 21:12:17 | 18 days 22:51:54 | 0 days 00:01:34 | 0 days 00:00:06 | 114 days 01:24:06 | 10.51 | 8.06 | 8.0 | 4 | 72 |
main | 5635 | 2385 | 0.14 | 0.64 | 3 days 20:15:36 | 9 days 02:58:23 | 0 days 00:00:07 | 0 days 00:00:00 | 97 days 21:24:23 | 3.00 | 2.94 | 2.0 | 1 | 21 |
path_end | 3751 | 3751 | 0.09 | 1.00 | 9 days 11:15:18 | 23 days 02:52:25 | 0 days 00:01:21 | 0 days 00:00:00 | 149 days 04:51:05 | 9.61 | 9.10 | 7.0 | 2 | 101 |
path_start | 3751 | 3751 | 0.09 | 1.00 | 0 days 00:00:00 | 0 days 00:00:00 | 0 days 00:00:00 | 0 days 00:00:00 | 0 days 00:00:00 | 0.00 | 0.00 | 0.0 | 0 | 0 |
payment_card | 565 | 521 | 0.01 | 0.14 | 6 days 21:42:26 | 17 days 18:52:33 | 0 days 00:01:40 | 0 days 00:00:08 | 138 days 04:51:25 | 12.14 | 7.34 | 10.0 | 6 | 66 |
payment_cash | 197 | 190 | 0.00 | 0.05 | 13 days 23:17:25 | 24 days 00:00:02 | 0 days 00:02:18 | 0 days 00:00:10 | 118 days 16:11:39 | 15.15 | 11.10 | 10.5 | 6 | 74 |
payment_choice | 1107 | 958 | 0.03 | 0.26 | 6 days 12:49:38 | 17 days 02:54:51 | 0 days 00:01:24 | 0 days 00:00:06 | 118 days 16:11:39 | 10.42 | 6.37 | 8.0 | 5 | 53 |
payment_done | 706 | 653 | 0.02 | 0.17 | 7 days 01:37:54 | 17 days 09:10:00 | 0 days 00:01:34 | 0 days 00:00:08 | 115 days 09:18:59 | 13.21 | 8.29 | 11.0 | 6 | 85 |
product1 | 1515 | 1122 | 0.04 | 0.30 | 5 days 23:49:43 | 16 days 04:36:13 | 0 days 00:00:50 | 0 days 00:00:00 | 118 days 19:38:40 | 6.46 | 6.04 | 4.0 | 2 | 62 |
product2 | 2172 | 1430 | 0.05 | 0.38 | 4 days 06:13:24 | 13 days 03:26:17 | 0 days 00:00:34 | 0 days 00:00:00 | 126 days 23:36:45 | 5.32 | 4.51 | 4.0 | 2 | 37 |
If the number of unique events in an eventstream is high,
we can leave events only from the list defined in event_list
parameter.
In the example below we leave the cart
and payment_done
events only as the events of high importance.
We also transpose the output DataFrame for a nicer view.
stream.describe_events()
stream.describe_events(event_list=['payment_done', 'cart']).T
event | cart | payment_done | |
---|---|---|---|
basic_statistics | number_of_occurrences | 2842 | 706 |
unique_users | 1924 | 653 | |
number_of_occurrences_shared | 0.07 | 0.02 | |
unique_users_shared | 0.51 | 0.17 | |
time_to_FO_user_wise | mean | 3 days 08:59:14 | 7 days 01:37:54 |
std | 11 days 19:28:46 | 17 days 09:10:00 | |
median | 0 days 00:00:56 | 0 days 00:01:34 | |
min | 0 days 00:00:01 | 0 days 00:00:08 | |
max | 118 days 16:11:36 | 115 days 09:18:59 | |
steps_to_FO_user_wise | mean | 5.51 | 13.21 |
std | 4.09 | 8.29 | |
median | 4.0 | 11.0 | |
min | 2 | 6 | |
max | 42 | 85 |
Often, such simple descriptive statistics are not enough to deeply understand the time-related values, so we want to see their distribution. For these purposes the following group of methods has been implemented.
Time-based histograms#
User lifetime#
One of the most important time-related statistics is user lifetime. By lifetime we
mean the time distance between the first and the last event represented
in a user’s trajectory. The histogram for this variable is plotted by
user_lifetime_hist()
method.
stream.user_lifetime_hist()
The method has multiple parameters:
timedelta_unit
defines a datetime unit that is used for the lifetime measuring;log_scale
sets logarithmic scale for the bins;lower_cutoff_quantile
,upper_cutoff_quantile
indicate the lower and upper quantiles (as floats between 0 and 1), the values between the quantiles only are considered for the histogram;bins
defines the number of histogram bins. Also can be the name of a reference rule or number of bins. See details in numpy documentation;width
andheight
set figure width and height in inches.
Note
The method is especially useful for selecting parameters to
DropPaths
.
See the user guide on preprocessing for details.
Timedelta between two events#
Previously, we have defined user lifetime as the timedelta between the beginning and the end of a user’s path.
This can be generalized.
timedelta_hist()
method shows a histogram for the distribution of timedeltas between a couple of specified events.
The method supports similar formatting arguments (timedelta_unit
, log_scale
,
lower_cutoff_quantile
, upper_cutoff_quantile
, bins
, width
, height
) as we have already mentioned
in user_lifetime_hist method.
If no arguments are passed (except formatting arguments), timedeltas between all adjacent events are calculated within each user path. For example, this tiny eventstream
generates 4 timedeltas \(\Delta_1, \Delta_2, \Delta_3, \Delta_4\) as shown in the diagram. The timedeltas between events B and D, D and C, C and E are not taken into account because two events from each pair belong to different users.
Here is how the histogram looks for the simple_shop
dataset with log_scale=True
and timedelta_unit='m'
:
stream.timedelta_hist(log_scale=True, timedelta_unit='m')
This distribution of the adjacent events fairly common. It looks like a bimodal (which is not true: remember we use log-scale here), but these two bells help us to estimate a timeout for splitting sessions. From this charts we can see that it is reasonable to set it to some value between 10 and 100 minutes.
Be careful if there are some synthetic events in the data. Usually those events are assigned with the same
timestamp as their “parent” raw events. Thus, the distribution of the timedeltas between
events will be heavily skewed to 0. Parameter raw_events_only=True
can help in such a situation.
Let us add to our dataset some common synthetic events using AddStartEndEvents and
SplitSessions data processors.
stream_with_synthetic = datasets\
.load_simple_shop()\
.add_start_end_events()\
.split_sessions(timeout=(30, 'm'))
stream_with_synthetic.timedelta_hist(log_scale=True, timedelta_unit='m')
stream_with_synthetic.timedelta_hist(
raw_events_only=True,
log_scale=True,
timedelta_unit='m'
)
You can see that on the second plot there is no high histogram bar located at \(\approx 10^{-3}\), so that the second histogram looks more natural.
Another use case for timedelta_hist()
is visualizing the distribution of timedeltas between two specific events. Assume we want to
know how much time it takes for a user to go from product1
to cart
.
Then we set event_pair=('product1', 'cart')
and pass it to timedelta_hist
:
stream.timedelta_hist(event_pair=('product1', 'cart'), timedelta_unit='m')
From the Y scale, we see that such occurrences are not very numerous. This is because the method still works with only
adjacent pairs of events (in this case product1
and cart
are assumed to go one right after
another in a user’s path). That is why the histogram is skewed to 0.
adjacent_events_only
parameter allows us to work with any cases when a user goes from
product1
to cart
non-directly but passing through some other events:
stream.timedelta_hist(
event_pair=('product1', 'cart'),
timedelta_unit='m',
adjacent_events_only=False
)
We see that the number of observations has increased, especially around 0. In other words,
for the vast majority of the users transition product1 → cart
takes less than 1 day.
On the other hands, we observe a “long tail” of the users whose journey from product1
to cart
takes multiple days. We can interpret this as there are two behavioral clusters:
the users who are open for purchases, and the users who are picky. However, we also notice
that adding a product to a cart does not necessarily mean that a user intends to make a
purchase. Sometimes users adds an item to a cart just to check its final price, delivery
options, etc.
Here we should make a stop and explain how timedeltas between event pairs are calculated.
Below you can see the picture of one user path and timedeltas that will be displayed in a timedelta_hist
with the parameters event_pair=('A', 'B')
and adjacent_events_only=False
.
Let us consider each time delta calculation:
\(\Delta_1\) is calculated between ‘A’ and ‘B’ events. ‘D’ and ‘F’ are ignored because of
adjacent_events_only=False
.The next ‘A’ event is colored grey and is skipped because there is one more ‘A’ event closer to the ‘B’ event. In such cases, we pick the ‘A’ event, that is closer to the next ‘B’ and calculate \(\Delta_2\).
Now let us get back to our example. Due to the fact we have a lot of users with short trajectories and a few users with very long paths our histogram is unreadable.
To make entire plot more comprehensible - the log_scale
parameter can be used.
We have already used that parameter for the x axis
, but it is also available fot the y axis
.
For example: log_scale=(False, True)
.
Another way to resolve that problem, is to look separately on different parts of our plot.
For that purpose we can use parameters lower_cutoff_quantile
and upper_cutoff_quantile
.
These parameters specify boundaries for the histogram and will be applied last.
In the example below, firstly, we keep users with event_pair=('product1', 'cart')
and adjacent_events_only=False
, and after it we truncate 90% of users with the shortest
trajectories and keep 10% of the longest.
stream.timedelta_hist(
event_pair=('product1', 'cart'),
timedelta_unit='m',
adjacent_events_only=False,
lower_cutoff_quantile=0.9
)
Here it is the same algorithm, but 10% of users with the shortest trajectories will be kept.
stream.timedelta_hist(
event_pair=('product1', 'cart'),
timedelta_unit='m',
adjacent_events_only=False,
upper_cutoff_quantile=0.1
)
If we set both parameters, boundaries will be calculated simultaneously and truncated afterward.
Let us turn to another case. Sometimes we are interested in looking only at events
within a user session. If we have already split the paths into sessions, we can use weight_col='session_id'
:
stream_with_synthetic\
.timedelta_hist(
event_pair=('product1', 'cart'),
timedelta_unit='m',
adjacent_events_only=False,
weight_col='session_id'
)
It is clear now that within a session the users walk from product1
to cart
event in less than 3 minutes.
For frequently occurring events we might be interested in aggregating the timedeltas over sessions or users.
For example, transition main -> catalog
is quite frequent. Some users do these transitions quickly,
some of them do not. It might be reasonable to aggregate the timedeltas over each user path first
(we would get one value per one user at this step), and then visualize the distribution of
these aggregated values. This can be done by passing an additional argument
time_agg='mean'
or time_agg='median'
.
stream\
.timedelta_hist(
event_pair=('main', 'catalog'),
timedelta_unit='m',
adjacent_events_only=False,
weight_col='user_id',
time_agg='mean'
)
Eventstream global events#
event_pair
argument can accept a couple of auxiliary events: eventstream_start
and eventstream_end
.
They indicate the first and the last events in an evenstream.
It is especially useful for choosing left_cutoff
and right_cutoff
parameters for
LabelCroppedPaths
data processor.
Before you choose it, you can explore how a path’s beginning/end margin from the right/left edge of an eventstream.
In the histogram below, \(\Delta_1\) illustrates such a margin for event_pair=('eventstream_start', 'B')
.
Note that here only one timedelta is calculated - from the ‘eventstream_start’ to the first occurrence of specified
event.
\(\Delta_1\) in the following example illustrates a margin for event_pair=('B', 'eventstream_end')
.
And again, only one timedelta per userpath is calculated - from the ‘B’ event (its last occurrence) to the
‘eventstream_end’.
stream_with_synthetic\
.timedelta_hist(
event_pair=('eventstream_start', 'path_end'),
timedelta_unit='h',
adjacent_events_only=False
)
For more details on how this histogram helps to define left_cutoff
and right_cutoff
parameters see
LabelCroppedPaths section in the data processors user guide.
Event intensity#
There is another helpful diagram that can be used for eventstream overview.
Sometimes we want to know how the events are distributed over time. The histogram for this distribution is plotted by
event_timestamp_hist()
method.
stream.event_timestamp_hist()
We can notice the heavy skew in the data towards the period between April and May of 2020.
One of the possible interpretations of this fact is that the product worked in beta version until April 2020,
and afterwards a stable were released so that new users started to arrive much more intense.
event_timestamp_hist
has event_list
argument, so we can check this hypothesis
by choosing path_start
in the event list .
stream\
.add_start_end_events()\
.event_timestamp_hist(event_list=['path_start'])
From this histogram we see that our hypothesis is true. New users started to arrive much more intense in April 2020.
Similar to timedelta_hist()
,
event_timestamp_hist
also has parameters raw_events_only
, upper_cutoff_quantile
,
lower_cutoff_quantile
, bins
, width
and height
that work with the same logic.
Path metrics#
Besides the pre-defined statistics you can calculate any custom metric over paths using the path_metrics()
method. In case of multiple metrics, the method accepts a list of tuples representing a metric and its name, and returns a DataFrame. A metric can be defined as an arbitrary callable function or pandas.NamedAgg to be applied to the grouped evenstream data. Also, some string shortcuts such as len
, has:TARGET_EVENT
, time_to:TARGET_EVENT
are available.
metrics = [
('len', 'path_length'),
('has:cart', 'has_cart'),
('time_to:cart', 'time_to_cart'),
(lambda _df: (_df['event'] == 'cart').sum(), 'cart_count'),
(pd.NamedAgg('timestamp', lambda s: len(s.dt.date.unique())), 'active_days')
]
stream.path_metrics(metrics).head()
path_length | has_cart | time_to_cart | cart_count | active_days | |
---|---|---|---|---|---|
122915 | 34 | True | 6 days 01:22:39.090422 | 1 | 2 |
463458 | 12 | False | NaT | 0 | 1 |
1475907 | 16 | True | 23 days 13:03:45.213509 | 1 | 2 |
1576626 | 3 | False | NaT | 0 | 1 |
2112338 | 7 | False | NaT | 0 | 1 |