Quick start with Retentioneering#
Retentioneering is a Python library for in-depth analysis of what is commonly called user clickstream. We find the traditional term clickstream to be too constrictive, as user actions may not just be clicks; instead, we use the term event to mean any user action, and eventstream to refer to a set of actions performed by the user. A set of events belonging to a particular user is called user path or user trajectory, and sometimes customer journey map (CJM) is used as a synonym for eventstream.
Each event is tied to the user who experienced it, and a timestamp. Hence, at a basic level, eventstream comprises a set of triples like these:
('user_1', 'login', '2019-01-01 00:00:00'),
('user_1', 'main_page_visit', '2019-01-01 00:00:00'),
('user_1', 'cart_button_click', '2019-01-01 00:00:00'),
...
Any eventstream research consists of three fundamental steps:
Loading data
Preparing the data
Applying Retentioneering tools
This document is a brief overview of how to follow these steps. For more detail, see the User Guides.
Loading data#
This is the introduction to our core class Eventstream, which stores eventstream events and enables you to work with them efficiently.
We have provided a small simple_shop dataset for you to use for demo purposes here, and throughout the documentation.
from retentioneering import datasets
# load sample user behavior data:
stream = datasets.load_simple_shop()
In the shell of eventstream object there is a regular pandas.DataFrame which can be revealed by calling to_dataframe()
method:
stream.to_dataframe().head()
user_id | event | timestamp | |
---|---|---|---|
0 | 219483890 | catalog | 2019-11-01 17:59:13.273932 |
1 | 219483890 | product1 | 2019-11-01 17:59:28.459271 |
2 | 219483890 | cart | 2019-11-01 17:59:29.502214 |
3 | 219483890 | catalog | 2019-11-01 17:59:32.557029 |
4 | 964964743 | catalog | 2019-11-01 21:38:19.283663 |
In this fragment of the dataset, user 219483890
has 4 events with timestamps on the website on 2019-11-01
.
If you are OK with the simple_shop dataset, you can proceed to the next section. Alternatively, you can create an eventstream by uploading your own dataset. It must be represented as a csv-table with at least three columns (user_id
, event
, and timestamp
). Upload your table as a pandas.DataFrame and create the eventstream as follows:
import pandas as pd
from retentioneering.eventstream import Eventstream
# load your own csv
data = pd.read_csv("your_own_data_file.csv")
stream = Eventstream(data)
If the input table columns have different names, either rename them in the DataFrame, or explicitly set data schema (see Eventstream user guide for the instructions). Likewise, if the table has additional custom columns, setting the data schema is also required.
Getting a CSV file with data#
If you use Google Analytics, raw data in the form of {user, event, timestamp} triples can be streamed via Google Analytics 360 or free Google Analytics App+Web into BigQuery. From the BigQuery console, you can run an SQL query and export data into a csv file. Alternatively, you can use the Python BigQuery connector to get directly into the DataFrame. For large datasets, we suggest sampling the users in an SQL query, filtering by the user_id (just add this condition to SQL WHERE statement to get 10% of your users:
and ABS(MOD(FARM_FINGERPRINT(fullVisitorId), 10)) = 0)
Preparing the data#
Raw data often needs to be prepared before analytical techniques are applied. Retentioneering provides a wide range of preprocessing tools that are comprised of elementary parts called “data processors.” With the help of data processors, a product analyst can easily add, delete, or group events, flexibly truncate an eventstream, split the trajectories into sessions, and much more. See the Data processors user guide for a comprehensive description of this Swiss army knife for data processors.
Below is a brief example of how the data processors work.
Suppose you wanted to analyze only the first session of each user, rather than their whole trajectory. Here is how you can do that with just a few lines of code:
# eventstream preprocessing example
stream = stream.split_sessions(timeout=(30, 'm'))
stream\
.filter_events(func=lambda df_, schema: df_['session_id'].str.endswith('_1'))
.to_dataframe()\
.head()
user_id | event | timestamp | session_id | |
---|---|---|---|---|
0 | 219483890 | session_start | 2019-11-01 17:59:13.273932 | 219483890_1 |
1 | 219483890 | catalog | 2019-11-01 17:59:13.273932 | 219483890_1 |
3 | 219483890 | product1 | 2019-11-01 17:59:28.459271 | 219483890_1 |
5 | 219483890 | cart | 2019-11-01 17:59:29.502214 | 219483890_1 |
7 | 219483890 | catalog | 2019-11-01 17:59:32.557029 | 219483890_1 |
At the beginning, we take a stream
variable that contains the eventstream instance created in the previous section. The split_sessions method creates a new column called session_id
, in which values ending with the suffix _<int>
indicate the ordinal number of each user’s session. In the end, we need to leave only those records where session_id
ends with _1
(meaning the first session). This is exactly what the filter method does. We also apply the to_dataframe()
method, which you are already familiar with.
In real life, analytical eventstream research is likely to be branchy. You might want to wrangle an initial eventstream’s data in many ways, check multiple hypotheses, and look at different parts of the eventstream. All of this is easily and efficiently managed using the preprocessing graph. It enables you to keep all the records and code related to the research in a calculation graph. This tool is especially recommended for those who need to share parts of the analytical code with team members. See the Preprocessing user guide for more details.
Applying path analysis tools#
Retentioneering offers many powerful tools for exploring the behavior of your users, including transition graphs, step matrices, step Sankey diagrams, funnels, cluster, and cohort analysis. A brief demo of each is presented below. For more details, see the user guides.
Transition graph#
Transition graph is an interactive tool that shows how many users jump from one event to another. It represents user paths as a Markov random walk model. The graph is interactive: you can drag the graph nodes, zoom in and out of the graph layout, or use a control panel on the left edge of the graph. The transition graph also allows you to highlight the most valuable nodes, and hide noisy nodes and edges.
stream.transition_graph()
See Transition graph user guide for a deeper understanding of this tool.
Step matrix#
The step matrix provides a stepwise look at CJM. It shows the event distribution with respect to a step ordinal number.
stream.step_matrix(
max_steps=16,
threshold=0.2,
centered={
'event': 'cart',
'left_gap': 5,
'occurrence': 1
},
targets=['payment_done']
)
The step matrix above is centered by cart
event. For example, it shows (see column -1
) that the events in the user trajectories one step before cart
event are distributed as follows: 60% of the users have catalog
event right before cart
, 24% of the users have product2
event, and 16% of the users are distributed among 5 events which are folded to an artificial THRESHOLDED_5
event.
See Step matrix user guide user guide for a deeper understanding of this tool.
Transition matrix#
Transition matrix is similar to transition graph, but it displays the edge weights as a table. It is especially useful for comparing two groups of users. For example, below we compare how the first user session differs from the other sessions. We create a session_count
segment indicating the session number and then compare 1
and _OUTER_
segment values in the transition matrix. See the Transition matrix user guide and the Segments user guide for more details.
def session_count_segment(df):
df['session_count'] = df['session_id'].str.split('_').str[1]
return df['session_count']
stream = stream.add_segment(session_count_segment, 'session_count')
stream.transition_matrix(norm_type='node', groups=['session_count', '1', '_OUTER_'])
For example, we see from the diagram that in the second and later sessions the users start their paths from the main
event a way more often: the difference in session_start -> main
transition is -0.46.
Step Sankey diagram#
The step Sankey diagram is similar to the step matrix. It also shows the event distribution with respect to step number. However, it has some more advanced features:
it explicitly shows the user flow from one step to another; and
it is interactive.
stream.step_sankey(max_steps=6, threshold=0.05)
See step Sankey user guide for a deeper understanding of this tool.
Cluster analysis#
features = stream.extract_features(feature_type='count', ngram_range=(1, 1))
stream = stream.get_clusters(
method='kmeans',
n_clusters=8,
X=features,
segment_name='kmeans_clusters'
)
Users with similar behavior are grouped into clusters. Once the clusters are obtained, you can explore them looking at a heatmap table that highlights the most valuable features or custom metrics that make a cluster unique.
custom_metrics = [
('segment_size', 'mean', 'segment size'),
('has:cart', 'mean', 'Conversion rate: cart'),
('has:payment_done', 'mean', 'Conversion rate: payment_done'),
]
stream.clusters_overview(
'kmeans_clusters',
features,
aggfunc='mean',
metrics=custom_metrics
)
From this overview we immediately see that cluster 7 has the highest feature usage rates and conversion to purchase
event over all clusters yet the smallest size (0.3% of all users). On the other hand, we see that super low activity clusters 0, 1, 4 take 74% of all users which might indicate a systematic problem in the product.
See the clusters user guide for a deeper understanding of this tool.
Funnel analysis#
Building a conversion funnel is a basic part of much analytical research. Funnel is a diagram that shows how many users sequentially walk through specific events (funnel stages) in their paths. For each stage event, the following values are calculated:
absolute unique number of users who reached this stage at least once;
conversion rate from the first stage (% of initial); and
conversion rate from the previous stage (% of previous).
stream.funnel(stages=['catalog', 'cart', 'payment_done'])
See Funnel user guide for a deeper understanding of this tool.
Cohort analysis#
Cohorts is a powerful tool that shows trends of user behavior over time. It helps to isolate the impact of different marketing activities, or changes in a product for different groups of users.
Here is an outline of the cohort matrix calculation:
Users are split into groups (
CohortGroups
) depending on the time of their first appearance in the eventstream; andThe retention rate of the active users is calculated in each period (
CohortPeriod
) of the observation.
stream.cohorts(
cohort_start_unit='M',
cohort_period=(1, 'M'),
average=False,
)
See Cohorts user guide for a deeper understanding of this tool.
Sequence analysis#
Sequences tool aims to calculate frequency statistics regarding each particular n-gram represented in an eventstream. It supports group comparison.
stream.sequences(
ngram_range=(2, 3),
threshold=['count', 1200],
sample_size=3
)
See Sequences user guide for the details.