Step matrix#
The following user guide is also available as Google Colab notebook.
Step matrix definition#
Step matrix is a powerful tool in the Retentioneering arsenal. It allows and getting a quick high-level understanding of user behavior. The step matrix features powerful customization options to tailor the output depending on the goal of the analysis.
To better understand how the step matrix works, let us first consider an intuitive example. Assume we have an eventstream as follows:
We can visualize this dataset as a step-wise heatmap, indicating the distribution of the events appeared at a specific step:
import pandas as pd
from retentioneering.eventstream import Eventstream
simple_example = pd.DataFrame(
[
['user1', 'main', 0],
['user2', 'main', 0],
['user3', 'main', 0],
['user4', 'catalog', 0],
['user1', 'catalog', 1],
['user3', 'catalog', 1],
['user4', 'product', 1],
['user1', 'product', 2],
['user3', 'catalog', 2],
['user4', 'main', 2],
['user1', 'cart', 3],
['user3', 'product', 3],
['user4', 'catalog', 3],
['user1', 'catalog', 5],
['user3', 'cart', 5],
['user3', 'order', 6]
],
columns=['user_id', 'event', 'timestamp']
)
Eventstream(simple_example)\
.step_matrix(max_steps=7);
The matrix rows correspond to the unique events, and the columns correspond to the steps in the user
trajectories. That is, (i, j)
matrix element shows the share of the users with event i
appeared at step j
.
Hereafter we use simple_shop dataset, which has already been converted to Eventstream and assigned to stream
variable. If you want to use your own dataset, upload it following this instruction.
from retentioneering import datasets
stream = datasets.load_simple_shop()
Basic example#
The primary way to visualize a step matrix is to call Eventstream.step_matrix()
method. Here is how it is applied to simple_shop
eventstream:
stream.step_matrix(max_steps=12)
As we can see, the sum of the values in the matrix columns is 1 (i.e. at each step). Looking at the first column we can say that the users start their sessions from events catalog
(72%) and main
(28%). Also, we notice that payment_done
event, which might be considered as an event of interest, appears in the trajectories no earlier than at the 7th step (row payment_done
has zeros until step 7).
Terminating event#
As you may have noticed, the step matrix above has ENDED
event which is located in the last row, whereas this event is not represented in the simple_shop
eventstream. ENDED
is a special synthetic event that explicitly indicates a trajectory’s end. If a user’s path is shorter than max_steps
parameter, then ENDED
event is padded to the path so that it becomes exactly of length max_steps
. With this behavior, the sum of the user fractions over each column (i.e each step) is exactly 1. Essentially, ENDED
row represents the cumulative share of leaving users. The event exists in scope of step matrix only, so that it does not affect the sourcing eventstream at all.
Collapsing rare events#
In a typical scenario, it can be useful to hide rare events in a step matrix, not removing them from the step matrix calculation. If we remove them, the matrix values will be distorted. This behaviour is controlled by the threshold
argument. An event is considered as rare if its maximum frequency over all the steps is less than threshold
. All such rare events are not removed from the matrix, but instead collapsed to thresholded_N
artificial event, where N
stands for the number of the collapsed events. The thresholded_N
event appears in step matrix only, and is not added to the sourcing eventstream.
Let us look how the events are collapsed if we set threshold=0.05
, and compare the result with the previous step matrix (which had the default threshold=0
parameter).
stream.step_matrix(max_steps=16, threshold=0.05)
Now, we see that all 6 rare events are hidden and grouped together in the THRESHOLDED_6
row. We also notice that THRESHOLDED_6
event contains delivery_courier
, delivery_pickup
, payment_cash
, payment_card
, payment_done
, and payment_choice
events. Let us check why, say, the payment_choice
event has been collapsed. In the previous step matrix we see that at step 5 this event contains 3% of the users, 4% at step 6, and 3% at step 7, etc. Since the maximum value (4%) is less than
threshold=0.05
, the event is collapsed.
Note that the number _6
in THRESHOLDED_6
event name contains no information on specific steps. For example, from the matrix with threshold=0
we see that at step 4 only one event among these 6 is represented (delivery_courier
), so it is the only event that is collapsed at this step. On the other hand, at step 5 delivery_pickup
and payment_choice
appear, so they are collapsed to the THRESHOLDED_6
event. Finally, at step 7, all these 6 events are collapsed.
You can use the target
parameter if you want to prevent some events from the collapsing.
Target events analysis#
It is common that some events are more important than the others, so we want to pay attention to them.
This includes such events as adding an item to the cart, order confirmation, payment, etc. Such events often have much lower occurrence rate compared to other events (like visiting main page or catalog). As a result, they are collapsed to the THRESHOLDED_N
event. Even if they are not, it would be worth highlighting them in the diagram at putting them in separate place. This can be done with the targets
parameter:
stream.step_matrix(
max_steps=16,
threshold=0.05,
targets=['payment_done']
)
Specified target events are always shown at the bottom of step matrix regardless of the selected threshold. As we have chosen the payment_done
event as a target, the row with payment_done
has been moved at the bottom of the matrix and now has its own color palette.
Multiple targets are also supported:
stream.step_matrix(
max_steps=16,
threshold=0.05,
targets=['product1', 'cart', 'payment_done']
)
Now we have selected three target events: product1
, cart
, payment_done
, so we can see them at the bottom of the diagram. Each of them has its own palette and color scaling.
If we want to compare some target events and plot them using the same color scaling, we can combine them in a sub-list inside the targets list:
stream.step_matrix(
max_steps=16,
threshold=0.05,
targets=['product1', ['cart', 'payment_done']]
)
With the colors defined in this way, we can compare how many users reached cart
vs payment_done
at particular step in their trajectories.
Targets can be presented as accumulated values with accumulated
parameter. Meaning that, we can display the cumulative shares of the users having this event at each step. The corresponding row names start with ACC_
prefix. There are two options for displaying these rows:
accumulated='only'
display rows with accumulated values only;accumulated='both'
display rows with both accumulated and not accumulated values.
The step matrix below demonstrates accumulated='only'
option:
stream.step_matrix(
max_steps=16,
threshold=0.05,
targets=['product1', ['cart', 'payment_done']],
accumulated='only'
)
In comparison with the previous step matrix, at the bottom we see three rows ACC_product1
, ACC_cart
, ACC_payment_done
instead of product1
, cart
, and payment_done
. Now, let us show how accumulated='both'
option works.
stream.step_matrix(
max_steps=16,
threshold=0.05,
targets=['product1', ['cart', 'payment_done']],
accumulated='both'
)
Above, we see two target blocks: one is with accumulated values, another one with the original values.
Centered step matrix#
Sometimes we are interested in the flow of users through a specific event to answer such questions as how do users reach a specific event and what do they do afterwards? This information can be visualized with the centered
parameter:
stream.step_matrix(
max_steps=16,
threshold=0.2,
centered={
'event': 'cart',
'left_gap': 5,
'occurrence': 1
}
)
The centered
parameter is a dictionary that requires three keys:
event
: name of the event we focus on. Reaching this event is associated with step 0. Negative step numbers correspond to the events occurred before the selected event. Positive step numbers correspond to the events occurred after the selected event;left_gap
: integer number that indicates how many steps before the centered event we want to show in the step matrix;occurrence
: the occurrence number of the target event to trigger thecentered
parameter. For example, in the coding example above, all the trajectories will be aligned to have the firstcart
occurrence as step 0.
Importantly, when the centered
parameter is used, only the users who have centered['event']
occurred at list centered['occurrence']
times are considered. The share of such users with respect to all the users from the eventstream is represented in the diagram’s title. In the example above, 51.3% of the users reached the event cart
at least once.
Another property of step matrix is that at step 0 column we always have zeros at any row except the row that relates to the centering event: at that row there is always 1.
To better understand the meaning of the occurrence
parameter, let us calculate another step matrix. This time with occurrence=2
:
stream.step_matrix(
max_steps=16,
threshold=0.2,
centered={
'event': 'cart',
'left_gap': 5,
'occurrence': 2
}
)
Here we can see that the proportion of the users whose steps are considered in our matrix has noticeably decreased. Now it is 15.2%, because we are evaluating the second occurrence of the cart
event, which
means we are considering the users who had this event at least twice.
A combination of targets
and centered
parameters is also possible:
stream.step_matrix(
max_steps=16,
threshold=0.2,
centered={
'event': 'cart',
'left_gap': 5,
'occurrence': 1
},
targets=['payment_done']
)
From the step matrix above, we see that the maximum in the target row appear at step 5 (with the value of 0.22). We can interpret this as follows: if a user reaches the cart
event and makes a purchase afterwards, it is likely that it took them 5 steps.
Events sorting#
By default, rows of the step matrix are sorted in the following order:
Original events by the order of their first appearance in the eventstream;
ENDED
event;THRESHOLDED
events;target events.
Sometimes, it is needed to obtain a step matrix with events ranked in a specific order - for example, when you compare two step matrices. This can be done with the sorting
parameter that accepts a list of event names in the required order to show up in the step matrix. Here is an example:
stream.step_matrix(max_steps=16, threshold=0.07)
We pass the following list ofr the events to the sorting
parameter:
custom_order = [
'main',
'catalog',
'product1',
'product2',
'cart',
'THRESHOLDED_7',
'ENDED'
]
stream.step_matrix(
max_steps=16,
threshold=0.07,
sorting=custom_order
)
Note
It is convenient to modify the order of the event list with the help of StepMatrix.values
property. See here for the details.
Note
The custom ordering affects non-target events only. Target events are always located at the bottom, and they are sorted in same order as they are specified in the targets
parameter.
Differential step matrix#
Definition and general usage#
Sometimes we would like to compare behaviors of multiple groups of users - for example, the users who had a target event versus those who had not, or test and control groups in an A/B test.
Suppose we have two abstract groups of the users: g1
and g2
. Let g1
consists of the users who had the payment_done
event, and g2
- who had not. Suppose also that M1
and M2
are the step matrices that calculated for the groups g1
and g2
correspondingly. So we want to compare behaviours of the users from g1
and g2
groups. In this case, it is reasonable to calculate a new step matrix as difference between M1
and M2
.
groups
parameter is responsible for differential step matrix plotting. It requires a collection of two user lists related to two user groups. Each list should be represented as a collection of user ids.
In the example below we demonstrate how the groups
parameter works. We also choose cart
as a central event, because usually it is closely followed by a purchase or user disappearance.
stream_df = stream.to_dataframe()
g1 = set(stream_df[stream_df['event'] == 'payment_done']['user_id'])
g2 = set(stream_df['user_id']) - g1
stream.step_matrix(
max_steps=16,
threshold=0.05,
centered={
'event': 'cart',
'left_gap': 5,
'occurrence': 1
},
groups=(g1, g2)
)
According to the step matrix definition, the values that are close to 0 mean that the corresponding values in the original matrices M1
and M2
are roughly equal. Large positive/negative value indicates that the corresponding value in M1
matrix is much greater/less than the corresponding number in M2
matrix. As a result, the step matrix heatmap highlights the cells where the difference is big.
For example, from the step matrix above we see that the values to the left from the central event cart
are close to zero. It means that the behavior of users in the two groups is roughly the same. However, to the right of the cart
event large positive and negative values appear. The positive values relate to such events as payment_done
, payment_choice
, or payment_choice
. The users from g2
must have no payment_done
event at all due to the group definition. As for the last two mentioned events, they relate to the payment process, so it is not a surprise that the users from the g2
experiences these events group less often than the users from g1
group.
Note that the values in each column of a differential step matrix are always sum up to 0, since the columns in both M1
and M2
matrices always sum up to 1. That is its fundamental property.
Cluster analysis#
Consider another example of differential step matrix usage. Now we will compare behaviors within two user clusters which are obtained by applying the Clusters tool. As before, we focus the analysis on payment_done
and event cart
events.
from retentioneering.tooling.clusters import Clusters
clusters = Clusters(eventstream=stream)
features = clusters.extract_features(feature_type='count', ngram_range=(1, 1))
clusters.fit(method='kmeans', n_clusters=8, X=features, random_state=42)
clusters.plot(targets=['payment_done', 'cart']);
So we have defined 8 clusters. The diagram above shows the distribution of the conversion rate to the target events (payment_done
and cart
) among the clusters. Suppose we are interested in how clusters #1 and #3 differ.
All we need is to get user_id
collections from the cluster_mapping attribute and pass it to the groups
parameter of step matrix:
g1 = clusters.cluster_mapping[1]
g2 = clusters.cluster_mapping[3]
stream.step_matrix(
max_steps=16,
threshold=0.05,
centered={
'event': 'cart',
'left_gap': 5,
'occurrence': 1
},
groups=(g1,g2)
)
The differential step matrix shows the difference between clusters #1 and #3. Users from cluster #1, after adding a product to the cart tend to return to the catalog and continue shopping more often or, on the opposite, finish their trajectory. On the other hand, users from cluster #3 tend to fall into payment flow and eventually make purchase. They can also return to the catalog, but in later steps.
Weighting step matrix values#
So far, we have been defining step matrix values as the shares of users appearing in an eventstream at a certain step. However, sometimes it is reasonable to calculate similar fractions not over users, but over some other entities as well - typically, over user sessions.
To demonstrate how to do this, we need to split the eventstream into the sessions at first with the help of SplitSessions data processor
. Let session timeout be 30 minutes.
stream_with_sessions = stream.split_sessions((30, 'm'))
Step matrix shares the same mechanism of weighting that is used in transition graph. weight_col
parameter accepts a name of the weighting column in the eventstream. In our case, we pass session_id
value.
stream_with_sessions.step_matrix(max_steps=16, weight_col='session_id')
For example, cart
value at step 3 is 0.05 which means that at step 3 only 5% of the sessions had cart
event.
Let us compare the result with the user-weighted matrix:
stream_with_sessions.step_matrix(max_steps=16, weight_col='user_id'))
Now, we can see the difference between these two types of weighting. The number of unique sessions is greater than the number of unique users, so the proportion of the cart
event at the third step when
normalizing by users is higher than for sessions (0.09 vs 0.05).
Using a separate instance#
By design, Eventstream.step_matrix()
is a shortcut method that uses StepMatrix
class under the hood. This method creates an instance of StepMatrix class and embeds it into the eventstream object. Eventually, Eventstream.step_matrix()
returns exactly this instance.
Sometimes it is reasonable to work with a separate instance of StepMatrix class. An alternative way to get the same visualization that Eventstream.step_matrix()
produces is to call StepMatrix.fit()
and StepMatrix.plot()
methods explicitly.
Here is an example how you can manage it:
from retentioneering.tooling.step_matrix import StepMatrix
step_matrix = StepMatrix(stream)
step_matrix.fit(max_steps=12, targets=['payment_done'])
step_matrix.plot()
Common tooling properties#
values#
StepMatrix.values
property returns the values underlying recent StepMatrix.plot()
call. The property is common for many retentioneering tools. It allows you to avoid unnecessary calculations if the tool object has already been fitted.
Two pandas.DataFrame objects are returned: one for the step matrix, another one for the additional targets block.
stream.step_matrix(
max_steps=6,
targets=['product1', ['cart', 'payment_done']],
show_plot=False
).values[0]
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
catalog | 0.716076 | 0.445215 | 0.384164 | 0.310051 | 0.251400 | 0.211677 |
main | 0.283924 | 0.162357 | 0.121834 | 0.094108 | 0.085311 | 0.079712 |
cart | 0.000000 | 0.089843 | 0.109571 | 0.080778 | 0.064783 | 0.047454 |
delivery_choice | 0.000000 | 0.000000 | 0.054119 | 0.061584 | 0.049054 | 0.034391 |
payment_choice | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.033591 | 0.043455 |
product1 | 0.000000 | 0.070115 | 0.045055 | 0.042655 | 0.031991 | 0.025860 |
payment_done | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.003999 |
payment_card | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.017595 |
delivery_pickup | 0.000000 | 0.000000 | 0.000000 | 0.014396 | 0.016796 | 0.015463 |
delivery_courier | 0.000000 | 0.000000 | 0.000000 | 0.025327 | 0.032791 | 0.024793 |
payment_cash | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.004799 |
product2 | 0.000000 | 0.114370 | 0.065849 | 0.057851 | 0.045854 | 0.035724 |
ENDED | 0.000000 | 0.118102 | 0.219408 | 0.313250 | 0.388430 | 0.455079 |
stream.step_matrix(
max_steps=6,
targets=['product1', ['cart', 'payment_done']],
show_plot=False
).values[1]
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
product1 | 0.0 | 0.070115 | 0.045055 | 0.042655 | 0.031991 | 0.025860 |
cart | 0.0 | 0.089843 | 0.109571 | 0.080778 | 0.064783 | 0.047454 |
payment_done | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.003999 |
params#
StepMatrix.params
property returns the StepMatrix parameters that was used in the last StepMatrix.fit()
call.
stream.step_matrix(
max_steps=6,
targets=['product1', ['cart', 'payment_done']],
show_plot=False
).params
{'max_steps': 6,
'weight_col': 'user_id',
'precision': 2,
'targets': ['product1', ['cart', 'payment_done']],
'accumulated': None,
'sorting': None,
'threshold': 0,
'centered': None,
'groups': None}