Transition graph#
Loading data#
Throughout this guide we use our demonstration simple_shop dataset. It has already been converted to Eventstream and assigned to stream
variable. If you want to use your own dataset, upload it following this instruction.
from retentioneering import datasets
stream = datasets.load_simple_shop()
A basic example#
The transition graph is a weighted directed graph that illustrates how often the users from an eventstream move from one event to another. The nodes stand for the unique events. A pair of nodes (say, A
and B
) is connected with a directed edge if the transition A → B
appeared at least once in the eventstream. Transition means that event B
appeared in a user path right after event B
. For example, in path A, C, B
there is no transition A → B
since event C
stands between A
and B
.
Each node and edge is associated with its weight. Roughly speaking, the weights are the numbers that reflect a node or an edge frequency. They might be calculated in multiple ways (see Weights section).
The primary way to build a transition graph is to call Eventstream.transition_graph()
method:
stream.transition_graph()
According to the transition graph definition, we see here the events represented as nodes connected with the edges. By default, the nodes and edges weights are the number of unique users who experienced the corresponding event or transition. All the edges are labeled with these numbers in the graph. For example, among the others, we can see that there are 1324 unique users who had catalog → cart
transitions, 603 users with main → main
self-transitions, and there were none with product1 → payment_done
transitions. The thickness of the edges and the size of the nodes are proportional to their weights.
The graph is interactive. You can move the nodes, zoom in/out the chart, and finally reveal or hide a control panel by clicking on the left edge of the chart. You can check the interactive features out even in the transition graphs embedded in this document.
Transition graph parameters#
Weights#
Edge weights calculation#
The edge weight values are controlled by edges_norm_type
and edges_weight_col
parameters of Eventstream.transition_graph()
method.
Let us start from the explanation of the configuration edges_norm_type=None
and edges_weight_col='event_id'
which means that no normalization is needed and event_id
column is used as a weighting column (we will explain the concept of weighting columns below). This combination defines edge weight as the number of the transitions associated with the edge in the entire eventstream.
By weight normalization we mean dividing the transition counts (calculated for edges_norm_type=None
case) by some denominator, so we get rational weights instead of integer. Except None
, two normalization types are possible: full
and node
. Full normalization defines the denominator as the overall number of the transitions in the eventstream. Node normalization works as follows. Consider a hypothetical A → B
transition. To normalize the weight of this edge we need to divide the number of A → B
transitions by the total number of the transitions coming out of A
node. In other words, node-normalized weight is essentially the probability of a user to transit to event B
standing on event A
.
Now, let us move to weighting column definition. In many cases it is reasonable to count the number of unique users or sessions instead of the number of transitions. This behavior is controlled by edges_norm_type
parameter. By default, edges_weight_col='event_id'
that is associated with the number of the transitions. You can also pass the names of the columns related to users or sessions in the eventstream. Typically they are user_id
and session_id
, but to be sure, check your eventstream data schema and session_col
parameter in the SplitSessions data processor
if you used it.
Having edges_weight_col
defined allows you to calculate the weighs as the unique values represented in edges_weight_col
column. This also relates to full
and node
normalization types. For example, edges_norm_type='full'
and edges_weight_col='user_id'
configuration means that we divide the number of the unique users who had a specific transition by the number of the unique users in the entire eventstream.
A simplified example#
In order to check whether you understand these definitions correctly, let us consider a simplified example and look into the matter of the edge weights calculation. Suppose we have the following eventstream:
user1: A, B, A, C, A, Buser2: A, B, C, C, C
user3: C, D, C, D, C, D
This eventstream consists of 3 unique users and 4 unique events. The event colors denote sessions (there are 6 sessions). We ignore the timestamps since the edge weights calculation does not take them into account. Note that throughout this example we will suppress edge_
prefix for the edges_norm_type
and edges_weight_col
.
Table 1 describes how the edge weights are calculated in case of weight_col='event_id'
.
So we have 8 unique edges in total. At first, we calculate for each edge the total number of such transitions occurred in the eventstream. As a result, we get the values in norm_type=None
column. Next, we estimate the total number of the transitions in the eventstream: 14. To get the weights in norm_type='full'
column, we divide the weights in norm_type=None
column by 14. Finally, we estimate that we have 4, 2, 6, 1 transitions starting from event A
, B
, C
, and D
correspondingly. Those are the denominators for norm_type='node'
column. To calculate the weights for this option, we divide the values in norm_type=None
by these denominators.
The calculation of the edge weights for weight_col='user_id'
is described in Table 2.
Now, for norm_type=None
option we calculate the number of unique users who had a specific transition. For norm_type='full'
the denominator is 3 as the total number of users in the eventstream. As for norm_type='node'
option, we have 2, 2, 3, 1 unique users who experienced A → *
, B → *
, C → *
, D → *
transitions. These values comprise the denominators. Again, to get the weights in norm_type='column'
, we divide the values from norm_type=None
column by these corresponding denominators.
Finally, in Table 3 we demonstrate the calculations for weight_col='session_id'
.
In comparison with the case for user_id
weight column, there are some important differences. Transitions B → A
, C → A
, B → C
are excluded since they are terminated by the session endings (their weights are zeros). As for the other transitions, we calculate the number of unique sessions they belong to. This is how we get norm_type=None
column. The total number of the sessions in the eventstream is 6. This is the denominator for norm_type='full'
column. The denominators for norm_type='node'
column are calculated as the number of the unique sessions with A → *
, B → *
, C → *
, and D → *
transitions. They are 4, 0, 2, and 1 correspondingly. Note that for B → A
and B → C
edges we have indeterminate form 0/0, since we have excluded all the transitions starting from B
. We define the corresponding weights as 0. Also, the denominator for C → *
edges is 2, not 3 since we have excluded one C → A
transition.
Node weights#
Besides edge weights, a transition graph also have node weights that control the diameters of the nodes. Unfortunately, so far only one option is supported: norm_type=None
along with weighting columns. By default, weight_col='user_id'
.
If you want to know how the node weights for norm_type='full'
are calculated, expand the following text snippet:
Obviously, node weights do not support norm_type='node'
since it involves edges by design. However, node_norm_type=None
and norm_type='full'
options might be calculated. They leverage the same calculation logic as we used for the edge weights calculation.
We explain this logic using the same example eventstream.
So for norm_type=None
option the node weights are simply the counters of the events over the entire eventstream (in case of weight_col='event_id'
) or the number of unique users or sessions (in case of weight_col='user_id'
or weight_col='session_id'
) that had a specific event. For norm_type='full'
we divide the non-normalized weights by either the overall number of events (17), or the number of unique users (3), or the number of unique sessions (6). See the calculations for each of the described cases in Table 4, Table 5, and Table 6 below:
Setting the weight options#
Finally, we demonstrate how to set the weighting options for a graph. As it has been discussed, edges_norm_type
argument accepts None
, full
or node
values. A weighting column is set by edges_weight_col
argument. Below is a table that summarizes the definitions of edge weights when these two arguments are used jointly.
edges_norm_type → edges_weight_col ↓ |
None |
full |
node |
---|---|---|---|
event_id |
The total number of the |
The total number of the |
The total number of the |
None or user_id |
The total number of the unique users who had the |
The total number of the unique users who had the |
The total number of the unique users who had the |
session_id |
The total number of the unique sessions who had the |
The total number of the unique sessions who had the |
The total number of the unique sessions where the |
Here is an example of the using these arguments:
stream.transition_graph(
edges_norm_type='node',
edges_weight_col='user_id'
)
From this graph we see, for example, that being at product1
event, 62.3% of the users transit to catalog
event, 43.3% - to cart
event, and 11.4% - to main
event. As you can notice, when you use some normalization, the values are not necessarily sum up to 1. This happens because a user can be at product1
state multiple times, so they can jump to multiple of these three events.
Thresholds#
The weights that we have discussed above are associated with importance of the edges and the nodes. In practice, a transition graph often contains enormous number of the nodes and the edges. The threshold mechanism sets the minimal weights for nodes and edges to be displayed in the canvas.
Note that the thresholds may use their own weighting columns both for nodes and for edges independently of those weighting columns defined in edges_weight_col
arguments. So the weights displayed on a graph might be different from the weights that the thresholds use in making their decision for hiding the nodes/edges. Moreover, multiple weighting columns might be used. In this case, the decision whether an item (a node or an edge) should be hidden is made applying logical OR: an item is hidden if it does not meet any threshold condition.
Also note that, by default, if all the edges connected to a node are hidden, the node becomes hidden as well. You can turn this option off here.
The thresholds are set with a couple of nodes_threshold
, edges_threshold
parameters. Each parameter is a dictionary. The keys are weighting column names, the values are the threshold values. A scalar value defines a minimal weight for the corresponding item to be displayed. Also, you can define a two-sided threshold with a dictionary containing min
and max
keys.
stream.transition_graph(
edges_norm_type='node',
edges_weight_col='user_id',
edges_threshold={'user_id': 0.12},
nodes_threshold={'event_id': {'min': 500, 'max': 50000}}
)
This example is an extension of the previous one. We use the same normalization configuration as before. Since we have added an edges threshold of 0.12
for user_id
weighting column, the edge product1
→ main
that we observed in the previous example is hidden now (its weight is 11.4%). As for the nodes threshold, note that event payment_cash
is hidden now (as we can see from the Nodes block in the Control panel, its weight is 197).
Color settings#
As we have already mentioned, the graph nodes and edges are often of different importance. Sometimes we need not just to hide graph unimportant elements, but to highlight important ones instead. There are two ways to set colors of nodes and edges.
The first one is setting targets
parameter. It associates given nodes with one of three target types and the corresponding colors: positive
(green), negative
(red), and source
(orange). As a result, the nodes and all their income and outcome edges are colored. Below is an example of the targets
usage.
stream\
.transition_graph(
targets={
'positive': ['payment_done', 'cart'],
'negative': 'path_end',
'source': 'path_start'
}
)
The second option is to set a color for each node or edge explicitly. Use nodes_custom_colors
and edges_custom_colors
parameters for this. The colors may be set using standard HTML color names or with HEX codes. targets
parameter is compatible in this case too.
nodes_custom_colors = {
'product1': 'gold',
'product2': 'gold',
'cart': 'green'
}
edges_custom_colors = {
('path_start', 'catalog'): '#cc29c4',
('path_start', 'main'): '#cc29c4',
}
stream\
.transition_graph(
nodes_custom_colors=nodes_custom_colors,
edges_custom_colors=edges_custom_colors,
targets={'negative': 'path_end'}
)
Graph settings#
You can set up the following boolean flags:
show_weights
. Hide/display the edge weight labels. Default value is True.show_percents
. Display edge weights as percents. Available only if an edge normalization type is chosen. Default value is False.show_nodes_names
. Hide/display the node names. Default value is True.show_all_edges_for_targets
. By default, the threshold filters hide the edges disregarding the node types. In case you have defined target nodes, you usually want to carefully analyze them. Hence, all the edges connected to these nodes are important. This displaying option allows to ignore the threshold filters and always display any edge connected to a target node. Default value is True.show_nodes_without_links
. Setting a threshold filter might remove all the edges connected to a node. Such isolated nodes might be considered as useless. This displaying option hides them in the canvas as well. Default value is True.show_edge_info_on_hover
. By default, a tooltip with an edge info pops up when you mouse over the edge. It might be disturbing for large graphs, so this option suppresses the tooltips. Default value is False.
These flags could be specified as separate arguments as follows:
stream.transition_graph(
edges_norm_type='node',
show_weights=True,
show_percents=True,
show_nodes_names=True,
show_all_edges_for_targets=False,
show_nodes_without_links=False,
show_edge_info_on_hover=True
)
Control panel#
The control panel is a visual interface allowing you to interactively control transition graph behavior. It also allows even to control the underlying eventstream in some scenarios (grouping events, renaming events, including/excluding events).
The control panel consists of 5 blocks: Weights, Nodes, Thresholds, Export, and Settings. By default, all these blocks are expanded. You can collapse them by clicking minus sign located at the top right corner of each block.
Click the minus sign to collapse the blocks. |
Click the plus sign to expand the blocks. |
Warning
All the settings that are tweaked in the Control panel are available only in scope of the current transition graph displayed in the current Jupyter cell. As soon as you run Eventstream.transition_graph()
again, all the settings will be reset to the defaults unless you call the method with particular parameters.
Weights block#
The Weights block contains selectors that choose weighting columns separately for nodes and edges. Unfortunately, so far you can not choose normalization type in this interface. The only way to set the normalization type is using edge_norm_type
argument in Eventstream.transition_graph()
method as it has been shown here. event_id
weighting column refers to edge_norm_type=None
.
For the nodes only event_id
and user_id
weighting columns are available. The same columns are available for the edges, but additionally the columns that are passed as the edges_weight_col
and custom_weight_cols
arguments of the Eventstream.transition_graph()
are also available.
Nodes block#
The Nodes block enumerates all the unique events represented in the transition graph and allows to perform such operations as grouping, renaming, and switching events.
Note
Nodes switcher requires graph recalculation.
Node item actions#
Each node list item contains the following 4 elements:
Focus icon. If you click it, the graph changes its position in the canvas so the selected node is placed in the center.
Event name. Double click it if you want to rename the node.
The number of the event occurrences in the eventstream.
This switcher hides the node and all the edges connected to the node from the canvas.
Grouping events#
The Control panel interface supports easy and intuitive event grouping. Suppose you want to group product1
and product2
events into one. There are two ways to do this:
Drag & drop method. Drag one node (say,
product2
) and drop it toproduct1
node.product1_group
event appears which contains eventsproduct1
andproduct2
.Add group method. Click “+ Add group” button,
untitled_group
appears. Drag & drop all the nodes to be grouped to this group.
Grouping node has a folder icon that triggers aggregation action. Once you click it, the grouped nodes are merged and the changes are displayed in the transition graph. Recalculation is required to update the node and edge weights.
Note
By recalculation we mean that some additional calculations are required in the backend in order to display the graph state according to the selected options. To recalculate the values, click yellow ⚠️ icon and request the recalculation. Sometimes it is reasonable to do multiple modifications in the control panel, and then call the recalculation at once.
Grouping nodes using drag & drop method. |
Grouping nodes using + Add group method. |
To rename a grouping node, double click its name and enter a new one. To ungroup the grouped nodes drag & drop the nodes out of the grouping node (or drop it right on the grouping node). As soon as the last event is out, the grouping node disappears.
Note
All the grouping and renaming actions do not affect the initial eventstream due to eventstream immutability property. However, it is possible to export the modified eventstream using the TransitionGraph.recalculation_result attribute.
Thresholds block#
The Thresholds block contains two sliders: one is associated with the nodes, another one - with the edges. You can set up a threshold value either by moving a slider or by entering a value explicitly. Also, you can set up a weighting column for each slider independently of the weighting column defined in the Weights block (we have already mentioned this feature here). A single slider is shared between multiple weighting columns. As soon as you select a weighting column in the dropdown menu, the threshold slider attaches to it. If you change another weighting column, the slider saves the previously entered threshold value and associate it with the previous weighting column.
Normalization type block#
Along with the Weights block, the Normalization type block carries the information on the nodes and edges weights. However, so far this block does not allow to change the normalization type.
Export block#
Transition graph export supports four formats: HTML, JSON, SVG, and PNG.
PNG. Save graph as an image.
HTML. A graph saved as HTML may be inserted in different environments, dashboards, slides, etc. All the interactive actions that does not require recalculation will be available even in HTML format.
JSON. This option saves the graph completely and allows to restore it later if applied to the same eventstream. All the interactive settings you have set will be saved. To get the modified state of the eventstream, use recalculation_result property.
SVG. Save graph as a vector graphics image.
Settings block#
The control panel also contains a block with checkbox interface for the already mentioned settings.
Import and export graph#
To restore previously saved graph, you can pass the JSON file path to the import_file
argument. This will restore all the elements positions, colors, thresholds, and other settings that were tuned previously.
path_to_file = '/path/to/dump_file.json'
stream.transition_graph(import_file=path_to_file)
You can also save the graph from code:
tg = stream.transition_graph()
tg.save_to_file(path_to_file)
Sometimes it is reasonable to save nodes layout only without other tweaks state. In this case you can export/import nodes layout as follows:
# export layout
tg = stream.transition_graph()
tg.save_layout(path_to_file)
# import layout
stream.transition_graph(layout_dump=path_to_file)
Graph properties#
A summary with all the important chosen graph settings is available by clicking ⓘ icon in the bottom right corner.
Saving the modified eventstream#
When you perform GUI actions that affect eventstream (like grouping events), the original eventstream is not changed.
To work with the updated data, you can export the modified eventstream using the TransitionGraph.recalculation_result
property.
Suppose you have built a transition graph and obtained the following grouped events delivery_choice_group
, payment_choice_group
, product1_group
.
tg = stream.transition_graph()
As we see from below, recalculation_result
property contains these grouped events:
tg.recalculation_result.to_dataframe()
event_id | event_type | event_index | event | timestamp | user_id | |
---|---|---|---|---|---|---|
0 | c209a540-5bdf-45ef-84da-917b0751fb3f | raw | 0 | catalog | 2019-11-01 17:59:13.273932 | 219483890 |
1 | ab361d87-ed46-45d1-950b-9e58c57a017c | raw | 1 | product1_group | 2019-11-01 17:59:28.459271 | 219483890 |
2 | fa45b90d-5c03-4d09-b977-5c8a3d2fc32f | raw | 2 | cart | 2019-11-01 17:59:29.502214 | 219483890 |
3 | 1b423d0c-a611-48d1-a20f-a264dd51b4b2 | raw | 3 | catalog | 2019-11-01 17:59:32.557029 | 219483890 |
4 | c0fcfaf6-2899-4f07-93e8-776a5147f8d4 | raw | 4 | catalog | 2019-11-01 21:38:19.283663 | 964964743 |
5 | bde282ab-5153-492e-83c1-8903ccf78a6b | raw | 5 | cart | 2019-11-01 21:38:36.761221 | 964964743 |
6 | 82c5197e-9a5e-4c16-b6be-92c3aa0c2c08 | raw | 6 | delivery_choice_group | 2019-11-01 21:38:37.564693 | 964964743 |