Transition Graph#
Loading data#
Throughout this guide we use our demonstration simple_shop dataset. It has already been converted to Eventstream and assigned to stream
variable. If you want to use your own dataset, upload it following this instruction.
from retentioneering import datasets
stream = datasets.load_simple_shop()
A basic example#
The transition graph is a weighted directed graph that illustrates how often the users from an eventstream move from one event to another. The nodes stand for the unique events. A pair of nodes (say, A
and B
) is connected with a directed edge if the transition A → B
appeared at least once in the eventstream. Transition means that event B
appeared in a user path right after event B
. For example, in path A, C, B
there is no transition A → B
since event C
stands between A
and B
.
Each node and edge is associated with its weight. Roughly speaking, the weights are the numbers that reflect a node or an edge frequency. They might be calculated in multiple ways (see Weights section).
The primary way to build a transition graph is to call Eventstream.transition_graph()
method:
stream.transition_graph()
According to the transition graph definition, we see here the events represented as nodes connected with the edges. By default, the nodes and edges weights are simply the total numbers of the events/transitions occurred in the sourcing eventstream. All the edges are labeled with these numbers in the graph. For example, among the others, we can see that there were 1709 catalog → cart
transitions, 1042 main → main
self-transitions, and there were no product1 → payment_done
transitions. The thickness of the edges and the size of the nodes are proportional to their weights.
The graph is interactive. You can move the nodes, zoom in/out the chart, and finally reveal or hide a control panel by clicking on the left edge of the chart. You can check the interactive features out even in the transition graphs embedded in this document.
Transition graph parameters#
Weights#
Edge weights calculation#
The edge weight values are controlled by edges_norm_type
and edges_weight_col
parameters of Eventstream.transition_graph()
method.
As we mentioned earlier, the most straightforward way to assign an edge weight is to calculate the number of the transitions associating with the edge in the entire eventstream. In this case we use edges_norm_type=None
and edges_weight_col='event_id'
, meaning that no normalization is needed and event_id
column is used as a weighting column (we will explain the concept of weighting columns below).
By weight normalization we mean dividing the transition counts (calculated for edges_norm_type=None
case) by some denominator, so we get rational weights instead of integer. Except None
, two normalization types are possible: full
and node
. Full normalization defines the denominator as the overall number of the transitions in the eventstream. Node normalization works as follows. Consider a hypothetical A → B
transition. To normalize the weight of this edge we need to divide the number of A → B
transitions by the total number of the transitions coming out of A
node. In other words, node-normalized weight is essentially the probability of a user to transit to event B
standing on event A
.
Now, let us move to weighting column definition. In many cases it is reasonable to count the number of unique users or sessions instead of the number of transitions. This behavior is controlled by edges_norm_type
parameter. By default, edges_weight_col='event_id'
that is associated with the number of the transitions. You can also pass the names of the columns related to users or sessions in the eventstream. Typically they are user_id
and session_id
, but to be sure, check your eventstream data schema and session_col
parameter in the SplitSessions data processor
if you used it.
Having edges_weight_col
defined allows you to calculate the weighs as the unique values represented in edges_weight_col
column. This also relates to full
and node
normalization types. For example, edges_norm_type='full'
and edges_weight_col='user_id'
configuration means that we divide the number of the unique users who had a specific transition by the number of the unique users in the entire eventstream.
A simplified example#
In order to check whether you understand these definitions correctly, let us consider a simplified example and look into the matter of the edge weights calculation. Suppose we have the following eventstream:
user1: A, B, A, C, A, Buser2: A, B, C, C, C
user3: C, D, C, D, C, D
This eventstream consists of 3 unique users and 4 unique events. The event colors denote sessions (there are 6 sessions). We ignore the timestamps since the edge weights calculation does not take them into account. Note that throughout this example we will suppress edge_
prefix for the edges_norm_type
and edges_weight_col
.
Table 1 describes how the edge weights are calculated in case of weight_col='event_id'
.
So we have 8 unique edges in total. At first, we calculate for each edge the total number of such transitions occurred in the eventstream. As a result, we get the values in norm_type=None
column. Next, we estimate the total number of the transitions in the eventstream: 14. To get the weights in norm_type='full'
column, we divide the weights in norm_type=None
column by 14. Finally, we estimate that we have 4, 2, 6, 1 transitions starting from event A
, B
, C
, and D
correspondingly. Those are the denominators for norm_type='node'
column. To calculate the weights for this option, we divide the values in norm_type=None
by these denominators.
The calculation of the edge weights for weight_col='user_id'
is described in Table 2.
Now, for norm_type=None
option we calculate the number of unique users who had a specific transition. For norm_type='full'
the denominator is 3 as the total number of users in the eventstream. As for norm_type='node'
option, we have 2, 2, 3, 1 unique users who experienced A → *
, B → *
, C → *
, D → *
transitions. These values comprise the denominators. Again, to get the weights in norm_type='column'
, we divide the values from norm_type=None
column by these corresponding denominators.
Finally, in Table 3 we demonstrate the calculations for weight_col='session_id'
.
In comparison with the case for user_id
weight column, there are some important differences. Transitions B → A
, C → A
, B → C
are excluded since they are terminated by the session endings (their weights are zeros). As for the other transitions, we calculate the number of unique sessions they belong to. This is how we get norm_type=None
column. The total number of the sessions in the eventstream is 6. This is the denominator for norm_type='full'
column. The denominators for norm_type='node'
column are calculated as the number of the unique sessions with A → *
, B → *
, C → *
, and D → *
transitions. They are 4, 0, 2, and 1 correspondingly. Note that for B → A
and B → C
edges we have indeterminate form 0/0, since we have excluded all the transitions starting from B
. We define the corresponding weights as 0. Also, the denominator for C → *
edges is 2, not 3 since we have excluded one C → A
transition.
Node weights#
Besides edge weights, a transition graph also have node weights that control the diameters of the nodes. Unfortunately, so far only one option is supported: norm_type=None
along with weighting columns. However, if you want to know how the node weights for norm_type='full'
are calculated, expand the following text snippet:
Obviously, node weights do not support norm_type='node'
since it involves edges by design. However, node_norm_type=None
and norm_type='full'
options might be calculated. They leverage the same calculation logic as we used for the edge weights calculation.
We explain this logic using the same example eventstream.
So for norm_type=None
option the node weights are simply the counters of the events over the entire eventstream (in case of weight_col='event_id'
) or the number of unique users or sessions (in case of weight_col='user_id'
or weight_col='session_id'
) that had a specific event. For norm_type='full'
we divide the non-normalized weights by either the overall number of events (17), or the number of unique users (3), or the number of unique sessions (6). See the calculations for each of the described cases in Table 4, Table 5, and Table 6 below:
Setting the weight options#
Finally, we demonstrate how to set the weighting options for a graph. As it has been discussed, edges_norm_type
argument accepts None
, full
or node
values. A weighting column is set by edges_weight_col
argument. Below is a table that summarizes the definitions of edge weights when these two arguments are used jointly.
edges_norm_type → edges_weight_col ↓ |
None |
full |
node |
---|---|---|---|
None or event_id |
The total number of the |
The total number of the |
The total number of the |
user_id |
The total number of the unique users who had the |
The total number of the unique users who had the |
The total number of the unique users who had the |
session_id |
The total number of the unique sessions who had the |
The total number of the unique sessions who had the |
The total number of the unique sessions where the |
Here is an example of the using these arguments:
stream.transition_graph(
edges_norm_type='node',
edges_weight_col='user_id'
)
From this graph we see, for example, that being at product1
event, 62.3% of the users transit to catalog
event, 43.3% - to cart
event, and 11.4% - to main
event. As you can notice, when you use some normalization, the values are not necessarily sum up to 1. This happens because a user can be at product1
state multiple times, so they can jump to multiple of these three events.
Thresholds#
The weights that we have discussed above are associated with importance of the edges and the nodes. In practice, a transition graph often contains enormous number of the nodes and the edges. The threshold mechanism sets the minimal weights for nodes and edges to be displayed in the canvas.
Note that the thresholds may use their own weighting columns both for nodes and for edges independently of those weighting columns defined in edges_weight_col
arguments. So the weights displayed on a graph might be different from the weights that the thresholds use in making their decision for hiding the nodes/edges. Moreover, multiple weighting columns might be used. In this case, the decision whether an item (a node or an edge) should be hidden is made applying logical OR: an item is hidden if it does not meet any threshold condition.
Also note that, by default, if all the edges connected to a node are hidden, the node becomes hidden as well. You can turn this option off here.
The thresholds are set with a couple of nodes_threshold
, edges_threshold
parameters. Each parameter is a dictionary. The keys are weighting column names, the values are the threshold values.
stream.transition_graph(
edges_norm_type='node',
edges_weight_col='user_id',
edges_threshold={'user_id': 0.12},
nodes_threshold={'event_id': 500}
)
This example is an extension of the previous one. We use the same normalization configuration as before. Since we have added an edges threshold of 0.12
for user_id
weighting column, the edge product1
→ main
that we observed in the previous example is hidden now (its weight is 11.4%). As for the nodes threshold, note that event payment_cash
is hidden now (as we can see from the Nodes block in the Control panel, its weight is 197).
Targets#
As we have already mentioned, the graph nodes are often of different importance. Sometimes we need not just to hide unimportant nodes, but to highlight important nodes instead. Transition graph identifies three types of the nodes: positive, negative, and sourcing. Three colors relate to these node types: green, ren and orange correspondingly. You can color the nodes with these colors by defining their types in the targets
parameter:
stream\
.add_start_end_events()\
.transition_graph(
targets={
'positive': ['payment_done', 'cart'],
'negative': 'path_end',
'source': 'path_start'
}
)
In the example above we additionally use Eventstream.add_start_end_events()
data processor helper to add path_start
and path_end
events.
Graph settings#
You can set up the following boolean flags:
show_weights
. Hide/display the edge weight labels. Default value is True.show_percents
. Display edge weights as percents. Available only if an edge normalization type is chosen. Default value is False.show_nodes_names
. Hide/display the node names. Default value is True.show_all_edges_for_targets
. By default, the threshold filters hide the edges disregarding the node types. In case you have defined target nodes, you usually want to carefully analyze them. Hence, all the edges connected to these nodes are important. This displaying option allows to ignore the threshold filters and always display any edge connected to a target node. Default value is True.show_nodes_without_links
. Setting a threshold filter might remove all the edges connected to a node. Such isolated nodes might be considered as useless. This displaying option hides them in the canvas as well. Default value is True.show_edge_info_on_hover
. By default, a tooltip with an edge info pops up when you mouse over the edge. It might be disturbing for large graphs, so this option suppresses the tooltips. Default value is False.
These flags could be specified as separate arguments as follows:
stream.transition_graph(
edges_norm_type='node',
show_weights=True,
show_percents=True,
show_nodes_names=True,
show_all_edges_for_targets=False,
show_nodes_without_links=False,
show_edge_info_on_hover=True
)
Control panel#
The control panel is a visual interface allowing you to interactively control transition graph behavior. It also allows even to control the underlying eventstream in some scenarios (grouping events, renaming events, including/excluding events).
The control panel consists of 5 blocks: Weights, Nodes, Thresholds, Export, and Settings. By default, all these blocks are expanded. You can collapse them by clicking minus sign located at the top right corner of each block.
Click the minus sign to collapse the blocks. |
Click the plus sign to expand the blocks. |
Warning
All the settings that are tweaked in the Control panel are available only in scope of the current transition graph displayed in the current Jupyter cell. As soon as you run Eventstream.transition_graph()
again, all the settings will be reset to the defaults unless you call the method with particular parameters.
Weights block#
The Weights block contains selectors that choose weighting columns separately for nodes and edges. Unfortunately, so far you can not choose normalization type in this interface. The only way to set the normalization type is using edge_norm_type
argument in Eventstream.transition_graph()
method as it has been shown here. event_id
weighting column refers to edge_norm_type=None
.
For the nodes only event_id
and user_id
weighting columns are available. The same columns are available for the edges, but additionally the columns that are passed as the edges_weight_col
and custom_weight_cols
arguments of the Eventstream.transition_graph()
are also available.
Nodes block#
The Nodes block enumerates all the unique events represented in the transition graph and allows to perform such operations as grouping, renaming, and switching events.
Note
Nodes switcher requires graph recalculation.
Node item actions#
Each node list item contains the following 4 elements:
Focus icon. If you click it, the graph changes its position in the canvas so the selected node is placed in the center.
Event name. Double click it if you want to rename the node.
The number of the event occurrences in the eventstream.
This switcher hides the node and all the edges connected to the node from the canvas.
Grouping events#
The Control panel interface supports easy and intuitive event grouping. Suppose you want to group product1
and product2
events into one. There are two ways to do this:
Drag & drop method. Drag one node (say,
product2
) and drop it toproduct1
node.product1_group
event appears which contains eventsproduct1
andproduct2
.Add group method. Click “+ Add group” button,
untitled_group
appears. Drag & drop all the nodes to be grouped to this group.
Grouping node has a folder icon that triggers aggregation action. Once you click it, the grouped nodes are merged and the changes are displayed in the transition graph. Recalculation is required to update the node and edge weights.
Note
By recalculation we mean that some additional calculations are required in the backend in order to display the graph state according to the selected options. To recalculate the values, click yellow ⚠️ icon and request the recalculation. Sometimes it is reasonable to do multiple modifications in the control panel, and then call the recalculation at once.
Grouping nodes using drag & drop method. |
Grouping nodes using + Add group method. |
To rename a grouping node, double click its name and enter a new one. To ungroup the grouped nodes drag & drop the nodes out of the grouping node (or drop it right on the grouping node). As soon as the last event is out, the grouping node disappears.
Note
All the grouping and renaming actions do not affect the initial eventstream due to eventstream immutability property. However, it is possible to export the modified eventstream using the TransitionGraph.recalculation_result attribute.
Thresholds block#
The Thresholds block contains two sliders: one is associated with the nodes, another one - with the edges. You can set up a threshold value either by moving a slider or by entering a value explicitly. Also, you can set up a weighting column for each slider independently of the weighting column defined in the Weights block (we have already mentioned this feature here). A single slider is shared between multiple weighting columns. As soon as you select a weighting column in the dropdown menu, the threshold slider attaches to it. If you change another weighting column, the slider saves the previously entered threshold value and associate it with the previous weighting column.
Normalization type block#
Along with the Weights block, the Normalization type block carries the information on the nodes and edges weights. However, so far this block does not allow to change the normalization type.
Export block#
Transition graph export supports four formats: HTML, JSON, SVG, and PNG.
PNG is a general-purpose format. It is used when you need to save a graph as an image.
HTML format is useful when you want to embed the resulting graph into different environments, reports, etc. Such a file supports all the interactive actions as if you treated the graph in Jupyter environment. For example, the graphs that are embedded in this user guide were exported right in this way, so that they are still interactive.
JSON format might be useful when you need to get the nodes coordinates. However, it is important to note that when using this option, only the node coordinates are saved. Any other adjustments or modifications you made in the graph GUI, such as thresholds, weights, settings will be lost and not included in the JSON file. To get the modified state of the eventstream, use recalculation_result property.
SVG is a commonly used format for vector graphics.
Settings block#
The control panel also contains a block with checkbox interface for the already mentioned settings.
Import and export graph layout#
To restore previously saved node positions, you need to pass the file path of the JSON file to the layout_dump
parameter.
path_to_file = '/path/to/node_params.json'
stream.transition_graph(layout_dump=path_to_file)
In general, layout dump mechanism is applied to the same eventstream that was a source of the JSON file. Technically, you can apply it to another eventstream. In this case, the saved node positions are assigned only for those nodes which are enlisted in the JSON file. Other nodes are ignored and placed at some default positions. This trick is convenient when you need to visualize transition graphs for multiple clusters of the same eventstream. You can use the same JSON file for all these graphs.
Note
In order to reproduce your graph state as closely as possible you can pass all the adjustments you have made in the transition graph GUI as parameters and their corresponding values to the new transition graph.
Graph properties#
A summary with all the important chosen graph settings is available by clicking ⓘ icon in the bottom right corner.
Saving the modified eventstream#
When you perform GUI actions that affect eventstream (like grouping events), the original eventstream is not changed.
To work with the updated data, you can export the modified eventstream using the TransitionGraph.recalculation_result
property.
Suppose you have built a transition graph and obtained the following grouped events delivery_choice_group
, payment_choice_group
, product1_group
.
tg = stream.transition_graph()
As we see from below, recalculation_result
property contains these grouped events:
tg.recalculation_result.to_dataframe()
event_id | event_type | event_index | event | timestamp | user_id | |
---|---|---|---|---|---|---|
0 | c209a540-5bdf-45ef-84da-917b0751fb3f | raw | 0 | catalog | 2019-11-01 17:59:13.273932 | 219483890 |
1 | ab361d87-ed46-45d1-950b-9e58c57a017c | raw | 1 | product1_group | 2019-11-01 17:59:28.459271 | 219483890 |
2 | fa45b90d-5c03-4d09-b977-5c8a3d2fc32f | raw | 2 | cart | 2019-11-01 17:59:29.502214 | 219483890 |
3 | 1b423d0c-a611-48d1-a20f-a264dd51b4b2 | raw | 3 | catalog | 2019-11-01 17:59:32.557029 | 219483890 |
4 | c0fcfaf6-2899-4f07-93e8-776a5147f8d4 | raw | 4 | catalog | 2019-11-01 21:38:19.283663 | 964964743 |
5 | bde282ab-5153-492e-83c1-8903ccf78a6b | raw | 5 | cart | 2019-11-01 21:38:36.761221 | 964964743 |
6 | 82c5197e-9a5e-4c16-b6be-92c3aa0c2c08 | raw | 6 | delivery_choice_group | 2019-11-01 21:38:37.564693 | 964964743 |
Transition matrix#
Transition matrix
is a sub-part of transition graph. It contains edge weights only so that the weight of, say, A → B
transition is located at A
row and B
column of the transition matrix. The calculation logic is exactly the same as we have described here for transition graphs, and the arguments are similar to weights-related arguments of transition graph. Use norm_type
instead of edges_norm_type
and weight_col
instead of edges_weight_col
.
stream.transition_matrix(norm_type='node', weight_col='user_id')
cart | catalog | ... | payment_done | payment_cash | |
---|---|---|---|---|---|
cart | 0.000594 | 0.283848 | ... | 0.000000 | 0.0 |
catalog | 0.418458 | 0.633375 | ... | 0.000000 | 0.0 |
... | ... | ... | ... | ... | ... |
payment_done | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 |
payment_cash | 0.000000 | 0.000000 | ... | 0.708333 | 0.0 |
For example, from this matrix we can see that the weight of the edge cart → catalog
is ~0.28 with respect to given weights configuration: norm_type='node'
and weight_col='user_id'
.
Using a separate instance#
By design, Eventstream.transition_graph()
is a shortcut method that uses TransitionGraph
class under the hood. This method creates an instance of TransitionGraph class and embeds it into the eventstream object. Eventually, Eventstream.transition_graph()
returns exactly this instance.
Sometimes it is reasonable to work with a separate instance of TransitionGraph class. An alternative way to get the same visualization that Eventstream.transition_graph()
produces is to call TransitionGraph.plot()
method explicitly.
Here is an example how you can manage it:
from retentioneering.tooling.transition_graph import TransitionGraph
tg = TransitionGraph(stream)
tg.plot(
edges_norm_type='node',
edges_weight_col='user_id',
edges_threshold={'user_id': 0.12},
nodes_threshold={'event_id': 500},
targets={'positive': ['payment_done', 'cart']}
)