StepSankey#

The following user guide is also available as Google Colab notebook.

Loading data#

Throughout this guide we use our demonstration simple_shop dataset. It has already been converted to Eventstream and assigned to stream variable. If you want to use your own dataset, upload it following this instruction.

from retentioneering import datasets

stream = datasets.load_simple_shop()

Basic example#

The step Sankey diagram represents eventstream as a stepwise directed graph. The nodes are associated with events that appear at a particular step in a user’s trajectory. The nodes are sorted from left to right according to the ordinal number of step (1, 2, etc). The edges visualize how often transition from, say, event A happened at i-th step to event B happened at i+1-th step occurred. The nodes and edges sizes reflect the number of unique users involved.

The step Sankey diagram in some sense is an extension of the step matrix diagram. The latter shows the distribution of the events with respect to an ordinal step, but in addition the step Sankey chart reflects connections between adjacent steps which the step matrix lacks of. Hence, step Sankey inherits many features that step matrix have, so we recommend you to read Step matrix user guide before you read this document.

The implementation is based on the Plotly Sankey diagram and inherits all the benefits from its parent. In particular, the diagram is interactive, so you can hover the nodes and edges and look at the detailed info, move the nodes, and even merge them (to merge use Box Select or Lasso Select tools located at the top-right corner on hover).

The primary way to build a step Sankey diagram graph is to call Eventstream.step_sankey() method. Here is how it visualizes simple_shop eventstream:

stream.step_sankey(max_steps=5)

Here we can see user flow. The nodes are grouped into columns in stepwise manner. The first column corresponds to the events that occurred at the users’ first step, the second column corresponds to the second step and so on. The height of a rectangle representing a node is proportional to the frequency this particular event occurred at this particular step. From this diagram we can see (if we hover the mouse cursor on the node) that at the first step the catalog event appeared 2.69K times (71.61% of the users) whereas the main event appeared 1.07K times (28.39% of the users). That is why the red rectangular (for the catalog event) is ~2.5 times higher than the green rectangular (for the main event). The percentage of the users is calculated with respect to all the users participating in the parent eventstream.

An edge’s width is proportional to the frequency of this transition in the eventstream. Hovering the mouse on the edges, you can reveal not only these frequencies, but also the info on how long a transition took the users on average. For example, we can see that the transition catalog (1st step) -> catalog (2nd step) appeared in 869 paths, and it took 29 seconds on average.

The screenshots of the data chunks on mouse hovering.#

hover_node1

hover_node2

hover_edge

Finally, we mention that max_steps arguments we used above denotes the number of the steps to be displayed in the diagram (starting from the 1st step).

Note

The charts in this user guide are usually quite wide, so do not forget to use horizontal scroll bar to see the entire visualization.

Terminating event#

Similar to step matrix, step Sankey diagram uses the idea of synthetic ENDED event. This event is padded in the end of short paths (meaning that their length is less than max_steps) so that their length becomes exactly max_path. See Step matrix user guide for the details.

Having ENDED event implemented guarantees that the sum of the user shares over each column (i.e. each step) is exactly 1. ENDED is always placed at the bottom of the diagram. The following example demonstrates this (we temporarily set threshold=0 for the comparison purposes, see the next section).

stream.step_sankey(max_steps=5, threshold=0)

In this diagram we see that ENDED appears at the 2nd step and involves 443 users. At the 3rd step ENDED event contains 823 users, and for 443 of them the event have been propagated from the previous step.

Collapsing rare events#

As in the case of the step matrix, it is reasonable to collapse rare events in the step Sankey diagram since these events make the diagram excessively noisy. This behaviour is controlled by the threshold argument. An event is considered as rare if its maximum frequency over all the steps represented in the diagram is less than threshold. The threshold might be of whether int or float type. The former stands for the limit for the absolute number of the users, the latter stands for the percentage of the users. All these rare events are not removed from the diagram, but collapsed to the thresholded_N artificial event instead, where N stands for the number of the collapsed events. The thresholded_N event appears in the step Sankey diagram only and is not added to the parent eventstream.

The default value for threshold is 0.05. Let us look how the events are collapsed if we set threshold=0.1 and compare the result with the previous diagram (with threshold=0 parameter).

stream.step_sankey(max_steps=5, threshold=0.1)

We see that thresholded_5 event has appeared. As you might have noticed, it contains product1, payment_choice, delivery_choice, delivery_courier, and delivery_pickup. Let us explain why, for example, the product1 event has been collapsed. Look at the first chart with threshold=0. The product1 event contains 7.01%, 4.51, 4.27, and 3.2% of the users at steps 2, 3, 4, 5 correspondingly. Since the maximum value (7.01%) is less than threshold=0.1, the event has been collapsed.

Please also note that the number _5 in the thresholded_5 event name carries no information about a specific step. For example, from the chart with threshold=0 we see that at step 2 only one event among these 5 is represented (product1), so it is the only event which is collapsed at this step. On the other hand, at step 3 product1 and delivery_choice appear, so they are collapsed to thresholded_5 event. Finally, at step 5 all these 5 events are collapsed.

If you want to prevent some events from collapsing, use the targets parameter then. We evolve the previous example, but now we are aiming to drag product1 and delivery_choice events out from the thresholded_5 event, so we put them into the targets list.

stream\
    .step_sankey(
        max_steps=5,
        threshold=0.1,
        targets=['product1', 'delivery_choice']
    )

Look at step 2. What we see is that thresholded_5 event has disappeared completely, and product1 has been revealed instead. At step 3 there is no thresholded_5 event too. It has been replaced by product1 and delivery_choice. Finally, at step 4 we see a couple of target events product1 and delivery_choice, but thresholded_3 event is also represented here. It still contains 3 events: delivery_courier, delivery_pickup, and payment_choice.

Events sorting#

Intuitively, the events order within a column depends on the corresponding user shares at a particular step. It is true in many cases, but this is not the only logic considered. The sorting algorithm also takes into account when (at which step) an event appears in the diagram for the first time. The algorithm ranks higher the events which appear earlier even if their frequency is low at a particular step.

To illustrate this logic consider a dummy eventstream:

from retentioneering.eventstream import Eventstream

dummy_stream = Eventstream(
    pd.DataFrame(
        [
            [1, 'event1', '2023-01-01 00:00:00'],
            [1, 'event1', '2023-01-01 00:00:00'],
            [2, 'event1', '2023-01-01 00:00:00'],
            [2, 'event2', '2023-01-01 00:00:00'],
            [3, 'event1', '2023-01-01 00:00:00'],
            [3, 'event2', '2023-01-01 00:00:00'],
            [4, 'event1', '2023-01-01 00:00:00'],
            [4, 'event2', '2023-01-01 00:00:00'],
        ],
        columns=['user_id', 'event', 'timestamp']
    )
)
dummy_stream.step_sankey(max_steps=4)

From this chart we see that there is no event2 spotted at step 1. However, at step 2, despite the dominance of event2 (75% of the users at step 2 had event2), event1 is placed higher since it is considered “older” than event2.

Using a separate instance#

By design, Eventstream.step_sankey() is a shortcut method that uses StepSankey class under the hood. This method creates an instance of StepSankey class and embeds it into the eventstream object. Eventually, Eventstream.step_sankey() returns exactly this instance.

Sometimes it is reasonable to work with a separate instance of StepSankey class. An alternative way to get the same visualization that Eventstream.step_sankey() produces is to call StepSankey.fit() and StepSankey.plot() methods explicitly. The former method calculates all the values needed for the visualization, the latter displays these values as a Sankey diagram.

from retentioneering.tooling.step_sankey import StepSankey

step_sankey = StepSankey(stream)
step_sankey.fit(max_steps=5, threshold=0.1)
step_sankey.plot()

Common tooling properties#

values#

Since the StepSankey object is essentially a graph, it is natural to get the underlying values as the data on the graph’s nodes and edges. So StepSankey.values property returns two pandas.DataFrame objects. The first relates to the nodes, the second relates to the edges. show_plot=False in the examples below is needed to suppress displaying the diagram.

# StepSankey graph nodes
stream\
    .step_sankey(show_plot=False)\
    .values[0]
step event usr_cnt usr_cnt_total perc color index sorting order_by
0 1 catalog 2686 3751 71.61 (80, 190, 151) 0 100 100
1 1 main 1065 3751 28.39 (228, 101, 92) 1 100 100
2 2 catalog 1670 3751 44.52 (80, 190, 151) 2 100 0
3 2 main 609 3751 16.24 (228, 101, 92) 3 100 1
4 2 product2 429 3751 11.44 (53, 58, 62) 4 100 100
# StepSankey graph edges
stream\
    .step_sankey(show_plot=False)\
    .values[1]
step event next_event usr_cnt time_to_next_sum index next_step next_index
0 1 catalog catalog 869 0 days 07:05:31.308030 0 2 2
1 1 catalog main 452 2228 days 01:07:48.656824 0 2 3
2 1 catalog product2 429 0 days 01:12:27.870236 0 2 4
3 1 catalog cart 337 0 days 02:31:57.294871 0 2 5
4 1 catalog ENDED 336 0 days 00:00:00 0 2 7

params#

StepSankey.params property returns a dictionary containing all the parameters (including the defaults) related to the current state of the StepSankey object:

# StepSankey graph nodes
stream\
    .step_sankey(show_plot=False)\
    .params
{'max_steps': 10,
 'threshold': 0.05,
 'sorting': None,
 'targets': None,
 'autosize': True,
 'width': None,
 'height': None}