Quick start with Retentioneering ================================ |colab| |jupyter| .. |jupyter| raw:: html Download - Jupyter Notebook .. |colab| raw:: html Google Colab Retentioneering is a Python library for in-depth analysis of what is commonly called user clickstream. We find the traditional term clickstream to be too constrictive, as user actions may not just be clicks; instead, we use the term *event* to mean any user action, and *eventstream* to refer to a set of actions performed by the user. A set of events belonging to a particular user is called *user path* or *user trajectory*, and sometimes *customer journey map* (CJM) is used as a synonym for eventstream. Each event is tied to the user who experienced it, and a timestamp. Hence, at a basic level, eventstream comprises a set of triples like these: .. parsed-literal:: ('user_1', 'login', '2019-01-01 00:00:00'), ('user_1', 'main_page_visit', '2019-01-01 00:00:00'), ('user_1', 'cart_button_click', '2019-01-01 00:00:00'), ... Any eventstream research consists of three fundamental steps: - Loading data - Preparing the data - Applying Retentioneering tools This document is a brief overview of how to follow these steps. For more detail, see the :doc:`User Guides <../user_guide>`. Loading data ------------ This is the introduction to our core class :doc:`Eventstream <../user_guides/eventstream>`, which stores eventstream events and enables you to work with them efficiently. We have provided a small :doc:`simple_shop <../datasets/simple_shop>` dataset for you to use for demo purposes here, and throughout the documentation. .. code-block:: python from retentioneering import datasets # load sample user behavior data: stream = datasets.load_simple_shop() In the shell of eventstream object there is a regular pandas.DataFrame which can be revealed by calling :py:meth:`to_dataframe()` method: .. code-block:: python stream.to_dataframe().head() .. raw:: html
user_id event timestamp
0 219483890 catalog 2019-11-01 17:59:13.273932
1 219483890 product1 2019-11-01 17:59:28.459271
2 219483890 cart 2019-11-01 17:59:29.502214
3 219483890 catalog 2019-11-01 17:59:32.557029
4 964964743 catalog 2019-11-01 21:38:19.283663
In this fragment of the dataset, user ``219483890`` has 4 events with timestamps on the website on ``2019-11-01``. If you are OK with the simple_shop dataset, you can proceed to the next section. Alternatively, you can create an eventstream by uploading your own dataset. It must be represented as a csv-table with at least three columns (``user_id``, ``event``, and ``timestamp``). Upload your table as a pandas.DataFrame and create the eventstream as follows: .. code-block:: python import pandas as pd from retentioneering.eventstream import Eventstream # load your own csv data = pd.read_csv("your_own_data_file.csv") stream = Eventstream(data) If the input table columns have different names, either rename them in the DataFrame, or explicitly set data schema (see :ref:`Eventstream user guide ` for the instructions). Likewise, if the table has additional custom columns, setting the data schema is also required. Getting a CSV file with data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you use Google Analytics, raw data in the form of {user, event, timestamp} triples can be streamed via Google Analytics 360 or free Google Analytics App+Web into BigQuery. From the BigQuery console, you can run an SQL query and export data into a csv file. Alternatively, you can use the Python BigQuery connector to get directly into the DataFrame. For large datasets, we suggest sampling the users in an SQL query, filtering by the user_id (just add this condition to SQL WHERE statement to get 10% of your users: .. parsed-literal:: and ABS(MOD(FARM_FINGERPRINT(fullVisitorId), 10)) = 0) .. _quick_start_preprocessing: Preparing the data ------------------ Raw data often needs to be prepared before analytical techniques are applied. Retentioneering provides a wide range of preprocessing tools that are comprised of elementary parts called “data processors.” With the help of data processors, a product analyst can easily add, delete, or group events, flexibly truncate an eventstream, split the trajectories into sessions, and much more. See the :doc:`Data processors user guide <../user_guides/dataprocessors>` for a comprehensive description of this Swiss army knife for data processors. Below is a brief example of how the data processors work. Suppose you wanted to analyze only the first session of each user, rather than their whole trajectory. Here is how you can do that with just a few lines of code: .. code-block:: python # eventstream preprocessing example stream = stream.split_sessions(timeout=(30, 'm')) stream\ .filter_events(func=lambda df_, schema: df_['session_id'].str.endswith('_1')) .to_dataframe()\ .head() .. raw:: html
user_id event timestamp session_id
0 219483890 session_start 2019-11-01 17:59:13.273932 219483890_1
1 219483890 catalog 2019-11-01 17:59:13.273932 219483890_1
3 219483890 product1 2019-11-01 17:59:28.459271 219483890_1
5 219483890 cart 2019-11-01 17:59:29.502214 219483890_1
7 219483890 catalog 2019-11-01 17:59:32.557029 219483890_1

At the beginning, we take a ``stream`` variable that contains the eventstream instance created in the previous section. The :ref:`split_sessions` method creates a new column called ``session_id``, in which values ending with the suffix ``_`` indicate the ordinal number of each user’s session. In the end, we need to leave only those records where ``session_id`` ends with ``_1`` (meaning the first session). This is exactly what the filter method does. We also apply the ``to_dataframe()`` method, which you are already familiar with. In real life, analytical eventstream research is likely to be branchy. You might want to wrangle an initial eventstream’s data in many ways, check multiple hypotheses, and look at different parts of the eventstream. All of this is easily and efficiently managed using the preprocessing graph. It enables you to keep all the records and code related to the research in a calculation graph. This tool is especially recommended for those who need to share parts of the analytical code with team members. See the :doc:`Preprocessing user guide <../user_guides/preprocessing>` for more details. .. _quick_start_rete_tools: Applying path analysis tools ---------------------------- Retentioneering offers many powerful tools for exploring the behavior of your users, including transition graphs, step matrices, step Sankey diagrams, funnels, cluster, and cohort analysis. A brief demo of each is presented below. For more details, see :ref:`the user guides `. .. _quick_start_transition_graph: Transition graph ~~~~~~~~~~~~~~~~ Transition graph is an interactive tool that shows how many users jump from one event to another. It represents user paths as a Markov random walk model. The graph is interactive: you can drag the graph nodes, zoom in and out of the graph layout, or use a control panel on the left edge of the graph. The transition graph also allows you to highlight the most valuable nodes, and hide noisy nodes and edges. .. code-block:: python stream.transition_graph() .. raw:: html See :doc:`Transition graph user guide<../user_guides/transition_graph>` for a deeper understanding of this tool. .. _quick_start_step_matrix: Step matrix ~~~~~~~~~~~ The step matrix provides a stepwise look at CJM. It shows the event distribution with respect to a step ordinal number. .. code-block:: python stream.step_matrix( max_steps=16, threshold=0.2, centered={ 'event': 'cart', 'left_gap': 5, 'occurrence': 1 }, targets=['payment_done'] ) .. figure:: /_static/getting_started/quick_start/step_matrix.png :width: 900 The step matrix above is centered by ``cart`` event. For example, it shows (see column ``-1``) that the events in the user trajectories one step before ``cart`` event are distributed as follows: 60% of the users have ``catalog`` event right before ``cart``, 24% of the users have ``product2`` event, and 16% of the users are distributed among 5 events which are folded to an artificial ``THRESHOLDED_5`` event. See :doc:`Step matrix user guide<../user_guides/step_matrix>` user guide for a deeper understanding of this tool. Transition matrix ~~~~~~~~~~~~~~~~~ Transition matrix is similar to transition graph, but it displays the edge weights as a table. It is especially useful for comparing two groups of users. For example, below we compare how the first user session differs from the other sessions. We create a ``session_count`` segment indicating the session number and then compare ``1`` and ``_OUTER_`` segment values in the transition matrix. See the :doc:`Transition matrix user guide<../user_guides/transition_matrix>` and the :doc:`Segments user guide <../user_guides/segments_and_clusters>` for more details. .. code-block:: python def session_count_segment(df): df['session_count'] = df['session_id'].str.split('_').str[1] return df['session_count'] stream = stream.add_segment(session_count_segment, 'session_count') stream.transition_matrix(norm_type='node', groups=['session_count', '1', '_OUTER_']) .. figure:: /_static/getting_started/quick_start/transition_matrix.png :width: 500 For example, we see from the diagram that in the second and later sessions the users start their paths from the ``main`` event a way more often: the difference in ``session_start -> main`` transition is -0.46. Step Sankey diagram ~~~~~~~~~~~~~~~~~~~ The step Sankey diagram is similar to the step matrix. It also shows the event distribution with respect to step number. However, it has some more advanced features: - it explicitly shows the user flow from one step to another; and - it is interactive. .. code-block:: python stream.step_sankey(max_steps=6, threshold=0.05) .. raw:: html
See :doc:`step Sankey user guide<../user_guides/step_sankey>` for a deeper understanding of this tool. .. _quick_start_cluster_analysis: Cluster analysis ~~~~~~~~~~~~~~~~ .. code-block:: python features = stream.extract_features(feature_type='count', ngram_range=(1, 1)) stream = stream.get_clusters( method='kmeans', n_clusters=8, X=features, segment_name='kmeans_clusters' ) Users with similar behavior are grouped into clusters. Once the clusters are obtained, you can explore them looking at a heatmap table that highlights the most valuable features or custom metrics that make a cluster unique. .. code-block:: python custom_metrics = [ ('segment_size', 'mean', 'segment size'), ('has:cart', 'mean', 'Conversion rate: cart'), ('has:payment_done', 'mean', 'Conversion rate: payment_done'), ] stream.clusters_overview( 'kmeans_clusters', features, aggfunc='mean', metrics=custom_metrics ) .. figure:: /_static/getting_started/quick_start/clusters.png :width: 500 From this overview we immediately see that cluster 7 has the highest feature usage rates and conversion to ``purchase`` event over all clusters yet the smallest size (0.3% of all users). On the other hand, we see that super low activity clusters 0, 1, 4 take 74% of all users which might indicate a systematic problem in the product. See :doc:`the clusters user guide<../user_guides/segments_and_clusters>` for a deeper understanding of this tool. .. _quick_start_funnels: Funnel analysis ~~~~~~~~~~~~~~~ Building a conversion funnel is a basic part of much analytical research. Funnel is a diagram that shows how many users sequentially walk through specific events (funnel stages) in their paths. For each stage event, the following values are calculated: - absolute unique number of users who reached this stage at least once; - conversion rate from the first stage (% of initial); and - conversion rate from the previous stage (% of previous). .. code-block:: python stream.funnel(stages=['catalog', 'cart', 'payment_done']) .. raw:: html See :doc:`Funnel user guide<../user_guides/funnel>` for a deeper understanding of this tool. Cohort analysis ~~~~~~~~~~~~~~~ Cohorts is a powerful tool that shows trends of user behavior over time. It helps to isolate the impact of different marketing activities, or changes in a product for different groups of users. Here is an outline of the *cohort matrix* calculation: - Users are split into groups (``CohortGroups``) depending on the time of their first appearance in the eventstream; and - The retention rate of the active users is calculated in each period (``CohortPeriod``) of the observation. .. code-block:: python stream.cohorts( cohort_start_unit='M', cohort_period=(1, 'M'), average=False, ) .. figure:: /_static/getting_started/quick_start/cohorts.png :width: 500 :height: 500 See :doc:`Cohorts user guide<../user_guides/cohorts>` for a deeper understanding of this tool. Sequence analysis ~~~~~~~~~~~~~~~~~ Sequences tool aims to calculate frequency statistics regarding each particular n-gram represented in an eventstream. It supports group comparison. .. code-block:: python stream.sequences( ngram_range=(2, 3), threshold=['count', 1200], sample_size=3 ) .. figure:: /_static/getting_started/quick_start/sequences.png See :doc:`Sequences user guide<../user_guides/sequences>` for the details.