retentioneering.core package¶
step_matrix¶
-
retentioneering.core.core_functions.step_matrix.
step_matrix
(self, *, max_steps=20, weight_col=None, precision=2, targets=None, accumulated=None, sorting=None, thresh=0, centered=None, groups=None, show_plot=True)[source]¶ Plots heatmap with distribution of users over trajectory steps ordered by event name. Matrix rows are event names, columns are aligned user trajectory step numbers and the values are shares of users. A given entry X at column i and event j means at i’th step fraction of users X have specific event j.
- max_steps: int (optional, default 20)
Maximum number of steps in trajectory to include.
- weight_col: str (optional, default None)
Aggregation column for edge weighting. If None, specified index_col from retentioneering.config will be used as column name. For example, can be specified as session_id if dataframe has such column.
- precision: int (optional, default 2)
Number of decimal digits after 0 to show as fractions in the heatmap.
- thresh: float (optional, default 0)
Used to remove rare events. Aggregates all rows where all values are less then specified threshold.
- targets: list (optional, default None)
List of events names (as str) to include in the bottom of step_matrix as individual rows. Each specified target will have separate color-coding space for clear visualization. Example: [‘product_page’, ‘cart’, ‘payment’]. If multiple targets need to be compared and plotted using same color-coding scale, such targets must be combined in sub-list. Examples: [‘product_page’, [‘cart’, ‘payment’]]
- accumulated: string (optional, default None)
Option to include accumulated values for targets. Valid values are None (do not show accumulated tartes), ‘both’ (show step values and accumulated values), ‘only’ (show targets only as accumulated).
- centered: dict (optional, default None)
Parameter used to align user trajectories at specific event at specific step. Has to contain three keys:
‘event’: str, name of event to align ‘left_gap’: int, number of events to include before specified event ‘occurrence’: int which occurance of event to align (typical 1)
When this parameter is not None only users which have specified i’th ‘occurance’ of selected event preset in their trajectories will be included. Fraction of such remaining users is specified in the title of centered step_matrix. Example: {‘event’: ‘cart’, ‘left_gap’: 8, ‘occurrence’: 1}
- sorting: list (optional, default None)
List of events_names (as string) can be passed to plot step_matrix with specified ordering of events. If None rows will be ordered according to i`th value (first row, where 1st element is max, second row, where second element is max, etc)
- groups: tuple (optional, default None)
Can be specified to plot step differential step_matrix. Must contain tuple of two elements (g_1, g_2): where g_1 and g_2 are collections of user_id`s (list, tuple or set). Two separate step_matrixes M1 and M2 will be calculated for users from g_1 and g_2, respectively. Resulting matrix will be the matrix M = M1-M2. Note, that values in each column in differential step matrix will sum up to 0 (since columns in both M1 and M2 always sum up to 1).
- show_plot: bool (optional, default True)
whether to show resulting heatmap or not.
Dataframe with max_steps number of columns and len(event_col.unique) number of rows at max, or less if used thr > 0.
pd.DataFrame
plot_graph¶
-
retentioneering.core.core_functions.plot_graph.
plot_graph
(self, *, targets={}, weight_col=None, norm_type='full', layout_dump=None, width=800, height=500, thresh=0)[source]¶ Create interactive graph visualization. Each node is a unique event_col value, edges are transitions between events and edge weights are calculated metrics. By default, it is a percentage of unique users that have passed though a particular edge visualized with the edge thickness. Node sizes are Graph loop is a transition to the same node, which may happen if users encountered multiple errors or made any action at least twice. Graph nodes are movable on canvas which helps to visualize user trajectories but is also a cumbersome process to place all the nodes so it forms a story.
That is why IFrame object also has a download button. By pressing it, a JSON configuration file with all the node parameters is downloaded. It contains node names, their positions, relative sizes and types. It it used as layout_dump parameter for layout configuration. Finally, show weights toggle shows and hides edge weights.
- norm_type: str (optional, default ‘full’)
Type of normalization used to calculate weights for graph edges. Possible values are:
None
‘full’
‘node’
- weight_col: str (optional, default None)
Aggregation column for edge weighting. If None, number of events will be calculated. For example, can be specified as client_id or session_id if dataframe has such columns.
- targets: dict (optional, default None)
Event mapping describing which nodes or edges should be highlighted by different colors for better visualisation. Dictionary keys are event_col values, while keys have the following possible values: Example: {‘lost’: ‘red’, ‘purchased’: ‘green’, ‘main’: ‘source’}
- thresh: float (optional, default 0.01)
Minimal edge weight value to be rendered on a graph. If a node has no edges of the weight >= thresh, then it is not shown on a graph. It is used to filter out rare event and not to clutter visualization. Nodes specified in targets parameter will be always shown regardless selected threshold.
- layout_dump: str (optional, default None)
Path to layout configuration file relative to current directory. If defined, uses configuration file as a graph layout.
- width: int (optional, default 800)
Width of plot in pixels.
- height: int (optional, default 500)
Height of plot in pixels.
Plots IFrame graph of width and height size. Saves webpage with JS graph visualization to retention_config.experiments_folder.
Renders IFrame object and saves graph visualization as HTML in experiments_folder of retention_config.
get_clusters¶
-
retentioneering.core.core_functions.get_clusters.
cluster_event_dist
(self, cl1, cl2=None, *, top_n=8, weight_col=None, targets=[])[source]¶ Plots distribution of top_n events in cluster cl1 compared vs entire dataset or in cluster cl2.
- cl1: int
ID of the cluster to compare.
- cl2: int, (optional, default None)
ID of the second cluster to compare with top events from first cluster. If None, then compares with entire dataset.
- top_n: int, (optional, default 8)
Number of top events.
- weight_col: str (optional, default None)
If None distribution will be compared based on events occurrences in datasets. If weight_col is specified, percentages of users (column name specified by parameter weight_col) who have particular events will be plotted.
- targets: list of str (optional, default [])
List of event names always to include for comparison regardless of the parameter top_n value. Target events will appear in the same order as specified
Plots distribution barchart
-
retentioneering.core.core_functions.get_clusters.
filter_cluster
(self, cluster_name)[source]¶ Filters dataset against one or several clusters.
- cluster_name: int or list
Cluster ID or list of cluster IDs for filtering.
Filtered dataset as pandas dataframe
pd.Dataframe
-
retentioneering.core.core_functions.get_clusters.
get_clusters
(self, *, feature_type='tfidf', ngram_range=1, 1, n_clusters=8, method='kmeans', plot_type=None, refit_cluster=True, targets=None, **kwargs)[source]¶ Cluster users in the dataset according to their behavior.
- feature_type: str (optional, default ‘tfidf’)
Type of vectorizer to user to convert sequences of events to numerical vectors. Currently supports: {‘tfidf’, ‘count’, ‘frequency’, ‘binary’}.
- ngram_range: tuple (optional, default (1, 1))
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
- n_clusters: int (optional, default 8)
Number of clusters to be identified.
- method: str (optional, default ‘kmeans’)
Clustering method to use. Currently supports: ‘kmeans’ and ‘gmm’.
- plot_type: str (optional, default None)
Type of cluster statistics overview graph to plot after clustering. Currently supports: ‘cluster_bar’
- targets: list (optional, default None)
List of target events to be Only applies if plot_type = ‘cluster_bar’
- refit_cluster: bool (optional, default True)
If False, then cached results of previous clustering are used. (from .cluster_mapping attribute). If True recalculates clustering.
Array of clusters as .cluster_mapping attribute
np.array
project¶
-
retentioneering.core.core_functions.project.
project
(self, *, method='tsne', targets=(), ngram_range=1, 1, feature_type='tfidf', plot_type=None, **kwargs)[source]¶ Does dimention reduction of user trajectories and draws projection plane.
- method: {‘umap’, ‘tsne’} (optional, default ‘tsne’)
Type of manifold transformation.
- plot_type: {‘targets’, ‘clusters’, None} (optional, default None)
- Type of color-coding used for projection visualization:
‘clusters’: colors trajectories with different colors depending on cluster number.
IMPORTANT: must do .rete.get_clusters() before to obtain cluster mapping. - ‘targets’: color trajectories based on reach to any event provided in ‘targets’ parameter. Must provide ‘targets’ parameter in this case.
If None, then only calculates TSNE without visualization.
- targets: list or tuple of str (optional, default ())
Vector of event_names as str. If user reach any of the specified events, the dot corresponding to this user will be highlighted as converted on the resulting projection plot
- feature_type: str, (optional, default ‘tfidf’)
Type of vectorizer to use before dimension-reduction. Available vectorization methods: {‘tfidf’, ‘count’, ‘binary’, ‘frequency’}
- ngram_range: tuple, (optional, default (1,1))
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted before dimension-reduction. For example ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams.
Dataframe with data in the low-dimensional space for user trajectories indexed by user IDs.
pd.DataFrame
extract_features¶
-
retentioneering.core.core_functions.extract_features.
extract_features
(self, *, feature_type='tfidf', ngram_range=1, 1)[source]¶ User trajectories vectorizer.
- feature_type: str, (optional, default ‘tfidf’)
Type of vectorizer. Available vectorization methods: {‘tfidf’, ‘count’, ‘binary’, ‘frequency’}
- ngram_range: tuple, (optional, default (1,1))
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. For example ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams.
Encoded user trajectories
pd.DataFrame of (number of users, number of unique events | event n-grams)
compare¶
-
retentioneering.core.core_functions.compare.
compare
(self, *, groups, function, test, group_names='group_1', 'group_2', alpha=0.05)[source]¶ Tests selected metric between two groups of users.
- groups: tuple (optional, default None)
Must contain tuple of two elements (g_1, g_2): where g_1 and g_2 are collections of user_id`s (list, tuple or set).
- function: function(x) -> number
Selected metrics. Must contain a function wich takes as an argument dataset for single user trajectory and returns a single numerical value.
- group_names: tuple (optional, default: (‘group_1’, ‘group_2’))
Names for selected groups g_1 and g_2.
- test: {‘mannwhitneyu’, ‘ttest’, ‘ks_2samp’}
Test the null hypothesis that 2 independent samples are drawn from the same distribution. One-sided tests are used, meaning that distributions are compared ‘less’ or ‘greater’. Rule of thumbs is: for discrete variables (like convertions or number of purchase) use Mann-Whitney (‘mannwhitneyu’) test or t-test (‘ttest’).
For continious variables (like average_check) use Kolmogorov-Smirnov test (‘ks_2samp’).
- alpha: float (optional, default 0.05)
Selected level of significance.
Prints statistical comparison between two groups over selected metric and test
Plots a distribution for selected metrics for two groups
funnel¶
-
retentioneering.core.core_functions.funnel.
funnel
(self, *, targets, groups=None, group_names=None)[source]¶ Plots simple convertion funnel with stages as specified in targets parameter.
- targets: list of str
List of events used as stages for the funnel. Absolute and relative number of users who reached specified events at least once will be plotted. Multiple events can be grouped together as individual state by combining them as sub list.
- groups: list of collectibles (optional, default None)
List of user_ids collections. Funnel for each user_id collection will be plotted. If None all users from dataset will be plotted
- group_names: list of strings (optional, default None)
- Names for specified user groups to place in a legend. If specified
len(group_names) must be equal to len(groups).
Funnel plot
get_edgelist¶
-
retentioneering.core.core_functions.get_edgelist.
get_edgelist
(self, *, weight_col=None, norm_type=None, edge_attributes='edge_weight')[source]¶ Creates weighted table of the transitions between events.
- weight_col: str (optional, default=None)
Aggregation column for transitions weighting. To calculate weights as number of transion events use None. To calculate number of unique users passed through given transition ‘user_id’.
For any other aggreagtion, like number of sessions, pass the column name.
- norm_type: {None, ‘full’, ‘node’} (optional, default=None)
Type of normalization. If None return raw number of transtions or other selected aggregation column. ‘full’ - normalized over entire dataset. ‘node’ weight for edge A –> B normalized over user in A
- edge_attributes: str (optional, default ‘edge_weight’)
Name for edge_weight columns
Dataframe with number of rows equal to all transitions with weight non-zero weight
pd.DataFrame
get_adjacency¶
-
retentioneering.core.core_functions.get_adjacency.
get_adjacency
(self, *, weight_col=None, norm_type=None)[source]¶ - Creates edge graph in the matrix format. Row indeces are event_col values,
from which the transition occured, and columns are events, to
which the transition occured. The values are weights of the edges defined with weight_col and norm_type parameters.
- weight_col: str (optional, default=None)
Aggregation column for transitions weighting. To calculate weights as number of transion events use None. To calculate number of unique users passed through given transition ‘user_id’.
For any other aggreagtion, like number of sessions, pass the column name.
- norm_type: {None, ‘full’, ‘node’} (optional, default=None)
Type of normalization. If None return raw number of transtions or other selected aggregation column. ‘full’ - normalized over entire dataset. ‘node’ weight for edge A –> B normalized over user in A
Dataframe with number of columns and rows equal to unique number of event_col values.
pd.DataFrame