Clusters Class#

class retentioneering.tooling.clusters.clusters.Clusters(eventstream)[source]#

A class that holds methods for the cluster analysis.


See also


Call Clusters tool as an eventstream method.


See Clusters user guide for the details.

diff(cluster_id1, cluster_id2=None, top_n_events=8, weight_col=None, targets=None)[source]#

Plots a bar plot illustrating the distribution of top_n_events in cluster cluster_id1 compared with the entire dataset or the cluster cluster_id2 if specified. Should be used after fit() or set_clusters().

cluster_id1int or str

ID of the cluster to compare.

cluster_id2int or str, optional

ID of the second cluster to compare with the first cluster. If None, then compares with the entire dataset.

top_n_eventsint, default 8

Number of top events.

weight_colstr, optional

If None, distribution will be compared based on event occurrences in datasets. If weight_col is specified, percentages of users (column name specified by parameter weight_col) who have particular events will be plotted.

targetsstr or list of str, optional

List of event names always to include for comparison, regardless of the parameter top_n_events value. Target events will appear in the same order as specified.


Plots the distribution barchart.

extract_features(feature_type, ngram_range=None)[source]#

Calculate vectorized user paths.

feature_type{“tfidf”, “count”, “frequency”, “binary”, “markov”, “time”, “time_fraction”}

Algorithms for converting text sequences to numerical vectors:

  • tfidf see details in sklearn documentation

  • count see details in sklearn documentation

  • frequency is similar to count, but normalized to the total number of the events in the user’s trajectory.

  • binary 1 if a user had the given n-gram at least once and 0 otherwise.

  • markov available for bigrams only. For a given bigram (A, B) the vectorized values are the user’s transition probabilities from A to B.

  • time associated with unigrams only. The total number of the seconds spent from the beginning of a user’s path until the given event.

  • time_fraction the same as time but divided by the total length of the user’s trajectory (in seconds).

ngram_rangeTuple(int, int)

The lower and upper boundary of the range of n-values for different word n-grams to be extracted. For example, ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams. Ignored for markov, time, time_fraction feature types.


A DataFrame with the vectorized values. Index contains user_ids, columns contain n-grams.


Truncate the eventstream, leaving the trajectories of the users who belong to the selected cluster. Should be used after fit() or set_clusters().

cluster_idint or str

Cluster identifier to be selected.

If create_clusters() was used for cluster generation, then

0, 1, … values are possible.


Eventstream with the users belonging to the selected cluster only.

fit(method, n_clusters, X, random_state=None)[source]#

Prepare features and compute clusters for the input eventstream data.

method{“kmeans”, “gmm”}

The expected number of clusters to be passed to a clustering algorithm.


pd.DataFrame representing a custom vectorization of the user paths. The index corresponds to user_ids, the columns are vectorized values of the path. See extract_features().

random_stateint, optional

Use an int to make the randomness deterministic. Calling fit multiple times with the same random_state leads to the same clustering results.


A fitted Clusters instance.


Plot a bar plot illustrating the cluster sizes and the conversion rates of the target events within the clusters. Should be used after fit() or set_clusters().

targetsstr or list of str, optional

Represents the list of the target events

projection(method='tsne', targets=None, color_type='clusters', **kwargs)[source]#

Show the clusters’ projection on a plane, applying dimension reduction techniques. Should be used after fit() or set_clusters().

method{‘umap’, ‘tsne’}, default ‘tsne’

Type of manifold transformation.

color_type{‘targets’, ‘clusters’}, default ‘clusters’

Type of color-coding used for projection visualization:

  • clusters colors trajectories with different colors depending on cluster number.

  • targets colors trajectories based on reach to any event provided in ‘targets’ parameter. Must provide targets parameter in this case.

targetsstr or list of str, optional

Vector of event_names as str. If user reaches any of the specified events, the dot corresponding to this user will be highlighted as converted on the resulting projection plot.


Parameters for sklearn.manifold.TSNE() and umap.UMAP().


Plot in the low-dimensional space for user trajectories indexed by user IDs.


Set custom user-cluster mapping.


Series index corresponds to user_ids. Values are cluster_ids. The values must be integers. For example, in case of 3 clusters possible cluster_ids must be 0, 1, 2.


A fitted Clusters instance.

property cluster_mapping#

Return calculated before cluster_id -> list[user_ids] mapping.


The keys are cluster_ids, and the values are the lists of the user_ids related to the corresponding cluster.

property params#

Returns the parameters used for the last fitting.

property user_clusters#

user_id -> cluster_id mapping representing as pd.Series. The index corresponds to user_ids, the values relate to the corresponding cluster_ids.


property Eventstream.clusters#

A blank (not fitted) instance of Clusters class to be used for cluster analysis.

See also