Clusters#

Clusters Class#

class retentioneering.tooling.clusters.clusters.Clusters(eventstream)[source]#

A class that holds methods for the cluster analysis.

Parameters:

eventstreamEventstreamType

See also

Eventstream.clusters: Call Clusters tool as an eventstream method.

Notes

See Clusters user guide for the details.

diff(cluster_id1, cluster_id2=None, top_n_events=8, weight_col=None, targets=None)[source]#

Plots a bar plot illustrating the distribution of top_n_events in cluster cluster_id1 compared with the entire dataset or the cluster cluster_id2 if specified. Should be used after fit() or set_clusters().

Parameters:

cluster_id1int or str: ID of the cluster to compare.
cluster_id2int or str, optional: ID of the second cluster to compare with the first cluster. If None, then compares with the entire dataset.
top_n_eventsint, default 8: Number of top events.
weight_colstr, optional: If None, distribution will be compared based on event occurrences in datasets. If weight_col is specified, percentages of users (column name specified by parameter weight_col) who have particular events will be plotted.
targetsstr or list of str, optional: List of event names always to include for comparison, regardless of the parameter top_n_events value. Target events will appear in the same order as specified.

Returns:

matplotlib.axes.Axes: Plots the distribution barchart.

extract_features(feature_type, ngram_range=None)[source]#

Calculate vectorized user paths.

Parameters:

feature_type{“tfidf”, “count”, “frequency”, “binary”, “markov”, “time”, “time_fraction”}

Algorithms for converting text sequences to numerical vectors:

tfidf see details in sklearn documentation
count see details in sklearn documentation
frequency is similar to count, but normalized to the total number of the events in the user’s trajectory.
binary 1 if a user had the given n-gram at least once and 0 otherwise.
markov available for bigrams only. For a given bigram (A, B) the vectorized values are the user’s transition probabilities from A to B.
time associated with unigrams only. The total number of the seconds spent from the beginning of a user’s path until the given event.
time_fraction the same as time but divided by the total length of the user’s trajectory (in seconds).

ngram_rangeTuple(int, int)

The lower and upper boundary of the range of n-values for different word n-grams to be extracted. For example, ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams. Ignored for markov, time, time_fraction feature types.

Returns:

pd.DataFrame: A DataFrame with the vectorized values. Index contains user_ids, columns contain n-grams.

filter_cluster(cluster_id)[source]#

Truncate the eventstream, leaving the trajectories of the users who belong to the selected cluster. Should be used after fit() or set_clusters().

Parameters:

cluster_idint or str

Cluster identifier to be selected.

If create_clusters() was used for cluster generation, then: 0, 1, … values are possible.

Returns:

EventstreamType: Eventstream with the users belonging to the selected cluster only.

fit(method, n_clusters, X, random_state=None)[source]#

Prepare features and compute clusters for the input eventstream data.

Parameters:

method{“kmeans”, “gmm”}

kmeans stands for the classic K-means algorithm. See details in sklearn documentation.
gmm stands for Gaussian mixture model. See details in sklearn documentation.

n_clustersint

The expected number of clusters to be passed to a clustering algorithm.

Xpd.DataFrame

pd.DataFrame representing a custom vectorization of the user paths. The index corresponds to user_ids, the columns are vectorized values of the path. See extract_features().

random_stateint, optional

Use an int to make the randomness deterministic. Calling fit multiple times with the same random_state leads to the same clustering results.

Returns:

Clusters: A fitted Clusters instance.

plot(targets=None)[source]#

Plot a bar plot illustrating the cluster sizes and the conversion rates of the target events within the clusters. Should be used after fit() or set_clusters().

Parameters:

targetslist of str, optional: Represents the list of the target events

projection(method='tsne', targets=None, color_type='clusters', **kwargs)[source]#

Show the clusters’ projection on a plane, applying dimension reduction techniques. Should be used after fit() or set_clusters().

Parameters:

method{‘umap’, ‘tsne’}, default ‘tsne’

Type of manifold transformation.

color_type{‘targets’, ‘clusters’}, default ‘clusters’

Type of color-coding used for projection visualization:

clusters colors trajectories with different colors depending on cluster number.
targets colors trajectories based on reach to any event provided in ‘targets’ parameter. Must provide targets parameter in this case.

targetsstr or list of str, optional

Vector of event_names as str. If user reaches any of the specified events, the dot corresponding to this user will be highlighted as converted on the resulting projection plot.

**kwargsoptional

Parameters for sklearn.manifold.TSNE() and umap.UMAP().

Returns:

sns.scatterplot: Plot in the low-dimensional space for user trajectories indexed by user IDs.

set_clusters(user_clusters)[source]#

Set custom user-cluster mapping.

Parameters:

user_clusterspd.Series: Series index corresponds to user_ids. Values are cluster_ids.

Returns:

Clusters: A fitted Clusters instance.

property cluster_mapping#

Return calculated before cluster_id -> list[user_ids] mapping.

Returns:

dict: The keys are cluster_ids, and the values are the lists of the user_ids related to the corresponding cluster.

property params#: Returns the parameters used for the last fitting.

property user_clusters#

Returns:

pd.Series: user_id -> cluster_id mapping representing as pd.Series. The index corresponds to user_ids, the values relate to the corresponding cluster_ids.

Eventstream#

property Eventstream.clusters#

Returns:

Clusters: A blank (not fitted) instance of Clusters class to be used for cluster analysis.