Clusters#

Clusters Class#

class retentioneering.tooling.clusters.clusters.Clusters(eventstream)[source]#

A class that holds methods for the cluster analysis.

Parameters:
eventstreamEventstreamType

See also

Eventstream.clusters

Call Clusters tool as an eventstream method.

Notes

See Clusters user guide for the details.

diff(cluster_id1, cluster_id2=None, top_n_events=8, weight_col=None, targets=None)[source]#

Plots a bar plot illustrating the distribution of top_n_events in cluster cluster_id1 compared with the entire dataset or the cluster cluster_id2 if specified. Should be used after fit() or set_clusters().

Parameters:
cluster_id1int or str

ID of the cluster to compare.

cluster_id2int or str, optional

ID of the second cluster to compare with the first cluster. If None, then compares with the entire dataset.

top_n_eventsint, default 8

Number of top events.

weight_colstr, optional

If None, distribution will be compared based on event occurrences in datasets. If weight_col is specified, percentages of users (column name specified by parameter weight_col) who have particular events will be plotted.

targetsstr or list of str, optional

List of event names always to include for comparison, regardless of the parameter top_n_events value. Target events will appear in the same order as specified.

Returns:
matplotlib.axes.Axes

Plots the distribution barchart.

extract_features(feature_type, ngram_range=None)[source]#

Calculate vectorized user paths.

Parameters:
feature_type{“tfidf”, “count”, “frequency”, “binary”, “markov”, “time”, “time_fraction”}

Algorithms for converting text sequences to numerical vectors:

  • tfidf see details in sklearn documentation

  • count see details in sklearn documentation

  • frequency is similar to count, but normalized to the total number of the events in the user’s trajectory.

  • binary 1 if a user had the given n-gram at least once and 0 otherwise.

  • markov available for bigrams only. For a given bigram (A, B) the vectorized values are the user’s transition probabilities from A to B.

  • time associated with unigrams only. The total number of the seconds spent from the beginning of a user’s path until the given event.

  • time_fraction the same as time but divided by the total length of the user’s trajectory (in seconds).

ngram_rangeTuple(int, int)

The lower and upper boundary of the range of n-values for different word n-grams to be extracted. For example, ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams. Ignored for markov, time, time_fraction feature types.

Returns:
pd.DataFrame

A DataFrame with the vectorized values. Index contains user_ids, columns contain n-grams.

filter_cluster(cluster_id)[source]#

Truncate the eventstream, leaving the trajectories of the users who belong to the selected cluster. Should be used after fit() or set_clusters().

Parameters:
cluster_idint or str

Cluster identifier to be selected.

If create_clusters() was used for cluster generation, then

0, 1, … values are possible.

Returns:
EventstreamType

Eventstream with the users belonging to the selected cluster only.

fit(method, n_clusters, X, random_state=None)[source]#

Prepare features and compute clusters for the input eventstream data.

Parameters:
method{“kmeans”, “gmm”}
n_clustersint

The expected number of clusters to be passed to a clustering algorithm.

Xpd.DataFrame

pd.DataFrame representing a custom vectorization of the user paths. The index corresponds to user_ids, the columns are vectorized values of the path. See extract_features().

random_stateint, optional

Use an int to make the randomness deterministic. Calling fit multiple times with the same random_state leads to the same clustering results.

Returns:
Clusters

A fitted Clusters instance.

plot(targets=None)[source]#

Plot a bar plot illustrating the cluster sizes and the conversion rates of the target events within the clusters. Should be used after fit() or set_clusters().

Parameters:
targetsstr or list of str, optional

Represents the list of the target events

projection(method='tsne', targets=None, color_type='clusters', **kwargs)[source]#

Show the clusters’ projection on a plane, applying dimension reduction techniques. Should be used after fit() or set_clusters().

Parameters:
method{‘umap’, ‘tsne’}, default ‘tsne’

Type of manifold transformation.

color_type{‘targets’, ‘clusters’}, default ‘clusters’

Type of color-coding used for projection visualization:

  • clusters colors trajectories with different colors depending on cluster number.

  • targets colors trajectories based on reach to any event provided in ‘targets’ parameter. Must provide targets parameter in this case.

targetsstr or list of str, optional

Vector of event_names as str. If user reaches any of the specified events, the dot corresponding to this user will be highlighted as converted on the resulting projection plot.

**kwargsoptional

Parameters for sklearn.manifold.TSNE() and umap.UMAP().

Returns:
sns.scatterplot

Plot in the low-dimensional space for user trajectories indexed by user IDs.

set_clusters(user_clusters)[source]#

Set custom user-cluster mapping.

Parameters:
user_clusterspd.Series

Series index corresponds to user_ids. Values are cluster_ids. The values must be integers. For example, in case of 3 clusters possible cluster_ids must be 0, 1, 2.

Returns:
Clusters

A fitted Clusters instance.

property cluster_mapping#

Return calculated before cluster_id -> list[user_ids] mapping.

Returns:
dict

The keys are cluster_ids, and the values are the lists of the user_ids related to the corresponding cluster.

property params#

Returns the parameters used for the last fitting.

property user_clusters#
Returns:
pd.Series

user_id -> cluster_id mapping representing as pd.Series. The index corresponds to user_ids, the values relate to the corresponding cluster_ids.

Eventstream#

property Eventstream.clusters#
Returns:
Clusters

A blank (not fitted) instance of Clusters class to be used for cluster analysis.

See also

Clusters