Clusters#
Clusters Class#
- class retentioneering.tooling.clusters.clusters.Clusters(eventstream)[source]#
A class that holds methods for the cluster analysis.
- Parameters:
- eventstreamEventstreamType
See also
Eventstream.clusters
Call Clusters tool as an eventstream method.
Notes
See Clusters user guide for the details.
- diff(cluster_id1, cluster_id2=None, top_n_events=8, weight_col=None, targets=None)[source]#
Plots a bar plot illustrating the distribution of
top_n_events
in clustercluster_id1
compared with the entire dataset or the clustercluster_id2
if specified. Should be used afterfit()
orset_clusters()
.- Parameters:
- cluster_id1int or str
ID of the cluster to compare.
- cluster_id2int or str, optional
ID of the second cluster to compare with the first cluster. If
None
, then compares with the entire dataset.- top_n_eventsint, default 8
Number of top events.
- weight_colstr, optional
If
None
, distribution will be compared based on event occurrences in datasets. Ifweight_col
is specified, percentages of users (column name specified by parameterweight_col
) who have particular events will be plotted.- targetsstr or list of str, optional
List of event names always to include for comparison, regardless of the parameter top_n_events value. Target events will appear in the same order as specified.
- Returns:
- matplotlib.axes.Axes
Plots the distribution barchart.
- extract_features(feature_type, ngram_range=None)[source]#
Calculate vectorized user paths.
- Parameters:
- feature_type{“tfidf”, “count”, “frequency”, “binary”, “markov”, “time”, “time_fraction”}
Algorithms for converting text sequences to numerical vectors:
tfidf
see details in sklearn documentationcount
see details in sklearn documentationfrequency
is similar to count, but normalized to the total number of the events in the user’s trajectory.binary
1 if a user had the given n-gram at least once and 0 otherwise.markov
available for bigrams only. For a given bigram(A, B)
the vectorized values are the user’s transition probabilities fromA
toB
.time
associated with unigrams only. The total number of the seconds spent from the beginning of a user’s path until the given event.time_fraction
the same astime
but divided by the total length of the user’s trajectory (in seconds).
- ngram_rangeTuple(int, int)
The lower and upper boundary of the range of n-values for different word n-grams to be extracted. For example, ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams. Ignored for
markov
,time
,time_fraction
feature types.
- Returns:
- pd.DataFrame
A DataFrame with the vectorized values. Index contains user_ids, columns contain n-grams.
- filter_cluster(cluster_id)[source]#
Truncate the eventstream, leaving the trajectories of the users who belong to the selected cluster. Should be used after
fit()
orset_clusters()
.- Parameters:
- cluster_idint or str
Cluster identifier to be selected.
- If
create_clusters()
was used for cluster generation, then 0, 1, … values are possible.
- If
- Returns:
- EventstreamType
Eventstream with the users belonging to the selected cluster only.
- fit(method, n_clusters, X, random_state=None)[source]#
Prepare features and compute clusters for the input eventstream data.
- Parameters:
- method{“kmeans”, “gmm”}
kmeans
stands for the classic K-means algorithm. See details in sklearn documentation.gmm
stands for Gaussian mixture model. See details in sklearn documentation.
- n_clustersint
The expected number of clusters to be passed to a clustering algorithm.
- Xpd.DataFrame
pd.DataFrame
representing a custom vectorization of the user paths. The index corresponds to user_ids, the columns are vectorized values of the path. Seeextract_features()
.- random_stateint, optional
Use an int to make the randomness deterministic. Calling
fit
multiple times with the samerandom_state
leads to the same clustering results.
- Returns:
- Clusters
A fitted
Clusters
instance.
- plot(targets=None)[source]#
Plot a bar plot illustrating the cluster sizes and the conversion rates of the
target
events within the clusters. Should be used afterfit()
orset_clusters()
.- Parameters:
- targetsstr or list of str, optional
Represents the list of the target events
- projection(method='tsne', targets=None, color_type='clusters', **kwargs)[source]#
Show the clusters’ projection on a plane, applying dimension reduction techniques. Should be used after
fit()
orset_clusters()
.- Parameters:
- method{‘umap’, ‘tsne’}, default ‘tsne’
Type of manifold transformation.
- color_type{‘targets’, ‘clusters’}, default ‘clusters’
Type of color-coding used for projection visualization:
clusters
colors trajectories with different colors depending on cluster number.targets
colors trajectories based on reach to any event provided in ‘targets’ parameter. Must providetargets
parameter in this case.
- targetsstr or list of str, optional
Vector of event_names as str. If user reaches any of the specified events, the dot corresponding to this user will be highlighted as converted on the resulting projection plot.
- **kwargsoptional
Parameters for sklearn.manifold.TSNE() and umap.UMAP().
- Returns:
- sns.scatterplot
Plot in the low-dimensional space for user trajectories indexed by user IDs.
- set_clusters(user_clusters)[source]#
Set custom user-cluster mapping.
- Parameters:
- user_clusterspd.Series
Series index corresponds to user_ids. Values are cluster_ids. The values must be integers. For example, in case of 3 clusters possible cluster_ids must be 0, 1, 2.
- Returns:
- Clusters
A fitted
Clusters
instance.
- property cluster_mapping#
Return calculated before
cluster_id -> list[user_ids]
mapping.- Returns:
- dict
The keys are cluster_ids, and the values are the lists of the user_ids related to the corresponding cluster.
- property params#
Returns the parameters used for the last fitting.
- property user_clusters#
- Returns:
- pd.Series
user_id -> cluster_id
mapping representing aspd.Series
. The index corresponds to user_ids, the values relate to the corresponding cluster_ids.