Clusters#
Clusters Class#
- class retentioneering.tooling.clusters.clusters.Clusters(eventstream)[source]#
- A class that holds methods for the cluster analysis. - Parameters:
- eventstreamEventstreamType
 
 - See also - Eventstream.clusters
- Call Clusters tool as an eventstream method. 
 - Notes - See Clusters user guide for the details. - diff(cluster_id1, cluster_id2=None, top_n_events=8, weight_col=None, targets=None)[source]#
- Plots a bar plot illustrating the distribution of - top_n_eventsin cluster- cluster_id1compared with the entire dataset or the cluster- cluster_id2if specified. Should be used after- fit()or- set_clusters().- Parameters:
- cluster_id1int or str
- ID of the cluster to compare. 
- cluster_id2int or str, optional
- ID of the second cluster to compare with the first cluster. If - None, then compares with the entire dataset.
- top_n_eventsint, default 8
- Number of top events. 
- weight_colstr, optional
- If - None, distribution will be compared based on event occurrences in datasets. If- weight_colis specified, percentages of users (column name specified by parameter- weight_col) who have particular events will be plotted.
- targetsstr or list of str, optional
- List of event names always to include for comparison, regardless of the parameter top_n_events value. Target events will appear in the same order as specified. 
 
- Returns:
- matplotlib.axes.Axes
- Plots the distribution barchart. 
 
 
 - extract_features(feature_type, ngram_range=None)[source]#
- Calculate vectorized user paths. - Parameters:
- feature_type{“tfidf”, “count”, “frequency”, “binary”, “markov”, “time”, “time_fraction”}
- Algorithms for converting text sequences to numerical vectors: - tfidfsee details in sklearn documentation
- countsee details in sklearn documentation
- frequencyis similar to count, but normalized to the total number of the events in the user’s trajectory.
- binary1 if a user had the given n-gram at least once and 0 otherwise.
- markovavailable for bigrams only. For a given bigram- (A, B)the vectorized values are the user’s transition probabilities from- Ato- B.
- timeassociated with unigrams only. The total number of the seconds spent from the beginning of a user’s path until the given event.
- time_fractionthe same as- timebut divided by the total length of the user’s trajectory (in seconds).
 
- ngram_rangeTuple(int, int)
- The lower and upper boundary of the range of n-values for different word n-grams to be extracted. For example, ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams. Ignored for - markov,- time,- time_fractionfeature types.
 
- Returns:
- pd.DataFrame
- A DataFrame with the vectorized values. Index contains user_ids, columns contain n-grams. 
 
 
 - filter_cluster(cluster_id)[source]#
- Truncate the eventstream, leaving the trajectories of the users who belong to the selected cluster. Should be used after - fit()or- set_clusters().- Parameters:
- cluster_idint or str
- Cluster identifier to be selected. - If create_clusters()was used for cluster generation, then
- 0, 1, … values are possible. 
 
- If 
 
- Returns:
- EventstreamType
- Eventstream with the users belonging to the selected cluster only. 
 
 
 - fit(method, n_clusters, X, random_state=None)[source]#
- Prepare features and compute clusters for the input eventstream data. - Parameters:
- method{“kmeans”, “gmm”}
- kmeansstands for the classic K-means algorithm. See details in sklearn documentation.
- gmmstands for Gaussian mixture model. See details in sklearn documentation.
 
- n_clustersint
- The expected number of clusters to be passed to a clustering algorithm. 
- Xpd.DataFrame
- pd.DataFramerepresenting a custom vectorization of the user paths. The index corresponds to user_ids, the columns are vectorized values of the path. See- extract_features().
- random_stateint, optional
- Use an int to make the randomness deterministic. Calling - fitmultiple times with the same- random_stateleads to the same clustering results.
 
- Returns:
- Clusters
- A fitted - Clustersinstance.
 
 
 - plot(targets=None)[source]#
- Plot a bar plot illustrating the cluster sizes and the conversion rates of the - targetevents within the clusters. Should be used after- fit()or- set_clusters().- Parameters:
- targetsstr or list of str, optional
- Represents the list of the target events 
 
 
 - projection(method='tsne', targets=None, color_type='clusters', **kwargs)[source]#
- Show the clusters’ projection on a plane, applying dimension reduction techniques. Should be used after - fit()or- set_clusters().- Parameters:
- method{‘umap’, ‘tsne’}, default ‘tsne’
- Type of manifold transformation. 
- color_type{‘targets’, ‘clusters’}, default ‘clusters’
- Type of color-coding used for projection visualization: - clusterscolors trajectories with different colors depending on cluster number.
- targetscolors trajectories based on reach to any event provided in ‘targets’ parameter. Must provide- targetsparameter in this case.
 
- targetsstr or list of str, optional
- Vector of event_names as str. If user reaches any of the specified events, the dot corresponding to this user will be highlighted as converted on the resulting projection plot. 
- **kwargsoptional
- Parameters for sklearn.manifold.TSNE() and umap.UMAP(). 
 
- Returns:
- sns.scatterplot
- Plot in the low-dimensional space for user trajectories indexed by user IDs. 
 
 
 - set_clusters(user_clusters)[source]#
- Set custom user-cluster mapping. - Parameters:
- user_clusterspd.Series
- Series index corresponds to user_ids. Values are cluster_ids. The values must be integers. For example, in case of 3 clusters possible cluster_ids must be 0, 1, 2. 
 
- Returns:
- Clusters
- A fitted - Clustersinstance.
 
 
 - property cluster_mapping#
- Return calculated before - cluster_id -> list[user_ids]mapping.- Returns:
- dict
- The keys are cluster_ids, and the values are the lists of the user_ids related to the corresponding cluster. 
 
 
 - property params#
- Returns the parameters used for the last fitting. 
 - property user_clusters#
- Returns:
- pd.Series
- user_id -> cluster_idmapping representing as- pd.Series. The index corresponds to user_ids, the values relate to the corresponding cluster_ids.