Segments and clusters#
Segments#
- Eventstream.segment_overview(segment_name, metrics=None, kind='heatmap', axis=0, show_plot=True)[source]#
Show a visualization with aggregated values of custom metrics by segments.
- segment_namestr
A name of the segment.
- metricslist of tuples, optional
A list of tuples with custom metrics. Each tuple should contain three elements:
a function to calculate the metric, see
Eventstream.path_metrics()
for the details;an aggregation metric to be applied to the metric values;
a metric label to be displayed in the resulting table.
- kind{“heatmap”, “bar”}, default=”heatmap”
Visualization option.
- axis{0, 1}, default 0
The axis for which the heatmap is to be generated.
0 : for row-wise heatmap.
1 : for column-wise heatmap,
- Returns:
- SegmentOverview
A
SegmentOverview
class instance fitted with given parameters.
- Eventstream.segment_diff(segment_items, features, aggfunc=<function mean>, threshold=0.01, top_n=None, show_plot=True)[source]#
Show a table with the difference between a pair of segment items. The rows relate to the features. Wasserstein distance is used to calculate the difference between the feature distributions of the pair segment items.
- segment_itemslist
A list with segment values to be compared.
- featurespd.DataFrame
A DataFrame with features to be aggregated and compared between selected segment items.
- aggfunccallable, default np.mean
A function to aggregate the features. If a string is passed, it should be a valid function name for the DataFrame’s
agg
method, see pandas documentation. for the details.- thresholdfloat, default 0.01
A threshold to filter out the features with a small difference between the segment items.
- top_nint, optional
A number of top features to be displayed.
- show_plotbool, default True
If
True
, a table with the difference is shown.
- Returns:
- SegmentDiff
A
SegmentDiff
class instance fitted with given parameters.
- Eventstream.projection(features, method='tsne', segments=None, sample_size=None, random_state=None, show_plot=True, **kwargs)[source]#
Project a dataset to a 2D space using a manifold transformation.
- Parameters:
- features: pandas.DataFrame
A dataset to be projected. The index should be path ids.
- method{“umap”, “tsne”}, default “tsne”
Type of manifold transformation. See sklearn.manifold.TSNE() and umap.UMAP() for the details.
- sample_sizeint, optional, default=1000
The number of elements to sample.
- random_stateint, optional
Use an int number to make the randomness deterministic. Calling the method multiple times with the same
random_state
yields the same results.- **kwargsoptional
Additional parameters for sklearn.manifold.TSNE() and umap.UMAP().
- Returns:
- SegmentProjection
A
SegmentProjection
class instance fitted with given parameters.
- Eventstream.segment_map(name, index='path_id', resolve_collision=None)[source]#
Return a mapping between segment values and paths. Works with static or roughly-static segments.
- Parameters:
- namestr, optional
A name of the segment. If
None
mapping is returned for all segments; works only forindex="path_id"
.- index{“path_id”, “segment_value”}, default “path_id”.
The index of the resulting Series or DataFrame. If
path_id
, the index is path_id, and the values are the correspondingsegment values. Ifsegment_value
, the index is segment values, and the values are lists of path_ids associated with the segment value.
- Returns:
- pd.Series
If
name
is defined.- pd.DataFrame
If
name=None
andindex="path_id"
.
Clusters#
- Eventstream.extract_features(feature_type, ngram_range=(1, 1), path_id_col=None, col_suffix=None)[source]#
Calculate set of features for each path.
- Parameters:
- feature_type{“tfidf”, “count”, “frequency”, “binary”, “markov”, “time”, “time_fraction”}
Algorithms for converting event sequences to feature vectors:
tfidf
see details in sklearn documentation.count
see details in sklearn documentation.frequency
is similar to count, but normalized to the total number of the events in the user’s trajectory.binary
1 if a user had the given n-gram at least once and 0 otherwise.markov
available for bigrams only. For a given bigram(A, B)
the vectorized values are the user’s transition probabilities fromA
toB
.time
associated with unigrams only. The total amount of time (in seconds) spent on a given event.time_fraction
the same astime
but divided by the path duration (in seconds).
- ngram_rangeTuple(int, int), default (1, 1)
The lower and upper boundary of the range of n for n-grams to be extracted. For example, ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams. Ignored for
markov
,time
,time_fraction
feature types.- path_id_colstr, optional
A column name associated with a path identifier. A default value is linked to the user column from eventstream schema.
- col_suffixstr, optional
A suffix added to the feature names.
- Returns:
- pd.DataFrame
A DataFrame with the vectorized values. The index consists of path ids, the columns relate to the n-grams.
- Eventstream.get_clusters(X, method, n_clusters=None, scaler=None, random_state=None, segment_name='cluster_id', **kwargs)[source]#
Split paths into clusters and save their labels as a segment.
- Parameters:
- Xpd.DataFrame
The input data to cluster.
- method{“kmeans”, “gmm”, “hdbscan”}
The clustering method to use.
- n_clustersint, optional
The number of clusters to form. Actual for
kmeans
andgmm
methods. Ifn_clusters=None
andmethod="kmeans"
, the elbow curve chart is displayed.- scaler{“minmax”, “std”}, optional
The scaling method to apply to the data before clustering. If None, no scaling is applied.
- random_stateint, optional
A seed used by the random number generator for reproducibility.
- segment_namestr, default “cluster_id”
The name of the segment that will contain the cluster labels.
- **kwargs
Additional keyword arguments to pass to the clustering methods.
- Returns:
- Eventstream or None
If
n_clusters
is specified, a new Eventstream object with clusters integrated as a segmentsegment_name
is returned; otherwise, returnsNone
.
- Eventstream.clusters_overview(segment_name, features, aggfunc='mean', scaler='minmax', metrics=None, axis=1, show_plot=True)[source]#
Show a heatmap table with aggregated values of features and custom metrics by clusters.
- Parameters:
- segment_namestr
A name of the segment containing cluster labels.
- featurespd.DataFrame
A DataFrame with features to be aggregated. The DataFrame’s index should be path ids.
- aggfunccallable or str, default “mean”
A function to aggregate the features. If a string is passed, it should be a valid function name for the DataFrame’s
agg
method, see pandas documentation. for the details.- scaler{“minmax”, “std”}, default “minmax”
A scaler to normalize the features before the aggregation. Available scalers:
minmax
: MinMaxScaler.std
: StandardScaler.
- metricslist of tuples, optional
A list of tuples with custom metrics. Each tuple should contain three elements:
a function to calculate the metric, see
Eventstream.path_metrics()
for the details;an aggregation metric to be applied to the metric values, same as
aggfunc
;a metric label to be displayed in the resulting table.
- axis{0, 1}, default 1
The axis for which the heatmap is to be generated.
1 : for row-wise heatmap,
0 : for column-wise heatmap. Custom metrics coloring is ignored in this case.
- show_plotbool, default True
If
True
, a heatmap is shown.
- Returns:
- SegmentOverview
A
SegmentOverview
class instance fitted with given parameters.