Segments and clusters#

Segments#

Eventstream.segment_overview(segment_name, metrics=None, kind='heatmap', axis=0, show_plot=True)[source]#

Show a visualization with aggregated values of custom metrics by segments.

segment_namestr

A name of the segment.

metricslist of tuples, optional

A list of tuples with custom metrics. Each tuple should contain three elements:

a function to calculate the metric, see Eventstream.path_metrics() for the details;
an aggregation metric to be applied to the metric values;
a metric label to be displayed in the resulting table.

kind{“heatmap”, “bar”}, default=”heatmap”

Visualization option.

axis{0, 1}, default 0

The axis for which the heatmap is to be generated.

0 : for row-wise heatmap.
1 : for column-wise heatmap,

Returns:

SegmentOverview: A SegmentOverview class instance fitted with given parameters.

Eventstream.segment_diff(segment_items, features, aggfunc=<function mean>, threshold=0.01, top_n=None, show_plot=True)[source]#

Show a table with the difference between a pair of segment items. The rows relate to the features. Wasserstein distance is used to calculate the difference between the feature distributions of the pair segment items.

segment_itemslist: A list with segment values to be compared.
featurespd.DataFrame: A DataFrame with features to be aggregated and compared between selected segment items.
aggfunccallable, default np.mean: A function to aggregate the features. If a string is passed, it should be a valid function name for the DataFrame’s agg method, see pandas documentation. for the details.
thresholdfloat, default 0.01: A threshold to filter out the features with a small difference between the segment items.
top_nint, optional: A number of top features to be displayed.
show_plotbool, default True: If True, a table with the difference is shown.

Returns:

SegmentDiff: A SegmentDiff class instance fitted with given parameters.

Eventstream.projection(features, method='tsne', segments=None, sample_size=None, random_state=None, show_plot=True, **kwargs)[source]#

Project a dataset to a 2D space using a manifold transformation.

Parameters:

features: pandas.DataFrame: A dataset to be projected. The index should be path ids.
method{“umap”, “tsne”}, default “tsne”: Type of manifold transformation. See sklearn.manifold.TSNE() and umap.UMAP() for the details.
sample_sizeint, optional, default=1000: The number of elements to sample.
random_stateint, optional: Use an int number to make the randomness deterministic. Calling the method multiple times with the same random_state yields the same results.
**kwargsoptional: Additional parameters for sklearn.manifold.TSNE() and umap.UMAP().

Returns:

SegmentProjection: A SegmentProjection class instance fitted with given parameters.

Eventstream.segment_map(name, index='path_id', resolve_collision=None)[source]#

Return a mapping between segment values and paths. Works with static or roughly-static segments.

Parameters:

namestr, optional: A name of the segment. If None mapping is returned for all segments; works only for index="path_id".
index{“path_id”, “segment_value”}, default “path_id”.: The index of the resulting Series or DataFrame. If path_id, the index is path_id, and the values are the correspondingsegment values. If segment_value, the index is segment values, and the values are lists of path_ids associated with the segment value.

Returns:

pd.Series: If name is defined.
pd.DataFrame: If name=None and index="path_id".

Clusters#

Eventstream.extract_features(feature_type, ngram_range=(1, 1), path_id_col=None, col_suffix=None)[source]#

Calculate set of features for each path.

Parameters:

feature_type{“tfidf”, “count”, “frequency”, “binary”, “markov”, “time”, “time_fraction”}

Algorithms for converting event sequences to feature vectors:

tfidf see details in sklearn documentation.
count see details in sklearn documentation.
frequency is similar to count, but normalized to the total number of the events in the user’s trajectory.
binary 1 if a user had the given n-gram at least once and 0 otherwise.
markov available for bigrams only. For a given bigram (A, B) the vectorized values are the user’s transition probabilities from A to B.
time associated with unigrams only. The total amount of time (in seconds) spent on a given event.
time_fraction the same as time but divided by the path duration (in seconds).

ngram_rangeTuple(int, int), default (1, 1)

The lower and upper boundary of the range of n for n-grams to be extracted. For example, ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams. Ignored for markov, time, time_fraction feature types.

path_id_colstr, optional

A column name associated with a path identifier. A default value is linked to the user column from eventstream schema.

col_suffixstr, optional

A suffix added to the feature names.

Returns:

pd.DataFrame: A DataFrame with the vectorized values. The index consists of path ids, the columns relate to the n-grams.

Eventstream.get_clusters(X, method, n_clusters=None, scaler=None, random_state=None, segment_name='cluster_id', **kwargs)[source]#

Split paths into clusters and save their labels as a segment.

Parameters:

Xpd.DataFrame: The input data to cluster.
method{“kmeans”, “gmm”, “hdbscan”}: The clustering method to use.
n_clustersint, optional: The number of clusters to form. Actual for kmeans and gmm methods. If n_clusters=None and method="kmeans", the elbow curve chart is displayed.
scaler{“minmax”, “std”}, optional: The scaling method to apply to the data before clustering. If None, no scaling is applied.
random_stateint, optional: A seed used by the random number generator for reproducibility.
segment_namestr, default “cluster_id”: The name of the segment that will contain the cluster labels.
**kwargs: Additional keyword arguments to pass to the clustering methods.

Returns:

Eventstream or None: If n_clusters is specified, a new Eventstream object with clusters integrated as a segment segment_name is returned; otherwise, returns None.

Eventstream.clusters_overview(segment_name, features, aggfunc='mean', scaler='minmax', metrics=None, axis=1, show_plot=True)[source]#

Show a heatmap table with aggregated values of features and custom metrics by clusters.

Parameters:

segment_namestr

A name of the segment containing cluster labels.

featurespd.DataFrame

A DataFrame with features to be aggregated. The DataFrame’s index should be path ids.

aggfunccallable or str, default “mean”

A function to aggregate the features. If a string is passed, it should be a valid function name for the DataFrame’s agg method, see pandas documentation. for the details.

scaler{“minmax”, “std”}, default “minmax”

A scaler to normalize the features before the aggregation. Available scalers:

minmax: MinMaxScaler.
std: StandardScaler.

metricslist of tuples, optional

A list of tuples with custom metrics. Each tuple should contain three elements:

a function to calculate the metric, see Eventstream.path_metrics() for the details;
an aggregation metric to be applied to the metric values, same as aggfunc;
a metric label to be displayed in the resulting table.

axis{0, 1}, default 1

The axis for which the heatmap is to be generated.

1 : for row-wise heatmap,
0 : for column-wise heatmap. Custom metrics coloring is ignored in this case.

show_plotbool, default True

If True, a heatmap is shown.

Returns:

SegmentOverview: A SegmentOverview class instance fitted with given parameters.