Segments and clusters#

Segments#

Eventstream.segment_overview(segment_name, metrics=None, kind='heatmap', axis=0, show_plot=True)[source]#

Show a visualization with aggregated values of custom metrics by segments.

segment_namestr

A name of the segment.

metricslist of tuples, optional

A list of tuples with custom metrics. Each tuple should contain three elements:

  • a function to calculate the metric, see Eventstream.path_metrics() for the details;

  • an aggregation metric to be applied to the metric values;

  • a metric label to be displayed in the resulting table.

kind{“heatmap”, “bar”}, default=”heatmap”

Visualization option.

axis{0, 1}, default 0

The axis for which the heatmap is to be generated.

  • 0 : for row-wise heatmap.

  • 1 : for column-wise heatmap,

Returns:
SegmentOverview

A SegmentOverview class instance fitted with given parameters.

Eventstream.segment_diff(segment_items, features, aggfunc=<function mean>, threshold=0.01, top_n=None, show_plot=True)[source]#

Show a table with the difference between a pair of segment items. The rows relate to the features. Wasserstein distance is used to calculate the difference between the feature distributions of the pair segment items.

segment_itemslist

A list with segment values to be compared.

featurespd.DataFrame

A DataFrame with features to be aggregated and compared between selected segment items.

aggfunccallable, default np.mean

A function to aggregate the features. If a string is passed, it should be a valid function name for the DataFrame’s agg method, see pandas documentation. for the details.

thresholdfloat, default 0.01

A threshold to filter out the features with a small difference between the segment items.

top_nint, optional

A number of top features to be displayed.

show_plotbool, default True

If True, a table with the difference is shown.

Returns:
SegmentDiff

A SegmentDiff class instance fitted with given parameters.

Eventstream.projection(features, method='tsne', segments=None, sample_size=None, random_state=None, show_plot=True, **kwargs)[source]#

Project a dataset to a 2D space using a manifold transformation.

Parameters:
features: pandas.DataFrame

A dataset to be projected. The index should be path ids.

method{“umap”, “tsne”}, default “tsne”

Type of manifold transformation. See sklearn.manifold.TSNE() and umap.UMAP() for the details.

sample_sizeint, optional, default=1000

The number of elements to sample.

random_stateint, optional

Use an int number to make the randomness deterministic. Calling the method multiple times with the same random_state yields the same results.

**kwargsoptional

Additional parameters for sklearn.manifold.TSNE() and umap.UMAP().

Returns:
SegmentProjection

A SegmentProjection class instance fitted with given parameters.

Eventstream.segment_map(name, index='path_id', resolve_collision=None)[source]#

Return a mapping between segment values and paths. Works with static or roughly-static segments.

Parameters:
namestr, optional

A name of the segment. If None mapping is returned for all segments; works only for index="path_id".

index{“path_id”, “segment_value”}, default “path_id”.

The index of the resulting Series or DataFrame. If path_id, the index is path_id, and the values are the correspondingsegment values. If segment_value, the index is segment values, and the values are lists of path_ids associated with the segment value.

Returns:
pd.Series

If name is defined.

pd.DataFrame

If name=None and index="path_id".

Clusters#

Eventstream.extract_features(feature_type, ngram_range=(1, 1), path_id_col=None, col_suffix=None)[source]#

Calculate set of features for each path.

Parameters:
feature_type{“tfidf”, “count”, “frequency”, “binary”, “markov”, “time”, “time_fraction”}

Algorithms for converting event sequences to feature vectors:

  • tfidf see details in sklearn documentation.

  • count see details in sklearn documentation.

  • frequency is similar to count, but normalized to the total number of the events in the user’s trajectory.

  • binary 1 if a user had the given n-gram at least once and 0 otherwise.

  • markov available for bigrams only. For a given bigram (A, B) the vectorized values are the user’s transition probabilities from A to B.

  • time associated with unigrams only. The total amount of time (in seconds) spent on a given event.

  • time_fraction the same as time but divided by the path duration (in seconds).

ngram_rangeTuple(int, int), default (1, 1)

The lower and upper boundary of the range of n for n-grams to be extracted. For example, ngram_range=(1, 1) means only single events, (1, 2) means single events and bigrams. Ignored for markov, time, time_fraction feature types.

path_id_colstr, optional

A column name associated with a path identifier. A default value is linked to the user column from eventstream schema.

col_suffixstr, optional

A suffix added to the feature names.

Returns:
pd.DataFrame

A DataFrame with the vectorized values. The index consists of path ids, the columns relate to the n-grams.

Eventstream.get_clusters(X, method, n_clusters=None, scaler=None, random_state=None, segment_name='cluster_id', **kwargs)[source]#

Split paths into clusters and save their labels as a segment.

Parameters:
Xpd.DataFrame

The input data to cluster.

method{“kmeans”, “gmm”, “hdbscan”}

The clustering method to use.

n_clustersint, optional

The number of clusters to form. Actual for kmeans and gmm methods. If n_clusters=None and method="kmeans", the elbow curve chart is displayed.

scaler{“minmax”, “std”}, optional

The scaling method to apply to the data before clustering. If None, no scaling is applied.

random_stateint, optional

A seed used by the random number generator for reproducibility.

segment_namestr, default “cluster_id”

The name of the segment that will contain the cluster labels.

**kwargs

Additional keyword arguments to pass to the clustering methods.

Returns:
Eventstream or None

If n_clusters is specified, a new Eventstream object with clusters integrated as a segment segment_name is returned; otherwise, returns None.

Eventstream.clusters_overview(segment_name, features, aggfunc='mean', scaler='minmax', metrics=None, axis=1, show_plot=True)[source]#

Show a heatmap table with aggregated values of features and custom metrics by clusters.

Parameters:
segment_namestr

A name of the segment containing cluster labels.

featurespd.DataFrame

A DataFrame with features to be aggregated. The DataFrame’s index should be path ids.

aggfunccallable or str, default “mean”

A function to aggregate the features. If a string is passed, it should be a valid function name for the DataFrame’s agg method, see pandas documentation. for the details.

scaler{“minmax”, “std”}, default “minmax”

A scaler to normalize the features before the aggregation. Available scalers:

  • minmax: MinMaxScaler.

  • std: StandardScaler.

metricslist of tuples, optional

A list of tuples with custom metrics. Each tuple should contain three elements:

  • a function to calculate the metric, see Eventstream.path_metrics() for the details;

  • an aggregation metric to be applied to the metric values, same as aggfunc;

  • a metric label to be displayed in the resulting table.

axis{0, 1}, default 1

The axis for which the heatmap is to be generated.

  • 1 : for row-wise heatmap,

  • 0 : for column-wise heatmap. Custom metrics coloring is ignored in this case.

show_plotbool, default True

If True, a heatmap is shown.

Returns:
SegmentOverview

A SegmentOverview class instance fitted with given parameters.