Eventstream core#

Eventstream#

class retentioneering.eventstream.eventstream.Eventstream(raw_data, raw_data_schema=None, schema=None, prepare=True, index_order=None, user_sample_size=None, user_sample_seed=None, events_order=None, custom_cols=None, add_start_end_events=True, convert_tz=None, segment_cols=None)[source]#

Collection of tools for storing and processing clickstream data.

Parameters:
raw_datapd.DataFrame or pd.Series

Raw clickstream data.

raw_data_schemadict or RawDataSchema, optional

Represents mapping rules connecting important eventstream columns with the raw data columns. The keys are defined in RawDataSchema. The values are the corresponding column names in the raw data. custom_cols key stands for the defining additional columns that can be used in the eventstream. See the Eventstream user guide for the details.

schemadict or EventstreamSchema, optional

Represents a schema of the created eventstream. The keys are defined in EventstreamSchema. The values are the names of the corresponding eventstream columns. See the Eventstream user guide for the details.

custom_colslist of str, optional

The list of additional columns from the raw data to be included in the eventstream. If not defined, all the columns from the raw data are included.

preparebool, default True
  • If True, input data will be transformed in the following way:

    • event_timestamp column is converted to pandas datetime format.

    • event_type column is added and filled with raw value. If the column exists, it remains unchanged.

  • If False - raw_data will be remained as is.

index_orderlist of str, default DEFAULT_INDEX_ORDER

Sorting order for event_type column.

user_sample_sizeint of float, optional

Number (int) or share (float) of all users’ trajectories that will be randomly chosen and left in final sample (all other trajectories will be removed) . See numpy documentation.

user_sample_seedint, optional

A seed value that is used to generate user samples. See numpy documentation.

events_orderlist of str, optional

Sorting order for event_name column, if there are events with equal timestamps inside each user trajectory. The order of raw events is fixed once while eventstream initialization.

add_start_end_eventsbool, default True

If True, path_start and path_end synthetic events are added to each path explicitly. See also AddStartEndEvents documentation.

convert_tz‘local’ or ‘UTC’, optional

Timestamp column with timezones is not supported in the eventstream and should be explicitly converted.

  • If UTC, the timestamp column will be converted to utc time, and the timezone part will be truncated.

  • If local, the timezone will be truncated.

Notes

See Eventstream user guide for the details.

add_custom_col(name, data)[source]#

Add custom column to an existing eventstream.

Parameters:
namestr

New column name.

datapd.Series
  • If pd.Series - new column with given values will be added.

  • If None - new column will be filled with np.nan.

Returns:
Eventstream
append_eventstream(eventstream)[source]#

Append eventstream with the same schema.

Parameters:
eventstreamEventstream
Returns:
eventstream
Raises:
ValueError

If EventstreamSchemas of two eventstreams are not equal.

copy()[source]#

Make a copy of current eventstream.

Returns:
Eventstream
index_events()[source]#

Sort and index eventstream using DEFAULT_INDEX_ORDER.

Returns:
None
to_dataframe(copy=False, drop_segment_events=True)[source]#

Convert eventstream to pd.DataFrame

Parameters:
copybool, default False

If True copy data from current eventstream. See details in the pandas documentation.

drop_segment_eventsbool, default True

If True remove segment synthetic events.

Returns:
pd.DataFrame

Schema#

class retentioneering.eventstream.schema.EventstreamSchema(event_id='event_id', event_type='event_type', event_index='event_index', event_name='event', event_timestamp='timestamp', user_id='user_id', custom_cols=<factory>)[source]#

Define a schema for eventstream columns names. If names of the columns are different from default names, they need to be specified.

Parameters:
event_idstr, default “event_id”
event_typestr, default “event_type”
event_indexstr, default “event_index”
event_namestr, default “event”
event_timestampstr, default “timestamp”
user_idstr, default “user_id”
custom_colslist of str, optional

Notes

See Eventstream user guide for the details.

class retentioneering.eventstream.schema.RawDataSchema(event_name='event', event_timestamp='timestamp', user_id='user_id', event_index=None, event_type=None, event_id=None, custom_cols=<factory>)[source]#

Define schema for raw_data columns names. If names of the columns are different from default names, they need to be specified.

Parameters:
event_namestr, default “event”
event_timestampstr, default “timestamp”
user_idstr, default “user_id”
event_typestr, optional
event_index: str, optional
custom_colslist, optional

Notes

See Eventstream user guide for the details.