SplitSessions#

Data processor#

class retentioneering.data_processors_lib.split_sessions.SplitSessions(params)[source]#

Create new synthetic events, that divide users’ paths on sessions: session_start (or session_start_cropped) and session_end (or session_end_cropped). Also create a new column that contains session number for each event in input eventstream. Session number will take the form: {user_id}_{session_number through one user path}.

Parameters:
timeoutTuple(float, DATETIME_UNITS), optional

Threshold value and its unit of measure. session_start and session_end events are always placed before the first and after the last event in each user’s path. Because user can have more than one session, it calculates timedelta between every two consecutive events in each user’s path. If the calculated timedelta is more than selected timeout, new synthetic events - session_start and session_end are created inside the user path, marking session starting and ending points.

delimiter_eventslist of str, optional

Delimiters define special events in the eventstream that indicate the start and the end of a session.

  • If a single delimiter is defined, it is associated with the session start and the end simultaneously. Delimiting events will be replaced with “session_start” event.

  • If a list of two delimiters is defined, the first and second events are associated with session start and end correspondingly. Delimiting events will be replaced with “session_start” and “session_end” events.

delimiter_collist, optional

Determines a column that already contains custom session identifiers.

mark_truncatedbool, default False

Works with timeout argument only. If True - calculates timedelta between:

  • first event in each user’s path and first event in the whole eventstream.

  • last event in each user’s path and last event in the whole eventstream.

For users with timedelta less than selected timeout, a new synthetic event - session_start_cropped or session_end_cropped will be added.

session_colstr, default “session_id”

The name of the session_col.

Returns:
Eventstream

Eventstream with new synthetic events and session_col.

event_name

event_type

timestamp

session_start

session_start

first_event

session_end

session_end

last_event

session_start_cropped

session_start_cropped

first_event

session_end_cropped

session_end_cropped

last_event

If the delta between timestamps of two consecutive events (raw_event_n and raw_event_n+1) is greater than the selected timeout the user will have more than one session:

user_id

event_name

event_type

timestamp

session_col

1

session_start

session_start

first_event

1_0

1

session_end

session_end

raw_event_n

1_0

1

session_start

session_start

raw_event_n+1

1_1

1

session_end

session_end

last_event

1_1

See also

TimedeltaHist

Plot the distribution of the time deltas between two events.

Eventstream.describe

Show general eventstream statistics.

Eventstream.describe_events

Show general eventstream events statistics.

Notes

See Data processors user guide for the details.

Examples

Splitting with a single delimiting event.

df = pd.DataFrame(
    [
        [111, "session_delimiter", "2023-01-01 00:00:00"],
        [111, "A", "2023-01-01 00:00:01"],
        [111, "B", "2023-01-01 00:00:02"],
        [111, "session_delimiter", "2023-01-01 00:00:04"],
        [111, "C", "2023-01-01 00:00:04"],
    ],
    columns=["user_id", "event", "timestamp"]
)
Eventstream(df)\
    .split_sessions(delimiter_events=["session_delimiter"])\
    .to_dataframe()\
    .sort_values(["user_id", "event_index"])\
    [["user_id", "event", "timestamp", "session_id"]]

   user_id          event           timestamp session_id
0      111  session_start 2023-01-01 00:00:00      111_1
1      111              A 2023-01-01 00:00:01      111_1
2      111              B 2023-01-01 00:00:02      111_1
3      111    session_end 2023-01-01 00:00:02      111_1
4      111  session_start 2023-01-01 00:00:04      111_2
5      111              C 2023-01-01 00:00:04      111_2
6      111    session_end 2023-01-01 00:00:04      111_2

Splitting with a couple of delimiters indicating session start and session end.

df = pd.DataFrame(
    [
        [111, "custom_start", "2023-01-01 00:00:00"],
        [111, "A", "2023-01-01 00:00:01"],
        [111, "B", "2023-01-01 00:00:02"],
        [111, "custom_end", "2023-01-01 00:00:02"],
        [111, "custom_start", "2023-01-01 00:00:04"],
        [111, "C", "2023-01-01 00:00:04"],
        [111, "custom_end", "2023-01-01 00:00:04"]
    ],
    columns=["user_id", "event", "timestamp"]
)
stream = Eventstream(df)
stream.split_sessions(delimiter_events=["custom_start", "custom_end"])\
    .to_dataframe()\
    .sort_values(["user_id", "event_index"])\
    [["user_id", "event", "timestamp", "session_id"]]

   user_id          event           timestamp session_id
0      111  session_start 2023-01-01 00:00:00      111_1
1      111              A 2023-01-01 00:00:01      111_1
2      111              B 2023-01-01 00:00:02      111_1
3      111    session_end 2023-01-01 00:00:02      111_1
4      111  session_start 2023-01-01 00:00:04      111_2
5      111              C 2023-01-01 00:00:04      111_2
6      111    session_end 2023-01-01 00:00:04      111_2

Splitting by a ‘delimiter_col’.

df = pd.DataFrame(
    [
        [111, "A", "2023-01-01 00:00:01", "session_1"],
        [111, "B", "2023-01-01 00:00:02", "session_1"],
        [111, "C", "2023-01-01 00:00:03", "session_2"],
        [111, "D", "2023-01-01 00:00:04", "session_2"],
    ],
    columns=["user_id", "event", "timestamp", "custom_ses_id"]
)
raw_data_schema = {"custom_cols": [{"raw_data_col": "custom_ses_id", "custom_col": "custom_ses_id"}]}
stream = Eventstream(df, raw_data_schema=raw_data_schema)
stream.split_sessions(delimiter_col="custom_ses_id")\
    .to_dataframe()\
    .sort_values(["user_id", "event_index"])\
    [["user_id", "event", "timestamp", "session_id", "custom_ses_id"]]

   user_id          event           timestamp session_id custom_ses_id
0      111  session_start 2023-01-01 00:00:01      111_1     session_1
1      111              A 2023-01-01 00:00:01      111_1     session_1
2      111              B 2023-01-01 00:00:02      111_1     session_1
3      111    session_end 2023-01-01 00:00:02      111_1     session_1
4      111  session_start 2023-01-01 00:00:03      111_2     session_2
5      111              C 2023-01-01 00:00:03      111_2     session_2
6      111              D 2023-01-01 00:00:04      111_2     session_2
7      111    session_end 2023-01-01 00:00:04      111_2     session_2
class retentioneering.data_processors_lib.split_sessions.SplitSessionsParams(*, timeout=None, delimiter_events=None, delimiter_col=None, mark_truncated=False, session_col='session_id')[source]#

A class with parameters for SplitSessions class.

Eventstream#

SplitSessionsHelperMixin.split_sessions(timeout=None, delimiter_events=None, delimiter_col=None, session_col='session_id', mark_truncated=False)[source]#

A method of Eventstream class that creates new synthetic events in each user’s path: session_start (or session_start_cropped) and session_end (or session_end_cropped). The created events divide users’ paths on sessions. Also creates a new column that contains session number for each event in the input eventstream Session number will take the form: {user_id}_{session_number through one user path}. The created events and column are added to the input eventstream.

Parameters:
timeoutTuple(float, DATETIME_UNITS), optional

Threshold value and its unit of measure. session_start and session_end events are always placed before the first and after the last event in each user’s path. Because user can have more than one session, it calculates timedelta between every two consecutive events in each user’s path. If the calculated timedelta is more than selected timeout, new synthetic events - session_start and session_end are created inside the user path, marking session starting and ending points.

delimiter_eventslist of str, optional

Delimiters define special events in the eventstream that indicate the start and the end of a session.

  • If a single delimiter is defined, it is associated with the session start and the end simultaneously. Delimiting events will be replaced with “session_start” event.

  • If a list of two delimiters is defined, the first and second events are associated with session start and end correspondingly. Delimiting events will be replaced with “session_start” and “session_end” events.

Returns:
Eventstream

Input eventstream with new synthetic events and session_col.