As we can see, this path now starts with the two events preceding the ``cart`` (``event_index=0,1``) and the ``cart`` event right after them (``event_index=2``). Another ``cart`` event occurred here (``event_index=5827``), but since the default ``occurrence_before='first'`` was triggered, the data processor ignored this second ``cart``.
We can also perform truncation from the right, or specify for the truncation point to be not the first but the last occurrence of the ``cart``. To demonstrate both, let us set ``drop_after='cart'`` and ``occurrence_after='last'``:
.. code-block:: python
res = stream.truncate_paths(
drop_after='cart',
occurrence_after='last'
).to_dataframe()
Now, any trajectory which includes a ``cart`` is truncated to the end with the last ``cart``:
.. code-block:: python
res[res['user_id'] == 219483890]
.. raw:: html
|
event_type |
event_index |
event |
timestamp |
user_id |
0 |
raw |
0 |
catalog |
2019-11-01 17:59:13 |
219483890 |
1 |
raw |
1 |
product1 |
2019-11-01 17:59:28 |
219483890 |
2 |
raw |
2 |
cart |
2019-11-01 17:59:29 |
219483890 |
... |
... |
... |
... |
... |
... |
5639 |
raw |
5639 |
catalog |
2020-01-06 22:10:15 |
219483890 |
5640 |
raw |
5640 |
cart |
2020-01-06 22:10:42 |
219483890 |
If a path does not meet any of *before* or *after* conditions, the entire trajectory is removed from the output eventstream. You can modify this behavior using ``ignore_before`` or ``ignore_after`` flags. Say, if ``ignore_after`` is set as ``True``, then the resulting path starts from *before* cutting point and is extended until the path end. For example, for a given path ``A``, ``B``, ``C``, ``D``, ``E`` and the configuration ``drop_before='B'``, ``drop_after='X'``, ``ignore_after=False`` the result is ``B``, ``C``, ``D``, ``E``. If we modify ``ignore_after=True``, the path is removed from the output eventstream since event ``X`` is not represented in the trajectory.
Yet another feature that might be useful is ``keep_synthetic`` flag. If it is ``True``, then synthetic events are included to the output eventstream for the corresponding boundary events. For example, for a given event sequence ``session_start``, ``A``, ``B``, ``C`` and ``drop_before='A'``, the ``keep_synthetic=True`` flag keeps the sequence as is, while ``keep_synthetic=False`` would return ``A``, ``B``, ``C``.
Editing processors
~~~~~~~~~~~~~~~~~~
.. _group_events:
GroupEvents
^^^^^^^^^^^
Given a masking function passed as a ``func``,
:py:meth:`GroupEvents
` replaces
all the events marked by ``func`` with newly created synthetic events
of ``event_name`` name and ``event_type`` type (``group_alias`` by
default). The timestamps of these synthetic events are the same as their
parents'. ``func`` can be any function that returns a series of
boolean (``True/False``) variables that can be used as a filter for the
DataFrame underlying the eventstream.
.. figure:: /_static/user_guides/data_processor/dp_12_group_events.png
With ``GroupEvents``, we can group events based on the event name. Suppose
we need to assign a common name ``product`` to events ``product1`` and
``product2``:
.. code-block:: python
def group_events(df, schema):
events_to_group = ['product1', 'product2']
return df[schema.event_name].isin(events_to_group)
params = {
'event_name': 'product',
'func': group_events
}
res = stream.group_events(**params).to_dataframe()
As we can see, user ``456870964`` now has two ``product`` events
(``event_index=160, 164``) with ``event_type=‘group_alias’``).
.. code-block:: python
res[res['user_id'] == 456870964]
.. raw:: html
|
event_type |
event_index |
event |
timestamp |
user_id |
157 |
raw |
157 |
catalog |
2019-11-03 11:46:55 |
456870964 |
158 |
raw |
158 |
catalog |
2019-11-03 11:47:46 |
456870964 |
159 |
raw |
159 |
catalog |
2019-11-03 11:47:58 |
456870964 |
160 |
group_alias |
160 |
product |
2019-11-03 11:48:43 |
456870964 |
162 |
raw |
162 |
cart |
2019-11-03 11:49:17 |
456870964 |
163 |
raw |
163 |
catalog |
2019-11-03 11:49:17 |
456870964 |
164 |
group_alias |
164 |
product |
2019-11-03 11:49:28 |
456870964 |
166 |
raw |
166 |
catalog |
2019-11-03 11:49:30 |
456870964 |
Previously, both events were named
``product1`` and ``product2`` and had ``raw`` event types:
.. code-block:: python
stream.to_dataframe().query('user_id == 456870964')
.. raw:: html
|
event_type |
event_index |
event |
timestamp |
user_id |
140 |
raw |
140 |
catalog |
2019-11-03 11:46:55 |
456870964 |
141 |
raw |
141 |
catalog |
2019-11-03 11:47:46 |
456870964 |
142 |
raw |
142 |
catalog |
2019-11-03 11:47:58 |
456870964 |
143 |
raw |
143 |
product1 |
2019-11-03 11:48:43 |
456870964 |
144 |
raw |
144 |
cart |
2019-11-03 11:49:17 |
456870964 |
145 |
raw |
145 |
catalog |
2019-11-03 11:49:17 |
456870964 |
146 |
raw |
146 |
product2 |
2019-11-03 11:49:28 |
456870964 |
147 |
raw |
147 |
catalog |
2019-11-03 11:49:30 |
456870964 |
You can also notice that the newly created ``product`` events have
``event_id`` that differs from their parents' event_ids.
.. note::
``schema`` parameter of the grouping function is optional as well as in
:ref:`FilterEvents
` data processor.
.. _group_events_bulk:
GroupEventsBulk
^^^^^^^^^^^^^^^
GroupEventsBulk is a simple extension of the :ref:`GroupEvents` data processor. It allows to apply multiple grouping operations simultaneously. The only positional argument ``grouping_rules`` defines grouping rules to be applied to the eventstream.
One option is to define grouping rules as a list of dictionaries. Each dictionary must contain two keys: ``event_name`` and ``func``, ``event_type`` key is optional. The meaning of the keys is exactly the same as for :ref:`GroupEvents` data processor.
.. code-block:: python
stream.group_events_bulk(
[
{
'event_name': 'product',
'event_type': 'group_product',
'func': lambda _df: _df['event'].str.startswith('product')
},
{
'event_name': 'delivery',
'func': lambda _df: _df['event'].str.startswith('delivery')
}
]
)
An alternative way to set grouping rules is to use a dictionary. The keys and the values are considered as ``event_name`` and ``func`` correspondingly. Setting ``event_type`` is not supported in this case.
.. code-block:: python
stream.group_events_bulk(
{
'product': lambda _df: _df['event'].str.startswith('product'),
'delivery': lambda _df: _df['event'].str.startswith('delivery')
}
)
.. note::
If at least two grouping rules might be applied to the same original event, ``ValueError`` is thrown. This behaviour is controlled by ``ignore_intersections`` flag. If ``ignore_intersections=True``, the first grouping rule is applied in case of such conflicts.
.. _collapse_loops:
CollapseLoops
^^^^^^^^^^^^^
:py:meth:`CollapseLoops`
replaces all uninterrupted series of repetitive user
events (loops) with one new ``loop`` - like event.
The ``suffix`` parameter defines the name of the new event:
- given ``suffix=None``, names new event with the old event_name, i.e. passes along
the name of the repeating event;
- given ``suffix="loop"``, names new event ``event_name_loop``;
- given ``suffix="count"``, names new event
``event_name_loop_{number of event repetitions}``.
The ``time_agg`` value determines the new event timestamp:
- given ``time_agg="max"`` (the default option), passes the
timestamp of the last event from the loop;
- given ``time_agg="min"``, passes the timestamp of
the first event from the loop;
- given ``time_agg="mean"``, passes the average loop
timestamp.
.. figure:: /_static/user_guides/data_processor/dp_13_collapse_loops.png
.. code-block:: python
res = stream.collapse_loops(suffix='loop', time_agg='max').to_dataframe()
Consider for example user ``2112338``. In the original eventstream she
had three consecutive ``catalog`` events.
.. code-block:: python
stream.to_dataframe().query('user_id == 2112338')
.. raw:: html
|
event_type |
event_index |
event |
timestamp |
user_id |
3327 |
raw |
3327 |
main |
2019-12-24 12:58:04 |
2112338 |
3328 |
raw |
3328 |
catalog |
2019-12-24 12:58:08 |
2112338 |
3329 |
raw |
3329 |
catalog |
2019-12-24 12:58:16 |
2112338 |
3330 |
raw |
3330 |
catalog |
2019-12-24 12:58:44 |
2112338 |
3331 |
raw |
3331 |
main |
2019-12-24 12:58:52 |
2112338 |
In the resulting DataFrame, the repeating "catalog" events have been collapsed to a single
``catalog_loop`` event. The timestamp of this synthetic event is the
same as the timestamp of the last looping event:
``2019-12-24 12:58:44``.
.. code-block:: python
res[res['user_id'] == 2112338]
.. raw:: html
|
event_type |
event_index |
event |
timestamp |
user_id |
5061 |
raw |
5061 |
main |
2019-12-24 12:58:04 |
2112338 |
5066 |
group_alias |
5066 |
catalog_loop |
2019-12-24 12:58:44 |
2112338 |
5069 |
raw |
5069 |
main |
2019-12-24 12:58:52 |
2112338 |
We can set the suffix to see the length of the loops we removed.
Also, let us see how ``time_agg`` works if
we set it to ``mean``.
.. code-block:: python
params = {
'suffix': 'count',
'time_agg': 'mean'
}
res = stream.collapse_loops(**params).to_dataframe()
res[res['user_id'] == 2112338]
.. raw:: html
|
event_type |
event_index |
event |
timestamp |
user_id |
5071 |
raw |
5071 |
main |
2019-12-24 12:58:04 |
2112338 |
5076 |
group_alias |
5076 |
catalog_loop_3 |
2019-12-24 12:58:23 |
2112338 |
5079 |
raw |
5079 |
main |
2019-12-24 12:58:52 |
2112338 |
Now, the synthetic ``catalog_loop_3`` event has ``12:58:23`` time -
the average of ``12:58:08``, ``12:58:16`` and ``12:58:44``.
The ``CollapseLoops`` data processor can be useful for compressing the
data:
- by packing loop information into single events,
- removing looping events, in case they are not desirable
(which can be a common case in clickstream visualization).
.. _pipe:
Pipe
^^^^
Pipe is a data processor similar to `pandas pipe
`_
method. It modifies an input eventstream in an arbitrary way by applying given function.
The function must accept a DataFrame associated with the input eventstream and return a new state
of the modified eventstream.
.. code-block:: python
stream.pipe(lambda _df: _df.assign(new_column=100))\
.to_dataframe()\
.head(3)
.. raw:: html
|
user_id |
event |
timestamp |
new_column |
0 |
219483890 |
path_start |
2019-11-01 17:59:13.273932 |
100 |
1 |
219483890 |
catalog |
2019-11-01 17:59:13.273932 |
100 |
2 |
219483890 |
product1 |
2019-11-01 17:59:28.459271 |
100 |
.. _synthetic_events_order:
Synthetic events order
----------------------
Let us summarize the information about event type and event order in the eventstream.
As we have already discussed in the eventstream guide: :ref:`event_type column` and
:ref:`reindex method`.
All events came from a sourcing DataFrame are of ``raw`` event type.
When we apply adding or editing data processors new synthetic events are created.
General idea is that each synthetic event has a "parent" or "parents" that
defines its timestamp.
When you apply multiple data processors, timestamp collisions might occur, so it is
unclear how the events should be ordered. For colliding events,
the following sorting order is applied, based on event types (earlier event types
are added earlier), also you can see which data processor
for which event_type is responsible:
.. table:: Mapping of event_types and data processors.
:widths: 10 40 40
:class: tight-table
+-------+-------------------------+---------------------------------------------------------+
| Order | event_type | helper |
+=======+=========================+=========================================================+
| 1 | profile | |
+-------+-------------------------+---------------------------------------------------------+
| 2 | path_start | :ref:`add_start_end_events` |
+-------+-------------------------+---------------------------------------------------------+
| 3 | new_user | :ref:`label_new_users` |
+-------+-------------------------+---------------------------------------------------------+
| 4 | existing_user | :ref:`label_new_users` |
+-------+-------------------------+---------------------------------------------------------+
| 5 | cropped_left | :ref:`label_cropped_paths` |
+-------+-------------------------+---------------------------------------------------------+
| 6 | session_start | :ref:`split_sessions` |
+-------+-------------------------+---------------------------------------------------------+
| 7 | session_start_cropped | :ref:`split_sessions` |
+-------+-------------------------+---------------------------------------------------------+
| 8 | group_alias | :ref:`group_events` |
+-------+-------------------------+---------------------------------------------------------+
| 9 | raw | |
+-------+-------------------------+---------------------------------------------------------+
| 10 | raw_sleep | |
+-------+-------------------------+---------------------------------------------------------+
| 11 | None | |
+-------+-------------------------+---------------------------------------------------------+
| 12 | synthetic | |
+-------+-------------------------+---------------------------------------------------------+
| 13 | synthetic_sleep | |
+-------+-------------------------+---------------------------------------------------------+
| 14 | add_positive_events | :ref:`add_positive_events` |
+-------+-------------------------+---------------------------------------------------------+
| 15 | add_negative_events | :ref:`add_negative_events` |
+-------+-------------------------+---------------------------------------------------------+
| 16 | session_end_cropped | :ref:`split_sessions` |
+-------+-------------------------+---------------------------------------------------------+
| 17 | session_end | :ref:`split_sessions` |
+-------+-------------------------+---------------------------------------------------------+
| 18 | session_sleep | |
+-------+-------------------------+---------------------------------------------------------+
| 19 | cropped_right | :ref:`label_cropped_paths` |
+-------+-------------------------+---------------------------------------------------------+
| 20 | absent_user | :ref:`label_lost_users` |
+-------+-------------------------+---------------------------------------------------------+
| 21 | lost_user | :ref:`label_lost_users` |
+-------+-------------------------+---------------------------------------------------------+
| 22 | path_end | :ref:`add_start_end_events` |
+-------+-------------------------+---------------------------------------------------------+