-
Notifications
You must be signed in to change notification settings - Fork 54
Initial Signals Docs #1197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Initial Signals Docs #1197
Changes from all commits
0664c70
321d5d5
29af49e
9effec5
889087a
a035621
4ea4e8a
3ed1987
5810bce
9931fba
eb235c5
535cb5c
7fcc79e
edd31a3
0505299
c88ca78
6149798
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
--- | ||
title: "Defining Attributes" | ||
sidebar_position: 20 | ||
description: "Attributes." | ||
sidebar_label: "Attributes" | ||
--- | ||
|
||
Attributes are the building blocks of Snowplow Signals. They represent specific facts about user behavior and are calculated based on events in your Snowplow pipeline. This guide explains how to define an `Attribute`, and use `Criteria` to filter events for precise data aggregation. | ||
|
||
- **Number of page views in the last 7 Days:** counts how many pages a user has viewed within the past week. | ||
- **Last product viewed:** identifies the most recent product a user interacted with. | ||
- **Previous purchases:** provides a record of the user's past transactions. | ||
|
||
|
||
### Basic Usage | ||
|
||
An `Attribute` can be defined to count the number of pageviews in a session through the Python SDK as follows: | ||
|
||
```python | ||
from snowplow_signals import Attribute, Event | ||
|
||
page_views_attribute = Attribute( | ||
name='page_views_count', | ||
type='int32', | ||
events=[ | ||
Event( | ||
vendor="com.snowplowanalytics.snowplow", | ||
name="page_view", | ||
version="1-0-0", | ||
) | ||
], | ||
aggregation='counter' | ||
) | ||
|
||
``` | ||
|
||
### Advanced Usage | ||
|
||
Add a `Criteria` to refine the events used to calculate the `Attribute`. For example, if you want to see the number of pageviews on a particular web page on your site. | ||
|
||
|
||
```python | ||
from snowplow_signals import Attribute, Event, Criteria, Criterion | ||
|
||
page_views_attribute = Attribute( | ||
name='page_views_count', | ||
type='int32', | ||
events=[ | ||
Event( | ||
vendor="com.snowplowanalytics.snowplow", | ||
name="page_view", | ||
version="1-0-0", | ||
) | ||
], | ||
aggregation='counter', | ||
criteria=Criteria( | ||
all=[ | ||
Criterion( | ||
property="page_title", | ||
operator="=", | ||
value="home_page", | ||
), | ||
], | ||
), | ||
) | ||
|
||
``` | ||
|
||
The table below lists all arguments for an `Attribute`: | ||
|
||
| **Argument Name** | **Description** | **Type** | | ||
| --- | --- | --- | | ||
| `name` | The name of the Attribute | `string` | | ||
| `description` | The description of the Attribute | `string` | | ||
| `type` | The type of the aggregation | One of: `bytes`, `string`, `int32`, `int64`, `double`, `float`, `bool`, `unix_timestamp`, `bytes_list`, `string_list`, `int32_list`, `int64_list`, `double_list`, `float_list`, `bool_list`, `unix_timestamp_list`, | | ||
| `tags` | Metadata for the Attribute | | | ||
| `events` | List of Snowplow Events that the Attribute is calculated on | List of `Event` type; see next section | | ||
| `aggregation` | The aggregation type of the Attribute | One of: `counter`, `sum`, `min`, `max`, `mean`, `first`, `last`, `unique_list` | | ||
| `property` | The property of the event or entity you wish to use in the aggregation | `string` | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we'll need a bit more documentation on this as there are complexities about the syntax and naming that should be used here. We decided that we'll use Snowflake syntax for accessing nested properties within events and entities. Also the columns names are the same as in the atomic events table. There was a bit more detail in this doc. One can also access other properties in the atomic events table like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it deserves a separate paragraph/section as users will definitely need to figure out how to do that. |
||
| `criteria` | List of `Criteria` to filter the events | List of `Criteria` type | | ||
| `period` | The time period over which the aggregation should be calculated | | | ||
|
||
|
||
### Event Type | ||
The `Event` informs the type of event that the `Attribute` is calculated on. It should be a reference to a Snowplow event that exists on your Snowplow account. | ||
|
||
| **Argument Name** | **Description** | **Type** | | ||
| --- | --- | --- | | ||
| `name` | Name of the event (`event_name` column in atomic.events table) | `string` | | ||
| `vendor` | Vendor of the event (`event_vendor` column in atomic.events table). | `string` | | ||
| `version` | Version of the event (`event_version` column in atomic.events table). | `string` | | ||
|
||
### Criteria | ||
The `Criteria` filters the events used to calculate an `Attribute`. They are made up of individual `Criterion`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are these not called Filters/Filtersets 😒 |
||
|
||
A `Criteria` accepts one of 2 parameters, both lists of individual `Criterion`: | ||
|
||
- `all`: An array of conditions used to filter the events. All conditions must be met. | ||
- `any`: An array of conditions used to filter the events. At least one of the conditions must be met. | ||
|
||
A `Criterion` specifies the individual filter conditions for an `Attribute` using the following properties. | ||
|
||
| **Argument Name** | **Description** | **Type** | | ||
| --- | --- | --- | | ||
| `property` | The path to the property on the event or entity you wish to filter. | `string` | | ||
| `operator` | The operator used to compare the property to the value. | One of: `=`, `!=`, `<`, `>`, `<=`, `>=`, `like`, `in` | | ||
| `value` | The value to compare the property to. | One of: `str`, `int`, `float`, `bool`, `List[str]`, `List[int]`, `List[float]`, `List[bool]` | | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,191 @@ | ||
--- | ||
title: "Batch Engine" | ||
sidebar_position: 40 | ||
description: "In depth explanation on how the Batch Engine works." | ||
sidebar_label: "Batch Engine" | ||
--- | ||
|
||
While many `Attributes` can be computed in stream, those that have to be calculated over a longer period (e.g. a day or more), can only be created "offline" in the warehouse. We call them `Batch Attributes`. | ||
|
||
One advantage of computing batch attributes over stream is that the computation can go back in history based on what you have already available in the atomic events dataset. | ||
|
||
The entity here is typically the user, which may be the `domain_userid` or other Snowplow identifier fields, such as the logged in `user_id`. | ||
|
||
:::info | ||
For now, only the `domain_userid` can be used, but shortly we will extend support for all Snowplow identifiers | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does this also apply to the stream source? |
||
::: | ||
|
||
Examples of `Batch Attributes` are typically: | ||
- customer lifetime values | ||
- specific transactions that have or have not happened in the last X number of days | ||
- first or last events a specific user generated or any properties associated to these events | ||
|
||
You may already have tables in your warehouse that contain such computed values, in which case you only have to register them as a batch source to be used by Signals. However, if you don't have them, we have developed a tool called the **`Batch Engine`** to efficiently generate these `Attributes` for you, with the help of a few CLI commands. This way you can avoid having to build complex data models to avoid recalculating the values each time over a very large table, which over time may even be impossible to do, not just costly. | ||
|
||
## How it works | ||
What you need to do first is to define a set of `Attributes` and register them as a `View` through the Python Signals SDK. Then you can use the optional CLI functionality of the SDK to generate a dbt project which will ultimately produce a view-specific attribute table. Then all that's left is to materialize the table, which will mean that Signals will regularly fetch the values from your warehouse table and sends it through the Profiles API. | ||
|
||
## Defining batch attributes | ||
Syntactically speaking, defining batch and stream attributes work the same way. For a general overview of how to do that please refer to the [attributes](/docs/signals/attributes/index.md) section. | ||
|
||
There are 4 main types of attributes that you may likely want to define for batch processing: | ||
1. `Time Windowed Attributes`: Actions that happened in the `last_x_number of days`. Period needs to be defined as timedelta in days. | ||
|
||
2. `Lifetime Attributes`: Calculated over all the available data for the entity. Period needs to be left as None. | ||
|
||
3. `First Touch Attributes`: Events (or properties) that happened for the first time for a given entity. Period needs to be left as None. | ||
|
||
4. `Last Touch Attributes`: Events (or properties) that happened for the last time for a given entity. Period needs to be left as None. | ||
|
||
We have illustrated each of these 4 types with an example block below. | ||
|
||
1. `products_added_to_cart_last_7_days`: This attribute calculates the number of add to cart ecommerce events in the last 7 days | ||
|
||
2. `total_product_price_clv`: This attribute is calculated across the customer lifetime | ||
|
||
3. `first_mkt_source`: This attribute takes the first page_view event and reads the mkt_source property for a specific entity (e.g. domain_userid) | ||
|
||
4. `last_device_class`: This attribute takes the first page_view event and extracts and retrieves the yauaa deviceClass property for a specific entity | ||
|
||
<details> | ||
<summary>Example batch attribute definitions</summary> | ||
|
||
Each block creates a single attribute definition including the logic how it should be calculated (its filters and aggregation). | ||
|
||
```python | ||
from snowplow_signals import ( | ||
Attribute, | ||
Criteria, | ||
Criterion, | ||
Event, | ||
) | ||
from datetime import timedelta | ||
|
||
products_added_to_cart_last_7_days = Attribute( | ||
name="products_added_to_cart_last_7_days", | ||
type="string_list", | ||
events=[ | ||
Event( | ||
vendor="com.snowplowanalytics.snowplow", | ||
name="snowplow_ecommerce_action", | ||
version="1-0-2", | ||
) | ||
], | ||
aggregation="unique_list", | ||
property="contexts_com_snowplowanalytics_snowplow_ecommerce_product_1[0].name", | ||
criteria=Criteria( | ||
all=[ | ||
Criterion( | ||
property="unstruct_event_com_snowplowanalytics_snowplow_ecommerce_snowplow_ecommerce_action_1:type", | ||
operator="=", | ||
value="add_to_cart", | ||
), | ||
], | ||
), | ||
period=timedelta(days=7), | ||
) | ||
|
||
total_product_price_clv = Attribute( | ||
name="total_product_price_clv", | ||
type="float", | ||
events=[ | ||
Event( | ||
vendor="com.snowplowanalytics.snowplow", | ||
name="snowplow_ecommerce_action", | ||
version="1-0-2", | ||
) | ||
], | ||
aggregation="sum", | ||
property="contexts_com_snowplowanalytics_snowplow_ecommerce_product_1[0].price", | ||
criteria=Criteria( | ||
all=[ | ||
Criterion( | ||
property="unstruct_event_com_snowplowanalytics_snowplow_ecommerce_snowplow_ecommerce_action_1:type", | ||
operator="=", | ||
value="add_to_cart" | ||
) | ||
] | ||
), | ||
) | ||
|
||
first_mkt_source = Attribute( | ||
name="first_mkt_source", | ||
type="string", | ||
events=[ | ||
Event( | ||
vendor="com.snowplowanalytics.snowplow", | ||
name="page_view", | ||
version="1-0-0", | ||
) | ||
], | ||
aggregation="first", | ||
property="mkt_source", | ||
) | ||
|
||
last_device_class = Attribute( | ||
name="last_device_class", | ||
type="string", | ||
events=[ | ||
Event( | ||
vendor="com.snowplowanalytics.snowplow", | ||
name="page_view", | ||
version="1-0-0", | ||
) | ||
], | ||
aggregation="last", | ||
property="contexts_nl_basjes_yauaa_context_1[0]:deviceClass", | ||
) | ||
``` | ||
</details> | ||
|
||
## Defining a batch view | ||
The key difference between a standard view and one meant for batch processing is the `offline=True` parameter. This flag indicates that the view’s attributes are computed in the data warehouse. | ||
|
||
```python | ||
from snowplow_signals import View, user_entity | ||
|
||
view = View( | ||
name="batch_ecommerce_attributes", | ||
version=1, | ||
entity=user_entity, | ||
offline=True, | ||
attributes=[ | ||
products_added_to_cart_last_7_days, | ||
total_product_price_clv, | ||
first_mkt_source, | ||
last_device_class | ||
], | ||
) | ||
``` | ||
## Generating the dbt project | ||
Here we assume you already defined your View(s) related to custom Batch Attributes, you want the Batch Engine to help generate for you. | ||
|
||
It is best to follow our step-by-step tutorial, please check it out [here](/tutorials/snowplow-batch-engine/start/). | ||
|
||
## Understanding the autogenerated data models | ||
The Signals Batch Engine-generated dbt models process the events for Attribute generation, to help you save resources by avoiding unnecessary processing. In each incremental run, only the data that has been loaded since the last time the data models ran get processed and deduplicated. Only the relevant events and properties that are part of the Attribute definition (defined in the same View) are used to create a `filtered_events_table`. Upon successful run, the `snowplow_incremental_manifest` is updated to keep records of where each run left off. | ||
|
||
:::info | ||
For those familiar with existing Snowplow dbt packages, it is worth to note that the incrementalization follows a completely different logic, based on newly loaded data and not by reprocessing sessions as a whole. | ||
::: | ||
|
||
There is a second layer of incremental processing logic dictated by the `daily_aggregation_manifest` table. After the `filtered_events` table is created or updated, the `daily_aggregates` table gets updated with the help of this manifest. It is needed due to late arriving data, which may mean that some days will need to be reprocessed as a whole. For optimization purposes there are variables to fine-tune how this works such as the `snowplow__reprocess_days` and the `snowplow__min_rows_to_process`. | ||
|
||
Finally, the `Attributes` table is generated which is a drop and recompute table, fully updated each time an incremental update runs. This is cost-effective as the data is already pre-aggregated on a daily level. | ||
 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this diagram would work better vertically, then the text wouldn't be tiny |
||
|
||
## Variables | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what are the defaults? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this section should go higher on the page, under Generating the dbt project |
||
|
||
```yml title="dbt_project.yml" | ||
snowplow__start_date: '2025-01-01' # date from where it starts looking for events based on both load and derived_tstamp | ||
snowplow__app_id: [] # already gets applied in base_events_this_run | ||
snowplow__backfill_limit_days: 1 # limit backfill increments for the filtered_events_table | ||
snowplow__late_event_lookback_days: 5 # the number of days to allow for late arriving data to be reprocessed fully in the daily aggregate table | ||
snowplow__min_late_events_to_process: 1 # the number of total daily events that have been skipped in previous runs, if it falls within the late_event_lookback_days, if the treshold is reached, those events will be processed in the daily aggregate model | ||
snowplow__allow_refresh: false # if true, the snowplow_incremental_manifest will be dropped when running with a --full-refresh flag | ||
snowplow__dev_target_name: dev | ||
snowplow__databricks_catalog: "hive_metastore" | ||
snowplow__atomic_schema: 'atomic' # Only set if not using 'atomic' schema for Snowplow events data | ||
snowplow__database: # Only set if not using target.database for Snowplow events data -- WILL BE IGNORED FOR DATABRICKS | ||
snowplow__events_table: "events" # Only set if not using 'events' table for Snowplow events data | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
--- | ||
title: "Snowplow Signals" | ||
sidebar_position: 8 | ||
description: "An overview of Signals concepts." | ||
sidebar_label: "Signals" | ||
--- | ||
|
||
Snowplow Signals is a personalization engine built on Snowplow’s behavioral data pipeline. The Profile API, hosted in your BDP cloud, allows you to create, manage and access user attributes by using the Signals SDKs. | ||
|
||
 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. move this diagram into the new "Signals components" subheading below, replace it with the fancy official one that matches the docs homepage architecture diagram |
||
|
||
Signals allows you to enhance your applications by aggregating historical user attributes and providing near real-time visibility into customer behavior. | ||
Use the aggregated attributes to power in-product personalization and recommendations engines, adaptive UIs, and agentic applications like AI copilots and chatbots. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a short paragraph or section here about how to use Signals |
||
### Sources | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this section refers to views, should go below View section There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually put these bits under a subheading Signals componentsThe core components are Attributes, Sources, Views, and Services. You will need to create a new Python project that imports the Signals SDK. Configure Signals by defining these components, and deploying them to the Profiles Store. You can then pull aggregrated attributes, using the Signals SDK, to use in your applications. [diagram like this - is this right?] |
||
|
||
The `Source` of a view refers to the origin of the data. A source can be one of two types: | ||
|
||
- **Batch Source:** Data that is aggregated and stored in a data warehouse, such as dbt models. | ||
- **Stream Source:** Data that is aggregated in real-time in stream. | ||
|
||
|
||
### Attributes | ||
|
||
The foundation of Signals is the `Attribute`. An attribute represents a specific fact about a user's behavior. For example: | ||
|
||
- **Number of page views in the last 7 days:** counts how many pages a user has viewed within the past week. | ||
- **Last product viewed:** identifies the most recent product a user interacted with. | ||
- **Previous purchases:** provides a record of the user's past transactions. | ||
|
||
Signals calculates user attributes in two ways: | ||
|
||
- Stream Processing: Real-time metrics for instant personalization. | ||
- Batch Processing: Historical insights from data stored in your warehouse. | ||
|
||
### Views | ||
|
||
A `View` is a collection of attributes that share a common aggregator (ie `session_id` or `user_id`) and a data `Source`. You can picture it as a table of attributes, for example: | ||
|
||
| user_id | number_of_pageviews | last_product_viewed | previous_purchases | | ||
|---------|---------------------|---------------------|---------------------| | ||
| `abc123`| 5 | Red Shoes |[`Blue Shoes`, `Red Hat`]| | ||
|
||
|
||
### Services | ||
|
||
A `Service` is a collection of `Views` that are grouped to make the retrieval of attributes simpler. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. expand on this - there's more info to copy in the other page. why would they want multiple views? |
||
|
Uh oh!
There was an error while loading. Please reload this page.