Skip to content

Initial Signals Docs #1197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions docs/signals/attributes/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: "Defining Attributes"
sidebar_position: 20
description: "Attributes."
sidebar_label: "Attributes"
---

Attributes are the building blocks of Snowplow Signals. They represent specific facts about user behavior and are calculated based on events in your Snowplow pipeline. This guide explains how to define an `Attribute`, and use `Criteria` to filter events for precise data aggregation.

- **Number of page views in the last 7 Days:** counts how many pages a user has viewed within the past week.
- **Last product viewed:** identifies the most recent product a user interacted with.
- **Previous purchases:** provides a record of the user's past transactions.


### Basic Usage

An `Attribute` can be defined to count the number of pageviews in a session through the Python SDK as follows:

```python
from snowplow_signals import Attribute, Event

page_views_attribute = Attribute(
name='page_views_count',
type='int32',
events=[
Event(
vendor="com.snowplowanalytics.snowplow",
name="page_view",
version="1-0-0",
)
],
aggregation='counter'
)

```

### Advanced Usage

Add a `Criteria` to refine the events used to calculate the `Attribute`. For example, if you want to see the number of pageviews on a particular web page on your site.


```python
from snowplow_signals import Attribute, Event, Criteria, Criterion

page_views_attribute = Attribute(
name='page_views_count',
type='int32',
events=[
Event(
vendor="com.snowplowanalytics.snowplow",
name="page_view",
version="1-0-0",
)
],
aggregation='counter',
criteria=Criteria(
all=[
Criterion(
property="page_title",
operator="=",
value="home_page",
),
],
),
)

```

The table below lists all arguments for an `Attribute`:

| **Argument Name** | **Description** | **Type** |
| --- | --- | --- |
| `name` | The name of the Attribute | `string` |
| `description` | The description of the Attribute | `string` |
| `type` | The type of the aggregation | One of: `bytes`, `string`, `int32`, `int64`, `double`, `float`, `bool`, `unix_timestamp`, `bytes_list`, `string_list`, `int32_list`, `int64_list`, `double_list`, `float_list`, `bool_list`, `unix_timestamp_list`, |
| `tags` | Metadata for the Attribute | |
| `events` | List of Snowplow Events that the Attribute is calculated on | List of `Event` type; see next section |
| `aggregation` | The aggregation type of the Attribute | One of: `counter`, `sum`, `min`, `max`, `mean`, `first`, `last`, `unique_list` |
| `property` | The property of the event or entity you wish to use in the aggregation | `string` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need a bit more documentation on this as there are complexities about the syntax and naming that should be used here.

We decided that we'll use Snowflake syntax for accessing nested properties within events and entities. Also the columns names are the same as in the atomic events table. There was a bit more detail in this doc.

One can also access other properties in the atomic events table like app_id and more. It'd be great to provide some examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it deserves a separate paragraph/section as users will definitely need to figure out how to do that.

| `criteria` | List of `Criteria` to filter the events | List of `Criteria` type |
| `period` | The time period over which the aggregation should be calculated | |


### Event Type
The `Event` informs the type of event that the `Attribute` is calculated on. It should be a reference to a Snowplow event that exists on your Snowplow account.

| **Argument Name** | **Description** | **Type** |
| --- | --- | --- |
| `name` | Name of the event (`event_name` column in atomic.events table) | `string` |
| `vendor` | Vendor of the event (`event_vendor` column in atomic.events table). | `string` |
| `version` | Version of the event (`event_version` column in atomic.events table). | `string` |

### Criteria
The `Criteria` filters the events used to calculate an `Attribute`. They are made up of individual `Criterion`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these not called Filters/Filtersets 😒


A `Criteria` accepts one of 2 parameters, both lists of individual `Criterion`:

- `all`: An array of conditions used to filter the events. All conditions must be met.
- `any`: An array of conditions used to filter the events. At least one of the conditions must be met.

A `Criterion` specifies the individual filter conditions for an `Attribute` using the following properties.

| **Argument Name** | **Description** | **Type** |
| --- | --- | --- |
| `property` | The path to the property on the event or entity you wish to filter. | `string` |
| `operator` | The operator used to compare the property to the value. | One of: `=`, `!=`, `<`, `>`, `<=`, `>=`, `like`, `in` |
| `value` | The value to compare the property to. | One of: `str`, `int`, `float`, `bool`, `List[str]`, `List[int]`, `List[float]`, `List[bool]` |

191 changes: 191 additions & 0 deletions docs/signals/batch_engine/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
title: "Batch Engine"
sidebar_position: 40
description: "In depth explanation on how the Batch Engine works."
sidebar_label: "Batch Engine"
---

While many `Attributes` can be computed in stream, those that have to be calculated over a longer period (e.g. a day or more), can only be created "offline" in the warehouse. We call them `Batch Attributes`.

One advantage of computing batch attributes over stream is that the computation can go back in history based on what you have already available in the atomic events dataset.

The entity here is typically the user, which may be the `domain_userid` or other Snowplow identifier fields, such as the logged in `user_id`.

:::info
For now, only the `domain_userid` can be used, but shortly we will extend support for all Snowplow identifiers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also apply to the stream source?

:::

Examples of `Batch Attributes` are typically:
- customer lifetime values
- specific transactions that have or have not happened in the last X number of days
- first or last events a specific user generated or any properties associated to these events

You may already have tables in your warehouse that contain such computed values, in which case you only have to register them as a batch source to be used by Signals. However, if you don't have them, we have developed a tool called the **`Batch Engine`** to efficiently generate these `Attributes` for you, with the help of a few CLI commands. This way you can avoid having to build complex data models to avoid recalculating the values each time over a very large table, which over time may even be impossible to do, not just costly.

## How it works
What you need to do first is to define a set of `Attributes` and register them as a `View` through the Python Signals SDK. Then you can use the optional CLI functionality of the SDK to generate a dbt project which will ultimately produce a view-specific attribute table. Then all that's left is to materialize the table, which will mean that Signals will regularly fetch the values from your warehouse table and sends it through the Profiles API.

## Defining batch attributes
Syntactically speaking, defining batch and stream attributes work the same way. For a general overview of how to do that please refer to the [attributes](/docs/signals/attributes/index.md) section.

There are 4 main types of attributes that you may likely want to define for batch processing:
1. `Time Windowed Attributes`: Actions that happened in the `last_x_number of days`. Period needs to be defined as timedelta in days.

2. `Lifetime Attributes`: Calculated over all the available data for the entity. Period needs to be left as None.

3. `First Touch Attributes`: Events (or properties) that happened for the first time for a given entity. Period needs to be left as None.

4. `Last Touch Attributes`: Events (or properties) that happened for the last time for a given entity. Period needs to be left as None.

We have illustrated each of these 4 types with an example block below.

1. `products_added_to_cart_last_7_days`: This attribute calculates the number of add to cart ecommerce events in the last 7 days

2. `total_product_price_clv`: This attribute is calculated across the customer lifetime

3. `first_mkt_source`: This attribute takes the first page_view event and reads the mkt_source property for a specific entity (e.g. domain_userid)

4. `last_device_class`: This attribute takes the first page_view event and extracts and retrieves the yauaa deviceClass property for a specific entity

<details>
<summary>Example batch attribute definitions</summary>

Each block creates a single attribute definition including the logic how it should be calculated (its filters and aggregation).

```python
from snowplow_signals import (
Attribute,
Criteria,
Criterion,
Event,
)
from datetime import timedelta

products_added_to_cart_last_7_days = Attribute(
name="products_added_to_cart_last_7_days",
type="string_list",
events=[
Event(
vendor="com.snowplowanalytics.snowplow",
name="snowplow_ecommerce_action",
version="1-0-2",
)
],
aggregation="unique_list",
property="contexts_com_snowplowanalytics_snowplow_ecommerce_product_1[0].name",
criteria=Criteria(
all=[
Criterion(
property="unstruct_event_com_snowplowanalytics_snowplow_ecommerce_snowplow_ecommerce_action_1:type",
operator="=",
value="add_to_cart",
),
],
),
period=timedelta(days=7),
)

total_product_price_clv = Attribute(
name="total_product_price_clv",
type="float",
events=[
Event(
vendor="com.snowplowanalytics.snowplow",
name="snowplow_ecommerce_action",
version="1-0-2",
)
],
aggregation="sum",
property="contexts_com_snowplowanalytics_snowplow_ecommerce_product_1[0].price",
criteria=Criteria(
all=[
Criterion(
property="unstruct_event_com_snowplowanalytics_snowplow_ecommerce_snowplow_ecommerce_action_1:type",
operator="=",
value="add_to_cart"
)
]
),
)

first_mkt_source = Attribute(
name="first_mkt_source",
type="string",
events=[
Event(
vendor="com.snowplowanalytics.snowplow",
name="page_view",
version="1-0-0",
)
],
aggregation="first",
property="mkt_source",
)

last_device_class = Attribute(
name="last_device_class",
type="string",
events=[
Event(
vendor="com.snowplowanalytics.snowplow",
name="page_view",
version="1-0-0",
)
],
aggregation="last",
property="contexts_nl_basjes_yauaa_context_1[0]:deviceClass",
)
```
</details>

## Defining a batch view
The key difference between a standard view and one meant for batch processing is the `offline=True` parameter. This flag indicates that the view’s attributes are computed in the data warehouse.

```python
from snowplow_signals import View, user_entity

view = View(
name="batch_ecommerce_attributes",
version=1,
entity=user_entity,
offline=True,
attributes=[
products_added_to_cart_last_7_days,
total_product_price_clv,
first_mkt_source,
last_device_class
],
)
```
## Generating the dbt project
Here we assume you already defined your View(s) related to custom Batch Attributes, you want the Batch Engine to help generate for you.

It is best to follow our step-by-step tutorial, please check it out [here](/tutorials/snowplow-batch-engine/start/).

## Understanding the autogenerated data models
The Signals Batch Engine-generated dbt models process the events for Attribute generation, to help you save resources by avoiding unnecessary processing. In each incremental run, only the data that has been loaded since the last time the data models ran get processed and deduplicated. Only the relevant events and properties that are part of the Attribute definition (defined in the same View) are used to create a `filtered_events_table`. Upon successful run, the `snowplow_incremental_manifest` is updated to keep records of where each run left off.

:::info
For those familiar with existing Snowplow dbt packages, it is worth to note that the incrementalization follows a completely different logic, based on newly loaded data and not by reprocessing sessions as a whole.
:::

There is a second layer of incremental processing logic dictated by the `daily_aggregation_manifest` table. After the `filtered_events` table is created or updated, the `daily_aggregates` table gets updated with the help of this manifest. It is needed due to late arriving data, which may mean that some days will need to be reprocessed as a whole. For optimization purposes there are variables to fine-tune how this works such as the `snowplow__reprocess_days` and the `snowplow__min_rows_to_process`.

Finally, the `Attributes` table is generated which is a drop and recompute table, fully updated each time an incremental update runs. This is cost-effective as the data is already pre-aggregated on a daily level.
![](../images/batch_engine_data_models.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this diagram would work better vertically, then the text wouldn't be tiny


## Variables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the defaults?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section should go higher on the page, under Generating the dbt project


```yml title="dbt_project.yml"
snowplow__start_date: '2025-01-01' # date from where it starts looking for events based on both load and derived_tstamp
snowplow__app_id: [] # already gets applied in base_events_this_run
snowplow__backfill_limit_days: 1 # limit backfill increments for the filtered_events_table
snowplow__late_event_lookback_days: 5 # the number of days to allow for late arriving data to be reprocessed fully in the daily aggregate table
snowplow__min_late_events_to_process: 1 # the number of total daily events that have been skipped in previous runs, if it falls within the late_event_lookback_days, if the treshold is reached, those events will be processed in the daily aggregate model
snowplow__allow_refresh: false # if true, the snowplow_incremental_manifest will be dropped when running with a --full-refresh flag
snowplow__dev_target_name: dev
snowplow__databricks_catalog: "hive_metastore"
snowplow__atomic_schema: 'atomic' # Only set if not using 'atomic' schema for Snowplow events data
snowplow__database: # Only set if not using target.database for Snowplow events data -- WILL BE IGNORED FOR DATABRICKS
snowplow__events_table: "events" # Only set if not using 'events' table for Snowplow events data
```
Binary file added docs/signals/images/batch_engine_data_models.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/signals/images/orgID.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/signals/images/signals.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 48 additions & 0 deletions docs/signals/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
title: "Snowplow Signals"
sidebar_position: 8
description: "An overview of Signals concepts."
sidebar_label: "Signals"
---

Snowplow Signals is a personalization engine built on Snowplow’s behavioral data pipeline. The Profile API, hosted in your BDP cloud, allows you to create, manage and access user attributes by using the Signals SDKs.

![](./images/signals.png)
Copy link
Collaborator

@mscwilson mscwilson May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this diagram into the new "Signals components" subheading below, replace it with the fancy official one that matches the docs homepage architecture diagram


Signals allows you to enhance your applications by aggregating historical user attributes and providing near real-time visibility into customer behavior.
Use the aggregated attributes to power in-product personalization and recommendations engines, adaptive UIs, and agentic applications like AI copilots and chatbots.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a short paragraph or section here about how to use Signals

### Sources
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section refers to views, should go below View section

Copy link
Collaborator

@mscwilson mscwilson May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually put these bits under a subheading


Signals components

The core components are Attributes, Sources, Views, and Services.

You will need to create a new Python project that imports the Signals SDK. Configure Signals by defining these components, and deploying them to the Profiles Store. You can then pull aggregrated attributes, using the Signals SDK, to use in your applications.

[diagram like this - is this right?]

signals_components


The `Source` of a view refers to the origin of the data. A source can be one of two types:

- **Batch Source:** Data that is aggregated and stored in a data warehouse, such as dbt models.
- **Stream Source:** Data that is aggregated in real-time in stream.


### Attributes

The foundation of Signals is the `Attribute`. An attribute represents a specific fact about a user's behavior. For example:

- **Number of page views in the last 7 days:** counts how many pages a user has viewed within the past week.
- **Last product viewed:** identifies the most recent product a user interacted with.
- **Previous purchases:** provides a record of the user's past transactions.

Signals calculates user attributes in two ways:

- Stream Processing: Real-time metrics for instant personalization.
- Batch Processing: Historical insights from data stored in your warehouse.

### Views

A `View` is a collection of attributes that share a common aggregator (ie `session_id` or `user_id`) and a data `Source`. You can picture it as a table of attributes, for example:

| user_id | number_of_pageviews | last_product_viewed | previous_purchases |
|---------|---------------------|---------------------|---------------------|
| `abc123`| 5 | Red Shoes |[`Blue Shoes`, `Red Hat`]|


### Services

A `Service` is a collection of `Views` that are grouped to make the retrieval of attributes simpler.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand on this - there's more info to copy in the other page. why would they want multiple views?


Loading