snowplow · Jack-Keene · Apr 7, 2025 · Apr 11, 2025 · Apr 11, 2025 · Apr 11, 2025
diff --git a/docs/signals/attributes/index.md b/docs/signals/attributes/index.md
@@ -0,0 +1,108 @@
+---
+title: "Defining Attributes"
+sidebar_position: 20
+description: "Attributes."
+sidebar_label: "Attributes"
+---
+
+Attributes are the building blocks of Snowplow Signals. They represent specific facts about user behavior and are calculated based on events in your Snowplow pipeline. This guide explains how to define an `Attribute`, and use `Criteria` to filter events for precise data aggregation.
+
+- **Number of page views in the last 7 Days:** counts how many pages a user has viewed within the past week.
+- **Last product viewed:** identifies the most recent product a user interacted with.
+- **Previous purchases:** provides a record of the user's past transactions. 
+
+
+### Basic Usage
+
+An `Attribute` can be defined to count the number of pageviews in a session through the Python SDK as follows:
+
+```python
+from snowplow_signals import Attribute, Event
+
+page_views_attribute = Attribute(
+    name='page_views_count',
+    type='int32',
+    events=[
+        Event(
+            vendor="com.snowplowanalytics.snowplow",
+            name="page_view",
+            version="1-0-0",
+        )
+    ],
+    aggregation='counter'
+)
+
+```
+
+### Advanced Usage
+
+Add a `Criteria` to refine the events used to calculate the `Attribute`. For example, if you want to see the number of pageviews on a particular web page on your site.
+
+
+```python
+from snowplow_signals import Attribute, Event, Criteria, Criterion
+
+page_views_attribute = Attribute(
+    name='page_views_count',
+    type='int32',
+    events=[
+        Event(
+            vendor="com.snowplowanalytics.snowplow",
+            name="page_view",
+            version="1-0-0",
+        )
+    ],
+    aggregation='counter',
+    criteria=Criteria(
+        all=[
+            Criterion(
+                property="page_title",
+                operator="=",
+                value="home_page",
+            ),
+        ],
+    ),
+)
+
+```
+
+The table below lists all arguments for an `Attribute`:
+
+| **Argument Name** | **Description** | **Type** |
+| --- | --- | --- | 
+| `name` | The name of the Attribute | `string` |
+| `description` | The description of the Attribute | `string` |
+| `type` | The type of the aggregation | One of: `bytes`, `string`, `int32`, `int64`, `double`, `float`, `bool`, `unix_timestamp`, `bytes_list`, `string_list`, `int32_list`, `int64_list`, `double_list`, `float_list`, `bool_list`, `unix_timestamp_list`,  |
+| `tags` | Metadata for the Attribute | |
+| `events` | List of Snowplow Events that the Attribute is calculated on | List of `Event` type; see next section |
+| `aggregation` | The aggregation type of the Attribute  | One of:  `counter`, `sum`, `min`, `max`, `mean`, `first`, `last`, `unique_list` |
+| `property` | The property of the event or entity you wish to use in the aggregation | `string` |
+| `criteria` | List of `Criteria` to filter the events | List of `Criteria` type |
+| `period` | The time period over which the aggregation should be calculated | |
+
+
+### Event Type
+The `Event` informs the type of event that the `Attribute` is calculated on. It should be a reference to a Snowplow event that exists on your Snowplow account.
+
+| **Argument Name** | **Description** | **Type** |
+| --- | --- | --- | 
+| `name` | Name of the event (`event_name` column in atomic.events table) | `string` |
+| `vendor` | Vendor of the event (`event_vendor` column in atomic.events table). | `string` |
+| `version` | Version of the event (`event_version` column in atomic.events table). | `string` |
+
+### Criteria 
+The `Criteria` filters the events used to calculate an `Attribute`. They are made up of individual `Criterion`.
+
+A `Criteria` accepts one of 2 parameters, both lists of individual `Criterion`:
+
+- `all`: An array of conditions used to filter the events. All conditions must be met.
+- `any`: An array of conditions used to filter the events. At least one of the conditions must be met.
+
+A `Criterion` specifies the individual filter conditions for an `Attribute` using the following properties.
+
+| **Argument Name** | **Description** | **Type** |
+| --- | --- | --- | 
+| `property` | The path to the property on the event or entity you wish to filter. | `string` |
+| `operator` | The operator used to compare the property to the value. | One of: `=`, `!=`, `<`, `>`, `<=`, `>=`, `like`, `in` |
+| `value` | The value to compare the property to. | One of:  `str`, `int`, `float`, `bool`, `List[str]`, `List[int]`, `List[float]`, `List[bool]` |
+
diff --git a/docs/signals/batch_engine/index.md b/docs/signals/batch_engine/index.md
@@ -0,0 +1,191 @@
+---
+title: "Batch Engine"
+sidebar_position: 40
+description: "In depth explanation on how the Batch Engine works."
+sidebar_label: "Batch Engine"
+---
+
+While many `Attributes` can be computed in stream, those that have to be calculated over a longer period (e.g. a day or more), can only be created "offline" in the warehouse. We call them `Batch Attributes`. 
+
+One advantage of computing batch attributes over stream is that the computation can go back in history based on what you have already available in the atomic events dataset.
+
+The entity here is typically the user, which may be the `domain_userid` or other Snowplow identifier fields, such as the logged in `user_id`. 
+
+:::info
+For now, only the `domain_userid` can be used, but shortly we will extend support for all Snowplow identifiers
+:::
+
+Examples of `Batch Attributes` are typically:
+- customer lifetime values
+- specific transactions that have or have not happened in the last X number of days
+- first or last events a specific user generated or any properties associated to these events
+
+You may already have tables in your warehouse that contain such computed values, in which case you only have to register them as a batch source to be used by Signals. However, if you don't have them, we have developed a tool called the **`Batch Engine`** to efficiently generate these `Attributes` for you, with the help of a few CLI commands. This way you can avoid having to build complex data models to avoid recalculating the values each time over a very large table, which over time may even be impossible to do, not just costly.
+
+## How it works
+What you need to do first is to define a set of `Attributes` and register them as a `View` through the Python Signals SDK. Then you can use the optional CLI functionality of the SDK to generate a dbt project which will ultimately produce a view-specific attribute table. Then all that's left is to materialize the table, which will mean that Signals will regularly fetch the values from your warehouse table and sends it through the Profiles API.
+
+## Defining batch attributes
+Syntactically speaking, defining batch and stream attributes work the same way. For a general overview of how to do that please refer to the [attributes](/docs/signals/attributes/index.md) section. 
+
+There are 4 main types of attributes that you may likely want to define for batch processing: 
+1. `Time Windowed Attributes`: Actions that happened in the `last_x_number of days`. Period needs to be defined as timedelta in days.
+
+2. `Lifetime Attributes`: Calculated over all the available data for the entity. Period needs to be left as None.
+
+3. `First Touch Attributes`: Events (or properties) that happened for the first time for a given entity. Period needs to be left as None.
+
+4. `Last Touch Attributes`: Events (or properties) that happened for the last time for a given entity. Period needs to be left as None.
+
+We have illustrated each of these 4 types with an example block below. 
+
+1. `products_added_to_cart_last_7_days`: This attribute calculates the number of add to cart ecommerce events in the last 7 days
+
+2. `total_product_price_clv`: This attribute is calculated across the customer lifetime
+
+3. `first_mkt_source`: This attribute takes the first page_view event and reads the mkt_source property for a specific entity (e.g. domain_userid)
+
+4. `last_device_class`: This attribute takes the first page_view event and extracts and retrieves the yauaa deviceClass property for a specific entity
+
+<details>
+<summary>Example batch attribute definitions</summary>
+
+Each block creates a single attribute definition including the logic how it should be calculated (its filters and aggregation).
+
+```python
+from snowplow_signals import (
+    Attribute,
+    Criteria,
+    Criterion,
+    Event,
+)
+from datetime import timedelta
+
+products_added_to_cart_last_7_days = Attribute(
+    name="products_added_to_cart_last_7_days",
+    type="string_list",
+    events=[
+        Event(
+            vendor="com.snowplowanalytics.snowplow",
+            name="snowplow_ecommerce_action",
+            version="1-0-2",
+        )
+    ],
+    aggregation="unique_list",
+    property="contexts_com_snowplowanalytics_snowplow_ecommerce_product_1[0].name",
+    criteria=Criteria(
+        all=[
+            Criterion(
+                property="unstruct_event_com_snowplowanalytics_snowplow_ecommerce_snowplow_ecommerce_action_1:type",
+                operator="=",
+                value="add_to_cart",
+            ),
+        ],
+    ),
+    period=timedelta(days=7),
+)
+
+total_product_price_clv = Attribute(
+    name="total_product_price_clv",
+    type="float",
+    events=[
+        Event(
+            vendor="com.snowplowanalytics.snowplow",
+            name="snowplow_ecommerce_action",
+            version="1-0-2",
+        )
+    ],
+    aggregation="sum",
+    property="contexts_com_snowplowanalytics_snowplow_ecommerce_product_1[0].price",
+    criteria=Criteria(
+        all=[
+            Criterion(
+                property="unstruct_event_com_snowplowanalytics_snowplow_ecommerce_snowplow_ecommerce_action_1:type",
+                operator="=",
+                value="add_to_cart"
+            )   
+        ]
+    ),
+)
+
+first_mkt_source = Attribute(
+    name="first_mkt_source",
+    type="string",
+    events=[
+        Event(
+            vendor="com.snowplowanalytics.snowplow",
+            name="page_view",
+            version="1-0-0",
+        )
+    ],
+    aggregation="first",
+    property="mkt_source",
+)
+
+last_device_class = Attribute(
+    name="last_device_class",
+    type="string",
+    events=[
+        Event(
+            vendor="com.snowplowanalytics.snowplow",
+            name="page_view",
+            version="1-0-0",
+        )
+    ],
+    aggregation="last",
+    property="contexts_nl_basjes_yauaa_context_1[0]:deviceClass",
+)
+```
+</details>
+
+## Defining a batch view
+The key difference between a standard view and one meant for batch processing is the `offline=True` parameter. This flag indicates that the view’s attributes are computed in the data warehouse.
+
+```python
+from snowplow_signals import View, user_entity
+
+view = View(
+    name="batch_ecommerce_attributes",
+    version=1,
+    entity=user_entity,
+    offline=True,
+    attributes=[
+        products_added_to_cart_last_7_days,
+        total_product_price_clv,
+        first_mkt_source,
+        last_device_class
+    ],
+)
+```
+## Generating the dbt project
+Here we assume you already defined your View(s) related to custom Batch Attributes, you want the Batch Engine to help generate for you. 
+
+It is best to follow our step-by-step tutorial, please check it out [here](/tutorials/snowplow-batch-engine/start/).
+
+## Understanding the autogenerated data models
+The Signals Batch Engine-generated dbt models process the events for Attribute generation, to help you save resources by avoiding unnecessary processing. In each incremental run, only the data that has been loaded since the last time the data models ran get processed and deduplicated. Only the relevant events and properties that are part of the Attribute definition (defined in the same View) are used to create a `filtered_events_table`. Upon successful run, the `snowplow_incremental_manifest` is updated to keep records of where each run left off.
+
+:::info
+For those familiar with existing Snowplow dbt packages, it is worth to note that the incrementalization follows a completely different logic, based on newly loaded data and not by reprocessing sessions as a whole.
+:::
+
+There is a second layer of incremental processing logic dictated by the `daily_aggregation_manifest` table. After the `filtered_events` table is created or updated, the `daily_aggregates` table gets updated with the help of this manifest. It is needed due to late arriving data, which may mean that some days will need to be reprocessed as a whole. For optimization purposes there are variables to fine-tune how this works such as the `snowplow__reprocess_days` and the `snowplow__min_rows_to_process`.
+
+Finally, the `Attributes` table is generated which is a drop and recompute table, fully updated each time an incremental update runs. This is cost-effective as the data is already pre-aggregated on a daily level.
+![](../images/batch_engine_data_models.png)
+
+## Variables
+
+```yml title="dbt_project.yml"
+snowplow__start_date: '2025-01-01' # date from where it starts looking for events based on both load and derived_tstamp
+snowplow__app_id: [] # already gets applied in base_events_this_run
+snowplow__backfill_limit_days: 1 # limit backfill increments for the filtered_events_table
+snowplow__late_event_lookback_days: 5 # the number of days to allow for late arriving data to be reprocessed fully in the daily aggregate table
+snowplow__min_late_events_to_process: 1 # the number of total daily events that have been skipped in previous runs, if it falls within the late_event_lookback_days, if the treshold is reached, those events will be processed in the daily aggregate model
+snowplow__allow_refresh: false # if true, the snowplow_incremental_manifest will be dropped when running with a --full-refresh flag
+snowplow__dev_target_name: dev
+snowplow__databricks_catalog: "hive_metastore"
+snowplow__atomic_schema: 'atomic' # Only set if not using 'atomic' schema for Snowplow events data
+snowplow__database: # Only set if not using target.database for Snowplow events data -- WILL BE IGNORED FOR DATABRICKS
+snowplow__events_table: "events" # Only set if not using 'events' table for Snowplow events data
+```
diff --git a/docs/signals/images/batch_engine_data_models.png b/docs/signals/images/batch_engine_data_models.png
diff --git a/docs/signals/images/orgID.png b/docs/signals/images/orgID.png
diff --git a/docs/signals/images/signals.png b/docs/signals/images/signals.png
diff --git a/docs/signals/index.md b/docs/signals/index.md
@@ -0,0 +1,48 @@
+---
+title: "Snowplow Signals"
+sidebar_position: 8
+description: "An overview of Signals concepts."
+sidebar_label: "Signals"
+---
+
+Snowplow Signals is a personalization engine built on Snowplow’s behavioral data pipeline. The Profile API, hosted in your BDP cloud, allows you to create, manage and access user attributes by using the Signals SDKs.
+
+![](./images/signals.png)
+
+Signals allows you to enhance your applications by aggregating historical user attributes and providing near real-time visibility into customer behavior.
+Use the aggregated attributes to power in-product personalization and recommendations engines, adaptive UIs, and agentic applications like AI copilots and chatbots.
+
+### Sources
+
+The `Source` of a view refers to the origin of the data. A source can be one of two types:
+
+- **Batch Source:** Data that is aggregated and stored in a data warehouse, such as dbt models.
+- **Stream Source:** Data that is aggregated in real-time in stream.
+
+
+### Attributes
+
+The foundation of Signals is the `Attribute`. An attribute represents a specific fact about a user's behavior. For example:
+
+- **Number of page views in the last 7 days:** counts how many pages a user has viewed within the past week.
+- **Last product viewed:** identifies the most recent product a user interacted with.
+- **Previous purchases:** provides a record of the user's past transactions.
+
+Signals calculates user attributes in two ways:
+
+- Stream Processing: Real-time metrics for instant personalization.
+- Batch Processing: Historical insights from data stored in your warehouse.
+
+### Views
+
+A `View` is a collection of attributes that share a common aggregator (ie `session_id` or `user_id`) and a data `Source`. You can picture it as a table of attributes, for example:
+
+| user_id | number_of_pageviews | last_product_viewed |  previous_purchases |
+|---------|---------------------|---------------------|---------------------|
+| `abc123`| 5                   |     Red Shoes       |[`Blue Shoes`, `Red Hat`]|
+
+
+### Services
+
+A `Service` is a collection of `Views` that are grouped to make the retrieval of attributes simpler. 
+