Skip to content

overview draft #969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 26, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 1 addition & 13 deletions _data/nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,7 @@
page: index.html

- title: Soda overview
page: overview/overview.md
subcategories:
- subtitle: Data testing
page: overview/data-testing.md
- subtitle: Observability
page: overview/observability.md
page: soda/overview.md

- title: Data testing
page: data-testing/data-testing.md
Expand Down Expand Up @@ -54,13 +49,6 @@
subcategories:
- subtitle: Quickstart
page: observability/quickstart.md
- subtitle: Introduction
page: observability/introduction.md
subcategories:
- subtitle: Observability
page: observability/what-is-observability.md
- subtitle: Metrics monitoring
page: observability/metrics-monitoring.md
- subtitle: How it works/Observability Guide
page: get-started/get-started-observability.md
subcategories:
Expand Down
2 changes: 0 additions & 2 deletions _includes/what-is-observability.md

This file was deleted.

19 changes: 0 additions & 19 deletions observability/introduction.md

This file was deleted.

41 changes: 40 additions & 1 deletion observability/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,43 @@ nav_order: 500

*Last modified on {% last_modified_at %}*

{% include banner-upgrade.md %}
{% include banner-upgrade.md %}

Use observability to monitor data quality at scale across all your datasets.
Observability helps you catch unexpected issues without needing to define every rule up front.

Where data testing focuses on known expectations, observability helps you detect the unknown unknowns—like late-arriving records, schema changes, or sudden spikes in missing values. It offers broad, low-effort coverage and requires little configuration, making it easy to share data quality responsibilities across technical and non-technical teams.

## What is data observability?

**Data observability** is the practice of continuously monitoring your data for unexpected changes, anomalies, and structural issues. It involves collecting and analyzing metrics about your datasets to understand their health over time.

Instead of writing checks manually for each dataset, observability uses profiling and metrics to automatically detect problems such as:
- A spike in null values
- A drop in row counts
- Unusual value distributions

**Data Observability helps you:**
- Detect incidents faster
- Scale coverage across more data
- Reduce time spent on manual testing
- Empower more team members to spot and act on issues


## What is metrics monitoring?

**Metrics monitoring** is the foundation of data observability in Soda. Soda collects metrics from datasets—such as row count, null values, min/max, and value distribution—and tracks how those metrics evolve over time.

Soda then uses built-in anomaly detection to identify when metrics deviate from expected patterns. These deviations are surfaced in the **Metric Monitors** tab for each dataset.

You can use metric monitoring to:
- Spot problems without writing checks
- Establish baselines for normal behavior
- Alert data owners when something unusual happens
- Provide insight to business users without requiring code

## What's Next?
To get started with Soda observability, follow one of these guides:

- [Data observability quickstart]({% link observability/quickstart.md %}): Set up monitoring to detect anomalies in your datasets.
- [Data observability guide]({% link observability/observability-guide.md %}): Learn how to get the most out of Soda’s data observability platform.
116 changes: 83 additions & 33 deletions observability/quickstart.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
layout: default
title: Quickstart observability
description: Quickstart observability
title: Quickstart Observability
description: Quickstart Observability
parent: Observability
nav_order: 511
---
Expand All @@ -10,23 +10,38 @@ nav_order: 511

*Last modified on {% last_modified_at %}*

In this Quickstart, you'll:
- create a Soda Cloud account,
- connect a data source, and
- configure your first dataset to enable observability.
In this quickstart, you will:
- Create a Soda Cloud account
- Connect a data source
- Configure your first dataset to enable observability.

## Step 1: Create a Soda Cloud Account
1. Go to <a href="https://cloud.soda.io/signup?utm_source=docs" target="_blank"> cloud.soda.io</a> and create a Soda Cloud account.
If you already have a Soda account, log in.
2. By default, Soda prepares a Soda-hosted agent for all newly-created accounts. However, if you are an Admin in an existing Soda Cloud account and wish to use a Soda-hosted agent, navigate to **your avatar** > **Organization Settings**. In the **Organization** tab, click the checkbox to **Enable Soda-hosted Agent**.
3. Navigate to **your avatar** > **Data Sources**, then access the **Agents** tab. Notice your out-of-the-box Soda-hosted agent that is up and running. <br />
1. Go to <a href="https://cloud.soda.io/signup?utm_source=docs" target="_blank"> cloud.soda.io</a> and sign up for a Soda Cloud account. If you already have an account, log in.
2. By default, Soda creates a Soda-hosted Agent for all new accounts. You can think of an Agent as the bridge between your data sources and Soda Cloud. A Soda-hosted Agent runs in Soda's cloud and securely connects to your data sources to scan for data quality issues.
3. If you are an admin and prefer to deploy your own agent, you can configure a self-hosted agent:

- In Soda Cloud, go to **your avatar** > **Agents**
- Click **New Soda Agent** and follow the setup instructions
<br />
![soda-hosted-agent](/assets/images/soda-hosted-agent.png){:height="700px" width="700px"}

> **Soda Agent Basics**
> <br />
> There are two types of Soda Agents:
> 1. **Soda-hosted Agent:** This is an out-of-the-box, ready-to-use agent that Soda provides and manages for you. It's the quickest way to get started with Soda as it requires no installation or deployment. It supports connections to specific data sources like BigQuery, Databricks SQL, MS SQL Server, MySQL, PostgreSQL, Redshift, and Snowflake. [Soda-hosted agent (missing)](#)
> 2. **Self-hosted Agent:** This is a version of the agent that you deploy in your own Kubernetes cluster within your cloud environment (like AWS, Azure, or Google Cloud). It gives you more control and supports a wider range of data sources. [Self-hosted agent (missing)](#)
>
> A Soda Agent is essentially Soda Library (the core scanning technology) packaged as a containerized application that runs in Kubernetes. It acts as the bridge between your data sources and Soda Cloud, allowing users to:
> - Connect to data sources securely
> - Run scans to check data quality
> - Create and manage no-code checks directly in the Soda Cloud interface
>
> The agent only sends metadata (not your actual data) to Soda Cloud, keeping your data secure within your environment. Soda [Agent basic concepts (missing)](#)

## Step 2: Add a Data Source
1. In your Soda Cloud account, navigate to **your avatar** > **Data Sources**.
2. Click **New Data Source**, then follow the guided steps to create a new data source (e.g., PostgreSQL, BigQuery).
Enter the required connection details (host, port, database name, credentials).
Refer to the section - **Attributes** below for insight into the values to enter in the fields and editing panels in the guided steps.
1. In Soda Cloud, go to **your avatar** > **Data Sources**.
2. Click **New Data Source**, then follow the guided steps to create the connection.
Use the table below to understand what each field means and how to complete it:

#### Attributes

Expand All @@ -39,12 +54,10 @@ In this Quickstart, you'll:
| Custom Cron Expression | (Optional) Write your own <a href="https://en.wikipedia.org/wiki/Cron" target="_blank">cron expression</a> to define the schedule Soda Cloud uses to run scans. |
| Anomaly Dashboard Scan Schedule <br />![available-2025](/assets/images/available-2025.png){:height="150px" width="150px"} <br /> | Provide the scan frequency details Soda Cloud uses to execute a daily scan to automatically detect anomalies for the anomaly dashboard. |

{:start="3"}
3. Complete the connection configuration. These settings are specific to each data source (PostgreSQL, MySQL, Snowflake, etc) and usually include connection details such as host, port, credentials, and database name.

3. Enter values in the fields to provide the connection configurations Soda Cloud needs to be able to access the data in the data source. Connection configurations are data source-specific and include values for things such as a database's host and access credentials.

Soda hosts agents in a secure environment in Amazon AWS. As a SOC 2 Type 2 certified business, Soda responsibly manages Soda-hosted agents to ensure that they remain private, secure, and independent of all other hosted agents. See [Data security and privacy]({% link soda/data-privacy.md %}#using-a-soda-hosted-agent) for details.

Use the following data source-specific connection configuration pages to populate the connection fields in Soda Cloud.
Use the appropriate guide below to complete the connection:
* [Connect to BigQuery]({% link soda/connect-bigquery.md %})
* [Connect to Databricks SQL]({% link soda/connect-spark.md %}#connect-to-spark-for-databricks-sql)
* [Connect to MS SQL Server]({% link soda/connect-mssql.md %})
Expand All @@ -53,27 +66,64 @@ Use the following data source-specific connection configuration pages to populat
* [Connect to Redshift]({% link soda/connect-redshift.md %})
* [Connect to Snowflake]({% link soda/connect-snowflake.md %})

💡 Already have data source connected to a self-hosted agent? You can [migrate]({% link soda/upgrade.md %}#migrate-a-data-source-from-a-self-hosted-to-a-soda-hosted-agent) a data source to a Soda-hosted agent.

## Step 3: Configure Dataset Discovery
Dataset discovery captures metadata about each dataset, including its schema and the data types of each column.

- In Step 3 of the guided workflow, specify the datasets you want to profile. Because dataset discovery can be resource-intensive, only include the datasets you need for observability.
See [Compute consumption and cost considerations]({% link soda-cl/profile.md %}#compute-consumption-and-cost-considerations) for more detail.

## Step 3: Select and Configure a Dataset
## Step 4: Add Column Profiling
Column profiling extracts metrics such as the mean, minimum, and maximum values in a column, and the number of missing values.

1. In the editing panel of **4. Profile**, use the include and exclude syntax to indicate the datasets for which Soda must profile and prepare an anomaly dashboard. The default syntax in the editing panel instructs Soda to profile every column of every dataset in the data source, and, superfluously, all datasets with names that begin with prod. The `%` is a wildcard character. See [Add column profiling]({% link soda-cl/profile.md %}#add-column-profiling) for more detail on profiling syntax.
- In Step 4 of the guided workflow, use include/exclude patterns to define which columns Soda should profile. Soda uses this information to power the anomaly dashboard. Learn more about [column profiling syntax]({% link soda-cl/profile.md %}#add-column-profiling).

```yaml
profile columns:
columns:
- "%.%" # Includes all your datasets
- prod% # Includes all datasets that begin with 'prod'
profile columns:
columns:
- "%.%" # Includes all columns of all datasets
- "prod%.%" # Includes all columns of all datasets that begin with 'prod'
```

2. Continue the remaining steps to add your new data source, then **Test Connection**, if you wish, and **Save** the data source configuration.
## Step 5: Add Automated Monitoring Checks
In Step 5 of the guided workflow, define which datasets should have automated checks applied for anomaly scores and schema evolution.

> If you are using the early access anomaly dashboard, this step is not required. Soda automatically enables monitoring in the > dashboard. See [Anomaly Dashboard]({% link soda-cloud/anomaly-dashboard.md %}) for details.

Use include/exclude filters to target specific datasets. Read more about [automated monitoring configuration]({% link soda-cl/automated-monitoring.md %}).

```yaml
automated monitoring:
datasets:
- include prod% # Includes all the datasets that begin with 'prod'
- exclude test% # Excludes all the datasets that begin with 'test'
```

## Step 6: Assing a Data Source and Dataset Owner
In the step 6 of the guided workflow, assign responsibility for maintaining the data source and each dataset.

- **Data Source Owner:** Manages the connection settings and scan configurations for the data source.
- **Dataset Owner:** Becomes the default owner of each dataset for monitoring and collaboration.

For more details, see [Roles and rights in Soda Cloud]({% link soda-cloud/roles-global.md %}).

## Step 7: Test Connection and Save
- Click **Test Connection** to verify your configuration.
- Click **Save** to start profiling the selected datasets.

Once saved, Soda runs a first scan using your profiling settings. This initial scan provides baseline measurements that Soda uses to begin learning patterns and identifying anomalies.

## Step 8: View Metric Monitor Results
1. Go to the **Datasets** page in Soda Cloud.
2. Select a dataset you included in profiling.
3. Open the **Metric Monitors** tab to view automatically detected issues.

![profile-anomalies](/assets/images/profile-anomalies.png){:height="700px" width="700px"}

3. Soda begins profiling the datasets according to your **Profile** configuration while the algorithm uses the first measurements collected from a scan of your data to begin the work of identifying patterns in the data. You can navigate to the **Dataset** page for a dataset you included in profiling. Click the **Monitors** tab to view the issues Soda automatically detected.
### 🎉 Congratulations! You’ve set up your dataset and enabled observability.

### Congratulations! You’ve set up your dataset and enabled observability.
## What's Next?
Now that your first dataset is configured and observability is active, try:

#### What's Next?
Now that you’ve set up your first dataset and enabled observability, try:
[Exploring detailed metrics in the dashboard.]({% link observability/anomaly-dashboard.md %})
[Setting up notifications for anomaly detection.]({% link observability/set-up-alerts.md %})
- [Explore detailed metrics in the anomaly dashboard]({% link observability/anomaly-dashboard.md %})
- [Set up alerts for anomaly detection]({% link observability/set-up-alerts.md %})
9 changes: 0 additions & 9 deletions overview/data-testing.md

This file was deleted.

9 changes: 0 additions & 9 deletions overview/observability.md

This file was deleted.

9 changes: 0 additions & 9 deletions overview/overview.md

This file was deleted.

Loading