Skip to content

Conversation

Karavil
Copy link
Contributor

@Karavil Karavil commented Aug 12, 2025

Summary

Adds support for ignoring specific tables from PostgreSQL publications during replication. Tables in IGNORED_PUBLICATION_TABLES get created in SQLite but stay empty - all changes are dropped. Useful for excluding audit logs or other high-volume tables.

Stored the ignored tables in the database with publications in InternalShardConfig. They define the replication boundary together - publications control what Postgres sends, ignored tables control what Zero Cache accepts.

Important: Changing ignored tables triggers a full resync (like changing publications). This prevents stale data from remaining in SQLite when a table becomes ignored.

Design Decision

Ignored tables are like publications - both define what data replicates. Publications say what Postgres sends, ignored tables say what Zero Cache keeps. If nodes disagree on either during deployment, you get inconsistent data (some nodes have audit_logs, others don't).

Wrestled with where to store the config:

Option 1: In-memory from env vars (problematic)
================================================

PostgreSQL publishes: [users, logs, temp]
                           ↓

Rolling deployment:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Node A    │    │   Node B    │    │   Node C    │
│   (v1.0)    │    │   (v1.0)    │    │   (v1.1)    │ ← new version\!
│             │    │             │    │             │
│ ENV: logs   │    │ ENV: logs   │    │ ENV: logs,  │
│              │    │              │    │      temp   │
└─────────────┘    └─────────────┘    └─────────────┘
      ↓                   ↓                   ↓
Replicates:          Replicates:          Replicates:
[users, temp]        [users, temp]        [users]      ← INCONSISTENT\!


Option 2: Database storage (what I did)
========================================

PostgreSQL publishes: [users, logs, temp]
                           ↓
                    ┌──────────────┐
                    │  shardConfig │
                    │  ignored:     │
                    │  [logs]       │
                    └──────────────┘
                           ↓

Rolling deployment (all nodes read same config):
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Node A    │    │   Node B    │    │   Node C    │
│   (v1.0)    │    │   (v1.0)    │    │   (v1.1)    │
└─────────────┘    └─────────────┘    └─────────────┘
      ↓                   ↓                   ↓
Replicates:          Replicates:          Replicates:
[users, temp]        [users, temp]        [users, temp] ← CONSISTENT\!

The database approach ensures all nodes agree on what to replicate, even mid-deployment. This matters because ignored tables (like publications) are replication config, not app config. App config affects how a node runs (ports, log levels) - fine to vary. Replication config affects what data exists - must be consistent or you get different data on different nodes.

First PR here, might be missing context. Happy to refactor if storing in database seems overkill - just felt like ignored tables belong with publications since they both define the replication boundary.

Testing

export ZERO_UPSTREAM_DB="postgresql://user:pass@localhost:5432/mydb"
export ZERO_IGNORED_PUBLICATION_TABLES='["public.audit_logs"]'

Check:

  • Logs show "Skipping initial sync for ignored table"
  • Tables exist but empty
  • Changes don't replicate to ignored tables
  • Changing ignored tables triggers resync: "Dropping shard to change ignored tables"

Adds ZERO_IGNORED_PUBLICATION_TABLES environment variable to exclude specific
tables from replication while preserving schema compatibility.

Key features:
• 🎯 Tables are created but remain empty (schema preserved)
• ⚡ Initial sync skips data copying for ignored tables
• 🔄 Replication changes are dropped for ignored tables
• ✅ Works because SQLite foreign keys are disabled by default

Usage:
export ZERO_IGNORED_PUBLICATION_TABLES='["audit_logs", "staging.imports"]'

This minimal implementation (~50 lines) provides table filtering without
modifying SQL queries, making it safer and simpler than deep integration.
Copy link

vercel bot commented Aug 12, 2025

@Karavil is attempting to deploy a commit to the Rocicorp Team on Vercel.

A member of the Team first needs to authorize it.

Karavil added 18 commits August 12, 2025 18:44
- Add #isTableIgnored() helper method in ChangeMaker
- Add helper functions in initial-sync for table filtering
- Remove redundant checks across all operations
- Remove useless test file that was testing Set behavior
- Create ignored-tables.ts with shared utilities
- Remove duplicated table expansion logic
- Single source of truth for table filtering logic
- Remove confusing auto-expansion of table names
- Use direct Set matching: 'users' matches any schema, 'public.users' matches specific
- Eliminates bugs with table names containing dots
- Much simpler and more predictable behavior
- Require schema.table format (e.g., 'public.users')
- Add validation to reject simple table names
- Remove ambiguity - you must specify exactly which schema
- Simpler implementation with only exact matches
- Remove class method redefinition (#isTableIgnored)
- Use shared isTableIgnored function from ignored-tables.ts
- Pass ignoredTables Set as parameter for all calls
- Improves code consistency and maintainability
- Move ignoredTables from Zero config layer to shard config in database
- Eliminates layering violation where pg abstraction imported Zero config
- Add ignoredTables to InternalShardConfig schema and shardConfig table
- Include migration (v11) to add column for existing shards
- Pass ignoredTables from Zero config during shard initialization
- ChangeMaker and initial-sync now read from InternalShardConfig
- Provides consistency across distributed nodes
- Cleaner architecture with proper separation of concerns
…sing

- Add .map() transform to internalShardConfigSchema to convert array to Set
- Remove buildIgnoredTablesSet function as it's no longer needed
- ChangeMaker and initial-sync now use the Set directly from config
- More efficient - Set is created once during schema parsing
- Cleaner code with less manual conversion
- Add check to compare requested vs replicated ignored tables
- Drop shard and throw AutoResetSignal when mismatch detected
- Prevents stale data from remaining in SQLite when tables become ignored
- Follows same pattern as publication changes
- Use equals() from set-utils instead of deepEqual with arrays
- Create Sets from both sides for proper Set comparison
- Cleaner and more consistent with existing Set operations in codebase
- Use .optional(() => []) pattern consistently for default empty array
- Remove unnecessary fallbacks since value is always defined
- ShardConfig.ignoredTables is now always an array, never undefined
- Cleaner code without || [] checks everywhere
- Filter out ignored tables before processing instead of inside map
- Cleaner code - no fake Promise.resolve for ignored tables
- More efficient - only process tables that actually need copying
- Same behavior - ignored tables still logged and contribute 0 to totals
- Remove obvious comments that just repeat what the code does
- Keep comments that explain why or provide important context
- Code is self-explanatory, no need for comment
Added three focused integration tests to verify ignored tables behavior:
• Ignored tables excluded from initial sync - verifies tables are created but remain empty
• Changes to ignored tables are dropped - confirms replication filters out changes
• AutoReset on changed ignored tables - ensures resync when config changes

Tests verify the minimal implementation approach where filtering happens at the application level rather than modifying SQL queries.
Successfully added three comprehensive integration tests that verify:

• Ignored tables are created but remain empty during initial sync
• Changes to ignored tables are properly dropped during replication
• Configuration changes to ignored tables trigger a full resync

All tests pass on PostgreSQL 16 using testcontainers. The tests confirm
that the minimal implementation correctly filters data at the application
level while maintaining table schema for consistency.
• Refactored startReplication to use optional object parameter
• Updated all test calls to use { ignoredTables: [...] } pattern
• Added ignoredTables to all ShardConfig objects in initial-sync tests

test: add comprehensive initial-sync tests for ignored tables

Added two targeted tests directly in initial-sync.pg-test.ts:

• 'ignored tables are created but not synced' - verifies that:
  - Regular tables get their data synced during initial sync
  - Ignored tables are created but remain empty
  - Ignored tables configuration is persisted in shardConfig

• 'multiple ignored tables' - verifies that:
  - Multiple tables can be ignored simultaneously
  - Only non-ignored tables receive data during sync
  - All tables (ignored and regular) have their schema created

These tests verify the behavior at the initialSync function level,
complementing the integration tests in change-source.pg-test.ts.
Implements comprehensive support for ignoring specific tables from PostgreSQL publication-based replication.

Key features:
• Tables defined in IGNORED_PUBLICATION_TABLES env var are created but data is skipped
• Changes to ignored tables are dropped during replication
• Changing the ignored tables list triggers automatic full resync
• Named arguments pattern for better API ergonomics

Tests added:
• Ignored tables excluded from initial sync
• Changes to ignored tables are dropped during replication
• AutoReset triggered when ignored tables list changes
• Multiple ignored tables handled correctly
• Integration tests in both change-source and initial-sync

Implementation follows application-level filtering approach for compatibility
with existing publication infrastructure.
Added tests for complex scenarios involving ignored tables and publications:
• Ignored table with row filter in publication - ensures ignored takes precedence
• Ignored table in multiple publications - verifies table stays empty regardless
• Exact table name matching - confirms no partial matching (test_table vs test_table_2)
• Schema qualification - tests ignoring tables in specific schemas only

All tests verify that ignored tables are created but remain empty, even when:
- Row filters would normally allow some data through
- Multiple publications reference the same table
- Table names are similar but not exact matches
- Tables with same name exist in different schemas
@Karavil Karavil marked this pull request as ready for review August 13, 2025 14:31
Made ignoredTables optional in ShardConfig to avoid breaking existing code:
• Changed ShardConfig type to make ignoredTables optional
• Updated all usages to handle undefined with fallback to empty array/set
• Fixed PostgreSQL ARRAY[] type casting with explicit ::TEXT[] cast
• Fixed duplicate import in init.ts
• Added missing ignoredPublicationTables to pusher.test.ts config

All type checks and tests now pass successfully.
Re-enabled the change-source/pg test suite that was previously skipped.
All tests pass including the new ignored tables functionality tests.
Updated documentation in multiple places:
• Added ZERO_APP_IGNORED_PUBLICATION_TABLES to zbugs/.env.example
• Enhanced zero-config.ts description with env var name and resync note
• Created Configuration section in zero-cache README with detailed usage

Documentation covers:
• Environment variable format (JSON array)
• Requirement for fully qualified table names
• Behavior (tables created but empty)
• Use cases (audit logs, temp data, analytics)
• Important note about full resync on changes
Added ZERO_APP_IGNORED_PUBLICATION_TABLES environment variable wherever
ZERO_APP_PUBLICATIONS is referenced:
• GitHub Actions workflows (prod, sandbox, gigabugs)
• SST config for deployment
• zbugs .env.example file
• Simplified config description to be concise
Restored comprehensive documentation in zero-config.ts including:
• Clear format and examples
• Multiple use cases (audit logs, temp data, analytics, etc.)
• Important notes about schema qualification and resync behavior

This documentation helps users understand how to effectively use the feature.
The ignoredTables field should remain optional in the ShardConfig when
ignoredPublicationTables is not provided in the config. This ensures
backward compatibility and cleaner type handling.

Using conditional spread operator to only include ignoredTables when
ignoredPublicationTables is present in the config.
This repository uses npm, not bun. The bun.lock file was accidentally added.
The migration intentionally leaves ignoredTables empty to trigger a resync.
This ensures the SQLite replica is completely rebuilt without stale data
from newly-ignored tables.
Fixed migration rocicorp#11 to use sql() wrapper for proper identifier quoting.
Also updated test expectations to match new schema version 11.
const SHARD_NUM = 1;

describe.skip('change-source/pg', {timeout: 30000, retry: 3}, () => {
describe('change-source/pg', {timeout: 30000, retry: 3}, () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, whoops? not sure why this was being skipped... I can undo it? but it seemed useful to run these tests.

@aboodman
Copy link
Contributor

Curious about this design decision:

Tables in IGNORED_PUBLICATION_TABLES get created in SQLite but stay empty - all changes are dropped.

Why create the table in SQLite?

@aboodman
Copy link
Contributor

Thank you very much for the contribution. Exciting!

@darkgnotic is best person to review this but he is on vacation right now. That is why this hasn't been reviewed so far.

From a product pov can you explain a bit more why the publication approach is hard to use? I believe you that it is, I'm just curious...

Is it because you don't want to have to remember to update the PG publication when you add a table? Why is it not possible to maintain the publication as part of the rest of your schema management?

@Karavil
Copy link
Contributor Author

Karavil commented Aug 14, 2025

Curious about this design decision:

Tables in IGNORED_PUBLICATION_TABLES get created in SQLite but stay empty - all changes are dropped.

Why create the table in SQLite?

I was mostly worried about:

  1. Client schema defined, table A exists
  2. Zero deploy, table A added to ignore list, table no longer exists in sqlite
  3. Client breaks?

Not super committed to this behavior though—can make changes!

@aboodman

@Karavil
Copy link
Contributor Author

Karavil commented Aug 14, 2025

Thank you very much for the contribution. Exciting!

@darkgnotic is best person to review this but he is on vacation right now. That is why this hasn't been reviewed so far.

From a product pov can you explain a bit more why the publication approach is hard to use? I believe you that it is, I'm just curious...

Is it because you don't want to have to remember to update the PG publication when you add a table? Why is it not possible to maintain the publication as part of the rest of your schema management?

Honestly, it's just more to think about! We have to use custom publications right now because of the sheer amount of data we have on file (initial syncs take ages if we don't do this). And I like that Zero is simple! I'd rather not create and maintain a publication for a few tables I want to ignore.

I do agree that a better abstraction around generating publications could work for this (we use Drizzle Zero; it would plug in nicely there). But it's just another point of friction for experimenting with Zero. Now I have to migrate my database before I can deploy Zero!

@aboodman

@Karavil
Copy link
Contributor Author

Karavil commented Aug 14, 2025

Thank you very much for the contribution. Exciting!
@darkgnotic is best person to review this but he is on vacation right now. That is why this hasn't been reviewed so far.
From a product pov can you explain a bit more why the publication approach is hard to use? I believe you that it is, I'm just curious...
Is it because you don't want to have to remember to update the PG publication when you add a table? Why is it not possible to maintain the publication as part of the rest of your schema management?

Honestly, it's just more to think about! We have to use custom publications right now because of the sheer amount of data we have on file (initial syncs take ages if we don't do this). And I like that Zero is simple! I'd rather not create and maintain a publication for a few tables I want to ignore.

I do agree that a better abstraction around generating publications could work for this (we use Drizzle Zero; it would plug in nicely there). But it's just another point of friction for experimenting with Zero. Now I have to migrate my database before I can deploy Zero!

@aboodman

I could also see this playing nicely with your cloud offering in the future? It'd be pretty much plug and play:

-> Create a publication for all tables
-> Select tables to ignore (don't worry, you can edit this later!)
-> Zero instance ready

Replaced 'as string[]' type casts with proper ShardConfig type annotations
in test files. This follows TypeScript best practices and improves type safety.
@Karavil
Copy link
Contributor Author

Karavil commented Aug 14, 2025

The naming here is also a bit weird. Should it be ZERO_APP_IGNORED_TABLES instead of ZERO_APP_PUBLICATION_IGNORED_TABLES? Naming it that way (without the Postgres context) would imply that this has to be supported for all databases in the future, so I was wary of it. Happy to change it though.

@darkgnotic
Copy link
Contributor

Hi @Karavil. Thank you for a great proposal and well thought out implementation. I appreciate (and agree with) the design decisions you detailed in the PR description.

I also agree that this would be a useful feature to provide in a cloud offering, and went through the exercise of what that might look like at a high level, were we to implement it.

At the end of the day, the problem boils down to a deficiency in the Postgres API (e.g. for CREATE PUBLICATION). However, Postgres does provide a way to achieve this: you can create an EVENT TRIGGER on the CREATE TABLE event, and add the table to your publication (when desired) in the triggered function.

The advantage of implementing it at the Postgres layer is that:

  • There would be less code and logic to maintain. It would a matter of adding a table to a publication, rather than intercepting multiple replication points / commands.
  • It would be more efficient, by avoiding the bandwidth and serialization cost of the data of ignored tables (which, as you point out, can be large).

Would you be up for trying this approach? And if you're willing, we'd love to figure out how to make it available for other users, whether it be through documented setup examples, a cli, or something that zero-cache does under the covers during the initial setup.

@aboodman aboodman closed this Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants