fix(file-based-cdk): handle schema inference error for empty file #802

darynaishchenko · 2025-10-20T14:22:04Z

Mostly reverts changes made in airbytehq/airbyte#44438.

With the change above we lost ability to "ignore" empty files and not it throws a config error Could not infer schema as there are no rows in file. If having an empty CSV file is expected, ignore this. Else, please contact Airbyte.
Unlike previous behavior just logged these files as empty with the error message and not exit discover/sync process. If file is empty it's not added to the discover catalog during discover.

To migrate azure blob storage to latest cdk airbytehq/airbyte#68161 and keep previous behavior this pr add new exception type EmptyFileSchemaInferenceError to keep this error logged and ignored during sync/discover process.

Summary by CodeRabbit

Bug Fixes
- More precise handling and clearer messages for schema inference failures, including a distinct error reported for empty files to improve diagnostics.
Refactor
- Centralized and standardized schema inference error handling across streams for more consistent error reporting and propagation.

github-actions · 2025-10-20T14:22:20Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/fix-error-handling#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/fix-error-handling

Helpful Resources

CDK API Reference

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

darynaishchenko · 2025-10-20T14:22:37Z

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

coderabbitai · 2025-10-20T14:24:31Z

📝 Walkthrough

Walkthrough

Adds EmptyFileSchemaInferenceError, updates CSV parser to raise it when no rows are inferred, and centralizes schema-inference error construction/propagation in DefaultFileBasedStream via a new _raise_schema_inference_error helper used by get_json_schema and _infer_file_schema.

Changes

Cohort / File(s)	Summary
New Exception Definition `airbyte_cdk/sources/file_based/exceptions.py`	Added `EmptyFileSchemaInferenceError` inheriting from `AirbyteTracedException`.
CSV Parser Update `airbyte_cdk/sources/file_based/file_types/csv_parser.py`	Imported `EmptyFileSchemaInferenceError` and changed `infer_schema` to raise it when no rows are inferred (replacing a generic `AirbyteTracedException`).
Centralized Error Handling `airbyte_cdk/sources/file_based/stream/default_file_based_stream.py`	Added `_raise_schema_inference_error(self, exc: Exception, file: Optional[RemoteFile] = None) -> NoReturn` and updated `get_json_schema` and `_infer_file_schema` to route `EmptyFileSchemaInferenceError` and other exceptions through this helper, producing `SchemaInferenceError` with file/format/stream context.

Sequence Diagram

sequenceDiagram
    participant CSV as CSV Parser
    participant Stream as DefaultFileBasedStream
    participant Handler as _raise_schema_inference_error()
    participant Client

    CSV->>CSV: infer_schema()
    alt no rows inferred
        CSV-->>Stream: raises EmptyFileSchemaInferenceError
    else other exception
        CSV-->>Stream: raises Exception
    end

    Stream->>Handler: forward exception + file/format/stream context
    Handler->>Client: raise SchemaInferenceError (with context)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

airbyte-python-cdk

Suggested reviewers

bnchrch
lmossman

Would you like me to suggest applying the same centralized error-handling pattern to other file parsers (JSON, Parquet) for consistency, wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "fix(file-based-cdk): handle schema inference error for empty file" directly and clearly summarizes the main objective of the changeset. The changes introduce a new exception type (`EmptyFileSchemaInferenceError`), update the CSV parser to use this specific exception, and add centralized error handling in the default file-based stream class—all of which align precisely with handling schema inference errors for empty files. The title is specific, concise, and accurately conveys the primary change without vagueness or generic language.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch daryna/file-based/fix-error-handling

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fae6e3 and 8c20651.

📒 Files selected for processing (1)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

airbyte_cdk/sources/file_based/exceptions.py (3)

EmptyFileSchemaInferenceError (162-163)

SchemaInferenceError (94-95)

FileBasedSourceError (12-38)

airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)

infer_schema (174-220)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)

GitHub Check: Check: source-intercom
GitHub Check: Check: source-google-drive
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-shopify
GitHub Check: Check: destination-motherduck
GitHub Check: Pytest (Fast)
GitHub Check: Manifest Server Docker Image Build
GitHub Check: SDM Docker Image Build
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)

🔇 Additional comments (4)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (4)

12-31: LGTM! Import organization looks clean.

The typing imports are properly sorted alphabetically, and the new EmptyFileSchemaInferenceError import is exactly what's needed for the enhanced error handling. Nice work addressing the previous review feedback!

246-269: Nice centralization of error handling!

The refactoring to route exceptions through _raise_schema_inference_error creates a consistent error handling pattern. I have one clarifying question though:

Since EmptyFileSchemaInferenceError is already an AirbyteTracedException with a user-friendly message (as shown in the CSV parser), does wrapping it in SchemaInferenceError add meaningful context? I see it adds the format and stream name, which the original might not have. Just wanted to confirm this is intentional and adds value, wdyt?

Also, the # type: ignore on line 246 is necessary because mypy can't prove all paths return when NoReturn functions are called. If you wanted to be more specific, you could use # type: ignore[return] instead, though the current approach is fine too.

396-406: Excellent! File-level error handling mirrors the stream-level pattern.

I really like how this method follows the same exception handling pattern as get_json_schema, but appropriately passes the file parameter to provide file-specific context in error messages. The exception hierarchy is respected (catching the specific EmptyFileSchemaInferenceError before the general AirbyteTracedException), and the chaining via from exc preserves the full error context.

The # type: ignore is justified here for the same reason as in get_json_schema.

408-416: Great design choice to centralize error construction!

This helper method is a nice refactor that:

Ensures consistent SchemaInferenceError construction across the codebase

Properly chains the original exception via from exc for full debugging context

Safely handles the optional file parameter

Uses the correct -> NoReturn type annotation to signal it always raises

The centralization makes it easy to maintain and evolve error messaging in one place. Well done!

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

airbyte_cdk/sources/file_based/exceptions.py (1)
162-163: Consider adding a docstring to clarify the exception's purpose.

The new EmptyFileSchemaInferenceError exception class would benefit from a brief docstring explaining when it should be raised and how it differs from the generic SchemaInferenceError. This would help maintainers understand when to use this specific exception type, wdyt?

For example:
 class EmptyFileSchemaInferenceError(AirbyteTracedException):
+    """
+    Raised during schema inference when a file contains no data rows.
+    This specialized exception allows for more targeted error handling of empty files.
+    """
     pass
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
250-255: Is the separate catch for EmptyFileSchemaInferenceError necessary?

Both the EmptyFileSchemaInferenceError catch (line 250-251) and the generic Exception catch (line 254-255) call the same _raise_schema_inference_error helper. Since EmptyFileSchemaInferenceError is an Exception, the generic handler would catch it anyway.

Would it be cleaner to remove the specific EmptyFileSchemaInferenceError handler and let it fall through to the generic exception handler, or is there a plan to handle it differently in the future? wdyt?

If you'd like to keep the explicit handling for clarity or future differentiation, that's fine too! Otherwise, consider:
         except InvalidSchemaError as config_exception:
             raise AirbyteTracedException(
                 internal_message="Please check the logged errors for more information.",
                 message=FileBasedSourceError.SCHEMA_INFERENCE_ERROR.value,
                 exception=AirbyteTracedException(exception=config_exception),
                 failure_type=FailureType.config_error,
             )
-        except EmptyFileSchemaInferenceError as exc:
-            self._raise_schema_inference_error(exc)
         except AirbyteTracedException as ate:
             raise ate
         except Exception as exc:
             self._raise_schema_inference_error(exc)
389-394: Same question about redundant exception handling.

Similar to the pattern in get_json_schema, both exception handlers call the same helper method. Would it simplify the code to remove the specific EmptyFileSchemaInferenceError handler? wdyt?
         try:
             return await self.get_parser().infer_schema(
                 self.config, file, self.stream_reader, self.logger
             )
-        except EmptyFileSchemaInferenceError as exc:
-            self._raise_schema_inference_error(exc, file)
         except AirbyteTracedException as ate:
             raise ate
         except Exception as exc:
             self._raise_schema_inference_error(exc, file)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 041c201 and e00e10c.

📒 Files selected for processing (3)

airbyte_cdk/sources/file_based/exceptions.py (1 hunks)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (2 hunks)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

airbyte_cdk/sources/file_based/exceptions.py (3)

EmptyFileSchemaInferenceError (162-163)

SchemaInferenceError (94-95)

FileBasedSourceError (12-38)

airbyte_cdk/sources/file_based/remote_file.py (1)

RemoteFile (11-18)

airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)

airbyte_cdk/sources/file_based/exceptions.py (3)

EmptyFileSchemaInferenceError (162-163)

FileBasedSourceError (12-38)

RecordParseError (90-91)

airbyte_cdk/sources/file_based/exceptions.py (1)

airbyte_cdk/utils/traced_exception.py (1)

AirbyteTracedException (25-145)

🪛 GitHub Actions: Linters

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

[error] 234-234: Mypy: Missing return statement [return].

[error] 384-384: Mypy: Missing return statement [return].

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: Check: source-google-drive
GitHub Check: Check: source-intercom
GitHub Check: Check: source-hardcoded-records
GitHub Check: Check: source-shopify
GitHub Check: Check: destination-motherduck
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (Fast)
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: SDM Docker Image Build
GitHub Check: Manifest Server Docker Image Build

🔇 Additional comments (3)

airbyte_cdk/sources/file_based/file_types/csv_parser.py (2)

25-29: LGTM! Clean import organization.

The addition of EmptyFileSchemaInferenceError to the imports follows the existing pattern and maintains alphabetical ordering.

209-214: Excellent specialization of the exception type!

Changing from a generic AirbyteTracedException to the more specific EmptyFileSchemaInferenceError enables better error handling downstream while preserving the same user-facing message and failure type. This makes the error semantics clearer and allows the caller to distinguish empty file scenarios from other schema inference failures.

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1)

19-19: LGTM!

The import of EmptyFileSchemaInferenceError is correctly positioned in the existing import block.

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

github-actions · 2025-10-20T14:31:09Z

PyTest Results (Fast)

3 802 tests ±0 3 790 ✅ ±0 6m 31s ⏱️ +9s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 8c20651. ± Comparison against base commit 041c201.

♻️ This comment has been updated with latest results.

darynaishchenko · 2025-10-20T14:47:33Z

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)
246-246: Narrow blanket # type: ignore to the specific code Mypy is complaining about.

To avoid muting unrelated errors, could we switch these to # type: ignore[override], wdyt?
-    def get_json_schema(self) -> JsonSchema:  # type: ignore
+    def get_json_schema(self) -> JsonSchema:  # type: ignore[override]
-    async def _infer_file_schema(self, file: RemoteFile) -> SchemaType:  # type: ignore
+    async def _infer_file_schema(self, file: RemoteFile) -> SchemaType:  # type: ignore[override]
Also applies to: 396-396

264-266: Use bare raise to preserve the original traceback.

raise ate can alter context; raise keeps the original traceback intact. Shall we switch to the idiomatic form, wdyt?
-        except AirbyteTracedException as ate:
-            raise ate
+        except AirbyteTracedException:
+            raise
-        except AirbyteTracedException as ate:
-            raise ate
+        except AirbyteTracedException:
+            raise
Also applies to: 403-404

408-416: Verification confirms the suggestion is valid. Add details=str(exc) to surface the root cause.

The BaseFileBasedSourceError.__init__ accepts any **kwargs and formats them as key=value pairs appended to the error message. Your suggestion to include details=str(exc) would work seamlessly and follow the existing pattern already used for file, format, and stream parameters.

This improves user-facing context by including the root exception while maintaining the exception chain. Shall we apply this refactoring, wdyt?
     raise SchemaInferenceError(
         FileBasedSourceError.SCHEMA_INFERENCE_ERROR,
         file=file.uri if file else None,
         format=str(self.config.format) if self.config.format else None,
         stream=self.name,
+        details=str(exc),
     ) from exc

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 699b40b and 5fae6e3.

📒 Files selected for processing (1)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)

airbyte_cdk/sources/file_based/exceptions.py (3)

EmptyFileSchemaInferenceError (162-163)

SchemaInferenceError (94-95)

FileBasedSourceError (12-38)

airbyte_cdk/sources/file_based/remote_file.py (1)

RemoteFile (11-18)

airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)

infer_schema (174-220)

🪛 GitHub Actions: Linters

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

[error] 5-5: I001 Import block is unsorted or un-formatted. Found 1 error. This can be fixed with 'ruff --fix'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)

GitHub Check: Check: source-intercom
GitHub Check: Check: destination-motherduck
GitHub Check: Check: source-pokeapi
GitHub Check: Check: source-shopify
GitHub Check: Check: source-google-drive
GitHub Check: Check: source-hardcoded-records
GitHub Check: Pytest (All, Python 3.10, Ubuntu)
GitHub Check: Pytest (All, Python 3.12, Ubuntu)
GitHub Check: Pytest (All, Python 3.11, Ubuntu)
GitHub Check: Pytest (All, Python 3.13, Ubuntu)
GitHub Check: SDM Docker Image Build
GitHub Check: Manifest Server Docker Image Build
GitHub Check: Pytest (Fast)
GitHub Check: Analyze (python)

🔇 Additional comments (2)

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

29-38: Importing EmptyFileSchemaInferenceError looks right.

Catching the specific EmptyFileSchemaInferenceError before AirbyteTracedException ensures it’s handled by the centralized helper as intended. Nice touch, wdyt?

262-268: ✓ Test coverage for empty-file scenario already exists—no action needed.

The centralized error handling through _raise_schema_inference_error looks solid. Good news: test coverage for the empty-file path is already in place. Multiple scenarios across csv_scenarios.py (lines 1151, 1255, 1827, 2657), jsonl_scenarios.py (line 497), and user_input_schema_scenarios.py (lines 151, 194) verify that discovery properly yields a SCHEMA_INFERENCE_ERROR when files are empty or unparseable. The CSV parser also correctly raises EmptyFileSchemaInferenceError when no rows are detected. The implementation and tests align well.

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

github-actions · 2025-10-20T15:00:52Z

PyTest Results (Full)

3 805 tests ±0 3 793 ✅ ±0 11m 16s ⏱️ +6s
1 suites ±0 12 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 8c20651. ± Comparison against base commit 041c201.

maxi297

LGTM!

handle schema inference error for empty file

806cdff

darynaishchenko changed the title ~~handle schema inference error for empty file~~ fix(file-based-cdk) handle schema inference error for empty file Oct 20, 2025

Auto-fix lint and format issues

e00e10c

coderabbitai bot requested changes Oct 20, 2025

View reviewed changes

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py Show resolved Hide resolved

darynaishchenko added 2 commits October 20, 2025 17:34

fix mypy

699b40b

updated typing

5fae6e3

darynaishchenko changed the title ~~fix(file-based-cdk) handle schema inference error for empty file~~ fix(file-based-cdk): handle schema inference error for empty file Oct 20, 2025

Auto-fix lint and format issues

8c20651

github-actions bot added the bug Something isn't working label Oct 20, 2025

coderabbitai bot requested changes Oct 20, 2025

View reviewed changes

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py Show resolved Hide resolved

coderabbitai bot approved these changes Oct 20, 2025

View reviewed changes

darynaishchenko requested a review from maxi297 October 20, 2025 14:59

maxi297 approved these changes Oct 20, 2025

View reviewed changes

darynaishchenko merged commit 73ec055 into main Oct 20, 2025
29 of 31 checks passed

darynaishchenko deleted the daryna/file-based/fix-error-handling branch October 20, 2025 15:43

fix(file-based-cdk): handle schema inference error for empty file #802

fix(file-based-cdk): handle schema inference error for empty file #802

Uh oh!

Conversation

darynaishchenko commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Oct 20, 2025

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

Helpful Resources

PR Slash Commands

Uh oh!

darynaishchenko commented Oct 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

darynaishchenko commented Oct 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Oct 20, 2025

PyTest Results (Full)

Uh oh!

maxi297 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

darynaishchenko commented Oct 20, 2025 •

edited

Loading

darynaishchenko commented Oct 20, 2025 •

edited by github-actions bot

Loading

coderabbitai bot commented Oct 20, 2025 •

edited

Loading

github-actions bot commented Oct 20, 2025 •

edited

Loading

darynaishchenko commented Oct 20, 2025 •

edited by github-actions bot

Loading