Skip to content

Conversation

@darynaishchenko
Copy link
Contributor

@darynaishchenko darynaishchenko commented Oct 20, 2025

Mostly reverts changes made in airbytehq/airbyte#44438.

With the change above we lost ability to "ignore" empty files and not it throws a config error Could not infer schema as there are no rows in file. If having an empty CSV file is expected, ignore this. Else, please contact Airbyte.
Unlike previous behavior just logged these files as empty with the error message and not exit discover/sync process. If file is empty it's not added to the discover catalog during discover.

To migrate azure blob storage to latest cdk airbytehq/airbyte#68161 and keep previous behavior this pr add new exception type EmptyFileSchemaInferenceError to keep this error logged and ignored during sync/discover process.

Summary by CodeRabbit

  • Bug Fixes

    • More precise handling and clearer messages for schema inference failures, including a distinct error reported for empty files to improve diagnostics.
  • Refactor

    • Centralized and standardized schema inference error handling across streams for more consistent error reporting and propagation.

@darynaishchenko darynaishchenko changed the title handle schema inference error for empty file fix(file-based-cdk) handle schema inference error for empty file Oct 20, 2025
@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/fix-error-handling#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/fix-error-handling

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment

📝 Edit this welcome message.

@darynaishchenko
Copy link
Contributor Author

darynaishchenko commented Oct 20, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 20, 2025

📝 Walkthrough

Walkthrough

Adds EmptyFileSchemaInferenceError, updates CSV parser to raise it when no rows are inferred, and centralizes schema-inference error construction/propagation in DefaultFileBasedStream via a new _raise_schema_inference_error helper used by get_json_schema and _infer_file_schema.

Changes

Cohort / File(s) Summary
New Exception Definition
airbyte_cdk/sources/file_based/exceptions.py
Added EmptyFileSchemaInferenceError inheriting from AirbyteTracedException.
CSV Parser Update
airbyte_cdk/sources/file_based/file_types/csv_parser.py
Imported EmptyFileSchemaInferenceError and changed infer_schema to raise it when no rows are inferred (replacing a generic AirbyteTracedException).
Centralized Error Handling
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
Added _raise_schema_inference_error(self, exc: Exception, file: Optional[RemoteFile] = None) -> NoReturn and updated get_json_schema and _infer_file_schema to route EmptyFileSchemaInferenceError and other exceptions through this helper, producing SchemaInferenceError with file/format/stream context.

Sequence Diagram

sequenceDiagram
    participant CSV as CSV Parser
    participant Stream as DefaultFileBasedStream
    participant Handler as _raise_schema_inference_error()
    participant Client

    CSV->>CSV: infer_schema()
    alt no rows inferred
        CSV-->>Stream: raises EmptyFileSchemaInferenceError
    else other exception
        CSV-->>Stream: raises Exception
    end

    Stream->>Handler: forward exception + file/format/stream context
    Handler->>Client: raise SchemaInferenceError (with context)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

airbyte-python-cdk

Suggested reviewers

  • bnchrch
  • lmossman

Would you like me to suggest applying the same centralized error-handling pattern to other file parsers (JSON, Parquet) for consistency, wdyt?

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "fix(file-based-cdk): handle schema inference error for empty file" directly and clearly summarizes the main objective of the changeset. The changes introduce a new exception type (EmptyFileSchemaInferenceError), update the CSV parser to use this specific exception, and add centralized error handling in the default file-based stream class—all of which align precisely with handling schema inference errors for empty files. The title is specific, concise, and accurately conveys the primary change without vagueness or generic language.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch daryna/file-based/fix-error-handling

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fae6e3 and 8c20651.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
airbyte_cdk/sources/file_based/exceptions.py (3)
  • EmptyFileSchemaInferenceError (162-163)
  • SchemaInferenceError (94-95)
  • FileBasedSourceError (12-38)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)
  • infer_schema (174-220)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
🔇 Additional comments (4)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (4)

12-31: LGTM! Import organization looks clean.

The typing imports are properly sorted alphabetically, and the new EmptyFileSchemaInferenceError import is exactly what's needed for the enhanced error handling. Nice work addressing the previous review feedback!


246-269: Nice centralization of error handling!

The refactoring to route exceptions through _raise_schema_inference_error creates a consistent error handling pattern. I have one clarifying question though:

Since EmptyFileSchemaInferenceError is already an AirbyteTracedException with a user-friendly message (as shown in the CSV parser), does wrapping it in SchemaInferenceError add meaningful context? I see it adds the format and stream name, which the original might not have. Just wanted to confirm this is intentional and adds value, wdyt?

Also, the # type: ignore on line 246 is necessary because mypy can't prove all paths return when NoReturn functions are called. If you wanted to be more specific, you could use # type: ignore[return] instead, though the current approach is fine too.


396-406: Excellent! File-level error handling mirrors the stream-level pattern.

I really like how this method follows the same exception handling pattern as get_json_schema, but appropriately passes the file parameter to provide file-specific context in error messages. The exception hierarchy is respected (catching the specific EmptyFileSchemaInferenceError before the general AirbyteTracedException), and the chaining via from exc preserves the full error context.

The # type: ignore is justified here for the same reason as in get_json_schema.


408-416: Great design choice to centralize error construction!

This helper method is a nice refactor that:

  • Ensures consistent SchemaInferenceError construction across the codebase
  • Properly chains the original exception via from exc for full debugging context
  • Safely handles the optional file parameter
  • Uses the correct -> NoReturn type annotation to signal it always raises

The centralization makes it easy to maintain and evolve error messaging in one place. Well done!


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/exceptions.py (1)

162-163: Consider adding a docstring to clarify the exception's purpose.

The new EmptyFileSchemaInferenceError exception class would benefit from a brief docstring explaining when it should be raised and how it differs from the generic SchemaInferenceError. This would help maintainers understand when to use this specific exception type, wdyt?

For example:

 class EmptyFileSchemaInferenceError(AirbyteTracedException):
+    """
+    Raised during schema inference when a file contains no data rows.
+    This specialized exception allows for more targeted error handling of empty files.
+    """
     pass
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

250-255: Is the separate catch for EmptyFileSchemaInferenceError necessary?

Both the EmptyFileSchemaInferenceError catch (line 250-251) and the generic Exception catch (line 254-255) call the same _raise_schema_inference_error helper. Since EmptyFileSchemaInferenceError is an Exception, the generic handler would catch it anyway.

Would it be cleaner to remove the specific EmptyFileSchemaInferenceError handler and let it fall through to the generic exception handler, or is there a plan to handle it differently in the future? wdyt?

If you'd like to keep the explicit handling for clarity or future differentiation, that's fine too! Otherwise, consider:

         except InvalidSchemaError as config_exception:
             raise AirbyteTracedException(
                 internal_message="Please check the logged errors for more information.",
                 message=FileBasedSourceError.SCHEMA_INFERENCE_ERROR.value,
                 exception=AirbyteTracedException(exception=config_exception),
                 failure_type=FailureType.config_error,
             )
-        except EmptyFileSchemaInferenceError as exc:
-            self._raise_schema_inference_error(exc)
         except AirbyteTracedException as ate:
             raise ate
         except Exception as exc:
             self._raise_schema_inference_error(exc)

389-394: Same question about redundant exception handling.

Similar to the pattern in get_json_schema, both exception handlers call the same helper method. Would it simplify the code to remove the specific EmptyFileSchemaInferenceError handler? wdyt?

         try:
             return await self.get_parser().infer_schema(
                 self.config, file, self.stream_reader, self.logger
             )
-        except EmptyFileSchemaInferenceError as exc:
-            self._raise_schema_inference_error(exc, file)
         except AirbyteTracedException as ate:
             raise ate
         except Exception as exc:
             self._raise_schema_inference_error(exc, file)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 041c201 and e00e10c.

📒 Files selected for processing (3)
  • airbyte_cdk/sources/file_based/exceptions.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_types/csv_parser.py (2 hunks)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
airbyte_cdk/sources/file_based/exceptions.py (3)
  • EmptyFileSchemaInferenceError (162-163)
  • SchemaInferenceError (94-95)
  • FileBasedSourceError (12-38)
airbyte_cdk/sources/file_based/remote_file.py (1)
  • RemoteFile (11-18)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)
airbyte_cdk/sources/file_based/exceptions.py (3)
  • EmptyFileSchemaInferenceError (162-163)
  • FileBasedSourceError (12-38)
  • RecordParseError (90-91)
airbyte_cdk/sources/file_based/exceptions.py (1)
airbyte_cdk/utils/traced_exception.py (1)
  • AirbyteTracedException (25-145)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

[error] 234-234: Mypy: Missing return statement [return].


[error] 384-384: Mypy: Missing return statement [return].

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (3)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (2)

25-29: LGTM! Clean import organization.

The addition of EmptyFileSchemaInferenceError to the imports follows the existing pattern and maintains alphabetical ordering.


209-214: Excellent specialization of the exception type!

Changing from a generic AirbyteTracedException to the more specific EmptyFileSchemaInferenceError enables better error handling downstream while preserving the same user-facing message and failure type. This makes the error semantics clearer and allows the caller to distinguish empty file scenarios from other schema inference failures.

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1)

19-19: LGTM!

The import of EmptyFileSchemaInferenceError is correctly positioned in the existing import block.

@github-actions
Copy link

github-actions bot commented Oct 20, 2025

PyTest Results (Fast)

3 802 tests  ±0   3 790 ✅ ±0   6m 31s ⏱️ +9s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 8c20651. ± Comparison against base commit 041c201.

♻️ This comment has been updated with latest results.

@darynaishchenko darynaishchenko changed the title fix(file-based-cdk) handle schema inference error for empty file fix(file-based-cdk): handle schema inference error for empty file Oct 20, 2025
@darynaishchenko
Copy link
Contributor Author

darynaishchenko commented Oct 20, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@github-actions github-actions bot added the bug Something isn't working label Oct 20, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)

246-246: Narrow blanket # type: ignore to the specific code Mypy is complaining about.

To avoid muting unrelated errors, could we switch these to # type: ignore[override], wdyt?

-    def get_json_schema(self) -> JsonSchema:  # type: ignore
+    def get_json_schema(self) -> JsonSchema:  # type: ignore[override]
-    async def _infer_file_schema(self, file: RemoteFile) -> SchemaType:  # type: ignore
+    async def _infer_file_schema(self, file: RemoteFile) -> SchemaType:  # type: ignore[override]

Also applies to: 396-396


264-266: Use bare raise to preserve the original traceback.

raise ate can alter context; raise keeps the original traceback intact. Shall we switch to the idiomatic form, wdyt?

-        except AirbyteTracedException as ate:
-            raise ate
+        except AirbyteTracedException:
+            raise
-        except AirbyteTracedException as ate:
-            raise ate
+        except AirbyteTracedException:
+            raise

Also applies to: 403-404


408-416: Verification confirms the suggestion is valid. Add details=str(exc) to surface the root cause.

The BaseFileBasedSourceError.__init__ accepts any **kwargs and formats them as key=value pairs appended to the error message. Your suggestion to include details=str(exc) would work seamlessly and follow the existing pattern already used for file, format, and stream parameters.

This improves user-facing context by including the root exception while maintaining the exception chain. Shall we apply this refactoring, wdyt?

     raise SchemaInferenceError(
         FileBasedSourceError.SCHEMA_INFERENCE_ERROR,
         file=file.uri if file else None,
         format=str(self.config.format) if self.config.format else None,
         stream=self.name,
+        details=str(exc),
     ) from exc
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 699b40b and 5fae6e3.

📒 Files selected for processing (1)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)
airbyte_cdk/sources/file_based/exceptions.py (3)
  • EmptyFileSchemaInferenceError (162-163)
  • SchemaInferenceError (94-95)
  • FileBasedSourceError (12-38)
airbyte_cdk/sources/file_based/remote_file.py (1)
  • RemoteFile (11-18)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)
  • infer_schema (174-220)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

[error] 5-5: I001 Import block is unsorted or un-formatted. Found 1 error. This can be fixed with 'ruff --fix'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: destination-motherduck
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.12, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.13, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Manifest Server Docker Image Build
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

29-38: Importing EmptyFileSchemaInferenceError looks right.

Catching the specific EmptyFileSchemaInferenceError before AirbyteTracedException ensures it’s handled by the centralized helper as intended. Nice touch, wdyt?


262-268: ✓ Test coverage for empty-file scenario already exists—no action needed.

The centralized error handling through _raise_schema_inference_error looks solid. Good news: test coverage for the empty-file path is already in place. Multiple scenarios across csv_scenarios.py (lines 1151, 1255, 1827, 2657), jsonl_scenarios.py (line 497), and user_input_schema_scenarios.py (lines 151, 194) verify that discovery properly yields a SCHEMA_INFERENCE_ERROR when files are empty or unparseable. The CSV parser also correctly raises EmptyFileSchemaInferenceError when no rows are detected. The implementation and tests align well.

@github-actions
Copy link

PyTest Results (Full)

3 805 tests  ±0   3 793 ✅ ±0   11m 16s ⏱️ +6s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 8c20651. ± Comparison against base commit 041c201.

Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@darynaishchenko darynaishchenko merged commit 73ec055 into main Oct 20, 2025
29 of 31 checks passed
@darynaishchenko darynaishchenko deleted the daryna/file-based/fix-error-handling branch October 20, 2025 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants