- 
                Notifications
    You must be signed in to change notification settings 
- Fork 30
fix(file-based-cdk): handle schema inference error for empty file #802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@daryna/file-based/fix-error-handling#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch daryna/file-based/fix-error-handlingHelpful ResourcesPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR: 
 | 
| /autofix 
 
 
 | 
| 📝 WalkthroughWalkthroughAdds  Changes
 Sequence DiagramsequenceDiagram
    participant CSV as CSV Parser
    participant Stream as DefaultFileBasedStream
    participant Handler as _raise_schema_inference_error()
    participant Client
    CSV->>CSV: infer_schema()
    alt no rows inferred
        CSV-->>Stream: raises EmptyFileSchemaInferenceError
    else other exception
        CSV-->>Stream: raises Exception
    end
    Stream->>Handler: forward exception + file/format/stream context
    Handler->>Client: raise SchemaInferenceError (with context)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
 Suggested reviewers
 Would you like me to suggest applying the same centralized error-handling pattern to other file parsers (JSON, Parquet) for consistency, wdyt? Pre-merge checks and finishing touches❌ Failed checks (1 warning)
 ✅ Passed checks (2 passed)
 ✨ Finishing touches
 🧪 Generate unit tests (beta)
 📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
 🧰 Additional context used🧬 Code graph analysis (1)airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
 ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
 🔇 Additional comments (4)
 Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/exceptions.py (1)
162-163: Consider adding a docstring to clarify the exception's purpose.The new
EmptyFileSchemaInferenceErrorexception class would benefit from a brief docstring explaining when it should be raised and how it differs from the genericSchemaInferenceError. This would help maintainers understand when to use this specific exception type, wdyt?For example:
class EmptyFileSchemaInferenceError(AirbyteTracedException): + """ + Raised during schema inference when a file contains no data rows. + This specialized exception allows for more targeted error handling of empty files. + """ passairbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
250-255: Is the separate catch for EmptyFileSchemaInferenceError necessary?Both the
EmptyFileSchemaInferenceErrorcatch (line 250-251) and the genericExceptioncatch (line 254-255) call the same_raise_schema_inference_errorhelper. SinceEmptyFileSchemaInferenceErroris anException, the generic handler would catch it anyway.Would it be cleaner to remove the specific
EmptyFileSchemaInferenceErrorhandler and let it fall through to the generic exception handler, or is there a plan to handle it differently in the future? wdyt?If you'd like to keep the explicit handling for clarity or future differentiation, that's fine too! Otherwise, consider:
except InvalidSchemaError as config_exception: raise AirbyteTracedException( internal_message="Please check the logged errors for more information.", message=FileBasedSourceError.SCHEMA_INFERENCE_ERROR.value, exception=AirbyteTracedException(exception=config_exception), failure_type=FailureType.config_error, ) - except EmptyFileSchemaInferenceError as exc: - self._raise_schema_inference_error(exc) except AirbyteTracedException as ate: raise ate except Exception as exc: self._raise_schema_inference_error(exc)
389-394: Same question about redundant exception handling.Similar to the pattern in
get_json_schema, both exception handlers call the same helper method. Would it simplify the code to remove the specificEmptyFileSchemaInferenceErrorhandler? wdyt?try: return await self.get_parser().infer_schema( self.config, file, self.stream_reader, self.logger ) - except EmptyFileSchemaInferenceError as exc: - self._raise_schema_inference_error(exc, file) except AirbyteTracedException as ate: raise ate except Exception as exc: self._raise_schema_inference_error(exc, file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
- airbyte_cdk/sources/file_based/exceptions.py(1 hunks)
- airbyte_cdk/sources/file_based/file_types/csv_parser.py(2 hunks)
- airbyte_cdk/sources/file_based/stream/default_file_based_stream.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
airbyte_cdk/sources/file_based/exceptions.py (3)
EmptyFileSchemaInferenceError(162-163)
SchemaInferenceError(94-95)
FileBasedSourceError(12-38)airbyte_cdk/sources/file_based/remote_file.py (1)
RemoteFile(11-18)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)
airbyte_cdk/sources/file_based/exceptions.py (3)
EmptyFileSchemaInferenceError(162-163)
FileBasedSourceError(12-38)
RecordParseError(90-91)
airbyte_cdk/sources/file_based/exceptions.py (1)
airbyte_cdk/utils/traced_exception.py (1)
AirbyteTracedException(25-145)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
[error] 234-234: Mypy: Missing return statement [return].
[error] 384-384: Mypy: Missing return statement [return].
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: Check: source-google-drive
- GitHub Check: Check: source-intercom
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Check: source-shopify
- GitHub Check: Check: destination-motherduck
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Manifest Server Docker Image Build
🔇 Additional comments (3)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (2)
25-29: LGTM! Clean import organization.The addition of
EmptyFileSchemaInferenceErrorto the imports follows the existing pattern and maintains alphabetical ordering.
209-214: Excellent specialization of the exception type!Changing from a generic
AirbyteTracedExceptionto the more specificEmptyFileSchemaInferenceErrorenables better error handling downstream while preserving the same user-facing message and failure type. This makes the error semantics clearer and allows the caller to distinguish empty file scenarios from other schema inference failures.airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1)
19-19: LGTM!The import of
EmptyFileSchemaInferenceErroris correctly positioned in the existing import block.
| /autofix 
 
 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)
246-246: Narrow blanket# type: ignoreto the specific code Mypy is complaining about.To avoid muting unrelated errors, could we switch these to
# type: ignore[override], wdyt?- def get_json_schema(self) -> JsonSchema: # type: ignore + def get_json_schema(self) -> JsonSchema: # type: ignore[override]- async def _infer_file_schema(self, file: RemoteFile) -> SchemaType: # type: ignore + async def _infer_file_schema(self, file: RemoteFile) -> SchemaType: # type: ignore[override]Also applies to: 396-396
264-266: Use bareraiseto preserve the original traceback.
raise atecan alter context;raisekeeps the original traceback intact. Shall we switch to the idiomatic form, wdyt?- except AirbyteTracedException as ate: - raise ate + except AirbyteTracedException: + raise- except AirbyteTracedException as ate: - raise ate + except AirbyteTracedException: + raiseAlso applies to: 403-404
408-416: Verification confirms the suggestion is valid. Adddetails=str(exc)to surface the root cause.The
BaseFileBasedSourceError.__init__accepts any**kwargsand formats them askey=valuepairs appended to the error message. Your suggestion to includedetails=str(exc)would work seamlessly and follow the existing pattern already used forfile,format, andstreamparameters.This improves user-facing context by including the root exception while maintaining the exception chain. Shall we apply this refactoring, wdyt?
raise SchemaInferenceError( FileBasedSourceError.SCHEMA_INFERENCE_ERROR, file=file.uri if file else None, format=str(self.config.format) if self.config.format else None, stream=self.name, + details=str(exc), ) from exc
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- airbyte_cdk/sources/file_based/stream/default_file_based_stream.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (3)
airbyte_cdk/sources/file_based/exceptions.py (3)
EmptyFileSchemaInferenceError(162-163)
SchemaInferenceError(94-95)
FileBasedSourceError(12-38)airbyte_cdk/sources/file_based/remote_file.py (1)
RemoteFile(11-18)airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)
infer_schema(174-220)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
[error] 5-5: I001 Import block is unsorted or un-formatted. Found 1 error. This can be fixed with 'ruff --fix'.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (14)
- GitHub Check: Check: source-intercom
- GitHub Check: Check: destination-motherduck
- GitHub Check: Check: source-pokeapi
- GitHub Check: Check: source-shopify
- GitHub Check: Check: source-google-drive
- GitHub Check: Check: source-hardcoded-records
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (All, Python 3.12, Ubuntu)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.13, Ubuntu)
- GitHub Check: SDM Docker Image Build
- GitHub Check: Manifest Server Docker Image Build
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (2)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
29-38: Importing EmptyFileSchemaInferenceError looks right.Catching the specific EmptyFileSchemaInferenceError before AirbyteTracedException ensures it’s handled by the centralized helper as intended. Nice touch, wdyt?
262-268: ✓ Test coverage for empty-file scenario already exists—no action needed.The centralized error handling through
_raise_schema_inference_errorlooks solid. Good news: test coverage for the empty-file path is already in place. Multiple scenarios acrosscsv_scenarios.py(lines 1151, 1255, 1827, 2657),jsonl_scenarios.py(line 497), anduser_input_schema_scenarios.py(lines 151, 194) verify that discovery properly yields aSCHEMA_INFERENCE_ERRORwhen files are empty or unparseable. The CSV parser also correctly raisesEmptyFileSchemaInferenceErrorwhen no rows are detected. The implementation and tests align well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Mostly reverts changes made in airbytehq/airbyte#44438.
With the change above we lost ability to "ignore" empty files and not it throws a config error
Could not infer schema as there are no rows in file. If having an empty CSV file is expected, ignore this. Else, please contact Airbyte.Unlike previous behavior just logged these files as empty with the error message and not exit discover/sync process. If file is empty it's not added to the discover catalog during discover.
To migrate azure blob storage to latest cdk airbytehq/airbyte#68161 and keep previous behavior this pr add new exception type
EmptyFileSchemaInferenceErrorto keep this error logged and ignored during sync/discover process.Summary by CodeRabbit
Bug Fixes
Refactor