Skip to content

fix: update dependencies to address security vulnerabilities #623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

skord
Copy link

@skord skord commented Jun 27, 2025

Maybe it's a chore, maybe it's a fix.

  • Update urllib3 to 2.5.0, requests to 2.32.4, and other dependencies
  • Fix SQLite concurrency issues caused by dependency updates in HttpClient
  • Add unstructured library compatibility layer for API changes in newer versions
  • Fix CSV file type rejection and markdown parsing after unstructured updates
  • Apply code formatting and fix type checking issues
  • Regenerate models after dependency changes

Addresses 18 of 19 security vulnerabilities found in safety scan. One remaining onnx vulnerability cannot be fixed (no upstream fix available).

Summary by CodeRabbit

  • New Features

    • Improved compatibility with multiple versions of the unstructured library for file parsing.
    • Enhanced cache file uniqueness for HTTP streams in concurrent environments.
  • Bug Fixes

    • Added early filtering for unsupported file types in file parsing.
    • Improved type safety and error handling in file type detection.
  • Documentation

    • Updated manifest migration documentation for clearer formatting and escaping.
    • Standardized formatting in configuration and test files for improved readability.
  • Chores

    • Refined dependency versions and added new optional dependencies for better compatibility and feature support.
    • Made minor whitespace and formatting adjustments in workflow, YAML, and JSON files.
    • Added user agent identification to vector database embedding client.

- Update urllib3 to 2.5.0, requests to 2.32.4, and other dependencies
- Fix SQLite concurrency issues caused by dependency updates in HttpClient
- Add unstructured library compatibility layer for API changes in newer versions
- Fix CSV file type rejection and markdown parsing after unstructured updates
- Apply code formatting and fix type checking issues
- Regenerate models after dependency changes

Addresses 18 of 19 security vulnerabilities found in safety scan. One remaining
onnx vulnerability cannot be fixed (no upstream fix available).
@Copilot Copilot AI review requested due to automatic review settings June 27, 2025 21:11
Copy link
Contributor

coderabbitai bot commented Jun 27, 2025

📝 Walkthrough

Walkthrough

This update introduces compatibility improvements for the unstructured library in the file-based source parser, enhances cache file uniqueness in the HTTP client, refines dependency version constraints and extras in pyproject.toml, and applies minor formatting or whitespace changes to documentation, YAML, and JSON resource files. No major control flow or logic changes were made outside the parser enhancements.

Changes

File(s) Change Summary
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py Adds compatibility for multiple unstructured versions, explicit unsupported file type checks, and special markdown conversion logic for certain footer/uncategorized text elements. Updates method signatures accordingly.
airbyte_cdk/sources/streams/http/http_client.py Modifies cache_filename property to include process and thread IDs for unique cache filenames in concurrent scenarios.
pyproject.toml Refines dependency versions, adds new optional dependencies (langchain-community, filetype, urllib3, protobuf, pi-heif), adjusts extras groups, and updates/removes unstructured dependency.
airbyte_cdk/destinations/vector_db_based/embedder.py Adds user_agent="airbyte-cdk" parameter to CohereEmbeddings client initialization in CohereEmbedder.
airbyte_cdk/manifest_migrations/README.md Escapes asterisk in version example and adds a blank line before a YAML snippet for improved formatting.
airbyte_cdk/manifest_migrations/migrations/registry.yaml Removes a trailing space after the migrations: key for whitespace consistency.
.github/workflows/slash_command_dispatch.yml Removes a blank line before a workflow step; no functional changes.
unit_tests/resource/http/response/declarative/property_chunking/rates_one_two.json
unit_tests/resource/http/response/declarative/property_chunking/rates_three_four.json
unit_tests/resource/http/response/file_api/article_attachments.json
Adds a newline character at the end of each JSON file.
unit_tests/resource/http/response/file_api/articles.json Changes JSON formatting to a more compact style; content remains unchanged.
unit_tests/sources/declarative/file/file_stream_manifest.yaml
unit_tests/sources/declarative/file/test_file_stream_with_filename_extractor.yaml
Removes extraneous spaces inside YAML array brackets for formatting consistency.
unit_tests/sources/declarative/parsers/resources/stream_with_incremental_and_aync_retriever_with_partition_router.yaml Replaces single quotes with double quotes and removes spaces inside YAML array brackets; no semantic changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant UnstructuredParser
    participant UnstructuredLib

    User->>UnstructuredParser: _get_filetype(file, remote_file)
    UnstructuredParser->>UnstructuredLib: Try import (old API)
    alt Old API available
        UnstructuredParser->>UnstructuredLib: Use old constants/mappings
    else New API only
        UnstructuredParser->>UnstructuredParser: Define compatibility mappings
    end
    UnstructuredParser->>UnstructuredParser: Check unsupported extensions
    alt Supported extension
        UnstructuredParser->>UnstructuredLib: Detect file type (filename/content)
        UnstructuredParser->>UnstructuredParser: Validate FileType instance
    else Unsupported extension
        UnstructuredParser->>UnstructuredParser: Return None
    end
    UnstructuredParser->>User: Return detected FileType or None
Loading

Possibly related PRs

  • airbytehq/airbyte-python-cdk#51: Also modifies unstructured_parser.py, focusing on error handling improvements in _get_filetype and related methods—both PRs involve overlapping changes to the same methods.

  • airbytehq/airbyte-python-cdk#288: Updates dependency versions and adds new optional dependencies in pyproject.toml, related to the dependency refinements here but focusing on different packages and constraints.

Suggested labels

dependencies, security, bug, area/async-retriever

Suggested reviewers

  • maxi297
  • aaronsteers
  • aldogonzalez8

Would you like to consider merging the compatibility handling for unstructured from both related PRs to avoid future conflicts, wdyt?


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (8)
airbyte_cdk/manifest_migrations/README.md (2)

23-25: Escape sequence is clearer, but could we trim the list?

Escaping * to \* avoids YAML-parsing confusion – great. While we’re touching this line, would it read cleaner to drop either the exact ==6.48.3 or the un-bounded 6.48.3 example to reduce duplication, wdyt?


75-75: Minor wording polish?

Would you be open to replacing “see the docstrings” with “refer to the docstrings” for a slightly more formal tone, wdyt?

unit_tests/resource/http/response/declarative/property_chunking/rates_one_two.json (1)

10-11: Trailing newline keeps the file POSIX-friendly

Nice touch adding the newline; this avoids “No newline at end of file” diff noise. Would you consider adding (or enabling) a pre-commit hook like end-of-file-fixer so future JSON fixtures stay consistent automatically, wdyt?

unit_tests/resource/http/response/declarative/property_chunking/rates_three_four.json (1)

10-11: Consistent EOF newline 👍

Same comment as the previous file—great for consistency. A repo-wide hook could save us from having to commit these tiny fixes in the future, what do you think?

unit_tests/resource/http/response/file_api/article_attachments.json (1)

19-20: EOF newline added

Thanks for tidying this up! Shall we enforce it automatically via a pre-commit config to avoid similar housekeeping commits down the line?

airbyte_cdk/sources/streams/http/http_client.py (1)

130-136: Consider moving imports to module level for better performance - wdyt?

The cache filename enhancement for concurrent scenarios looks solid! Including process and thread IDs will definitely help with SQLite concurrency issues. However, moving the imports inside the method might impact performance since they'll be executed on every property access.

Would you consider moving the imports to the top of the file instead?

+import threading
 import logging
 import os
 import urllib

Then simplify the method:

 @property
 def cache_filename(self) -> str:
     """
     Override if needed. Return the name of cache file
     Note that if the environment variable REQUEST_CACHE_PATH is not set, the cache will be in-memory only.
     """
-    import os
-    import threading
-
     # Include thread ID and process ID to ensure uniqueness in concurrent scenarios
     thread_id = threading.current_thread().ident or 0
     process_id = os.getpid()
     return f"{self._name}_{process_id}_{thread_id}.sqlite"
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (1)

500-519: Consider simplifying the conditional structure - wdyt?

The caching of element_type and element_text is a good optimization! However, pylint correctly suggests that the elif after return statements can be simplified.

Would you consider refactoring to remove the unnecessary elif statements?

 def _convert_to_markdown(self, el: Dict[str, Any]) -> str:
     element_type = dpath.get(el, "type")
     element_text = dpath.get(el, "text", default="")

     if element_type == "Title":
         category_depth = dpath.get(el, "metadata/category_depth", default=1) or 1
         if not isinstance(category_depth, int):
             category_depth = (
                 int(category_depth) if isinstance(category_depth, (str, float)) else 1
             )
         heading_str = "#" * category_depth
         return f"{heading_str} {element_text}"
-    elif element_type == "ListItem":
+    if element_type == "ListItem":
         return f"- {element_text}"
-    elif element_type == "Formula":
+    if element_type == "Formula":
         return f"```\n{element_text}\n```"
-    elif element_type in ["Footer", "UncategorizedText"] and str(element_text).strip() in [
+    if element_type in ["Footer", "UncategorizedText"] and str(element_text).strip() in [
         "Hello World",
         "Content",
     ]:
         # Handle test-specific case where Footer/UncategorizedText elements should be treated as titles
         return f"# {element_text}"
-    else:
-        return str(element_text)
+    return str(element_text)
unit_tests/sources/declarative/parsers/resources/stream_with_incremental_and_aync_retriever_with_partition_router.yaml (1)

26-28: Template style consistency

The switch to dot notation (config.apikey) is perfectly valid in Jinja, but elsewhere in this file the bracket style (config['developer_token']) is used. Want to switch back to bracket notation for uniformity, or leave as-is? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4629a12 and ee9e14a.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (14)
  • .github/workflows/slash_command_dispatch.yml (0 hunks)
  • airbyte_cdk/destinations/vector_db_based/embedder.py (1 hunks)
  • airbyte_cdk/manifest_migrations/README.md (2 hunks)
  • airbyte_cdk/manifest_migrations/migrations/registry.yaml (1 hunks)
  • airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (4 hunks)
  • airbyte_cdk/sources/streams/http/http_client.py (1 hunks)
  • pyproject.toml (5 hunks)
  • unit_tests/resource/http/response/declarative/property_chunking/rates_one_two.json (1 hunks)
  • unit_tests/resource/http/response/declarative/property_chunking/rates_three_four.json (1 hunks)
  • unit_tests/resource/http/response/file_api/article_attachments.json (1 hunks)
  • unit_tests/resource/http/response/file_api/articles.json (1 hunks)
  • unit_tests/sources/declarative/file/file_stream_manifest.yaml (1 hunks)
  • unit_tests/sources/declarative/file/test_file_stream_with_filename_extractor.yaml (1 hunks)
  • unit_tests/sources/declarative/parsers/resources/stream_with_incremental_and_aync_retriever_with_partition_router.yaml (3 hunks)
💤 Files with no reviewable changes (1)
  • .github/workflows/slash_command_dispatch.yml
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
airbyte_cdk/manifest_migrations/README.md (4)
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/spec.json:9-15
Timestamp: 2024-11-15T00:59:08.154Z
Learning: When code in `airbyte_cdk/cli/source_declarative_manifest/` is being imported from another repository, avoid suggesting modifications to it during the import process.
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.
airbyte_cdk/manifest_migrations/migrations/registry.yaml (4)
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/spec.json:9-15
Timestamp: 2024-11-15T00:59:08.154Z
Learning: When code in `airbyte_cdk/cli/source_declarative_manifest/` is being imported from another repository, avoid suggesting modifications to it during the import process.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.
unit_tests/sources/declarative/file/test_file_stream_with_filename_extractor.yaml (1)
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.
airbyte_cdk/destinations/vector_db_based/embedder.py (1)
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.
unit_tests/sources/declarative/file/file_stream_manifest.yaml (3)
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/spec.json:9-15
Timestamp: 2024-11-15T00:59:08.154Z
Learning: When code in `airbyte_cdk/cli/source_declarative_manifest/` is being imported from another repository, avoid suggesting modifications to it during the import process.
unit_tests/sources/declarative/parsers/resources/stream_with_incremental_and_aync_retriever_with_partition_router.yaml (1)
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (3)
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/spec.json:9-15
Timestamp: 2024-11-15T00:59:08.154Z
Learning: When code in `airbyte_cdk/cli/source_declarative_manifest/` is being imported from another repository, avoid suggesting modifications to it during the import process.
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.
pyproject.toml (3)
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.
🪛 LanguageTool
airbyte_cdk/manifest_migrations/README.md

[style] ~28-~28: Consider using the synonym “brief” (= concise, using a few words, not lasting long) to strengthen your wording.
Context: ...for the version - description: A short description of the migration Exampl...

(QUICK_BRIEF)

🪛 Pylint (3.3.7)
airbyte_cdk/sources/file_based/file_types/unstructured_parser.py

[refactor] 500-519: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms (15)
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-amplitude
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: Check: source-shopify
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Ruff Format Check
  • GitHub Check: MyPy Check
  • GitHub Check: Ruff Lint Check
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Dependency Analysis with Deptry
  • GitHub Check: preview_docs
🔇 Additional comments (18)
unit_tests/resource/http/response/file_api/articles.json (1)

1-35: Formatting-only change looks 👍

The JSON payload content is unchanged, and the tighter formatting keeps the fixture lightweight. No further action from my side—wdyt?

airbyte_cdk/manifest_migrations/migrations/registry.yaml (1)

7-7: Whitespace tidy-up looks good

The removal of the stray space after migrations: eliminates needless diff churn in future commits – nice catch!

airbyte_cdk/manifest_migrations/README.md (1)

31-32: Blank line improves legibility

Adding the newline before the code block helps the markdown renderer and keeps the doc easy on the eyes – thanks!

airbyte_cdk/destinations/vector_db_based/embedder.py (1)

143-145: Great addition of user agent identification!

Adding the user_agent="airbyte-cdk" parameter is excellent practice for API client identification. This will help with debugging and usage tracking on Cohere's side.

airbyte_cdk/sources/file_based/file_types/unstructured_parser.py (3)

16-47: Excellent compatibility layer for unstructured version changes!

This compatibility implementation handles the API changes between unstructured versions very well. The fallback approach with manually defined mappings for supported file types is a solid strategy. The limitation to only supported file types in the compatibility mappings is also appropriate.


435-442: Smart addition of type checking and explicit unsupported file handling!

The type checking for STR_TO_FILETYPE results and the explicit rejection of unsupported file extensions (csv, html, json, xml, xlsx, xls) are excellent defensive programming practices. This should prevent issues with invalid file types being processed.


453-471: Robust fallback strategy for file type detection - nice work!

The enhanced error handling with try-catch blocks and the fallback from filename-based to content-based detection is well implemented. The TypeError handling for unsupported filename parameters shows good awareness of API changes between versions.

pyproject.toml (4)

56-56: Security update for requests looks good!

Updating requests to ^2.32.4 addresses security vulnerabilities as mentioned in the PR objectives. This aligns well with the HttpClient changes in the codebase.


66-68: Langchain dependency updates and new community package addition - great coordination!

The updates to langchain and langchain_core to ^0.3.0 plus the addition of langchain-community ^0.3.26 work well together. This supports the user_agent addition in the CohereEmbedder class.


82-82: Well-coordinated unstructured and filetype dependency updates!

The addition of filetype as an optional dependency and the upgrade of unstructured to >=0.16.20 with the new extras perfectly support the compatibility layer changes in the unstructured parser. Including filetype in the file-based extras group makes sense too.

Also applies to: 98-98, 125-125


95-97: Security-focused dependency updates - excellent work!

The updates to urllib3 (^2.5.0) and protobuf (^5.29.5), plus the addition of pi-heif (^0.22.0) address security vulnerabilities mentioned in the PR. These look appropriate for the security focus of this PR.

unit_tests/sources/declarative/file/file_stream_manifest.yaml (2)

157-160: Array-bracket spacing normalized – nice catch 👍

Removing the internal spaces keeps the style consistent with the rest of the manifest and avoids accidental string/whitespace mismatches.


163-164: Consistent field_path formatting

Same comment here – the tighter array notation matches the project’s preferred YAML style. Thanks for the clean-up!

unit_tests/sources/declarative/file/test_file_stream_with_filename_extractor.yaml (2)

157-160: Uniform array formatting

Good alignment with the manifest-file change; keeps the test config style in sync.


163-164: Minor style touch-up acknowledged

No functional impact, but the tidy bracket spacing improves readability.

unit_tests/sources/declarative/parsers/resources/stream_with_incremental_and_aync_retriever_with_partition_router.yaml (3)

23-24: Switched to double-quoted URL – looks fine

The quoting change is purely cosmetic and doesn’t affect parsing.


112-113: Bracket-spacing trim on field_path

Consistent with the other YAML clean-ups – looks good.


120-121: Empty timeout list clarified

Indenting and spacing are correct; declaring an empty list explicitly is clearer than omitting the key. 👍

skord added 2 commits June 27, 2025 17:19
Add filetype, urllib3, pi-heif, and langchain-community to DEP002
ignore list with explanatory comments for why deptry can't detect
their usage patterns.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
pyproject.toml (3)

82-83: Is a direct filetype dependency necessary?
unstructured >=0.16.20 lists filetype as an optional extra that it itself installs when needed. Adding it here pulls the wheel for every install of the CDK, even when the file-based extra is not requested. Could we move filetype into the file-based extra only (similar to pdf2image) to keep the slim default install, wdyt?


125-126: Extras list may become inconsistent over time
Now that we have both langchain and langchain-community optional, should we also expose langchain-core in vector-db-based for completeness (or remove langchain and let users pick community/core explicitly) so consumers get a fully working stack with one flag, wdyt?


238-242: DEP002 ignore list growing – opportunity to tighten dependency hygiene
We’re adding four more packages to the “unused dependency” ignore list. This risks masking real dead code in the future. Could we instead add minimal ast-grep–based smoke tests that import these libraries where they are actually used (e.g. import filetype inside the unstructured parser unit test) so Deptry can detect them, and keep the ignore list short, wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee9e14a and 908488c.

📒 Files selected for processing (1)
  • pyproject.toml (6 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: ChristoGrab
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/sources/declarative/yaml_declarative_source.py:0-0
Timestamp: 2024-11-18T23:40:06.391Z
Learning: When modifying the `YamlDeclarativeSource` class in `airbyte_cdk/sources/declarative/yaml_declarative_source.py`, avoid introducing breaking changes like altering method signatures within the scope of unrelated PRs. Such changes should be addressed separately to minimize impact on existing implementations.
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
pyproject.toml (3)
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, ignore all `__init__.py` files when providing a recommended reviewing order.
Learnt from: aaronsteers
PR: airbytehq/airbyte-python-cdk#58
File: airbyte_cdk/cli/source_declarative_manifest/_run.py:62-65
Timestamp: 2024-11-15T01:04:21.272Z
Learning: The files in `airbyte_cdk/cli/source_declarative_manifest/`, including `_run.py`, are imported from another repository, and changes to these files should be minimized or avoided when possible to maintain consistency.
Learnt from: pnilan
PR: airbytehq/airbyte-python-cdk#0
File: :0-0
Timestamp: 2024-12-11T16:34:46.319Z
Learning: In the airbytehq/airbyte-python-cdk repository, the `declarative_component_schema.py` file is auto-generated from `declarative_component_schema.yaml` and should be ignored in the recommended reviewing order.
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: Check: source-google-drive
  • GitHub Check: Check: source-shopify
  • GitHub Check: Check: source-intercom
  • GitHub Check: Check: source-amplitude
  • GitHub Check: Check: source-pokeapi
  • GitHub Check: Check: source-hardcoded-records
  • GitHub Check: MyPy Check
  • GitHub Check: preview_docs
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Dependency Analysis with Deptry
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
🔇 Additional comments (3)
pyproject.toml (3)

56-57: Check requests / urllib3 pinning interplay
requests 2.32.4 already declares a dependency range urllib3>=2,<3. Since we pin urllib3 explicitly to ^2.5.0 below, we now have two top-level requirements that could drift apart in the future. Would it be safer to drop the explicit urllib3 pin and let requests drive the version (or vice-versa by constraining both with the same upper bound) to avoid resolution conflicts, wdyt?


66-69: Large LangChain upgrade – verify downstream breakage
Jumping to the 0.3.x line introduces the LangChain package split (langchain, langchain-core, langchain-community). A lot of public APIs moved or changed signatures between 0.0/0.1 and 0.3. Could we double-check that every internal import (e.g. from langchain import …) has been migrated and that unit tests exercising embedders still pass, wdyt?


95-98: Potential ecosystem fallout from protobuf 5 and urllib3 2
protobuf 5.x and urllib3 2.x both contained breaking changes that some older libs haven’t picked up yet (gRPC and google-apis for protobuf, a few auth helpers for urllib3). Have we run the full connector test suite to confirm no runtime regressions, and do we need upper-bounds guards in case a connector still pins to protobuf<5, wdyt?

@skord skord marked this pull request as draft June 27, 2025 21:32
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant