-
Notifications
You must be signed in to change notification settings - Fork 19
Update Salesforce Data Source Name #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Salesforce Data Source Name #19
Conversation
WalkthroughThe changes update the Salesforce data source integration by renaming its registration identifier from "salesforce" to "pyspark.datasource.salesforce" in the implementation, tests, and examples. The class docstring is expanded to clarify streaming write recovery semantics. No logic, error handling, or control flow is modified. Changes
Sequence Diagram(s)Not applicable: changes are limited to identifier renaming and docstring clarification, with no new or modified control flow. Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
🧰 Additional context used🪛 Ruff (0.12.2)tests/test_data_sources.py81-81: (F405) 🔇 Additional comments (2)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
pyspark_datasources/salesforce.py (1)
80-106: Code examples use the obsolete identifier – they no longer runAll illustrative snippets (Lines 80-106, 116-122, 129-138, 149-157, etc.) call
.format("salesforce"). Update them to"salesforce-sink"(or document both if an alias is provided) so that copy-paste examples work.Example fix:
- .format("salesforce") \ + .format("salesforce-sink") \Neglecting this will create immediate runtime errors for users following the docs.
🧹 Nitpick comments (1)
pyspark_datasources/salesforce.py (1)
26-33: Docstring still statesName: salesforce– update to avoid confusionThe high-level documentation advertises the old identifier, which contradicts the new return value and will mislead readers.
- Name: `salesforce` + Name: `salesforce-sink`[Suggest keeping a note about the legacy identifier if you adopt the alias approach.]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pyspark_datasources/salesforce.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
pyspark_datasources/salesforce.py (6)
pyspark_datasources/lance.py (2)
name(48-49)LanceSink(10-56)pyspark_datasources/huggingface.py (1)
name(58-59)pyspark_datasources/fake.py (1)
name(113-114)pyspark_datasources/kaggle.py (1)
name(63-64)pyspark_datasources/opensky.py (1)
name(441-442)pyspark_datasources/simplejson.py (1)
name(48-49)
| @classmethod | ||
| def name(cls) -> str: | ||
| """Return the short name for this Salesforce sink.""" | ||
| return "salesforce" | ||
| return "salesforce-sink" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Changed short-name silently breaks existing jobs – add an alias or clear migration path
Every user script that previously did
df.writeStream.format("salesforce")…will now fail to resolve the sink after this one-line change. Either provide an explicit backward-compatibility alias or publish a migration note in the module so that production pipelines do not break without warning.
Minimal code-side mitigation:
@classmethod
def name(cls) -> str:
"""Return the short name for this Salesforce sink."""
return "salesforce-sink"
+ # -- Backward-compatibility -------------------------------------------
+ @classmethod
+ def aliases(cls) -> List[str]:
+ """
+ Alternative identifiers accepted by Spark.
+ Retain the legacy name while users migrate.
+ """
+ return ["salesforce"] # legacy identifierIf DataSource doesn’t support an aliases() hook, expose a second thin subclass that only overrides name() back to "salesforce" and register both classes.
Failing that, at the very least document the breaking change in the project’s CHANGELOG/README.
🤖 Prompt for AI Agents
In pyspark_datasources/salesforce.py around lines 168 to 172, the change of the
sink's short name from "salesforce" to "salesforce-sink" breaks backward
compatibility for existing jobs using "salesforce". To fix this, either add an
alias for "salesforce" that points to the new sink or create a second subclass
with the old name method returning "salesforce" and register both classes. If
neither is possible, document this breaking change clearly in the project's
CHANGELOG or README to inform users of the migration path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
tests/test_data_sources.py (1)
4-4: Replace star-import with explicit symbols to remove Ruff F405 and improve clarity
from pyspark_datasources import *obfuscates whereSalesforceDataSource,FakeDataSource, etc. come from and triggers Ruff’s F405 (“may be undefined from star import”).
Import the required classes explicitly instead.-from pyspark_datasources import * +from pyspark_datasources import ( + GithubDataSource, + FakeDataSource, + KaggleDataSource, + OpenSkyDataSource, + SalesforceDataSource, +)
🧹 Nitpick comments (1)
tests/test_data_sources.py (1)
79-93: Usepytest.raisesfor cleaner failure-expectationThe manual
try/except … assert Falseblock is verbose and can hide assertion mistakes.
pytest.raisesconveys intent and short-circuits correctly.-try: - # Try to use batch write - should fail since we only support streaming - df = spark.range(1).select( - lit("Test Company").alias("Name"), - lit("Technology").alias("Industry"), - lit(50000.0).alias("AnnualRevenue") - ) - - df.write.format("salesforce-sink").mode("append").save() - assert False, "Should have raised error - Salesforce DataSource only supports streaming" -except Exception as e: - # This is expected - Salesforce DataSource only supports streaming writes - error_msg = str(e).lower() - # The error can be about unsupported mode or missing writer - assert "unsupported" in error_msg or "writer" in error_msg or "not implemented" in error_msg +df = spark.range(1).select( + lit("Test Company").alias("Name"), + lit("Technology").alias("Industry"), + lit(50000.0).alias("AnnualRevenue") +) + +with pytest.raises(Exception) as excinfo: + df.write.format("salesforce-sink").mode("append").save() + +msg = str(excinfo.value).lower() +assert any(keyword in msg for keyword in ("unsupported", "writer", "not implemented"))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/salesforce_sink_example.py(5 hunks)tests/test_data_sources.py(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- examples/salesforce_sink_example.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tests/test_data_sources.py
74-74: SalesforceDataSource may be undefined, or defined from star imports
(F405)
🔇 Additional comments (1)
tests/test_data_sources.py (1)
74-74: Assertion updated correctly to the new sink identifierThe expectation now matches the renamed
SalesforceDataSource.name()value ("salesforce-sink").
Looks good and keeps the test aligned with the production change.
examples/salesforce_sink_example.py
Outdated
| # Write to Salesforce | ||
| query = account_data.writeStream \ | ||
| .format("salesforce") \ | ||
| .format("salesforce-sink") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To follow the naming convention for Scala data source (e.g org.spark.sql.<datasource>) can we use something like: pyspark.datasource.salesforce? Or we can just use sfdc (abbreviation for salesforce).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure I updated it to pyspark.datasource.salesforce
| - Uses Salesforce username/password/security token authentication | ||
| - Supports batch writing with Salesforce Composite Tree API for efficient processing | ||
| - Implements exactly-once semantics through Spark's checkpoint mechanism | ||
| - If a streaming write job fails and is resumed from the checkpoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also update the Name above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the name in the example below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for missing it! I updated all data source references
Update Salesforce Data Source Name to accurately reflect its scope.
Summary by CodeRabbit
Summary by CodeRabbit
Documentation
Refactor