-
Notifications
You must be signed in to change notification settings - Fork 745
[GH-2074] Add libpostal integration #2077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Imbruced Any idea what might be causing the test failure in this new version I added to the test matrix? spark 3.5.4, java 17, scala 2.12.8 Edit: I see it is that we only set spark home when the spark version is 3.5.0 or 4.0.0. I will update this. |
|
Jpostal requires Java 17? Sedona is compiled against Java 11. Will this even pass our Java tests in CI? |
|
@james-willis Since the jpostal under Wherobots is already a fork, why not just compile it against Java 11? What specific Java 16+ features were used in the JNI binding and can we drop it? |
|
actually turned out to be really easy to support java 11: wherobots/jpostal#13 |
|
Please also remove all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates libpostal to provide address normalization and parsing via two new functions, ExpandAddress and ParseAddress, complete with codegen support, configuration, and documentation.
- Introduce new SQL and DataFrame API functions backed by libpostal (
ExpandAddress,ParseAddress). - Add helper
LibPostalUtilsand extendSedonaConfto configure data directory and useSenzing options. - Provide comprehensive Scala and Python tests, update documentation, add jpostal dependency, and adjust CI matrix.
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| spark/common/src/test/scala/org/apache/sedona/sql/GeoStatsSuite.scala | Disable adaptive SQL execution in tests |
| spark/common/src/test/scala/org/apache/sedona/sql/AddressProcessingFunctionsTest.scala | Add unit tests for ExpandAddress and ParseAddress |
| spark/common/src/test/java/org/apache/sedona/core/utils/SedonaConfTest.java | Cover null input for bytesFromString |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/st_functions.scala | Register ExpandAddress and ParseAddress in SQL functions |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/LibPostalUtils.scala | Implement utility methods for jpostal config |
| spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala | Define expression classes with eval and codegen for the new functions |
| spark/common/src/main/scala/org/apache/sedona/sql/UDF/Catalog.scala | Add new functions to the UDF catalog |
| spark/common/src/main/java/org/apache/sedona/core/utils/SedonaConf.java | Extend configuration to include libpostal settings |
| spark/common/pom.xml | Add com.wherobots:jpostal dependency |
| python/tests/sql/test_dataframe_api.py | Add DataFrame API plan tests for the new functions |
| python/sedona/spark/sql/st_functions.py | Add Python wrappers for ExpandAddress and ParseAddress |
| docs/api/sql/Function.md | Document the new SQL functions |
| .github/workflows/java.yml | Expand CI Spark version matrix and simplify Python setup |
Comments suppressed due to low confidence (1)
spark/common/src/test/scala/org/apache/sedona/sql/AddressProcessingFunctionsTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/scala/org/apache/sedona/sql/AddressProcessingFunctionsTest.scala
Outdated
Show resolved
Hide resolved
spark/common/src/test/java/org/apache/sedona/core/utils/SedonaConfTest.java
Outdated
Show resolved
Hide resolved
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala
Show resolved
Hide resolved
spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/Functions.scala
Outdated
Show resolved
Hide resolved
…ssingFunctionsTest.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ConfTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ssingFunctionsTest.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ExpandAddress and ParseAddess support via libpostal * PR comments; edge case in SedonaConf changes * Update spark/common/src/test/scala/org/apache/sedona/sql/AddressProcessingFunctionsTest.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update python/sedona/spark/sql/st_functions.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update spark/common/src/test/java/org/apache/sedona/core/utils/SedonaConfTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update spark/common/src/test/scala/org/apache/sedona/sql/AddressProcessingFunctionsTest.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * align null case logic between eval and codegen cases --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-2074] my subject. Closes #2074What changes were proposed in this PR?
This add ExpandAddress and ParseAddress functions backed by libpostal, including support for code gen.
How was this patch tested?
Unit Tests
Did this PR include necessary documentation updates?