-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-664] Add native GeoPackage reader #1603
Conversation
As a follow up to this mr, I need to add
|
Thank you for this great work! My major concern is whether it works with GeoPackage files stored on cloud storage such as HDFS or S3. As far as I know |
@Kontinuation good point, I ll write the test to make sure it works. Integration test with minio would be enough I think. |
WIP |
// skip srid for now | ||
reader.getInt() | ||
|
||
skipEnvelope(resolvedFlags._1, reader) | ||
|
||
val wkb = new Array[Byte](reader.remaining()) | ||
reader.get(wkb) | ||
|
||
val wkbReader = new org.locationtech.jts.io.WKBReader() | ||
val geom = wkbReader.read(wkb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we take SRID into consideration. val wkbReader = new WKBReader(new GeometryFactory(new PrecisionModel(), srid))
would be sufficient.
if (pathString.toLowerCase(Locale.ROOT).endsWith(".geopackage")) { | ||
val path = new Path(pathString) | ||
val fs = path.getFileSystem(hadoopConf) | ||
|
||
val isDirectory = Try(fs.getFileStatus(path).isDirectory).getOrElse(false) | ||
if (isDirectory) { | ||
pathString | ||
} else { | ||
pathString.substring(0, pathString.length - 3) + "???" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, if pathString
ends with ".geopackage", and it is not a directory, it will be transformed to "****.geopack???". I cannot grasp the idea of this transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah you are right, I was sure that in this part I already have list of files in the directory, I am wondering how it would behave if I am passing list of files and if its actually needed (user specifying paths with different file formats).
val serializableConf = new SerializableConfiguration( | ||
sparkSession.sessionState.newHadoopConfWithOptions(options.asScala.toMap)) | ||
|
||
val tempFile = FileSystemUtils.copyToLocal(serializableConf.value, files.head.getPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we detect if the path is a local path skip calling copyToLocal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, sure
@Kontinuation thanks for the review ! |
767b8ca
to
5b88709
Compare
spark/spark-3.3/pom.xml
Outdated
@@ -98,6 +97,36 @@ | |||
<groupId>org.locationtech.jts</groupId> | |||
<artifactId>jts-core</artifactId> | |||
</dependency> | |||
<dependency> | |||
<groupId>org.testcontainers</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why these dependencies only appear in the spark-3.3
profile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test testing loading from s3 (I used minio with test containers) is only in spark 3.3. I can duplicate it to other versions of spark as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move these to the spark-common
pom.xml so all Spark 3.X pom.xml will share it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think deps with test scope is not propagating when you put it in common and I would avoid putting it as normal deps, I copied them for each of the spark version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support other Spark versions 3.0, 3.1, and 3.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't test it with those I can add data sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if the implementation differs among the versions, we should definitely replicate the tests and test them for each version
@jiayuasu I can't wait to be able to use this for Overture Buildings Conflation. We are looking at using a couple datasets that are packaged as gpkg files. |
41d4965
to
18186b9
Compare
.option("showMetadata", "true") | ||
.load(path) | ||
|
||
df.where("data_type = 'tiles'").show(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove or comment out all show()
functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might add linter for that cases, sorry for that.
fc24a6e
to
c73aab3
Compare
@jiayuasu remove show method calls |
spark/spark-3.0/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.0/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.1/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.1/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.2/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.3/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.4/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.4/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.5/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.5/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
spark/spark-3.0/src/test/scala/org/apache/sedona/sql/GeoPackageReaderTest.scala
Outdated
Show resolved
Hide resolved
5964204
to
4468d67
Compare
@jiayuasu I applied changes, and what I saw is that we are running previous pipelines even if new are starting I think it's worth adding concurrency mechanism in github actions |
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
What changes were proposed in this PR?
Geopackage datasource
How was this patch tested?
integration tests
Did this PR include necessary documentation updates?