-
Notifications
You must be signed in to change notification settings - Fork 310
Fix bug on generation of avro files with nested records with the same name #249
Conversation
Codecov Report
@@ Coverage Diff @@
## master #249 +/- ##
==========================================
+ Coverage 90.46% 90.71% +0.25%
==========================================
Files 5 5
Lines 325 334 +9
Branches 51 50 -1
==========================================
+ Hits 294 303 +9
Misses 31 31 |
Ping @JoshRosen - this seems like it actually does what #73 had to accomplish. A couple of other people asked for a solution to this, so since this is a relatively common issue (and kind of a bug), this should probably be merged. P.S. This is @vlyubin from different account :) |
@gsolasab Thanks! @JoshRosen is currently on vacation. |
We're getting bitten by this too. If we (@autotraderuk) were able to contribute the missing tests, how soon could the fix be released? |
Tomorrow I will do the tests. |
@gsolasab Cool, please add test cases on file reading as well. |
// Read avro file saved on the last step | ||
val readDf = spark.read.avro(outputFolder) | ||
// Check if the written DataFrame is equals than read DataFrame | ||
assert(readDf.collect() sameElements writeDf.collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change sameElements
to . sameElements(writeDf.collect())
?
Overall LGTM except one comment. |
val elementConverter = createConverterToAvro( | ||
elementType, | ||
structName, | ||
elementType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I think we can create a function for these four lines of code.
A function accepts elementType, recordNamespace, structName and returns the record namespace.
So that we won't have duplicated code.
I am making release today, thanks for the work in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This match is also on SchemaConverters, should we create a general function for both classes then? In wich class do you think we can put it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general function In SchemaConverters
, I think.
elementType, | ||
structName, | ||
SchemaConverters.getNewRecordNamespace(elementType, recordNamespace, structName) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this )
to the above line
https://docs.scala-lang.org/style/indentation.html#methods-with-numerous-arguments
Thanks!
Thanks, merge to master. |
… name spark-avro was failing on the generation of avro files when nested records had the same name, as can be seen on the following issue [#54](#54). For solving that problem has been added a different namespace on each record. This namespace is created with the concatenation of all the parents of each record separated by dot. Added also a new test in AvroSuite for checking the new code. This is related with pull request [#73](#73) but have been added a modification on AvroOutputWriter to solve correctly the issue. Author: gsolasab <gerard.sola.sabate@everis.com> Author: Gerard Solà <gerardsola@gmail.com> Closes #249 from gsolasab/master.
… name spark-avro was failing on the generation of avro files when nested records had the same name, as can be seen on the following issue [#54](#54). For solving that problem has been added a different namespace on each record. This namespace is created with the concatenation of all the parents of each record separated by dot. Added also a new test in AvroSuite for checking the new code. This is related with pull request [#73](#73) but have been added a modification on AvroOutputWriter to solve correctly the issue. Author: gsolasab <gerard.sola.sabate@everis.com> Author: Gerard Solà <gerardsola@gmail.com> Closes #249 from gsolasab/master. (cherry picked from commit 7619b67) Signed-off-by: Wang Gengliang <ltnwgl@gmail.com>
spark-avro was failing on the generation of avro files when nested records had the same name, as can be seen on the following issue #54.
For solving that problem has been added a different namespace on each record. This namespace is created with the concatenation of all the parents of each record separated by dot.
Added also a new test in AvroSuite for checking the new code.
This is related with pull request #73 but have been added a modification on AvroOutputWriter to solve correctly the issue.