Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

Fix bug on generation of avro files with nested records with the same name #249

Closed
wants to merge 7 commits into from

Conversation

gsolasab
Copy link

spark-avro was failing on the generation of avro files when nested records had the same name, as can be seen on the following issue #54.
For solving that problem has been added a different namespace on each record. This namespace is created with the concatenation of all the parents of each record separated by dot.
Added also a new test in AvroSuite for checking the new code.
This is related with pull request #73 but have been added a modification on AvroOutputWriter to solve correctly the issue.

@codecov-io
Copy link

codecov-io commented Oct 12, 2017

Codecov Report

Merging #249 into master will increase coverage by 0.25%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #249      +/-   ##
==========================================
+ Coverage   90.46%   90.71%   +0.25%     
==========================================
  Files           5        5              
  Lines         325      334       +9     
  Branches       51       50       -1     
==========================================
+ Hits          294      303       +9     
  Misses         31       31

@lyubinetz
Copy link

Ping @JoshRosen - this seems like it actually does what #73 had to accomplish. A couple of other people asked for a solution to this, so since this is a relatively common issue (and kind of a bug), this should probably be merged.

P.S. This is @vlyubin from different account :)

@gengliangwang
Copy link
Contributor

gengliangwang commented Oct 23, 2017

@gsolasab Thanks! @JoshRosen is currently on vacation.
Could you also add test cases for Array/Map type?

@bloomonkey
Copy link

We're getting bitten by this too. If we (@autotraderuk) were able to contribute the missing tests, how soon could the fix be released?

@gsolasab
Copy link
Author

Tomorrow I will do the tests.

@gengliangwang
Copy link
Contributor

@gsolasab Cool, please add test cases on file reading as well.
We plan to release recently, and I think we should merge this PR.

// Read avro file saved on the last step
val readDf = spark.read.avro(outputFolder)
// Check if the written DataFrame is equals than read DataFrame
assert(readDf.collect() sameElements writeDf.collect())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change sameElements to . sameElements(writeDf.collect()) ?

@gengliangwang
Copy link
Contributor

Overall LGTM except one comment.

val elementConverter = createConverterToAvro(
elementType,
structName,
elementType match {
Copy link
Contributor

@gengliangwang gengliangwang Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I think we can create a function for these four lines of code.
A function accepts elementType, recordNamespace, structName and returns the record namespace.
So that we won't have duplicated code.
I am making release today, thanks for the work in advance!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This match is also on SchemaConverters, should we create a general function for both classes then? In wich class do you think we can put it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general function In SchemaConverters, I think.

elementType,
structName,
SchemaConverters.getNewRecordNamespace(elementType, recordNamespace, structName)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang
Copy link
Contributor

Thanks, merge to master.

gengliangwang pushed a commit that referenced this pull request Oct 27, 2017
… name

spark-avro was failing on the generation of avro files when nested records had the same name, as can be seen on the following issue [#54](#54).
For solving that problem has been added a different namespace on each record. This namespace is created with the concatenation of all the parents of each record separated by dot.
Added also a new test in AvroSuite for checking the new code.
This is related with pull request [#73](#73) but have been added a modification on AvroOutputWriter to solve correctly the issue.

Author: gsolasab <gerard.sola.sabate@everis.com>
Author: Gerard Solà <gerardsola@gmail.com>

Closes #249 from gsolasab/master.
gengliangwang pushed a commit that referenced this pull request Oct 27, 2017
… name

spark-avro was failing on the generation of avro files when nested records had the same name, as can be seen on the following issue [#54](#54).
For solving that problem has been added a different namespace on each record. This namespace is created with the concatenation of all the parents of each record separated by dot.
Added also a new test in AvroSuite for checking the new code.
This is related with pull request [#73](#73) but have been added a modification on AvroOutputWriter to solve correctly the issue.

Author: gsolasab <gerard.sola.sabate@everis.com>
Author: Gerard Solà <gerardsola@gmail.com>

Closes #249 from gsolasab/master.

(cherry picked from commit 7619b67)
Signed-off-by: Wang Gengliang <ltnwgl@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants