Skip to content

HIVE-26954: Upgrade avro to 1.11.1 #4012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

akshat0395
Copy link
Contributor

What changes were proposed in this pull request?

Upgrade Avro to 1.11.1
HIVE-26954

Why are the changes needed?

Keeping file formats to latest version

Does this PR introduce any user-facing change?

No

How was this patch tested?

Build + Unit tests

@Aggarwal-Raghav
Copy link
Contributor

@akshat0395, one question: Shouldn't we need to be in-sync with Hadoop (3.3.1 used in hive) on avro version even though we are already ahead. Because difference in file format version might cause problem. Because I think the hadoop jars will be loaded first in classpath which will already contain hadoop avro version (1.7.7)?

There is some discussion on same for this on PR 3878 (HIVE-26829).

@akshat0395
Copy link
Contributor Author

@akshat0395, one question: Shouldn't we need to be in-sync with Hadoop (3.3.1 used in hive) on avro version even though we are already ahead. Because difference in file format version might cause problem. Because I think the hadoop jars will be loaded first in classpath which will already contain hadoop avro version (1.7.7)?

There is some discussion on same for this on PR 3878 (HIVE-26829).
Thanks @Aggarwal-Raghav for sharing these details, I wasn't aware that there is a similar discussion going on.
You made some valid points, I think the solution to which could be extensive testing.
Right now I've run some similar tests that you have done in your upgrade, which seems to work fine to me, and after going to the discussion I feel writing some qtest and some unit test around avro APIs that are used in Hive might help confirming if the upgrade is safe or not, Till then this can be marked in progress.
Regarding compatibility with Hadoop, as we are already ahead of Hadoop and the changes seems to be backward compatible I dont think upgrade should be a problem there.
Also I do agree with @cnauroth regrading the upgrade to 1.11.1 rather than 1.11.0.
@Aggarwal-Raghav @cnauroth Please let me know what do you think about the proposed testing solution.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 3, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
No Duplication information No Duplication information

@akshat0395
Copy link
Contributor Author

++ @ayushtkn
Would be really helpful if you can share some insights here. Thanks

@ayushtkn
Copy link
Member

ayushtkn commented Feb 7, 2023

I am not sure what blocks us from moving forward, we have already diverged from hadoop, Hadoop is at 1.7.7
https://github.com/apache/hadoop/blob/branch-3.3.1/hadoop-project/pom.xml#L65-L66

And we are at 1.8.2 already. So, if anything has to break due to hadoop it should have been broken by now, but we haven't got hold of anything till now, the upgrade was done on branch-3 and branch-2 as well(HIVE-19662), so divergence b/w hadoop and hive avro version is already there.

HIVE-24324 was done to pave way for future upgrades.

The tests are green. @akshat0395 can you deploy and try some tests on an actual cluster, if things work. If we don't find any conclusive evidence like. "This breaks due to Avro upgrade", I think we can move forward.

@zabetak / @abstractdog any thoughts around this?

@zabetak
Copy link
Member

zabetak commented Feb 8, 2023

@ayushtkn I am not aware of the history but I agree with what you are saying. If there are no proofs that it breaks something then I am fine proceeding with the upgrade.

@akshat0395
Copy link
Contributor Author

Thanks @ayushtkn, @zabetak for the input. I've done some further testing and analysis for the same.

  • I've ran a Java API compatibility test between the current avro version i.e 1.8.2 used by Hive and the target version 1.11.1. The compatibility is ~ 95%, There are few new APIs added which are safe to use and doesnt have impact on hive and there are few APIs that are deprecated, The deprecated API are already replaced as part of HIVE-24324 , PR: 1621 as @ayushtkn also mentioned. In conclusion compatibility wise the upgrade seems safe
  • Manual Tests:
  • I've triggered build and ran test with the upgraded version in an cluster and it passed. This PR also passed the test
  • I've create an avro table with different datatypes with a build that have avro 1.8.2
    CREATE TABLE avro_table STORED AS AVRO TBLPROPERTIES ('avro.schema.literal'='{ "type": "record", "name": "my_record", "fields": [ {"name": "tinyint_col", "type": "int"}, {"name": "smallint_col", "type": "int"}, {"name": "int_col", "type": "int"}, {"name": "bigint_col", "type": "long"}, {"name": "float_col", "type": "float"}, {"name": "double_col", "type": "double"}, {"name": "decimal_col", "type": {"type": "bytes", "logicalType": "decimal", "precision": 10, "scale": 2}}, {"name": "string_col", "type": "string"}, {"name": "varchar_col", "type": "string"}, {"name": "char_col", "type": "string"}, {"name": "binary_col", "type": "bytes"}, {"name": "boolean_col", "type": "boolean"}, {"name": "array_col", "type": {"type": "array", "items": "string"}}, {"name": "map_col", "type": {"type": "map", "values": "int"}}, {"name": "struct_col", "type": {"type": "record", "name": "my_struct", "fields": [{"name": "field1", "type": "string"}, {"name": "field2", "type": "int"}]}} ] }');

I've performed some basic Insert, Modify, and aggregation query on the current version, post that I've performed the same set of queries with the upgraded version and both seems to work without any issues.

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zabetak
Copy link
Member

zabetak commented Feb 22, 2023

@akshat0395 Are there (enough) tests already in Hive reading and writing from Avro files/tables?

@zabetak
Copy link
Member

zabetak commented Feb 22, 2023

@akshat0395 Do we have (should we add) in the repo tests that are reading from Avro files written with old writers?

@akshat0395
Copy link
Contributor Author

@akshat0395 Are there (enough) tests already in Hive reading and writing from Avro files/tables?

There are notable number of q test present in Hive specific to avro that covers various read/write scenarios. Following is the list of q test:

  1. avro_add_column.q
  2. avro_add_column2.q
  3. avro_add_column3.q
  4. avro_add_column_extschema.q
  5. avro_alter_table_update_columns.q
  6. avro_change_schema.q
  7. avro_charvarchar.q
  8. avro_comments.q
  9. avro_compression_enabled.q
  10. avro_compression_enabled_native.q
  11. avro_date.q
  12. avro_decimal.q
  13. avro_decimal_native.q
  14. avro_decimal_old.q
  15. avro_deserialize_map_null.q
  16. avro_evolved_schemas.q
  17. avro_extschema_insert.q
  18. avro_historical_timestamp.q
  19. avro_hybrid_mixed_date.q
  20. avro_hybrid_mixed_timestamp.q
  21. avro_joins.q
  22. avro_joins_native.q
  23. avro_legacy_mixed_date.q
  24. avro_legacy_mixed_timestamp.q
  25. avro_native.q
  26. avro_nullable_fields.q
  27. avro_nullable_union.q
  28. avro_partitioned.q
  29. avro_partitioned_native.q
  30. avro_proleptic_mixed_date.q
  31. avro_proleptic_mixed_timestamp.q
  32. avro_sanity_test.q
  33. avro_schema_evolution_native.q
  34. avro_schema_literal.q
  35. avro_tableproperty_optimize.q
  36. avro_timestamp.q
  37. avro_timestamp2.q
  38. avro_type_evolution.q
  39. avro_write_legacy_timestamp.q
  40. avro_write_new_timestamp.q
  41. avrocountemptytbl.q
  42. avrotblsjoin.q

I think this cover alot of ground, Please let me know your thoughts in case we need to add more specific cases @zabetak

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Will merge soon.

@zabetak zabetak closed this in 4784df9 Feb 24, 2023
InvisibleProgrammer pushed a commit to InvisibleProgrammer/hive that referenced this pull request Feb 28, 2023
…axena, Stamatis Zampetakis, Chris Nauroth)

Co-authored-by: Raghav Aggarwal <raghavaggarwal03.ra@gmail.com>

Closes apache#3878
Closes apache#4012
kasakrisz pushed a commit to kasakrisz/hive that referenced this pull request May 4, 2023
…axena, Stamatis Zampetakis, Chris Nauroth)

Co-authored-by: Raghav Aggarwal <raghavaggarwal03.ra@gmail.com>

Closes apache#3878
Closes apache#4012
yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023
…axena, Stamatis Zampetakis, Chris Nauroth)

Co-authored-by: Raghav Aggarwal <raghavaggarwal03.ra@gmail.com>

Closes apache#3878
Closes apache#4012
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…axena, Stamatis Zampetakis, Chris Nauroth)

Co-authored-by: Raghav Aggarwal <raghavaggarwal03.ra@gmail.com>

Closes apache#3878
Closes apache#4012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants