Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature](avro) Support Apache Avro file format #19990

Merged
merged 4 commits into from
Jun 28, 2023

Conversation

DongLiang-0
Copy link
Contributor

Proposed changes

#18431

Problem summary

support read avro file by hdfs() or s3() .

select * from s3(
         "uri" = "http://127.0.0.1:9312/test2/person.avro",
         "ACCESS_KEY" = "ak",
         "SECRET_KEY" = "sk",
         "FORMAT" = "avro");
+--------+--------------+-------------+-----------------+
| name   | boolean_type | double_type | long_type       |
+--------+--------------+-------------+-----------------+
| Alyssa |            1 |     10.0012 | 100000000221133 |
| Ben    |            0 |    5555.999 |      4009990000 |
| lisi   |            0 | 5992225.999 |      9099933330 |
+--------+--------------+-------------+-----------------+

select * from hdfs(
                "uri" = "hdfs://127.0.0.1:9000/input/person2.avro",
                "fs.defaultFS" = "hdfs://127.0.0.1:9000",
                "hadoop.username" = "doris",
                "format" = "avro");
+--------+--------------+-------------+-----------+
| name   | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------+
| Alyssa |            1 |  8888.99999 |  89898989 |
+--------+--------------+-------------+-----------+

current avro reader only support common data type, the complex data types will be supported later.

Checklist(Required)

  • Does it affect the original behavior
  • Has unit tests been added
  • Has document been added or modified
  • Does it need to update dependencies
  • Is this PR support rollback (If NO, please explain WHY)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions github-actions bot added area/vectorization kind/docs Categorizes issue or PR as related to documentation. labels May 24, 2023
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@morningman morningman self-assigned this May 24, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2023

clang-tidy review says "All clean, LGTM! 👍"

2 similar comments
@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2023

clang-tidy review says "All clean, LGTM! 👍"

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@DongLiang-0 DongLiang-0 marked this pull request as draft June 14, 2023 05:32
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@DongLiang-0 DongLiang-0 marked this pull request as ready for review June 14, 2023 05:45
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@DongLiang-0
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

morningman
morningman previously approved these changes Jun 16, 2023
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 16, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@DongLiang-0
Copy link
Contributor Author

run buildall

@@ -62,6 +62,21 @@ JniConnector::~JniConnector() {
}
}

Status JniConnector::open() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call open(nullptr, nullptr) directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, I have modified here

* Use configuration map to provide scan information. The java side should determine how the parameters are parsed.
* @param connector_class Java scanner class
* @param scanner_params Provided configuration map
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment to describe the difference between the two constructors.

JniConnector(std::string connector_class, std::map<std::string, std::string> scanner_params)
: _connector_class(std::move(connector_class)),
_scanner_params(std::move(scanner_params)) {
_is_table_schema = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add identifiers on the Java side and decide how to clean up memory on the Java side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java memory is used here, not direct memory, which is managed by JVM

* The schema information are stored as a string.
* Use # between column names and column types.
*
* like: col_name1,col_name2,col_name3#col_type1,col_type2.col_type3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue with the concatenation method of this string, as type may contain commas, such as the decimal type. You can directly return Map<String, String> type, as shown in the process of passing map types as parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the support of complex types, I have rebuild this piece. Like array, map types are nested.

#include "runtime/types.h"

namespace doris::vectorized {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation: AvroReader -> AvroJniReader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, I have modified here.


@Override
public LocalDate getDate() {
return LocalDate.now();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalDate.now() is only used mock jni reader, please check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it can be changed to return null here?
Avro does not have date type,The code will not run to here either.
what you think?

@@ -78,15 +78,16 @@ under the License.
</exclusions>
</dependency>
<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the hadoop dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there is a conflict between hudi and avro dependencies

@@ -132,6 +133,11 @@ public int getNext() throws IOException {
}
}

@Override
public TableSchema parseTableSchema() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw UnsupportedOperationException error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -41,6 +42,9 @@ public abstract class JniScanner {
// Scan data and save as vector table
protected abstract int getNext() throws IOException;

// parse table schema
public abstract TableSchema parseTableSchema() throws IOException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a protected method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

public class TableSchema {

private final String[] fields;
private final TPrimitiveType[] schemaTypes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column type is defined in org.apache.doris.common.jni.vec.ColumnType, maybe we should use ColumnType, not TPrimitiveType. Because ColumnType has provide the parseType method, and get type is an inverse method.

@DongLiang-0
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@DongLiang-0
Copy link
Contributor Author

run buildall

@DongLiang-0
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@DongLiang-0
Copy link
Contributor Author

run buildall

@@ -258,6 +258,9 @@ public static ColumnType parseType(String columnName, String hiveType) {
case "largeint":
type = Type.LARGEINT;
break;
case "long":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should generate hive compatible types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* like: col_name1,col_name2,col_name3#col_type1,col_type2.col_type3
*
*/
Status get_table_schema(std::string& table_schema_str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map<string, string> get_table_schema();
field_name -> array<struct<col1: int, col2: string>>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I use the nested format of json.
You can see TableSchema.SchemaColumn

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@DongLiang-0
Copy link
Contributor Author

run buildall

@AshinGau
Copy link
Member

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jun 27, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@DongLiang-0
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@DongLiang-0
Copy link
Contributor Author

run buildall

1 similar comment
@DongLiang-0
Copy link
Contributor Author

run buildall

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit a6b51ec into apache:master Jun 28, 2023
morningman pushed a commit to morningman/doris that referenced this pull request Jul 3, 2023
support read avro file by hdfs() or s3() .
```sql
select * from s3(
         "uri" = "http://127.0.0.1:9312/test2/person.avro",
         "ACCESS_KEY" = "ak",
         "SECRET_KEY" = "sk",
         "FORMAT" = "avro");
+--------+--------------+-------------+-----------------+
| name   | boolean_type | double_type | long_type       |
+--------+--------------+-------------+-----------------+
| Alyssa |            1 |     10.0012 | 100000000221133 |
| Ben    |            0 |    5555.999 |      4009990000 |
| lisi   |            0 | 5992225.999 |      9099933330 |
+--------+--------------+-------------+-----------------+

select * from hdfs(
                "uri" = "hdfs://127.0.0.1:9000/input/person2.avro",
                "fs.defaultFS" = "hdfs://127.0.0.1:9000",
                "hadoop.username" = "doris",
                "format" = "avro");
+--------+--------------+-------------+-----------+
| name   | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------+
| Alyssa |            1 |  8888.99999 |  89898989 |
+--------+--------------+-------------+-----------+
```

current avro reader only support common data type, the complex data types will be supported later.
morningman pushed a commit that referenced this pull request Jul 4, 2023
support read avro file by hdfs() or s3() .
```sql
select * from s3(
         "uri" = "http://127.0.0.1:9312/test2/person.avro",
         "ACCESS_KEY" = "ak",
         "SECRET_KEY" = "sk",
         "FORMAT" = "avro");
+--------+--------------+-------------+-----------------+
| name   | boolean_type | double_type | long_type       |
+--------+--------------+-------------+-----------------+
| Alyssa |            1 |     10.0012 | 100000000221133 |
| Ben    |            0 |    5555.999 |      4009990000 |
| lisi   |            0 | 5992225.999 |      9099933330 |
+--------+--------------+-------------+-----------------+

select * from hdfs(
                "uri" = "hdfs://127.0.0.1:9000/input/person2.avro",
                "fs.defaultFS" = "hdfs://127.0.0.1:9000",
                "hadoop.username" = "doris",
                "format" = "avro");
+--------+--------------+-------------+-----------+
| name   | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------+
| Alyssa |            1 |  8888.99999 |  89898989 |
+--------+--------------+-------------+-----------+
```

current avro reader only support common data type, the complex data types will be supported later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. area/vectorization dev/2.0.0-merged kind/docs Categorizes issue or PR as related to documentation. reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants