-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature](avro) Support Apache Avro file format #19990
Conversation
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
2 similar comments
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
run buildall |
be/src/vec/exec/jni_connector.cpp
Outdated
@@ -62,6 +62,21 @@ JniConnector::~JniConnector() { | |||
} | |||
} | |||
|
|||
Status JniConnector::open() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call open(nullptr, nullptr)
directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, I have modified here
be/src/vec/exec/jni_connector.h
Outdated
* Use configuration map to provide scan information. The java side should determine how the parameters are parsed. | ||
* @param connector_class Java scanner class | ||
* @param scanner_params Provided configuration map | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment to describe the difference between the two constructors.
JniConnector(std::string connector_class, std::map<std::string, std::string> scanner_params) | ||
: _connector_class(std::move(connector_class)), | ||
_scanner_params(std::move(scanner_params)) { | ||
_is_table_schema = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add identifiers on the Java side and decide how to clean up memory on the Java side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java memory is used here, not direct memory, which is managed by JVM
* The schema information are stored as a string. | ||
* Use # between column names and column types. | ||
* | ||
* like: col_name1,col_name2,col_name3#col_type1,col_type2.col_type3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue with the concatenation method of this string, as type may contain commas, such as the decimal type. You can directly return Map<String, String> type, as shown in the process of passing map types as parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the support of complex types, I have rebuild this piece. Like array, map types are nested.
be/src/vec/exec/scan/avro_reader.cpp
Outdated
#include "runtime/types.h" | ||
|
||
namespace doris::vectorized { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommendation: AvroReader
-> AvroJniReader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, I have modified here.
|
||
@Override | ||
public LocalDate getDate() { | ||
return LocalDate.now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LocalDate.now()
is only used mock jni reader, please check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it can be changed to return null here?
Avro does not have date type,The code will not run to here either.
what you think?
@@ -78,15 +78,16 @@ under the License. | |||
</exclusions> | |||
</dependency> | |||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the hadoop dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because there is a conflict between hudi and avro dependencies
@@ -132,6 +133,11 @@ public int getNext() throws IOException { | |||
} | |||
} | |||
|
|||
@Override | |||
public TableSchema parseTableSchema() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw UnsupportedOperationException
error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -41,6 +42,9 @@ public abstract class JniScanner { | |||
// Scan data and save as vector table | |||
protected abstract int getNext() throws IOException; | |||
|
|||
// parse table schema | |||
public abstract TableSchema parseTableSchema() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a protected method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
public class TableSchema { | ||
|
||
private final String[] fields; | ||
private final TPrimitiveType[] schemaTypes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column type is defined in org.apache.doris.common.jni.vec.ColumnType
, maybe we should use ColumnType
, not TPrimitiveType
. Because ColumnType
has provide the parseType
method, and get type is an inverse method.
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
run buildall |
abd1fb8
to
8ba4ae7
Compare
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
run buildall |
@@ -258,6 +258,9 @@ public static ColumnType parseType(String columnName, String hiveType) { | |||
case "largeint": | |||
type = Type.LARGEINT; | |||
break; | |||
case "long": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should generate hive compatible types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* like: col_name1,col_name2,col_name3#col_type1,col_type2.col_type3 | ||
* | ||
*/ | ||
Status get_table_schema(std::string& table_schema_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Map<string, string> get_table_schema();
field_name
-> array<struct<col1: int, col2: string>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I use the nested format of json.
You can see TableSchema.SchemaColumn
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
run buildall |
f6b215c
to
a36d011
Compare
LGTM |
PR approved by at least one committer and no changes requested. |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
run buildall |
1 similar comment
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
support read avro file by hdfs() or s3() . ```sql select * from s3( "uri" = "http://127.0.0.1:9312/test2/person.avro", "ACCESS_KEY" = "ak", "SECRET_KEY" = "sk", "FORMAT" = "avro"); +--------+--------------+-------------+-----------------+ | name | boolean_type | double_type | long_type | +--------+--------------+-------------+-----------------+ | Alyssa | 1 | 10.0012 | 100000000221133 | | Ben | 0 | 5555.999 | 4009990000 | | lisi | 0 | 5992225.999 | 9099933330 | +--------+--------------+-------------+-----------------+ select * from hdfs( "uri" = "hdfs://127.0.0.1:9000/input/person2.avro", "fs.defaultFS" = "hdfs://127.0.0.1:9000", "hadoop.username" = "doris", "format" = "avro"); +--------+--------------+-------------+-----------+ | name | boolean_type | double_type | long_type | +--------+--------------+-------------+-----------+ | Alyssa | 1 | 8888.99999 | 89898989 | +--------+--------------+-------------+-----------+ ``` current avro reader only support common data type, the complex data types will be supported later.
support read avro file by hdfs() or s3() . ```sql select * from s3( "uri" = "http://127.0.0.1:9312/test2/person.avro", "ACCESS_KEY" = "ak", "SECRET_KEY" = "sk", "FORMAT" = "avro"); +--------+--------------+-------------+-----------------+ | name | boolean_type | double_type | long_type | +--------+--------------+-------------+-----------------+ | Alyssa | 1 | 10.0012 | 100000000221133 | | Ben | 0 | 5555.999 | 4009990000 | | lisi | 0 | 5992225.999 | 9099933330 | +--------+--------------+-------------+-----------------+ select * from hdfs( "uri" = "hdfs://127.0.0.1:9000/input/person2.avro", "fs.defaultFS" = "hdfs://127.0.0.1:9000", "hadoop.username" = "doris", "format" = "avro"); +--------+--------------+-------------+-----------+ | name | boolean_type | double_type | long_type | +--------+--------------+-------------+-----------+ | Alyssa | 1 | 8888.99999 | 89898989 | +--------+--------------+-------------+-----------+ ``` current avro reader only support common data type, the complex data types will be supported later.
Proposed changes
#18431
Problem summary
support read avro file by hdfs() or s3() .
current avro reader only support common data type, the complex data types will be supported later.
Checklist(Required)
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...