Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark Load][Bug] Keep the column splitting in spark load consistent with broker load / mini load #4532

Merged
merged 4 commits into from
Sep 6, 2020

Conversation

xy720
Copy link
Member

@xy720 xy720 commented Sep 3, 2020

Proposed changes

There is a 4 columns source data:

1|1|jim|2|
2|1|grace|2|
3|2|tom|2|
4|3|bush|3|
5|3|helen|3|
5|3|helen|6|
6|3|helen|3|
6|3|helen|3|
...

Given the same column terminator '|', broker load determines that it is 5 columns, and spark load determines that it is 4 columns.

And there is another 4 columns source

1|1|jim|2
2|1|grace|2
3|2|tom|2
4|3|bush|3
5|3|helen|3
5|3|helen|6
6|3|helen|3
6|3|helen|3
...

Given the same column terminator '|', both the broker load and spark load determines that it is 4 columns.

To Reproduce
Steps to reproduce the behavior:

  1. Submit a broker load.
load label ssb_db.broker_load_label 
( 
    data infile ("hdfs://ymy-host:port/user/palo/table1") 
    into table test_tbl 
    COLUMNS TERMINATED BY "|" 
    (k1,k2,name,clicks ) 
) 
with broker "doris" ("username"  =  "test", "password"  =  "test");
  1. Submit a spark load.
load label ssb_db.spark_load_label 
( data infile ("hdfs://ymy-host:port/user/palo/table1") 
    into table test_tbl 
    COLUMNS TERMINATED BY "|" 
    (k1,k2,name,clicks ) 
) with resource "spark0" 
("spark.executor.memory"  =  "24g", "spark.executor.cores"  =  "2", "spark.executor.instances"  =  "8");
  1. Broker Load will report an error "quality not good enough to cancel"

The reson of this bug
This is because the first character and the last character of a line are not considered to be delimeter in spark load.

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

@morningman morningman added area/spark-load Issues or PRs related to the spark load kind/fix Categorizes issue or PR as related to a bug. labels Sep 3, 2020
@@ -640,6 +642,22 @@ private StructType createScrSchema(List<String> srcColumns) {
return srcSchema;
}

// This method is to keep the splitting consistent with broker load / mini load
private String[] splitLine(String line, char sep) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if line is an empty string, this method should return an empty string array.
But here you will return a string array with one empty string in it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman added the approved Indicates a PR has been approved by one committer. label Sep 5, 2020
@morningman morningman merged commit aae942b into apache:master Sep 6, 2020
morningman pushed a commit that referenced this pull request Sep 7, 2020
@PasunuriSrinidhi
Copy link

import csv

with open('input_file.csv') as f:
reader = csv.reader(f, delimiter='|')
for row in reader:
# Handle case where first column has extra character
if row[0][0] != '1':
row[0] = row[0][1:]
# Handle case where last column has extra character
if row[-1][-1] != '3':
row[-1] = row[-1][:-1]
# Process the row as usual
# ...

In this example, I have used the csv module to read the input data and split it into columns using the | delimiter. Then I will check the first and last columns of each row for extra characters and remove them if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. area/spark-load Issues or PRs related to the spark load kind/fix Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Spark Load][Bug] The number of columns in broker load and spark load is different
3 participants