[Spark Load][Bug] Keep the column splitting in spark load consistent with broker load / mini load #4532

xy720 · 2020-09-03T12:04:55Z

Proposed changes

There is a 4 columns source data:

1|1|jim|2|
2|1|grace|2|
3|2|tom|2|
4|3|bush|3|
5|3|helen|3|
5|3|helen|6|
6|3|helen|3|
6|3|helen|3|
...

Given the same column terminator '|', broker load determines that it is 5 columns, and spark load determines that it is 4 columns.

And there is another 4 columns source

1|1|jim|2
2|1|grace|2
3|2|tom|2
4|3|bush|3
5|3|helen|3
5|3|helen|6
6|3|helen|3
6|3|helen|3
...

Given the same column terminator '|', both the broker load and spark load determines that it is 4 columns.

To Reproduce
Steps to reproduce the behavior:

Submit a broker load.

load label ssb_db.broker_load_label 
( 
    data infile ("hdfs://ymy-host:port/user/palo/table1") 
    into table test_tbl 
    COLUMNS TERMINATED BY "|" 
    (k1,k2,name,clicks ) 
) 
with broker "doris" ("username"  =  "test", "password"  =  "test");

Submit a spark load.

load label ssb_db.spark_load_label 
( data infile ("hdfs://ymy-host:port/user/palo/table1") 
    into table test_tbl 
    COLUMNS TERMINATED BY "|" 
    (k1,k2,name,clicks ) 
) with resource "spark0" 
("spark.executor.memory"  =  "24g", "spark.executor.cores"  =  "2", "spark.executor.instances"  =  "8");

Broker Load will report an error "quality not good enough to cancel"

The reson of this bug
This is because the first character and the last character of a line are not considered to be delimeter in spark load.

Types of changes

What types of changes does your code introduce to Doris?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have create an issue on (Fix [Spark Load][Bug] The number of columns in broker load and spark load is different #4520 ), and have described the bug/feature there in detail

morningman · 2020-09-03T15:25:11Z

fe/spark-dpp/src/main/java/org/apache/doris/load/loadv2/dpp/SparkDpp.java

@@ -640,6 +642,22 @@ private StructType createScrSchema(List<String> srcColumns) {
        return srcSchema;
    }

+    // This method is to keep the splitting consistent with broker load / mini load
+    private String[] splitLine(String line, char sep) {


if line is an empty string, this method should return an empty string array.
But here you will return a string array with one empty string in it.

morningman

LGTM

…with broker load / mini load (#4532)

PasunuriSrinidhi · 2023-04-15T12:04:07Z

import csv

with open('input_file.csv') as f:
reader = csv.reader(f, delimiter='|')
for row in reader:
# Handle case where first column has extra character
if row[0][0] != '1':
row[0] = row[0][1:]
# Handle case where last column has extra character
if row[-1][-1] != '3':
row[-1] = row[-1][:-1]
# Process the row as usual
# ...

In this example, I have used the csv module to read the input data and split it into columns using the | delimiter. Then I will check the first and last columns of each row for extra characters and remove them if necessary.

xy720 added 3 commits September 3, 2020 19:35

save code

db01f4a

remove unuse

a91613c

save code

86e417b

morningman added area/spark-load Issues or PRs related to the spark load kind/fix Categorizes issue or PR as related to a bug. labels Sep 3, 2020

morningman reviewed Sep 3, 2020

View reviewed changes

check empty string

9c07c6a

morningman approved these changes Sep 5, 2020

View reviewed changes

morningman added the approved Indicates a PR has been approved by one committer. label Sep 5, 2020

morningman merged commit aae942b into apache:master Sep 6, 2020

morningman pushed a commit that referenced this pull request Sep 7, 2020

[Spark Load][Bug] Keep the column splitting in spark load consistent …

bbefdad

…with broker load / mini load (#4532)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark Load][Bug] Keep the column splitting in spark load consistent with broker load / mini load #4532

[Spark Load][Bug] Keep the column splitting in spark load consistent with broker load / mini load #4532

xy720 commented Sep 3, 2020 •

edited

Loading

morningman Sep 3, 2020

xy720 Sep 4, 2020

morningman left a comment

PasunuriSrinidhi commented Apr 15, 2023

[Spark Load][Bug] Keep the column splitting in spark load consistent with broker load / mini load #4532

[Spark Load][Bug] Keep the column splitting in spark load consistent with broker load / mini load #4532

Conversation

xy720 commented Sep 3, 2020 • edited Loading

Proposed changes

Types of changes

Checklist

morningman Sep 3, 2020

Choose a reason for hiding this comment

xy720 Sep 4, 2020

Choose a reason for hiding this comment

morningman left a comment

Choose a reason for hiding this comment

PasunuriSrinidhi commented Apr 15, 2023

xy720 commented Sep 3, 2020 •

edited

Loading