Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-622]unify file header reader #518

Closed
wants to merge 2 commits into from
Closed

[CARBONDATA-622]unify file header reader #518

wants to merge 2 commits into from

Conversation

QiangCai
Copy link
Contributor

@QiangCai QiangCai commented Jan 10, 2017

Scenario:
We can get file header from DDL command and CSV file.

  1. If the file header comes from DDL command, should separate this file header by comma ","
  2. if the file header comes from CSV file, should sparate this file header by specify delimiter in DDL command.

Changes:
Before data loading, generate file header at first, both dict generation and dataloading can directly use this file header(exclude Kettle flow).

@CarbonDataQA
Copy link

Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/542/

@CarbonDataQA
Copy link

Build Failed with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/543/

@CarbonDataQA
Copy link

Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/544/

@QiangCai QiangCai changed the title [WIP]unify file header reader [CARBONDATA-622]unify file header reader Jan 11, 2017
@@ -301,4 +304,45 @@ object CommonUtil {
LOGGER.info(s"mapreduce.input.fileinputformat.split.maxsize: ${ newSplitSize.toString }")
}
}

def getCsvHeaderColumns(carbonLoadModel: CarbonLoadModel): Array[String] = {
val delimiter = if (StringUtils.isEmpty(carbonLoadModel.getCsvDelimiter)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think delimiter can not be " ", right? so better to use isBlank instead of isEmpty

Copy link
Contributor Author

@QiangCai QiangCai Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the delimiter maybe a blank " ", just like comma ",".
Customer should provide the proper delimiter in DDL command( default value is comma ",").

+ "not the same.")
} else {
LOGGER.error(
"CSV File provided is not proper. Column names in schema and csv header are not same. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to tell "CSV header in the input file ($csvFile) is not proper."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

CarbonDataLoadSchema schema, String delimiter) throws DataLoadingException {
delimiter = CarbonUtil.delimiterConverter(delimiter);
public static boolean isHeaderValid(String tableName, String[] csvHeader,
CarbonDataLoadSchema schema) throws DataLoadingException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DataLoadingException can be removed, it is not thrown by the body

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this function, basically you want to compare two String array to find out weather they are the same, case-insensitively.
take a look at http://stackoverflow.com/questions/2419061/compare-string-array-using-collection
According to this link, using TreeSet is optimal in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -462,6 +389,13 @@ public static boolean isHeaderValid(String tableName, String header,
return count == columnNames.length;
}

public static boolean isHeaderValid(String tableName, String header,
CarbonDataLoadSchema schema, String delimiter) throws DataLoadingException {
delimiter = CarbonUtil.delimiterConverter(delimiter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

declare a local variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@CarbonDataQA
Copy link

Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/547/

@CarbonDataQA
Copy link

Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/549/

CarbonDataLoadSchema schema) {
Iterator<String> columnIterator =
CarbonDataProcessorUtil.getSchemaColumnNames(schema, tableName).iterator();
Set<String> csvColumns = new HashSet<String>(Arrays.asList(csvHeader));
Copy link
Contributor

@jackylk jackylk Jan 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Collections.addAll instead of converting to list and add

@CarbonDataQA
Copy link

Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/550/

throw new DataLoadingException("Not able to read CSV input File ", e);
} finally {
CarbonUtil.closeStreams(fileReader, bufferedReader);
while (columnIterator.hasNext()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment to describe this logic, column definition in schema should be subset of input CSV header

@CarbonDataQA
Copy link

Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/552/

@jackylk
Copy link
Contributor

jackylk commented Jan 11, 2017

LGTM

@asfgit asfgit closed this in 3003360 Jan 11, 2017
@QiangCai QiangCai deleted the fileheader branch May 12, 2017 01:54
Beyyes pushed a commit to Beyyes/carbondata that referenced this pull request Jul 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants