-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-622]unify file header reader #518
Conversation
Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/542/ |
Build Failed with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/543/ |
Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/544/ |
@@ -301,4 +304,45 @@ object CommonUtil { | |||
LOGGER.info(s"mapreduce.input.fileinputformat.split.maxsize: ${ newSplitSize.toString }") | |||
} | |||
} | |||
|
|||
def getCsvHeaderColumns(carbonLoadModel: CarbonLoadModel): Array[String] = { | |||
val delimiter = if (StringUtils.isEmpty(carbonLoadModel.getCsvDelimiter)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think delimiter can not be " ", right? so better to use isBlank instead of isEmpty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the delimiter maybe a blank " ", just like comma ",".
Customer should provide the proper delimiter in DDL command( default value is comma ",").
+ "not the same.") | ||
} else { | ||
LOGGER.error( | ||
"CSV File provided is not proper. Column names in schema and csv header are not same. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to tell "CSV header in the input file ($csvFile) is not proper."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
CarbonDataLoadSchema schema, String delimiter) throws DataLoadingException { | ||
delimiter = CarbonUtil.delimiterConverter(delimiter); | ||
public static boolean isHeaderValid(String tableName, String[] csvHeader, | ||
CarbonDataLoadSchema schema) throws DataLoadingException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think DataLoadingException can be removed, it is not thrown by the body
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this function, basically you want to compare two String array to find out weather they are the same, case-insensitively.
take a look at http://stackoverflow.com/questions/2419061/compare-string-array-using-collection
According to this link, using TreeSet is optimal in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -462,6 +389,13 @@ public static boolean isHeaderValid(String tableName, String header, | |||
return count == columnNames.length; | |||
} | |||
|
|||
public static boolean isHeaderValid(String tableName, String header, | |||
CarbonDataLoadSchema schema, String delimiter) throws DataLoadingException { | |||
delimiter = CarbonUtil.delimiterConverter(delimiter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
declare a local variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/547/ |
Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/549/ |
CarbonDataLoadSchema schema) { | ||
Iterator<String> columnIterator = | ||
CarbonDataProcessorUtil.getSchemaColumnNames(schema, tableName).iterator(); | ||
Set<String> csvColumns = new HashSet<String>(Arrays.asList(csvHeader)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use Collections.addAll
instead of converting to list and add
Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/550/ |
throw new DataLoadingException("Not able to read CSV input File ", e); | ||
} finally { | ||
CarbonUtil.closeStreams(fileReader, bufferedReader); | ||
while (columnIterator.hasNext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add comment to describe this logic, column definition in schema should be subset of input CSV header
Build Success with Spark 1.5.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/552/ |
LGTM |
[Launcher]: Refine miscellaneous
Scenario:
We can get file header from DDL command and CSV file.
Changes:
Before data loading, generate file header at first, both dict generation and dataloading can directly use this file header(exclude Kettle flow).