-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Plugin-1730] Added format xls #1835
Conversation
cdc5372
to
1c49888
Compare
a1f6456
to
51beae1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is anything using the testdata.xlsx file in the resources directory?
|
||
public static Builder builder() { | ||
return new Builder(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a newline after this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline Added 👍
@Name(NAME_SHEET_VALUE) | ||
@Description(DESC_SHEET_VALUE) | ||
private String sheetValue; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove extra newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Newline Removed 👍
public class XlsInputFormatProvider extends PathTrackingInputFormatProvider<XlsInputFormatConfig> { | ||
static final String NAME = "xls"; | ||
static final String DESC = "Plugin for reading files in xls(x) format."; | ||
public static final PluginClass PLUGIN_CLASS = PluginClass.builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like it's unused. Should remove it if so, otherwise please point out where it is being used.
If unused, please also remove the related variables, like XlsInputFormatConfig.XLS_FIELDS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intent to use this in a future PR? The pattern for other formats is to add the format to ETLBatchTestBase.setupTest() and have test cases in FileBatchSourceTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this will be used, in FileBatchSourceTest
to create Test Cases.
public static final String DESC_SHEET_VALUE = "Specifies the value corresponding to 'sheet' input. " + | ||
"Can be either sheet name or sheet no; for example: 'Sheet1' or '0' in case user selects 'Sheet Name' or " + | ||
"'Sheet Number' as 'sheet' input respectively. Sheet number starts with 0."; | ||
public static final String DESC_TERMINATE_ROW = "Specify whether processing needs to be terminated in case an" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't looks like XLS_FIELDS is needed, so these should just be specified inline with the Description annotation, so the description is close to the actual variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an odd option to have. I assume it's being carried over from the existing excel source?
If we don't know of an important use case for this, we should remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the implementation, it looks like this is used to stop reading when an empty row is found. Let's change the wording here, I thought 'terminated' meant it would throw an error.
Whether to stop reading after encountering the first empty row. Defaults to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it's being carried over from the existing excel source
Yes
- Updated the description for
DESC_TERMINATE_ROW
"Whether to skip the first line of each file. The default value is false."; | ||
public static final String DESC_SHEET = "Select the sheet by name or number. Default is 'Sheet Number'."; | ||
public static final String DESC_SHEET_VALUE = "Specifies the value corresponding to 'sheet' input. " + | ||
"Can be either sheet name or sheet no; for example: 'Sheet1' or '0' in case user selects 'Sheet Name' or " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior if this is not specified? Does it read every sheet? Does it fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the description.
By default it will read the 1st sheet.
for (int cellIndex = 0; cellIndex < row.getLastCellNum(); cellIndex++) { | ||
if (cellIndex >= fields.size()) { | ||
throw new IllegalArgumentException( | ||
String.format("Schema contains less fields than the number of columns in the excel file. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less -> fewer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated !
Schema.Field field = fields.get(cellIndex); | ||
Schema.Type type = field.getSchema().isNullable() ? | ||
field.getSchema().getNonNullable().getType() : field.getSchema().getType(); | ||
String result = formatter.formatCellValue(cell, type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is wasteful and error prone. We shouldn't be converting to a String and then using convertAndSet to change back to the actual type. Should be setting the value directly on the builder based on the cell type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the XlsInputFormatDataFormatter, it would be better to have a XlsRowConverter class that contains all the logic for converting a Row to a StructuredRecord.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a XlsRowConverter
class
/** | ||
* Formats the cell value of an Excel file. | ||
*/ | ||
public class XlsInputFormatDataFormatter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove 'InputFormat' from the class name, it doesn't need to be used with Hadoop InputFormats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed !
@@ -0,0 +1,192 @@ | |||
/* | |||
* Copyright © 2023 Cask Data, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update to 2024 everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated 2024 🎆
pom.xml
Outdated
@@ -45,6 +45,7 @@ | |||
<module>solrsearch-plugins</module> | |||
<module>spark-plugins</module> | |||
<module>transform-plugins</module> | |||
<module>format-xls</module> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modules are sorted alphabetically, move this up to where the other formats are
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved up !
can you squash the commits that took place after the first review? That way it will be easier to review just the changes instead of all 1.5k lines again. |
6ff8f04
to
783ee3c
Compare
* 5) Check if the name has been found before (without considering case) | ||
* if so add _# where # is the number of times seen before + 1 | ||
*/ | ||
public static List<String> getSafeColumnNames(List<String> columnNames) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was cleanSchemaColumnNames copied directly from there except modified to use Lists instead of arrays?
If the logic is the same, can you move that class to format-common to re-use most of the logic?
format-xls/pom.xml
Outdated
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> | ||
<!-- | ||
~ Copyright © 2023 Cask Data, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the years should be 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a bug on GitHub, it's already 2024 🤔
core-plugins/pom.xml
Outdated
@@ -279,6 +279,18 @@ | |||
<groupId>org.mockito</groupId> | |||
<artifactId>mockito-core</artifactId> | |||
</dependency> | |||
<dependency> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a duplicate, should remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed duplicate !
</dependency> | ||
<dependency> | ||
<groupId>org.apache.logging.log4j</groupId> | ||
<artifactId>log4j-core</artifactId> | ||
<scope>test</scope> | ||
<version>2.17.2</version> | ||
<scope>compile</scope> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changed to compile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The version of apache poi
being used was having compatibility issue with the log4j library being used, i had to change this to compile to make it work.
FileSplit split, TaskAttemptContext context, @Nullable String pathField, | ||
@Nullable Schema schema) throws IOException { | ||
Configuration jobConf = context.getConfiguration(); | ||
boolean skipFirstRow = jobConf.getBoolean(NAME_SKIP_HEADER, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought the default for this was false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, default is now false. Thanks!
Configuration jobConf = context.getConfiguration(); | ||
boolean skipFirstRow = jobConf.getBoolean(NAME_SKIP_HEADER, true); | ||
boolean terminateIfEmptyRow = jobConf.getBoolean(TERMINATE_IF_EMPTY_ROW, false); | ||
Schema outputSchema = schema != null ? Schema.parseJson(context.getConfiguration().get("schema")) : null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space after =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed space.
@@ -50,79 +46,93 @@ | |||
* The {@link XlsInputFormat.XlsRecordReader} reads a given sheet, and within a sheet reads | |||
* all columns and all rows. | |||
*/ | |||
public class XlsInputFormat extends CombineFileInputFormat<LongWritable, StructuredRecord> { | |||
public class XlsInputFormat extends PathTrackingInputFormat { | |||
|
|||
public static final String SHEET_NO = "Sheet Number"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: NO -> NUM is more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SHEET_NUM
is better 👍
try { | ||
sheetValue = Integer.parseInt(conf.getSheetValue()); | ||
} catch (NumberFormatException e) { | ||
failureCollector.addFailure("Sheet number must be a number.", null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is code is duplicated in the validate() method, it would be better to have a method like getSheetAsNumber() that is called in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private Integer getSheetAsNumber(FailureCollector failureCollector) {
if (!Strings.isNullOrEmpty(conf.getSheetValue())) {
try {
int sheetValue = Integer.parseInt(conf.getSheetValue());
if (sheetValue >= 0) {
return sheetValue;
}
failureCollector.addFailure("Sheet number must be a positive number.", null)
.withConfigProperty(XlsInputFormatConfig.NAME_SHEET_VALUE);
} catch (NumberFormatException e) {
failureCollector.addFailure("Sheet number must be a number.", null)
.withConfigProperty(XlsInputFormatConfig.NAME_SHEET_VALUE);
}
}
return null;
}
List<Schema.Field> fields = outputSchema.getFields(); | ||
for (int cellIndex = 0; cellIndex < row.getLastCellNum(); cellIndex++) { | ||
if (cellIndex >= fields.size()) { | ||
throw new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems overly restrictive, it seems like people should be able to read a subset of the columns. Or if there are extra cells on the side it seems like it shouldn't cause a failure. Is this how the existing excel source behaves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, i have added && cellIndex < fields.size()
as a condition in the for loop.
this won't throw any errors now.
The existing plugin does not have this issue as it uses fields like Columns To Be Extracted
and Column Label Mapping
to have a well defined area .
But this forces user to manually enter all the details, so this was not included in this plugin.
cellValue = getCellAsBoolean(cell); | ||
break; | ||
default: | ||
throw new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be checked during plugin validation if it is not already happening. For cases like this when we don't expect the situation to ever happen, we usually add a comment that it's not expected and throw an IllegalStateException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this is not suppose to happen.
I have added comments for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: throw IllegalStateException instead of IllegalArgumentException, since hitting this is a problem in the code and not a problem with the input argument
isRowEmpty = false; | ||
} | ||
if (isRowEmpty) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method should be annotated with @Nullable
so it is clear the caller needs to handle nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked !
case BOOLEAN: | ||
return cell.getBooleanCellValue() ? "TRUE" : "FALSE"; | ||
case BLANK: | ||
case ERROR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any other CellType values (like date?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No , dates are stored as Numeric, there is a util function to check if the numeric cell is formatted as date.
698dd7e
to
cd5f2a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, please squash all commits and we can merge
cellValue = getCellAsBoolean(cell); | ||
break; | ||
default: | ||
throw new IllegalArgumentException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: throw IllegalStateException instead of IllegalArgumentException, since hitting this is a problem in the code and not a problem with the input argument
cd5f2a0
to
1f42fb7
Compare
@psainics there are e2e and unit test failures, please comment on whether these are new or expected |
[s] Review Squashed
1f42fb7
to
5ae5ae1
Compare
E2E failure has been resolved. |
Added format xls to support files with
.xls
,.xlsx
formatJira : Plugin-1730
UI Fields
Sample Size: The maximum number of rows that will get investigated for automatic data type detection. The default value is 1000.
Override: A list of columns with the corresponding data types for whom the automatic data type detection gets skipped.
Terminate If Empty Row: Whether to terminate the file reading if an empty row is encountered. The default value is false.
Select Sheet Using: Select the sheet by name or number. Default is ‘Sheet Number’.
Sheet Value: The name/number of the sheet to read from. If not specified, the first sheet will be read. Sheet Number are 0 based, ie first sheet is 0.
Use First Row as Header: Whether to use first row as header.