Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Plugin-1730] Added format xls #1835

Merged
merged 1 commit into from
Feb 12, 2024
Merged

Conversation

psainics
Copy link
Contributor

Added format xls to support files with .xls , .xlsx format

Jira : Plugin-1730

UI Fields

image
  • Sample Size: The maximum number of rows that will get investigated for automatic data type detection. The default value is 1000.

  • Override: A list of columns with the corresponding data types for whom the automatic data type detection gets skipped.

  • Terminate If Empty Row: Whether to terminate the file reading if an empty row is encountered. The default value is false.

  • Select Sheet Using: Select the sheet by name or number. Default is ‘Sheet Number’.

  • Sheet Value: The name/number of the sheet to read from. If not specified, the first sheet will be read. Sheet Number are 0 based, ie first sheet is 0.

  • Use First Row as Header: Whether to use first row as header.

    • Default A,B,C,D....AA,AB,AC are used

Copy link
Contributor

@albertshau albertshau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is anything using the testdata.xlsx file in the resources directory?


public static Builder builder() {
return new Builder();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a newline after this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline Added 👍

@Name(NAME_SHEET_VALUE)
@Description(DESC_SHEET_VALUE)
private String sheetValue;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra newline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Newline Removed 👍

public class XlsInputFormatProvider extends PathTrackingInputFormatProvider<XlsInputFormatConfig> {
static final String NAME = "xls";
static final String DESC = "Plugin for reading files in xls(x) format.";
public static final PluginClass PLUGIN_CLASS = PluginClass.builder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like it's unused. Should remove it if so, otherwise please point out where it is being used.

If unused, please also remove the related variables, like XlsInputFormatConfig.XLS_FIELDS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent to use this in a future PR? The pattern for other formats is to add the format to ETLBatchTestBase.setupTest() and have test cases in FileBatchSourceTest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this will be used, in FileBatchSourceTest to create Test Cases.

public static final String DESC_SHEET_VALUE = "Specifies the value corresponding to 'sheet' input. " +
"Can be either sheet name or sheet no; for example: 'Sheet1' or '0' in case user selects 'Sheet Name' or " +
"'Sheet Number' as 'sheet' input respectively. Sheet number starts with 0.";
public static final String DESC_TERMINATE_ROW = "Specify whether processing needs to be terminated in case an" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't looks like XLS_FIELDS is needed, so these should just be specified inline with the Description annotation, so the description is close to the actual variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an odd option to have. I assume it's being carried over from the existing excel source?

If we don't know of an important use case for this, we should remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the implementation, it looks like this is used to stop reading when an empty row is found. Let's change the wording here, I thought 'terminated' meant it would throw an error.

Whether to stop reading after encountering the first empty row. Defaults to false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's being carried over from the existing excel source

Yes

  • Updated the description for DESC_TERMINATE_ROW

"Whether to skip the first line of each file. The default value is false.";
public static final String DESC_SHEET = "Select the sheet by name or number. Default is 'Sheet Number'.";
public static final String DESC_SHEET_VALUE = "Specifies the value corresponding to 'sheet' input. " +
"Can be either sheet name or sheet no; for example: 'Sheet1' or '0' in case user selects 'Sheet Name' or " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior if this is not specified? Does it read every sheet? Does it fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description.
By default it will read the 1st sheet.

for (int cellIndex = 0; cellIndex < row.getLastCellNum(); cellIndex++) {
if (cellIndex >= fields.size()) {
throw new IllegalArgumentException(
String.format("Schema contains less fields than the number of columns in the excel file. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less -> fewer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated !

Schema.Field field = fields.get(cellIndex);
Schema.Type type = field.getSchema().isNullable() ?
field.getSchema().getNonNullable().getType() : field.getSchema().getType();
String result = formatter.formatCellValue(cell, type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wasteful and error prone. We shouldn't be converting to a String and then using convertAndSet to change back to the actual type. Should be setting the value directly on the builder based on the cell type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the XlsInputFormatDataFormatter, it would be better to have a XlsRowConverter class that contains all the logic for converting a Row to a StructuredRecord.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a XlsRowConverter class

/**
* Formats the cell value of an Excel file.
*/
public class XlsInputFormatDataFormatter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove 'InputFormat' from the class name, it doesn't need to be used with Hadoop InputFormats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed !

@@ -0,0 +1,192 @@
/*
* Copyright © 2023 Cask Data, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update to 2024 everywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated 2024 🎆

pom.xml Outdated
@@ -45,6 +45,7 @@
<module>solrsearch-plugins</module>
<module>spark-plugins</module>
<module>transform-plugins</module>
<module>format-xls</module>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modules are sorted alphabetically, move this up to where the other formats are

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved up !

@albertshau
Copy link
Contributor

can you squash the commits that took place after the first review? That way it will be easier to review just the changes instead of all 1.5k lines again.

* 5) Check if the name has been found before (without considering case)
* if so add _# where # is the number of times seen before + 1
*/
public static List<String> getSafeColumnNames(List<String> columnNames) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was cleanSchemaColumnNames copied directly from there except modified to use Lists instead of arrays?

If the logic is the same, can you move that class to format-common to re-use most of the logic?

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<!--
~ Copyright © 2023 Cask Data, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the years should be 2024

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a bug on GitHub, it's already 2024 🤔

@@ -279,6 +279,18 @@
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
</dependency>
<dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a duplicate, should remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed duplicate !

</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-core</artifactId>
<scope>test</scope>
<version>2.17.2</version>
<scope>compile</scope>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changed to compile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version of apache poi being used was having compatibility issue with the log4j library being used, i had to change this to compile to make it work.

FileSplit split, TaskAttemptContext context, @Nullable String pathField,
@Nullable Schema schema) throws IOException {
Configuration jobConf = context.getConfiguration();
boolean skipFirstRow = jobConf.getBoolean(NAME_SKIP_HEADER, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought the default for this was false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, default is now false. Thanks!

Configuration jobConf = context.getConfiguration();
boolean skipFirstRow = jobConf.getBoolean(NAME_SKIP_HEADER, true);
boolean terminateIfEmptyRow = jobConf.getBoolean(TERMINATE_IF_EMPTY_ROW, false);
Schema outputSchema = schema != null ? Schema.parseJson(context.getConfiguration().get("schema")) : null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed space.

@@ -50,79 +46,93 @@
* The {@link XlsInputFormat.XlsRecordReader} reads a given sheet, and within a sheet reads
* all columns and all rows.
*/
public class XlsInputFormat extends CombineFileInputFormat<LongWritable, StructuredRecord> {
public class XlsInputFormat extends PathTrackingInputFormat {

public static final String SHEET_NO = "Sheet Number";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: NO -> NUM is more descriptive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SHEET_NUM is better 👍

try {
sheetValue = Integer.parseInt(conf.getSheetValue());
} catch (NumberFormatException e) {
failureCollector.addFailure("Sheet number must be a number.", null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is code is duplicated in the validate() method, it would be better to have a method like getSheetAsNumber() that is called in both places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private Integer getSheetAsNumber(FailureCollector failureCollector) {
    if (!Strings.isNullOrEmpty(conf.getSheetValue())) {
      try {
        int sheetValue = Integer.parseInt(conf.getSheetValue());
        if (sheetValue >= 0) {
          return sheetValue;
        }
        failureCollector.addFailure("Sheet number must be a positive number.", null)
                    .withConfigProperty(XlsInputFormatConfig.NAME_SHEET_VALUE);
      } catch (NumberFormatException e) {
        failureCollector.addFailure("Sheet number must be a number.", null)
                .withConfigProperty(XlsInputFormatConfig.NAME_SHEET_VALUE);
      }
    }
    return null;
  }

List<Schema.Field> fields = outputSchema.getFields();
for (int cellIndex = 0; cellIndex < row.getLastCellNum(); cellIndex++) {
if (cellIndex >= fields.size()) {
throw new IllegalArgumentException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems overly restrictive, it seems like people should be able to read a subset of the columns. Or if there are extra cells on the side it seems like it shouldn't cause a failure. Is this how the existing excel source behaves?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i have added && cellIndex < fields.size() as a condition in the for loop.
this won't throw any errors now.

The existing plugin does not have this issue as it uses fields like Columns To Be Extracted and Column Label Mapping to have a well defined area .
But this forces user to manually enter all the details, so this was not included in this plugin.

cellValue = getCellAsBoolean(cell);
break;
default:
throw new IllegalArgumentException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be checked during plugin validation if it is not already happening. For cases like this when we don't expect the situation to ever happen, we usually add a comment that it's not expected and throw an IllegalStateException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this is not suppose to happen.
I have added comments for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: throw IllegalStateException instead of IllegalArgumentException, since hitting this is a problem in the code and not a problem with the input argument

isRowEmpty = false;
}
if (isRowEmpty) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method should be annotated with @Nullable so it is clear the caller needs to handle nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marked !

case BOOLEAN:
return cell.getBooleanCellValue() ? "TRUE" : "FALSE";
case BLANK:
case ERROR:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any other CellType values (like date?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No , dates are stored as Numeric, there is a util function to check if the numeric cell is formatted as date.

Copy link
Contributor

@albertshau albertshau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, please squash all commits and we can merge

cellValue = getCellAsBoolean(cell);
break;
default:
throw new IllegalArgumentException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: throw IllegalStateException instead of IllegalArgumentException, since hitting this is a problem in the code and not a problem with the input argument

@albertshau
Copy link
Contributor

@psainics there are e2e and unit test failures, please comment on whether these are new or expected

[s] Review Squashed
@psainics
Copy link
Contributor Author

psainics commented Feb 9, 2024

E2E failure has been resolved.

@albertshau albertshau merged commit 55ac600 into cdapio:develop Feb 12, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Trigger unit test build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants