[Plugin-1730] Added format xls #1835

psainics · 2023-12-11T04:36:02Z

Added format xls to support files with `.xls` , `.xlsx` format

Jira : Plugin-1730

UI Fields

Sample Size: The maximum number of rows that will get investigated for automatic data type detection. The default value is 1000.
Override: A list of columns with the corresponding data types for whom the automatic data type detection gets skipped.
Terminate If Empty Row: Whether to terminate the file reading if an empty row is encountered. The default value is false.
Select Sheet Using: Select the sheet by name or number. Default is ‘Sheet Number’.
Sheet Value: The name/number of the sheet to read from. If not specified, the first sheet will be read. Sheet Number are 0 based, ie first sheet is 0.
Use First Row as Header: Whether to use first row as header.
- Default A,B,C,D....AA,AB,AC are used

albertshau

is anything using the testdata.xlsx file in the resources directory?

albertshau · 2024-01-03T22:41:33Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatConfig.java

+
+  public static Builder builder() {
+    return new Builder();
+  }


nit: add a newline after this

Newline Added 👍

albertshau · 2024-01-03T22:44:39Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatConfig.java

+  @Name(NAME_SHEET_VALUE)
+  @Description(DESC_SHEET_VALUE)
+  private String sheetValue;
+


nit: remove extra newline

Newline Removed 👍

albertshau · 2024-01-03T22:48:10Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatProvider.java

+public class XlsInputFormatProvider extends PathTrackingInputFormatProvider<XlsInputFormatConfig> {
+  static final String NAME = "xls";
+  static final String DESC = "Plugin for reading files in xls(x) format.";
+  public static final PluginClass PLUGIN_CLASS = PluginClass.builder()


this looks like it's unused. Should remove it if so, otherwise please point out where it is being used.

If unused, please also remove the related variables, like XlsInputFormatConfig.XLS_FIELDS

Is the intent to use this in a future PR? The pattern for other formats is to add the format to ETLBatchTestBase.setupTest() and have test cases in FileBatchSourceTest

Yes this will be used, in FileBatchSourceTest to create Test Cases.

albertshau · 2024-01-03T22:49:32Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatConfig.java

+  public static final String DESC_SHEET_VALUE = "Specifies the value corresponding to 'sheet' input. " +
+    "Can be either sheet name or sheet no; for example: 'Sheet1' or '0' in case user selects 'Sheet Name' or " +
+    "'Sheet Number' as 'sheet' input respectively. Sheet number starts with 0.";
+  public static final String DESC_TERMINATE_ROW = "Specify whether processing needs to be terminated in case an" +


It doesn't looks like XLS_FIELDS is needed, so these should just be specified inline with the Description annotation, so the description is close to the actual variable.

This seems like an odd option to have. I assume it's being carried over from the existing excel source?

If we don't know of an important use case for this, we should remove it.

After reading the implementation, it looks like this is used to stop reading when an empty row is found. Let's change the wording here, I thought 'terminated' meant it would throw an error.

Whether to stop reading after encountering the first empty row. Defaults to false.

I assume it's being carried over from the existing excel source

Yes

Updated the description for DESC_TERMINATE_ROW

albertshau · 2024-01-03T22:57:30Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatConfig.java

+    "Whether to skip the first line of each file. The default value is false.";
+  public static final String DESC_SHEET = "Select the sheet by name or number. Default is 'Sheet Number'.";
+  public static final String DESC_SHEET_VALUE = "Specifies the value corresponding to 'sheet' input. " +
+    "Can be either sheet name or sheet no; for example: 'Sheet1' or '0' in case user selects 'Sheet Name' or " +


What is the behavior if this is not specified? Does it read every sheet? Does it fail?

Updated the description.
By default it will read the 1st sheet.

albertshau · 2024-01-03T23:40:01Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormat.java

+      for (int cellIndex = 0; cellIndex < row.getLastCellNum(); cellIndex++) {
+        if (cellIndex >= fields.size()) {
+         throw new IllegalArgumentException(
+           String.format("Schema contains less fields than the number of columns in the excel file. " +


less -> fewer

albertshau · 2024-01-03T23:45:24Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormat.java

+        Schema.Field field = fields.get(cellIndex);
+        Schema.Type type = field.getSchema().isNullable() ?
+                field.getSchema().getNonNullable().getType() : field.getSchema().getType();
+        String result = formatter.formatCellValue(cell, type);


this is wasteful and error prone. We shouldn't be converting to a String and then using convertAndSet to change back to the actual type. Should be setting the value directly on the builder based on the cell type.

Instead of the XlsInputFormatDataFormatter, it would be better to have a XlsRowConverter class that contains all the logic for converting a Row to a StructuredRecord.

Added a XlsRowConverter class

albertshau · 2024-01-03T23:47:42Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatDataFormatter.java

+/**
+ * Formats the cell value of an Excel file.
+ */
+public class XlsInputFormatDataFormatter {


nit: remove 'InputFormat' from the class name, it doesn't need to be used with Hadoop InputFormats

albertshau · 2024-01-03T23:51:24Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatProvider.java

@@ -0,0 +1,192 @@
+/*
+ * Copyright © 2023 Cask Data, Inc.


update to 2024 everywhere

Updated 2024 🎆

albertshau · 2024-01-03T23:51:57Z

pom.xml

@@ -45,6 +45,7 @@
    <module>solrsearch-plugins</module>
    <module>spark-plugins</module>
    <module>transform-plugins</module>
+    <module>format-xls</module>


modules are sorted alphabetically, move this up to where the other formats are

albertshau · 2024-01-18T00:26:19Z

can you squash the commits that took place after the first review? That way it will be easier to review just the changes instead of all 1.5k lines again.

albertshau · 2024-01-18T18:21:04Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatUtils.java

+   * 5) Check if the name has been found before (without considering case)
+   * if so add _# where # is the number of times seen before + 1
+   */
+  public static List<String> getSafeColumnNames(List<String> columnNames) {


was cleanSchemaColumnNames copied directly from there except modified to use Lists instead of arrays?

If the logic is the same, can you move that class to format-common to re-use most of the logic?

albertshau · 2024-01-18T18:22:27Z

format-xls/pom.xml

+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
+  <!--
+   ~ Copyright © 2023 Cask Data, Inc.


all the years should be 2024

Looks like a bug on GitHub, it's already 2024 🤔

albertshau · 2024-01-18T18:23:15Z

core-plugins/pom.xml

@@ -279,6 +279,18 @@
      <groupId>org.mockito</groupId>
      <artifactId>mockito-core</artifactId>
    </dependency>
+      <dependency>


this is a duplicate, should remove it

Removed duplicate !

albertshau · 2024-01-18T18:24:15Z

format-xls/pom.xml

    </dependency>
    <dependency>
      <groupId>org.apache.logging.log4j</groupId>
      <artifactId>log4j-core</artifactId>
-      <scope>test</scope>
-      <version>2.17.2</version>
+      <scope>compile</scope>


why is this changed to compile?

The version of apache poi being used was having compatibility issue with the log4j library being used, i had to change this to compile to make it work.

albertshau · 2024-01-18T18:25:08Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormat.java

+          FileSplit split, TaskAttemptContext context, @Nullable String pathField,
+          @Nullable Schema schema) throws IOException {
+    Configuration jobConf = context.getConfiguration();
+    boolean skipFirstRow = jobConf.getBoolean(NAME_SKIP_HEADER, true);


thought the default for this was false?

Fixed, default is now false. Thanks!

albertshau · 2024-01-18T18:25:59Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormat.java

+    Configuration jobConf = context.getConfiguration();
+    boolean skipFirstRow = jobConf.getBoolean(NAME_SKIP_HEADER, true);
+    boolean terminateIfEmptyRow = jobConf.getBoolean(TERMINATE_IF_EMPTY_ROW, false);
+    Schema outputSchema =  schema != null ? Schema.parseJson(context.getConfiguration().get("schema")) : null;


extra space after =

Removed space.

albertshau · 2024-01-18T18:26:42Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormat.java

@@ -50,79 +46,93 @@
 * The {@link XlsInputFormat.XlsRecordReader} reads a given sheet, and within a sheet reads
 * all columns and all rows.
 */
-public class XlsInputFormat extends CombineFileInputFormat<LongWritable, StructuredRecord> {
+public class XlsInputFormat extends PathTrackingInputFormat {

  public static final String SHEET_NO = "Sheet Number";


nit: NO -> NUM is more descriptive

SHEET_NUM is better 👍

albertshau · 2024-01-18T18:33:05Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsInputFormatProvider.java

+            try {
+              sheetValue = Integer.parseInt(conf.getSheetValue());
+            } catch (NumberFormatException e) {
+              failureCollector.addFailure("Sheet number must be a number.", null)


this is code is duplicated in the validate() method, it would be better to have a method like getSheetAsNumber() that is called in both places.

private Integer getSheetAsNumber(FailureCollector failureCollector) { if (!Strings.isNullOrEmpty(conf.getSheetValue())) { try { int sheetValue = Integer.parseInt(conf.getSheetValue()); if (sheetValue >= 0) { return sheetValue; } failureCollector.addFailure("Sheet number must be a positive number.", null) .withConfigProperty(XlsInputFormatConfig.NAME_SHEET_VALUE); } catch (NumberFormatException e) { failureCollector.addFailure("Sheet number must be a number.", null) .withConfigProperty(XlsInputFormatConfig.NAME_SHEET_VALUE); } } return null; }

albertshau · 2024-01-18T18:36:16Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsRowConverter.java

+    List<Schema.Field> fields = outputSchema.getFields();
+    for (int cellIndex = 0; cellIndex < row.getLastCellNum(); cellIndex++) {
+      if (cellIndex >= fields.size()) {
+        throw new IllegalArgumentException(


this seems overly restrictive, it seems like people should be able to read a subset of the columns. Or if there are extra cells on the side it seems like it shouldn't cause a failure. Is this how the existing excel source behaves?

Hmm, i have added && cellIndex < fields.size() as a condition in the for loop.
this won't throw any errors now.

The existing plugin does not have this issue as it uses fields like Columns To Be Extracted and Column Label Mapping to have a well defined area .
But this forces user to manually enter all the details, so this was not included in this plugin.

albertshau · 2024-01-18T18:37:52Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsRowConverter.java

+          cellValue = getCellAsBoolean(cell);
+          break;
+        default:
+          throw new IllegalArgumentException(


this should be checked during plugin validation if it is not already happening. For cases like this when we don't expect the situation to ever happen, we usually add a comment that it's not expected and throw an IllegalStateException

Yup, this is not suppose to happen.
I have added comments for that.

nit: throw IllegalStateException instead of IllegalArgumentException, since hitting this is a problem in the code and not a problem with the input argument

albertshau · 2024-01-18T18:38:44Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsRowConverter.java

+      isRowEmpty = false;
+    }
+    if (isRowEmpty) {
+      return null;


method should be annotated with @Nullable so it is clear the caller needs to handle nulls.

albertshau · 2024-01-18T18:42:35Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsRowConverter.java

+      case BOOLEAN:
+        return cell.getBooleanCellValue() ? "TRUE" : "FALSE";
+      case BLANK:
+      case ERROR:


are there any other CellType values (like date?)

No , dates are stored as Numeric, there is a util function to check if the numeric cell is formatted as date.

albertshau

lgtm, please squash all commits and we can merge

albertshau · 2024-01-30T21:45:23Z

format-xls/src/main/java/io/cdap/plugin/format/xls/input/XlsRowConverter.java

+          cellValue = getCellAsBoolean(cell);
+          break;
+        default:
+          throw new IllegalArgumentException(


nit: throw IllegalStateException instead of IllegalArgumentException, since hitting this is a problem in the code and not a problem with the input argument

albertshau · 2024-02-07T21:59:20Z

@psainics there are e2e and unit test failures, please comment on whether these are new or expected

[s] Review Squashed

psainics · 2024-02-09T06:08:38Z

E2E failure has been resolved.

sau42shri requested a review from DJSagarAhire December 12, 2023 10:41

psainics force-pushed the xls_addition branch from cdc5372 to 1c49888 Compare December 18, 2023 08:01

vikasrathee-cs requested a review from albertshau December 19, 2023 14:38

psainics force-pushed the xls_addition branch 2 times, most recently from a1f6456 to 51beae1 Compare December 21, 2023 09:35

psainics force-pushed the xls_addition branch from 51beae1 to c9b31fd Compare January 1, 2024 04:53

albertshau reviewed Jan 3, 2024

View reviewed changes

psainics requested a review from albertshau January 15, 2024 00:42

psainics force-pushed the xls_addition branch from 6ff8f04 to 783ee3c Compare January 18, 2024 00:32

albertshau reviewed Jan 18, 2024

View reviewed changes

psainics force-pushed the xls_addition branch from 698dd7e to cd5f2a0 Compare January 22, 2024 16:14

psainics requested a review from albertshau January 22, 2024 16:15

albertshau approved these changes Jan 30, 2024

View reviewed changes

albertshau added the build Trigger unit test build label Jan 30, 2024

psainics force-pushed the xls_addition branch from cd5f2a0 to 1f42fb7 Compare January 31, 2024 19:40

Added format xls

5ae5ae1

[s] Review Squashed

psainics force-pushed the xls_addition branch from 1f42fb7 to 5ae5ae1 Compare February 8, 2024 03:23

albertshau merged commit 55ac600 into cdapio:develop Feb 12, 2024
5 checks passed

[Plugin-1730] Added format xls #1835

[Plugin-1730] Added format xls #1835

Conversation

psainics commented Dec 11, 2023

Added format xls to support files with .xls , .xlsx format

UI Fields

albertshau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertshau commented Jan 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertshau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertshau commented Feb 7, 2024

psainics commented Feb 9, 2024

Added format xls to support files with `.xls` , `.xlsx` format