-
Notifications
You must be signed in to change notification settings - Fork 31
Using Flink Table API to read write Excel files
This is a Flink application demonstrating some of the capabilities of the hadoopoffice Flink TableSource/TableSink library. It takes as input a set of Excel files. It outputs them to a new Excel file into the Sheet "Sheet2". It has successfully been tested with the HDP Sandbox VM 2.5, but other Hadoop distributions should work equally well, if they support Flink. You will need at least Flink 1.5 (older versions may work though). Note: you may need to put the file flink-hadoop-compatibility*.jar in the lib folder of your flink distribution.
You can create yourself an Excel file in LibreOffice or Microsoft Excel. Alternatively, you can download an Excel file that is used for unit testing of hadoopoffice library by executing the following command:
wget --no-check-certificate https://github.com/ZuInnoTe/hadoopoffice/blob/master/examples/scala-flinkts-excel/src/it/resources/testsimple.xlsx?raw=true
You can put it on your HDFS cluster by executing the following commands:
hadoop fs -mkdir -p /user/flink/office/excel/input
hadoop fs -put ./testsimple.xlsx /user/flink/office/excel/input
After it has been copied you are ready to use the example.
Note the HadoopOffice library and the Flink TableSource/TableSink is available on Maven Central.
Execute
git clone https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice
You can build the application by changing to the directory hadoopoffice/examples/scala-flinkts-excel and using the following command:
sbt +clean +it:test +assembly
Make sure that the output directory is empty:
hadoop fs -rm -R /user/flink/office/excel/output
Execute the following command:
flink run example-ho-flink-ts-scala-excel.jar --input hdfs://localhost:8020/user/flink/office/excel/input --output hdfs://localhost:8020/user/flink/office/excel/output/output.xlsx
After the job finished you can download the Excel file and view it in LibreOffice, Excel or any other spreadsheet tool.
hadoop fs -copyToLocal /user/flink/office/excel/output/output.xlsx
Find here further configuration options of the HadoopOffice library, such as encryption, decryption, locale, meta data filter, linked workbooks and filtering by sheets.