Closed
Description
Describe the bug
hdfsreader 插件读取text文件时报错
运行的json文件如下:
{
"job": {
"setting": {
"speed": {
"byte": -1,
"channel": 1
}
},
"content": [
{
"writer": {
"name": "streamwriter",
"parameter": {
"print": "true"
}
},
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "long"
},
{
"index": 2,
"type": "date"
},
{
"index": 3,
"type": "boolean"
},
{
"index": 4,
"type": "string"
}
],
"defaultFS": "hdfs://sandbox-hdp.hortonworks.com:8020",
"path": "/tmp/out_orc",
"fileType": "text",
"fieldDelimiter": "\u0001",
"fileName": "test_none",
"encoding": "UTF-8",
}
}
}
]
}
}
执行结果如下:
....
2020-12-11 21:27:24.903 [job-0] INFO DFSUtil - get HDFS all files in path = [/tmp/out_orc]
2020-12-11 21:27:26.459 [job-0] ERROR DFSUtil - 检查文件[hdfs://sandbox-hdp.hortonworks.com:8020/tmp/out_orc/test_none__48bb0c2c_c520_4406_ab12_8039dc277296]类型失败,目前支持ORC,SEQUENCE,RCFile,TEXT,CSV五种格式的文件,请检查您文件类型和文件是否正确。
2020-12-11 21:27:26.472 [job-0] INFO StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 0.00%
2020-12-11 21:27:26.474 [job-0] ERROR Engine - Code:[HdfsReader-10], Description:[读取文件出错]. - Code:[HdfsReader-10], Description:[读取文件出错]. - 检查文件[hdfs://sandbox-hdp.hortonworks.com:8020/tmp/out_orc/test_none__48bb0c2c_c520_4406_ab12_8039dc277296]类型失败,目前支持ORC,SEQUENCE,RCFile,TEXT,CSV五种格式的文件,请检查您文件类型和文件是否正确。 - java.lang.RuntimeException: hdfs://sandbox-hdp.hortonworks.com:8020/tmp/out_orc/test_none__48bb0c2c_c520_4406_ab12_8039dc277296 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [101, 115, 116, 10]
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:531)
at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:712)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:609)
at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.isParquetFile(DFSUtil.java:893)
at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.checkHdfsFileType(DFSUtil.java:724)
at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.addSourceFileByType(DFSUtil.java:222)
at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.addSourceFileIfNotEmpty(DFSUtil.java:152)
at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getHDFSAllFilesNORegex(DFSUtil.java:209)
at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getHDFSAllFiles(DFSUtil.java:179)
at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getAllFiles(DFSUtil.java:141)
at com.alibaba.datax.plugin.reader.hdfsreader.HdfsReader$Job.prepare(HdfsReader.java:172)
at com.alibaba.datax.core.job.JobContainer.prepareJobReader(JobContainer.java:702)
at com.alibaba.datax.core.job.JobContainer.prepare(JobContainer.java:312)
at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:115)
at com.alibaba.datax.core.Engine.start(Engine.java:90)
at com.alibaba.datax.core.Engine.entry(Engine.java:151)
at com.alibaba.datax.core.Engine.main(Engine.java:169)
运行环境
- OS: CentOS 7.7.1908
- JDK Version: openjdk 14
- DataX Version: 3.1.4