Skip to content

hdfsreader 插件读取text文件时报错 #66

Closed
@wgzhao

Description

Describe the bug

hdfsreader 插件读取text文件时报错

运行的json文件如下:

{
    "job": {
        "setting": {
            "speed": {
                "byte": -1,
                "channel": 1
            }
        },
        "content": [
            {
                "writer": {
                    "name": "streamwriter",
                    "parameter": {
                        "print": "true"
                    }
                },
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "column": [
                            {
                                "index": 0,
                                "type": "string"
                            },
                            {
                                "index": 1,
                                "type": "long"
                            },
                            {
                                "index": 2,
                                "type": "date"
                            },
                            {
                                "index": 3,
                                "type": "boolean"
                            },
                            {
                                "index": 4,
                                "type": "string"
                            }
                        ],
                        "defaultFS": "hdfs://sandbox-hdp.hortonworks.com:8020",
                        "path": "/tmp/out_orc",
                        "fileType": "text",
                        "fieldDelimiter": "\u0001",
                        "fileName": "test_none",
                        "encoding": "UTF-8",
                    }
                }
            }
        ]
    }
}

执行结果如下:

....
2020-12-11 21:27:24.903 [job-0] INFO  DFSUtil - get HDFS all files in path = [/tmp/out_orc]
2020-12-11 21:27:26.459 [job-0] ERROR DFSUtil - 检查文件[hdfs://sandbox-hdp.hortonworks.com:8020/tmp/out_orc/test_none__48bb0c2c_c520_4406_ab12_8039dc277296]类型失败,目前支持ORC,SEQUENCE,RCFile,TEXT,CSV五种格式的文件,请检查您文件类型和文件是否正确。
2020-12-11 21:27:26.472 [job-0] INFO  StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 0.00%
2020-12-11 21:27:26.474 [job-0] ERROR Engine - Code:[HdfsReader-10], Description:[读取文件出错].  - Code:[HdfsReader-10], Description:[读取文件出错].  - 检查文件[hdfs://sandbox-hdp.hortonworks.com:8020/tmp/out_orc/test_none__48bb0c2c_c520_4406_ab12_8039dc277296]类型失败,目前支持ORC,SEQUENCE,RCFile,TEXT,CSV五种格式的文件,请检查您文件类型和文件是否正确。 - java.lang.RuntimeException: hdfs://sandbox-hdp.hortonworks.com:8020/tmp/out_orc/test_none__48bb0c2c_c520_4406_ab12_8039dc277296 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [101, 115, 116, 10]
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:531)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:712)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:609)
	at org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:152)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.isParquetFile(DFSUtil.java:893)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.checkHdfsFileType(DFSUtil.java:724)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.addSourceFileByType(DFSUtil.java:222)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.addSourceFileIfNotEmpty(DFSUtil.java:152)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getHDFSAllFilesNORegex(DFSUtil.java:209)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getHDFSAllFiles(DFSUtil.java:179)
	at com.alibaba.datax.plugin.reader.hdfsreader.DFSUtil.getAllFiles(DFSUtil.java:141)
	at com.alibaba.datax.plugin.reader.hdfsreader.HdfsReader$Job.prepare(HdfsReader.java:172)
	at com.alibaba.datax.core.job.JobContainer.prepareJobReader(JobContainer.java:702)
	at com.alibaba.datax.core.job.JobContainer.prepare(JobContainer.java:312)
	at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:115)
	at com.alibaba.datax.core.Engine.start(Engine.java:90)
	at com.alibaba.datax.core.Engine.entry(Engine.java:151)
	at com.alibaba.datax.core.Engine.main(Engine.java:169)

运行环境

  • OS: CentOS 7.7.1908
  • JDK Version: openjdk 14
  • DataX Version: 3.1.4

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions