-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
PapaParse breaks multi bytes UTF-8 characters when they are sliced between different chunks of Buffer
.
For example ç
would become ��
.
To reproduce:
const Papa = require('papaparse')
const {PassThrough} = require('stream')
const csvFileString = 'first_name,last_name\nFrançois,Mitterrand\n'
const input = new PassThrough()
const parser = Papa.parse(Papa.NODE_STREAM_INPUT, {header: true})
input.pipe(parser)
parser.on('data', row => console.log(row))
input.write(Buffer.from(csvFileString).slice(0, 26))
input.write(Buffer.from(csvFileString).slice(26))
input.end()
{ first_name: 'Fran��ois', last_name: 'Mitterrand' }
A workaround is to ensure UTF-8 decoding with string_decoder
(internal Node module), WHATWG TextDecoder
or with iconv-lite
(user-land dependency).
But a better answer is to use string_decoder
or TextDecoder
into PapaParse
, in place of chunk.toString()
.
Related to #751
jebarjonet and fatso83unframework, 0703904886c and geostonemarten
Metadata
Metadata
Assignees
Labels
No labels