Skip to content

Break multi bytes UTF-8 characters when parsing in Node-style #908

@jdesboeufs

Description

@jdesboeufs

PapaParse breaks multi bytes UTF-8 characters when they are sliced between different chunks of Buffer.
For example ç would become ��.

To reproduce:

const Papa = require('papaparse')
const {PassThrough} = require('stream')

const csvFileString = 'first_name,last_name\nFrançois,Mitterrand\n'

const input = new PassThrough()
const parser = Papa.parse(Papa.NODE_STREAM_INPUT, {header: true})

input.pipe(parser)

parser.on('data', row => console.log(row))

input.write(Buffer.from(csvFileString).slice(0, 26))
input.write(Buffer.from(csvFileString).slice(26))
input.end()
{ first_name: 'Fran��ois', last_name: 'Mitterrand' }

A workaround is to ensure UTF-8 decoding with string_decoder (internal Node module), WHATWG TextDecoder or with iconv-lite (user-land dependency).
But a better answer is to use string_decoder or TextDecoder into PapaParse, in place of chunk.toString().

Related to #751

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions