Description
Elasticsearch version (7.2.0
):
Plugins installed: [none]
JVM version: java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
OS version : Windows 10, Version 1709
Description of the problem including expected versus actual behavior:
Using the split processor (since there is no csv processor in pipeline) to split a csv line drops the trailing empty spaces.
A,,B,, gives A, '', B.
Expected behaviour is : A, '', B, '', ''
In Java the default behaviour is this only but they provide an overload of passing -1 as a parameter to retain the trailing empty spaces. There is no such support in split processor.
Steps to reproduce:
- Create a simple pipeline:
PUT _ingest/pipeline/test_pipeline
{
"description": "test",
"processors": [
{
"split": {
"field": "message",
"target_field": "splitdata",
"separator": ","
}
}
]
}
- Test it.
GET _ingest/pipeline/test_pipeline/_simulate
{
"docs": [
{
"_source" :{
"message" : "A,,B,,"
}
}
]
}
- Results
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"message" : "A,,B,,",
"splitdata" : [
"A",
"",
"B"
]
},
"_ingest" : {
"timestamp" : "2019-10-23T04:25:26.277Z"
}
}
}
]
}
Two empty fields after the character 'B' are dropped.
- Test with a different input.
GET _ingest/pipeline/test_pipeline/_simulate
{
"docs": [
{
"_source" :{
"message" : "A,,B,,C"
}
}
]
}
- Result:
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"message" : "A,,B,,C",
"splitdata" : [
"A",
"",
"B",
"",
"C"
]
},
"_ingest" : {
"timestamp" : "2019-10-23T04:27:38.400Z"
}
}
}
]
}
The empty values are preserved.