Skip to content

saveToEs saves case classes fields with NULL values #998

Closed
@angelcervera

Description

@angelcervera

What kind an issue is this?

Bug report.

Maybe related with #792 but no the same.

Issue description

saveToEs is inserting case classes null fields as null, instead of ignoring them.
The behavior should be in align with spark-sql, where null fields are ignored.

From spark-sql documentation: By default, elasticsearch-hadoop will ignore null values in favor of not writing any field at all. Since a DataFrame is meant to be treated as structured tabular data, you can enable writing nulls as null valued fields for DataFrame Objects only by toggling the es.spark.dataframe.write.null setting to true.

Steps to reproduce

Code:

      case class WithNulls(desc: String, field2: Option[String], field3: String, inner: Option[WithNullsInner])
      case class WithNullsInner(field4: String, field5: String)

      val conf = new SparkConf()
        .setAppName("Testing null serialization.")
        .setMaster("local[2]")
        .set("es.index.auto.create", "true")
        .set("es.nodes", "localhost:9200")

      new SparkContext(conf).parallelize(
        List(
          WithNulls("all fields", Some("field2_1"), "field3_1", Some(WithNullsInner("field4_1", "field5_1"))),
          WithNulls("None and nulls", None, null, None )
        )
      ).saveToEs("testidx/withnulls")

Strack trace:

Current response:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "testidx",
        "_type": "withnulls",
        "_id": "AVxEYKgiaEb7vk_5fa7O",
        "_score": 1,
        "_source": {
          "desc": "None and nulls",
          "field2": null,
          "field3": null,
          "inner": null
        }
      },
      {
        "_index": "testidx",
        "_type": "withnulls",
        "_id": "AVxEYKgkaEb7vk_5fa7P",
        "_score": 1,
        "_source": {
          "desc": "all fields",
          "field2": "field2_1",
          "field3": "field3_1",
          "inner": {
            "field4": "field4_1",
            "field5": "field5_1"
          }
        }
      }
    ]
  }
}

Expected response:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "testidx",
        "_type": "withnulls",
        "_id": "AVxEYKgiaEb7vk_5fa7O",
        "_score": 1,
        "_source": {
          "desc": "None and nulls"
        }
      },
      {
        "_index": "testidx",
        "_type": "withnulls",
        "_id": "AVxEYKgkaEb7vk_5fa7P",
        "_score": 1,
        "_source": {
          "desc": "all fields",
          "field2": "field2_1",
          "field3": "field3_1",
          "inner": {
            "field4": "field4_1",
            "field5": "field5_1"
          }
        }
      }
    ]
  }
}

Version Info

OS: : Ubuntu 16.04.2 LTS
JVM : openjdk version "1.8.0_131"
Hadoop/Spark: Spark 2.1.0
ES-Hadoop : 5.3
ES : tested in 2.2 and 5.3

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions