Skip to content

Reading into spark dataframe from indexes with dynamic field mapping configuration #1988

Open
@qaziqarta

Description

@qaziqarta

What kind an issue is this?

  • Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
    The easier it is to track down the bug, the faster it is solved.
  • Feature Request. Start by telling us what problem you’re trying to solve.
    Often a solution already exists! Don’t send pull requests to implement new features without
    first getting our support. Sometimes we leave features out on purpose to keep the project small.

Feature description

Allow reading into dataframe/rdd from indexes with dynamic field mapping configurations.
For example, I have index with dynamic_date_formats, created like this:

PUT my_index
{
  "mappings": {
    "dynamic_date_formats": ["MM/dd/yyyy"]
  }
}

PUT my_index/_doc/1
{
  "create_date": "09/25/2015"
}

While trying to read from the index above, the exception is fired:

spark.read.format("org.elasticsearch.spark.sql").load("my_index")
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: invalid map received dynamic_date_formats=[MM/dd/yyyy]
  at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseField(FieldParser.java:146)
  at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseMapping(FieldParser.java:88)
  at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseIndexMappings(FieldParser.java:69)
  at org.elasticsearch.hadoop.serialization.dto.mapping.FieldParser.parseMappings(FieldParser.java:40)
  at org.elasticsearch.hadoop.rest.RestClient.getMappings(RestClient.java:321)
  at org.elasticsearch.hadoop.rest.RestClient.getMappings(RestClient.java:307)
  at org.elasticsearch.hadoop.rest.RestRepository.getMappings(RestRepository.java:293)
  at org.elasticsearch.spark.sql.SchemaUtils$.discoverMappingAndGeoFields(SchemaUtils.scala:103)
  at org.elasticsearch.spark.sql.SchemaUtils$.discoverMapping(SchemaUtils.scala:91)
  at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema$lzycompute(DefaultSource.scala:229)
  at org.elasticsearch.spark.sql.ElasticsearchRelation.lazySchema(DefaultSource.scala:229)
  at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:233)
  at org.elasticsearch.spark.sql.ElasticsearchRelation$$anonfun$schema$1.apply(DefaultSource.scala:233)
  at scala.Option.getOrElse(Option.scala:121)
  at org.elasticsearch.spark.sql.ElasticsearchRelation.schema(DefaultSource.scala:233)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)
  ... 49 elided

Version Info

OS: : Ubuntu 20.04
JVM : OpenJDK 64-Bit Server VM, Java 1.8.0_275
Hadoop/Spark: Spark 2.4.7, Scala 2.11.12
ES-Hadoop : elasticsearch-spark-20_2.11-7.7.0
ES : ElasticSearch 7.7

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions