Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a doc_id field for data exported from the DocumentDb source #4463

Closed
Tracked by #4460
dlvenable opened this issue Apr 24, 2024 · 1 comment
Closed
Tracked by #4460
Labels
enhancement New feature or request
Milestone

Comments

@dlvenable
Copy link
Member

dlvenable commented Apr 24, 2024

Every DocumentDb object must have an _id field. The Data Prepper documentdb source should do two things with this value:

  1. Create a metadata field primary_key which is a string representation of that value.
  2. Create a new field docdb_id with matches the data from _id. This should use the same mappings as found in DocumentDb simple representations of BSON types #4458.

OpenSearch does not allow sending documents with an _id field. So using the docdb_id will allow saving all the data from a DocumentDb object while also using a valid OpenSearch Id.

Examples

For example, say we have the following DocumentDb documents:

{
  _id: {
    category: "repository", 
    title: "OpenSearch"
  }, 
  url: "https://github.com/opensearch-project/OpenSearch"
}

and

{
  _id: {
    category: "project", 
    title: "opensearch-project"
  }, 
  url: "https://github.com/opensearch-project/"
}

We'd like to retain the original data inside the OpenSearch index.

So the documentdb source should output the following:

{
  docdb_id: {
    category: "repository", 
    title: "OpenSearch"
  }, 
  url: "https://github.com/opensearch-project/OpenSearch"
}
metadata {
  "primary_key" : "{category:"repository",title:"OpenSearch"}"
}

and

{
  docdb_id: {
    category: "project", 
    title: "opensearch-project"
  }, 
  url: "https://github.com/opensearch-project/"
}
metadata {
  "primary_key" : "{category:"project",title:"opensearch-project"}"
}

Say the user has the following sink configuration:

sink:
  - opensearch:
      hosts: ["https://localhost:9200"]
      document_id: "${getMetadata(\"primary_key\")}"
      action: "${getMetadata(\"opensearch_action\")}"

Then, in OpenSearch, this will look like:

{
  _id: "{category:"project",title:"opensearch-project"}"
  docdb_id: {
    category: "project", 
    title: "opensearch-project"
  }, 
  url: "https://github.com/opensearch-project/"
}

Configuration

Also, we should allow users to configure the value used in the source.

documentdb:
  id_key: docdb_id
@dlvenable dlvenable added enhancement New feature or request and removed untriaged labels Apr 30, 2024
@dlvenable dlvenable added this to the v2.8 milestone May 14, 2024
@dlvenable
Copy link
Member Author

Completed by #4512.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

1 participant