Skip to content

Conversation

@karenyrx
Copy link
Contributor

@karenyrx karenyrx commented Oct 23, 2025

Description

This PR adds auto-detection capability to the gRPC Bulk API to support ingestion of all OpenSearch XContent document formats (CBOR, SMILE, and YAML), not just JSON.

The main motivation is to improve performance via binary formats (CBOR, SMILE). A secondary reason is to maintain feature parity with the HTTP APIs.

Differences: REST Bulk vs gRPC Bulk API

Some differences compared to the HTTP side are:

  1. HTTP cannot support CBOR/YAML in bulk, due to NDJSON Stream Parsing Limitation, but GRPC does not face this limitation
  • REST Bulk API requires a stream separator (newline delimiter \n) to parse the NDJSON format, where each line represents either an action metadata object or a document. (JSON uses \n and SMILE uses \0xFF as the delimiter between documents, but CBOR/YAML do not have such delimeters). Thus HTTP Bulk using NDJSON cannot support CBOR/YAML. gRPC avoids this because it uses Protobufs with explicit message boundaries (bulk_request_body[] array), eliminating the need for stream separators.
  1. Format detection in HTTP is reliant on an explicit header being passed, while gRPC can auto-detect
  • In HTTP, a header (e.g. application/json, application/smile) must be provided to determine the format of the request. The gRPC request parser uses MediaTypeRegistry.mediaTypeFromBytes to auto-detects the document format. An alternative considered was to provide a "document_type" field in the protobuf request to allow the user to set it explictly, but this didn't seem necessary.
  1. Encoding scope for HTTP is the full payload, while for GRPC it is just the document
  • In HTTP, the full HTTP payload is encoded as the chosen format, but in GRPC, only the document is (the full request is embedded inside a protobuf payload). Thus technically, each document in a bulk request can technically use a different encoding.

Test Plan

  1. Send a bulk request with mixed document formats (JSON, CBOR, YAML, and an Invalid Doc)
grpcurl -plaintext \
  -import-path ~/OpenSearch/ \
  -proto ~/OpenSearch/protos/services/document_service.proto \
  -d @ localhost:9400 \
  org.opensearch.protobufs.services.DocumentService/Bulk <<'EOM'
{
  "index": "movies",
  "bulk_request_body": [
    {
      "operation_container": {
        "create": {
          "x_index": "movies",
          "x_id": "json-doc-1"
        }
      },
      "object": "eyJ0aXRsZSI6IkluY2VwdGlvbiIsInllYXIiOjIwMTB9"
    },
    {
      "operation_container": {
        "create": {
          "x_index": "movies",
          "x_id": "cbor-doc-1"
        }
      },
      "object": "uQACZXRpdGxlaUluY2VwdGlvbmR5ZWFyGQfa"
    },
    {
      "operation_container": {
        "create": {
          "x_index": "movies",
          "x_id": "yaml-doc-1"
        }
      },
      "object": "LS0tCnRpdGxlOiBJbmNlcHRpb24KeWVhcjogMjAxMAo="
    },
   {
      "operation_container": {
        "create": {
          "x_index": "movies",
          "x_id": "invalid-doc-1"
        }
      },
      "object": "//79/A=="
    }
  ]
}
EOM
{
  "items": [
    {
      "create": {
        "xIndex": "movies",
        "xId": {
          "string": "json-doc-1"
        },
        "xPrimaryTerm": "1",
        "result": "created",
        "xSeqNo": "0",
        "xShards": {
          "successful": 1,
          "total": 2
        },
        "xVersion": "1"
      }
    },
    {
      "create": {
        "xIndex": "movies",
        "xId": {
          "string": "cbor-doc-1"
        },
        "xPrimaryTerm": "1",
        "result": "created",
        "xSeqNo": "1",
        "xShards": {
          "successful": 1,
          "total": 2
        },
        "xVersion": "1"
      }
    },
    {
      "create": {
        "xIndex": "movies",
        "xId": {
          "string": "yaml-doc-1"
        },
        "xPrimaryTerm": "1",
        "result": "created",
        "xSeqNo": "2",
        "xShards": {
          "successful": 1,
          "total": 2
        },
        "xVersion": "1"
      }
    },
   {
      "create": {
        "xIndex": "movies",
        "status": 3,
        "xId": {
          "string": "invalid-doc-1"
        },
        "error": {
          "type": "mapper_parsing_exception",
          "reason": "failed to parse",
          "stackTrace": "MapperParsingException[failed to parse]; nested: NotXContentException[Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes];\n\tat org.opensearch.index.mapper.DocumentParser.wrapInMapperParsingException(DocumentParser.java:206)\n\tat org.opensearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:99)\n\tat org.opensearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:253)\n\tat org.opensearch.index.engine.Engine.prepareIndex(Engine.java:1635)\n\tat org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1205)\n\tat org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:1122)\n\tat org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:655)\n\tat org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:481)\n\tat org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:975)\n\tat org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: org.opensearch.core.compress.NotXContentException: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes\n\tat org.opensearch.core.compress.CompressorRegistry.compressor(CompressorRegistry.java:75)\n\tat org.opensearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:109)\n\tat org.opensearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:87)\n\t... 11 more\n",
          "causedBy": {
            "type": "not_x_content_exception",
            "reason": "Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes",
            "stackTrace": "org.opensearch.core.compress.NotXContentException: Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes\n\tat org.opensearch.core.compress.CompressorRegistry.compressor(CompressorRegistry.java:75)\n\tat org.opensearch.common.xcontent.XContentHelper.createParser(XContentHelper.java:109)\n\tat org.opensearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:87)\n\tat org.opensearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:253)\n\tat org.opensearch.index.engine.Engine.prepareIndex(Engine.java:1635)\n\tat org.opensearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:1205)\n\tat org.opensearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:1122)\n\tat org.opensearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:655)\n\tat org.opensearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:481)\n\tat org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:975)\n\tat org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)\n\tat java.base/java.lang.Thread.run(Thread.java:1583)\n"
          }
        }
      }
    }
  ],
  ],
  "took": "755"
}
  1. MatchAll query to verify all 3 docs were successfully ingested, and are returned in the format they are ingested in:
grpcurl -plaintext  -import-path ~/OpenSearch/ -proto ~/OpenSearch/protos/services/search_service.proto  -d @ localhost:9400  org.opensearch.protobufs.services.SearchService/Search <<EOM                
{
   "search_request_body": {
       "query": {       
           "match_all": {}
       }                      
   }               
}                             
EOM                           
         
{
  "took": "90",
  "xShards": {
    "successful": 1,
    "total": 1
  },
  "hits": {
    "total": {
      "totalHits": {
        "relation": "TOTAL_HITS_RELATION_EQ",
        "value": "3"
      }
    },
    "hits": [
      {
        "xIndex": "movies",
        "xId": "json-doc-1",
        "xScore": {
          "double": 1
        },
        "xSource": "eyJ0aXRsZSI6IkluY2VwdGlvbiIsInllYXIiOjIwMTB9"
      },
      {
        "xIndex": "movies",
        "xId": "cbor-doc-1",
        "xScore": {
          "double": 1
        },
        "xSource": "uQACZXRpdGxlaUluY2VwdGlvbmR5ZWFyGQfa"
      },
      {
        "xIndex": "movies",
        "xId": "yaml-doc-1",
        "xScore": {
          "double": 1
        },
        "xSource": "LS0tCnRpdGxlOiBJbmNlcHRpb24KeWVhcjogMjAxMAo="
      }
    ],
    "maxScore": {
      "float": 1
    }
  }
}

Note: Smile is unable to be tested via a grpccurl command as there is no plaintext/non-binary representation for the SMILE document. But unit tests confirm SMILE format detection + setting is working.

Related Issues

Partially resolves #19311

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@karenyrx karenyrx force-pushed the bulkBinary branch 2 times, most recently from 8168c3c to 25991d7 Compare October 23, 2025 03:33
@karenyrx karenyrx changed the title [GRPC] Add SMILE/CBOR document format support to Bulk GRPC endpoint [GRPC] Add SMILE/CBOR/YAML document format support to Bulk GRPC endpoint Oct 23, 2025
@github-actions
Copy link
Contributor

❌ Gradle check result for 25991d7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Karen X <karenxyr@gmail.com>
Signed-off-by: Karen X <karenxyr@gmail.com>
Signed-off-by: Karen X <karenxyr@gmail.com>
@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Performance labels Oct 23, 2025
@karenyrx karenyrx marked this pull request as ready for review October 23, 2025 20:54
@karenyrx karenyrx requested a review from a team as a code owner October 23, 2025 20:54
@github-actions
Copy link
Contributor

❌ Gradle check result for 4ce3caf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 4ce3caf: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@msfroh
Copy link
Contributor

msfroh commented Oct 23, 2025

An alternative considered was to provide a "document_type" field in the protobuf request to allow the user to set it explictly, but this didn't seem necessary.

This isn't a one-way door, right? If we ever find that the autodetected type is wrong, we have the option of adding a document_type field that will bypass the autodetection.

Copy link
Contributor

@msfroh msfroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, simple improvement!

@github-actions
Copy link
Contributor

❌ Gradle check result for 41bfa33: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 41bfa33: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Karen X <karenxyr@gmail.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 9ad3d33: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

✅ Gradle check result for 9ad3d33: SUCCESS

@codecov
Copy link

codecov bot commented Oct 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.09%. Comparing base (a250c35) to head (7499a2e).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19744      +/-   ##
============================================
- Coverage     73.15%   73.09%   -0.06%     
+ Complexity    70958    70940      -18     
============================================
  Files          5736     5736              
  Lines        324734   324743       +9     
  Branches      46979    46980       +1     
============================================
- Hits         237548   237380     -168     
- Misses        68031    68252     +221     
+ Partials      19155    19111      -44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Karen X <karenxyr@gmail.com>
@github-actions
Copy link
Contributor

❌ Gradle check result for 7499a2e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❕ Gradle check result for 7499a2e: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@karenyrx karenyrx merged commit c8734db into opensearch-project:main Oct 27, 2025
34 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request Search:Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Support SMILE as a document format for GRPC Bulk requests and Search responses

2 participants