Skip to content

Conversation

dadoonet
Copy link
Contributor

@dadoonet dadoonet commented Jun 15, 2018

Note: this is a backport of #28977 which does not require a review.
As I forgot to backport it, I just like to double check that CI will be happy.

I'll update the PR if CI is unhappy and then will ask for a review.


We today support a global indexed_chars processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

We add an option which reads this limit value from the document itself
by adding a setting named indexed_chars_field.

Which allows running:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}

Then index either:

PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}

Which will use the default value (or the one defined by indexed_chars)

Or

PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000
}

Backport of #28977 in 6.x branch (6.4.0)

We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.

We add an option which reads this limit value from the document itself
by adding a setting named `indexed_chars_field`.

Which allows running:

```
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information. Used to parse pdf and office files",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars_field" : "size"
      }
    }
  ]
}
```

Then index either:

```
PUT index/doc/1?pipeline=attachment
{
  "data": "BASE64"
}
```

Which will use the default value (or the one defined by `indexed_chars`)

Or

```
PUT index/doc/2?pipeline=attachment
{
  "data": "BASE64",
  "size": 1000
}
```

Backport of elastic#28977 in 6.x branch (6.4.0)
@dadoonet dadoonet self-assigned this Jun 15, 2018
@colings86 colings86 added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP labels Jun 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@martijnvg
Copy link
Member

@dadoonet The CI build failed because if a checkstyle violation: [ERROR] /var/lib/jenkins/workspace/elastic+elasticsearch+pull-request/plugins/ingest-attachment/src/main/java/org/elasticsearch/ingest/attachment/AttachmentProcessor.java:32:8: Unused import - java.io.IOException. [UnusedImports]

@dadoonet dadoonet merged commit 2df907c into elastic:6.x Jun 16, 2018
@dadoonet dadoonet deleted the backport/28977-6x branch June 16, 2018 03:35
dnhatn added a commit that referenced this pull request Jun 19, 2018
* 6.x:
  Add get stored script and delete stored script to high level REST API
  Increasing skip version for failing test on 6.x
  Skip get_alias tests for 5.x (#31397)
  Fix defaults in GeoShapeFieldMapper output (#31302)
  Test: better error message on failure
  Mute DefaultShardsIT#testDefaultShards test
  Fix reference to XContentBuilder.string() (#31337)
  [DOCS] Adds monitoring breaking change (#31369)
  [DOCS] Adds security breaking change (#31375)
  [DOCS] Backports breaking change (#31373)
  RestAPI: Reject forcemerge requests with a body (#30792)
  Docs: Use the default distribution to test docs (#31251)
  Use system context for cluster state update tasks (#31241)
  [DOCS] Adds testing for security APIs (#31345)
  [DOCS] Removes ML item from release highlights
  [DOCS] Removes breaking change (#31376)
  REST high-level client: add validate query API (#31077)
  Move language analyzers from server to analysis-common module. (#31300)
  Expose lucene's RemoveDuplicatesTokenFilter (#31275)
  [Test] Fix :example-plugins:rest-handler on Windows
  Delete typos in SAML docs (#31199)
  Ensure we don't use a remote profile if cluster name matches (#31331)
  Test: Skip alias tests that failed all weekend
  [DOCS] Fix version in SQL JDBC Maven template
  [DOCS] Improve install and setup section for SQL JDBC
  Add ingest-attachment support for per document `indexed_chars` limit (#31352)
  SQL: Fix rest endpoint names in node stats (#31371)
  [DOCS] Fixes small issue in release notes
  Support for remote path in reindex api Closes #22913
  [ML] Put ML filter API response should contain the filter (#31362)
  Remove trial status info from start trial doc (#31365)
  [DOCS] Added links in breaking changes pages
  [DOCS] Adds links to release notes and highlights
  Docs: Document changes in rest client
  QA: Fix tribe tests to use node selector
  REST Client: NodeSelector for node attributes (#31296)
  LLClient: Fix assertion on windows
  LLClient: Support host selection (#30523)
  Add QA project and fixture based test for discovery-ec2 plugin (#31107)
  [ML] Hold ML filter items in sorted set (#31338)
  [ML] Add description to ML filters (#31330)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v6.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants