Skip to content

Use the primary_term field to identify parent documents #27469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 21, 2017

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Nov 21, 2017

This change stops indxing the _primary_term field for nested documents
to allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.

Relates to #24362

This change stops indxing the `_primary_term` field for nested documents
to allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.

Relates to elastic#24362
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I can't speak to the implications on the lucene side - like whether we want to index an illegal value (0) for the primary term vs using the mentioned bitset and the exists query. I presume the bit set will be faster for reads and probably smaller index wise (as the doc values will be compressed better).

final Document doc = context.docs().get(i);
doc.add(seqID.seqNo);
doc.add(seqID.seqNoDocValue);
doc.add(seqID.primaryTerm);
if (includePrimaryTerm) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment saying that primary terms are used to distinguish between top level (parent) docs and nested ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 will do that

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left a few small comments.


public class QueriesTests extends ESTestCase {

public void testNonNestedQuery() {
// This is a custom query that extends AutomatonQuery and want to make sure the equals method works
assertEquals(Queries.newNonNestedFilter(), Queries.newNonNestedFilter());
assertEquals(Queries.newNonNestedFilter().hashCode(), Queries.newNonNestedFilter().hashCode());
Version version = VersionUtils.randomVersion(random());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of a random version maybe also test both pre 7.0.0 and post 7.0.0 specifically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test that checks all versions

@@ -314,7 +326,7 @@ public void testResetRootDocId() throws Exception {
fieldType.setName(VALUE_FIELD_NAME);

BooleanQuery.Builder bq = new BooleanQuery.Builder();
bq.add(Queries.newNonNestedFilter(), BooleanClause.Occur.MUST);
bq.add(Queries.newNonNestedFilter(Version.CURRENT), BooleanClause.Occur.MUST);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randomize version? The test should be able to handle both the old and new way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@@ -252,11 +253,16 @@ public void postParse(ParseContext context) throws IOException {
// we share the parent docs fields to ensure good compression
SequenceIDFields seqID = context.seqID();
assert seqID != null;
for (int i = 1; i < context.docs().size(); i++) {
int numDocs = context.docs().size();
final Version versionCreated = context.mapperService().getIndexSettings().getIndexVersionCreated();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this change, but I think we should have add QueryShardContext#getIndexVersionCreated() helper method that does this: mapperService().getIndexSettings().getIndexVersionCreated().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg... I meant ParseContext#getIndexVersionCreated()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will open a followup but I think we need to have a more common class across all index level contexts maybe a base class?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me, thanks for tackling it.

Regarding implications, my understanding is that primary term lookups are rare so this change should not slow down indexing, even it might make primary term lookups slower. The current way that things are designed, doc value lookups may be linear (this typically happens in 2 cases: if the field is sparse of if splitting into blocks proves to give better compression) but with a very high constant-factor of 2^16. So say you have a segment with 100M documents (which is pretty large for a segment), Lucene will be looping over 1.5k blocks of documents until it reaches the right one.

@s1monw
Copy link
Contributor Author

s1monw commented Nov 21, 2017

@jpountz @martijnvg please take another look I had to work around a lucene bug

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM

@s1monw s1monw merged commit 5a0b6d1 into elastic:master Nov 21, 2017
s1monw added a commit that referenced this pull request Nov 21, 2017
s1monw added a commit that referenced this pull request Nov 21, 2017
This change stops indexing the `_primary_term` field for nested documents
to allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.

Relates to #24362
jasontedor added a commit to olcbean/elasticsearch that referenced this pull request Nov 21, 2017
* master: (41 commits)
  [Test] Fix AggregationsTests#testFromXContentWithRandomFields
  [DOC] Fix mathematical representation on interval (range) (elastic#27450)
  Update version check for CCS optional remote clusters
  Bump BWC version to 6.1.0 for elastic#27469
  Adapt rest test BWC version after backport
  Fix dynamic mapping update generation. (elastic#27467)
  Use the primary_term field to identify parent documents (elastic#27469)
  Move composite aggregation to core (elastic#27474)
  Fix test BWC version after backport
  Protect shard splitting from illegal target shards (elastic#27468)
  Cross Cluster Search: make remote clusters optional (elastic#27182)
  [Docs] Fix broken bulleted lists (elastic#27470)
  Move resync request serialization assertion
  Fix resync request serialization
  Fix issue where pages aren't released (elastic#27459)
  Add YAML REST tests for filters bucket agg (elastic#27128)
  Remove tcp profile from low level nio channel (elastic#27441)
  [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure
  Transition transport apis to use void listeners (elastic#27440)
  AwaitsFix GeoShapeQueryTests#testPointsOnly elastic#27454
  ...
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Nested Docs labels Feb 14, 2018
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
przemekwitek pushed a commit to przemekwitek/elasticsearch that referenced this pull request Jul 29, 2019
We changed things in elastic#27469 to filter parent docs by using an exists query on the 
primary_term field. However, the equivalent query for nested documents is still using 
the type field. This commit makes newNestedFilter build the complement of 
newNonNestedFilter instead.
jkakavas pushed a commit that referenced this pull request Jul 31, 2019
We changed things in #27469 to filter parent docs by using an exists query on the 
primary_term field. However, the equivalent query for nested documents is still using 
the type field. This commit makes newNestedFilter build the complement of 
newNonNestedFilter instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories v6.1.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants