Use the primary_term field to identify parent documents #27469

s1monw · 2017-11-21T09:03:32Z

This change stops indxing the _primary_term field for nested documents
to allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.

Relates to #24362

This change stops indxing the `_primary_term` field for nested documents to allow fast retrieval of parent documents. Today we create a docvalues field for children to ensure we have a dense datastructure on disk. Yet, since we only use the primary term to tie-break on when we see the same seqID on indexing having a dense datastructure is less important. We can use this now to improve the nested docs performance and it's memory footprint. Relates to elastic#24362

bleskes

This looks good to me. I can't speak to the implications on the lucene side - like whether we want to index an illegal value (0) for the primary term vs using the mentioned bitset and the exists query. I presume the bit set will be faster for reads and probably smaller index wise (as the doc values will be compressed better).

bleskes · 2017-11-21T09:17:47Z

core/src/main/java/org/elasticsearch/index/mapper/SeqNoFieldMapper.java

            final Document doc = context.docs().get(i);
            doc.add(seqID.seqNo);
            doc.add(seqID.seqNoDocValue);
-            doc.add(seqID.primaryTerm);
+            if (includePrimaryTerm) {


can we add a comment saying that primary terms are used to distinguish between top level (parent) docs and nested ones.

+1 will do that

martijnvg

LGTM! I left a few small comments.

martijnvg · 2017-11-21T09:52:24Z

core/src/test/java/org/apache/lucene/search/QueriesTests.java


 public class QueriesTests extends ESTestCase {

    public void testNonNestedQuery() {
        // This is a custom query that extends AutomatonQuery and want to make sure the equals method works
-        assertEquals(Queries.newNonNestedFilter(), Queries.newNonNestedFilter());
-        assertEquals(Queries.newNonNestedFilter().hashCode(), Queries.newNonNestedFilter().hashCode());
+        Version version = VersionUtils.randomVersion(random());


instead of a random version maybe also test both pre 7.0.0 and post 7.0.0 specifically?

I added a test that checks all versions

martijnvg · 2017-11-21T09:53:15Z

...src/test/java/org/elasticsearch/search/aggregations/bucket/nested/NestedAggregatorTests.java

@@ -314,7 +326,7 @@ public void testResetRootDocId() throws Exception {
                fieldType.setName(VALUE_FIELD_NAME);

                BooleanQuery.Builder bq = new BooleanQuery.Builder();
-                bq.add(Queries.newNonNestedFilter(), BooleanClause.Occur.MUST);
+                bq.add(Queries.newNonNestedFilter(Version.CURRENT), BooleanClause.Occur.MUST);


randomize version? The test should be able to handle both the old and new way.

martijnvg · 2017-11-21T10:00:58Z

core/src/main/java/org/elasticsearch/index/mapper/SeqNoFieldMapper.java

@@ -252,11 +253,16 @@ public void postParse(ParseContext context) throws IOException {
        // we share the parent docs fields to ensure good compression
        SequenceIDFields seqID = context.seqID();
        assert seqID != null;
-        for (int i = 1; i < context.docs().size(); i++) {
+        int numDocs = context.docs().size();
+        final Version versionCreated = context.mapperService().getIndexSettings().getIndexVersionCreated();


not related to this change, but I think we should have add QueryShardContext#getIndexVersionCreated() helper method that does this: mapperService().getIndexSettings().getIndexVersionCreated().

Arg... I meant ParseContext#getIndexVersionCreated()

I will open a followup but I think we need to have a more common class across all index level contexts maybe a base class?

jpountz

The change looks good to me, thanks for tackling it.

Regarding implications, my understanding is that primary term lookups are rare so this change should not slow down indexing, even it might make primary term lookups slower. The current way that things are designed, doc value lookups may be linear (this typically happens in 2 cases: if the field is sparse of if splitting into blocks proves to give better compression) but with a very high constant-factor of 2^16. So say you have a segment with 100M documents (which is pretty large for a segment), Lucene will be looping over 1.5k blocks of documents until it reaches the right one.

s1monw · 2017-11-21T13:36:52Z

@jpountz @martijnvg please take another look I had to work around a lucene bug

martijnvg

Still LGTM

This change stops indexing the `_primary_term` field for nested documents to allow fast retrieval of parent documents. Today we create a docvalues field for children to ensure we have a dense datastructure on disk. Yet, since we only use the primary term to tie-break on when we see the same seqID on indexing having a dense datastructure is less important. We can use this now to improve the nested docs performance and it's memory footprint. Relates to #24362

* master: (41 commits) [Test] Fix AggregationsTests#testFromXContentWithRandomFields [DOC] Fix mathematical representation on interval (range) (elastic#27450) Update version check for CCS optional remote clusters Bump BWC version to 6.1.0 for elastic#27469 Adapt rest test BWC version after backport Fix dynamic mapping update generation. (elastic#27467) Use the primary_term field to identify parent documents (elastic#27469) Move composite aggregation to core (elastic#27474) Fix test BWC version after backport Protect shard splitting from illegal target shards (elastic#27468) Cross Cluster Search: make remote clusters optional (elastic#27182) [Docs] Fix broken bulleted lists (elastic#27470) Move resync request serialization assertion Fix resync request serialization Fix issue where pages aren't released (elastic#27459) Add YAML REST tests for filters bucket agg (elastic#27128) Remove tcp profile from low level nio channel (elastic#27441) [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure Transition transport apis to use void listeners (elastic#27440) AwaitsFix GeoShapeQueryTests#testPointsOnly elastic#27454 ...

We changed things in elastic#27469 to filter parent docs by using an exists query on the primary_term field. However, the equivalent query for nested documents is still using the type field. This commit makes newNestedFilter build the complement of newNonNestedFilter instead.

We changed things in #27469 to filter parent docs by using an exists query on the primary_term field. However, the equivalent query for nested documents is still using the type field. This commit makes newNestedFilter build the complement of newNonNestedFilter instead.

s1monw added :Nested Docs >enhancement v6.1.0 v7.0.0 labels Nov 21, 2017

s1monw requested review from jpountz, martijnvg and bleskes November 21, 2017 09:03

add missing file

b65adf5

bleskes approved these changes Nov 21, 2017

View reviewed changes

martijnvg approved these changes Nov 21, 2017

View reviewed changes

jpountz approved these changes Nov 21, 2017

View reviewed changes

s1monw added 3 commits November 21, 2017 11:37

apply review comments

d1c98f8

add workaround for lucene bug

bcbfe46

reference lucene issue

05cd180

martijnvg approved these changes Nov 21, 2017

View reviewed changes

s1monw merged commit 5a0b6d1 into elastic:master Nov 21, 2017

s1monw added a commit that referenced this pull request Nov 21, 2017

Bump BWC version to 6.1.0 for #27469

cc78b24

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Nested Docs labels Feb 14, 2018

jeffreynscrbdee mentioned this pull request Sep 25, 2018

Extra Lucene DocValueExistsQuery fired - due to nested mapping and primary_terms #34067

Closed

jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

romseygeek mentioned this pull request Jul 17, 2019

Don't use TypeField for nested filters #44482

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use the primary_term field to identify parent documents #27469

Use the primary_term field to identify parent documents #27469

Uh oh!

s1monw commented Nov 21, 2017

Uh oh!

bleskes left a comment

Uh oh!

bleskes Nov 21, 2017

Uh oh!

s1monw Nov 21, 2017

Uh oh!

martijnvg left a comment

Uh oh!

martijnvg Nov 21, 2017

Uh oh!

s1monw Nov 21, 2017

Uh oh!

martijnvg Nov 21, 2017

Uh oh!

s1monw Nov 21, 2017

Uh oh!

martijnvg Nov 21, 2017

Uh oh!

martijnvg Nov 21, 2017

Uh oh!

s1monw Nov 21, 2017

Uh oh!

jpountz left a comment

Uh oh!

s1monw commented Nov 21, 2017

Uh oh!

martijnvg left a comment

Uh oh!

Uh oh!

Use the primary_term field to identify parent documents #27469

Use the primary_term field to identify parent documents #27469

Uh oh!

Conversation

s1monw commented Nov 21, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

s1monw commented Nov 21, 2017

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!