-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Use the primary_term field to identify parent documents #27469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change stops indxing the `_primary_term` field for nested documents to allow fast retrieval of parent documents. Today we create a docvalues field for children to ensure we have a dense datastructure on disk. Yet, since we only use the primary term to tie-break on when we see the same seqID on indexing having a dense datastructure is less important. We can use this now to improve the nested docs performance and it's memory footprint. Relates to elastic#24362
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I can't speak to the implications on the lucene side - like whether we want to index an illegal value (0) for the primary term vs using the mentioned bitset and the exists query. I presume the bit set will be faster for reads and probably smaller index wise (as the doc values will be compressed better).
final Document doc = context.docs().get(i); | ||
doc.add(seqID.seqNo); | ||
doc.add(seqID.seqNoDocValue); | ||
doc.add(seqID.primaryTerm); | ||
if (includePrimaryTerm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment saying that primary terms are used to distinguish between top level (parent) docs and nested ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 will do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I left a few small comments.
|
||
public class QueriesTests extends ESTestCase { | ||
|
||
public void testNonNestedQuery() { | ||
// This is a custom query that extends AutomatonQuery and want to make sure the equals method works | ||
assertEquals(Queries.newNonNestedFilter(), Queries.newNonNestedFilter()); | ||
assertEquals(Queries.newNonNestedFilter().hashCode(), Queries.newNonNestedFilter().hashCode()); | ||
Version version = VersionUtils.randomVersion(random()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of a random version maybe also test both pre 7.0.0 and post 7.0.0 specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test that checks all versions
@@ -314,7 +326,7 @@ public void testResetRootDocId() throws Exception { | |||
fieldType.setName(VALUE_FIELD_NAME); | |||
|
|||
BooleanQuery.Builder bq = new BooleanQuery.Builder(); | |||
bq.add(Queries.newNonNestedFilter(), BooleanClause.Occur.MUST); | |||
bq.add(Queries.newNonNestedFilter(Version.CURRENT), BooleanClause.Occur.MUST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
randomize version? The test should be able to handle both the old and new way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
@@ -252,11 +253,16 @@ public void postParse(ParseContext context) throws IOException { | |||
// we share the parent docs fields to ensure good compression | |||
SequenceIDFields seqID = context.seqID(); | |||
assert seqID != null; | |||
for (int i = 1; i < context.docs().size(); i++) { | |||
int numDocs = context.docs().size(); | |||
final Version versionCreated = context.mapperService().getIndexSettings().getIndexVersionCreated(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this change, but I think we should have add QueryShardContext#getIndexVersionCreated()
helper method that does this: mapperService().getIndexSettings().getIndexVersionCreated()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arg... I meant ParseContext#getIndexVersionCreated()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will open a followup but I think we need to have a more common class across all index level contexts maybe a base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me, thanks for tackling it.
Regarding implications, my understanding is that primary term lookups are rare so this change should not slow down indexing, even it might make primary term lookups slower. The current way that things are designed, doc value lookups may be linear (this typically happens in 2 cases: if the field is sparse of if splitting into blocks proves to give better compression) but with a very high constant-factor of 2^16. So say you have a segment with 100M documents (which is pretty large for a segment), Lucene will be looping over 1.5k blocks of documents until it reaches the right one.
@jpountz @martijnvg please take another look I had to work around a lucene bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM
This change stops indexing the `_primary_term` field for nested documents to allow fast retrieval of parent documents. Today we create a docvalues field for children to ensure we have a dense datastructure on disk. Yet, since we only use the primary term to tie-break on when we see the same seqID on indexing having a dense datastructure is less important. We can use this now to improve the nested docs performance and it's memory footprint. Relates to #24362
* master: (41 commits) [Test] Fix AggregationsTests#testFromXContentWithRandomFields [DOC] Fix mathematical representation on interval (range) (elastic#27450) Update version check for CCS optional remote clusters Bump BWC version to 6.1.0 for elastic#27469 Adapt rest test BWC version after backport Fix dynamic mapping update generation. (elastic#27467) Use the primary_term field to identify parent documents (elastic#27469) Move composite aggregation to core (elastic#27474) Fix test BWC version after backport Protect shard splitting from illegal target shards (elastic#27468) Cross Cluster Search: make remote clusters optional (elastic#27182) [Docs] Fix broken bulleted lists (elastic#27470) Move resync request serialization assertion Fix resync request serialization Fix issue where pages aren't released (elastic#27459) Add YAML REST tests for filters bucket agg (elastic#27128) Remove tcp profile from low level nio channel (elastic#27441) [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure Transition transport apis to use void listeners (elastic#27440) AwaitsFix GeoShapeQueryTests#testPointsOnly elastic#27454 ...
We changed things in elastic#27469 to filter parent docs by using an exists query on the primary_term field. However, the equivalent query for nested documents is still using the type field. This commit makes newNestedFilter build the complement of newNonNestedFilter instead.
We changed things in #27469 to filter parent docs by using an exists query on the primary_term field. However, the equivalent query for nested documents is still using the type field. This commit makes newNestedFilter build the complement of newNonNestedFilter instead.
This change stops indxing the
_primary_term
field for nested documentsto allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.
Relates to #24362