-
Notifications
You must be signed in to change notification settings - Fork 96
/
proton.html
948 lines (903 loc) · 37.6 KB
/
proton.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
---
# Copyright Vespa.ai. All rights reserved.
title: "Proton"
redirect_from:
- /en/reference/search-custom-state-api.html
---
<p>
Proton is Vespa's search core and runs on each content node as the <em>vespa-proton-bin</em> process.
Proton maintains disk and memory structures for documents (organized per document type),
handles <a href="reads-and-writes.html#operations">read and write operations</a>,
and execution of <a href="#queries">queries</a>.
As the document data is dynamic, disk and memory structures are periodically optimized by
<a href="#proton-maintenance-jobs">maintenance jobs</a>.
</p>
<p>
The content node has a <em>bucket management system</em>
which sends requests to a set of <em>document databases</em>,
which each consists of three <em>sub-databases</em>
<code>ready</code>, <code>not ready</code> and <code>removed</code>:
</p>
<img src="/assets/img/proton-feed.svg" alt="Proton feed overview" width="930px" height="auto"/>
<h3 id="bucket-management">Bucket management</h3>
<p>
When the node starts up it first needs to get an overview of what documents and buckets it has.
Once metadata for all buckets are known, the content nodes transitions from down to up state.
As the distributors wants quick access to bucket metadata, it maintains an in-memory bucket
database to efficiently serve these requests.
The state of the bucket database can always be reconstructed from the durably persisted
search node state, but this is expensive and therefore only happens at process startup time.
</p>
<p>
This database is considered the source of truth for the state of the node's bucket metadata
for the duration of the process' lifetime. As incoming operations mutate the state of the
documents on the node, it is critical that the database is always kept in sync with these changes.
</p>
<h3 id="persistence-threads-and-operation-dispatching">Persistence threads and operation dispatching</h3>
<p>
A content node has a pool of <em>persistence threads</em> that is created at startup
and remains fixed in size for the lifetime of the process. It is the responsibility of the
persistence threads to schedule incoming write and read operations received by the content
node, dispatch these to the search core, and to ensure the bucket database remains in sync
with changes caused by write operations.
</p>
<p>
Unless explicitly configured, the size of the thread pool is automatically set based on the
number of CPU cores available.
</p>
<p>
Persistence threads are backed by a <em>persistence queue</em>. Read/write-operations received
by the RPC subsystem are pushed onto this queue. The queue is operation deadline-aware; if
an operation has exceeded its deadline while enqueued, it is immediately failed back to
the sender without being executed. This avoids a particular failure scenario where a heavily
loaded node spends increasingly more and more time processing already doomed operations,
due to not being able to drain its queue quickly enough.
</p>
<p>
All operations bound for a particular data bucket (such as Puts, Gets etc.) execute in the
context of a <em>bucket lock</em>. Locks are <em>shared</em> for reads and <em>exclusive</em> for writes.
This means that multiple read operations can execute in parallel for the same bucket, but
only one write operation can execute for a bucket at any given time (and no reads can be
started concurrently with existing writes for a bucket, and vice versa). Note that some of
these locking restrictions can be relaxed when it's safe to do so—see
<a href="#performance-optimizations">performance optimizations</a> for details.
</p>
<p>
If a persistence thread tries to pop an operation from the queue and sees that the bucket it's
bound for is already locked, it will leave the operation in place in the queue and try the next
operation(s) instead. This means that although the queue acts as a FIFO for client operations
towards a <em>single</em> bucket, this is not the case across <em>multiple</em> buckets.
</p>
<h4 id="write-operations">Write operations</h4>
<p>
Write operations are dispatched as <em>asynchronous</em>—i.e. non-blocking—tasks to the search core.
This increases parallelism by freeing up persistence threads to handle other operations, and a deeper
pipeline enables the search core to optimize transaction log synchronization and batching of data
structure updates.
</p>
<p>
Since a deeper pipeline comes at the potential cost of increased latency when many operations are in
flight, the maximum number of concurrent asynchronous operations is bounded by an adaptive persistence
throttling mechanism. The throttler will dynamically scale the window of concurrency until it reaches a
saturation point where further increasing the window size also results in increased operation latencies.
When the number of in-flight operations hits the current maximum, persistence threads will not
dispatch any more writes until the number goes down. Reads can still be processed during this time.
</p>
<p>
An asynchronous write-task holds on to the exclusive bucket lock for the duration of its lifetime.
Once the search core completes the write operation, the bucket database is updated with the new metadata
state of the bucket (which reflects the side effects of the write) prior to releasing the lock.
An operation reply is then generated and sent back via the RPC subsystem.
</p>
<h4 id="read-operations">Read operations</h4>
<p>
Read operations are always evaluated <em>synchronously</em>—i.e. blocking—by persistence threads.
To avoid having potentially expensive maintenance read operations (such as those used for
<a href="content/consistency.html#replica-reconciliation">replica reconciliation</a>) block client
operations for prolonged amounts of time, a subset of the persistence threads are <em>not</em> allowed
to process such maintenance operations.
</p>
<p>
Note that the condition evaluation step of a test-and-set write is considered a <em>read</em> sub-operation
and is therefore done synchronously. Since it's part of a write operation, it happens atomically in
the context of the exclusive lock of the higher-level operation.
</p>
<h4 id="performance-optimizations">Performance optimizations</h4>
<p>
To reduce thread context switches, some write operations may bypass the persistence thread queues
and be directly asynchronously dispatched to the search core from the RPC thread the operation
was received at.
Such operations must still successfully acquire the appropriate exclusive bucket lock—if the
lock cannot be immediately acquired the operation is pushed onto the persistence queue instead.
</p>
<p>
To reduce lock contention and thread wakeups, smaller numbers of persistence threads are grouped
together in <em>stripes</em> that share a dedicated per-stripe persistence queue.
Operations are routed deterministically to a particular stripe based on their bucket ID, meaning
that stripes operate on non-overlapping parts of the bucket space. Together, the many stripes and
queues form one higher level <em>logical</em> queue that covers the entire bucket space.
</p>
<p>
If the queue contains multiple <em>non-conflicting</em> write operations to the same bucket, these
may be dispatched in parallel in the context of the <em>same</em> write lock. This avoids having
to wait for an entire lock-execute-unlock roundtrip prior to dispatching the next write for the
same bucket. An example of conflicting writes is multiple Puts to the same document ID.
The maximum number of operations dispatched in parallel is implementation-defined.
</p>
<h3 id="document-database">Document database</h3>
<p>
Each document database is responsible for a single document type.
It has a component called FeedHandler which takes care of incoming documents, updates, and remove requests.
All requests are first written to a <a href="#transaction-log">transaction log</a>,
then handed to the appropriate sub-database, based on the request type.
</p>
<h3 id="sub-databases">Sub-databases</h3>
<p>
There are three types of sub-databases, each with its own
<a href="attributes.html#document-meta-store">document meta store</a> and <a href="#document-store">document store</a>.
The document meta store holds a map from the document id to a local id.
This local id is used to address the document in the document store.
The document meta store also maintains information on the
state of the buckets that are present in the sub-database.
</p><p>
The sub-databases are maintained by the <em>Maintenance Controller</em>.
The document distribution changes as the system is resized.
When the number of nodes in the system changes,
the Maintenance Controller will move documents between the Ready and
Not Ready sub-databases to reflect the new distribution.
When an entry in the Removed sub-database gets old it is purged.
The sub-databases are:
</p>
<table class="table">
<thead></thead><tbody>
<tr>
<th style="white-space:nowrap;">Not Ready</th>
<td>
Holds the redundant documents that are not searchable,
i.e. the <em>not ready</em> documents.
Documents that are not ready are only stored, not indexed.
It takes some processing to move from this state to the ready state.
</td>
</tr><tr>
<th>Ready</th>
<td>
Maintains attributes and indexes of all <em>ready</em> documents and keeps them searchable.
One of the ready copies is <em>active</em> while the rest are <em>not active:</em>
<table class="table">
<tr>
<th>Active</th>
<td>There should always be exactly one active copy of each
document in the system, though intermittently there may be more.
These documents produce results when queries are evaluated.</td>
</tr>
<tr>
<th style="white-space:nowrap">Not Active</th>
<td>The ready copies that are not active are indexed but will not produce results.
By being indexed, they are ready to take over immediately
if the node holding the active copy becomes unavailable.
Read more in <a href="reference/services-content.html#searchable-copies">searchable-copies</a>.</td>
</tr>
</table><!-- Todo: add illustration of active/non-active, or add to illustration above -->
</td>
</tr><tr>
<th>Removed</th>
<td>Keeps track of documents that have been removed.
The id and timestamp for each document is kept.
This information is used when buckets from two nodes are merged.
If the removed document exists on another node but with a different timestamp,
the most recent entry prevails.
</td>
</tr>
</tbody>
</table>
<h2 id="transaction-log">Transaction log</h2>
<p>
Content nodes have a transaction log to persist mutating operations.
The transaction log persists operations by file append.
Having a transaction log simplifies proton's in-memory index structures
and enables steady-state high performance, read more below.
</p><p>
All operations are written and synced to the <a href="proton.html#transaction-log">transaction log</a>.
This is sequential (not random) IO but might impact overall feed performance if running on NAS attached storage
where the sync operation has a much higher cost than on local attached storage (e.g. SSD).
See <a href="reference/services-content.html#sync-transactionlog">sync-transactionlog</a>.
</p>
<p>
Default, proton will
<a href="reference/services-content.html#flush-on-shutdown">flush components</a>
like attribute vectors and memory index on shutdown, for quicker startup after scheduled restarts.
</p>
<h2 id="document-store">Document store</h2>
<p>
Documents are stored as compressed serialized blobs in the <em>document store</em>.
Put, update and remove operations are persisted in the <a href="#transaction-log">transaction log</a>
before updating the document in the document store.
The operation is acked to the client and the result of the operation is immediately seen in search results.
</p>
<p>
Files in the document store are written sequentially, and occur in pairs - example:
<pre>
-rw-r--r-- 1 owner users 4133380096 Aug 10 13:36 1467957947689211000.dat
-rw-r--r-- 1 owner users 71192112 Aug 10 13:36 1467957947689211000.idx
</pre>
<p>
The <a href="reference/services-content.html#summary-store-logstore-maxfilesize">maximum size</a>:
(in bytes) per .dat file on disk can be set using the following:
</p>
<pre>
<content id="mycluster" version="1.0">
<engine>
<proton>
<tuning>
<searchnode>
<summary>
<store>
<logstore>
<span class="pre-hilite"><maxfilesize>8000000000</maxfilesize></span>
</pre>
Notes:
<ul>
<li>The files are written in sequence. <em>proton</em> starts with one pair
and grows it until <em>maxfilesize</em>. Once full, a new pair is started.</li>
<li>This means, the pair is immutable, except the last pair, which is written to.</li>
<li>Documents exist in multiple versions in multiple files.
Older versions are compacted away when a pair is scheduled for being the new active pair -
obsolete versions are removed, leaving only the active document version left in a new file pair
- which is the new active pair.</li>
<li>Read more on implications of setting <em>maxfilesize</em> in
<a href="proton.html#document-store-compaction">proton maintenance jobs</a>.</li>
<li>Files are written in <a href="reference/services-content.html#summary-store-logstore-chunk">chunks</a>,
using compression settings.
</ul>
<h3 id="defragmentation">Defragmentation</h3>
<p>
<a href="#document-store-compaction">Document store compaction</a>,
defragments and sort document store files.
It removes stale versions of documents (i.e. old version of updated documents).
It is triggered when the disk bloat of the document store is larger than the total disk usage of the document store
times <a href="reference/services-content.html#flushstrategy-native-total-diskbloatfactor">
diskbloatfactor</a>.
Refer to <a href="reference/services-content.html#summary">summary tuning</a> for details.
</p>
<p>
Defragmentation status is best observed by tracking the
<a href="reference/searchnode-metrics-reference.html#content_proton_documentdb_ready_document_store_max_bucket_spread">max_bucket_spread</a>
metric over time.
A sawtooth pattern is normal for corpora that change over time.
The <a href="reference/searchnode-metrics-reference.html#content_proton_documentdb_job_document_store_compact">document_store_compact</a>
metric tracks when proton is running the document store compaction job.
Compaction settings can be set too tight, in that case, the metric is always, or close to, 1.
</p>
<p>
When benchmarking, it is important to set the correct compaction settings,
and also make sure that proton has compacted files since (can take hours),
and is not actively compacting (<em>document_store_compact</em> should be 0 most of the time).
<!-- The effect of compaction on query latency is <strong>ToDo</strong>. -->
</p>
{% include note.html content="There is no bucket-compaction across files - documents will not move between files."%}
<h3 id="optimized-reads-using-chunks">Optimized reads using chunks</h3>
<p>
As documents are clustered within the .dat file,
proton optimizes reads by reading larger chunks when accessing documents.
When visiting, documents are read in <em>bucket</em> order.
This is the same order as the defragmentation jobs uses.
</p>
<p>
The first document read in a visit operation for a bucket will read a chunk from the .dat file into memory.
Subsequent document accesses are served by a memory lookup only.
The chunk size is configured by
<a href="reference/services-content.html#summary-store-logstore-chunk-maxsize">maxsize</a>:
</p>
<pre>
<engine>
<proton>
<tuning>
<searchnode>
<summary>
<store>
<logstore>
<chunk>
<span class="pre-hilite"><maxsize>16384</maxsize></span>
</chunk>
</logstore>
</pre>
<p>
There can be 2^22=4M chunks. This sets a minimum chunk size based on <em>maxfilesize</em> -
e.g. an 8G file can have minimum 2k chunk size.
Finally, bucket size is configured by setting
<a href="reference/services-content.html#bucket-splitting">bucket-splitting</a>:
</p>
<pre>
<content id="imagepersonal" version="1.0">
<tuning>
<span class="pre-hilite"><bucket-splitting max-documents="1024"/></span>
</pre>
<p>The following are the relevant sizing units:</p>
<ul>
<li>.dat file size - <em>maxfilesize</em>.
Larger files give less files and so better locality,
but compaction requires more memory and more time to complete.</li>
<li>chunk size - <em>maxsize</em>.
Smaller chunks give less wasted IO bytes but more IO operations.</li>
<li>bucket size - <em>bucket-splitting</em>.
Larger buckets give less buckets and better locality to nodes and files, but
incur more overhead during content layer bucket maintenance operations.
Overhead can be treated as linear in both CPU, memory and network usage
with the bucket size.</li>
</ul>
<h3 id="memory-usage">Memory usage</h3>
<p>
The document store has a mapping in memory from local ID (LID) to position in a document store file (.dat).
Part of this mapping is persisted in the .idx-file paired to the .dat file.
The memory used by the document store is linear with number of documents and updates to these.
</p><p>
The metric <a href="reference/searchnode-metrics-reference.html#content_proton_documentdb_ready_document_store_memory_usage_allocated_bytes">content.proton.documentdb.ready.document_store.memory_usage.allocated_bytes</a>
gives the size in memory -
use the <a href="reference/state-v1.html#state-v1-metrics">metric API</a> to find it.
A rule of thumb is 12 bytes per document.
</p>
<h2 id="attributes">Attributes</h2>
<p>
<a href="attributes.html">Attribute</a> fields are in-memory fields used for
matching, ranking, sorting and grouping.
Each attribute is a separate component that consists of a set of
<a href="attributes.html#data-structures">data structures</a>
to store values for that field across all documents in a sub-database.
Attributes are managed by the Ready sub-database.
Some attributes can also be managed by the Not Ready sub-database,
see <a href="attributes.html#fast-access">high-throughput updates</a> for details.
</p>
<h2 id="index">Index</h2>
<p>
Index fields are string fields, used for text search.
Other field types are <a href="attributes.html">attributes</a>
and <a href="document-summaries.html">summary fields</a>.
</p>
<p>
The Index in the Ready sub-database consists of a memory index and one or more disk indexes.
Mutating document operations are applied to the memory index,
which is <a href="#memory-index-flush">flushed</a> regularly.
Flushed memory indexes are <a href="#disk-index-fusion">merged</a> with the primary disk index.
</p><p>
Proton stores position information in text indices by default, for proximity relevance -
<code>posocc</code> (below).
All the occurrences of a term is stored in the posting list, with its position.
This provides superior ranking features,
but is somewhat more expensive than just storing a single occurrence per document.
For most applications it is the correct tradeoff,
since most of the cost is usually elsewhere and relevance is valuable.
</p><p>
Applications that only need occurrence information for filtering
can use <a href="reference/schema-reference.html#rank">rank: filter</a>
to optimize query performance, using only <code>boolocc</code>-files (below).
</p><p>
The memory index has a dictionary per index field.
This contains all unique words in that field with mapping to posting lists with position information.
The position information is used during ranking, see
<a href="nativerank.html">nativeRank</a> for details on how a text match score is calculated.
</p><p>
The disk index stores the content of each index field in separate folders.
Each folder contains:
</p>
<ul>
<li>Dictionary. Files: <code>dictionary.pdat</code>,
<code>dictionary.spdat</code>, <code>dictionary.ssdat</code>.</li>
<li>Compressed posting lists with position information. File: <code>posocc.dat.compressed</code>.</li>
<li>Posting lists with only occurrence information (bitvector). These are generated for common words.
Files: <code>boolocc.bdat</code>, <code>boolocc.idx</code>.</li>
</ul>
<p>
Example:
</p>
<pre>{% highlight sh%}
$ pwd
/opt/vespa/var/db/vespa/search/cluster.mycluster/n1/documents/myschema/0.ready/index/index.flush.1/myfield
$ ls -la
total 7632
drwxr-xr-x 2 org users 145 Oct 29 06:09 .
drwxr-xr-x 74 org users 4096 Oct 29 06:11 ..
-rw-r--r-- 1 org users 4096 Oct 29 06:11 boolocc.bdat
-rw-r--r-- 1 org users 4096 Oct 29 06:11 boolocc.idx
-rw-r--r-- 1 org users 8192 Oct 29 06:11 dictionary.pdat
-rw-r--r-- 1 org users 8192 Oct 29 06:11 dictionary.spdat
-rw-r--r-- 1 org users 4120 Oct 29 06:11 dictionary.ssdat
-rw-r--r-- 1 org users 7778304 Oct 29 06:11 posocc.dat.compressed
{% endhighlight %}</pre>
<p>
Note that <code>boolocc</code>-files are empty if number of occurrences is small, like in the example above.
</p>
<h2 id="proton-maintenance-jobs">Proton maintenance jobs</h2>
<p>
The memory and disk data structures used in Proton are
periodically optimized by a set of maintenance jobs.
These jobs are automatically executed, and some can be tuned in
<a href="reference/services-content.html#flushstrategy">flush strategy tuning</a>.
All jobs are described in the table below.
</p><p>
There is only one instance of each job at a time - e.g. attributes are flushed in sequence.
When a job is running, its metric is set to 1 - otherwise 0.
Use this to correlate observed performance or resource usage with job runs - see <em>Run metric</em> below.
</p><p>
The <em>temporary</em> resources used when jobs are executed are described in <em>CPU</em>, <em>Memory</em> and <em>Disk</em>.
The memory and disk usage metrics of components that are optimized by the jobs
are described in <em>Metrics</em> (with <em>Metric prefix</em>).
For a list of all available Proton metrics refer to the searchnode metrics in the
<a href="reference/vespa-set-metrics-reference.html#searchnode-metrics">Vespa Metric Set</a>.
Metrics are available at the <a href="operations/metrics.html">Metrics API</a>.
</p>
<table class="table">
<thead>
<tr>
<th>Job</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="7">attribute flush</th>
<td colspan="2">
<p id="attribute-flush">
Flush an <a href="attributes.html">attribute</a> vector from memory to disk,
based on configuration in the
<a href="reference/services-content.html#flushstrategy">flush strategy</a>.
This ensures that Proton starts quicker - see
<a href="reference/services-content.html#flush-on-shutdown">flush on shutdown</a>.
An attribute flush also releases memory after a <a href="#lid-space-compaction">LID-space compaction</a>.
</p>
</td>
<td></td>
</tr>
<tr>
<th>CPU</th>
<td>Little - one thread flushes to disk</td>
</tr>
<tr>
<th>Memory</th>
<td>Little - some temporary use</td>
</tr>
<tr>
<th>Disk</th>
<td>A new file is written too, so 2x the size of an attribute on disk until the old flush file is deleted.</td>
</tr>
<tr>
<th>Run metric</th>
<td><em>content.proton.documentdb.job.attribute_flush</em></td>
</tr>
<tr>
<th style="white-space:nowrap;">Metric prefix</th>
<td><em>content.proton.documentdb.[ready|notready].attribute.memory_usage.</em></td>
</tr>
<tr>
<th>Metrics</th>
<td><em>
allocated_bytes.average<br/>
used_bytes.average<br/>
dead_bytes.average<br/>
onhold_bytes.average<br/>
</em></td>
</tr>
<tr>
<th rowspan="7">memory index flush</th>
<td colspan="2">
<p id="memory-index-flush">
Flush a <em>memory index</em> to disk, then trigger <a href="#disk-index-fusion">disk index fusion</a>.
The goal is to shrink memory usage by adding to the disk-backed indices.
Note: A high feed rate can cause multiple smaller flushed indices, like
<em>$VESPA_HOME/var/db/vespa/search/cluster.name/n1/documents/doc/0.ready/index/index.flush.102</em>
- see the high index number.
Multiple smaller indices is a symptom of too small memory indices compared to feed rate -
to fix, increase <a href="reference/services-content.html#flushstrategy">
flushstrategy > native > component > maxmemorygain</a>.
</p>
</td>
</tr>
<tr>
<th>CPU</th>
<td>Little - one thread flushes to disk</td>
</tr><tr>
<th>Memory</th>
<td>Little</td>
</tr>
<tr>
<th>Disk</th>
<td>Creates a new disk index, size of the memory index.</td>
</tr>
<tr>
<th>Run metric</th>
<td><em>content.proton.documentdb.job.memory_index_flush</em></td>
</tr>
<tr>
<th>Metric prefix</th>
<td><em>content.proton.documentdb.index.memory_usage.</em></td>
</tr>
<tr>
<th>Metrics</th>
<td><em>allocated_bytes.average<br/>
used_bytes.average<br/>
dead_bytes.average<br/>
onhold_bytes.average<br/></em></td>
</tr>
<tr>
<th rowspan="5">disk index fusion</th>
<td colspan="2">
<p id="disk-index-fusion">
Merge the primary disk index with smaller indices generated by
<a href="#memory-index-flush">memory index flush</a> -
triggered by the memory index flush.
</p>
</td>
</tr>
<tr>
<th>CPU</th>
<td>Multiple threads merge indices, configured as a function of
<a href="reference/services-content.html#feeding-concurrency">feeding concurrency</a> -
refer to this for details</td>
</tr>
<tr>
<th>Memory</th>
<td>Little</td>
</tr>
<tr>
<th>Disk</th>
<td>Creates a new index while serving from the current: 2x temporary disk usage for the given index.</td>
</tr>
<tr>
<th>Run metric</th>
<td><em>content.proton.documentdb.job.disk_index_fusion</em></td>
</tr>
<tr>
<th rowspan="5">document store flush</th>
<td colspan="2">
<p id="document-store-flush">
Flushes the <a href="#memory-usage">document store</a>.
</p>
</td>
</tr>
<tr>
<th>CPU</th>
<td>Little</td>
</tr>
<tr>
<th>Memory</th>
<td>Little</td>
</tr>
<tr>
<th>Disk</th>
<td>Little</td>
</tr>
<tr>
<th>Run metric</th>
<td><em>content.proton.documentdb.job.document_store_flush</em></td>
</tr>
<tr>
<th rowspan="7">document store compaction</th>
<td colspan="2">
<p id="document-store-compaction">
Defragment and sort <a href="#document-store">document store</a> files as documents are updated and deleted,
in order to reduce disk usage.
The file is sorted in bucket order on output. Triggered by
<a href="reference/services-content.html#flushstrategy-native-total-diskbloatfactor">
diskbloatfactor</a>.
</p>
</td>
</tr>
<tr>
<th>CPU</th>
<td>Little - one thread reads one files, sorts and writes a new file</td>
</tr>
<tr>
<th>Memory</th>
<td>
Holds a document store file in memory plus memory for sorting the file.
<strong>Note: This is important on hosts with little memory!</strong>
Reduce <a href="reference/services-content.html#summary-store-logstore-maxfilesize">
maxfilesize</a> to increase number of files and use less temporary memory for compaction.
</td>
</tr>
<tr>
<th>Disk</th>
<td>A new file is written while the current is serving, max temporary usage is 2x.</td>
</tr>
<tr>
<th>Run metric</th>
<td><em>content.proton.documentdb.job.document_store_compact</em></td>
</tr>
<tr>
<th>Metric prefix</th>
<td><em>content.proton.documentdb.[ready|notready|removed].document_store.</em></td>
</tr>
<tr>
<th>Metrics</th>
<td><em>disk_usage.average<br/>
disk_bloat.average<br/>
max_bucket_spread.average<br/>
memory_usage.allocated_bytes.average<br/>
memory_usage.used_bytes.average<br/>
memory_usage.dead_bytes.average<br/>
memory_usage.onhold_bytes.average<br/></em></td>
</tr>
<tr>
<th rowspan="5">bucket move</th>
<td colspan="2">
<p id="bucket-move">
Triggered by nodes going up/down, refer to <a href="elasticity.html">
content cluster elasticity</a> and
<a href="reference/services-content.html#searchable-copies">searchable-copies</a>.
It causes documents to be indexed or de-indexed, similar to feeding.
This moves documents between the <em>ready</em> and <em>not ready</em> sub-databases.
</p>
</td>
</tr>
<tr>
<th>CPU</th>
<td>CPU similar to feeding.
Consumes capacity from the write threads, so has feeding impact</td>
</tr>
<tr>
<th>Memory</th>
<td>As feeding - e.g. the attribute memory usage and memory index in the <em>ready</em> sub-database will grow</td>
</tr>
<tr>
<th>Disk</th>
<td>As feeding</td>
</tr>
<tr>
<th>Run metric</th>
<td><em>content.proton.documentdb.job.bucket_move</em></td>
</tr>
<tr>
<th rowspan="7">lid-space compaction</th>
<td colspan="2">
<p id="lid-space-compaction">
Each sub-database has a
<a href="attributes.html#document-meta-store">document meta store</a>
that manages a local document id space (LID-space).
When a <a href="elasticity.html">cluster grows</a> with more nodes,
documents are redistributed to new nodes and each node ends up with fewer documents.
The result is holes in the LID-space, and a compaction is triggered when the bloat is above 1%.
This in-place defragments the
<a href="attributes.html#document-meta-store">document meta store</a>
and <a href="attributes.html">attribute</a> vectors
by moving documents from high to low LIDs inside the sub-database.
Resources are freed on a subsequent <a href="#attribute-flush">attribute flush</a>.
</p>
</td>
</tr>
<tr>
<th>CPU</th>
<td>Like feeding - add and remove documents</td>
</tr>
<tr>
<th>Memory</th>
<td>Little</td>
</tr>
<tr>
<th>Disk</th>
<td>0</td>
</tr>
<tr>
<th>Run metric</th>
<td><em>content.proton.documentdb.job.lid_space_compact</em></td>
</tr>
<tr>
<th>Metric prefix</th>
<td><em>content.proton.documentdb.[ready|notready|removed].lid_space.</em></td>
</tr>
<tr>
<th>Metrics</th>
<td><em>lid_limit.last<br/>
lid_bloat_factor.average<br/>
lid_fragmentation_factor.average</em></td>
</tr>
<tr>
<th rowspan="5" style="white-space:nowrap;">removed documents pruning</th>
<td colspan="2">
<p id="removed-documents-pruning">
Prunes the <em>removed</em> sub-database which keeps IDs for deleted documents.
See <a href="https://docs.vespa.ai/en/reference/services-content.html#removed-db">removed-db</a>
for details.
</p>
</td>
</tr>
<tr>
<th>CPU</th>
<td>Little</td>
</tr>
<tr>
<th>Memory</th>
<td>Little</td>
</tr>
<tr>
<th>Disk</th>
<td>Little</td>
</tr>
<tr>
<th>Run metric</th>
<td><em>content.proton.documentdb.job.removed_documents_prune</em></td>
</tr>
</tbody>
</table>
<!-- ToDo: Moved from elastic doc, better fit here.
Merge with the 3 subdatabase section above -->
<h2 id="retrieving-documents">Retrieving documents</h2>
<p>
Retrieving documents is done by specifying an id to <em>get</em>,
or use a <a href="reference/document-select-language.html">selection expression</a>
to <em>visit</em> a range of documents - refer to the <a href="api.html">Document API</a>.
Overview:
</p>
<img src="/assets/img/elastic-visit-get.svg" alt="Retrieving documents" width="705px" height="auto"/>
<table class="table">
<tr>
<th>Get</th>
<td>
<p id="get">
When the content node receives a get request,
it scans through all the document databases,
and for each one it checks all three sub-databases.
Once the document is found, the scan is stopped and the document returned.
If the document is found in a Ready sub-database,
the document retriever will apply any changes that is stored in the
<a href="attributes.html">attributes</a> before returning the document.
</p>
</td>
</tr><tr>
<th>Visit</th>
<td>
<p id="visit">
A visit request creates an iterator over each candidate bucket.
This iterator will retrieve matching documents from all sub-databases of all document databases.
As for get, attributes values are applied to document fields in the Ready sub-database.
</p>
</td>
</tr>
</table>
<h2 id="queries">Queries</h2>
<p>
Queries have a separate pathway through the system.
They do not use the distributor, nor do they go through the content node persistence threads.
They are orthogonal to the elasticity set up by the storage and retrieval described above.
How queries move through the system:
</p>
<img src="/assets/img/proton-query.svg" width="510px" height="auto" alt="Queries" />
<p>
A query enters the system through the <em>QR-server (query rewrite server)</em>
in the <a href="jdisc/">Vespa Container</a>.
The QR-server issues one query per document type to the search nodes:
</p>
<table class="table">
<tr>
<th>Container</th>
<td>
<p id="container">
The Container knows all the document types and rewrites queries as a
collection of queries, one for each type.
Queries may have a <a href="reference/query-api-reference.html#model.restrict">restrict</a> parameter,
in which case the container will send the query only to the specified document types.
</p>
<p>
It sends the query to content nodes and collect partial results.
It pings all content nodes every second to know whether they are alive,
and keeps open TCP connections to each one.
If a node goes down, the elastic system will make the documents available on other nodes.
</p>
</td>
</tr><tr>
<th>Content node matching</th>
<td>
<p id="content-node-matching">
The <em>match engine</em> <!-- ToDo: link here to design doc / move it -->
receives queries and routes them to the right document database based on the document type.
The query is passed to the <em>Ready</em> sub-database, where the searchable documents are.
Based on information stored in the document meta store,
the query is augmented with a blocklist that ensures only <em>active</em> documents are matched.
</p>
</td>
</tr>
</table>
<h2 id="custom-component-state-api">Custom Component State API</h2>
<p>
This section describes the custom extensions of the proton custom component state API.
Component status can be found by HTTP GET at <code>http://host:stateport/state/v1/custom/component</code>.
This gives an overview of the relevant search node components and their internal state.
Note that this is <strong>not</strong> a stable API, and it will expand and change between releases.
</p>
<p>
Run <a href="/en/operations-selfhosted/vespa-cmdline-tools.html#vespa-model-inspect">vespa-model-inspect</a>
to find the JSON HTTP stateport:
</p>
<pre>
vespa-model-inspect service searchnode
</pre>
<p>
Example <code>state/v1/custom/component</code>:
</p>
<pre>{% highlight json %}
{
"documentdb": {
"mydoctype": {
"documentType": "mydoctype",
"status": {
"state": "ONLINE",
"configState": "OK"
},
"documents": {
"active": 10,
"ready": 10,
"total": 10,
"removed": 0
},
"url": "http://host:stateport/state/v1/custom/component/documentdb/mydoctype"
}
},
"threadpools": {
"url": "http://host:stateport/state/v1/custom/component/threadpools"
},
"matchengine": {
"status": {
"state": "ONLINE"
},
"url": "http://host:stateport/state/v1/custom/component/matchengine"
},
"flushengine": {
"url": "http://host:stateport/state/v1/custom/component/flushengine"
},
"tls": {
"url": "http://host:stateport/state/v1/custom/component/tls"
},
"hwinfo": {
"url": "http://host:stateport/state/v1/custom/component/hwinfo"
},
"resourceusage": {
"url": "http://host:stateport/state/v1/custom/component/resourceusage",
"disk": 0.25,
"memory": 0.35,
"attribute_address_space": 0
},
"session": {
"search": {
"url": "http://host:stateport/state/v1/custom/component/session/search",
"numSessions": 0
}
}
}
{% endhighlight %}</pre>
<p>
Example <code>state/v1/custom/component/documentdb/mydoctype</code>:
</p>
<pre>{% highlight json %}
{
"documentType": "mydoctype",
"status": {
"state": "ONLINE",
"configState": "OK"
},
"documents": {
"active": 10,
"ready": 10,
"total": 10,
"removed": 0
},
"subdb": {
"removed": {
"url": "http://host:stateport/state/v1/custom/component/documentdb/mydoctype/subdb/removed"
},
"ready": {
"url": "http://host:stateport/state/v1/custom/component/documentdb/mydoctype/subdb/ready"
},
"notready": {
"url": "http://host:stateport/state/v1/custom/component/documentdb/mydoctype/subdb/notready"
}
},
"threadingservice": {
"url": "http://host:stateport/state/v1/custom/component/documentdb/mydoctype/threadingservice"
},
"bucketdb": {
"url": "http://host:stateport/state/v1/custom/component/documentdb/mydoctype/bucketdb",
"numBuckets": 1
},
"maintenancecontroller": {
"url": "http://host:stateport/state/v1/custom/component/documentdb/mydoctype/maintenancecontroller"
}
}
{% endhighlight %}</pre>