-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAK-11114 - Filter downloaded Mongo documents by path suffix #1716
Conversation
Add missing check for filtered document. Rename NodeDocumentFieldFilter to NodeDocumentFilter
...n/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/ConfigHelper.java
Outdated
Show resolved
Hide resolved
...a/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java
Show resolved
Hide resolved
.../org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentFilter.java
Outdated
Show resolved
Hide resolved
...che/jackrabbit/oak/index/indexer/document/flatfile/pipelined/PipelinedMongoDownloadTask.java
Outdated
Show resolved
Hide resolved
...n/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/ConfigHelper.java
Show resolved
Hide resolved
...n/java/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/ConfigHelper.java
Show resolved
Hide resolved
...a/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java
Outdated
Show resolved
Hide resolved
...a/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java
Show resolved
Hide resolved
Improve logging in NodeDocumentCodec and reduce frequency of periodic log messages.
...a/org/apache/jackrabbit/oak/index/indexer/document/flatfile/pipelined/NodeDocumentCodec.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's non-functional change, but could improve structure, readability and extensibility of the test. I suggest parameterising it. JUnit 4 doesn't support it well, but still we have capabilities of doing so. Here's the proposed solution:
_package org.apache.jackrabbit.oak.index.indexer.document.flatfile.pipelined;
import static java.util.Collections.emptyList;
import static java.util.Collections.singletonList;
import static org.junit.Assert.assertEquals;
import org.junit.Rule;
import org.junit.Test;
import org.junit.contrib.java.lang.system.RestoreSystemProperties;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;
import org.junit.runners.Parameterized.Parameters;
import org.junit.runners.Suite;
import java.util.Collection;
import java.util.List;
@RunWith(Suite.class)
@Suite.SuiteClasses({ConfigHelperTest.SystemPropertyExistsTest.class, ConfigHelperTest.MissingSystemPropertyTest.class})
public class ConfigHelperTest {
@rule
public final RestoreSystemProperties restoreSystemProperties = new RestoreSystemProperties();
@RunWith(Parameterized.class)
public static class SystemPropertyExistsTest {
private final String key;
private final String value;
private final List<String> expected;
public SystemPropertyExistsTest(String key, String value, List<String> expected) {
this.key = key;
this.value = value;
this.expected = expected;
}
@Parameters
public static Collection<Object[]> data() {
return List.of(new Object[][]{
{"key1", "value1", singletonList("value1")},
{"key2", " ", emptyList()},
{"key3", "v1;v2", List.of("v1", "v2")},
{"key4", "v1; v2", List.of("v1", "v2")}
});
}
@Test
public void givenDefaultValue_whenSystemPropertyExists_thenGetSystemPropertyAsString() {
// given
var defaultValue = "default";
var separator = ';';
System.setProperty(key, value);
// when
final var result = ConfigHelper.getSystemPropertyAsStringList(key, defaultValue, separator);
// then
assertEquals(expected, result);
}
}
@RunWith(Parameterized.class)
public static class MissingSystemPropertyTest {
private final String defaultValue;
private final List<String> expected;
public MissingSystemPropertyTest(String defaultValue, List<String> expected) {
this.defaultValue = defaultValue;
this.expected = expected;
}
@Parameters
public static Collection<Object[]> data() {
return List.of(new Object[][]{
{"", emptyList()},
{"default", singletonList("default")},
{"default1;default2", List.of("default1", "default2")}
});
}
@Test
public void givenDefaultValue_whenSystemPropertyNotExists_thenReturnDefaultValueAsList() {
// given
final var key = "not.defined";
final var separator = ';';
// when
final var result = ConfigHelper.getSystemPropertyAsStringList(key, defaultValue, separator);
// then
assertEquals(expected, result);
}
}
}_
Also could you provide the scenario where separator differs from the one used in system property? I.e. default1;defalut2
and ,
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed solution is much more verbose, I don't think it is justified in this case. I have instead extracted into a separate method the core of the test logic, which reduces repetition and makes the test more readable.
I added a few test cases where the separator in the value of the system property does not match the expected separator.
@@ -796,7 +800,7 @@ void download(FindIterable<NodeDocument> mongoIterable) throws InterruptedExcept | |||
this.lastIdDownloaded = id; | |||
this.documentsDownloadedTotal++; | |||
downloadStatics.incrementDocumentsDownloadedTotal(); | |||
if (this.documentsDownloadedTotal % 20_000 == 0) { | |||
if (this.documentsDownloadedTotal % 50_000 == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extract 50_000
to constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are inlining similar constants in other places. Maybe it is not a good practise, but it's ok for the time being, I don't want to expose this as a constant.
Additionally, the whole approach to logging progress should be reconsidered. I no longer think that logging based on the number of items processed is the best approach, as download speeds vary a lot from run to run, so we risk either not logging often enough or logging way too frequently. A better approach is to log periodically, so at a certain point in the future I would like to revise the whole approach to logging progress in the indexing job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think this is not needed. Actually I'm against making it a constant: that would just complicate the code without giving any meaningful value, in my view. (I understand others might have a different opinion, that's fine.)
|
||
private void test(List<String> expected, String sysPropertyValue, String defaultValue, char separator) { | ||
System.setProperty("key", sysPropertyValue); | ||
assertEquals(expected, ConfigHelper.getSystemPropertyAsStringList("key", defaultValue, separator)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I would cleanup the system property
System.clearProperty("key");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have also changed the key to a value that is less likely to be used by any other code.
Adds support to the indexing job to filter documents downloaded from Mongo early in the processing pipeline, more precisely at the NodeDocumentCodec, when they are being decoded from a binary stream of tokens. The motivation is to reduce the work that the indexing job has to do and the size of the intermediate flat file store, by not processing the documents that are not necessary for indexing, but that cannot be easily filtered using Mongo regex filtering.
This PR adds two system configuration properties:
oak.indexer.pipelined.nodeDocument.filter.filteredPath
- The path where the filter should be applied. Documents outside this path are not considered for filtering.oak.indexer.pipelined.nodeDocument.filter.suffixesToSkip
- The suffix of the documents that should be excluded.