Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-11114 - Filter downloaded Mongo documents by path suffix #1716

Merged
merged 18 commits into from
Sep 16, 2024

Conversation

nfsantos
Copy link
Contributor

@nfsantos nfsantos commented Sep 13, 2024

Adds support to the indexing job to filter documents downloaded from Mongo early in the processing pipeline, more precisely at the NodeDocumentCodec, when they are being decoded from a binary stream of tokens. The motivation is to reduce the work that the indexing job has to do and the size of the intermediate flat file store, by not processing the documents that are not necessary for indexing, but that cannot be easily filtered using Mongo regex filtering.

This PR adds two system configuration properties:

  • oak.indexer.pipelined.nodeDocument.filter.filteredPath - The path where the filter should be applied. Documents outside this path are not considered for filtering.
  • oak.indexer.pipelined.nodeDocument.filter.suffixesToSkip - The suffix of the documents that should be excluded.

@nfsantos nfsantos changed the title OAK-11114 - Implement filtering downloaded Mongo documents by path suffix (WIP) OAK-11114 - Implement filtering downloaded Mongo documents by path suffix Sep 16, 2024
@nfsantos nfsantos changed the title OAK-11114 - Implement filtering downloaded Mongo documents by path suffix OAK-11114 - Filter downloaded Mongo documents by path suffix Sep 16, 2024
Improve logging in NodeDocumentCodec and reduce frequency of periodic log messages.
Copy link

@alagodasii alagodasii Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's non-functional change, but could improve structure, readability and extensibility of the test. I suggest parameterising it. JUnit 4 doesn't support it well, but still we have capabilities of doing so. Here's the proposed solution:

_package org.apache.jackrabbit.oak.index.indexer.document.flatfile.pipelined;

import static java.util.Collections.emptyList;
import static java.util.Collections.singletonList;
import static org.junit.Assert.assertEquals;

import org.junit.Rule;
import org.junit.Test;
import org.junit.contrib.java.lang.system.RestoreSystemProperties;
import org.junit.runner.RunWith;
import org.junit.runners.Parameterized;
import org.junit.runners.Parameterized.Parameters;
import org.junit.runners.Suite;

import java.util.Collection;
import java.util.List;

@RunWith(Suite.class)
@Suite.SuiteClasses({ConfigHelperTest.SystemPropertyExistsTest.class, ConfigHelperTest.MissingSystemPropertyTest.class})
public class ConfigHelperTest {

@rule
public final RestoreSystemProperties restoreSystemProperties = new RestoreSystemProperties();

@RunWith(Parameterized.class)
public static class SystemPropertyExistsTest {

private final String key;
private final String value;
private final List<String> expected;

public SystemPropertyExistsTest(String key, String value, List<String> expected) {
  this.key = key;
  this.value = value;
  this.expected = expected;
}

@Parameters
public static Collection<Object[]> data() {
  return List.of(new Object[][]{
      {"key1", "value1", singletonList("value1")},
      {"key2", " ", emptyList()},
      {"key3", "v1;v2", List.of("v1", "v2")},
      {"key4", "v1; v2", List.of("v1", "v2")}
  });
}

@Test
public void givenDefaultValue_whenSystemPropertyExists_thenGetSystemPropertyAsString() {
  // given
  var defaultValue = "default";
  var separator = ';';
  System.setProperty(key, value);

  // when
  final var result = ConfigHelper.getSystemPropertyAsStringList(key, defaultValue, separator);

  // then
  assertEquals(expected, result);
}

}

@RunWith(Parameterized.class)
public static class MissingSystemPropertyTest {

private final String defaultValue;
private final List<String> expected;

public MissingSystemPropertyTest(String defaultValue, List<String> expected) {
  this.defaultValue = defaultValue;
  this.expected = expected;
}

@Parameters
public static Collection<Object[]> data() {
  return List.of(new Object[][]{
      {"", emptyList()},
      {"default", singletonList("default")},
      {"default1;default2", List.of("default1", "default2")}
  });
}

@Test
public void givenDefaultValue_whenSystemPropertyNotExists_thenReturnDefaultValueAsList() {
  // given
  final var key = "not.defined";
  final var separator = ';';

  // when
  final var result = ConfigHelper.getSystemPropertyAsStringList(key, defaultValue, separator);

  // then
  assertEquals(expected, result);
}

}
}_

Also could you provide the scenario where separator differs from the one used in system property? I.e. default1;defalut2 and ,.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed solution is much more verbose, I don't think it is justified in this case. I have instead extracted into a separate method the core of the test logic, which reduces repetition and makes the test more readable.

I added a few test cases where the separator in the value of the system property does not match the expected separator.

@@ -796,7 +800,7 @@ void download(FindIterable<NodeDocument> mongoIterable) throws InterruptedExcept
this.lastIdDownloaded = id;
this.documentsDownloadedTotal++;
downloadStatics.incrementDocumentsDownloadedTotal();
if (this.documentsDownloadedTotal % 20_000 == 0) {
if (this.documentsDownloadedTotal % 50_000 == 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you extract 50_000 to constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are inlining similar constants in other places. Maybe it is not a good practise, but it's ok for the time being, I don't want to expose this as a constant.

Additionally, the whole approach to logging progress should be reconsidered. I no longer think that logging based on the number of items processed is the best approach, as download speeds vary a lot from run to run, so we risk either not logging often enough or logging way too frequently. A better approach is to log periodically, so at a certain point in the future I would like to revise the whole approach to logging progress in the indexing job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think this is not needed. Actually I'm against making it a constant: that would just complicate the code without giving any meaningful value, in my view. (I understand others might have a different opinion, that's fine.)


private void test(List<String> expected, String sysPropertyValue, String defaultValue, char separator) {
System.setProperty("key", sysPropertyValue);
assertEquals(expected, ConfigHelper.getSystemPropertyAsStringList("key", defaultValue, separator));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I would cleanup the system property

System.clearProperty("key");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have also changed the key to a value that is less likely to be used by any other code.

@nfsantos nfsantos merged commit 917c1e4 into apache:trunk Sep 16, 2024
1 of 2 checks passed
@nfsantos nfsantos deleted the OAK-11114 branch September 16, 2024 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants