Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Persist StoreFileMetadata fields in Remote Store #7722

Closed
Bukhtawar opened this issue May 24, 2023 · 7 comments · Fixed by #8088
Closed

[Remote Store] Persist StoreFileMetadata fields in Remote Store #7722

Bukhtawar opened this issue May 24, 2023 · 7 comments · Fixed by #8088
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework

Comments

@Bukhtawar
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
We need to capture the Lucene version of the segment files maintained on the remote store to support version currency and compatibility checks.

Describe the solution you'd like
Consider capturing the metadata information while persisting segments to remote store
https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/store/StoreFileMetadata.java

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@Bukhtawar Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels May 24, 2023
@sachinpkale sachinpkale added Storage:Durability Issues and PRs related to the durability framework and removed untriaged labels May 24, 2023
@BhumikaSaini-Amazon
Copy link
Contributor

I was looking into this and wanted to get some additional clarity on the expected end result.

Do we want to associate the Lucene version with:

  1. Each uploaded metadata file?
  2. OR, Each uploaded segment file?

(additional thoughts below)


If 1,

If we want to associate the Lucene version with each uploaded metadata file, we may be able to retrieve it using the segment metadata structure that exists today.

As part of RemoteSegmentMetadata, we persist the SegmentInfos bytes (i.e. RemoteSegmentMetadata.segmentInfosBytes) as well to remote store:

public void write(IndexOutput out) throws IOException {
out.writeMapOfStrings(toMapOfStrings());
out.writeLong(generation);
out.writeLong(segmentInfosBytes.length);
out.writeBytes(segmentInfosBytes, segmentInfosBytes.length);
}
public static RemoteSegmentMetadata read(IndexInput indexInput) throws IOException {
Map<String, String> metadata = indexInput.readMapOfStrings();
long generation = indexInput.readLong();
int byteArraySize = (int) indexInput.readLong();
byte[] segmentInfosBytes = new byte[byteArraySize];
indexInput.readBytes(segmentInfosBytes, 0, byteArraySize);
return new RemoteSegmentMetadata(RemoteSegmentMetadata.fromMapOfStrings(metadata), segmentInfosBytes, generation);
}

The Lucene version can be retrieved as SegmentInfos.getCommitLuceneVersion() in this case (docs).


If 2,

If we want to associate the Lucene version with each uploaded segment, we would need to associate the Lucene version with each UploadedSegmentMetadata.

While the SegmentInfos instance is available is available in RemoteSegmentStoreDirectory#uploadMetadata()...

/**
* Upload metadata file
* @param segmentFiles segment files that are part of the shard at the time of the latest refresh
* @param segmentInfosSnapshot SegmentInfos bytes to store as part of metadata file
* @param storeDirectory instance of local directory to temporarily create metadata file before upload
* @param primaryTerm primary term to be used in the name of metadata file
* @throws IOException in case of I/O error while uploading the metadata file
*/
public void uploadMetadata(
Collection<String> segmentFiles,
SegmentInfos segmentInfosSnapshot,
Directory storeDirectory,
long primaryTerm
) throws IOException {
synchronized (this) {
String metadataFilename = MetadataFilenameUtils.getMetadataFilename(
primaryTerm,
segmentInfosSnapshot.getGeneration(),
this.commonFilenameSuffix
);
IndexOutput indexOutput = storeDirectory.createOutput(metadataFilename, IOContext.DEFAULT);
Map<String, String> uploadedSegments = new HashMap<>();
for (String file : segmentFiles) {
if (segmentsUploadedToRemoteStore.containsKey(file)) {
uploadedSegments.put(file, segmentsUploadedToRemoteStore.get(file).toString());
} else {
throw new NoSuchFileException(file);
}
}
ByteBuffersDataOutput byteBuffersIndexOutput = new ByteBuffersDataOutput();
segmentInfosSnapshot.write(new ByteBuffersIndexOutput(byteBuffersIndexOutput, "Snapshot of SegmentInfos", "SegmentInfos"));
byte[] segmentInfoSnapshotByteArray = byteBuffersIndexOutput.toArrayCopy();
metadataStreamWrapper.writeStream(
indexOutput,
new RemoteSegmentMetadata(
RemoteSegmentMetadata.fromMapOfStrings(uploadedSegments),
segmentInfoSnapshotByteArray,
segmentInfosSnapshot.getGeneration()
)
);
indexOutput.close();
storeDirectory.sync(Collections.singleton(metadataFilename));
remoteMetadataDirectory.copyFrom(storeDirectory, metadataFilename, metadataFilename, IOContext.DEFAULT);
storeDirectory.deleteFile(metadataFilename);
}
}

... we actually need access to it during the actual segment file upload...

/**
* This method does the actual upload.
*
* @param file that needs to be uploaded.
* @throws IOException is thrown if the upload fails.
*/
private void performUpload(String file) throws IOException {
remoteDirectory.copyFrom(storeDirectory, file, file, IOContext.DEFAULT);
}
}

... because as part of the upload, we also construct the UploadedSegmentMetadata available in remote store:

UploadedSegmentMetadata segmentMetadata = new UploadedSegmentMetadata(src, remoteFilename, checksum, from.fileLength(src));
segmentsUploadedToRemoteStore.put(src, segmentMetadata);

I will need to work on identifying a good way to achieve this, if option 2 is the preferred end result.


Thanks!

@sachinpkale
Copy link
Member

IMO, we can add version info as part of metadata. A metadata file corresponds to a commit. It is not possible to create segments of different Lucene versions without committing first.

@sachinpkale
Copy link
Member

But if I look at StoreFileMetadata, it has version info for each file. Need to understand why it was done this way instead of storing info in the SegmentInfos file.

@sachinpkale
Copy link
Member

This would be due to the backward compatibility that is maintained across major Lucene versions (at most 1). So, it is possible that one metadata file will refer to segment files from different versions. So, we have to go with Option 2.

@BhumikaSaini-Amazon
Copy link
Contributor

Thanks Sachin for the inputs! I will work on option 2 then.

@ankitkala
Copy link
Member

I agree with @sachinpkale, we need version for each segment file as part of the metadata. Backwards compatibility can be a concern since you can have older segments in an index which were written on older lucene versions(e.g. version upgrade).

@Bukhtawar
Copy link
Collaborator Author

For version upgrades we just need the min version in the commit. However in future it might help to selectively upgrade fewer segments as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Storage:Durability Issues and PRs related to the durability framework
Projects
Status: Done
4 participants