[chore][fileconsumer/archive] - Add archive read logic #35798

VihasMakwana · 2024-10-15T13:54:16Z

This PR follows #35098.

Description

This PR adds core logic for matching from archive. Check this out for the core logic.

Future PRs

As of now, we don't keep track of most recently written index across collector restarts. This is simple to accomplish and we can use of persister for this. I haven't implemented this in current PR, as I want to guide your focus solely towards reading part. We can address this in this PR (later) or in a separate PR, independently.
Testing and Enabling: Once we establish common ground on reading from archive matter, we can proceed with testing and enabling the configuration.

VihasMakwana · 2024-10-22T17:48:05Z

@djaglowski were you able to look at this?

pkg/stanza/fileconsumer/internal/reader/factory.go

pkg/stanza/fileconsumer/internal/tracker/tracker.go

djaglowski · 2024-10-28T13:46:10Z

pkg/stanza/fileconsumer/internal/tracker/tracker.go

+	archiveReadIndex := t.archiveIndex - 1 // try loading most recently written index and iterate backwards
+	for i := 0; i < t.pollsToArchive; i++ {


The comment doesn't seem to describe what's happening here.

I'll update it.

djaglowski · 2024-10-28T14:12:10Z

pkg/stanza/fileconsumer/internal/tracker/tracker.go

+}
+
+func (t *fileTracker) SyncOffsets() {
+	// SyncOffsets goes through all new (unmatched) readers and updates the metadata, if found on archive.


This wasn't my understanding of how the archive would work. We shouldn't have to reconcile multiple copies of the same metadata and worry about syncing. It's too complicated.

We should use almost the exact same patterns as when searching knownFiles. The only difference is that in order to reduce inefficient read/writes to storage, we should search 1 fileset for N fingerprints before moving on to the next fileset. (As opposed to searching all filesets for 1 fingerprint before moving on to the next fingerprint.)

Searching the archive should look something like:

func (t *fileTracker) FindFiles(fps []fingerprint.Fingerprint) []reader.Metadata { matchedFPs := make([]reader.Metadata) for i := mostRecentIndex; i != mostRecentIndex; i = (i + 1) % pollsToArchive { fs := loadFileset(i) // the entire fileset at most once poll var modified bool for _, fp := range fps { if matchedFP := fs.Get(fp) { // removes fp if matched matchedFPs = append(matchedFPs, matchedFP) modified = true } } if modified { saveFileset(i, fs) // overwrite the entire fileset at most once poll } return matchedFPs }

This way filesets in the archive are kept up to date in near-real time. The only time when they could be out of data is before the function returns. It also ensures we are minimizing interactions with the archive. (One thing I didn't include in my pseudocode is early exit if all fingerprints have been found, but we should do that too.)

From an high level, this is what I did. But syncing/reconcile made it complex. I agree with you.

VihasMakwana · 2024-10-29T00:01:46Z

This wasn't my understanding of how the archive would work. We shouldn't have to reconcile multiple copies of the same metadata and worry about syncing. It's too complicated.

I agree on this. It's like an overkill.

We should use almost the exact same patterns as when searching knownFiles. The only difference is that in order to reduce inefficient read/writes to storage, we should search 1 fileset for N fingerprints before moving on to the next fileset. (As opposed to searching all filesets for 1 fingerprint before moving on to the next fingerprint.)

Again, I agree. This is precisely what I did for SyncOffsets()

opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/tracker/tracker.go

Lines 192 to 227 in 2936928

    
           func (t *fileTracker) SyncOffsets() { 
        
           	// SyncOffsets goes through all new (unmatched) readers and updates the metadata, if found on archive. 
        
           	// To minimize disk access, we first access the index, then review unmatched readers and synchronize their metadata if a match is found. 
        
           	// We exit if no new reader exists. 
        
           	archiveReadIndex := t.archiveIndex - 1 // try loading most recently written index and iterate backwards 
        
           	for i := 0; i < t.pollsToArchive; i++ { 
        
           		newFound := false 
        
           		data, err := t.readArchive(archiveReadIndex) 
        
           		if err != nil { 
        
           			t.set.Logger.Error("error while opening archive", zap.Error(err)) 
        
           			continue 
        
           		} 
        
           		for _, v := range t.currentPollFiles.Get() { 
        
           			if v.IsNew() { 
        
           				newFound = true 
        
           				if md := data.Match(v.GetFingerprint(), fileset.StartsWith); md != nil { 
        
           					v.SyncMetadata(md) 
        
           				} 
        
           			} 
        
           		} 
        
           		if !newFound { 
        
           			// No new reader is available, so there’s no need to go through the rest of the archive. 
        
           			// Just exit to save time. 
        
           			break 
        
           		} 
        
           		if err := t.updateArchive(archiveReadIndex, data); err != nil { 
        
           			t.set.Logger.Error("error while opening archive", zap.Error(err)) 
        
           			continue 
        
           		} 
        
           		archiveReadIndex = (archiveReadIndex - 1) % t.pollsToArchive 
        
           	} 
        
           }

, although the comments were incorrect. I'll take care next time.

@djaglowski I'd like your thoughts on an approach.

Note:

We need both the file and the fingerprint to create a new reader. To link them together, we can create a struct that includes references to both the file and the fingerprint. For now, let’s call it a record.

For archiving, our main focus is for to be efficient while accessing disk storage.

Current implementation looks as following, correct me if I'm wrong:

It becomes very difficult to integrate archiving the way we want (i.e. we should search 1 fileset for N fingerprints before moving on to the next fileset). Going through paths and creating readers one at a time adds to the difficulty.

I propose a new way to create readers, which takes care of our requirements.

I think what we need is to combine records (the struct containing the file and fingerprint):

Next, we'll check each record for a match in memory.
Then, we’ll go through each archive to find matches for the new records.
Finally, we'll have a loop that creates readers based on the matched metadata and adds them to tracker.

In other words, we need to divide our function as per checkpoints:

1st loop to combine files and fingerprints into an array.
2nd loop to go through the combined array and try finding a match in memory.
Reading from archive.
Finally, create readers from records.

As of now, only one loop exists and it becomes very difficult to integrate archiving in an efficient manner. Hence, I believe we need to decouple things.

Please let me know your thoughts on this.

VihasMakwana · 2024-10-29T21:15:24Z

@djaglowski Here’s a pseudocode outline for the new approach. I've omitted the archiving section for simplicity, but we can easily incorporate it into FindFiles.

type Record struct {
	file *os.File
	fingerprint *fingerprint.Fingerprint
	metadata *reader.Metadata // metadata is non-nil if a file is found in knownFiles. 
}

func makeReaders(paths []string) {
	records := make([]*Record, 0, len(paths))
	for _, path := range paths {
		fp, file := m.makeFingerprint(path)
		if fp == nil {
			continue
		}
		records = append(records, &Record{file: file, fingerprint: fp})
	}

	findFiles(records) // update records with matched metadata, in-place

	// create new readers once matching is done
	for _, record := range records {
		// Exclude duplicate paths with the same content. This can happen when files are
		// being rotated with copy/truncate strategy. (After copy, prior to truncate.)
		if r := m.tracker.GetCurrentFile(record.Fingerprint); r != nil {
			m.tracker.Add(r)
			record.file.Close()
			continue
		}
		r, err := m.newReader(ctx, record)
		if err != nil {
			m.set.Logger.Error("Failed to create reader", zap.Error(err))
			continue
		}

		m.tracker.Add(r)

	}

}

// findFiles loops through the records, matches them against the offsets in memory and updates record.metadata with found metadata
func findFiles(records []*Record) {
	for _, record := range records {

        // update record.Metadata if match is found
		if oldReader := t.GetOpenFile(record.fingerprint); oldReader != nil {
			record.metadata = oldReader.Close()
		} else if oldMetadata := t.GetClosedFile(record.fingerprint); oldMetadata != nil {
			record.metadata = oldMetadata
		}
	}
}


func newReader(record *tracker.Record) {

	if record.metadata != nil {
		return m.readerFactory.NewReaderFromMetadata(record.file, record.metadata)
	} else {
		// If we don't match any previously known files, create a new reader from scratch
		m.set.Logger.Info("Started watching file", zap.String("path", record.file.Name()))
		return m.readerFactory.NewReader(record.file, record.fingerprint)
	}
}

djaglowski · 2024-11-01T13:52:14Z

This is precisely what I did for SyncOffsets()

I don't see it. It's fundamentally doing something different than what I described. SyncOffsets is

iterating through all sets in the archive
operating on individual files
dependent on unnecessary state that you've added to readers

What I have suggested is that there should be no need for syncing files. We only need:

Load a set from the archive. Search it for matches. Remove matches from the set. Return the matches. If any matches were found, rewrite the set to the archive.
At the end of a poll, instead of deleting the oldest set in knownFiles, write it to the archive.

VihasMakwana · 2024-11-01T14:20:57Z

What I have suggested is that there should be no need for syncing files. We only need:

Load a set from the archive. Search it for matches. Remove matches from the set. Return the matches. If any matches were found, rewrite the set to the archive.

At the end of a poll, instead of deleting the oldest set in knownFiles, write it to the archive.

@djaglowski
I see your point.
In your example, the FindFiles function is defined as FindFiles(fps []fingerprint.Fingerprint) []readerMetadata. However, for creating a reader, we have two function signatures:
NewReaderFromMetadata(file *os.File, m *Metadata) and NewReader(file *os.File, fp *Fingerprint).

This raises a concern: we need a way to link the fingerprint or metadata to a specific os.File.
Essentially, we need to indicate that "metadata x belongs to file y," so we can then call NewReaderFromMetadata(x, y).

metadata x belongs to file y so we can then call NewReaderFromMetadata(x, y)

If we only return an array of metadata, we won't have the corresponding os.File instance needed to create a reader. What are your thoughts on this? Or am i missing something here. Please correct me if I'm going offtrack.

VihasMakwana · 2024-11-01T23:06:05Z

@djaglowski
First, thank you for your patience. I’ve made updates to the PR based on your feedback.

I’ve removed the unnecessary state from the readers and reverted to the original pseudocode. Please take a look!

I’d also like to discuss future PRs to ensure we’re aligned.
If you review this PR again, you’ll notice the introduction of a new convenience struct called Record.

Why do we need this?
- It simplifies reader creation. You can see more details here.
How would future PR look like and how do we plan to use struct Record?
- At a high level, they will follow a structure similar to the following pseudocode:

func makeReaders(paths []string) {
	unmatchedFiles := make([]*Record, 0)
	for _, path := range paths {
		fp, file := m.makeFingerprint(path)
		if fp == nil {
			continue
		}
		// ...Exclude duplicate paths
		if (fp found in tracker) {
			// call NewReaderFromMetadata
		} else {
			unmatchedFiles = append(unmatchedFiles, Record{File: file, Fingerprint: fp})
		}
	}
	
	// Update unmatchedFiles in place by searching for any matches in the archive via the tracker
	// This method will populate the Metadata field in the Record if a match is found
	tracker.FindFiles(unmatchedFiles) // this method is introduced in current PR
	
	// Now, process the unmatchedFiles to determine if they have been matched in the archive
	for _, record := range unmatchedFiles {
		// Check if Metadata has been populated, indicating a successful match
		if record.Metadata != nil {
			r, err := NewReaderFromMetadata(record.File, record.Metadata)
			tracker.Add(r)
		} else {
			r, err := NewReader(record.File, record.Fingerprint)
			tracker.Add(r)
		}
	}
}

VihasMakwana · 2024-11-06T21:30:30Z

@djaglowski were you able to review take a look at this? Let me know your thoughts!

djaglowski · 2024-11-07T15:30:14Z

@djaglowski I see your point. In your example, the FindFiles function is defined as FindFiles(fps []fingerprint.Fingerprint) []readerMetadata. However, for creating a reader, we have two function signatures: NewReaderFromMetadata(file *os.File, m *Metadata) and NewReader(file *os.File, fp *Fingerprint).

This raises a concern: we need a way to link the fingerprint or metadata to a specific os.File. Essentially, we need to indicate that "metadata x belongs to file y," so we can then call NewReaderFromMetadata(x, y).

That's a good point. I think we can handle that a couple different ways. One would be to just return a slice of the same size, where each index of the result slice corresponds to the fingerprint of the same index in the input slice.

djaglowski · 2024-11-07T15:32:26Z

I’d also like to discuss future PRs to ensure we’re aligned. If you review this PR again, you’ll notice the introduction of a new convenience struct called Record.

Let's not get into future PRs in this PR. I don't see a need for the Record struct but I do see some pitfalls to using it. Let's leave it out until it is proven to be necessary.

VihasMakwana · 2024-11-07T19:35:51Z

@djaglowski I see your point. In your example, the FindFiles function is defined as FindFiles(fps []fingerprint.Fingerprint) []readerMetadata. However, for creating a reader, we have two function signatures: NewReaderFromMetadata(file *os.File, m *Metadata) and NewReader(file *os.File, fp *Fingerprint).
This raises a concern: we need a way to link the fingerprint or metadata to a specific os.File. Essentially, we need to indicate that "metadata x belongs to file y," so we can then call NewReaderFromMetadata(x, y).

That's a good point. I think we can handle that a couple different ways. One would be to just return a slice of the same size, where each index of the result slice corresponds to the fingerprint of the same index in the input slice.

I'll explore few different ways and update the PR.

VihasMakwana · 2024-11-09T10:52:12Z

@djaglowski can you take a fresh look? I've removed the Record and each index of the result slice corresponds to the fingerprint of the same index in the input slice. Let me know what do you think about it!

VihasMakwana · 2024-11-09T10:54:19Z

pkg/stanza/fileconsumer/internal/util/util.go

+
+func Mod(x, y int) int {
+	return ((x % y) + y) % y
+}


This is needed because in golang, the % operator acts as a remainder, not a typical math modulus and it can output negative integers.

VihasMakwana · 2024-11-09T10:56:29Z

pkg/stanza/fileconsumer/internal/fingerprint/fingerprint.go

+func (f *Fingerprint) GetFingerprint() *Fingerprint {
+	return f
+}
+


I've added this so the Fingeprint can implement the Matchable interface, so that we can have unified return type for

opentelemetry-collector-contrib/pkg/stanza/fileconsumer/internal/tracker/tracker.go

Lines 194 to 195 in ad46f94

func (t *fileTracker) FindFiles(fps []*fingerprint.Fingerprint) []fileset.Matchable {

// To minimize disk access, we first access the index, then review unmatched files and update the metadata, if found.

I think this is neater than returning []any or creating a struct to capture the fingerprint/metadata.

VihasMakwana · 2024-11-09T10:57:41Z

pkg/stanza/fileconsumer/internal/tracker/tracker_test.go

+	"go.opentelemetry.io/collector/component/componenttest"
+)
+
+func TestFindFilesOrder(t *testing.T) {


Let me know your thoughts over this test @djaglowski.

VihasMakwana and others added 30 commits August 6, 2024 15:56

chore: add initial structure and signature methods

ea3f57d

fix: add license

17ec1aa

fix: lint

26deb01

fix: lint

fe1912c

fix: check

33566b0

chore: add write method, update interface

5acb848

Merge branch 'main' into create-archive-storage

06e4a45

chore: commit, archive module

431ab41

chore: cleanup, simplify the PR

077ab35

fix: initial commit, second PR

e3cdd5d

chore: remove stanza.go

d88d5ae

Format the comment

d9e5992

chore: use options

c68817a

fix: lint

39d9e86

Merge branch 'main' into create-archive-storage-temp

a6a010e

fix: bug

3490e7d

chore: rename function

87c2d2a

chore: rename function

b668554

Merge branch 'main' into create-archive-storage-temp

7adf335

Merge branch 'main' into create-archive-storage-temp

924e002

chore: cleanup tracker and use options

3f762ed

fix: move function before loading

66824c6

chore: log the error

ef2e53a

chore: lint, ci

f009e71

chore: remove redundant argument

79ce0e3

Merge branch 'cleanup-tracker-options' into create-archive-storage-temp

0dd23be

chore: remove redundant code

cae05c7

chore: add new no tracking

25d923b

fix: pass persister instead of m.persister

4bc2150

chore: remove options

9006e3f

lint

2936928

djaglowski reviewed Oct 28, 2024

View reviewed changes

VihasMakwana requested a review from djaglowski October 29, 2024 15:28

chore: improve logic

fbf35c5

VihasMakwana force-pushed the archive-read-logic branch from b5c571f to fbf35c5 Compare November 1, 2024 23:06

comments

cb59bc4

VihasMakwana force-pushed the archive-read-logic branch from 38cd221 to beb94c5 Compare November 1, 2024 23:20

lint

5560f6a

VihasMakwana force-pushed the archive-read-logic branch from beb94c5 to 5560f6a Compare November 1, 2024 23:21

use modulo

b45a4c0

VihasMakwana added 2 commits November 9, 2024 16:16

chore: remove record and update tests

200eadb

comments

ad46f94

VihasMakwana force-pushed the archive-read-logic branch from 431dccc to ad46f94 Compare November 9, 2024 10:50

VihasMakwana commented Nov 9, 2024

View reviewed changes

VihasMakwana added 3 commits November 10, 2024 14:18

lint and check

b9f65f3

lint and check

90dadb0

gci

2dac57f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chore][fileconsumer/archive] - Add archive read logic #35798

[chore][fileconsumer/archive] - Add archive read logic #35798

VihasMakwana commented Oct 15, 2024

VihasMakwana commented Oct 22, 2024

djaglowski Oct 28, 2024

VihasMakwana Oct 30, 2024

djaglowski Oct 28, 2024

VihasMakwana Oct 30, 2024

VihasMakwana commented Oct 29, 2024 •

edited

Loading

VihasMakwana commented Oct 29, 2024 •

edited

Loading

djaglowski commented Nov 1, 2024

VihasMakwana commented Nov 1, 2024 •

edited

Loading

VihasMakwana commented Nov 1, 2024 •

edited

Loading

VihasMakwana commented Nov 6, 2024

djaglowski commented Nov 7, 2024

djaglowski commented Nov 7, 2024

VihasMakwana commented Nov 7, 2024

VihasMakwana commented Nov 9, 2024

VihasMakwana Nov 9, 2024 •

edited

Loading

VihasMakwana Nov 9, 2024

VihasMakwana Nov 9, 2024

		archiveReadIndex := t.archiveIndex - 1 // try loading most recently written index and iterate backwards
		for i := 0; i < t.pollsToArchive; i++ {

	func (t fileTracker) FindFiles(fps []fingerprint.Fingerprint) []fileset.Matchable {
	// To minimize disk access, we first access the index, then review unmatched files and update the metadata, if found.

[chore][fileconsumer/archive] - Add archive read logic #35798

Are you sure you want to change the base?

[chore][fileconsumer/archive] - Add archive read logic #35798

Conversation

VihasMakwana commented Oct 15, 2024

Description

Future PRs

VihasMakwana commented Oct 22, 2024

djaglowski Oct 28, 2024

Choose a reason for hiding this comment

VihasMakwana Oct 30, 2024

Choose a reason for hiding this comment

djaglowski Oct 28, 2024

Choose a reason for hiding this comment

VihasMakwana Oct 30, 2024

Choose a reason for hiding this comment

VihasMakwana commented Oct 29, 2024 • edited Loading

VihasMakwana commented Oct 29, 2024 • edited Loading

djaglowski commented Nov 1, 2024

VihasMakwana commented Nov 1, 2024 • edited Loading

VihasMakwana commented Nov 1, 2024 • edited Loading

VihasMakwana commented Nov 6, 2024

djaglowski commented Nov 7, 2024

djaglowski commented Nov 7, 2024

VihasMakwana commented Nov 7, 2024

VihasMakwana commented Nov 9, 2024

VihasMakwana Nov 9, 2024 • edited Loading

Choose a reason for hiding this comment

VihasMakwana Nov 9, 2024

Choose a reason for hiding this comment

VihasMakwana Nov 9, 2024

Choose a reason for hiding this comment

VihasMakwana commented Oct 29, 2024 •

edited

Loading

VihasMakwana commented Oct 29, 2024 •

edited

Loading

VihasMakwana commented Nov 1, 2024 •

edited

Loading

VihasMakwana commented Nov 1, 2024 •

edited

Loading

VihasMakwana Nov 9, 2024 •

edited

Loading