Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lassie fetching for retrieval #325

Merged
merged 12 commits into from
Sep 22, 2023
Merged

Lassie fetching for retrieval #325

merged 12 commits into from
Sep 22, 2023

Conversation

hannahhoward
Copy link
Contributor

@hannahhoward hannahhoward commented Sep 15, 2023

Goals

support retrieval for data that is no longer available in the source storage

Implementation

Core implementation pieces:

  • Retriever - coordinates fetching data from SPs via Lassie. Starts with a CID that represents the root of a UnixFS File DAG, a list SP miner addresses, and a byte range to fetch from the original flat file. Looks up each SP's HTTP endpoint, fetches the requested data with Lassie, and then deserializes and returns the requested flat byte range.
    • EndpointFinder - looks up SP peer IDs from the chain, then contacts them over Libp2p to get their HTTP endpoint. Maintains a LRU cache of endpoints for performance.
    • Deserializer - A simple function to deserialize a CAR as it comes in into a flat byte range
  • RetrieveFileHandler
    • The primary addition here is that when we encounter a file that is no longer in the source storage, instead of returning an error, we follow a new code path to that implements a io.ReadSeeker by looking up deals for various byte ranges, and then using the Retriever to fetch the data from the SPs storing the deals.

For Discussion

  • Everything is unit tested. It's relatively difficult to implement a full integration test without an SP / mock SP to test with, so the plan currently is to integration test through motion. In the future, we might consider implementing a full integration test using https://github.com/ipld/frisbii/

  • There's a significant optimization still to be done. Currently, we do reads from remotes based of the size of the buffer passed to Read on the io.ReadSeeker implementation. This is probably about 2k. For longer reads, it would make sense to fetch full FileRange objects. At the same time, we probably need to know the bounds of the HTTP request ahead of time so we no how far ahead we can read without waiting network bandwidth for the SPs. I didn't take this on here cause the ticket is already large.


// getContent fetches content through Lassie and writes the CAR file to an output writer
func (r *Retriever) getContent(ctx context.Context, c cid.Cid, rangeStart int64, rangeEnd int64, sps []string, carOutput io.Writer) error {
writable, err := storage.NewWritable(carOutput, []cid.Cid{c}, car.WriteAsCarV1(true))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the round-trip through a CARv1 seems unfortunate, serializing and deserializing when you have the blocks is maybe a tad inefficient? Perhaps you should make this accept a storage.WritableStorage, and the "deserializer" accept a storage.ReadableStorage and your pipe is just a custom storage implementation that takes blocks from the write side, queues them up and puts them out on the read side, checking the expected read CID matches the queued CID and erroring if not. You could also apply the backpressure there too, preventing writes until you get a read, or however much buffering you want to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something kind of similar here https://github.com/ipld/go-car/blob/master/cmd/car/extract.go#L380C6-L380C22 but it's got to account for the no-dupes case and it is decoding a CAR. But the point is that it's providing a ReadableStorage to the unixfs reifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see what you mean, skip carv1 streaming and jsut make a block pipe? interesting!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage.Pipe seems like an interesting concept thought I wonder if it belongs in go-ipld-prime :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other thing is that the fact that it's length encoded as a car is what makes using the io.Pipe possible.

@codecov
Copy link

codecov bot commented Sep 16, 2023

Codecov Report

Patch coverage: 62.46% and project coverage change: -0.45% ⚠️

Comparison is base (88c011e) 74.33% compared to head (4407193) 73.88%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #325      +/-   ##
==========================================
- Coverage   74.33%   73.88%   -0.45%     
==========================================
  Files         137      140       +3     
  Lines        8793     9096     +303     
==========================================
+ Hits         6536     6721     +185     
- Misses       1589     1674      +85     
- Partials      668      701      +33     
Files Changed Coverage Δ
retriever/deserializer/deserializer.go 41.81% <41.81%> (ø)
api/api.go 73.37% <45.45%> (-0.94%) ⬇️
replication/makedeal.go 69.80% <48.00%> (-0.40%) ⬇️
handler/file/retrieve.go 55.46% <56.36%> (+18.62%) ⬆️
retriever/retriever.go 70.58% <70.58%> (ø)
retriever/endpointfinder/endpointfinder.go 94.44% <94.44%> (ø)
api/retrieve.go 53.84% <100.00%> (ø)
handler/file/interface.go 100.00% <100.00%> (ø)

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hannahhoward hannahhoward marked this pull request as ready for review September 18, 2023 15:55
@hannahhoward hannahhoward changed the title WIP: Lassie fetching for retrieval Lassie fetching for retrieval Sep 18, 2023
err := db.Table("deals").Select("provider").
Joins("JOIN cars ON deals.piece_cid = cars.piece_cid").
Where("cars.job_id = ? and deals.state IN (?)", jobID, []model.DealState{
model.DealPublished,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to see if the miners who have published the deal might have sealed it before the tracking gets updated?
Do you want to add DealExpired?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data is generally retrievable from boost as soon as the deal is published. the sealing just creates the sealed copy.


func findProviders(db *gorm.DB, jobID model.JobID) ([]string, error) {
var deals []deal
err := db.Table("deals").Select("provider").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to add distinct here?

readLen = remainingBytes
}

fileRanges, err := findFileRanges(r.db, r.id, r.offset, r.offset+readLen)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has some room to optimize to reduce db and network requests. Even during sequential read, the length of the buffer p increases gradually so you may end up calling findFileRanges and findProviders multiple times.

How about finding all file ranges during the first time when read() is called so we can reduce the database load.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably providers too.

return err
}
request.Duplicates = true
request.Protocols = []multicodec.Code{multicodec.TransportIpfsGatewayHttp}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you evaluated usinig piece gateway, because each fileRange is stored sequentially in the file, so you can get the CarBlocks of a fileRange, and stream that fileRange with a single HTTP request with ranges header while validating the content while streaming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, we should make it configurable so the user can choose different ways of retrieval based on some header value, i.e. x-trasnport-protocol

@hannahhoward hannahhoward merged commit f9699b6 into main Sep 22, 2023
12 checks passed
@hannahhoward hannahhoward deleted the feat/lassie-fetch branch September 22, 2023 00:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants