Lassie fetching for retrieval #325

hannahhoward · 2023-09-15T02:38:40Z

Goals

support retrieval for data that is no longer available in the source storage

Implementation

Core implementation pieces:

Retriever - coordinates fetching data from SPs via Lassie. Starts with a CID that represents the root of a UnixFS File DAG, a list SP miner addresses, and a byte range to fetch from the original flat file. Looks up each SP's HTTP endpoint, fetches the requested data with Lassie, and then deserializes and returns the requested flat byte range.
- EndpointFinder - looks up SP peer IDs from the chain, then contacts them over Libp2p to get their HTTP endpoint. Maintains a LRU cache of endpoints for performance.
- Deserializer - A simple function to deserialize a CAR as it comes in into a flat byte range
RetrieveFileHandler
- The primary addition here is that when we encounter a file that is no longer in the source storage, instead of returning an error, we follow a new code path to that implements a io.ReadSeeker by looking up deals for various byte ranges, and then using the Retriever to fetch the data from the SPs storing the deals.

For Discussion

Everything is unit tested. It's relatively difficult to implement a full integration test without an SP / mock SP to test with, so the plan currently is to integration test through motion. In the future, we might consider implementing a full integration test using https://github.com/ipld/frisbii/
There's a significant optimization still to be done. Currently, we do reads from remotes based of the size of the buffer passed to Read on the io.ReadSeeker implementation. This is probably about 2k. For longer reads, it would make sense to fetch full FileRange objects. At the same time, we probably need to know the bounds of the HTTP request ahead of time so we no how far ahead we can read without waiting network bandwidth for the SPs. I didn't take this on here cause the ticket is already large.

rvagg · 2023-09-15T02:55:45Z

retriever/retriever.go

+
+// getContent fetches content through Lassie and writes the CAR file to an output writer
+func (r *Retriever) getContent(ctx context.Context, c cid.Cid, rangeStart int64, rangeEnd int64, sps []string, carOutput io.Writer) error {
+	writable, err := storage.NewWritable(carOutput, []cid.Cid{c}, car.WriteAsCarV1(true))


the round-trip through a CARv1 seems unfortunate, serializing and deserializing when you have the blocks is maybe a tad inefficient? Perhaps you should make this accept a storage.WritableStorage, and the "deserializer" accept a storage.ReadableStorage and your pipe is just a custom storage implementation that takes blocks from the write side, queues them up and puts them out on the read side, checking the expected read CID matches the queued CID and erroring if not. You could also apply the backpressure there too, preventing writes until you get a read, or however much buffering you want to do.

I did something kind of similar here https://github.com/ipld/go-car/blob/master/cmd/car/extract.go#L380C6-L380C22 but it's got to account for the no-dupes case and it is decoding a CAR. But the point is that it's providing a ReadableStorage to the unixfs reifier.

ah I see what you mean, skip carv1 streaming and jsut make a block pipe? interesting!

storage.Pipe seems like an interesting concept thought I wonder if it belongs in go-ipld-prime :P

the other thing is that the fact that it's length encoded as a car is what makes using the io.Pipe possible.

retriever/retriever.go

codecov · 2023-09-16T02:51:57Z

Codecov Report

Patch coverage: 62.46% and project coverage change: -0.45% ⚠️

Comparison is base (88c011e) 74.33% compared to head (4407193) 73.88%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #325      +/-   ##
==========================================
- Coverage   74.33%   73.88%   -0.45%     
==========================================
  Files         137      140       +3     
  Lines        8793     9096     +303     
==========================================
+ Hits         6536     6721     +185     
- Misses       1589     1674      +85     
- Partials      668      701      +33

Files Changed	Coverage Δ
retriever/deserializer/deserializer.go	`41.81% <41.81%> (ø)`
api/api.go	`73.37% <45.45%> (-0.94%)`	⬇️
replication/makedeal.go	`69.80% <48.00%> (-0.40%)`	⬇️
handler/file/retrieve.go	`55.46% <56.36%> (+18.62%)`	⬆️
retriever/retriever.go	`70.58% <70.58%> (ø)`
retriever/endpointfinder/endpointfinder.go	`94.44% <94.44%> (ø)`
api/retrieve.go	`53.84% <100.00%> (ø)`
handler/file/interface.go	`100.00% <100.00%> (ø)`

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

xinaxu · 2023-09-18T17:58:38Z

handler/file/retrieve.go

+	err := db.Table("deals").Select("provider").
+		Joins("JOIN cars ON deals.piece_cid = cars.piece_cid").
+		Where("cars.job_id = ? and deals.state IN (?)", jobID, []model.DealState{
+			model.DealPublished,


Are you trying to see if the miners who have published the deal might have sealed it before the tracking gets updated?
Do you want to add DealExpired?

data is generally retrievable from boost as soon as the deal is published. the sealing just creates the sealed copy.

xinaxu · 2023-09-18T17:58:57Z

handler/file/retrieve.go

+
+func findProviders(db *gorm.DB, jobID model.JobID) ([]string, error) {
+	var deals []deal
+	err := db.Table("deals").Select("provider").


Do you need to add distinct here?

xinaxu · 2023-09-18T18:11:44Z

handler/file/retrieve.go

+		readLen = remainingBytes
+	}
+
+	fileRanges, err := findFileRanges(r.db, r.id, r.offset, r.offset+readLen)


This has some room to optimize to reduce db and network requests. Even during sequential read, the length of the buffer p increases gradually so you may end up calling findFileRanges and findProviders multiple times.

How about finding all file ranges during the first time when read() is called so we can reduce the database load.

And probably providers too.

xinaxu · 2023-09-18T18:46:40Z

retriever/retriever.go

+		return err
+	}
+	request.Duplicates = true
+	request.Protocols = []multicodec.Code{multicodec.TransportIpfsGatewayHttp}


Have you evaluated usinig piece gateway, because each fileRange is stored sequentially in the file, so you can get the CarBlocks of a fileRange, and stream that fileRange with a single HTTP request with ranges header while validating the content while streaming.

Or, we should make it configurable so the user can choose different ways of retrieval based on some header value, i.e. x-trasnport-protocol

adds a utility to fetch unixfs files from sps

implements the full read seeker for filecoin and wires everything up

add unit test and add various comments

and test of retrieve handler in both Filecoin and non-filecoin variants and correct mistakes

hannahhoward requested a review from rvagg September 15, 2023 02:39

rvagg reviewed Sep 15, 2023

View reviewed changes

retriever/retriever.go Show resolved Hide resolved

hannahhoward force-pushed the feat/lassie-fetch branch from 00ca1cc to a18c41f Compare September 16, 2023 01:22

hannahhoward force-pushed the feat/lassie-fetch branch from d783522 to 0e6617e Compare September 18, 2023 15:54

hannahhoward marked this pull request as ready for review September 18, 2023 15:55

hannahhoward changed the title ~~WIP: Lassie fetching for retrieval~~ Lassie fetching for retrieval Sep 18, 2023

xinaxu reviewed Sep 18, 2023

View reviewed changes

hannahhoward added 12 commits September 19, 2023 17:59

feat(retriever): add filecoin retrieval via lassie

bc1736f

adds a utility to fetch unixfs files from sps

test(endpointfinder): add unit test

0a51bf8

feat(handler): complete retrieve for filecoin

4a22957

implements the full read seeker for filecoin and wires everything up

test(retriever): add unit test

3da0d86

add unit test and add various comments

style(lint): mod tidy

cf13a93

style(lint): fix lint errors

c59d116

style(lint): fix more lint errors

4b84c37

style(lint): fix even more lint errors

7cc13dc

fix(file): fix mock for file handler

d7014f6

test(file): and retrieve handler test

f8c3c2a

and test of retrieve handler in both Filecoin and non-filecoin variants and correct mistakes

fix(file): select distinct providers

8b6d02a

fix(mod): go mod tidy

4407193

hannahhoward force-pushed the feat/lassie-fetch branch from 435f602 to 4407193 Compare September 20, 2023 01:17

hannahhoward merged commit f9699b6 into main Sep 22, 2023
12 checks passed

hannahhoward deleted the feat/lassie-fetch branch September 22, 2023 00:13

hannahhoward mentioned this pull request Sep 25, 2023

Add utility functions similar to those found in stdlib io package ipld/go-ipld-prime#549

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lassie fetching for retrieval #325

Lassie fetching for retrieval #325

hannahhoward commented Sep 15, 2023 •

edited

Loading

rvagg Sep 15, 2023

rvagg Sep 15, 2023

hannahhoward Sep 15, 2023

hannahhoward Sep 15, 2023

hannahhoward Sep 15, 2023

codecov bot commented Sep 16, 2023 •

edited

Loading

xinaxu Sep 18, 2023

hannahhoward Sep 20, 2023

xinaxu Sep 18, 2023

xinaxu Sep 18, 2023

xinaxu Sep 18, 2023

xinaxu Sep 18, 2023

xinaxu Sep 18, 2023

Lassie fetching for retrieval #325

Lassie fetching for retrieval #325

Conversation

hannahhoward commented Sep 15, 2023 • edited Loading

Goals

Implementation

For Discussion

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 16, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahhoward commented Sep 15, 2023 •

edited

Loading

codecov bot commented Sep 16, 2023 •

edited

Loading