Skip to content
This repository was archived by the owner on Feb 8, 2023. It is now read-only.
This repository was archived by the owner on Feb 8, 2023. It is now read-only.

Reed-Solomon layer over IPFS #196

Open
@jackie-scholl

Description

@jackie-scholl

Consider this scenario:

You are some entity with a whole lot of data. You're using IPFS to store and retrieve that data for your servers (either public cloud or on-prem). You're probably using an (as-yet-unimplemented) private IPFS network to keep your files totally safe. To ensure resiliency, you always make sure that each file you care about is stored on IPFS nodes in at least three geographically isolated locations, out of the 10 locations that you have storage servers in. That means that for every GB of storage you need, you need 3 GB of disk space. But when you look over to your peers, they manage to store more data with less disk and higher resiliency using Reed-Solomon erasure coding! In fact, your friends at Facebook are using 10-of-14 encoding, which allows them to survive the failure of any 4 nodes out of the 14 storing their data while only using 1.4 GB of disk per GB of storage. But you're using IPFS, so you can't achieve that efficiency. Or can you?

A Possible Architecture for Reed-Solomon on IPFS

(Note: I use the term "availability zone" here to indicate groups of servers that are likely to fail together. I assume failures between availability zones are independent.)

I present here an idea for how one might achieve the efficiency of Reed-Solomon storage and also preserve many of the benefits of IPFS.

There are four components to this architecture:

  • A coordination service that decides what files need to be stored, and on what nodes they should be stored
  • Storage nodes, which are running an IPFS client and receive instructions for pinning and unpinning IPFS files from the coordination service. They also report to the coordination service how much space they have left, which files they have pinned, what availability zone they're in, etc.
  • Encoding nodes, which receive requests from the coordination service to Reed-Solomon-encode a particular IPFS file and put the chunks on IPFS, and then return the necessary data to the coordination service.
  • Decoding nodes, which sit on the network and offer up files that they do not have locally, but are stored RS-encoded on the storage nodes.

Here's how the basic flow works:

  1. Some application puts a file in IPFS that needs to be reliably stored
  2. The application tells the coordination service that the file should be stored
  3. The coordination service chooses three storage nodes in different availability zones and tells them each to pin the file
  4. The file is accessed for a while, but then requests for the file start coming less and less frequently
  5. The coordination service decides that the file is getting accessed rarely enough that it should go into "cold storage"
  6. The coordination service picks an encoding node (at random?) that's not busy, and tells it to RS-encode this file with (just as an example) 7-of-10 resiliency (also known as RS(7, 10))
  7. The encoding node pins the "source" file and runs the Reed-Solomon algorithm and generates 10 chunks, such that any 7 can be used to perfectly recreate the file. The chunks are self-describing, in that they include all the parameters an algorithm will need to later decode the chunks
  8. The encoding node makes each chunk its own file, and puts them all in a directory
  9. The encoding node puts the directory on IPFS
  10. The encoding node returns the reference (hash, address, whatever you want to call it) the directory to the coordination service
  11. The coordination service stores this as a key-value pair, where the key is the reference of the source file and the value is the reference to the directory returned by the encoder
  12. The coordination service steps through the directory and, for each file in the directory, tells a storage node in a different availability zone to pin that file. In this case, there are 10
  13. Once the coordination service has confirmed that each storage node has the file it needs, the source file can be considered reliably stored in these nodes, but it is not directly accessible from them
  14. The coordination service chooses at least one (but probably many) decoding node(s), and sends them the key-value pair from before
  15. The decoding nodes store this key-value pair, and also they pin (directly, not recursively) the value part, which if you'll remember was the directory with the list of chunks
  16. The decoding nodes now announce that they are serving the original/source file (the key part of the key-value pair), despite the fact that it is not stored locally
  17. At this point, the source file is highly available through the storage/decoding system, so the coordination service is able to tell those original three storage nodes that they don't need to store the source file. It's also able to tell the encoding node that it doesn't need to store any of that data any more
  18. Whenever a request comes in for the original file, the decoding node looks up the value for that key, and recursively pins that value reference
  19. This causes IPFS to download each chunk, which will end up coming from those 10 storage nodes from before
  20. The decoding node is actively watching the directory and as chunks come in
  21. As soon as the decoding node has 7 chunks, it is able to reassemble the source file
  22. As soon as the source file is reassembled, it can be served to the requester, and the decoding node can erase the assembled file and also un-pin the chunks that it used

One of the amazing things is that I believe this could be done without touching the IPFS client code at all. Obviously the coordination service, storage node, and encoding node are all using IPFS in pretty mundane ways, but because of the amazing property that IPFS can be mounted as a filesystem and also stored files are just a directory a the filesystem, I think one could implement the decoder with a program that only touches the filesystem and perhaps doesn't even "know" IPFS. The way you would do this would be to implement a FUSE filesystem that pretends it has each of the files that the decoder node should be broadcasting. You then tell the IPFS client that files should be stored in some directory on this filesystem, and it will detect all the files the decoder is pretending to have. When a request for some file comes in through IPFS for some object, the IPFS client will turn around and ask the FUSE filesystem for that object. The decoder will, in turn, look at the directory it has stored and ask the IPFS mount for each of the chunks. The IPFS mount will ask the IPFS client, which will go and fetch them (from the storage servers) and pass them on to the mount which will pass them up to the decoder which will wait until it has enough (7 in this example) and recombine them and pass the resulting file (which will be exactly the source file) up to the IPFS client which will pass them on to the original requester. So, yeah, a bit complicated and a bit slow, but I think it would work!

Some advantages to this architecture:

  • Files that are stored with RS-encoding would appear exactly the same to clients
  • Files can be silently moved from mirroring to RS-encoded storage
  • You should be able to seamlessly move from one set of parameters to another and re-encode your files
  • Highly distributed. If we leave the coordination service out of the picture (as I have no details on how it would work anyways), the entire system can seamlessly handle node failures and network partitions and other weird things
  • You can have as many decoding nodes as you like, and they can be serving whichever files you like. Given that the metadata for each file that needs to be stored on the node is probably only a few kilobytes, you could reasonably serve millions of files from each node, and it wouldn't be out of the question to serve several billions of files from each node. This way you can have every decoding node serve every file
  • You can tune latency (mirroring vs RS-encoding) and reliability on a per-file basis, and even change your decision over time
  • You can scale encoding, storage, and decoding separately for each other, so you can tune for you particular use case
  • If that proposal for private IPFS networks gets implemented, you might be able to put the RS-encoded stuff on its own network, so it's totally invisible to the rest of the world. This could be helpful if your source files are on a public network that your end users can access, but you don't want them to be able to see how you store stuff on the backend

Some disadvantages:

  • If the storage node storing a given chunk goes down, it's not clear how you would re-generate the chunk in order to regain resiliency. You might need to re-code the file entirely
  • It would probably end up being very slow
  • How does that coordination service work? We're asking it to keep a lot of data. Would it end up having to be centralized?

(P.S. I'm interested in any comments/questions/suggestions/whatever that anyone might have!)
(P.P.S. I wasn't sure if this was the right place to post this? I'd be happy to move it if that'd be helpful.)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions