-
Notifications
You must be signed in to change notification settings - Fork 30
ipfs and pacman #84
Comments
Relevant thread on Arch Linux forums: https://bbs.archlinux.org/viewtopic.php?id=203853 |
@robcat thanks! let us know how we can help. we'd love to contribute to making this easy + nice. cc @whyrusleeping |
(adding @anatol, Arch Linux developer that has shown interest on the forums) So, in general ipfs can help in many ways the package distribution (distributed storage, versioning, package signing). The low hanging fruit is the distributed storage: given a "package entry" in the pacman database, get directly the package from ipfs by hash. The big problem is the following: the pacman database includes only SHA256 and MD5 hashes, and no ipfs-style multihashes (that unfortunately cannot be constructed using the already included hashes). My plan is to build a custom XferCommand script that, querying some kind of service, translates the SHA256 hash into a "standard" ipfs hash, and then does This hash translation service can be centralized in this initial stage, but it adds a constant lag to each package download (the next step should be to build and distribute a package database with ipfs-multihashes). I'm still in search of a more elegant solution that doesn't require a central translation service, ideas are welcome |
A possible solution could be creating a dag with all packages: For example having the repo mirror servers add everything to ipfs and provide the hash of the folder, then deduplication takes care of efficiency and the pacman client only needs that one hash of the latest version of all the packages and the .db repository files 👍 For this to work smoothly we would probably need to patch pacman. Or maybe a tool that wraps pacman but uses IPFS to get the data could be built. |
@fazo96 About deduplication: I downloaded the pool of archlinux packages and tried both the chunking algorithms provided by ipfs (fixed blocks and rabin). Apparently there is no detectable deduplication in my case in either case. Can you suggest an effective chunking strategy for xz compressed packages? About the dag of packages: cool idea, but it requires to manage a fat central mirror that regularly rsyncs and compiles a new dag at every update. Unfortunately I don't have such a server available (but maybe an existing Arch mirror could be interested in running the ipfs daemon?). |
@robcat having an existing mirror run an ipfs daemon would be ideal: by mounting its IPNS publication using fuse, it could store the data directly in IPFS while also being able to read it from the file system and when it updates the files it would automagically update what's on IPFS. About deduplication, there's probably no trivial way to apply it inside packages but at least two copies of the same package will produce the same hash and thus will become the same file in IPFS, so that the pacman client can get a package from any IPFS node that has a copy without having any configuration telling it the nodes' ip addresses. This way choosing the best mirror, ranking them etc will not be needed: ipfs will be in charge of downloading the package from the best location. This will mean that updating even when an Internet connection is not there could be implemented if there's a reachable computer that is serving the packages and .db files. |
Are they tarballs? Did you try @whyrusleeping what is the tar branch that detects tarballs on add? |
Cool feature! I didn't know about that. (at the moment it doesn't seem to work for some tars, I'm opening a separate issue) In the Arch Linux case it's unfortunately not very useful, since the packages are signed only after compression. Distributing the packages in uncompressed form would mean:
|
Well, packages aren't that heavy. I mean, the biggest package I ever downloaded is probably netbeans which is less than 300 MB, and it's wayyyyyy bigger than the average package which is like a few megabytes (quanitity pulled out of thin air). I don't think there are many packages that share common data except multiple versions of the same package. It would be very nice to figure out how to take advantage of deduplication with the current way pacman packages stuff, but even without that, an IPFS transport for pacman would be a huge step forward in terms of efficiency and repository management. Just having the official repository publish the hash of the latest version of the packages folder and all the mirrors just automatically pinning that when it updates would propagate changes pretty fast and no one would have to worry about choosing the fastest mirror except bitswap developers 👍 To sum it up my humble suggestion is not to try too hard to fix this problem (applying deduplication to arch packages) now. IPFS is still very useful in this use case, even without that, and in the future pacman could consider packaging stuff differently if IPFS becomes the standard for distribution. |
@fazo96 But the problem is, to follow your plan would mean to convince the central official mirror to:
There is a problem of incentives here: why should the central authority to do all this, if nobody is already using ipfs? |
@robcat shouldn't any mirror work, as long as IPFS users trust it? Anyone could set up a mirror to serve as the bridge to IPFS. Then once people start using it, down the line the official repo could take over that function. Setting up a new mirror to copy packages into IPFS is closely related to the archival efforts discussed at ipfs-inactive/archives#5 |
It's not hard at all to do, there are also docker images.
You can mount IPNS with FUSE so that you don't need twice the space
Well, if the packages are all in one folder, and that folder is inside the IPNS mountpoint (using FUSE), when a file changes everything gets republished. It's not the best of solution but it's worth a shot (not by the official mirror maintainers of course, but by some interest third party)
Yes, that's the point! 👍 I don't see why a random guy can't set up an IPFS mirror (except the actual hassle of downloading every package). Also no need to trust IPFS or the mirror since the packages are signed, if my understanding is correct. If somebody has the resources needed to set up an IPFS mirror of the arch repositories and some people manage to start using it, it will be a lot easier to get more adoption. I could do it, but I only have 1 Mbit upload bandwidth, for four people and at least 6 devices... |
Ok, so what about this three pronged approach:
My node is already up at 178.62.202.191 and it independently adds all the x86_64 packages from core, extra, multilib, testing and multilib-testing (it syncs to an official arch mirror and adds the packages to ipfs). @fazo96 you can go ahead and publish the ipns entry, if IPFS works the bulk of the requests will not even hit your node :) |
@robcat I'm very interested in trying this out! :) However:
Also, another use case for this is local package sharing: I have two computers in LAN and a shitty bandwidth to the internet, and all available solutions to share packages in LAN are pretty ugly compared to this. If you want we can meet on freenode (I'm |
If you have troubles with IPNS bug me directly. we'll improve the perf to fix it for you. |
@jbenet thanks a lot, but I think with the 0.3.10 caching we'll be fine 👍 It would be cool if records would be kept for a long time if nothing more recent is available, not sure if that's already in the codebase. |
@fazo96 right now records expire. i do hear you though, most IPNS users today would prefer non-expiring records. my crypto paranoia could be deferred. |
@fazo96 A bit of out-of-topic. I was looking for a simple Arch package sharing tool for LAN and did not find what matches my expectations. So I wrote my own tool - https://github.com/anatol/pacoloco an Arch package proxy repo. It is not published anywhere and just a bunch of code without much documentation. But it works well for several months for me. Advantages of this tool - pure C, no external dependencies except glibc, run perfectly at OpenWRT MIPS router, single-threaded, event-loop based architecture, HTTP pipeline support to reduce request latency. |
@anatol Cool, if you could set up some documentation on how to use it and maybe even a PKGBUILD to the AUR, I'd use it. |
Ok, time for some results! me and @robcat did it. He set up an arch linux mirror on a VPS then we got it to store files inside the FUSE mountpoint ( Using IPFS to download arch packagesThe node that is serving them is It has a few hiccups, for example the timeout is too low and if it's your first time downloading package it's going to fall back to boring centralized mirrors. Also the mirror we have is incomplete due to the resources needed to hash tens of gigabytes of files. But it works! Keep in mind that the IPFS mirror mentioned was just used for testing, we are not guaranteeing anything. I achieved the best result by using this XferCommand: This gives IPFS the time to answer (long timeout) and tells wget not to clutter the output and display a nice progress bar. Setting up a mirrorJust follow the steps to set up a regular arch mirror, but mount IPNS to EDIT: another issue is that signatures for the 09:39:50.913 ERROR core/serve: Path Resolve error: no link named "core.db.sig" under QmbEx32ruv1FLp6dHdsgouKSsg1hdaidnZnwSLA52mvAiu gateway_handler.go:458 |
This is awesome! congrats @robcat and @fazo96 ! \o/ Please take a look at @diasdavid's recent
@diasdavid uses the new Also |
@jbenet Thanks!! We know about |
I have to say, it is absolutely awesome that it's this simple. ❤️ to all the hard work making the abstractions nice. (cc @whyrusleeping) @robcat @fazo96 -- @whyrusleeping and i have discussed making the |
That's fantastic. it really is ❤️ @fazo96 we may want to make a simple
|
We're still figuring out issues with rsync crashing go-ipfs when used with a target inside EDIT: looks like we were using a build from the The biggest issue now is that for some reason rsync can't detect that two files are the same if one of the two is in |
Ok, I've been able to sync the "core" repository to /ipns/local using this script, and it works wonderfully on my Arch machine with ipfs 0.3.10: But me and @fazo96 were trying to make it work on a Ubuntu 15.10 VM (same versions of ipfs and fuse), and rsync doesn't detect the unchanged files (it overwrites every file, triggering the re-hash and re-publish by ipfs). This is of course too slow. @jbenet The mountable files api you described can be useful in a lot of situations. |
@robcat i think I know the issue. the Atime and Ctime of the files in /ipns arent set. Let me work on a patch for ya |
hrmm... after a bit more thought its not going to be easy to fix. I'm working modtimes into the mfs code on 0.4.0 (the code behind the fuse interface). You could try the |
Uhm, I double checked and in fact there is something seriously wrong with my Ubuntu machine.
But the file objects are stored correctly by ipfs:
To ask for a third opinion, I will try soon with osx. |
@robcat that's very nice! I'll give it a try too on my machine if I have the time and network bandwidth. We could try setting up a dockerfile based on the go-ipfs one that also mounts fuse and keeps an updated arch mirror using your script, so that we can run it on any platform. I'm not sure it's possible to mount IPFS with fuse inside a docker container... Then, only the "root" arch mirror using IPFS needs to actually sync with the other mirrors using rsync. The other IPFS based mirrors will just need to pin the IPNS name of the "root" mirror once pub/sub is implemented. As a temporary workaround, a very small script could poll for IPNS updates and pin the new hash. |
@RubenKelevra, do you have any plan on how to sync, update and use it? Pinning a snapshot of the official mirror does not seem that useful unless all those details are in place. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@RubenKelevra shouldn't the unpin be recursive? |
It should be by default,
You're welcome! :) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Alright guys, I hit some bugs here and there, but I think it's finally ready for prime time. You can join the cluster with a simple
You can use the IPFS as mirror with a simple additional line in your mirror config for pacman:
Both require a running IPFS daemon on the machine. The first one requires obviously that there's more space than the standard 10 GB limit, so you need to raise it. If you just want to fetch updates via IPFS the default 10 GB limit is most likely fine. |
Note to anybody who had already joined the cluster a while back:The cluster setup ran into a non-recoverable state because of a bug and would not replicate anything from the master. Therefore there's a new secret and a new name for the cluster, restart following the cluster with the command quoted above. I tried to remove all old pins from the old cluster before shutdown, but this might have not worked either. So check if you're having still local pins in your IPFS-Client. If you have nothing running in the IPFS-Client except the cluster follower, you can safely remove all pinned entries and start with the command above again. The following command will unpin everything from your IPFS-Client:
(may take some minutes) |
If you run multiple machines in a network:MDNS-Method:Make sure both IPFS-Daemons have mDNS activated. They will connect to each other and exchange the data when needed. Single IPFS-Client as Gateway-Method:You can change the listening address for the web gateway in the settings of the IPFS-Client which should serve as local package cache. It won't 'cache' in the traditional sense, that you get older outdated content, it will always fetch the latest content directly from the IPFS. But if there are multiple machines downloading the same updates, they will be already stored locally on the IPFS-Client which runs the Gateway and just have to be transferred in the local network. Make sure to increase the cache size for this application, since there might be more than 8 GB of different packages installed in your network, so no redownload will be necessary. If you have a large number of clients and want to prefetch the entire repo to the IPFS-Cache of your local web gateway to speedup updates, even more, you can do this by running this script: Note that this script has to dig through a lot of data, so there's a substantial amount of IO going on for each run which finds a new update. Make sure to run this on an SSD, not an HDD. https://github.com/RubenKelevra/pacman.store/blob/master/toolset/follow_cluster_parts.sh |
@RubenKelevra , I've got this error messages.
|
Errors gone after changing default port on config file. I use syncthing daemon(8080) so changed ipfs gateway port to 8082. |
I'll take a look at my setup, to confirm everything is working on my end, will report back in some minutes, thanks for the report! If you use a different port, you have to set an environment variable, before you launch ipfs-cluster-follower:
More info on that can be found here: https://cluster.ipfs.io/documentation/reference/follow/ Just a sidenote: The cluster holds a large amount of informations, so the initial "catch up" to the current state, will take some time and cannot be interrupted. If you need to interrupt early, it's best to start from the beginning again. Once this is finished, you can interrupt at any time. This will be improved in the near future when ipfs-cluster has implemented commits, which will pack multiple changes into one 'change' of the cluster state. |
@dontdieych So I've tested my setup. When you change the port to the local IPFS-Instance it should work flawlessly. If you still run into troubles, just send me an email. :) |
@RubenKelevra Thanks. It looks like working. Another question. Is there a way to download pkgs automatically that already installed? Oviously when new version appeared. |
@dontdieych yes, you can simply run a This will issue a download request for everything new, and download them to your local cache. |
@RubenKelevra Thanks! again. 👍 |
I've started a discussion if it wouldn't make sense to extend pacman natively with the ability to request packages from a running IPFS daemon via the API. To be able to do this there's a need to extend the database-format with the Content-ID of each package (currently they hold an MD5 and a SHA256 sum plus a signature). This way there would be no noticeable download delay since there's no need to fetch and open large folders (which is currently the main reason why my approach is slow). The discussion is on the pacman-development mailing list if somebody wants to follow: https://lists.archlinux.org/pipermail/pacman-dev/2020-April/024179.html |
This discussion has ceased, but if you like to add notes or track the status long term, you can find a ticket here: RubenKelevra/pacman.store#41 |
Hey @jbenet I think we can close this one, right? :) |
Off-Topic: What do you guys think about the idea of asking the Flatpak project to integrate IPFS? |
I still don't have another method of reaching you all, so I add this here: Make sure to update your cluster nodes to 0.14.0 :) |
In case someone likes Endeavouros: The small additional repository of this project is now available on the pacman.store collab cluster as well.
Originally posted by @RubenKelevra in endeavouros-team/mirrors#1 (comment) |
notes integrating with pacman here
The text was updated successfully, but these errors were encountered: