Replies: 1 comment 1 reply
-
tldr; 3 then 5. As it currently stands, the Bacalhau CLI contains the code necessary to interact directly with any of the providers implementing the available publisher specs (as long as they are IPFS or HTTP based). This works as long as the CLI has access to the storage (e.g. it is accessible to the machine performing the download). The CLI already has to create a local IPFS node in order to communicate with other nodes and retrieve the data. Apart from this situations, this works fine in unconstrained envrionments, but is unlikely to work well should we introduce storage plugins. Option 5, even if the first version is simply replicating what the CLI does now has many benefits. First it removes the need entirely for both the temporary repository and the optional manually managed IPFS node on the client machine. Then it gives us a policy decision point around access to the data, and then access to all the storage providers - even if the CLI knows nothing about them as they'll just need to read from a HTTP endpoint. As we look forward to implementing pluggable storage providers (perhaps via the container storage interface) there will be no need at all for the client to understand, access or interop with the underlying storage. For the shorter term, option 3 seems like the best compromise in that it will solve the problem and won't require any deep changes that might influence future design decisions. Any option that requires installing IPFS on a user's desktop/laptop is suboptimal as I think it's unfair to expect users to have to run infra. |
Beta Was this translation helpful? Give feedback.
-
Hi all, I've been working with @JvD007 to get
bacalhau get
working with downloads from a private IPFS cluster.Initial problem report was that
bacalhau get
fails with a timeout when trying to do this. Investigation reveals that the connection with the IPFS node is never established due to some opaque errors and turns out this is because the Bacalhau client needs the same swarm key as the private IPFS cluster.The workaround is:
export IPFS_PATH=/tmp/something; ipfs init -e
cp swarm.key $IPFS_PATH
export BACALHAU_SERVE_IPFS_PATH=/tmp/something
export BACALHAU_IPFS_SWARM_ADDRESSES=...
export BACALHAU_API_HOST=...
Now
bacalhau get
should work without issues. This isn't a problem forbacalhau serve
because it uses the unauthenticated HTTP API to save data (I presume?)Slightly different steps where we allow Bacalhau to create the IPFS repo (e.g. skip step 1 and just
mkdir
) also work, but fail after the firstget
because the IPFS config written into the repo by Bacalhau doesn't quite work.Jaco explains the issue with Bac-generated config
So the situation is that getting data from a private Bacalhau cluster currently requires non-trivial configuration on the client side. I would like to discuss if/what we want to do to make this smoother. AFAICS the options are:
1. Do nothing, document the workaround
We can include these details in our documentation for running private clusters. Clearly better than nothing, but this either involves having the
ipfs
binaries on the client machine or getting clients to manually mess with the config after the firstbacalhau get
– nethier are super great.2. Make
bacalhau
write better configWe should be able to just get Bacalhau to use the value of
BACALHAU_IPFS_SWARM_ADDRESSES
in the config it outputs. This would remove the need to mess around with the config but would still require clients to manually establish an IPFS repo and copy the swarm key into it.3. Make
bacalhau
accept a swarm key config parameterWe could give Bacalhau the power to recognise an environment variable for the swarm key. This would mean that user would only need to have the swarm key on their local machine and all of the IPFS repo management can remain with Bacalhau. E.g. setting
BACALHAU_IPFS_SWARM_KEYFILE=...
would allow Bacalhau to use that key as part of it's temporary repo during abacalhau get
.4. Make
bacalhau get
automatically detect the private swarm detailsWe could make the published result returned by the requester node automatically include the swarm addresses and swarm key. When using
bacalhau get
, the in-process IPFS node could use them to establish a private connection to the swarm. This means that we don't need to do any special configuration to establish a private connection - essentially, the requester node provides the secret details automatically to authenticated clients.5. Introduce our own data download mechanism
We could introduce our own protocol between requester and client for downloading results. Essentially the requester would become a proxy for downloading the data. We have talked about this previously as a way of having a private data storage location whilst still allowing users to download data for just their jobs. Such a thing would also mean we could deliver data to lower-trust users without having them make a direct conneciton to the data store. But this would be a non-trivial engineering problem and would be reinventing a wheel that has already been invented a number of times before.
I'd welcome feedback on these options and what we think is reasonable now and next.
Beta Was this translation helpful? Give feedback.
All reactions