Skip to content

Commit

Permalink
Add suggestions for distributing large datasets to book
Browse files Browse the repository at this point in the history
type: documentation
pr: #360
  • Loading branch information
casey committed Apr 11, 2020
1 parent ff6f6d4 commit 1c9ff0c
Show file tree
Hide file tree
Showing 3 changed files with 154 additions and 1 deletion.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ Changelog

UNRELEASED - 2020-04-11
-----------------------
- :white_check_mark: [`xxxxxxxxxxxx`](https://github.com/casey/intermodal/commits/master) Test that `--glob`s match entire file paths ([#357](https://github.com/casey/intermodal/pull/357)) - _Casey Rodarmor <casey@rodarmor.com>_
- :books: [`xxxxxxxxxxxx`](https://github.com/casey/intermodal/commits/master) Add suggestions for distributing large datasets to book ([#360](https://github.com/casey/intermodal/pull/360)) - _Casey Rodarmor <casey@rodarmor.com>_
- :white_check_mark: [`ff6f6d4c3de1`](https://github.com/casey/intermodal/commit/ff6f6d4c3de1a14c6b2ebef270c0ec542300f0de) Test that `--glob`s match entire file paths ([#357](https://github.com/casey/intermodal/pull/357)) - _Casey Rodarmor <casey@rodarmor.com>_
- :books: [`b914c175949f`](https://github.com/casey/intermodal/commit/b914c175949fa6063b6fb0428f4ebd66a51fdda3) Add buildtorretn to prior art section of book ([#355](https://github.com/casey/intermodal/pull/355)) - _Casey Rodarmor <casey@rodarmor.com>_


Expand Down
1 change: 1 addition & 0 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Summary
- [`imdl torrent verify`](./commands/imdl-torrent-verify.md)

- [Bittorrent](./bittorrent.md)
- [Distributing Large Datasets](./bittorrent/distributing-large-datasets.md)
- [BEP Support](./bittorrent/bep-support.md)
- [Alternatives & Prior Art](./bittorrent/prior-art.md)
- [UDP Tracker Protocol](./bittorrent/udp-tracker-protocol.md)
Expand Down
151 changes: 151 additions & 0 deletions book/src/bittorrent/distributing-large-datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
Distributing Large Data Sets
============================

Even though BitTorrent is well-suited for distributing large amounts of data,
very large torrents can still cause problems. Here are some of the problems you
might encounter, as well as suggestions for how to avoid or ameliorate those
issues.

Intermodal currently uses a single-threaded piece hashing algorithm. If you're
distributing a large data set and hashing time is a problem, please open an
issue! I'm eager to improve hashing performance, but want to make sure I do it
in such a way that real workloads benefit.


Background
----------

In order to support incremental download and verification, as well as
resumption of partial downloads, the contents of a torrent are broken into
pieces.

The length of pieces varies is configurable, and the ideal choice of piece
length depends on many factors, but values between 16KiB and 256KiB are common.
Very large torrents may use much larger piece lengths, like 16MiB.

Each piece is hashed, and `.torrent` files, also referred to as metainfo,
contain a list of those hashes.

For all the example commands, I'll be using `dir` for the directory containing
the data set you want to share.


Issues
------

### `.torrent` file too large

When the amount of data is large, or the piece length is small, the number of
pieces can make the `.torrent` file very big.

To avoid this, you can either break the data into multiple torrents, or make
the piece length larger, so the `.torrent` file contains fewer pieces.

#### Breaking data into multiple torrents

`imdl torrent create` has a `--glob` option that can be used to control which
files are included in a torrent. If your data set is divided into multiple
files, ideally with a consistent naming scheme, this can be used to easily
create multiple torrents with different subsets of the data.

The name of the created torrent is usually derived from the name of the input,
so the output torrent name should be given manually to avoid conflicts:

$ imdl torrent create -i dir -o a.torrent --glob 'dir/0*'
$ imdl torrent create -i dir -o b.torrent --glob 'dir/1*'
$ imdl torrent create -i dir -o c.torrent --glob 'dir/2*'
# etc…

#### Making the piece length larger

`imdl` has an automatic piece length picker, which should choose a good piece
length. You can see what choices it makes for different torrent sizes with:

$ imdl torrrent piece-length

Some torrent clients don't do well with piece lengths over 16 MiB, so the piece
length picker will never pick piece lengths over 16 MiB. This can be
overridden by specifying `--piece-length` manually. `--piece-length` takes
SI units, like `KiB`, `MiB`, and `KiB`:

$ imdl torrent create -i dir --piece-length 128mib


### Too many files

Torrents containing a large number of separate files can cause performance
issues. It's not clear if these performance issues are due to BitTorrent client
implementations, host OS file system issues, or both.

#### Distributing your data set as an ISO image

By distributing your data set as an ISO image, all the files in your torrent
will be packed into a single `.iso` file. Additionally, recipients of the ISO
won't have to decompress the whole data set to browse or extract individual
files.

You can create an ISO with `genisoimage`, which can be installed on Debian or
Ubuntu with:

$ sudo apt install genisoimage

To create a compressed ISO containing your data set:

$ genisoimage \
-transparent-compression \ # compress data in the ISO
-untranslated-filenames \ # don't mangle filenames
-verbose \ # verbose output
-output data.iso \ # output path
-V DATA_SET_NAME \ # volume name
dir \ # input path

The same command, but with short flags:

$ genisoimage -zUvo data.iso -V DATA_SET_NAME dir

A torrent can then be created containing the ISO:

$ imdl torrent create --input data.iso

Users can mount and unmount the ISO on Linux:

$ sudo mkdir -p /mnt # create mount point
$ sudo mount --read-only data.iso /mnt # mount ISO
$ sudo umount /mnt # unmount when finished

Or MacOS:

$ hdiutil mount data.iso # mount ISO
# hdiutil unmount /Volumes/DATA_SET_NAME # unmount when finished

On Windows, MacOS, and some Linux desktop environments, ISOs can also be
mounted by double-clicking the file.


### Torrent Client Issues

Some torrent clients don't do well with torrents with large piece sizes, many
files, or a large amount of data.

#### Switch to a `libtorrent`-based client

If you're experiencing issues downloading a large data set, switching torrent
clients may help.

In my personal experience, torrent clients that use Arvid Norberg's
`libtorrent` have done well with large amounts of data.

`libtorrent`'s [Wikipedia page](https://en.wikipedia.org/wiki/Libtorrent) has a
[list](https://en.wikipedia.org/wiki/Libtorrent#Applications) of torrent
clients that use `libtorrent`.


Conclusion
----------

If you have suggestions for this guide, please don't hesitate to open an
[issue](https://github.com/casey/intermodal/issues).

In particular, if you've found particular torrent clients to be good or bad at
downloading large data sets, or have run into issues or found solutions not
covered by this guide, I would love to know!

0 comments on commit 1c9ff0c

Please sign in to comment.