Skip to content

shlok/streamly-archive

Repository files navigation

streamly-archive

Hackage CI built with garnix

Stream data from archives (tar, tar.gz, zip, or any other format supported by libarchive) using the Haskell streamly library.

Requirements

Install libarchive on your system.

  • Debian Linux: sudo apt-get install libarchive-dev.
  • macOS: brew install libarchive.

Quick start

{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE TypeApplications #-}

module Main where

import Crypto.Hash
import Data.ByteString (ByteString)
import Data.Function
import Data.Functor
import Data.Maybe
import Streamly.Data.Fold (Fold)
import qualified Streamly.Data.Fold as F
import qualified Streamly.Data.Stream.Prelude as S
import Streamly.External.Archive

main :: IO ()
main = do
  -- A fold for converting each archive entry (which is a Header followed by
  -- zero or more ByteStrings) into a path and corresponding SHA-256 hash
  -- (Nothing for no data).
  let entryFold :: Fold IO (Either Header ByteString) (String, Maybe String) =
        F.foldlM'
          ( \(mpath, mctx) e ->
              case e of
                Left h -> do
                  mpath' <- headerPathName h
                  return (mpath', mctx)
                Right bs ->
                  return
                    ( mpath,
                      Just . (`hashUpdate` bs) $
                        fromMaybe (hashInit @SHA256) mctx
                    )
          )
          (return (Nothing, Nothing))
          <&> ( \(mpath, mctx) ->
                  ( show $ fromMaybe (error "path expected") mpath,
                    show . hashFinalize <$> mctx
                  )
              )

  -- Execute the stream, grouping at the headers (the Lefts) using the above
  -- fold, and output the paths and SHA-256 hashes along the way.
  S.unfold readArchive (id, "/path/to/archive.tar.gz")
    & groupByLeft entryFold
    & S.mapM print
    & S.fold F.drain

Benchmarks

See ./bench/README.md. Summary (with rough figures from our machine):

  • For 1-byte files, this library has roughly a 70 ns/byte overhead compared to plain Haskell IO code, which has roughly a 895 ns/byte overhead compared to plain C.
  • For larger (> 10 KB) files, this library performs just as good as plain Haskell IO code, which has roughly a 0.15 ns/byte overhead compared to plain C.

July 2024; NixOS 22.11; Intel i7-12700K (3.6 GHz, 12 cores); Corsair VENGEANCE LPX DDR4 RAM 64GB (2 x 32GB) 3200MHz; Samsung 970 EVO Plus SSD 2TB (M.2 NVMe).

About

Stream data from archives using the Haskell streamly library.

Resources

License

Stars

Watchers

Forks

Packages

No packages published