Skip to content

Research/quantify performance envelopes of multiple CDC algorighms #227

Open
@ribasushi

Description

@ribasushi
  • oI 95% Assemble corpuses of data from various prior performance research initiatives ( both within and outside of PL )
    • 💯 Enumerate/obtain test datasets
    • 90% Document rationales for the test datasets
    • 95% Publish all of the above as plain HTTP + IPFS pinned download
  • oI 85% Document prior art, motivation and precise scope and types of sought metrics
    • 💯 Solicit/assemble feedback from various stakeholders
    • 💯 Collect/determine relevance of existing academic research into chunking ( 14 distinct papers selected for evaluation )
    • 💯 Convert the pre-PL chunk-tester to proper multi-streaming, to dramatically lower the cost of experiments ( aiming at about 500 megabyte/s stream processing ) with the correct implementation and hardware about 3.5GiB/s standard ingestion 🎉
    • 80% Generate few preliminary datapoints to aid understanding the goal/scope
    • 90% In depth study/evaluation/application of findings from above works
    • 💯 Understand and reuse existing go-ipfs implementations of CDCs ( Rabin + Buzzhash ) in a simpler go-ipfs independent utility, allowing rapid retries of different parameters
    • 💯 Same as above but pertaining to linking strategies ( trickle-dag etc ), as ignoring the link-layer of streams skews the results disproportionately
    • 98% ( subsumes a large portion of points below v0.1 ETA: DEMO AT TEAM-WEEK ) Fully implement a standalone CLI utility re-implementing/converging with go-ipfs on all above algorithms. The distinguishing feature of said tool is the exposure of each chunker/linker as an atomic, composable primitive. The UX is similar to that of ffmpeg whereby an input stream is processed via multiple "filters", with the result being a stream of blocks with a statistic on their counts/sizes plus a valid IPFS CID. Current remaining tasks:
      • 💯 Profile/optimize baseline stream ingestion, ensure there is no penalty from applying a "null-filter", which allows one to benchmark a particular hardware setup's theoritcal maximum throughput
      • 💯 Finalize the "stackable chunkers" UI/UX, allowing effortless demonstration of impact of such chunker chains on the
      • 💯 Adjust statistics compilation/output for the above ( it currently looks like this, ignoring various "filter-levels" )
      • 💯 Make final pass on memory allocation profile and fixup obvious low hanging fruit before v0.1
      • 80% README / godoc / stuffz
    • 80% Rewrite previously utilized plotly.js-based visualiser to aid with the above point
  • oI Open document to a short discussion soliciting feedback from workgroups
  • oII Perform a number of "brute force" tests aiming at reproducible results ( utilizing https://github.com/ipfs/testground ) for the purposes of what we are trying to quantify iptb will be sufficient
  • oII ( half-covered by initial writeup ) Convert raw results into multi-dimensional scatter-plot visualizations ( plotly.js )
  • oIII Combine all available results into a "compromise chunking settings" RFC document
  • oIV Publish the results for discussion and decision of the level of incorporation into IPFS implementations ( default parameters, use of selected algorithm by default, etc )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions