fip	title	author	discussions-to	status	type	created
0058	Verifiable Data Aggregation	Jakub Sztandera (@Kubuxu), Nicola Greco (@nicola), Peter Rabbitson (@ribasushi)	#512	Draft	FRC	2022-12-11

Simple Summary

Enable Clients using data aggregation services to verify the correct aggregation of their data and allow proving of this fact to third parties.

Abstract

This proposal provides a description of a scheme enabling Aggregators to produce a Proof of Data Segment Inclusion certifying proper aggregation of Client's data. The produced proof assures:

an inclusion of Client's data within the on-chain deal
the Client's data can be trivially discovered within the deal to enable retrieval
malicious behaviour of an Aggregator or another user, whose data was aggregated, does not interfere with retrievability of Client's data

Change Motivation

A large majority of users onboard data onto the Filecoin network via an Aggregator, a third party combining small pieces of data into a singular large deal. Today the work done by aggregators is unverifiable and unprovable. The user relies on the Aggregator to perform the work correctly and at the same time, it is impossible to prove to a third party that a given piece of data was included in a deal which is a highly requested functionality for user-programmable data use cases.

This is a critical link in enabling and exposing small pieces of data to the FEVM ecosystem. In the majority of cases, small pieces of data undergo an aggregation process, combing them into a large deal for acceptance by a Storage Provider. Without the proposed proof, data within aggregated deals becomes a second class citizen in Filecoin ecosystem. A significant portion of the F(E)VM use-case is enabling the ability to process and reason about the data stored by Filecoin Storage Providers. The Proof of Data Segment Inclusion allows to apply this new capability on segments of data which are too small to be on-boarded in their own deals due to economic constraints.

Specification

The specification consists of three parts:

Data Segment Format and alignment
Data Segment Index and retrieval handling
Proof of Data Segment Inclusion

Data Segments

A Data Segment is an array of bytes the Client requested to be stored after the process of Fr32 bit padding. Data Segments are identified by the commitment to data (using same algorithm as PieceCID, SHA-256-truncated, 2-ary Merkle Tree) and their size.

Data Segments can then be placed within a deal according to alignment rules of current sectors.

Data Segments have to aligned to the next power of two boundary with NUL trailer up to next power of the boundary.

DataSegment size = ceil(original sub-deal size * 128/127) // Fr32 padding
DataSegment alignment = 2**ceil(log2(DataSegment size))
DataSegment with trailer size = DataSegment alignment

Data Segment Index

Data Segment Index is an optional region at the end of a deal describing all the data segments contained within that deal. The index has a maximum size depending on the size of the deal, but not all entries in the Data Segment Index have to be utilised. Unused entries should be filled out with NUL bytes.

The number of entries in the index is defined as:

number of entries = max(4, 2**floor(log2(padded size of deal / 2048 / 64 [byte/entry])))
size of index = number of entries * 64 [byte/entry]
start of the index = (padded size of deal - size of index)

This results in an index capable of containing 256Ki entries and using 16MiB with 32GiB deal and double that for 64GiB deal.

Each entry within the Data Segment Index has size of 64 bytes after Fr32 bit padding resulting in usable space of 508 bits, two 254 bit nodes (two high bits of each 32byte node are set to 0s). Each entry is also aligned to 64 byte boundary after Fr32 bit padding. The format is designed to be accessible and readable in the Fr32 padded form.

Entires consists of:

CommDS - commitment to the data segment, 254 bits, first node
Offset - offset in bytes from the start of the deal to the Data Segment, low 64 bits
Size - size in bytes of the Data Segment, next 64 bits
Checksum - SHA-256 of the (CommDS, Offset, Size, 0s) truncated to 126 bits filling out second node.

For the purpose of SHA-256 checksum, data is structured in the same format as Data Segment Index entry with checksum bits filled with 0s.

Data Segment Index entry is defined as valid if it is positioned within the Data Segment Index area, from the start of the index to the end of the deal and contains a valid checksum.

Storage Provider data processing

After receiving a deal data, a Storage Provider should process the index area to discover any valid Data Segment Index entries within that area. When a valid entry is discovered the SP should queue up a task of computing the commitment of referenced Data Segment. If the commitment in the index matches the referenced Data Segment, the Storage Provider should make it possible to retrieve the Data Segment via its Data Segment commitment (equivalent to PieceCID).

Proof of Data Segment Inclusion

The proof consists of two inclusion proofs:

An inclusion proof of a sub-tree corresponding to the tree of the Client’s data within aggregator's commitment.
An inclusion proof of the double leaf data segment descriptor within the data segment index.

The client possesses the following information: Commitment to their data $\mathrm{CommDS}$ and size of their data $|{\mathrm{DS}}|$.

The aggregator inserts Client’s data into the deal, and also adds the data segment descriptor into the data segment index.

Following that, the aggregator produces the data commitment and two inclusion proofs and provides them to the client:

$\mathrm{CommD}_ A$ - a commitment to the data created by the aggregator
$|\mathrm{D}_ A|$ - size of the aggregator’s data, corresponding to deal/sector size
$\mathrm{pos}_ D$ - position of client’s data within the aggregator’s data
$\pi_ {st}$ - sub-tree proof, proving the inclusion of $\mathrm{DS}$ in $\mathrm{D}_ A$ at position $\mathrm{pos}_ D$
$\mathrm{pos}_ {ds}$ - position of the index entry within the aggregator's data
$\pi_ {ds}$- leaf inclusion proof, proving the inclusion of $(\mathrm{CommD}_ A, \mathrm{pos}_ D, |\mathrm{D}_ C|, \mathrm{checksum})$ within $\mathrm{D}_ {A}$ at $\mathrm{pos}_ {ds}$.

Auxiliary information provided by the aggregator to the Client are: DealID of the deal which contains Client's data.

To complete the verification of the proof, the Client has to verify the two inclusion proofs as well as check that a given sector was on-boarded and $\mathrm{CommD}_ A$ was size $|\mathrm{D}_ A|$:

$\mathrm{CommD}_ A', |\mathrm{D}_ A'| = \mathrm{ComputeRoot}(\mathrm{CommDS}, |\mathrm{DS}|, \pi_{st}, \mathrm{pos}_ D)$
$\mathrm{CommD}_ A'', |\mathrm{D}_ A''| = \mathrm{ComputeRoot}(\mathrm{Comm}(\mathrm{CommDS}, \mathrm{pos}_ D, |\mathrm{D}_ C|, \mathrm{checksum}), 2, \pi_ {ds}, \mathrm{pos}_ {ds})$
Verify $\mathrm{CommD}_ A' \overset{?}{=} \mathrm{CommD}_ A''$
Verify $|\mathrm{D}_ A'| \overset{?}{=} |\mathrm{D}_ A''|$
Verify that $\mathrm{pos}_ {ds}$ falls into data segment index based on $|\mathrm{D}_ A'|$.
Verify that DealID is active on chain
Verify that DealID has data commitment of $\mathrm{CommD}_ A'$ and size $|\mathrm{D}_ A'|$.

Design Rationale

The design of this proposal is guided by three main goals: to ensure the provable inclusion of the Client's data within the aggregator's on-chain data deal, to make it easy for the storage provider to find the user's data within the larger data segment, and to prevent malicious behavior of the aggregator or other users whose data was aggregated from causing irrecoverable problems with retrieval.

To achieve the first goal, the proposal includes a mechanism for producing an inclusion proof. This proof allows the client or a third party to verify that their data is properly included within the on-chain deal.

Additionally, the proposal includes a Data Segment Index which provides discoverability of data within the sector. This feature allows the storage provider to easily locate the client's data and prevents malicious behavior from influencing the retrieval of data. By including this mechanism, the proposal ensures that the client's data is retrievable, even in presence of malicious behavior from the aggregator or other users.

Due to the commitments to data which are available on-chain Merkle inclusion proofs are the only choice for proving inclusion of client's sub-trees.

The existence of Data Segment Index within the deal in combination with the power-of-two alignment constraint reduces the storage efficiency of subdeals if subdeals are large in relation to the deal. A deal containing Data Segment Index cannot include two subdeals with a size half of the deal. If the deal were to contain only two subdeals the maximum storage efficiency is 75%. The efficiency approaches 100% as the number of subdeals increases, reaching 99% with seven subdeals.

Backwards Compatibility

Not applicable

Test Cases

Pending

Security Considerations

Does not impact core Filecoin security.

Incentive Considerations

Does not impact incentive systems.

Product Considerations

This FRC enables new deal capabilities in environment with aggregators. These new capabilities should allow for growth of contract driven data on-boarding by allowing contracts to utilise aggregator services.

Implementation

WIP

Copyright

Copyright and related rights waived via CC0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frc-0058.md

frc-0058.md

Simple Summary

Abstract

Change Motivation

Specification

Data Segments

Data Segment Index

Storage Provider data processing

Proof of Data Segment Inclusion

Design Rationale

Backwards Compatibility

Test Cases

Security Considerations

Incentive Considerations

Product Considerations

Implementation

Copyright

Files

frc-0058.md

Latest commit

History

frc-0058.md

File metadata and controls

Simple Summary

Abstract

Change Motivation

Specification

Data Segments

Data Segment Index

Storage Provider data processing

Proof of Data Segment Inclusion

Design Rationale

Backwards Compatibility

Test Cases

Security Considerations

Incentive Considerations

Product Considerations

Implementation

Copyright