Skip to content

Commit

Permalink
Merge pull request #221 from ipld/codec-terminology-and-config-consis…
Browse files Browse the repository at this point in the history
…tency

codecs: more docs, a terminology guide, consistency in options.
  • Loading branch information
warpfork authored Aug 12, 2021
2 parents 70b6d15 + 6d3da1f commit 1285c1d
Show file tree
Hide file tree
Showing 14 changed files with 348 additions and 139 deletions.
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ When a release tag is made, this block of bullet points will just slide down to
- The codecs do not reject other orderings when parsing serial data.
The `ipld.Node` trees resulting from deserialization will still preserve the serialized order.
However, it has now become impossible to re-encode data in that same preserved order.
- If doing your own encoding, there are customization options in `dagcbor.MarshalOptions.MapSortMode` and `dagjson.MarshalOptions.SortMapKeys`.
- If doing your own encoding, there are customization options in `dagcbor.EncodeOptions.MapSortMode` and `dagjson.EncodeOptions.MapSortMode`.
(However, note that these options are not available to you while using any systems that only operate in terms of multicodec codes.)
- _Be cautious of this change._ It is now extremely easy to write code which puts data into an `ipld.Node` in memory in one order,
then save and load that data using these codecs, and end up with different data as a result because the sorting changes the order of data.
Expand Down
77 changes: 77 additions & 0 deletions codec/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
Codecs
======

The `go-ipld-prime/codec` package is a grouping package.
The subpackages contains some codecs which reside in this repo.

The codecs included here are our "batteries included" codecs,
but they are not otherwise special.

It is not necessary for a codec to be a subpackage here to be a valid codec to use with go-ipld-prime;
anything that implements the `ipld.Encoder` and `ipld.Decoder` interfaces is fine.


Terminology
-----------

We generally refer to "codecs" as having an "encode" function and "decode" function.

We consider "encoding" to be the process of going from {Data Model} to {serial data},
and "decoding" to be the process of going from {serial data} to {Data Model}.

### Codec vs Multicodec

A "codec" is _any_ function that goes from {Data Model} to {serial data}, or vice versa.

A "multicodec" is a function which does that and is _also_ specifically recognized and described in
the tables in https://github.com/multiformats/multicodec/ .

Multicodecs generally leave no further room for customization and configuration,
because their entire behavior is supposed to be specified by a multicodec indicator code number.

Our codecs, in the child packages of this one, usually offer configuration options.
They also usually offer exactly one function, which does *not* allow configuration,
which is supplying a multicodec-compatible behavior.
You'll see this marked in the docs on those functions.

### Marshal vs Encode

It's common to see the terms "marshal" and "unmarshal" used in golang.

Those terms are usually describing when structured data is transformed into linearized, tokenized data
(and then, perhaps, all the way to serially encoded data), or vice versa.

We would use the words the same way... except we don't end up using them,
because that feature doesn't really come up in our codec layer.

In IPLD, we would describe mapping some typed data into Data Model as "marshalling".
(It's one step shy of tokenizing, but barely: Data Model does already have defined ordering for every element of data.)
And we do have systems that do this:
`bindnode` and our codegen systems both do this, implicitly, when they give you an `ipld.Node` of the representation of some data.

We just don't end up talking about it as "marshalling" because of how it's done implicitly by those systems.
As a result, all of our features relating to codecs only end up speaking about "encoding" and "decoding".

### Legacy code

There are some appearances of the words "marshal" and "unmarshal" in some of our subpackages here.

That verbiage is generally on the way out.
For functions and structures with those names, you'll notice their docs marking them as deprecated.


Why have "batteries-included" codecs?
-------------------------------------

These codecs live in this repo because they're commonly used, highly supported,
and general-purpose codecs that we recommend for widespread usage in new developments.

Also, it's just plain nice to have something in-repo for development purposes.
It makes sure that if we try to make any API changes, we immediately see if they'd make codecs harder to implement.
We also use the batteries-included codecs for debugging, for test fixtures, and for benchmarking.

Further yet, the batteries-included codecs let us offer getting-started APIs.
For example, we offer some helper APIs which use codecs like e.g. JSON to give consumers of the libraries
one-step helper methods that "do the right thing" with zero config... so long as they happen to use that codec.
Even for consumers who don't use those codecs, such functions then serve as natural documentation
and examples for what to do to put their codec of choice to work.
8 changes: 8 additions & 0 deletions codec/api.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,11 @@ type ErrBudgetExhausted struct{}
func (e ErrBudgetExhausted) Error() string {
return "decoder resource budget exhausted (message too long or too complex)"
}

type MapSortMode uint8

const (
MapSortMode_None MapSortMode = iota
MapSortMode_Lexical
MapSortMode_RFC7049
)
20 changes: 14 additions & 6 deletions codec/cbor/multicodec.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,6 @@ package cbor
import (
"io"

"github.com/polydawn/refmt/cbor"

"github.com/ipld/go-ipld-prime"
"github.com/ipld/go-ipld-prime/codec/dagcbor"
"github.com/ipld/go-ipld-prime/multicodec"
Expand All @@ -20,12 +18,22 @@ func init() {
multicodec.RegisterDecoder(0x51, Decode)
}

// Decode deserializes data from the given io.Reader and feeds it into the given ipld.NodeAssembler.
// Decode fits the ipld.Decoder function interface.
//
// This is the function that will be registered in the default multicodec registry during package init time.
func Decode(na ipld.NodeAssembler, r io.Reader) error {
return dagcbor.Unmarshal(na, cbor.NewDecoder(cbor.DecodeOptions{}, r),
dagcbor.UnmarshalOptions{AllowLinks: false})
return dagcbor.DecodeOptions{
AllowLinks: false,
}.Decode(na, r)
}

// Encode walks the given ipld.Node and serializes it to the given io.Writer.
// Encode fits the ipld.Encoder function interface.
//
// This is the function that will be registered in the default multicodec registry during package init time.
func Encode(n ipld.Node, w io.Writer) error {
return dagcbor.Marshal(n, cbor.NewEncoder(w),
dagcbor.MarshalOptions{AllowLinks: false})
return dagcbor.EncodeOptions{
AllowLinks: false,
}.Encode(n, w)
}
75 changes: 52 additions & 23 deletions codec/dagcbor/marshal.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,41 +2,62 @@ package dagcbor

import (
"fmt"
"io"
"sort"

"github.com/polydawn/refmt/cbor"
"github.com/polydawn/refmt/shared"
"github.com/polydawn/refmt/tok"

ipld "github.com/ipld/go-ipld-prime"
"github.com/ipld/go-ipld-prime/codec"
cidlink "github.com/ipld/go-ipld-prime/linking/cid"
)

// This file should be identical to the general feature in the parent package,
// except for the `case ipld.Kind_Link` block,
// which is dag-cbor's special sauce for schemafree links.

const (
MapSortMode_none = iota
MapSortMode_RFC7049
)

type MarshalOptions struct {
// If true, allow encoding of Link nodes as CBOR tag(42), otherwise reject
// them as unencodable
// EncodeOptions can be used to customize the behavior of an encoding function.
// The Encode method on this struct fits the ipld.Encoder function interface.
type EncodeOptions struct {
// If true, allow encoding of Link nodes as CBOR tag(42);
// otherwise, reject them as unencodable.
AllowLinks bool

// Control the sorting of map keys, MapSortMode_none for no sorting or
// MapSortMode_RFC7049 for length-first bytewise sorting as per RFC7049 and
// DAG-CBOR
MapSortMode int
// Control the sorting of map keys, using one of the `codec.MapSortMode_*` constants.
MapSortMode codec.MapSortMode
}

func Marshal(n ipld.Node, sink shared.TokenSink, options MarshalOptions) error {
// Encode walks the given ipld.Node and serializes it to the given io.Writer.
// Encode fits the ipld.Encoder function interface.
//
// The behavior of the encoder can be customized by setting fields in the EncodeOptions struct before calling this method.
func (cfg EncodeOptions) Encode(n ipld.Node, w io.Writer) error {
// Probe for a builtin fast path. Shortcut to that if possible.
type detectFastPath interface {
EncodeDagCbor(io.Writer) error
}
if n2, ok := n.(detectFastPath); ok {
return n2.EncodeDagCbor(w)
}
// Okay, generic inspection path.
return Marshal(n, cbor.NewEncoder(w), cfg)
}

// Future work: we would like to remove the Marshal function,
// and in particular, stop seeing types from refmt (like shared.TokenSink) be visible.
// Right now, some kinds of configuration (e.g. for whitespace and prettyprint) are only available through interacting with the refmt types;
// we should improve our API so that this can be done with only our own types in this package.

// Marshal is a deprecated function.
// Please consider switching to EncodeOptions.Encode instead.
func Marshal(n ipld.Node, sink shared.TokenSink, options EncodeOptions) error {
var tk tok.Token
return marshal(n, &tk, sink, options)
}

func marshal(n ipld.Node, tk *tok.Token, sink shared.TokenSink, options MarshalOptions) error {
func marshal(n ipld.Node, tk *tok.Token, sink shared.TokenSink, options EncodeOptions) error {
switch n.Kind() {
case ipld.Kind_Invalid:
return fmt.Errorf("cannot traverse a node that is absent")
Expand Down Expand Up @@ -138,14 +159,14 @@ func marshal(n ipld.Node, tk *tok.Token, sink shared.TokenSink, options MarshalO
}
}

func marshalMap(n ipld.Node, tk *tok.Token, sink shared.TokenSink, options MarshalOptions) error {
func marshalMap(n ipld.Node, tk *tok.Token, sink shared.TokenSink, options EncodeOptions) error {
// Emit start of map.
tk.Type = tok.TMapOpen
tk.Length = int(n.Length()) // TODO: overflow check
if _, err := sink.Step(tk); err != nil {
return err
}
if options.MapSortMode == MapSortMode_RFC7049 {
if options.MapSortMode != codec.MapSortMode_None {
// Collect map entries, then sort by key
type entry struct {
key string
Expand All @@ -163,14 +184,22 @@ func marshalMap(n ipld.Node, tk *tok.Token, sink shared.TokenSink, options Marsh
}
entries = append(entries, entry{keyStr, v})
}
// RFC7049 style sort as per DAG-CBOR spec
sort.Slice(entries, func(i, j int) bool {
li, lj := len(entries[i].key), len(entries[j].key)
if li == lj {
// Apply the desired sort function.
switch options.MapSortMode {
case codec.MapSortMode_Lexical:
sort.Slice(entries, func(i, j int) bool {
return entries[i].key < entries[j].key
}
return li < lj
})
})
case codec.MapSortMode_RFC7049:
sort.Slice(entries, func(i, j int) bool {
// RFC7049 style sort as per DAG-CBOR spec
li, lj := len(entries[i].key), len(entries[j].key)
if li == lj {
return entries[i].key < entries[j].key
}
return li < lj
})
}
// Emit map contents (and recurse).
for _, e := range entries {
tk.Type = tok.TString
Expand Down
46 changes: 24 additions & 22 deletions codec/dagcbor/multicodec.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@ package dagcbor
import (
"io"

"github.com/polydawn/refmt/cbor"

"github.com/ipld/go-ipld-prime"
"github.com/ipld/go-ipld-prime/codec"
"github.com/ipld/go-ipld-prime/multicodec"
)

Expand All @@ -19,28 +18,31 @@ func init() {
multicodec.RegisterDecoder(0x71, Decode)
}

// Decode deserializes data from the given io.Reader and feeds it into the given ipld.NodeAssembler.
// Decode fits the ipld.Decoder function interface.
//
// A similar function is available on DecodeOptions type if you would like to customize any of the decoding details.
// This function uses the defaults for the dag-cbor codec
// (meaning: links (indicated by tag 42) are decoded).
//
// This is the function that will be registered in the default multicodec registry during package init time.
func Decode(na ipld.NodeAssembler, r io.Reader) error {
// Probe for a builtin fast path. Shortcut to that if possible.
type detectFastPath interface {
DecodeDagCbor(io.Reader) error
}
if na2, ok := na.(detectFastPath); ok {
return na2.DecodeDagCbor(r)
}
// Okay, generic builder path.
return Unmarshal(na, cbor.NewDecoder(cbor.DecodeOptions{}, r),
UnmarshalOptions{AllowLinks: true})
return DecodeOptions{
AllowLinks: true,
}.Decode(na, r)
}

// Encode walks the given ipld.Node and serializes it to the given io.Writer.
// Encode fits the ipld.Encoder function interface.
//
// A similar function is available on EncodeOptions type if you would like to customize any of the encoding details.
// This function uses the defaults for the dag-cbor codec
// (meaning: links are encoded, and map keys are sorted (with RFC7049 ordering!) during encode).
//
// This is the function that will be registered in the default multicodec registry during package init time.
func Encode(n ipld.Node, w io.Writer) error {
// Probe for a builtin fast path. Shortcut to that if possible.
type detectFastPath interface {
EncodeDagCbor(io.Writer) error
}
if n2, ok := n.(detectFastPath); ok {
return n2.EncodeDagCbor(w)
}
// Okay, generic inspection path.
return Marshal(n, cbor.NewEncoder(w),
MarshalOptions{AllowLinks: true, MapSortMode: MapSortMode_RFC7049})
return EncodeOptions{
AllowLinks: true,
MapSortMode: codec.MapSortMode_RFC7049,
}.Encode(n, w)
}
35 changes: 31 additions & 4 deletions codec/dagcbor/unmarshal.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@ package dagcbor
import (
"errors"
"fmt"
"io"
"math"

cid "github.com/ipfs/go-cid"
"github.com/polydawn/refmt/cbor"
"github.com/polydawn/refmt/shared"
"github.com/polydawn/refmt/tok"

Expand All @@ -27,12 +29,37 @@ const (
// except for the `case tok.TBytes` block,
// which has dag-cbor's special sauce for detecting schemafree links.

type UnmarshalOptions struct {
// DecodeOptions can be used to customize the behavior of a decoding function.
// The Decode method on this struct fits the ipld.Decoder function interface.
type DecodeOptions struct {
// If true, parse DAG-CBOR tag(42) as Link nodes, otherwise reject them
AllowLinks bool
}

func Unmarshal(na ipld.NodeAssembler, tokSrc shared.TokenSource, options UnmarshalOptions) error {
// Decode deserializes data from the given io.Reader and feeds it into the given ipld.NodeAssembler.
// Decode fits the ipld.Decoder function interface.
//
// The behavior of the decoder can be customized by setting fields in the DecodeOptions struct before calling this method.
func (cfg DecodeOptions) Decode(na ipld.NodeAssembler, r io.Reader) error {
// Probe for a builtin fast path. Shortcut to that if possible.
type detectFastPath interface {
DecodeDagCbor(io.Reader) error
}
if na2, ok := na.(detectFastPath); ok {
return na2.DecodeDagCbor(r)
}
// Okay, generic builder path.
return Unmarshal(na, cbor.NewDecoder(cbor.DecodeOptions{}, r), cfg)
}

// Future work: we would like to remove the Unmarshal function,
// and in particular, stop seeing types from refmt (like shared.TokenSource) be visible.
// Right now, some kinds of configuration (e.g. for whitespace and prettyprint) are only available through interacting with the refmt types;
// we should improve our API so that this can be done with only our own types in this package.

// Unmarshal is a deprecated function.
// Please consider switching to DecodeOptions.Decode instead.
func Unmarshal(na ipld.NodeAssembler, tokSrc shared.TokenSource, options DecodeOptions) error {
// Have a gas budget, which will be decremented as we allocate memory, and an error returned when execeeded (or about to be exceeded).
// This is a DoS defense mechanism.
// It's *roughly* in units of bytes (but only very, VERY roughly) -- it also treats words as 1 in many cases.
Expand All @@ -41,7 +68,7 @@ func Unmarshal(na ipld.NodeAssembler, tokSrc shared.TokenSource, options Unmarsh
return unmarshal1(na, tokSrc, &gas, options)
}

func unmarshal1(na ipld.NodeAssembler, tokSrc shared.TokenSource, gas *int, options UnmarshalOptions) error {
func unmarshal1(na ipld.NodeAssembler, tokSrc shared.TokenSource, gas *int, options DecodeOptions) error {
var tk tok.Token
done, err := tokSrc.Step(&tk)
if err != nil {
Expand All @@ -55,7 +82,7 @@ func unmarshal1(na ipld.NodeAssembler, tokSrc shared.TokenSource, gas *int, opti

// starts with the first token already primed. Necessary to get recursion
// to flow right without a peek+unpeek system.
func unmarshal2(na ipld.NodeAssembler, tokSrc shared.TokenSource, tk *tok.Token, gas *int, options UnmarshalOptions) error {
func unmarshal2(na ipld.NodeAssembler, tokSrc shared.TokenSource, tk *tok.Token, gas *int, options DecodeOptions) error {
// FUTURE: check for schema.TypedNodeBuilder that's going to parse a Link (they can slurp any token kind they want).
switch tk.Type {
case tok.TMapOpen:
Expand Down
Loading

0 comments on commit 1285c1d

Please sign in to comment.