Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: An intrinsic Byte data type #562

Open
johnwikman opened this issue Apr 1, 2022 · 1 comment
Open

Proposal: An intrinsic Byte data type #562

johnwikman opened this issue Apr 1, 2022 · 1 comment
Labels
RFC Request for Comments

Comments

@johnwikman
Copy link
Contributor

Currently, there is no data type in MCore for representing a byte. While it might be fine for some applications to only require String based I/O, the absence of binary I/O heavily impacts performance on applications where serialization of data is useful, such as when saving/checkpointing weights during machine learning with neural networks.

The proposal is to have a new intrinsic data type Byte that takes up 1-byte per element. I.e. a tensor tensorCreateDense [n] (lam. #byte"0x00") should take up n + O(1) memory, where the O(1) represents constant overhead for the tensor.

  • A Byte constant can be instantiated in code as #byte"0x10" or #b"0x10 (short-hand version). The restriction in this case would be that the value inside the quotation marks is between 0 and 255 inclusive, not necessarily the representation used.

  • The behavior of a Byte (apart from storage space) would be completely implemented through externals. E.g. there would be no intrinsic addbyte a b. For OCaml, we might implement the following externals:

    • external int2bytesbe: Int -> [Byte] (be = Big Endian)
    • external float2bytesbe: Float -> [Byte] (could be IEEE754 format or something else, up to the backend to decide)
    • external bytesbe2int: [Byte] -> Int
    • external bytesbe2float: [Byte] -> Float
    • external readBytes ! : ReadChannel -> Int -> [Byte]
    • external writeBytes ! : WriteChannel -> [Byte] -> ()
    • etc.

The consequence of this would be that the behavior of a byte becomes completely defined by the backend used, which I would see as favorable as that offloads a lot of the underlying encoding requirements from MCore.

The immediate use case for me is to be able to serialize & deserialize large tensors containing floats. Currently this is not really feasible since I have to use float2string and string2float every time I do file I/O, combined with that I have to parse strings to check for delimiters, etc. Previous attempts that I made with loading tensors using string representations would take days to fully parse the produced strings. In the case of having the Byte type available however, I could instead use more efficient writeTensor and readTensor functions:

let writeTensor: WriteChannel -> Tensor[Float] -> () = lam ch. lam t.
  writeBytes ch (int2bytesbe (tensorRank t));
  foldl (lam. lam dimsize.
    writeBytes ch (int2bytesbe dimsize)
  ) () (tensorShape t);
  let n = tensorSize t in
  recursive let iterH = lam i.
    if eqi i n then () else (
      writeBytes ch (float2bytesbe (tensorLinearGetExn t i));
      iterH (addi i 1)
    )
  in
  iterH 0

let readTensor: ReadChannel -> Tensor[Float] = lam ch.
  -- assuming that floats and ints have the same serialized size regardless of value (might need to have more expressive externals here...)
  let sizeFloat = length (float2bytesbe 0.0) in
  let sizeInt = length (int2bytesbe 0) in
  let rank = bytesbe2int (readBytes ch sizeInt);
  recursive let mkshapeH = lam acc. lam i.
    if eqi i rank then
      ()
    else
      mkshapeH (snoc acc (bytesbe2int (readBytes ch sizeInt)))
               (addi i 1)
  in
  let shape = mkshapeH [] 0 in
  let t = tensorCreateDense shape (lam. 0.0) in
  let n = tensorSize t in
  recursive let fillTensorH = lam i.
    if eqi i n then () else (
      tensorLinearSetExn t i (bytesbe2float (readBytes ch sizeFloat));
      fillTensorH (addi i 1)
    )
  in
  fillTensorH 0;
  t
@david-broman david-broman added the RFC Request for Comments label Apr 4, 2022
@johnwikman
Copy link
Contributor Author

This has been discussed during meetings, and the idea would probably merge into how we handle external types. Related issue: #586

Though it would still be good to have some way to specify size guarantees in MCore, such that a backend cannot hog up arbitrary memory for a value that should be fixed size and small. To solve this, we could introduce the Blob type with can take a number as a parameter, specifying the number of bits that it should occupy:

external type UInt8
external type Int8
external type UInt16
external type Int16
external type BigInt

ffi ocaml
  type UInt8 = Blob[8]
  type Int8 = Blob[8]
  type UInt16 = Blob[16]
  type Int16 = Blob[16]
  type BigInt = Blob

The syntax for Blob would then be

  • Blob[<bits>]: Fixed size blob that is not able to be resized.
  • Blob: Arbitrary blob of data to be managed by the backend. Can be resized if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for Comments
Projects
None yet
Development

No branches or pull requests

2 participants