You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, there is no data type in MCore for representing a byte. While it might be fine for some applications to only require String based I/O, the absence of binary I/O heavily impacts performance on applications where serialization of data is useful, such as when saving/checkpointing weights during machine learning with neural networks.
The proposal is to have a new intrinsic data type Byte that takes up 1-byte per element. I.e. a tensor tensorCreateDense [n] (lam. #byte"0x00") should take up n + O(1) memory, where the O(1) represents constant overhead for the tensor.
A Byte constant can be instantiated in code as #byte"0x10" or #b"0x10 (short-hand version). The restriction in this case would be that the value inside the quotation marks is between 0 and 255 inclusive, not necessarily the representation used.
The behavior of a Byte (apart from storage space) would be completely implemented through externals. E.g. there would be no intrinsic addbyte a b. For OCaml, we might implement the following externals:
external int2bytesbe: Int -> [Byte] (be = Big Endian)
external float2bytesbe: Float -> [Byte] (could be IEEE754 format or something else, up to the backend to decide)
external bytesbe2int: [Byte] -> Int
external bytesbe2float: [Byte] -> Float
external readBytes ! : ReadChannel -> Int -> [Byte]
The consequence of this would be that the behavior of a byte becomes completely defined by the backend used, which I would see as favorable as that offloads a lot of the underlying encoding requirements from MCore.
The immediate use case for me is to be able to serialize & deserialize large tensors containing floats. Currently this is not really feasible since I have to use float2string and string2float every time I do file I/O, combined with that I have to parse strings to check for delimiters, etc. Previous attempts that I made with loading tensors using string representations would take days to fully parse the produced strings. In the case of having the Byte type available however, I could instead use more efficient writeTensor and readTensor functions:
let writeTensor: WriteChannel -> Tensor[Float] -> ()= lam ch. lam t.
writeBytes ch (int2bytesbe (tensorRank t));
foldl (lam. lam dimsize.
writeBytes ch (int2bytesbe dimsize)
) () (tensorShape t);
let n = tensorSize t in
recursive let iterH = lam i.
if eqi i n then()else (
writeBytes ch (float2bytesbe (tensorLinearGetExn t i));
iterH (addi i 1)
)
in
iterH 0let readTensor: ReadChannel -> Tensor[Float] = lam ch.
-- assuming that floats andintshavethesameserializedsizeregardlessofvalue (mightneedtohavemoreexpressiveexternalshere...)
letsizeFloat= length (float2bytesbe 0.0) inlet sizeInt = length (int2bytesbe 0) inlet rank = bytesbe2int (readBytes ch sizeInt);
recursive let mkshapeH = lam acc. lam i.
if eqi i rank then()else
mkshapeH (snoc acc (bytesbe2int (readBytes ch sizeInt)))
(addi i 1)
inlet shape = mkshapeH []0inlet t = tensorCreateDense shape (lam. 0.0) inlet n = tensorSize t in
recursive let fillTensorH = lam i.
if eqi i n then()else (
tensorLinearSetExn t i (bytesbe2float (readBytes ch sizeFloat));
fillTensorH (addi i 1)
)
in
fillTensorH 0;
t
The text was updated successfully, but these errors were encountered:
This has been discussed during meetings, and the idea would probably merge into how we handle external types. Related issue: #586
Though it would still be good to have some way to specify size guarantees in MCore, such that a backend cannot hog up arbitrary memory for a value that should be fixed size and small. To solve this, we could introduce the Blob type with can take a number as a parameter, specifying the number of bits that it should occupy:
external type UInt8
external type Int8
external type UInt16
external type Int16
external type BigInt
ffi ocaml
type UInt8 = Blob[8]
type Int8 = Blob[8]
type UInt16 = Blob[16]
type Int16 = Blob[16]
type BigInt = Blob
The syntax for Blob would then be
Blob[<bits>]: Fixed size blob that is not able to be resized.
Blob: Arbitrary blob of data to be managed by the backend. Can be resized if necessary.
Currently, there is no data type in MCore for representing a byte. While it might be fine for some applications to only require String based I/O, the absence of binary I/O heavily impacts performance on applications where serialization of data is useful, such as when saving/checkpointing weights during machine learning with neural networks.
The proposal is to have a new intrinsic data type
Byte
that takes up 1-byte per element. I.e. a tensortensorCreateDense [n] (lam. #byte"0x00")
should take up n + O(1) memory, where the O(1) represents constant overhead for the tensor.A Byte constant can be instantiated in code as
#byte"0x10"
or#b"0x10
(short-hand version). The restriction in this case would be that the value inside the quotation marks is between 0 and 255 inclusive, not necessarily the representation used.The behavior of a Byte (apart from storage space) would be completely implemented through externals. E.g. there would be no intrinsic
addbyte a b
. For OCaml, we might implement the following externals:external int2bytesbe: Int -> [Byte]
(be = Big Endian)external float2bytesbe: Float -> [Byte]
(could be IEEE754 format or something else, up to the backend to decide)external bytesbe2int: [Byte] -> Int
external bytesbe2float: [Byte] -> Float
external readBytes ! : ReadChannel -> Int -> [Byte]
external writeBytes ! : WriteChannel -> [Byte] -> ()
The consequence of this would be that the behavior of a byte becomes completely defined by the backend used, which I would see as favorable as that offloads a lot of the underlying encoding requirements from MCore.
The immediate use case for me is to be able to serialize & deserialize large tensors containing floats. Currently this is not really feasible since I have to use float2string and string2float every time I do file I/O, combined with that I have to parse strings to check for delimiters, etc. Previous attempts that I made with loading tensors using string representations would take days to fully parse the produced strings. In the case of having the
Byte
type available however, I could instead use more efficientwriteTensor
andreadTensor
functions:The text was updated successfully, but these errors were encountered: