Skip to content

Conversation

frank-dong-ms-zz
Copy link
Contributor

fix issue #5336

  1. use byte array to create tensor instead of string
  2. use Unicode encode instead of UTF8

This issue is little bit complicated so please read through below:

User want to load a pb model in ML.NET, the input tensor looks like below which is a serialized Example object (a binary buffer, not a text string):
inputs['inputs'] tensor_info: dtype: DT_STRING shape: (-1) name: input_example_tensor:0

I find a workable solution is first convert Example object to protobuf encoded byte array using:
example.ToByteArray()
then convert byte array to string (char array) using some sort of reliable encoding (ideally Unicode or Base64 encoding):
Encoding.Unicode.GetString(example.ToByteArray())
Then ML.NET will convert the string back to byte array with same encoding and pass to tf.net:
Encoding.Unicode.GetBytes(((ReadOnlyMemory<char>)(object)data[i]).ToArray());

The method ML.NET uses to create Tensor is CastDataAndReturnAsTensor, previously we are using UTF8 to decode the string and convert to byte array, UTF8 is not reliable encoding as I described in this comment so I would like to change the encoding to Unicode.
Also, recently Xiaoyun upgrade our TF version in this PR and changed to use string[] instead of byte[][] to create Tensor, in this case we need to use byte[][] as the input string itself is converted from binary buffer(protobuf encoded).

@frank-dong-ms-zz
Copy link
Contributor Author

Realized UTF8 is the default encoding for tensorflow so I can't use Unicode encoding here.

@frank-dong-ms-zz frank-dong-ms-zz deleted the frdong/issue-5336 branch October 24, 2020 01:36
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants