-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix from_base64 Presto function for inputs without padding #8647
Conversation
✅ Deploy Preview for meta-velox canceled.
|
6311221
to
c4e455e
Compare
f4e1867
to
8543883
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Joe-Abraham Thank you for the fix.
Bae64.h/cpp files were copied from somewhere and therefore do not match Velox coding style. We need to fix this at some point.
It looks like there is no unit test for utilities provided in these files. Would you add velox/common/encode/tests/Base64Test.cpp with tests for the modified functionality?
Would you also clarify whether this change affects any other users of the utilities provided by Base64.h? Somehow, I cannot tell easily from the code.
Finally, please, make yourself familiar with the coding style documented at https://github.com/facebookincubator/velox/blob/main/CODING_STYLE.md and guidelines for writing PR titles and descriptions at https://github.com/facebookincubator/velox/blob/main/CONTRIBUTING.md#code-contribution-process
velox/common/encode/Base64.h
Outdated
static inline size_t countPadding(const char* src, size_t len) { | ||
DCHECK_GE(len, 2); | ||
return src[len - 1] != kBase64Pad ? 0 : src[len - 2] != kBase64Pad ? 1 : 2; | ||
size_t padding_count = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: numPadding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code.
velox/common/encode/Base64.h
Outdated
DCHECK_GE(len, 2); | ||
return src[len - 1] != kBase64Pad ? 0 : src[len - 2] != kBase64Pad ? 1 : 2; | ||
size_t padding_count = 0; | ||
while (len > 0 && src[len - 1] == kPadding) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why introduce kPadding? It seems to duplicate existing kBase64Pad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a mistake on my end, and I have corrected it.
My actual task was to add the presto functions
- 'to_base32'(Add presto function 'to_base32' #8652)
- 'from_base32'(Add presto function
from_base32
andto_base32
#7672).
Since base64 and base32 share a lot of code, I was advised to create a utility class for the common code(#8650) and to clean base64(#8651)
While testing my functionality, I found the decoding in base64 had a bug! (#8646)
I had all these changes in one single PR, and @aditi-pandit suggested to separate out the PRs. while seperating out I missed to use kBase64Pad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Joe-Abraham Got it. BTW, let's also update documentation for the affected function to clarify that it allows inputs without padding.
velox/common/encode/Base64.cpp
Outdated
if (size % 4 != 0) { | ||
// Check if the input data is padded | ||
if (isPadded(data, size)) { | ||
/// If padded, ensure that the string length is a multiple of the encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// -> //
/// used only for comments on public classes, methods, functions in header files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comments
velox/common/encode/Base64.h
Outdated
constexpr static char kPadding = '='; | ||
|
||
// Size of the encoded block after encoding. | ||
constexpr static int kEncodedBlockSize = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These constants can go into .cpp file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved those variables to .cpp
velox/common/encode/Base64.cpp
Outdated
return needed - | ||
ceil((padding * kBinaryBlockSize) / | ||
static_cast<double>(kEncodedBlockSize)); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop 'else' after 'return'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code
@@ -59,8 +59,7 @@ class Base64 { | |||
|
|||
/// Returns decoded size for the specified input. Adjusts the 'size' to | |||
/// subtract the length of the padding, if exists. | |||
static size_t | |||
calculateDecodedSize(const char* data, size_t& size, bool withPadding = true); | |||
static size_t calculateDecodedSize(const char* data, size_t& size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment needs updating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but the comment doesn't explain why 'size' argument is an output argument? Would it be possible to clarify or change the type of 'size' to size_t?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased the comment
8543883
to
2213f42
Compare
34ad23c
to
0cbcde9
Compare
@Joe-Abraham This PR is marked as Draft. Is is still a work-in-progress or ready for review? |
@mbasmanova It is WIP and I have a few reworks pending. |
@Joe-Abraham Got it. Thank you for clarifying. Sounds good to me. |
8aa6ab0
to
c0dd0a8
Compare
@mbasmanova, Can you please look into the reworks that were done? |
c0dd0a8
to
090e5c8
Compare
@mbasmanova Can you please review? |
8e1a45a
to
d30d22d
Compare
@Joe-Abraham will review it next week. |
d30d22d
to
a87ed54
Compare
velox/common/encode/Base64.cpp
Outdated
|
||
// Adjust the needed size for padding | ||
return needed - | ||
ceil((padding * kBinaryBlockSize) / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, can we replace this with (size * kBinaryBlockSize) / kEncodedBlockSize;
since the size has been updated at L351 ? This would simplify this calculation.
If not, can we replace this with:
((padding * kBinaryBlockSize) + (kEncodedBlockSize - 1)) / kEncodedBlockSize
to simulate the additional ceil and avoid floating point conversions + arithmetic
Also, can you point me to resources about this and maybe add a comment about how this formula takes care of both 32 and 64 bit encodings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bikramSingh91 Updated the code
a87ed54
to
d4ad413
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just two simple nits
velox/common/encode/Base64.cpp
Outdated
@@ -22,6 +22,10 @@ | |||
|
|||
namespace facebook::velox::encoding { | |||
|
|||
// Constants defining the size of binary and encoded blocks for Base64 encoding. | |||
constexpr static int kBinaryBlockSize = 3; // 3 bytes of binary = 24 bits | |||
constexpr static int kEncodedBlockSize = 4; // 4 bytes of encoded = 24 bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// 4 bytes of encoded = 24 bits
This comment is a bit confusing. How about we make it explicit in the variable name like:
kEncodedBlockByteSize
similarly for kBinaryBlockByteSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code
/// subtract the length of the padding, if exists. | ||
static size_t | ||
calculateDecodedSize(const char* data, size_t& size, bool withPadding = true); | ||
/// Returns the actual size of the decoded data. Will also remove the padding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: enclose the variable name in single quotes like
/// length from the input data 'size'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the code
return needed - padding; | ||
|
||
// Adjust the needed size for padding | ||
return needed - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I haven't had the opportunity to review the subsequent PRs, but my initial confusion during the review was due to the intertwining of some code that is specific to base64 and some that applies to both 64 and 32. I had to delve into the details of both to assure myself of their functionality. It would be great if you could annotate/comment on some of these complex calculations in the final version (where this code is reused) to clarify how we arrived at these formulas. This would greatly aid future readers. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the comments regarding the need for the calculations we have done here.
The calculation is a bit different in base32 because of the need to handle some exceptional cases of multiple padded lengths. This code couldn't be reused.
@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@Joe-Abraham can you please address the final nits and update the velox documentation at |
@bikramSingh91 I am a bit confused about what needs to be added at |
@Joe-Abraham The documentation for from_base64 function says "Decodes binary data from the base64 encoded string." It is not clear whether inputs are required to have padding or not. It would be nice to clarify and perhaps include a few examples. https://facebookincubator.github.io/velox/functions/presto/binary.html#from_base64 |
:: | ||
[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100] | ||
|
||
In these examples, both the padded and non-padded Base64 strings 'SGVsbG8gV29ybGQ=' and 'SGVsbG8gV29ybGQ' decode to the binary representation of the text 'Hello World'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice description. Typically, in examples, the results are includes as comments
SELECT from_base64('SGVsbG8gV29ybGQ='); -- [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
It would be nice to also mention that "partial" padding is supported as well, i.e. if full padding requires ==, a single = is also allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated it accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. What do you think about
It would be nice to also mention that "partial" padding is supported as well, i.e. if full padding requires ==, a single = is also allowed.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova partial padding is not supported.
For example, YQ==
and YQ
decode to a
, while YQ=
throws an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Joe-Abraham Got it. Thank you for clarifying. Let's mention this in the doc as well for completeness. BTW, does this behavior match Presto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova The behaviour is exactly like presto. I have updated the documentation
@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@bikramSingh91 merged this pull request in b7bacaf. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…ncubator#8647) Summary: Fixes facebookincubator#8646 Pull Request resolved: facebookincubator#8647 Reviewed By: amitkdutta, DanielMunozT Differential Revision: D56792399 Pulled By: bikramSingh91 fbshipit-source-id: 212acce56c0dd708e1220e229e1943380d4d976b
…ncubator#8647) Summary: Fixes facebookincubator#8646 Pull Request resolved: facebookincubator#8647 Reviewed By: amitkdutta, DanielMunozT Differential Revision: D56792399 Pulled By: bikramSingh91 fbshipit-source-id: 212acce56c0dd708e1220e229e1943380d4d976b
Fixes #8646