Fix from_base64 Presto function for inputs without padding #8647

Joe-Abraham · 2024-02-02T06:08:03Z

netlify · 2024-02-02T06:08:08Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`69c9bf0`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/66335362b2a31700098de115

mbasmanova

@Joe-Abraham Thank you for the fix.

Bae64.h/cpp files were copied from somewhere and therefore do not match Velox coding style. We need to fix this at some point.

It looks like there is no unit test for utilities provided in these files. Would you add velox/common/encode/tests/Base64Test.cpp with tests for the modified functionality?

Would you also clarify whether this change affects any other users of the utilities provided by Base64.h? Somehow, I cannot tell easily from the code.

Finally, please, make yourself familiar with the coding style documented at https://github.com/facebookincubator/velox/blob/main/CODING_STYLE.md and guidelines for writing PR titles and descriptions at https://github.com/facebookincubator/velox/blob/main/CONTRIBUTING.md#code-contribution-process

mbasmanova · 2024-03-14T10:55:29Z

velox/common/encode/Base64.h

  static inline size_t countPadding(const char* src, size_t len) {
-    DCHECK_GE(len, 2);
-    return src[len - 1] != kBase64Pad ? 0 : src[len - 2] != kBase64Pad ? 1 : 2;
+    size_t padding_count = 0;


naming: numPadding

Updated the code.

mbasmanova · 2024-03-14T10:56:52Z

velox/common/encode/Base64.h

-    DCHECK_GE(len, 2);
-    return src[len - 1] != kBase64Pad ? 0 : src[len - 2] != kBase64Pad ? 1 : 2;
+    size_t padding_count = 0;
+    while (len > 0 && src[len - 1] == kPadding) {


Why introduce kPadding? It seems to duplicate existing kBase64Pad.

It was a mistake on my end, and I have corrected it.

My actual task was to add the presto functions

'to_base32'(Add presto function 'to_base32' #8652)

'from_base32'(Add presto function from_base32 and to_base32 #7672).

Since base64 and base32 share a lot of code, I was advised to create a utility class for the common code(#8650) and to clean base64(#8651)

While testing my functionality, I found the decoding in base64 had a bug! (#8646)

I had all these changes in one single PR, and @aditi-pandit suggested to separate out the PRs. while seperating out I missed to use kBase64Pad.

@Joe-Abraham Got it. BTW, let's also update documentation for the affected function to clarify that it allows inputs without padding.

velox/common/encode/Base64.cpp

mbasmanova · 2024-03-14T10:58:53Z

velox/common/encode/Base64.cpp

-    if (size % 4 != 0) {
+  // Check if the input data is padded
+  if (isPadded(data, size)) {
+    /// If padded, ensure that the string length is a multiple of the encoded


/// -> //

/// used only for comments on public classes, methods, functions in header files.

Updated the comments

mbasmanova · 2024-03-14T10:59:20Z

velox/common/encode/Base64.h

+  constexpr static char kPadding = '=';
+
+  // Size of the encoded block after encoding.
+  constexpr static int kEncodedBlockSize = 4;


These constants can go into .cpp file.

moved those variables to .cpp

mbasmanova · 2024-03-14T10:59:58Z

velox/common/encode/Base64.cpp

+    return needed -
+        ceil((padding * kBinaryBlockSize) /
+             static_cast<double>(kEncodedBlockSize));
+  } else {


drop 'else' after 'return'

Updated the code

mbasmanova · 2024-03-14T11:01:06Z

velox/common/encode/Base64.h

@@ -59,8 +59,7 @@ class Base64 {

  /// Returns decoded size for the specified input. Adjusts the 'size' to
  /// subtract the length of the padding, if exists.
-  static size_t
-  calculateDecodedSize(const char* data, size_t& size, bool withPadding = true);
+  static size_t calculateDecodedSize(const char* data, size_t& size);


comment needs updating

Updated the comment

Thanks, but the comment doesn't explain why 'size' argument is an output argument? Would it be possible to clarify or change the type of 'size' to size_t?

Rephrased the comment

mbasmanova · 2024-03-21T09:16:03Z

@Joe-Abraham This PR is marked as Draft. Is is still a work-in-progress or ready for review?

Joe-Abraham · 2024-03-21T09:34:28Z

@mbasmanova It is WIP and I have a few reworks pending.
I will request for re-review, once I am done.
Hope that sounds good!

mbasmanova · 2024-03-21T09:40:32Z

@Joe-Abraham Got it. Thank you for clarifying. Sounds good to me.

Joe-Abraham · 2024-03-22T11:17:57Z

@mbasmanova, Can you please look into the reworks that were done?

Joe-Abraham · 2024-04-22T06:10:20Z

@mbasmanova Can you please review?

bikramSingh91 · 2024-04-27T00:26:20Z

@Joe-Abraham will review it next week.

bikramSingh91 · 2024-04-26T23:45:50Z

velox/common/encode/Base64.cpp

+
+    // Adjust the needed size for padding
+    return needed -
+        ceil((padding * kBinaryBlockSize) /


just curious, can we replace this with (size * kBinaryBlockSize) / kEncodedBlockSize; since the size has been updated at L351 ? This would simplify this calculation.

If not, can we replace this with:

((padding * kBinaryBlockSize) + (kEncodedBlockSize - 1)) / kEncodedBlockSize

to simulate the additional ceil and avoid floating point conversions + arithmetic

Also, can you point me to resources about this and maybe add a comment about how this formula takes care of both 32 and 64 bit encodings?

@bikramSingh91 Updated the code

bikramSingh91

Looks good, just two simple nits

bikramSingh91 · 2024-04-30T21:27:19Z

velox/common/encode/Base64.cpp

@@ -22,6 +22,10 @@

 namespace facebook::velox::encoding {

+// Constants defining the size of binary and encoded blocks for Base64 encoding.
+constexpr static int kBinaryBlockSize = 3; // 3 bytes of binary = 24 bits
+constexpr static int kEncodedBlockSize = 4; // 4 bytes of encoded = 24 bits


nit:

// 4 bytes of encoded = 24 bits

This comment is a bit confusing. How about we make it explicit in the variable name like:
kEncodedBlockByteSize
similarly for kBinaryBlockByteSize

Updated the code

bikramSingh91 · 2024-04-30T21:28:45Z

velox/common/encode/Base64.h

-  /// subtract the length of the padding, if exists.
-  static size_t
-  calculateDecodedSize(const char* data, size_t& size, bool withPadding = true);
+  /// Returns the actual size of the decoded data. Will also remove the padding


nit: enclose the variable name in single quotes like
/// length from the input data 'size'.

Updated the code

bikramSingh91 · 2024-04-30T21:37:32Z

velox/common/encode/Base64.cpp

-    return needed - padding;
+
+    // Adjust the needed size for padding
+    return needed -


nit: I haven't had the opportunity to review the subsequent PRs, but my initial confusion during the review was due to the intertwining of some code that is specific to base64 and some that applies to both 64 and 32. I had to delve into the details of both to assure myself of their functionality. It would be great if you could annotate/comment on some of these complex calculations in the final version (where this code is reused) to clarify how we arrived at these formulas. This would greatly aid future readers. Thanks

I have updated the comments regarding the need for the calculations we have done here.
The calculation is a bit different in base32 because of the need to handle some exceptional cases of multiple padded lengths. This code couldn't be reused.

facebook-github-bot · 2024-04-30T21:38:27Z

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

bikramSingh91 · 2024-05-01T23:41:56Z

@Joe-Abraham can you please address the final nits and update the velox documentation at velox/docs/functions/presto/binary.rst to clarify this behavior. Thanks!

Joe-Abraham · 2024-05-02T05:25:28Z

@bikramSingh91 I am a bit confused about what needs to be added at velox/docs/functions/presto/binary. rst. Could you please help me understand the changes required in this file?

mbasmanova · 2024-05-02T05:57:28Z

@Joe-Abraham The documentation for from_base64 function says "Decodes binary data from the base64 encoded string." It is not clear whether inputs are required to have padding or not. It would be nice to clarify and perhaps include a few examples.

https://facebookincubator.github.io/velox/functions/presto/binary.html#from_base64

mbasmanova · 2024-05-02T07:05:18Z

velox/docs/functions/presto/binary.rst

+    ::
+        [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]
+
+    In these examples, both the padded and non-padded Base64 strings 'SGVsbG8gV29ybGQ=' and 'SGVsbG8gV29ybGQ' decode to the binary representation of the text 'Hello World'.


This is a very nice description. Typically, in examples, the results are includes as comments

SELECT from_base64('SGVsbG8gV29ybGQ='); -- [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

It would be nice to also mention that "partial" padding is supported as well, i.e. if full padding requires ==, a single = is also allowed.

Updated it accordingly

Thanks. What do you think about

It would be nice to also mention that "partial" padding is supported as well, i.e. if full padding requires ==, a single = is also allowed.

?

@mbasmanova partial padding is not supported.
For example, YQ== and YQ decode to a, while YQ= throws an error.

@Joe-Abraham Got it. Thank you for clarifying. Let's mention this in the doc as well for completeness. BTW, does this behavior match Presto?

@mbasmanova The behaviour is exactly like presto. I have updated the documentation

facebook-github-bot · 2024-05-02T18:20:36Z

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-05-06T21:50:44Z

@bikramSingh91 merged this pull request in b7bacaf.

conbench-facebook · 2024-05-06T22:13:16Z

Conbench analyzed the 1 benchmark run on commit b7bacaf3.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…ncubator#8647) Summary: Fixes facebookincubator#8646 Pull Request resolved: facebookincubator#8647 Reviewed By: amitkdutta, DanielMunozT Differential Revision: D56792399 Pulled By: bikramSingh91 fbshipit-source-id: 212acce56c0dd708e1220e229e1943380d4d976b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 2, 2024

This was referenced Feb 2, 2024

Refactor presto function base64 to make use of the new utility file #8651

Draft

Add presto function from_base32 and to_base32 #7672

Open

Joe-Abraham force-pushed the test_base64 branch 2 times, most recently from 6311221 to c4e455e Compare February 5, 2024 09:01

Joe-Abraham force-pushed the test_base64 branch 2 times, most recently from f4e1867 to 8543883 Compare March 11, 2024 05:41

Joe-Abraham mentioned this pull request Mar 14, 2024

from_base64 Presto function isn't decoding the input when the input isn't padded. #8646

Closed

mbasmanova changed the title ~~Fix padding issue~~ Fix from_base64 Presto function for inputs without padding Mar 14, 2024

mbasmanova reviewed Mar 14, 2024

View reviewed changes

Joe-Abraham force-pushed the test_base64 branch from 8543883 to 2213f42 Compare March 15, 2024 06:59

Joe-Abraham marked this pull request as draft March 15, 2024 08:01

Joe-Abraham force-pushed the test_base64 branch 4 times, most recently from 34ad23c to 0cbcde9 Compare March 21, 2024 08:27

Joe-Abraham force-pushed the test_base64 branch 6 times, most recently from 8aa6ab0 to c0dd0a8 Compare March 22, 2024 11:11

Joe-Abraham requested a review from mbasmanova March 22, 2024 11:16

Joe-Abraham marked this pull request as ready for review March 22, 2024 11:16

Joe-Abraham force-pushed the test_base64 branch from c0dd0a8 to 090e5c8 Compare March 25, 2024 09:09

Joe-Abraham requested a review from mbasmanova April 15, 2024 06:09

Joe-Abraham force-pushed the test_base64 branch 2 times, most recently from 8e1a45a to d30d22d Compare April 25, 2024 04:36

Joe-Abraham force-pushed the test_base64 branch from d30d22d to a87ed54 Compare April 29, 2024 09:16

bikramSingh91 reviewed Apr 29, 2024

View reviewed changes

Joe-Abraham force-pushed the test_base64 branch from a87ed54 to d4ad413 Compare April 30, 2024 09:39

bikramSingh91 approved these changes Apr 30, 2024

View reviewed changes

bikramSingh91 reviewed Apr 30, 2024

View reviewed changes

bikramSingh91 added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label May 1, 2024

Joe-Abraham force-pushed the test_base64 branch from d4ad413 to 42da1c7 Compare May 2, 2024 05:19

Joe-Abraham force-pushed the test_base64 branch from 42da1c7 to afa2311 Compare May 2, 2024 06:57

mbasmanova reviewed May 2, 2024

View reviewed changes

Joe-Abraham force-pushed the test_base64 branch from afa2311 to ee8c4f4 Compare May 2, 2024 07:19

Fix from_base64 Presto function for inputs without padding

69c9bf0

Joe-Abraham force-pushed the test_base64 branch from d1b6ad2 to 69c9bf0 Compare May 2, 2024 08:48

facebook-github-bot closed this in b7bacaf May 6, 2024

facebook-github-bot added the Merged label May 6, 2024

Joe-Abraham deleted the test_base64 branch May 8, 2024 04:28

Fix from_base64 Presto function for inputs without padding #8647

Fix from_base64 Presto function for inputs without padding #8647

Conversation

Joe-Abraham commented Feb 2, 2024

netlify bot commented Feb 2, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova commented Mar 21, 2024

Joe-Abraham commented Mar 21, 2024

mbasmanova commented Mar 21, 2024

Joe-Abraham commented Mar 22, 2024

Joe-Abraham commented Apr 22, 2024

bikramSingh91 commented Apr 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bikramSingh91 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Joe-Abraham May 2, 2024 • edited Loading

Choose a reason for hiding this comment

facebook-github-bot commented Apr 30, 2024

bikramSingh91 commented May 1, 2024

Joe-Abraham commented May 2, 2024

mbasmanova commented May 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Joe-Abraham May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented May 2, 2024

facebook-github-bot commented May 6, 2024

conbench-facebook bot commented May 6, 2024

netlify bot commented Feb 2, 2024 •

edited

Loading

Joe-Abraham May 2, 2024 •

edited

Loading

Joe-Abraham May 2, 2024 •

edited

Loading