Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix from_base64 Presto function for inputs without padding #8647

Closed
wants to merge 1 commit into from

Conversation

Joe-Abraham
Copy link
Contributor

Fixes #8646

Copy link

netlify bot commented Feb 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 69c9bf0
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66335362b2a31700098de115

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 2, 2024
@Joe-Abraham Joe-Abraham force-pushed the test_base64 branch 2 times, most recently from 6311221 to c4e455e Compare February 5, 2024 09:01
@Joe-Abraham Joe-Abraham force-pushed the test_base64 branch 2 times, most recently from f4e1867 to 8543883 Compare March 11, 2024 05:41
@mbasmanova mbasmanova changed the title Fix padding issue Fix from_base64 Presto function for inputs without padding Mar 14, 2024
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Joe-Abraham Thank you for the fix.

Bae64.h/cpp files were copied from somewhere and therefore do not match Velox coding style. We need to fix this at some point.

It looks like there is no unit test for utilities provided in these files. Would you add velox/common/encode/tests/Base64Test.cpp with tests for the modified functionality?

Would you also clarify whether this change affects any other users of the utilities provided by Base64.h? Somehow, I cannot tell easily from the code.

Finally, please, make yourself familiar with the coding style documented at https://github.com/facebookincubator/velox/blob/main/CODING_STYLE.md and guidelines for writing PR titles and descriptions at https://github.com/facebookincubator/velox/blob/main/CONTRIBUTING.md#code-contribution-process

static inline size_t countPadding(const char* src, size_t len) {
DCHECK_GE(len, 2);
return src[len - 1] != kBase64Pad ? 0 : src[len - 2] != kBase64Pad ? 1 : 2;
size_t padding_count = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: numPadding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code.

DCHECK_GE(len, 2);
return src[len - 1] != kBase64Pad ? 0 : src[len - 2] != kBase64Pad ? 1 : 2;
size_t padding_count = 0;
while (len > 0 && src[len - 1] == kPadding) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce kPadding? It seems to duplicate existing kBase64Pad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake on my end, and I have corrected it.

My actual task was to add the presto functions

  1. 'to_base32'(Add presto function 'to_base32' #8652)
  2. 'from_base32'(Add presto function from_base32 and to_base32 #7672).

Since base64 and base32 share a lot of code, I was advised to create a utility class for the common code(#8650) and to clean base64(#8651)

While testing my functionality, I found the decoding in base64 had a bug! (#8646)

I had all these changes in one single PR, and @aditi-pandit suggested to separate out the PRs. while seperating out I missed to use kBase64Pad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Joe-Abraham Got it. BTW, let's also update documentation for the affected function to clarify that it allows inputs without padding.

velox/common/encode/Base64.cpp Show resolved Hide resolved
if (size % 4 != 0) {
// Check if the input data is padded
if (isPadded(data, size)) {
/// If padded, ensure that the string length is a multiple of the encoded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// -> //

/// used only for comments on public classes, methods, functions in header files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comments

constexpr static char kPadding = '=';

// Size of the encoded block after encoding.
constexpr static int kEncodedBlockSize = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants can go into .cpp file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved those variables to .cpp

return needed -
ceil((padding * kBinaryBlockSize) /
static_cast<double>(kEncodedBlockSize));
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop 'else' after 'return'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code

@@ -59,8 +59,7 @@ class Base64 {

/// Returns decoded size for the specified input. Adjusts the 'size' to
/// subtract the length of the padding, if exists.
static size_t
calculateDecodedSize(const char* data, size_t& size, bool withPadding = true);
static size_t calculateDecodedSize(const char* data, size_t& size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment needs updating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but the comment doesn't explain why 'size' argument is an output argument? Would it be possible to clarify or change the type of 'size' to size_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased the comment

@mbasmanova
Copy link
Contributor

@Joe-Abraham This PR is marked as Draft. Is is still a work-in-progress or ready for review?

@Joe-Abraham
Copy link
Contributor Author

@mbasmanova It is WIP and I have a few reworks pending.
I will request for re-review, once I am done.
Hope that sounds good!

@mbasmanova
Copy link
Contributor

@Joe-Abraham Got it. Thank you for clarifying. Sounds good to me.

@Joe-Abraham Joe-Abraham force-pushed the test_base64 branch 6 times, most recently from 8aa6ab0 to c0dd0a8 Compare March 22, 2024 11:11
@Joe-Abraham Joe-Abraham marked this pull request as ready for review March 22, 2024 11:16
@Joe-Abraham
Copy link
Contributor Author

@mbasmanova, Can you please look into the reworks that were done?

@Joe-Abraham
Copy link
Contributor Author

@mbasmanova Can you please review?

@Joe-Abraham Joe-Abraham force-pushed the test_base64 branch 2 times, most recently from 8e1a45a to d30d22d Compare April 25, 2024 04:36
@bikramSingh91
Copy link
Contributor

@Joe-Abraham will review it next week.


// Adjust the needed size for padding
return needed -
ceil((padding * kBinaryBlockSize) /
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, can we replace this with (size * kBinaryBlockSize) / kEncodedBlockSize; since the size has been updated at L351 ? This would simplify this calculation.

If not, can we replace this with:

((padding * kBinaryBlockSize) + (kEncodedBlockSize - 1)) / kEncodedBlockSize

to simulate the additional ceil and avoid floating point conversions + arithmetic

Also, can you point me to resources about this and maybe add a comment about how this formula takes care of both 32 and 64 bit encodings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bikramSingh91 Updated the code

Copy link
Contributor

@bikramSingh91 bikramSingh91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just two simple nits

@@ -22,6 +22,10 @@

namespace facebook::velox::encoding {

// Constants defining the size of binary and encoded blocks for Base64 encoding.
constexpr static int kBinaryBlockSize = 3; // 3 bytes of binary = 24 bits
constexpr static int kEncodedBlockSize = 4; // 4 bytes of encoded = 24 bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

// 4 bytes of encoded = 24 bits

This comment is a bit confusing. How about we make it explicit in the variable name like:
kEncodedBlockByteSize
similarly for kBinaryBlockByteSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code

/// subtract the length of the padding, if exists.
static size_t
calculateDecodedSize(const char* data, size_t& size, bool withPadding = true);
/// Returns the actual size of the decoded data. Will also remove the padding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: enclose the variable name in single quotes like
/// length from the input data 'size'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code

return needed - padding;

// Adjust the needed size for padding
return needed -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I haven't had the opportunity to review the subsequent PRs, but my initial confusion during the review was due to the intertwining of some code that is specific to base64 and some that applies to both 64 and 32. I had to delve into the details of both to assure myself of their functionality. It would be great if you could annotate/comment on some of these complex calculations in the final version (where this code is reused) to clarify how we arrived at these formulas. This would greatly aid future readers. Thanks

Copy link
Contributor Author

@Joe-Abraham Joe-Abraham May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comments regarding the need for the calculations we have done here.
The calculation is a bit different in base32 because of the need to handle some exceptional cases of multiple padded lengths. This code couldn't be reused.

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@bikramSingh91 bikramSingh91 added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label May 1, 2024
@bikramSingh91
Copy link
Contributor

@Joe-Abraham can you please address the final nits and update the velox documentation at velox/docs/functions/presto/binary.rst to clarify this behavior. Thanks!

@Joe-Abraham
Copy link
Contributor Author

@bikramSingh91 I am a bit confused about what needs to be added at velox/docs/functions/presto/binary. rst. Could you please help me understand the changes required in this file?

@mbasmanova
Copy link
Contributor

@Joe-Abraham The documentation for from_base64 function says "Decodes binary data from the base64 encoded string." It is not clear whether inputs are required to have padding or not. It would be nice to clarify and perhaps include a few examples.

https://facebookincubator.github.io/velox/functions/presto/binary.html#from_base64

::
[72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

In these examples, both the padded and non-padded Base64 strings 'SGVsbG8gV29ybGQ=' and 'SGVsbG8gV29ybGQ' decode to the binary representation of the text 'Hello World'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice description. Typically, in examples, the results are includes as comments

SELECT from_base64('SGVsbG8gV29ybGQ='); -- [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

It would be nice to also mention that "partial" padding is supported as well, i.e. if full padding requires ==, a single = is also allowed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it accordingly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. What do you think about

It would be nice to also mention that "partial" padding is supported as well, i.e. if full padding requires ==, a single = is also allowed.

?

Copy link
Contributor Author

@Joe-Abraham Joe-Abraham May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova partial padding is not supported.
For example, YQ== and YQ decode to a, while YQ= throws an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Joe-Abraham Got it. Thank you for clarifying. Let's mention this in the doc as well for completeness. BTW, does this behavior match Presto?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova The behaviour is exactly like presto. I have updated the documentation

image

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@bikramSingh91 merged this pull request in b7bacaf.

Copy link

Conbench analyzed the 1 benchmark run on commit b7bacaf3.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@Joe-Abraham Joe-Abraham deleted the test_base64 branch May 8, 2024 04:28
Joe-Abraham added a commit to Joe-Abraham/velox that referenced this pull request May 8, 2024
…ncubator#8647)

Summary:
Fixes facebookincubator#8646

Pull Request resolved: facebookincubator#8647

Reviewed By: amitkdutta, DanielMunozT

Differential Revision: D56792399

Pulled By: bikramSingh91

fbshipit-source-id: 212acce56c0dd708e1220e229e1943380d4d976b
Joe-Abraham added a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…ncubator#8647)

Summary:
Fixes facebookincubator#8646

Pull Request resolved: facebookincubator#8647

Reviewed By: amitkdutta, DanielMunozT

Differential Revision: D56792399

Pulled By: bikramSingh91

fbshipit-source-id: 212acce56c0dd708e1220e229e1943380d4d976b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall
Projects
None yet
Development

Successfully merging this pull request may close these issues.

from_base64 Presto function isn't decoding the input when the input isn't padded.
4 participants