-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function bitSlice for String and FixedString #33360
Conversation
"Illegal type " + arguments[2]->getName() + " of second argument of function " + getName(), | ||
ErrorCodes::ILLEGAL_TYPE_OF_ARGUMENT); | ||
|
||
return std::make_shared<DataTypeString>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why It's always String
and not FixedString
? Is it because length
can be not a constant and each string can have different lenght?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, length
may not be a constant, it will make the implementation simple
source, StringSink(*col_res, input_rows_count), static_cast<size_t>(start_value - 1)); | ||
else if (start_value < 0) | ||
bitSliceFromRightConstantOffsetUnbounded( | ||
source, StringSink(*col_res, input_rows_count), -static_cast<size_t>(start_value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we negate first not to have overflow?
source, StringSink(*col_res, input_rows_count), -static_cast<size_t>(start_value)); | |
source, StringSink(*col_res, input_rows_count), static_cast<size_t>(-start_value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test both ways are perimt, even if the negate is removed , there will be no overflow problem. I refer the code of subString,I don't know what is the best way to deal with this problem。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the negate also work on size_t, so result is right. I will change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find negate first is a undefined-behavior when value is INT64_MIN.
I test
#include <iostream>
int main() {
ssize_t t = INT64_MIN;
size_t tt = static_cast<size_t>(-t);
std::cout << tt << std::endl;
return 0;
}
compile and run
$ clang++ -fsanitize=undefined -Wall main.cpp -o main && ./main
main.cpp:5:37: runtime error: negation of -9223372036854775808 cannot be represented in type 'ssize_t' (aka 'long'); cast to an unsigned type to negate this value to itself
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior main.cpp:5:37 in
9223372036854775808
src/Functions/bitSlice.cpp
Outdated
const ColumnConst * column_start_const, | ||
const ColumnConst * column_length_const, | ||
Int64 start_value, | ||
Int64 length_value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass two arguments of type optional<Int64>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I'll change it here
@@ -0,0 +1,111 @@ | |||
SELECT 'Const Offset'; | |||
select 1, subString(bin('Hello'), 1), bin(bitSlice('Hello', 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to add tests with non const and nullable first argument. Also with nulls in start
and length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
|
||
``` text | ||
┌─bin('Hello')─────────────────────────────┬─bin(bitSlice('Hello', 1, 8))─┐ | ||
│ 0100100001100101011011000110110001101111 │ 01001000 │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what happen if slice length isn't multiple of 8. We add zero pad at right, why right, not left?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also what we will do when slice out of bound.
SELECT
number,
bin('' AS e),
bin(bitSlice(e, number, number + 1))
FROM numbers(10)
┌─number─┬─bin('')──────────┬─bin(bitSlice('', number, plus(number, 1)))─┐
│ 0 │ 1111111111111111 │ │
│ 1 │ 1111111111111111 │ 11000000 │
│ 2 │ 1111111111111111 │ 11100000 │
│ 3 │ 1111111111111111 │ 11110000 │
│ 4 │ 1111111111111111 │ 11111000 │
│ 5 │ 1111111111111111 │ 11111100 │
│ 6 │ 1111111111111111 │ 11111110 │
│ 7 │ 1111111111111111 │ 11111111 │
│ 8 │ 1111111111111111 │ 1111111110000000 │
│ 9 │ 1111111111111111 │ 11111111 │
└────────┴──────────────────┴────────────────────────────────────────────┘
Looks unnatural, maybe let's make result length always is a length
argument padded to 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum unit of string is byte, so we add 0 to the right and fill it with 1byte. I don't know if there is a better way.
why right, not left?
I think most usually expectations for return are valid data at the beginning
for example
select bin(bitSlice(unbin('1111111111111111'), 1, 9));
//right
┌─bin(bitSlice(unbin('1111111111111111'), 1, 9))─┐
│ 1111111110000000 │
└────────────────────────────────────────────────┘
//left
┌─bin(bitSlice(unbin('1111111111111111'), 1, 9))─┐
│ 0000000111111111 │
└────────────────────────────────────────────────┘
I think add zero pad at right will reduce its effect.
Or we specify that the length must be a multiple of 8, But it will bring some restrictions on its use.
Also what we will do when slice out of bound.
SELECT
number,
bin('' AS e),
bin(bitSlice(e, number, number + 1))
FROM numbers(10)
┌─number─┬─bin('')─┬─bin(bitSlice('', number, plus(number, 1)))─┐
│ 0 │ │ │
│ 1 │ │ │
│ 2 │ │ │
│ 3 │ │ │
│ 4 │ │ │
│ 5 │ │ │
│ 6 │ │ │
│ 7 │ │ │
│ 8 │ │ │
│ 9 │ │ │
└────────┴─────────┴────────────────────────────────────────────┘
The test is normal here. I try to find out why
{ | ||
bool left_offset = start > 0; | ||
size_t offset = left_offset ? start - 1 : -start; | ||
size_t offset_byte; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we initialize with some value here (e.g. with result of first branch)? It's dificult to keep that all wariables will be initalized in any branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if else
can clearly compare the differences between the two situations. I separate offset
and length
, now It looks clear.
src/Functions/bitSlice.cpp
Outdated
ColumnPtr column_length; | ||
|
||
if (number_of_arguments == 3) | ||
column_length = arguments[2].column; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's assign column_length
and column_length_const
under same if if (number_of_arguments == 3)
below, because this variables connected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
src/Functions/bitSlice.cpp
Outdated
void bitSliceDynamicOffsetBounded(Source && src, StringSink && sink, const IColumn & offset_column, const IColumn & length_column) const | ||
{ | ||
const bool is_offset_null = offset_column.onlyNull(); | ||
const auto * offset_nullable = typeid_cast<const ColumnNullable *>(&offset_column); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to handle nulls? If yes in what way? According to code below we replace nulls with some defaults. But useDefaultImplementationForNulls
is not set, so nulls rows with nulls in args will result to null.
Also same question for useDefaultImplementationForConstants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bitSlice(s, offset, lenght)
, I think null lenght and offset values should not make result to null. The behavior here is consistent with substring
and arrayslice
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, There is no need to handle nulls. I will remove these code.
Good work! I left some comments. Generally fine, just need to pay attention for corner cases (like when/where we add padding) and add more tests (including some corner-cases like nullable, non-constant arguments at different positions etc.) |
I can't reproduce it on macos and ubuntu.
please check it and add how to reproduce it. |
Oh, sorry, I copied query wrong. I meant query |
@vdimir So you mean it's confusing to have padding and truncate at the same time. but if we don't care truncate, it will return useless data
the function will truncate, only because I refer to substring and arraySlice implement. If you don't think need it, I can remove it. |
@vdimir What do you think? should we make result length depend only on the param length? |
Actually it's the case only when we ran outside of bit range, so I suppose current solution is correct, we do not need add extra zeros from outside. Let's try to mention @asurovenko, maybe he as author of request of feature (#8922) can give advice what is better. Also is there examples how it works in other programming languages? |
I think BitSet is very similar to the scene here, I tried some languages. C++ #include <bitset>
#include <iostream>
using namespace std;
int main() {
std::bitset<4> s;
s.set(3);
cout << s.to_ulong() << endl; // 8
return 0;
} Java import java.util.BitSet;
class T {
public static void main(String[ ] args) {
BitSet s = new BitSet(4);
s.set(3);
System.out.print(s.toByteArray()[0]); // 8
}
} Rust use bit_vec::BitVec;
fn main() {
let mut bv = BitVec::from_elem(4, false);
bv.set(3, true);
println!("{:?}", bv.to_bytes()[0]); // 16
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work!
Changelog category:
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
bitSlice(s, offset, length)
Returns a substring starting with the bit from the ‘offset’ index that is ‘length’ bits long. bit indexing starts from 1
This closes #28535