-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce SHA1 and UUID type parsers #165
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #165 +/- ##
==========================================
+ Coverage 86.49% 87.39% +0.89%
==========================================
Files 9 10 +1
Lines 1718 1856 +138
==========================================
+ Hits 1486 1622 +136
- Misses 232 234 +2
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parsing looks good to me
Some question about the intended down-stream use of SHA1
|
||
### SHA1 ### | ||
|
||
struct SHA1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect this SHA1
type to be a public user interface? For example, would CSV return a SHA1 column? (or would we expect CSV to define it's own type that represents SHA1?)
I think we should probably add a docstring.
What i'm wondering is if we need to do to make this a reasonably usable type for downstream users to receive. e.g. I'm wondering if we also define certain common functions (e.g. that UUID
suuports).
Like string
/print
/show
: should string(SHA1((1,2,3,4,5))) == "0000000100000002000000030000000400000005"
? Should we print SHA1 as SHA1("0000000100000002000000030000000400000005")
?
Should we add constructors and convert methods? e.g. SHA1(::String)
? or SHA1(::Vector{UInt8})
(so we can construct a SHA1
from SHA.sha1
?
Also i think we'd need isless
(so we can compare SHA1
s) and broadcastable
(so we can operate on a column of SHA1
s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I thought about this and I agree, that we want the string
and broadcastable
, but additional constructors are a little tricky, because people could expect this to actually calculate the sha1 hash of the input string or bytes. Since this is a storage-only type, I think it also make sense to "show it as you store it" for the show
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think my preference would be to match UUID
But i'm happy to leave this as a topic for potential follow-up PRs
src/hexadecimal.jl
Outdated
pos += 1 | ||
incr!(source) | ||
end | ||
check != 0xFFFF || @goto backtrack_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these be functions rather than @goto
s?
e.g.
check != 0xFFFF || @goto backtrack_error | |
check != 0xFFFF || return backtrack_error!(source, pos, segment_len, ...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried a couple things here:
a) current thing where jump to the error handling segment of the code
b) make a @inline
d function _backtrack_error(source, pos, len, code, segment_len, hi, lo, pl)
c) make a @noinline
d function _backtrack_error(source, pos, len, code, segment_len, hi, lo, pl)
In an micro benchmark where we parse a single valid UUID string, I didn't see any perf difference. But there is a difference in the size of the generated code. For the three versions I ran
@code_native debuginfo=:none syntax=:intel Parsers.typeparser(UUID, collect(codeunits(su)), 1, (length(su)), 0x00, Int16(0), Parsers.PosLen(0,0,false,false), (Parsers.OPTIONS))
And then I very scientifically counted the number of lines generated:
a) 488
b) 705
c) 507
So I think there the original version is at least preferable in terms of codesize? The @noinline
d seems to only differ in that it inserts a couple of call
instructions which are also just gotos/jumps + some small changes to the stack (if we hit the error, that is). But do let me know if you think that the gains in legibility are worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i personally this the gains in legibility are worth it if there's no time cost
oh i'm not a member on this repo anymore, probably since i reset my GitHub 2FA -- @quinnj please could you add me back so my approval counts for merging? thanks! |
Adapted from https://github.com/Drvi/ChunkedCSV.jl/pull/28 and extended to support
IO
sources.