Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce SHA1 and UUID type parsers #165

Merged
merged 4 commits into from
May 11, 2023
Merged

Introduce SHA1 and UUID type parsers #165

merged 4 commits into from
May 11, 2023

Conversation

Drvi
Copy link
Collaborator

@Drvi Drvi commented May 10, 2023

Adapted from https://github.com/Drvi/ChunkedCSV.jl/pull/28 and extended to support IO sources.

@codecov
Copy link

codecov bot commented May 10, 2023

Codecov Report

Patch coverage: 98.55% and project coverage change: +0.89 🎉

Comparison is base (d8995fe) 86.49% compared to head (d25e0c6) 87.39%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #165      +/-   ##
==========================================
+ Coverage   86.49%   87.39%   +0.89%     
==========================================
  Files           9       10       +1     
  Lines        1718     1856     +138     
==========================================
+ Hits         1486     1622     +136     
- Misses        232      234       +2     
Impacted Files Coverage Δ
src/Parsers.jl 87.29% <ø> (ø)
src/hexadecimal.jl 98.55% <98.55%> (ø)

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@nickrobinson251 nickrobinson251 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parsing looks good to me

Some question about the intended down-stream use of SHA1


### SHA1 ###

struct SHA1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect this SHA1 type to be a public user interface? For example, would CSV return a SHA1 column? (or would we expect CSV to define it's own type that represents SHA1?)

I think we should probably add a docstring.

What i'm wondering is if we need to do to make this a reasonably usable type for downstream users to receive. e.g. I'm wondering if we also define certain common functions (e.g. that UUID suuports).

Like string/print/show: should string(SHA1((1,2,3,4,5))) == "0000000100000002000000030000000400000005"? Should we print SHA1 as SHA1("0000000100000002000000030000000400000005")?

Should we add constructors and convert methods? e.g. SHA1(::String)? or SHA1(::Vector{UInt8}) (so we can construct a SHA1 from SHA.sha1?

Also i think we'd need isless (so we can compare SHA1s) and broadcastable (so we can operate on a column of SHA1s)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I thought about this and I agree, that we want the string and broadcastable, but additional constructors are a little tricky, because people could expect this to actually calculate the sha1 hash of the input string or bytes. Since this is a storage-only type, I think it also make sense to "show it as you store it" for the show method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think my preference would be to match UUID

But i'm happy to leave this as a topic for potential follow-up PRs

src/hexadecimal.jl Show resolved Hide resolved
pos += 1
incr!(source)
end
check != 0xFFFF || @goto backtrack_error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can these be functions rather than @gotos?

e.g.

Suggested change
check != 0xFFFF || @goto backtrack_error
check != 0xFFFF || return backtrack_error!(source, pos, segment_len, ...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried a couple things here:
a) current thing where jump to the error handling segment of the code
b) make a @inlined function _backtrack_error(source, pos, len, code, segment_len, hi, lo, pl)
c) make a @noinlined function _backtrack_error(source, pos, len, code, segment_len, hi, lo, pl)

In an micro benchmark where we parse a single valid UUID string, I didn't see any perf difference. But there is a difference in the size of the generated code. For the three versions I ran

@code_native debuginfo=:none syntax=:intel Parsers.typeparser(UUID, collect(codeunits(su)), 1, (length(su)), 0x00, Int16(0), Parsers.PosLen(0,0,false,false), (Parsers.OPTIONS))

And then I very scientifically counted the number of lines generated:
a) 488
b) 705
c) 507

So I think there the original version is at least preferable in terms of codesize? The @noinlined seems to only differ in that it inserts a couple of call instructions which are also just gotos/jumps + some small changes to the stack (if we hit the error, that is). But do let me know if you think that the gains in legibility are worth it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i personally this the gains in legibility are worth it if there's no time cost

test/hexadecimal.jl Show resolved Hide resolved
@nickrobinson251
Copy link
Collaborator

oh i'm not a member on this repo anymore, probably since i reset my GitHub 2FA -- @quinnj please could you add me back so my approval counts for merging? thanks!

@Drvi Drvi merged commit 682fa0c into main May 11, 2023
@Drvi Drvi deleted the td-misc-hexadecimals branch May 11, 2023 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants