Introduce SHA1 and UUID type parsers #165

Drvi · 2023-05-10T09:51:58Z

Adapted from https://github.com/Drvi/ChunkedCSV.jl/pull/28 and extended to support IO sources.

codecov · 2023-05-10T11:30:11Z

Codecov Report

Patch coverage: 98.55% and project coverage change: +0.89 🎉

Comparison is base (d8995fe) 86.49% compared to head (d25e0c6) 87.39%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #165      +/-   ##
==========================================
+ Coverage   86.49%   87.39%   +0.89%     
==========================================
  Files           9       10       +1     
  Lines        1718     1856     +138     
==========================================
+ Hits         1486     1622     +136     
- Misses        232      234       +2

Impacted Files	Coverage Δ
src/Parsers.jl	`87.29% <ø> (ø)`
src/hexadecimal.jl	`98.55% <98.55%> (ø)`

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

nickrobinson251

Parsing looks good to me

Some question about the intended down-stream use of SHA1

nickrobinson251 · 2023-05-10T11:37:13Z

src/hexadecimal.jl

+
+### SHA1 ###
+
+struct SHA1


Do we expect this SHA1 type to be a public user interface? For example, would CSV return a SHA1 column? (or would we expect CSV to define it's own type that represents SHA1?)

I think we should probably add a docstring.

What i'm wondering is if we need to do to make this a reasonably usable type for downstream users to receive. e.g. I'm wondering if we also define certain common functions (e.g. that UUID suuports).

Like string/print/show: should string(SHA1((1,2,3,4,5))) == "0000000100000002000000030000000400000005"? Should we print SHA1 as SHA1("0000000100000002000000030000000400000005")?

Should we add constructors and convert methods? e.g. SHA1(::String)? or SHA1(::Vector{UInt8}) (so we can construct a SHA1 from SHA.sha1?

Also i think we'd need isless (so we can compare SHA1s) and broadcastable (so we can operate on a column of SHA1s)

So I thought about this and I agree, that we want the string and broadcastable, but additional constructors are a little tricky, because people could expect this to actually calculate the sha1 hash of the input string or bytes. Since this is a storage-only type, I think it also make sense to "show it as you store it" for the show method.

i think my preference would be to match UUID

But i'm happy to leave this as a topic for potential follow-up PRs

src/hexadecimal.jl

nickrobinson251 · 2023-05-10T11:42:16Z

src/hexadecimal.jl

+        pos += 1
+        incr!(source)
+    end
+    check != 0xFFFF || @goto backtrack_error


can these be functions rather than @gotos?

e.g.

Suggested change

check != 0xFFFF || @goto backtrack_error

check != 0xFFFF || return backtrack_error!(source, pos, segment_len, ...)

So I tried a couple things here:
a) current thing where jump to the error handling segment of the code
b) make a @inlined function _backtrack_error(source, pos, len, code, segment_len, hi, lo, pl)
c) make a @noinlined function _backtrack_error(source, pos, len, code, segment_len, hi, lo, pl)

In an micro benchmark where we parse a single valid UUID string, I didn't see any perf difference. But there is a difference in the size of the generated code. For the three versions I ran

@code_native debuginfo=:none syntax=:intel Parsers.typeparser(UUID, collect(codeunits(su)), 1, (length(su)), 0x00, Int16(0), Parsers.PosLen(0,0,false,false), (Parsers.OPTIONS))

And then I very scientifically counted the number of lines generated:
a) 488
b) 705
c) 507

So I think there the original version is at least preferable in terms of codesize? The @noinlined seems to only differ in that it inserts a couple of call instructions which are also just gotos/jumps + some small changes to the stack (if we hit the error, that is). But do let me know if you think that the gains in legibility are worth it.

i personally this the gains in legibility are worth it if there's no time cost

test/hexadecimal.jl

nickrobinson251 · 2023-05-10T11:46:08Z

oh i'm not a member on this repo anymore, probably since i reset my GitHub 2FA -- @quinnj please could you add me back so my approval counts for merging? thanks!

Drvi added 2 commits May 10, 2023 11:51

Introduce SHA1 and UUID type parsers

94bcaeb

Fix fastseek!

8c82dcf

nickrobinson251 approved these changes May 10, 2023

View reviewed changes

Drvi added 2 commits May 11, 2023 11:56

Add a string and broadcastable methods to SHA1

2b3ee73

Replace a goto with a call

d25e0c6

Drvi merged commit 682fa0c into main May 11, 2023

Drvi deleted the td-misc-hexadecimals branch May 11, 2023 11:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce SHA1 and UUID type parsers #165

Introduce SHA1 and UUID type parsers #165

Drvi commented May 10, 2023

codecov bot commented May 10, 2023 •

edited

Loading

nickrobinson251 left a comment

nickrobinson251 May 10, 2023

Drvi May 11, 2023

nickrobinson251 May 11, 2023

nickrobinson251 May 10, 2023

Drvi May 11, 2023

nickrobinson251 May 11, 2023

nickrobinson251 commented May 10, 2023

	check != 0xFFFF \|\| @goto backtrack_error
	check != 0xFFFF \|\| return backtrack_error!(source, pos, segment_len, ...)

Introduce SHA1 and UUID type parsers #165

Introduce SHA1 and UUID type parsers #165

Conversation

Drvi commented May 10, 2023

codecov bot commented May 10, 2023 • edited Loading

Codecov Report

nickrobinson251 left a comment

Choose a reason for hiding this comment

nickrobinson251 May 10, 2023

Choose a reason for hiding this comment

Drvi May 11, 2023

Choose a reason for hiding this comment

nickrobinson251 May 11, 2023

Choose a reason for hiding this comment

nickrobinson251 May 10, 2023

Choose a reason for hiding this comment

Drvi May 11, 2023

Choose a reason for hiding this comment

nickrobinson251 May 11, 2023

Choose a reason for hiding this comment

nickrobinson251 commented May 10, 2023

codecov bot commented May 10, 2023 •

edited

Loading