Skip to content

Conversation

@chrisdickinson
Copy link
Contributor

WIP. Opening this now to double check that this is headed in the right direction.

This is based on utf-8-validate. The primary differences are the use of clz (vs. a lookup table) to compute the number of extra bytes and the introduction of the error/glyph callbacks.

The Utf8Consume function will iterate over valid glyphs, calling a provided OnGlyph callback and an OnError callback as necessary. Provided error strategies are "Halt" and "Skip" – which either halts the consumer at the first error, or skips past them as appropriate.

The strategies were added to accommodate the desire to build a utf8-to-utf16 translator as part of this work.

src/util.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: no C-style casts (here and elsewhere.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Sorry about that, missed while porting.

src/util.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0xC2? Shouldn't that be 0xC0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I'll look into that – the original version has it as 0xC2.

@piscisaureus
Copy link
Contributor

The msvc "equivalent" is _BitScanReverse. https://github.com/libuv/libuv/blob/3346082132dc2ff809dfd5d25d451c0b33905f53/src/win/tty.c#L1442

Edit: you already knew :) sorry

src/util.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you want here can be implemented portable and reasonably efficient using the following:

inline uint32_t log2(uint8_t v) {
  const uint32_t r = (v > 15) << 2;
  v >>= r;
  const uint32_t s = (v > 3) << 1;
  v >>= s;
  v >>= 1;
  return r | s | v;
}

inline uint32_t clz(uint8_t v) {
  // clz(0) == 7.  Add a zero check if that's an issue.
  return 7 - log2(v);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention, the behavior of static_cast<int>(...) << 24 is implementation-defined on platforms where ints are 32 bits (i.e. all of them.) You're not allowed to shift values into the sign bit.

src/util.cc Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this mask off the high bits? Also, when is extrabytes == 5 for valid UTF-8?

@mscdex mscdex added the c++ Issues and PRs that require attention from people who are familiar with C++. label Apr 1, 2015
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800 :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real Programmers(TM) write this as (glyph & 0x7800) != 0x5800 :-)

Compilers are pretty good in replacing idiomatic range comparisons with bit hacks. Have you checked whether it's really necessary to have these ... ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Har, har.

(But it's true that both clang and gcc manage to pull it off.)

@jbergstroem
Copy link
Member

Now that #1199 landed -- could we perhaps add unit tests for this?

@chrisdickinson
Copy link
Contributor Author

Nixed clz support, since a quick benchmark showed that @bnoordhuis' implementation was faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style issue: the first argument goes on the same line as the function name and the other arguments should line up below it. The only time that's deviated from is when the 80 column limit is exceeded.

@bnoordhuis
Copy link
Member

I second @jbergstroem's sentiment on unit tests. :-)

@petkaantonov
Copy link
Contributor

Also always interesting to run on an utf-8 decoder is the utf8 decoder stress test

@brendanashworth brendanashworth added the wip Issues and PRs that are still a work in progress. label May 11, 2015
@Fishrock123
Copy link
Contributor

So I think I heard at some point v8 has a version of this for TypedArrays? Am I mistaken or will this soon be obsolete?

@Fishrock123
Copy link
Contributor

@trevnorris does TypedArrays give us any sort of this functionality?

@trevnorris
Copy link
Contributor

Since this isn't used anywhere not even sure what it's for, but typed arrays don't have anything like this afaik.

@jasnell
Copy link
Member

jasnell commented Oct 22, 2015

@chrisdickinson @trevnorris ... what the status on this one?

@jasnell jasnell added the stalled Issues and PRs that are stalled. label Nov 16, 2015
@chrisdickinson
Copy link
Contributor Author

Closing this due to inactivity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Issues and PRs that require attention from people who are familiar with C++. stalled Issues and PRs that are stalled. wip Issues and PRs that are still a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants