Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer: add isAscii method #46046

Merged
merged 1 commit into from
Jan 20, 2023
Merged

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented Jan 1, 2023

This is the second pull request on my node::encoding proposal.

This pull request will expose require('buffer').isAscii to validate if a buffer contains ascii-encoded data. For example: require('buffer').isAscii((new TextEncoder()).encode('Yağız')) will return false, since ğ is not ascii/latin1 but UTF-8.

I'm adding thenotable-change label due to adding a new API.

cc @nodejs/buffer

@anonrig anonrig added buffer Issues and PRs related to the buffer subsystem. notable-change PRs with changes that should be highlighted in changelogs. labels Jan 1, 2023
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. labels Jan 1, 2023
@mscdex
Copy link
Contributor

mscdex commented Jan 1, 2023

Is this really checking for valid ASCII (<=0x7F) or something else? That should help determine the name. There's really no way to validate a single byte encoding that uses all 256 byte values (e.g. latin1).

@anonrig
Copy link
Member Author

anonrig commented Jan 1, 2023

Is this really checking for valid ASCII (<=0x7F) or something else? That should help determine the name. There's really no way to validate a single byte encoding that uses all 256 byte values (e.g. latin1).

@mscdex It is literally checking for valid ASCII (<=0x7F), referencing the simdutf repository. Happy new year btw!

@marco-ippolito
Copy link
Member

marco-ippolito commented Jan 1, 2023

Since basically wrapping simdutf::validate_ascii and the documentation states

return true if and only if the string is valid ASCII

I'd go with isAscii

lib/buffer.js Outdated Show resolved Hide resolved
doc/api/buffer.md Outdated Show resolved Hide resolved
@addaleax addaleax added the semver-minor PRs that contain new features and should be released in the next minor version. label Jan 2, 2023
@addaleax
Copy link
Member

addaleax commented Jan 2, 2023

Since we can't have polls in Github, please comment if you think isLatin1 is a better name.

Maybe this is pointing out something obvious, but isLatin1 is a bad name because it would be plain wrong.

@tniessen
Copy link
Member

tniessen commented Jan 3, 2023

After reading through #45823, I still don't really understand the motivation for this API. IIRC isUtf8 is commonly used and helpful for undici, but is the same true for ASCII?

I guess most of the times I implement a check for a subset of ASCII, it's a strict subset, so most of my implementations would still have to iterate over the string.

@anonrig
Copy link
Member Author

anonrig commented Jan 3, 2023

After reading through #45823, I still don't really understand the motivation for this API. IIRC isUtf8 is commonly used and helpful for undici, but is the same true for ASCII?

@tniessen There are lots of packages that checks & validates ascii only input by either iterating through all inputs, or by using regexp. My reasoning for adding this is to provide a fast & native solution to these libraries.

Ideally, I'd put this to node:encoding but my proposal received lots of feedback for not doing it, and therefore, here I am adding it to node:buffer

I guess most of the times I implement a check for a subset of ASCII, it's a strict subset, so most of my implementations would still have to iterate over the string.

You are right, but with this pull request, we will have isUtf8, isAscii and probably in the future isLatin1 too.

@anonrig anonrig added the review wanted PRs that need reviews. label Jan 3, 2023
@mscdex
Copy link
Contributor

mscdex commented Jan 3, 2023

probably in the future isLatin1 too

There is no way to validate this. Latin1 is a single byte encoding with values from 0x00 to 0xFF.

@anonrig
Copy link
Member Author

anonrig commented Jan 3, 2023

probably in the future isLatin1 too

There is no way to validate this. Latin1 is a single byte encoding with values from 0x00 to 0xFF.

Why is that? From my understanding:

  • ASCII: characters from 0x00 to 0x7F (0 to 128)
  • Latin1 (Extended ASCII): All single-byte characters from 0x00 to 0xFF (0 to 255)
  • UTF-8: All characters that can be represented by one to four one-byte code units.

If we are aiming for latin1-only, we can easily calculate it by checking if it contains only single-byte characters and is within the range of 0x00 to 0xFF.

@aduh95
Copy link
Contributor

aduh95 commented Jan 3, 2023

If we are aiming for latin1-only, we can easily calculate it by checking if it contains only single-byte characters and is within the range of 0x00 to 0xFF.

Here is a candidate implementation:

function isLatin1() {
  return true;
}

A single-byte value is always within the range of 0 to 255, so if there are no invalid value for latin1, a validation method would not be very useful.

@anonrig
Copy link
Member Author

anonrig commented Jan 3, 2023

A single-byte value is always within the range of 0 to 255, so if there are no invalid value for latin1, a validation method would not be very useful.

Am I missing something? I think you are looking from an ArrayBuffer perspective, whereas I'm looking from a string input perspective.

Naive way of understanding if it's a valid latin1 character is:

  1. Iterate through each character of the string input
  2. Check the byte length of each character
  3. If byte length is bigger than 1, return false.
  4. Return true, after the exit of the iteration.

For example, "Yağiz" is not latin1, because "ğ" has a representation of 2 bytes.

If you don't have a string input, but have an ArrayBuffer input, you have to check trailing code points, if multiple bytes are used to construct an unicode character. An example implementation a.k.a simdutf::count_utf8 is available on: https://github.com/simdutf/simdutf/blob/master/src/scalar/utf8.h#L159

@aduh95
Copy link
Contributor

aduh95 commented Jan 3, 2023

Am I missing something? I think you are looking from an ArrayBuffer perspective, whereas I'm looking from a string input perspective.

Yeah I'm looking from an ArrayBuffer perspective, because we're talking about adding the method to Buffer. ECMAScript strings are UTF-16 encoded, so they would never be latin1. Or am I missing something?
I guess we could have a method that checks whether a string contains only char that have a latin1 representation, but that seems quite far from the isAscii method this PR is about.

For example, "Yağiz" is not latin1, because "ğ" has a representation of 2 bytes.

Hum I'm not sure I follow, all chars have a representation of one-byte in latin1, ğ is outside of the charset and therefore has no representation, there is no 2-byte representation for latin1 AFAIK.

If you don't have a string input, but have an ArrayBuffer input, you have to check trailing code points, if multiple bytes are used to construct an unicode character. An example implementation a.k.a simdutf::count_utf8 is available on: https://github.com/simdutf/simdutf/blob/master/src/scalar/utf8.h#L159

That's assuming the ArrayBuffer contains a UTF-8 encoded string, right? In that case, isLatin1 would be confusing, it should be called isUTF8AndCanBeConvertedToLatin1 or something.

@anonrig
Copy link
Member Author

anonrig commented Jan 3, 2023

Yeah I'm looking from an ArrayBuffer perspective

This was one of the reasons I wanted to introduce a new module for encoding. Maybe that discussion should continue.

@tniessen
Copy link
Member

tniessen commented Jan 4, 2023

There are lots of packages that checks & validates ascii only input by either iterating through all inputs, or by using regexp.

Do you have some performance-critical examples? I see that is-utf8 has millions of weekly downloads, but I wasn't able to find a similarly popular package for ASCII (or Latin1, for that matter).

My reasoning for adding this is to provide a fast & native solution to these libraries.

My concern is that all these added APIs are not really specific to node, and "fast & native" in this case also means non-portable.

@anonrig
Copy link
Member Author

anonrig commented Jan 18, 2023

Do you have some performance-critical examples?

I tried looking for any; even though several are not widespread, developers might develop their solution (js based) and have yet to release a package.

My concern is that all these added APIs are not really specific to node, and "fast & native" in this case also means non-portable.

I agree but I don't think there is any other way to endorse developers to get access to performant functions (as far as I know).

Initially, I wanted this function to live inside node:encoding but here we are...

@anonrig
Copy link
Member Author

anonrig commented Jan 18, 2023

Appreciate any reviews. cc @nodejs/cpp-reviewers @nodejs/performance

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tniessen
Copy link
Member

Do you have some performance-critical examples?

I tried looking for any; even though several are not widespread, developers might develop their solution (js based) and have yet to release a package.

Based on the lack of evidence and our guidelines, I guess the only semi-strong argument for including this in core is

The component can only be implemented in a performant way in core.

which is not exactly true because it might as well be implemented outside of core, but most people won't do it. Partially, perhaps, because inputs tend to be small so shaving off a few microseconds here and there might not matter.

And even when it's in core, using it comes at the high price of losing portability, being unable to run the same code in browsers, bun, deno, CF workers, ...

(None of my comments are blocking; those who create/approve the PR take responsibility for it.)

@benjamingr
Copy link
Member

I'm +0 on this and don't see the big motivation either for the reasons @tniessen mentioned above, but I acknowledge others may have a use case where this being performant and in core is an advantage and I don't understand this space enough to block.

@bnoordhuis
Copy link
Member

So, this is basically the code below but SIMDified? (def a word)

function isAscii(b) {
  for (let i = 0, n = b.length; i < n; i++)
    if (b[i] > 127)
      return false;
  return true;
}

I don't expect performance to be all that different until the input starts getting large (10s of kilobytes, maybe even 100s) and it's also... dunno, kind of niche and trivial? Useful if you're implementing Kermit, otherwise not so much.

@anonrig anonrig added the commit-queue Add this label to land a pull request using GitHub Actions. label Jan 20, 2023
@nodejs-github-bot nodejs-github-bot removed the commit-queue Add this label to land a pull request using GitHub Actions. label Jan 20, 2023
@nodejs-github-bot nodejs-github-bot merged commit babe6d7 into nodejs:main Jan 20, 2023
@nodejs-github-bot
Copy link
Collaborator

Landed in babe6d7

@anonrig anonrig deleted the buffer/isascii branch January 20, 2023 03:32
@tniessen tniessen removed the notable-change PRs with changes that should be highlighted in changelogs. label Jan 20, 2023
@tniessen
Copy link
Member

I'm adding thenotable-change label due to adding a new API.

Given that most semver-minor PRs are not labeled as notable-change and that this feature appears to be "kind of niche and trivial" in @bnoordhuis' words, I have removed the label. Feel free to re-add if someone feels strongly that it is notable or if there is some guidelines saying to label all new APIs as notable changes.

ruyadorno pushed a commit that referenced this pull request Feb 1, 2023
PR-URL: #46046
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
@ruyadorno ruyadorno mentioned this pull request Feb 1, 2023
@ruyadorno ruyadorno added the notable-change PRs with changes that should be highlighted in changelogs. label Feb 1, 2023
ruyadorno added a commit that referenced this pull request Feb 1, 2023
Notable changes:

buffer:
  * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046
deps:
  * upgrade npm to 9.4.0 (npm team) #46353
esm:
  * leverage loaders when resolving subsequent loaders (Maël Nison) #43772
fs:
  * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358
src,lib:
  * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218
test_runner:
  * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712
v8:
  * (SEMVER-MINOR) support gc profile (theanarkh) #46255
vm:
  * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: #46455
ruyadorno added a commit that referenced this pull request Feb 2, 2023
Notable changes:

buffer:
  * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046
deps:
  * upgrade npm to 9.4.0 (npm team) #46353
esm:
  * leverage loaders when resolving subsequent loaders (Maël Nison) #43772
fs:
  * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358
src,lib:
  * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218
test_runner:
  * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712
v8:
  * (SEMVER-MINOR) support gc profile (theanarkh) #46255
vm:
  * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: #46455
juanarbol pushed a commit that referenced this pull request Mar 3, 2023
PR-URL: #46046
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
juanarbol added a commit that referenced this pull request Mar 3, 2023
Notable Changes

buffer:
  * add isAscii method (Yagiz Nizipli) #46046
fs:
  * add statfs() functions (Colin Ihrig) #46358
src,lib:
  * add constrainedMemory API for process (theanarkh) #46218
v8:
  * support gc profile (theanarkh) #46255
vm:
  * expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: TDB
@juanarbol juanarbol mentioned this pull request Mar 3, 2023
juanarbol added a commit that referenced this pull request Mar 3, 2023
Notable Changes

buffer:
  * add isAscii method (Yagiz Nizipli) #46046
fs:
  * add statfs() functions (Colin Ihrig) #46358
src,lib:
  * add constrainedMemory API for process (theanarkh) #46218
v8:
  * support gc profile (theanarkh) #46255
vm:
  * expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: #46920
juanarbol added a commit that referenced this pull request Mar 3, 2023
Notable Changes:

buffer:
  * add isAscii method (Yagiz Nizipli) #46046
fs:
  * add statfs() functions (Colin Ihrig) #46358
src,lib:
  * add constrainedMemory API for process (theanarkh) #46218
v8:
  * support gc profile (theanarkh) #46255
vm:
  * expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: #46920
juanarbol added a commit that referenced this pull request Mar 3, 2023
Notable Changes:

buffer:
  * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046
doc,lib,src,test:
  * rename --test-coverage (Colin Ihrig) #46017
fs:
  * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358
src,lib:
  * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218
test_runner:
  * add initial code coverage support (Colin Ihrig) #46017
  * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712
v8:
  * (SEMVER-MINOR) support gc profile (theanarkh) #46255
vm:
  * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: #46920
juanarbol pushed a commit that referenced this pull request Mar 5, 2023
PR-URL: #46046
Reviewed-By: Antoine du Hamel <duhamelantoine1995@gmail.com>
Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
juanarbol added a commit that referenced this pull request Mar 5, 2023
Notable changes:

buffer:
  * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046
doc,lib,src,test:
  * rename --test-coverage (Colin Ihrig) #46017
fs:
  * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358
src,lib:
  * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218
test_runner:
  * add initial code coverage support (Colin Ihrig) #46017
  * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712
v8:
  * (SEMVER-MINOR) support gc profile (theanarkh) #46255
vm:
  * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: https://github.com/nodejs/node/pull/46920/commits
juanarbol added a commit that referenced this pull request Mar 5, 2023
Notable changes:

buffer:
  * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046
doc,lib,src,test:
  * rename --test-coverage (Colin Ihrig) #46017
fs:
  * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358
src,lib:
  * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218
test_runner:
  * add initial code coverage support (Colin Ihrig) #46017
  * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712
v8:
  * (SEMVER-MINOR) support gc profile (theanarkh) #46255
vm:
  * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: #46920
juanarbol added a commit that referenced this pull request Mar 5, 2023
Notable changes:

buffer:
  * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046
doc,lib,src,test:
  * rename --test-coverage (Colin Ihrig) #46017
fs:
  * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358
src,lib:
  * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218
test_runner:
  * add initial code coverage support (Colin Ihrig) #46017
  * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712
v8:
  * (SEMVER-MINOR) support gc profile (theanarkh) #46255
vm:
  * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: #46920
juanarbol added a commit that referenced this pull request Mar 7, 2023
Notable changes:

buffer:
  * (SEMVER-MINOR) add isAscii method (Yagiz Nizipli) #46046
doc,lib,src,test:
  * rename --test-coverage (Colin Ihrig) #46017
fs:
  * (SEMVER-MINOR) add statfs() functions (Colin Ihrig) #46358
src,lib:
  * (SEMVER-MINOR) add constrainedMemory API for process (theanarkh) #46218
test_runner:
  * add initial code coverage support (Colin Ihrig) #46017
  * (SEMVER-MINOR) add reporters (Moshe Atlow) #45712
v8:
  * (SEMVER-MINOR) support gc profile (theanarkh) #46255
vm:
  * (SEMVER-MINOR) expose cachedDataRejected for vm.compileFunction (Anna Henningsen) #46320

PR-URL: #46920
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. notable-change PRs with changes that should be highlighted in changelogs. review wanted PRs that need reviews. semver-minor PRs that contain new features and should be released in the next minor version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.