Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we still need asciilib for better performance? #98229

Open
sobolevn opened this issue Oct 12, 2022 · 7 comments
Open

Do we still need asciilib for better performance? #98229

sobolevn opened this issue Oct 12, 2022 · 7 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement

Comments

@sobolevn
Copy link
Member

sobolevn commented Oct 12, 2022

Feature or enhancement

asciilib was introduced in c3cec78 as a performance feature for ascii strings.

11 years have passed since then.

While working on #98025 we with @vstinner experimented on unicode_count with and without asciilib_count calls.

Results were clear: it does not bring any benefits on our benchmarks.
And this commit was made: df3a6d9

Later, while working on #98228 I've noticed that asciilib_rsplit_whitespace also does not provide significant performance gains on my platform and my simple data input.

So, maybe this should be analyzed deeply?

Pitch

  • I think that these calls should be further investigated: do we really need this?
  • We should come up with better data to make a final decision, including:
    • What pytohn methods / c-api functions are affected?
    • Short and long ascii / non-ascii strings (because this check slows down all non-ascii strings by calling extra PyUnicode_IS_ASCII(str)
    • Windows / MacOS / Linux platforms, maybe the results will be different

I don't have access to Windows, but I can do the research for other platforms.

If it does not provide any performance benefits, it should be removed.

@sobolevn sobolevn added type-feature A feature request or enhancement performance Performance or resource usage labels Oct 12, 2022
@mdboom
Copy link
Contributor

mdboom commented Oct 13, 2022

I'd generally be in favor of removing almost-duplicate code if there is no performance benefit.

Have you done any analysis with the pyperformance benchmarks?

@vstinner
Copy link
Member

ASCII-only strings have multiple efficient properties. Examples:

  • Looking for a non-ASCII character in an ASCII string has complexity of O(1): it's always false
  • _PyUnicode_FromASCII() doesn't have to scan the string to compute the maximum code path: it's known that the string is ASCII
  • Maximum code point is 127 (STRINGLIB_MAX_CHAR=127 for asciilib functions)

Some optimizations are implemented outside stringlib. Like:

static Py_ssize_t
any_find_slice(PyObject* s1, PyObject* s2, ...)
{
    ...
    kind1 = PyUnicode_KIND(s1);
    kind2 = PyUnicode_KIND(s2);
    if (kind1 < kind2)
        return -1;
    ...
}

@sobolevn
Copy link
Member Author

Have you done any analysis with the pyperformance benchmarks?

Only very limited ones in #98025 (comment) and #98228 (comment)

@serhiy-storchaka
Copy link
Member

Please perform tests also with longer strings (100, 1000, 10000 characters or more). Also test with strings containing non-ASCII whitespace characters \x85 and \xa0.

@rhpvorderman
Copy link
Contributor

If you want to work with some real-world ASCII data you can download a FASTQ file named SRR15096500_1.fastq.gz here https://www.ebi.ac.uk/ena/browser/view/SRR15096500.
FASTQ is a format to store DNA (or RNA) sequencing reads. You can find a description of the format here: https://en.wikipedia.org/wiki/FASTQ_format.

The data can be quite big. Whole genome sequencing can yield FASTQ files that are 100GB when compressed with gzip.

@marcelm has created a nice library called dnaio for parsing the format. To add to @vstinner ASCII has some really interesting other properties to (to complement @vstinner):

  • The last 5 bits of lowercase and uppercase characters are the same. Allowing quick case-independent comparing.
  • The last 4 bits of 0, 1, 2, 3, are 0000, 0001, 0010, 0011.. you get the idea
  • The fact that the highest code point is at most 127 allows you to create reasonably small lookup tables with only 128 entries.
  • The most significant bit of ASCII is always 0. So you can do a very quick check if strings are ascii by doing a Bitwise OR on all characters and checking if the result's significant bit is also zero. That trick is used in dnaio and runs at approximately 50GB/s (using SSE2 instructions) or 20GB/s (without). This could potentially be used in UTF-8 parsing.

Anyway. I am curious to see if asciilib works well when using real-world data such as posted above. A lot of formats in my field are ASCII, so Python's performance in this area matters to me.

@vstinner
Copy link
Member

vstinner commented Apr 5, 2023

This could potentially be used in UTF-8 parsing.

The utf8_decode() function of the Python "stringlib" internal library uses this property to check if a string is ASCII-only in an efficient way. It even works on size_t words (ex: 8 bytes on 64-bit systems) per loop iteration. See: Objects/stringlib/codecs.h.

@rhpvorderman
Copy link
Contributor

The utf8_decode() function of the Python "stringlib" internal library uses this property to check if a string is ASCII-only in an efficient way.

@vstinner I am aware that the current ASCII check exists and is quite efficient. However, it can be more efficient: #101289

The current ASCII check is per string and involves a lot of code. A quick check on the input buffer is almost cost-free. The copying can then proceed using only PyUnicode_New(127) and memcpy. That is much faster. Although it will require some reworking of the ASCII/UTF-8 code.

@iritkatriel iritkatriel added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants