-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-33578: Add getstate/setstate for CJK codec #6984
Conversation
This comment has been minimized.
This comment has been minimized.
(Added GitHub name to bugs.python.org. CLA is signed already) |
Modules/cjkcodecs/multibytecodec.c
Outdated
if (_PyLong_AsByteArray(statelong, pendingbuffer, | ||
sizeof(pendingbuffer), | ||
PY_LITTLE_ENDIAN, 0) < 0) { | ||
Py_DECREF(statelong); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to refactor all error paths under one "error" label section and substitute these with gotosas indicated here so there is only one exit point from the function?
Modules/cjkcodecs/multibytecodec.c
Outdated
unsigned long long flag; | ||
|
||
if (self->pendingsize > 0) { | ||
buffer = PyBytes_FromString((const char *)self->pending); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
return NULL; | ||
} | ||
|
||
buffersize = PyBytes_Size(buffer); |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
8b5ee96
to
db1a055
Compare
Thanks for the review @pablogsal, I've made the changes. Refactoring to use an error label helped catch a missing decref 👍 |
db1a055
to
6f9869d
Compare
Lib/test/test_io.py
Outdated
f.write("\u0300") | ||
f.close() | ||
|
||
f = self.open(support.TESTFN, "r", encoding="jisx0213") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is leaking a file descriptor:
ResourceWarning: unclosed file <_io.TextIOWrapper name='@test_22392_tmp' mode='r' encoding='jisx0213'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, missed the last .close, thanks.
self.assertEqual(f.readline(), "\u3046\u3048\n") | ||
self.assertEqual(f.tell(), p1) | ||
f.close() | ||
|
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
6f9869d
to
0af0cc7
Compare
Lib/test/test_io.py
Outdated
f.close() | ||
|
||
def test_seek_with_encoder_state(self): | ||
f = self.open(support.TESTFN, "w", encoding="jisx0213") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use "euc_jis_2004" instead of "jisx0213" alias.
"JIS X 0213:2004" defines charset. There are three encoding for it; "ISO-2022-JP-2004", "EUC-JIS-2004", and "Shift-JIS 2004".
Additionally, there is an old version of it; "JIS X 0213:2000".
I don't know why "jisx0213" alias for "euc_jis_2004" was added, but it seems not predictable alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've changed it to use euc_jis_2004. Also replaced euc-jp with euc_jp.
0af0cc7
to
44727e7
Compare
be4dc14
to
3c18d61
Compare
Need to fix the following warnings:
and Travis has picked up a checksum mismatch:
Update: done |
3c18d61
to
f60f50c
Compare
return NULL; | ||
} | ||
if (!PyArg_ParseTuple(state, "OK;setstate(): illegal state argument", | ||
&buffer, &flag)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use S
for better error message..
https://docs.python.org/3/library/codecs.html#codecs.IncrementalDecoder.getstate
Modules/cjkcodecs/multibytecodec.c
Outdated
unsigned long long flag; | ||
|
||
if (!PyTuple_Check(state)) { | ||
PyErr_SetString(PyExc_TypeError, "state argument must be a tuple"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArg_ParseTuple will raise TypeError. So this check may be redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArg_ParseTuple is enough but the error message is not very user friendly:
>>> import codecs
>>> decoder = codecs.getincrementaldecoder('euc_jp')()
>>> decoder.setstate(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
SystemError: new style getargs format but argument is not a tuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, you can use ArgumentClinic again. object(subclass_of='&PyTuple_Type')
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that works nicely :)
>>> decoder.setstate(0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: setstate() argument must be tuple, not int
Modules/cjkcodecs/multibytecodec.c
Outdated
return NULL; | ||
} | ||
Py_DECREF(ret); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split }
and else {
in two lines.
Modules/cjkcodecs/multibytecodec.c
Outdated
return NULL; | ||
} | ||
self->pendingsize = buffersize; | ||
memcpy(self->pending, bufferstr, self->pendingsize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to check buffersize <= MAXDECPENDING
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And test cases for possible wrong inputs are needed too.
Modules/cjkcodecs/multibytecodec.c
Outdated
unsigned long long flag; | ||
|
||
if (self->pendingsize > 0) { | ||
buffer = PyBytes_FromString((const char *)self->pending); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use PyBytes_FromStringAndSize
.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Modules/cjkcodecs/multibytecodec.c
Outdated
|
||
flag = self->state.ull; | ||
|
||
return Py_BuildValue("NK", buffer, flag); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is make_tuple()
in this file. It can be used instead.
Modules/cjkcodecs/multibytecodec.h
Outdated
unsigned char pending[MAXENCPENDING]; | ||
Py_ssize_t pendingsize; | ||
MultibyteCodec_State codecstate; | ||
} MultibyteIncrementalEncoderState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct is used from only two place in single source file.
Move this to the file.
c97199f
to
edf0b4e
Compare
Changes to address feedback from @doerwalter have been pushed.
|
Modules/cjkcodecs/multibytecodec.c
Outdated
{ | ||
const char *pendingbuffer = NULL; | ||
Py_ssize_t pendingsize; | ||
MultibyteIncrementalEncoderState state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Struct layout is different by compiler.
To build portable integer, you need to use just bytearray:
unsigned char state[1 + MAXPENDING*4 + 8];
...
state[0] = peindingsize;
memcpy(state+1, pendingbuffer, pendingsize);
int pos = 1 + pendingsize;
memcpy(state+pos, self->state, 8);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @methane. state should now be independent of x32/64, endianness, and compiler.
Latest commit changes:
- Replace
MultibyteIncrementalEncoderState
with byte array - Update
test_multibytecodec
to match new byte array format (difference:pendingsize
moved to the first byte + onlypendingsize
bytes stored in buffer position)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added test_getstate_returns_expected_value
to test_multibytecodec
so that the consistency of the state
byte array structure is validated by the test suite.
e43fb6f
to
b36149f
Compare
@methane: you approved the change, but you didn't merge it. Do you want another core dev to approve the change, or did you just forget to merge it? |
@doerwalter Would you review this again before merging this? |
@@ -51,6 +51,8 @@ | |||
; \ | |||
} | |||
|
|||
#define STATE_OFFSET 0 | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the meaning of this #define
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt c[STATE_OFFSET]
was more explicit than c[0]
, but don't mind inlining it if you prefer. Alternatively, STATE_POSITION
may be a better name, as I wanted to get across that only byte 0 is relevant in state->c
for the CN codec.
Perhaps even CN_STATE_POSITION
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I get it: c[8]
is the merged version of the previous union, so it serves two purposes: it's a character array and a (very) small integer. IMHO this (and how it's used) should be documented where MultibyteCodec_State
is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were to document how I see MultibyteCodec_State
now, I would say something along the lines of:
A union that provides 8 bytes of state for multibyte codecs. Codecs
are free to use this how they want. Note: if you need to add a new
member to this union, ensure that its byte order is independent of CPU
endianness so that the return value of `getstate` doesn't differ
between little and big endian CPUs.
Should I add this as a comment next to the union?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. However MultibyteCodec_State
no longer is a union, "if you need to add a new member to this union" should be phrased differently.
And STATE_OFFSET
would have a comment like: "Codecs that have to store a small integer in MultibyteCodec_State
store it at the following offset".
And yes, multibytecodec.h
is a little light on documentation, but that's probably another patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added comments next to STATE_OFFSET
(now CN_STATE_OFFSET
for clarity) and MultibyteCodec_State
. Please let me know if the explanations make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'd like @methane to make the final decision about merging this PR.
b36149f
to
4aea0e1
Compare
|
|
|
I'm looking into the buildbot failures to see if they're related.
If I understand correctly, the s390x systems are big endian, so there is probably a problem in the handling of endianness. I will look into emulating a big endian machine locally in a few hours from now to debug this. Update: I managed to recreate locally and the problem seems to be in |
I've opened a PR (#10290) to address the issues found by the big-endian build bots. |
This implements getstate and setstate for the cjkcodecs multibyte incremental encoders/decoders, primarily to fix issues with seek/tell.
The encoder getstate/setstate is slightly tricky as the "state" is pending bytes + MultibyteCodec_State but only an integer can be returned. The approach I've taken is to encode this data into a long, similar to how .tell() encodes a "cookie_type" as a long.
https://bugs.python.org/issue33578