Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check pofile string delimiters #1151

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

rtobar
Copy link

@rtobar rtobar commented Nov 17, 2024

This PR adds checks to the pofile parser code to validate that message strings are correctly delimited by double quotes. Keeping with the current design, an error is only raised if requested, otherwise a warning is printed, the faulty lines are corrected and parsing goes on.

I found this issue while processing a pofile used in the Spanish translation of the CPython documentation. One of our files was incorrectly written, and from all our tooling only the msgcat tool of GNU's gettext package complained, while babel, polib and others didn't. See python/python-docs-es#2873, izimobil/polib#161 and https://git.afpy.org/AFPy/powrap/pulls/4 for further reference.

While implementing this change I found that the _NormalizedString class not only was used to contain message lines, but also participated in the parsing process (and hid some parsing as well). I thus broke down my changes into three separate commits:

  • I first clarified the usage of the current _NormalizedString class across the codebase (see details in commit).
  • I then added the double quote delimitation check logic I wanted to add to the parser
  • Now that all strings have the same form, I more formally constrained how _NormalizedString behaves

Along the way I also implemented three small quality-of-life changes. They are included as the first three commits of this PR, happy to submit these separately if required:

  • Avoid re-compiling a regular expression
  • Remove a duplicate test assertion
  • Perform a better assertion in a particular test, allegedly what was intended in the first place

While the re module caches some of the latest compilations, it's better
form to not rely on it doing so.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
The exact same check is performed a few lines above.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Since Python 2.3 sorted() has been guaranteed to be stable. The comment
was wrong, and thus it makes sense to do the full assertion as clearly
intended.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
The _NormalizeString helper class mixes some responsibilities, not only
acting as a container for potentially multiple lines of a single string
message, but also doing and hiding some of the parsing of such strings.
"Doing" because it performs a .strip() on all incoming strings in order
to remove any whitespace before/after them, and "hiding" because when
invoking the "denormalize" method, each line is slices to remove the
first and last element, which are implicitly assumed to be the string
delimiters (double quotes, in principle).

These multiple roles have already led to confusion within the codebase
as to how this class is supposed to be used. Its existing unit test
doesn't provide strings with proper delimiters (and thus calling
.denormalize() on these objects would return unexpected results -- empty
strings in all cases). Similarly, missing msgstr instances also result
in a call to _NormalizeString(""), which does work, but is conceptually
incorrect, as the empty string is somethiing that _NormalizeString
should never see coming in.

This commit changes all the places where confusing usage of the
_NormalizeString class happens. In particular, the existing unit test's
strings are now always delimited by double quotes (so calling
.denormalize on them would yield the expected value). A number of new
unit tests have also been added exercising the denormalize() method,
which includes unescaping escaped characters. Finally, the construction
of an empty string message has been simplified to _NormalizeString().

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Strings should be delimited on both ends by double quotes, but this
is currently not being been detected, and content is simply being
incorrectly trimmed.

This commit adds a check for each string to verify it starts and ends
with a double quote character, issuing a warning/error if that's not the
case (and fixing it as appropriate).

A few new test cases have been added to check that the lack of double
quotes to delimit strings issues errors as expected.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Now that all strings given as inputs to _NormalizeString have been
verified (or corrected) to be correctly delimited with double quotes,
there's no reason to continue doing an internal strip anymore. Moreover,
we can express this internal constraint with an assertion to avoid
issues in the future.

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
rtobar added a commit to python/python-docs-es that referenced this pull request Nov 18, 2024
En `library/re.po` había una entrada que no estaba delineada
correctamente con comillas dobles (si ven el diff entero es la última
entrada en el diff, o pueden ver simplemente el primer commit de este
PR). Esto hacía que `powrap --check` se saltara el archivo y no lo
validara. Esto, a su vez, ocurría porque la utilidad `msgcat` de
`gettext` identificaba el error de sintaxis, y fallaba al ser ejecutada.
`powrap` no consideraba esos errores al momento de calcular el exit code
del proceso, y por lo tanto el archivo no sólo seguía siendo inválido,
sino que tampoco era verificado. De igual forma, el archivo no podía ser
wrapeado correctamente usando `powrap library/re.po`.

Ya abrí un PR contra `powrap` para cambiar este comportamiento en
https://git.afpy.org/AFPy/powrap/pulls/4 (actualización: el PR ya fue
mergeado, y una nuevs versión de powrap fue publicada, pornlo que
también actualicé en este PR nuestra dependencia de powrap, además del
pre-commit hook de powrap).

Por otro lado, el resto de nuestras herramientas *no* consideraban este
archivo como inválido, Esto es porque `polib` no hacía la validación
correspondiente, e incorrectamente parseaba la entrada. También abrí un
PR contra polib para esto en izimobil/polib#161.
Actualización: en el intertanto también me di cuenta de que el paquete
`babel` sufre del mismo problema, yo incorrectamente había asumido que
babel dependía de polib; PR creada contra babel:
python-babel/babel#1151.

Después de corregir el error de sintaxis, ejecuté powrap de tal manera
que ahora `library/re.po` está bien formateado.

---------

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant