Skip to content

Lawbox Content Contains CP1252 Characters Encoded Using ISO-8859 (which fails) #410

Open

Description

ISO-8859 strikes again. It seems that when we imported the lawbox content, we trusted the encodings in the HTML files when they said the files were ISO-8859. Well, as we learned long ago, whenever you see ISO-8859, you should assume you have some characters from CP1252, which is a superset of ISO-8859 containing useful characters like m-dashes, tildes and whatnot.

If you look at the Wikipedia page for CP1252, you can see which characters are extra:

https://en.wikipedia.org/wiki/Windows-1252#Code_page_layout

So, for example 0x0097 is an mdash. Now, take a look at footnote four of this opinion:

https://www.courtlistener.com/opinion/1625969/doe-v-rosenberg/

And you'll see little squares like, �. If we had imported using cp1252 instead of stupid ISO8859, we would have correctly interpreted this as an m-dash.

The good news is that we can look these problems up with something like:

os98 = Opinion.objects.filter(html_lawbox__contains='�')

And we can replace them with something like:

o.html_lawbox.encode('utf-8').replace('�', '—').decode('utf-8')

Honestly, we don't want a replace statement for all of the 20 or so characters unique to CP1252, so we'll want to find a better way of re-encoding the whole file, but this is the right track.

ISO-8859, may you die in hell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions