Description
openedon Dec 31, 2015
ISO-8859 strikes again. It seems that when we imported the lawbox content, we trusted the encodings in the HTML files when they said the files were ISO-8859. Well, as we learned long ago, whenever you see ISO-8859, you should assume you have some characters from CP1252, which is a superset of ISO-8859 containing useful characters like m-dashes, tildes and whatnot.
If you look at the Wikipedia page for CP1252, you can see which characters are extra:
https://en.wikipedia.org/wiki/Windows-1252#Code_page_layout
So, for example 0x0097 is an mdash. Now, take a look at footnote four of this opinion:
https://www.courtlistener.com/opinion/1625969/doe-v-rosenberg/
And you'll see little squares like, �. If we had imported using cp1252 instead of stupid ISO8859, we would have correctly interpreted this as an m-dash.
The good news is that we can look these problems up with something like:
os98 = Opinion.objects.filter(html_lawbox__contains='�')
And we can replace them with something like:
o.html_lawbox.encode('utf-8').replace('�', '—').decode('utf-8')
Honestly, we don't want a replace statement for all of the 20 or so characters unique to CP1252, so we'll want to find a better way of re-encoding the whole file, but this is the right track.
ISO-8859, may you die in hell.