Lawbox Content Contains CP1252 Characters Encoded Using ISO-8859 (which fails)

ISO-8859 strikes again. It seems that when we imported the lawbox content, we trusted the encodings in the HTML files when they said the files were ISO-8859. Well, as we learned long ago, whenever you see ISO-8859, you should assume you have some characters from CP1252, which is a superset of ISO-8859 containing useful characters like m-dashes, tildes and whatnot.

If you look at the Wikipedia page for CP1252, you can see which characters are extra:

https://en.wikipedia.org/wiki/Windows-1252#Code_page_layout

So, for example 0x0097 is an mdash. Now, take a look at footnote four of this opinion:

https://www.courtlistener.com/opinion/1625969/doe-v-rosenberg/

And you'll see little squares like, . If we had imported using cp1252 instead of stupid ISO8859, we would have correctly interpreted this as an m-dash. 

The good news is that we can look these problems up with something like:

```
os98 = Opinion.objects.filter(html_lawbox__contains='')
```

And we can replace them with something like:

```
o.html_lawbox.encode('utf-8').replace('', '—').decode('utf-8')
```

Honestly, we don't want a replace statement for all of the 20 or so characters unique to CP1252, so we'll want to find a better way of re-encoding the whole file, but this is the right track. 

ISO-8859, may you die in hell.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lawbox Content Contains CP1252 Characters Encoded Using ISO-8859 (which fails) #410

mlissner
openedon Dec 31, 2015

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lawbox Content Contains CP1252 Characters Encoded Using ISO-8859 (which fails) #410

Description

mlissneropenedon Dec 31, 2015

Metadata