แปลงรัฐธรรมนูญ (ร่างต้นปี 2559) จาก PDF เป็น HTML ดูคำอธิบายได้ในสไลด์ https://www.slideshare.net/arthit/pdf-plain-text และโน๊ต https://www.facebook.com/notes/10154493302702646
Convert Thai constitution draft (early 2016) from PDF to plaintext and correct encoding glitches. It is crafted to work with a specific set of PDF. Mainly one from https://www.parliament.go.th/ewtcommittee/ewt/draftconstitution2/download/article/article_20160129132217.pdf and following versions. Cannot be guarantted to work with other PDFs. This is like web scraping, you have to tailor it to a particular website.
- Use Apache PDFBox for PDF to HTML
java -jar pdfbox-app.jar ExtractText -html file.pdf file.html
- Cannot convert directly to plaintext, as there are Thai characters in the PDF that use codepoints in Private User Area (PUA) -- all the PUAs will be discarded for conversion to plaintext
- Convert Thai characters that encoded as HTML entities to UTF-8. The same process will also convert PUAs to valid codepoints.
pua = { '63233': 'ิ', # 0xf701 Sara I '63234': 'ี', # 0xf702 '63235': 'ึ', # 0xf703 '63236': 'ื', # 0xf704 '63237': '่', # 0xf705 '63238': '้', # 0xf706 Mai Tho (on Po Pla) '63242': '่', # 0xf70a Mai Ek '63243': '้', # 0xf70b Mai Tho '63246': '์', # 0xf70e Thantakat '63248': 'ั', # 0xf710 Mai Han Akhat (on Po Pla) '63250': '็', # 0xf712 Mai Tai Khu (on Po Pla) '63251': '่', # 0xf713 '63252': '้' # 0xf714 }
- Correct wrong order of Thai characters, like tonemark + vowel --> vowel + tonemark
- Basic reformatting
More explanation (in Thai): slides, notes
Ideally, there should be no need for a script like this. All laws should be available in search friendly and machine-readable format.