-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--encoding
is not working when exporting to HTML
#155
Comments
Thank You for the Issue. I will try to get to look at it as soon as I can. |
@brrd Thanks for the thumbs up - always appreciated. This is really to do with Word. I did a quick search and couldnt find anything specifically about this issue. --encoding seems affect the codepage but as you have found out may not change how the conversion to html happens. I'll have a look. I see you are converting an rtf, do you get a differnt result if you use a .doc or .docx? |
Thanks for your quick answer! I tried to convert from .doc and .docx to html and I get exactly the same result: file encoded in windows 1252 and: <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> I hope that it will possible to fix this. I will work in windows-1252 encoding until then. Do not hesitate if you need any other information. |
@brrd I dont seem to be able to recreate this, when I save documents with docto when I do docto.exe -f "CullohillApplePie - Copy.doc" -o ..\Test1\cullohil.html -t wdFormatHTML -e 65001 or docto.exe -f "CullohillApplePie - Copy.doc" -o ..\Test1\cullohil2.html -t wdFormatHTML
Both of them create a html file with <html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8"> So i think perhaps it is more to do (in the first instance) with settings in word that I'm not overwriting, but I need to know what setting to override. Can you watch this video and tell me what setting your word has. |
Your intuition was right, my default export setting was not unicode: I tried to run docTo again after switching this setting it to "Unicode":
(I tried both) But I always get the same result: <html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252"> |
Finally I could make it work by checking the box "Always save in default encoding" with "Unicode" selected. When running docTo again (same commands as above) I could get the expected encoding: <html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8"> |
ill play around with some setings and see what i can do |
I think I have found it. Seems there is a web encodign as well as a standard encoding . Please find .zip attached iwth updated binary. let me know if it works and I'll merge to main.[ |
I can confirm this is working perfectly with this new version! When running this:
than the output contains as expected: <meta http-equiv=Content-Type content="text/html; charset=utf-8"> with proper UTF-8 encoding. When running the same command without Thank you very much @tobya for fixing this and for your great reactivity! (You maybe want to keep this issue open until the next release is published, so I let you decide when to close it.) |
Great, I'll merge into next release. |
Pushed to release. https://github.com/tobya/DocTo/releases/tag/V1.04 |
First of all, thank you very much for this very useful program.
Describe the bug
When exporting a document to HTML with the
--encoding
option, the output file is always in encoded in windows-1252.This issue looks like this one (someone suggested an answer, but I don't know if it's relevant here): https://stackoverflow.com/q/34026716
To Reproduce
Here with UTF-8:
The same behavior is encountered when running the command from Node.js (https://github.com/brrd/msoconvert).
Expected behavior
I would expect the file HTML file to be encoded in UTF-8, and its header to contain this meta:
Instead, the file is encoded in windows-1252 and the the header contains the following:
Additional context
-L 10
to provide verbose logging and paste that into your bug report.Windows 10 Pro 20H2
The text was updated successfully, but these errors were encountered: