Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--encoding is not working when exporting to HTML #155

Closed
brrd opened this issue May 28, 2021 · 11 comments
Closed

--encoding is not working when exporting to HTML #155

brrd opened this issue May 28, 2021 · 11 comments

Comments

@brrd
Copy link

brrd commented May 28, 2021

First of all, thank you very much for this very useful program.

Describe the bug
When exporting a document to HTML with the --encoding option, the output file is always in encoded in windows-1252.

This issue looks like this one (someone suggested an answer, but I don't know if it's relevant here): https://stackoverflow.com/q/34026716

To Reproduce
Here with UTF-8:

docto.exe -F input.rtf -T wdFormatHTML -O test.html -E 65001

The same behavior is encountered when running the command from Node.js (https://github.com/brrd/msoconvert).

Expected behavior
I would expect the file HTML file to be encoded in UTF-8, and its header to contain this meta:

<meta http-equiv=Content-Type content="text/html; charset=utf-8">

Instead, the file is encoded in windows-1252 and the the header contains the following:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

Additional context

  • Please run the command with -L 10 to provide verbose logging and paste that into your bug report.
docto.exe -F input.rtf -T wdFormatHTML -O test.html -E 65001 -L 10
[20210528 19:21:26 -]: [DEBUG]  Log Level Set To:10
Loading ChooseConverter...
Parameter Count is 10
Converter:MS Word
[DEBUG]  Log Level Set To:10
[INFO]   Loading Configuration...
[DEBUG]  Parameter Count is 10
[DEBUG]  Input File is: C:\Users\Thomas\Desktop\input.rtf
[DEBUG]  Type Integer is: 8
[INFO]   Output file: C:\Users\Thomas\Desktop\test.html
[INFO]   Log Level Set To:10
[DEBUG]  Current Directory: C:\Users\Thomas\Desktop
[DEBUG]  Ready to Execute
[DEBUG]  Executing Conversion ...
[INFO]   ExecuteConversion:C:\Users\Thomas\Desktop\input.rtf
[DEBUG]  Version >= 14 Using Saveas2 Function
[INFO]   File Converted: C:\Users\Thomas\Desktop\test.html
  • Please also run docto.exe -v so I can see what version of Docto and Word you are running.
docto.exe -v
DocTo Version:1.03.30.54
OfficeApp Version:16
Source: https://github.com/tobya/DocTo/
  • What OS: [e.g. Windows Server 2012]

Windows 10 Pro 20H2

@github-actions
Copy link

Thank You for the Issue. I will try to get to look at it as soon as I can.

@tobya
Copy link
Owner

tobya commented May 29, 2021

@brrd Thanks for the thumbs up - always appreciated.

This is really to do with Word. I did a quick search and couldnt find anything specifically about this issue. --encoding seems affect the codepage but as you have found out may not change how the conversion to html happens.

I'll have a look.

I see you are converting an rtf, do you get a differnt result if you use a .doc or .docx?

@brrd
Copy link
Author

brrd commented May 29, 2021

Thanks for your quick answer!

I tried to convert from .doc and .docx to html and I get exactly the same result: file encoded in windows 1252 and:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

I hope that it will possible to fix this. I will work in windows-1252 encoding until then.

Do not hesitate if you need any other information.

@tobya
Copy link
Owner

tobya commented May 29, 2021

@brrd I dont seem to be able to recreate this, when I save documents with docto

when I do

docto.exe -f "CullohillApplePie - Copy.doc" -o ..\Test1\cullohil.html -t wdFormatHTML -e 65001

or

docto.exe -f "CullohillApplePie - Copy.doc" -o ..\Test1\cullohil2.html -t wdFormatHTML

Both of them create a html file with

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">

So i think perhaps it is more to do (in the first instance) with settings in word that I'm not overwriting, but I need to know what setting to override.

Can you watch this video and tell me what setting your word has.

https://vimeo.com/556655462/b46c1f8539

@brrd
Copy link
Author

brrd commented May 30, 2021

Your intuition was right, my default export setting was not unicode:

image

I tried to run docTo again after switching this setting it to "Unicode":

docto.exe -F input.doc -T wdFormatHTML -O test.html -E 65001
docto.exe -F input.doc -T wdFormatHTML -O test.html

(I tried both)

But I always get the same result:

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

@brrd
Copy link
Author

brrd commented May 30, 2021

Finally I could make it work by checking the box "Always save in default encoding" with "Unicode" selected.

image

When running docTo again (same commands as above) I could get the expected encoding:

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"
xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8">

@tobya
Copy link
Owner

tobya commented May 31, 2021

ill play around with some setings and see what i can do

@tobya
Copy link
Owner

tobya commented May 31, 2021

I think I have found it. Seems there is a web encodign as well as a standard encoding .

Please find .zip attached iwth updated binary. let me know if it works and I'll merge to main.[
docto.zip
](url)

@brrd
Copy link
Author

brrd commented May 31, 2021

I can confirm this is working perfectly with this new version!

When running this:

docto.exe -F input.doc -T wdFormatHTML -O test.html -E 65001

than the output contains as expected:

<meta http-equiv=Content-Type content="text/html; charset=utf-8">

with proper UTF-8 encoding.

When running the same command without -E, the default encoding is used, which I think is a good behavior.

Thank you very much @tobya for fixing this and for your great reactivity!

(You maybe want to keep this issue open until the next release is published, so I let you decide when to close it.)

@tobya
Copy link
Owner

tobya commented May 31, 2021

Great, I'll merge into next release.

@tobya
Copy link
Owner

tobya commented May 31, 2021

Pushed to release. https://github.com/tobya/DocTo/releases/tag/V1.04

@tobya tobya closed this as completed Jun 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants