Skip to content

polizoto/DOCX-HTML

Repository files navigation

DOCX-HTML

Create Accessible HTML Files From DOCX

DOCX-HTML Script Usage

Requirements

Optional

  • Nodejs - for use with the -m option to output math as svg or mathspeak
  • MathJax Node SRE - for use with the -m option to output math as svg or mathspeak
  • VIM - for use with -e flag (if you want to view any errors in the terminal and edit them directly)

Setup

  1. Install Pandoc
  2. Download the DOCX-HTML.sh script to your macOS or PC
  3. Place the script in an easy-to-locate folder (C:\scripts\ for PC or ~/scripts/ for macOS)
  4. Create a "stylesheets" folder in your C:\stylesheets\ drive (PC) or in your Home directory ~/stylesheets/ (Mac); Name your default stylesheet "standard.css"
  5. Download and unpack the NuHTML zip folder to your C:\ drive (PC) or to your Home directory ~/ (Mac). On a PC the path to the vnu.bat file should C:\vnu-runtime-image\bin; on a macOS, the path to the vnu.jar file should be ~/.
  6. Place Tasklist.exe in the C:\scripts\ folder (PC only)

Note: If you want the script to be available systemwide on a PC, place it in the following folder: C:\Program Files\Git\usr\bin; on macOS, place it here: /usr/local/bin. On macOS, you will need to show hidden folders to access this directory in the Finder window. Use CMD + Shift + . to reveal hidden folders/files and then use it again to hide them.

Optional Setup for processing math equations as svg or mathspeak:

  1. Install Nodejs

  2. Open the terminal and type npm install -g mathjax-node-cli and press Enter. Wait for installation to complete. (Note: use sudo npm install -g mathjax-node-cli if you are on macOS)

  3. Open the terminal and type npm install -g mathjax-node-sre and press Enter. Wait for the installation to complete. (Note: use sudo npm install -g mathjax-node-sre if you are on macOS)

  4. (For PC users) Edit the environmental variables on your machine to include the path to the mathjax-node-cli\bin and node_modules folders, respectively. These should be located here:

         C:\Users\YOUR-NAME\AppData\Roaming\npm\node_modules
       
         C:\Users\YOUR-NAME\AppData\Roaming\npm\node_modules\mathjax-node-cli\bin
    

Optional Setup for uploading files to Canvas (-u Option):

  1. Download canvas_token.txt files and place in "scripts" folder (C:\scripts\ for PC or ~/scripts/ for macOS)
  2. Enter your Canvas API key on line two of canvas_token.txt (From your Canvas account, go to Settings and click on the + New Access Token button.)
  3. Enter your Canvas domain on line four of canvas_token.txt (e.g., bcourses.berkeley.edu)

Features

  • Table of contents (Headings 1-3)
  • language switches are automatically added (lang attribute) when using the -l option. Up to nine secondary languages
  • table accessibility markup is automatically added (e.g., colspan, rowspan, scope attributes) when MS Word Tags are used
  • math output options (mathjax, mathml, webtex, SVG, mathspeak). Mathjax is default. Use -m option for other targets
  • hyperlinks for footnotes (automatically added with the -f option)
  • <aside> element for lines numbers (poetry) when using the -n option
  • <aside> element for secondary text and footnote regions
  • <details> element for extended descriptions
  • inspect the alternative text of mathematical content (when using the -i option)
  • export just HTML code, excluding the <head>, <style>, and <footer> sections, for easier copying and pasting to LMS (when using the -j option)
  • export new DOCX file for PDF workflows (when using -p option), which can be helpful for producing alternative text for math content
  • receive HTML accessibility warnings and errors in the terminal (NuHTML checker)
  • edit HTML directly with VIM in the terminal (when using the -e option)
  • check for correct setup with -d option (Diagnostics)
  • upload files directly to Canvas LMS, as a course page or into Files area (when using the -u option)
  • batch processing

Overview

DOCX-HTML.sh is a bash script that converts DOCX files to HTML (web) format. The script performs numerous find and replace operations on an HTML file to ensure that the file is fully accessible to students using assistive technologies.

This ReadMe has three parts:

  1. How to structure your MS Word document for use with the DOCX-HTML.sh script and
  2. How to run the DOCX-HTML.sh script.
  3. Further Resources

Getting Started

Before using the DOCX-HTML.sh script, it is important to make sure that the MS Word document you are converting contains the following features:

  • Heading structure
  • Alternative text for images
  • Page numbers are styled as Heading 6 (for easier navigation)

In addition to these elements, there are some HTML accessibility features that are not added by the script unless you use “MS Word tags” in your document. These “tags” are detected by the script and replaced with the appropriate HTML elements and attributes.

MS Word Tags

For a complete list of the tags that are recognized by the DOCX-HTML.sh script, see the MS Word Tags document. This document explains how each of the tags should be used in your MS Word document.

We will give a few examples of the most common tags below:

Secondary Text

When there is text in a document that is not essential to the main content (e.g., a sidebar), this is “secondary text”. On the line above secondary text, enter the following tag:

Secondary Text Begin:

And on the line below secondary text, enter:

Secondary Text End.

Footnote Text

If there are footnotes in your document, make sure that these have superscript formatting. For the footnote references (on the bottom of each page OR at the end of the document) use the following tag above the footnote text region:

Footnote Begin:

And on the line below footnote text region, enter:

Footnote End.

Foreign Languages

The DOCX-HTML.sh script automatically makes English the default language for the HTML file. If there are other languages in your MS Word document, these languages need to be marked with language tags.

Insert the following tag before the foreign language text:

###1

And enter the following tag after the foreign language text:

%%%

Note: These tags should be used in-line with text in your document. If there is more than one foreign language, use ###2 and ###3 before the second and third foreign languages, respectively, and use the same ending tag (%%%) at the end of these passages. The DOCX-HTML.sh script can process up to nine foreign languages in your MS Word document.

Figure Captions

A figure caption is text that identifies an image. It is text that normally appears immediately before or after the image which help readers understand what the image is about.

Write the following tag before figure caption:

Figcaption Begin:

And on the line below the figure caption text, enter:

Figcaption End

Extended Descriptions

If there are complex images in the document, write an extended description. Keep the alt text for this image short (What does it show?) and write “description to follow.” at the end of the alt text.

Write your extended description and then add the following tag above the description:

Description Begin:

And on the line below the extended description, write:

Description End.

Note: Extended descriptions must come immediately after the image in the MS Word document for the DOCX-HTML.sh script to process the Description Begin: … Description End. tags successfully.

Table Captions

A table caption is text that identifies a table. It is text that normally appears immediately before or after the table.

Write the following tag immediately before table captions in your MS Word document:

Caption Begin:

And on the line below the table caption, write:

Caption End.

Note: Table captions must come immediately before the table in the MS Word document for the DOCX-HTML.sh script to process the Caption Begin: … Caption End. tags successfully.

MS Word Tables

General Information

There are two types of tables that you may encounter in your MS Word document: simple tables and complex tables.

A simple table is a table that has no merged cells. A complex table has merged cells and there is a different number of cells in each row of the table.

When you are working with simple tables, make sure that the “Header Row” checkbox is checked in the Table Style Options group of the Table Design Tab, when you insert your cursor in that table.

If it is a complex table and there are multiple column headers for cells in the table, make sure that the “Header Row” checkbox is unchecked when you insert your cursor in that table. See the example of a complex table.

Note that columns 2-4 have children columns. When using the DOCX-HTML.sh script, this table should not have the “Header Row” marked.

Tags for MS Word Tables

If your table is simple, there usually isn’t anything else that you have to do than make sure that the “Header Row” is marked.

If your table is complex, however, you will need to add tags to the table to ensure that the table will be processed correctly by the DOC-HTML.sh script. Otherwise, you will need to do heavy editing of the HTML document in an HTML editor (e.g., Dreamweaver), which can significantly increase the amount of time you spend converting the document.

For an extended explanation of how to use MS Word tags in tables for use with the DOCX-HTML.sh script, see the MS Word Tags document.

We will give a few examples of common MS Word tags for tables below:

Column Headers

When a cell in the first row of your table has multiple columns underneath it, we call this cell a “parent column header”. The cell is the parent of “children columns”. With this type of table, you will need to add tags so that the DOCX-HTML.sh script can determine the number of parent column headers and children columns correctly.

See the example of a complex table + column headers - with tags

In the first cell of this table, the tag begins with the number of children columns for this cell. There is only one column beneath this cell so we insert the number 1. Next we indicate that this cell is a column header by using $. Next we use the @ symbol followed by the number of children columns of each parent column header (122) for the entire table.

With the rest of the cells in the first row of this table, we again use tags to indicate the number of children columns underneath the cells (2) and to indicate that these cells are column header cells ($)

In the second row of the table, we also use the $ tag to indicate that these cells are column headers. We do not use a number next to them because they are not parent column headers.

For more information about complex tables with parent column headers, see the MS Word Tags document.

Row Headers

When a cell in the left column of your table multiple rows to the right of it, we can this cell a "a parent row header". The cell is the parent of "children rows". With this type of table you will also need to add tags so that the DOCX-HTML.sh script can determine the number of parent row headers and children rows correctly.

See the example of a complex table + row headers - with tags.

In each of the "parent row header" cells in this table, we use tags to indicate the number of children rows (3) to the right of the cells and to indicate that these cells are row header cells (^).

In the second column of the table, we also use the ^ tag to indicate that these cells are row headers. We do not use a number next to them because they are not parent row headers.

For more information about complex tables with parent row headers, see the MS Word Tags document.

VBA Macros for MS Word Tags

To speed up the process of adding "tags" to your MS Word document, you can use VBA macros + your own keyboard shortcuts.

See VBA Macros - MS Word Tags

Note: MS Word Tags are case sensitive and must have any formatting such as styles, bold, italics, or superscripts applied to them.

Usage

To use the script, follow these instructions:

  1. Place the DOCX file(s) into a folder (e.g., "HTML Projects")
  2. If you are on a PC, right click in the folder and select "Git Bash Here"; if you are on a macOS, open the terminal and change directories to the folder with the DOCX file(s).
  3. In the terminal window, type the path to the script: '/c/scripts/DOCX-HTML.sh' (for PC) or /scripts/DOCX-HTML_mac.sh (for macOS). [OPTIONAL] use an option at runtime (see the help menu, -h, for more information)
  4. Press ENTER to run the script on the DOCX file(s) in your current working directory.
  5. View the terminal output for any warnings or errors. The HTML files will be output to a folder with the same name as the DOCX in the current working directory.

Note: on a macOS you will first need to use chmod + x /scripts/DOCX-HTML_mac.sh to make the script executable.

DOCX-HTML Help Menu

Sample Files

See the following documents for examples of DOCX files with "MS Word tags" and their HTML versions.

Example 1: Multiple Languages (using the -l flag)

Languages.docx | Languages.html

Note: When using the -l flag, you must enter the ISO language code for the secondary language(s). Use the -l option before each ISO language code if there are multiple secondary languages. For example, '/c/scripts/DOCX-HTML.sh' -l it -l fr. In this example, there are two secondary languages, Italian (it) and French (fr). These are marked with ###1 and ###2, respectively, in the MS Word document. See the MS WORD Tags document for more information about usage.

N.B. Secondary languages are displayed initially with a purple background for easier review and can then be removed by the user.

Example 2: Complex Tables and Extended Descriptions

Complex_tables.docx | Complex_tables.html

Example 3: Footnote Text Regions (using the -f option for adding footnotes)

Footnote.docx | Footnote.html

Note: Footnotes must be superscripted in the text. They also must have a footnote reference.

Example 4: Line Numbers (using the -n flag)

Line_Numbers.docx | Line_Numbers.html

Example 5: Mathematical Content

Math_test.docx | Math_test.html

Note: the DOCX-HTML.sh script processes mathematical content that is in OMML (Open Math Markup Language) only. MathType equations are not supported. See GrindEQ's MathType to Equation tool for converting MathType equations to MS Word Equations (OMML)

More Resources

AHG Conference - 2020

See the following links to the presentation and video tutorials from our workshop "Accessible HTML – Why It’s a Good Format for your Students and How to Create It!" delivered on November 17, 2020 at the Accessing Higher Ground Conference - 2020:

Note: The current version of DOCX-HTML.sh script is slightly different than the version used at the AHG conference. There is now a version for the macOS as well.

Math to HTML Workflow

See the following presentation entitled "Turning Math into HTML" delivered on April 28, 2021 for Phillip White and North Carolina accessibility professionals:

Turning Math Into HTML - Presentation

Validate-HTML Script

  • If you would like to run the NuHTML validator after creating an HTML file(s), you can use the Validate-HTML script (for PC):

Validate-HTML.sh (for PC only)

macOS version forthcoming...

About

Create Accessible HTML Files From DOCX

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published