I’ve been given a group of Word documents to turn into a Web site. As they are long and contain a lot of internal formatting I wanted to convert them from Word into HTML before working on them, rather than saving the content as raw text and putting all the formatting back in.
This used not to be possible. Or rather, it is possible using ‘Save As’ within Word, but the results are so stuffed with unwanted HTML that in practice it was not worth doing because the resulting file was almost impossible to do further editing on. But there’s now an alternative way of doing it, using Google Docs, or Google Drive as you are now supposed call it. Well, you know what I mean – the part of Google’s bid for world domination that consists of looking after your files for you.
Put your Word document in there, select it by checking the box by the name, then pull down ‘More’ from the header bar and select ‘Open With… Google Drive Viewer’. (Some Word docs appear on Google already automatically viewable with Google Drive Viewer, without your needing to select it.) You’ll see something that looks pretty much like your Word document with the formatting in. Select ‘File… Download As… Web Page’. This generates a zipped file with the suffix .html.
Using this method I get something with paragraphs, hyperlinks, headings etc. but there’s still quite a bit of work ahead before they are in a form I find acceptable.
Firstly, Google’s converter imposes its own stylesheet with all sorts of styles known only to itself. There’s a whole bunch of style information near the beginning which can be discarded and replaced with your own stylesheet. The rest of the HTML is littered with ‘class=”c12″‘ and similar tags. Sometimes Google’s styles join forces, and you get something like ‘class=”c1 c9 c11″‘. Fortunately some Perl one-liners get rid of these quite easily.
Then there’s our friend the line break. This time I actually want some in the HTML source, to make the text more readable, but if I’m to get them I need to put them in myself after headings and paragraphs. Again, Perl one-liners help. The process shows up a few ghost headings of the form <h4> </h4> which can easily be deleted, and ghost paragraphs <p></p> which also serve no purpose. Every so often a whole paragraph, or part of a paragraph, will leap out in <h3> or some other heading tag for no particular reason. Perhaps some deleted formatting in the original document is being picked up?
The HTML converter has particular problems with hyperlinks. It would be a real pain to put them all in by hand, and I do prefer Google’s conversion as the lesser of two evils. They tend to come out duplicated, one of the hyperlinks not enclosing any text; the closing </a> is also often not correctly positioned. For good measure, sometimes an unscheduled paragraph break gets thrown into the mix. A reliable pointer to duplication is the non-breaking space which is usually a sure indication of a place where Google’s converter hasn’t really understood what’s going on.
Lists really cause problems for the converter. A panoply of ordered lists is generated, although many of them are on closer inspection lists containing only one item, because the next item in the list starts a new list. In fact the original document contained only unordered lists, apart from a few which were ordered by letter (which the converter understands and handles correctly).
As well as adding unnecessary HTML, the conversion process also removes some formatting which I have had to replace, such as emphasis (used in quotations and some headings in the original document).
But I should end on a positive note. There were a number of tables containing text, something which Word-to-HTML conversion used to pangle spectacularly. These now come out correctly, and while they don’t automatically have borders (which become a box round the text inside), a simple global change using Perl puts borders back in.
So do I recommend using Google’s HTML conversion tool? It boils down to one word: hyperlinks. If you have more than a handful of these, it is worth converting the document to HTML, because the risk of creating an error if you cut and paste a link incorrectly outweighs the inconvenience of having formatting incorrectly rendered. Incorrect formatting can easily be spotted when you look at the resulting page, but a mistake in a hyperlink can only be found by trying to follow it. And it knocks spots off Word’s own HTML conversion.