Paul Kiddie

Converting .doc to .docx programmatically

July 25, 2007

I’ve been looking for something that can aid me in converting doc files and other formats to html, or some internal representation that I can do additional parsing on server side, and have come across some interesting applications such as PurePage, ConvertDoc etc. These are great, and have some nice, simple and effective API’s but unfortunately cost just too much for small scale projects. The main issue is converting the .doc binary format. Several attempts exist, including Apache POI, but these are more proof of concept than anything.

So armed with my copy of Word 2007 I’ve been playing with the COM interface. Looks like we can parse in another way - by extracting the RAW OpenXML from the document. When Word 2007 opens a traditional .doc it converts it to the OpenXML representation and we can simply extract this through the COM interface, then close the instance of Word down, and post-process the XML. Here’s what I did:

  1. added a COM reference in Visual Studio to Word 12.0 Object Library
  2. a little bit of code:

object file = src;     //String containing location to .doc file

object nullobj = System.Reflection.Missing.Value;

Microsoft.Office.Interop.Word.Document doc = wordApp.Documents.Open2002(
ref file, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj, ref nullobj, ref nullobj,ref nullobj);

doc.ActiveWindow.Selection.WholeStory();     //get entire story

string xml = doc.ActiveWindow.Selection.get_XML(false);  //get xml corresponding to story

  1. ‘xml’ now contains the OpenXML representation of the .doc file. Dont forget to close the instance of word down with doc.close();

Now I just need to find an ODF C# library that can help me make sense of the resulting XML string!

👋 I'm Paul Kiddie, a software engineer working in London. I'm currently working as a Principal Engineer at trainline.