David R. Heffelfinger
OpenOffice.org Document Version Control With Mercurial
I've always wanted to put my documentation under version control, just like I do with my source code. However, word processor files are binaries, therefore not that well suited for version control (track changes aside). Of course, they can be committed, however, being binaries they can't be diffed very easily.
Standard OpenDocument Text (the default format for OpenOffice.org Writer since version 2), are nothing but zipped XML files. I searched around for an easy, automated way to unzip them and zip them "on the fly" as necessary, thinking that i could put the "raw" XML files under version control. However, I couldn't find anything that would help in that regard. Manually zipping and unzipping files seemed like more trouble than it's worth.
OpenOffice.org's word processor, Writer, allows us to save in formats that are text based, such as Docbook XML, Microsoft Word 2003 XML, and OpenDocument Text Flat XML (.fodt). I figured I could try to use one of these formats internally, since they are text based they would be "diffable" by Mercurial (or any other version control tool), then when I needed to distribute the document I could export to Word format, PDF or what have you.
I haven't had the opportunity to work with DocBook in the past, and I admit I've been kind of curious about it, so I tried this option first. Unfortunately it turned out I couldn't use this format since I frequently work with Word templates (even though I work with OpenOffice.org, word templates work fine in Writer) and it doesn't seem like DocBook supports them.
I then turned my attention to the OpenDocument Flat XML (.fodt) format, this format can work with word templates, and it is saved as a plain text (XML) file. It looked like the perfect solution. To test it out, I created a simple document, saved it as OpenOffice Flat XML, and committed it to a Mercurial repository. I then made a simple change to this document, and did an
hg diff on it.
To my dismay, this very simple change (I just added a new paragraph with a single sentence on it) resulted in quite a number of diffs between the two versions. Apparently this format contains a bunch of metadata such as creation time, creator, the time the file was saved, etc. This metadata was creating a number of diffs that were irrelevant to the task at hand, which is to find out what change I actually made to the file.
At this point I considered using the Handling OpenDocument Files oodiff trick described in the Mercurial site, however this trick seemed to me more like a hack than a proper solution. When using this approach, files are checked in as binary, then when diffing, a tool called odt2txt to convert the document to plain text "on the fly" then diff the plain text version. The problem with this approach is that the files are still commited to version control as binary, and most version control tools are not very efficient in storing binary files.
At this point started using the above trick, however recently I found the color extension for Mercurial, which allows diffs to be color coded. After I installed this extension, I gave the .fodt format a try again, and I started to notice patterns of what to look for when looking for diffs. For example, paragraphs are nested inside a
<text:p> tag, this makes it easy to find text changes. Images are stored inside a
<draw:image> tag, which makes it straightforward to see if an image was added, deleted or moved. Tables use the
<table:cell> tags, making it fairly easy to identify them. This seemed like a good solution, however after a while I noticed that sometimes making a simple change in the document (for example, adding a heading somewhere in the middle), created a bunch of diffs on the document again, for example, lines that were now farther down in the document were being reported as deleted from one place and added in another, which is inaccurate.
For now, I went back to the oodiff trick, even though it bothers me a bit that I am checking in binary files to the repository, however this approach results in sane diffs that actually allow me to track what was changed in the document.