Collaborative edition with docx and gitlab

When a group of persons need to write a document collaboratively, there are no really good options to do so efficiently:

  • I think the best way would be to write in latex with the source versioned in a gitlab server… But this is most of the time not an option, because it’s very rare that everybody in a group of writers master both latex and git, or is willing to learn.
  • A common solution is to use google docs, but
    • there is no versioning, or at least you do not control it
    • you can not edit offline, in the plane
    • all your data may be captured by a third party (e.g., Google)
  • Instead of googledocs, you may try OnlyOffice or Nextcloud documents, but they are much harder to install or have less functionalities, and you still may not edit offline or benefit from versioning.
  • A degraded option is to use Etherpad, but it’s far too limited for complex documents.
  • Yet another option is to send each other docx documents with the track change option activated, but it’s a lot of work for the main editor, and still, no versioning (I’m not going to talk about why versioning is important, this is for another post all by itself…)

So… You’re finally stuck with one of these not-so-good solution. Or at least, I don’t know any other (if you do, please please drop me a line !)

Nevertheless, I think that there’s one solution that is better, when at least one of writers in the group knows about gitlab and unix scripting. The idea is then that this person plays the role of the main editor and that all the other writers simply use Libreoffice or Word as they are used to, and send their edits by email to the main editor. The main editor then computes diffs introduced by each editor and update another master docx document, which is always automatically made available on a web page. Every diff must also be versioned in gitlab.

There are a few tricks to consider to implement such a process moothly:

  • The docx must not be put in gitlab, but it must be first unzipped, and passed through xmllint to version nice-looking XML files line by line.
  • When a new version of the master docx is compiled, it must include somewhere the git hash version, so that diffing asynchronously every writer change is possible. I usually use a script that automatically introduces this hash code within the footer of the docx, with a warning not to modify it.
  • For every, even minor, modification produce by a write, many lines in the XML files might actually by changed, making versioning rather inefficient.

To solve the latter issue, I suggest to actually use a single master docx file on the main editor computer, and that the editor only integrates the updates of each writer. This is well suited when only the text has been modified, because images requires more integration work. This may be achieved via a script that gets the hash from the emailed docx, checkouts the corresponding version, transforms these XML files as well as the updated XML files into markdown files with pandoc or the Document library in python, and finally prints out the diff with the editor favourite diff tool, such as vimdiff. The main editor can then further use vimdiff between the raw XML files to only report the interesting text updates into the master docx XML files.

I have used this process for writing a project proposal, and yes, it’s a bit of work for the main editor, but I think it is much less work than trying to merge documents via track change, with the added bonus of having everything versioned !

Written on November 24, 2017