Case Study in TLA+ / TLC

Case Study in TLA⁺ / TLC

As part of the course CS-986: Advanced Topics in Program Verification, we (the students and myself) used TLA+ to specify and verify properties of several concurrent algorithms. When it was possible (i.e., when processes states were finite), we also used TLC to help us debug / verify our formal models. One of these cases seemed a good enough illustration of TLC to us to possibly be of interest to other TLA⁺ and/or TLC users. This is the reason for this web page.

The starting point is a paper on "Non-blocking atomic commitment" by Babaoglu and Toueg, from Mulender's book on Distributed Systems. They present an algorithm for distributed atomic commitment that enjoys different sets of properties depending on what type of broadcast is used. They give an informal correctness proof which they claim is compositional, although I'd rather call it incremental (when the algorithm is modified, the proof is modified accordingly, some parts being reused).

I wanted to see how similar (or not) a TLA proof would be and, possibly, attempt a true compositional proof later. So, we have modeled this algorithm in TLA⁺. Although I'm still not done with the proof, I've used TLC extensively on this specification, and I think it can provide an interesting example of what the tool can (and cannot) do.

The original description assumes a reliable synchronous communication model (transmission times are bounded), and relies on it to detect faulty sites. In order to simplify the model, we switched to a reliable (implicit) asynchronous communication model where a site can either receive a message from another site (if the message was sent) or realize that this other site is down (if the message was not sent). We don't specify explicitly how this detection is achieved.

A first TLA⁺ module (ACP_SB) implements the generic algorithm from the paper with a simple broadcast (messages are simply sent successively). It satisfies some atomic commitment properties, but does not always terminate in the presence of failure.

Probably because we tried to separate safety and liveness concerns as much as possible, we ended up with a set of properties that is a bit stronger than the original. (The original specification does not force an abort decision if at least a vote is No and no failures occur. Ours does.)

ACP_SB is then extended into an ACP_NB module where the simple broadcast is replaced with a reliable broadcast. This reliable broadcast is achieved by having each participant forward the decision it receives to all other participants. Thanks to this modification, the algorithm now always terminates, even in the presence of failures.

Interestingly enough, we made a mistake when we wrote this module for the first time. It's important that a decision is not delivered locally before it has been forwarded to all other participants. If this happens (ACP_NB_WRONG), then the consistency property (all decisions are the same) is violated. We were careless, and TLC provided us with a nice counterexample that helped realize our mistake right away, before starting any proof. The output from TLC builds a prefix of a computation that falsifies invariant AC1 (and it remains reasonably readable). The last state shows one process that aborts while the other commits.

Once the problem was corrected (ACP_NB), TLC was able to verify the safety property of the algorithm for 4 participants and the liveness for 3 participants. The verification of safety for 4 participants took 9 days, 14 hours and 25 minutes on my desk machine. Even though the machine was used at other tasks at the same time, it confirms that TLC is to be considered more as a debugger than a verification tool. In its debugging role, TLC was quite good.

Here is the list of TLA⁺ and TLC files that were used:

ACP_SB

ACP_NB

ACP_NB_WRONG

Following are some other files that were used and that could be of interest to someone:

By default, TLC stores the graph of generated states in a subdirectory "states" of the current directory. As this directory can be quite huge, this is inconvenient if TLA files are in your home directory, and that directory is on a small partition or possibly accessed through NFS. TLC should offer a possibility to choose where to store those states, but until it does, I have added this option to TLC's command-line. Here is the modified TLC.java. This is the only file that is modified (without any authorization from TLC's developers, I'm afraid). When TLC is compiled with this file, it offers a -dir <dirname> option. Everything works fine with TLC as of version 1.63 of Jan 4, 2002 (the modification is minor anyway).

As I could never remember the capitalization and the options of TLA tools, I use bash functions to launch them:

# TLA tools
function tlatex () {
    java -showversion tlatex.TLA -shade \
         -textwidth 550 -textheight 725 -hoffset -100 -voffset -100 $*
}
function sany () {
    java -showversion tlasany.SANY $*
}
function tlc () {
    java -showversion -Xmx1024m tlatk.TLC -dir /var/tmp/states $*
}
function tlapdf () {
    tlatex -latexCommand pdflatex $*
}

Here is the bash script used to generate a curve of the number of states in queue from TLC's output, such as this one. I find it convenient because the shape of the curve in the middle of a computation gives me a rough idea of the amount of time still required before completion. The script is horrible, though, as it generates a temporary file with numbers from 5 to 50000, one by line. There must be a way to avoid that, but I don't know enough gnuplot for that.

Naturally, comments on this case study and feedback on similar experiments are welcome. Drop me a line and let me know what you think...

Michel Charpentier <

Last modified: Mon May 5 17:30:41 EDT 2003