Running TLC in Distributed Mode

Contents
  Different modes of Distributed TLC
  How Distributed TLC in Ad Hoc Mode Works
  Running the Master
  Running the Slave Computers
  Limitations of Distributed Mode 
  Getting and Giving Help

Different modes of Distributed TLC

The TLA+ Toolbox has support to run distributed TLC in different modes (ad hoc, Azure, aws-ec2), which can be generalized into ad hoc mode and Cloud Distributed TLC. The former (ad hoc) runs TLC on a set of your own computers. It is up to you to install all necessary prerequisites (see below) on each of the computers. The latter mode (Cloud Distributed TLC) on the other hand, moves Distributed TLC into the cloud and thus can automate all the install steps of ad hoc mode for you. It comes at the price service charges by the cloud compute providers. For Cloud Distributed TLC, please switch over to its dedicated documentation

How Distributed TLC in Ad Hoc Mode Works

TLC has a collection of worker threads that do the actual model checking--that is, they compute the graph of reachable states and check invariants and other safety properties.  There is also a master thread that coordinates the worker threads.  In ordinary (non-distributed) mode, these threads all run on the same Java Virtual Machine (JVM) on the same computer as the Toolbox.  In distributed mode, the threads can be run on multiple computers.  The master thread runs on the same computer as the Toolbox--a computer we call here the master computer.  The threads are run on a collection of slave computers.  Each slave computer can run multiple threads--the default is to run as many threads as it has processors (cores).

TLC keeps fingerprints of all states that it has found, which it uses to determine if a newly computed state has already been examined.   It writes the set of fingerprints to disk if it becomes too big to fit in memory.  Having to read fingerprints from disk slows TLC down considerably.  As explained below, you can have distributed TLC distribute the set of fingerprints to multiple slave computers (instead of keeping them on the master computer).

Unless you're just trying it out, you're running TLC in distributed mode because your model is quite large.  For any large model, it's a good idea to give TLC as much of the computer's memory as is not needed by the operating system and other programs that will be running at the same time.  This applies to TLC being run on the slaves and on the master if slave fingerprint servers are not used. 

Running the Master

TLC is run on a model.  To specify that it is to be run in distributed mode, select "ad hoc" in the Run in distributed mode drop down list in the How to run? section of the Model Overview Page. That section also allows you to adjust the amount of memory allocated to the master.

Your master computer is likely to have multiple network interface cards and thus several different IP addresses. The Master's network address drop down list shows all IP addresses the Toolbox could identify. Choose the one to which the workers will be able to connect. The Toolbox tries guess the best match and selects it by default.
It is up to you to configure your master's firewall to allow the workers incoming connections.

You start a TLC run in distributed mode as usual by clicking on the button, by selecting Run model on the TLC Model Checker menu, or by typing F11.  This should cause the Current status field of the Model Checking Results Page to change to Master is waiting for (remote) workers and then to indicate that one or more workers have registered with the master.

If you want TLC to store the fingerprint set among slave computers, you must run a (single) fingerprint server on each of them.  Your model must tell TLC how many fingerprint servers it will use.  If you want fingerprints stored on 5 fingerprint servers (5 slave computers), you increase the Number of distributed fingerprint sets spinner to 5. Setting it to 0, means that the master will store the fingerprints. The master uses more memory in this mode.

For distributed mode, you may also want to add special arguments to the command that launches the JVM that runs the master.  As explained below, one such argument may be needed for the master and the workers to communicate.  The arguments are put in the JVM arguments field of the model's TLC Options page, which may be opened via a link on the Model Overview page.

If your model will run for a long time (days or weeks), you may not want to keep the Toolbox open for the entire run.  Instead, you can run TLC from a command line.  The easiest way to do this is to create the model in the Toolbox, and validate it.  You can then run distributed TLC with the following command, where tool-path is the complete pathname of the directory containing the file tla2tools.jar and model-path is is the complete pathname of the directory Spec.toolbox/ModelName, with Spec the specification's name and ModelName the model's name (see How TLC is Run):

   java  -cp tool-path  tlc2.tool.distributed.TLCServer  model-path/MC
Like the JVM argument -cp tool-path, other JVM arguments can come between java and tlc2.tool.distributed.TLCServer in this command. TLC options follow model-path/MC.

Running the Slave Computers

The following are the basic steps that must be accomplished to start the slaves.  The details of how these steps are performed will vary according to the operating system and network configuration that you are using.  If you have trouble getting distributed TLC to work on your system, try finding help on the TLA+ Google group.

To run the slaves, you must have a network of computers that can communicate with one another and with the machine running the master and the Toolbox.  We assume that there is some some remote management/administration system installed on the computers that run the workers.  Examples of such a system are Remote Desktop (Windows), ssh, rsh, and telnet.  A Java Runtime Environment (JRE), version 11 or later, must also be installed on the machines. 

Start the Toolbox on the master computer, and then open a web browser to the URL

      http://master-computer:10996
  
where master-computer is the name (or IP address) of the master computer.

The resulting web page will have further instructions concerning running slaves, however the gist of it is:

Start the Toolbox on the master computer.  Then do the following on a shell on each slave computer:
     wget http://master-computer:10996/files/tla2tools.jar
  
Then execute one of the following commands in that slave's shell:

The wget command downloads the file tla2tools.jar into the current directory on the worker machine, the java command actually starts the worker.  The wget command therefore just has to be executed the first time you run a worker, and then whenever you install a new version of the Toolbox.  The wget command is not part of Windows, but can be installed on Windows as part of Cygwin.

You can add JVM arguments to the java command, such as the argument -Xmx7G that gives the slave 7 gigabytes of memory.

By default, each slave that runs workers runs as many worker threads as it has processors.  Running that many threads on the slave might cause problems with some operating system.  The following JVM argument causes each slave to run 2 threads:

   -Dtlc2.tool.distributed.TLCWorker.threadCount=2
(This JVM argument is only applicable for slaves and has no effect if passed to the master process).

A Possible Problem

The master-computer name you use in the instructions above for running the workers is usually a name like jones-home-laptop or tla.msr-inria.inria.fr.  However, some computers have multiple network interfaces, with different IP addresses--for example, 192.168.1.10 and 10.10.0.1.  If the master computer is such a computer, then it's possible that using this name as the master-computer name in the instructions above will cause the worker not to be able to communicate with the master because they are using different IP addresses.  To solve this problem, choose one of those IP addresses--say 192.168.1.10. Use that address as the master-computer name in the instructions above for running the workers, add the following JVM argument to the model:
   -Djava.rmi.server.hostname=192.168.1.10

Limitations of Distributed Mode

When run in distributed mode, TLC cannot check liveness properties. Nor does it use depth-first mode or simulation mode, even if one of those modes is selected on the TLC Options Page.  Coverage information is not provided.  You should therefore check a smaller model in ordinary non-distributed mode to make sure that the specification specifies what you think it does. 

Getting and Giving Help

You may want to run TLC in distributed mode with many workers.  (Preliminary tests suggest that linear speedup is possible even with hundreds of workers.)  Manually starting so many workers is not practical.  It is possible to implement scripts that can automate the installation of the JRE and one of the two methods described above for copying the tla2tools.jar file and starting the workers.  How this is done will depend on the operating system and network configuration.  If you need help, try going to the TLA+ Google group. If you have successfully run TLC in distributed mode, please use that Forum to tell others how you did it on your system.
↑ Model Overview Page