Debugging a Java multi-threaded program
We consider the Java program described in
https://www.cs.unh.edu/~charpov/programming-tlabuffer.html.
This Java program consists in several threads that communicate using a single shared buffer (class Buffer
). Some of the threads are producers that write to the shared buffer. The other threads are consumers that read data from the shared buffer.
As explained in the document at the link above, this program has a bug: the system may deadlock. This was detected on a production version of the application. The author has been able to reproduce the bug through testing. However, it required to let the system run for a very long time before the bug occured. This is unsatisfactory as this mostly relies on luck. Moreover, the resulting trace is so long that debugging from the trace is almost unachievable.
A starter
The author has modeled the system in TLA+ and used the TLC model-checker to detect the bug and get a short trace leading to the deadlock. Our goal is to reproduce his methodology in a slightly different settings:
- we will model the Java program in the higher-level language PlusCal instead of directly in TLA+ (although some definitions in TLA+ will be useful)
- we also claim that the model can be improved
Read the document at the link above. You need to fully understand the Java code (refering to the language specification), his TLA+ model, the issue that is identified by the author, and his solution. In particular, you should answer the following questions:
- Which standard data structure is implemented by the Java class
Buffer
? - Why are some methods
synchronized
and what does it imply on the runs of the program? - How do
wait()
,notify()
andnotifyAll()
work? - What are the
CONSTANTS
in the TLA+ model, and what is the goal of theASSUME
condition and theTypeInv
definition set by the author? - What is modeled by
buffer
andwaitSet
, and how are they modeled? - From the Java documentation, argue why the definitions of
Notify
,NotifyAll
andWait
model accurately the semantics of the corresponding Java methods. - Justify why the definitions
Put
andGet
correspond to the semantics of the Java methods. - Why do the
Init
,Next
andProg
definitions capture exactly the runs of the Java program? Justify - Finally, explain why model-checking the property
NoDeadlock
corresponds to searching for the bug in the author’s TLA+ model.
Modeling the buggy Java program in PlusCal
Our ultimate goal is to provide a PlusCal model that improves the TLA+ model of the author. We start from the skeleton file java_prodcons_skel.tla that builds on the ideas of the author. The goal is to model the Java code containing the bug in such a way that (1) the model behaves similary to the Java program and (2) the bug can be detected from the model using the TLC model-checker.
Modeling the data structures
A first step consists in identifying the data structures in the Java code that need to ne modeled in PlusCal.
- What are the data structures used by this Java code?
- The Java code of the
Buffer
class has two methodsput()
andget()
that do two things. First, they check if writing or reading from the buffer can be performed. Then, they update the buffer. We suggest to split these two parts, and write macros or definitions for the methodsisEmpty
,isFull
,next
,push
andpop
(or the subset that you have identified as necessary). The macrospush
andpop
perform the modifications on the buffer. - Proceed similarly for the other data structures that you have identified.
Implementing the processes
Now, we can model the processes.
- Write a PlusCal algorithm of the producer processes. It should model the behaviors of the Java
put()
method. - Similarly, the algorithm of a consummer should have the same behaviors as the Java method
get()
. - How can we detect the bug using the TLC model-checker on your model? Check that your model exhibits a similar error than the author’s TLA model for the settings described in the paper (3 producers, 2 consumers and a buffer of capacity 2).
- Replace the calls to
notify()
bynotifyAll()
in your model. Check that the bug disapperas and explain why.
Fixing the PlusCal model
Using your model, you have been able to identifiy the bug that was mentionned by the author. Our goal now is to fix the model, in such a way that we can prove it behaves correctly. In this part, we consider an instance of the system with 3 producers, 2 consumers and a buffer of capacity 2. Copy your buggy model in a new .tla
file that you will be working on from now.
- Identify a way to fix your model, that can be implemented in Java.
- Modify your PlusCal model accordingly. Then, check that your model does not exhibit the error anymore.
- One you have checked that your model is correct, implement your solution in the Java code.
Improving the model
In the previous section, you have been able to fix your model, and check that your new model is correct for 3 producers, 2 consumers and a buffer of capacity 2. Our goal is to be able to model-check bigger instances of the system.
- Check that your solution is correct for an instance with 12 producers, 10 consumers and a buffer of capacity 10. Interrupt model-checking if it has not terminated within 2 minutes.
- If verification takes more than 2 minutes, your model is uselessly complex. Identify the issue from the section “Limitations of Model Checking” in the document.
- The issues in your model can come from different parts: too many interleavings of processes which do not correspond to actual runs of the Java program, data structures that could be modeled in a more abstract way, etc. Identify parts of your model taht can be improved. Find solutions and argue that they are correct. Update your model until you manage to verify the bigger instance within 2 minutes.
- What can you conclude on your model, and on the Java program?