These three issues are all related, yet very different. When we deal with reports of problems, it’s essential to clearly identify what is actually happening. Sometimes, we need clarification by asking about the symptoms, instead of accepting what the user says without delving any deeper. The root cause of a problem may not even be what the user is experiencing directly, or what they claim they’re experiencing.
In this post, we’ll cover the similarities and differences of timeouts, blocking, and deadlocks, how they manifest themselves to an end user, and how they may cause, or lead to, each other.
Blocking (also known as live locking) occurs when a process attempts to acquire a lock on a resource, but has to wait in the locking queue before the lock is granted.
From the outside, it appears that the process is doing nothing, when in fact it’s waiting for the other process(es) to release the lock(s) ahead of it in the queue.
If a process is blocked for too long, this can lead to timeouts. If locks are acquired in a specific order, blocking can lead to deadlocks.
The thing to remember is that blocking occurs all the time to synchronize access to resources. It only becomes a problem when the amount of time a process spends waiting becomes excessive, either because it’s causing timeouts, or because it’s causing a user to wait for more time than they expect is normal. Blocking may also be experienced as a “frozen” application. Users may not complain about blocking until their work is impeded enough to frustrate them into taking action to get the problem fixed.
Timeouts occur when a client application makes a request for a resource, and while waiting for a response to be returned, decides things have taken too long, and stops waiting. This usually results in an error being raised by the client API.
Timeouts occur for a variety of reasons, including blocking, the request needing to do a ton of work on the server, or maybe the network is saturated or simply really slow. There are millions of reasons, all ultimately because the client was waiting, and gave up after a certain period of time (the timeout threshold).
An important thing to note about timeouts is that they do not cause other problems (except if the error that’s raised is handled incorrectly). In other words, timeouts are only a symptom of another problem; to solve the timeout issue, solve the issue that caused the timeout.
Since timeouts always depend on something else happening (or not happening), they can be difficult to reproduce in a testing environment. Without sufficient tracing/logging mechanisms in place, it can be difficult to track down the source of the problem. Users will usually be able to tell you that something is timing out, and probably what they were doing at the time, but finding the cause often requires some digging.
Deadlocks occur when two or more processes hold locks on resources (i.e., block each other), and also try to take locks on resources held by the other process(es).
This creates a situation such that neither process can continue unless one of them is terminated by another external process. In SQL Server, this external process is called the Deadlock Monitor, and upon detecting a deadlock, it summarily rolls back one or more of the queries to resolve the situation.
I demonstrated how deadlocks are created in a demo video in a previous post. As part of setting up the demo, I showed how blocking works in relation to deadlocks.
By definition, deadlocks are caused in part by blocking, and in part by the order in which the locks were acquired. Usually it’s very clear to users when deadlocks occur because of the error messages. In some cases, deadlocks can lead to timeouts if the external monitoring process takes too long to pick up on the fact there is a deadlock in the system (the deadlocked processes are blocked and waiting).
Fixing deadlocks is often an exercise in information gathering, because they normally aren’t reproducible except under specific circumstances. The key is setting up a tracing/logging solution to record the pertinent information when deadlocks happen, so you can analyze and fix the problems later. I explained how to do that in a previous post as well. A more proactive solution would be to determine the “hot” areas of the application, and ensure that locks are acquired in the same order in every piece of code that accesses that area.
While well-written applications may attempt to retry the operation in progress when a deadlock occurs, the question of why the deadlock occurred in the first place must still be asked and answered; tracing/logging or other solutions still apply.
As you can see, there are relationships between blocking, timeouts, and deadlocks. The next time you deal with an error report that involves one of these three phenomenon, I encourage you to dig deeper, and put in place processes that either prevent the problem from occurring in the first place, or simply record enough information to be able to fully solve the problem later.