The Art of Making Multithreading Issues Worse

Reading Time: 3 minutes
It also took me ages to find that these were dead

I recently spent days looking for a bug – a thread safety bug – that I should have found in minutes. The reason it took so long is that someone had found it before me and attempted to fix it, twice, but each time had in fact made matters worse.

What made matters even worse than that is a phenomenon that will be all too familiar to you if you handle multithreaded code a lot – the customer could make it happen every time but on the bench, back at the office with all the debug tools? Practically impossible to reproduce.

So the code started off like this, and here I’ve extracted the essence of the problem, this is not the real code. In reality the mistake was the same, but heavily obfuscated right from the start.

class AlarmManager
{
    private Dictionary alarms = new Dictionary();

    public Guid StartAlarm()
    {
        Alarm newAlarm = new Alarm();
        newAlarm.AlarmId = Guid.NewGuid();
        alarms.Add(newAlarm.AlarmId, newAlarm);
        newAlarm.AlarmWorkflow = WorkflowManager.CreateAlarmWorkflow(newAlarm.AlarmId);
        return newAlarm.AlarmId;
    }

    public void DeleteAlarm(Guid alarmId)
    {
        Alarm toRemove;
        if (alarms.TryGetValue(alarmId, out toRemove))
        {
            toRemove.AlarmWorkflow.StopAndRemove();
            alarms.Remove(alarmId);
        }
    }
}

Two schoolboy errors in there. Someone spotted the first, Dictionaries are not thread safe, so they added some locking…

class AlarmManager
{
    private Dictionary alarms = new Dictionary();

    public Guid StartAlarm()
    {
        Alarm newAlarm = new Alarm();
        newAlarm.AlarmId = Guid.NewGuid();
        lock (alarms)
        {
            alarms.Add(newAlarm.AlarmId, newAlarm);
        }
        newAlarm.AlarmWorkflow = WorkflowManager.CreateAlarmWorkflow(newAlarm.AlarmId);
        return newAlarm.AlarmId;
    }

    public void DeleteAlarm(Guid alarmId)
    {
        Alarm toRemove;
        lock (alarms)
        {
            if (alarms.TryGetValue(alarmId, out toRemove))
            {
                toRemove.AlarmWorkflow.StopAndRemove();
                alarms.Remove(alarmId);
            }
        }
    }
}

Fixed? No. This is where we hit the real problems. Someone noticed that the line toRemove.AlarmWorkflow.Stop(); was throwing null reference exceptions, so rather than investigate how toRemove.AlarmWorkflow came to be null they simply put a null check in…

public void DeleteAlarm(Guid alarmId)
{
    Alarm toRemove;
    lock (alarms)
    {
        if (alarms.TryGetValue(alarmId, out toRemove))
        {
            if (toRemove.AlarmWorkflow != null)
                toRemove.AlarmWorkflow.StopAndRemove();
            alarms.Remove(alarmId);
        }
    }
}

The reason it was occasionally null is in StartAlarm. That’s where the bug is – the new alarm is added to the dictionary and the lock released before it’s finished initialising. So if it’s deleted by another thread immediately after it’s started, the threads can interleave in a way where the alarm is removed from the dictionary with the workflow being null, then the workflow is assigned and started. As the workflow is managed by the WorkflowManager, there’s still a reference to it, hence it continues to run.

Now the issue got compounded further, because someone spotted that exceptions were still being thrown during the delete, from the StopAndRemove method. This is where my simplification falls down a bit because the real reasons for the exception are somewhat complex involving events and another access to the alarms dictionary, suffice to say however that this was not the way to solve the problem…

public void DeleteAlarm(Guid alarmId)
{
    Alarm toRemove;
    lock (alarms)
    {
        if (alarms.TryGetValue(alarmId, out toRemove))
        {
            try
            {
                if (toRemove.AlarmWorkflow != null)
                    toRemove.AlarmWorkflow.StopAndRemove();
            }
            catch 
            {
            }
            alarms.Remove(alarmId);
        }
    }
}

These two attempted fixes are part of a mindset of patching it up rather than fixing the root cause. In certain extreme circumstances patching it up may be acceptable. I’ve had to do it, but it must be highlighted that this is what has been done and that it may actually be masking the root cause or indeed causing further knock-on issues.
I find the #warning preprocessor directive useful in such circumstances.

The solution is trivial…

class AlarmManager
{
    private Dictionary alarms = new Dictionary();

    public Guid StartAlarm()
    {
        Alarm newAlarm = new Alarm();
        newAlarm.AlarmId = Guid.NewGuid();
        lock (alarms)
        {
            newAlarm.AlarmWorkflow = WorkflowManager.CreateAlarmWorkflow(newAlarm.AlarmId);
            alarms.Add(newAlarm.AlarmId, newAlarm);
        }
        return newAlarm.AlarmId;
    }

    public void DeleteAlarm(Guid alarmId)
    {
        Alarm toRemove;
        lock (alarms)
        {
            if (alarms.TryGetValue(alarmId, out toRemove))
            {
                alarms.Remove(alarmId);
                toRemove.AlarmWorkflow.Stop();
            }
        }
    }
}