September 2011 – Tom Fosdick

Reading Time: 2 minutes

“Oh we didn’t use your code in the end,” said an ex-colleague some time after I’d left a previous employer, “nobody could understand it.” I was a little taken aback because I’d thought the code was particularly clear given the complexity of what I was doing.

I could understand that some less experienced developers might have had trouble with it, but to someone experienced I didn’t think it would be an issue. This left me feeling distinctly sheepish – should I have written the code in the way I did? Instead of going for ultimate performance could I have simplified it, got enough of a performance gain and achieved a greater degree of maintainability?

I felt sheepish because code that nobody understands is a big development problem for 2 reasons. The first is that at some point someone will need to understand it – that’s going to take time and development time is expensive. The second is that there is a danger that someone who doesn’t understand it (even if they think they do) modifies it and makes a mistake. Often people under pressure to fix bugs have to make guesses and sometimes they’re wrong – it might appear to fix the issue but it may actually unleash a greater demon that’s much, much harder to find.

Sometimes code does need to be complex: the earth for instance is far from round and it’s hard to avoid some pretty hideous maths if you need to work with its geometry. That’s excusable but every effort should be made to lay the code out clearly and comment well. What’s not acceptable is needless complexity through bad design, bad coding style, bad layout, lack of commenting or even in some cases developers showboating.

A good developer is not the one who writes really clever, complex code that nobody else understands. It’s the one who solves complex problems by writing simple, easy to understand code. Maintainability is the key, that’s what business needs to succeed.

Reading Time: 3 minutes

Two sticks of deceased DDR2 — It also took me ages to find that these were dead

I recently spent days looking for a bug – a thread safety bug – that I should have found in minutes. The reason it took so long is that someone had found it before me and attempted to fix it, twice, but each time had in fact made matters worse.

What made matters even worse than that is a phenomenon that will be all too familiar to you if you handle multithreaded code a lot – the customer could make it happen every time but on the bench, back at the office with all the debug tools? Practically impossible to reproduce.

So the code started off like this, and here I’ve extracted the essence of the problem, this is not the real code. In reality the mistake was the same, but heavily obfuscated right from the start.

class AlarmManager
{
    private Dictionary alarms = new Dictionary();

    public Guid StartAlarm()
    {
        Alarm newAlarm = new Alarm();
        newAlarm.AlarmId = Guid.NewGuid();
        alarms.Add(newAlarm.AlarmId, newAlarm);
        newAlarm.AlarmWorkflow = WorkflowManager.CreateAlarmWorkflow(newAlarm.AlarmId);
        return newAlarm.AlarmId;
    }

    public void DeleteAlarm(Guid alarmId)
    {
        Alarm toRemove;
        if (alarms.TryGetValue(alarmId, out toRemove))
        {
            toRemove.AlarmWorkflow.StopAndRemove();
            alarms.Remove(alarmId);
        }
    }
}

Two schoolboy errors in there. Someone spotted the first, Dictionaries are not thread safe, so they added some locking…

class AlarmManager
{
    private Dictionary alarms = new Dictionary();

    public Guid StartAlarm()
    {
        Alarm newAlarm = new Alarm();
        newAlarm.AlarmId = Guid.NewGuid();
        lock (alarms)
        {
            alarms.Add(newAlarm.AlarmId, newAlarm);
        }
        newAlarm.AlarmWorkflow = WorkflowManager.CreateAlarmWorkflow(newAlarm.AlarmId);
        return newAlarm.AlarmId;
    }

    public void DeleteAlarm(Guid alarmId)
    {
        Alarm toRemove;
        lock (alarms)
        {
            if (alarms.TryGetValue(alarmId, out toRemove))
            {
                toRemove.AlarmWorkflow.StopAndRemove();
                alarms.Remove(alarmId);
            }
        }
    }
}

Fixed? No. This is where we hit the real problems. Someone noticed that the line toRemove.AlarmWorkflow.Stop(); was throwing null reference exceptions, so rather than investigate how toRemove.AlarmWorkflow came to be null they simply put a null check in…

public void DeleteAlarm(Guid alarmId)
{
    Alarm toRemove;
    lock (alarms)
    {
        if (alarms.TryGetValue(alarmId, out toRemove))
        {
            if (toRemove.AlarmWorkflow != null)
                toRemove.AlarmWorkflow.StopAndRemove();
            alarms.Remove(alarmId);
        }
    }
}

The reason it was occasionally null is in StartAlarm. That’s where the bug is – the new alarm is added to the dictionary and the lock released before it’s finished initialising. So if it’s deleted by another thread immediately after it’s started, the threads can interleave in a way where the alarm is removed from the dictionary with the workflow being null, then the workflow is assigned and started. As the workflow is managed by the WorkflowManager, there’s still a reference to it, hence it continues to run.

Now the issue got compounded further, because someone spotted that exceptions were still being thrown during the delete, from the StopAndRemove method. This is where my simplification falls down a bit because the real reasons for the exception are somewhat complex involving events and another access to the alarms dictionary, suffice to say however that this was not the way to solve the problem…

public void DeleteAlarm(Guid alarmId)
{
    Alarm toRemove;
    lock (alarms)
    {
        if (alarms.TryGetValue(alarmId, out toRemove))
        {
            try
            {
                if (toRemove.AlarmWorkflow != null)
                    toRemove.AlarmWorkflow.StopAndRemove();
            }
            catch 
            {
            }
            alarms.Remove(alarmId);
        }
    }
}

These two attempted fixes are part of a mindset of patching it up rather than fixing the root cause. In certain extreme circumstances patching it up may be acceptable. I’ve had to do it, but it must be highlighted that this is what has been done and that it may actually be masking the root cause or indeed causing further knock-on issues.
I find the #warning preprocessor directive useful in such circumstances.

The solution is trivial…

class AlarmManager
{
    private Dictionary alarms = new Dictionary();

    public Guid StartAlarm()
    {
        Alarm newAlarm = new Alarm();
        newAlarm.AlarmId = Guid.NewGuid();
        lock (alarms)
        {
            newAlarm.AlarmWorkflow = WorkflowManager.CreateAlarmWorkflow(newAlarm.AlarmId);
            alarms.Add(newAlarm.AlarmId, newAlarm);
        }
        return newAlarm.AlarmId;
    }

    public void DeleteAlarm(Guid alarmId)
    {
        Alarm toRemove;
        lock (alarms)
        {
            if (alarms.TryGetValue(alarmId, out toRemove))
            {
                alarms.Remove(alarmId);
                toRemove.AlarmWorkflow.Stop();
            }
        }
    }
}

Code Complexity – A Complex Matter

The Art of Making Multithreading Issues Worse