C#’s IEnumerable – Two Simple Gotchas

Dummies in suits...

Dummies in suits…

The other day I saw someone do something very much like this and it made me cringe a little. There’s no problem with this when the number of items that are being dealt with is small, but as the number of items increases performance will take a spectacular nosedive.

        static string[] composers = {"Pyotr Ilyich Tchaikovsky",
                                    "Gabriel Fauré",
                                    "Sergei Sergeyevich Prokofiev",
                                    "Jean Sibelius",
                                    "Ludwig van Beethoven",
                                    "Benjamin Britten",
                                    "Erik Satie"};


        static void Main(string[] args)
        {
            Console.WriteLine(" L I S T I N G   S H O R T   N A M E D   C O M P O S E R S");
            Console.WriteLine(" =========================================================");
            IEnumerable<string> shortComposers = composers.Where(x => x.Length < 15);
            for (int i = 0; i < shortComposers.Count(); i++)
            {
                Console.WriteLine("Composer {0} => {1}", i, shortComposers.ElementAt(i));
            }


            Console.Write(" - PRESS ANY KEY TO EXIT - ");
            Console.ReadKey();
        }

What’s the problem? Well in this trivial example nothing at all, but if the list of names were a bit longer then we could find that this code runs very slowly indeed. There are two problems.

C#’s for loop combined with IEnumerable’s .Count() method is somewhat toxic.
In order to understand why we need to look at what .NET is doing behind the scenes. Originally I used the rather excellent IL Spy to decompile part of the .NET Framework itself.
Since then Microsoft have actually published the source – so let’s take a look at what .Count() really does.

public static int Count<TSource>(this IEnumerable<TSource> source) 
{
    if (source == null) 
        throw Error.ArgumentNull("source");

    ICollection<TSource> collectionoft = source as ICollection<TSource>;
    if (collectionoft != null) 
        return collectionoft.Count;

    ICollection collection = source as ICollection;
    if (collection != null) 
        return collection.Count;

    int count = 0;
    using (IEnumerator<TSource> e = source.GetEnumerator()) 
    {
        checked 
        {
            while (e.MoveNext()) count++;
        }
    }
    return count;
}

(original here)

So if it’s a ICollection<T> or a plain ICollection it returns the Count property of the object, if not it gets the enumerator and counts up every item in the IEnumerable.

Now we have to look at C#’s for loop and how it’s evaluated. If we look at the C# 4.0 language specification it says;

A for statement is executed as follows:

  • If a for-initializer is present, the variable initializers or statement expressions are executed in the order they are written. This step is only performed once.
  • If a for-condition is present, it is evaluated.
  • If the for-condition is not present or if the evaluation yields true, control is transferred to the embedded statement. When and if control reaches the end point of the embedded statement (possibly from execution of a continue statement), the expressions of the for-iterator, if any, are evaluated in sequence, and then another iteration is performed, starting with evaluation of the for-condition in the step above.
  • If the for-condition is present and the evaluation yields false, control is transferred to the end point of the for statement.

Or in other words every time the loop goes round the Count() method is called. So if there are 1000 items in the list then the Count() method will be called 1000 times. Each time the Count() method itself is called it will get the enumerator and count up 1000. So that’s more than 1,000,000 operations that are performed when only 1000 should have been.

It’s easy to fix…

            int shortComposersCount=shortComposers.Count();
            for (int i = 0; i < shortComposersCount; i++)
            {
                Console.WriteLine("Composer {0} => {1}", i, shortComposers.ElementAt(i));
            }

Now there are 2000ish operations performed because the Count() method is only called once. Note that there is some overhead in creating the shortComposersCount variable, so for small numbers you may find that it’s faster just to leave the Count() where it was. If you do leave it as was though be aware that performance will degrade exponentially as the count increases.

.ElementAt() is inefficient
To look at the second problem let’s fire up the Dot Net Reference Source again and take a look at what .ElementAt() does.

public static TSource ElementAt<TSource>(this IEnumerable<TSource> source, int index) 
{
    if (source == null) 
        throw Error.ArgumentNull("source");

    IList<TSource> list = source as IList<TSource>;
    if (list != null) 
        return list[index];

    if (index < 0) 
        throw Error.ArgumentOutOfRange("index");

    using (IEnumerator<TSource> e = source.GetEnumerator()) 
    {
        while (true) 
        {
            if (!e.MoveNext()) 
                 throw Error.ArgumentOutOfRange("index");
            if (index == 0) 
                return e.Current;
            index--;
        }
    }
}

(original here)

Having seen what Count() does that won’t surprise you. If the IEnumerable doesn’t implement IList then ElementAt() gets the enumerator and calls MoveNext() the requisite number of times to move through to the specified position, then returns the current item. This is not good – we’re performing a whole heap of unnecessary operations and the performance drop off as the number of items increases will be horrid.

So we need a bit of a rethink – in this case simply using a foreach instead of a for fixes everything neatly.

            int i=0;
            foreach (string shortComposer in shortComposers)
            {
                Console.WriteLine("Composer {0} => {1}", i++, shortComposer);
            }

This way the enumerator is only got and enumerated once – the performance improvement of this way of doing it over the original for loop implementation will be an eyebrow-raiser.

Comparing this with an array

But what if we have to use a for loop (for some other reason)?

I won’t go into the details of the implementation of the C# array but…

  • It has a Length property (not a method) that directly returns the size of the array, it doesn’t do any counting.
  • It (obviously) has an indexer property that directly returns an item from an array – again no counting involved.

So an array also solves both of our problems…

            string[] shortComposers = composers.Where(x => x.Length < 15).ToArray();            
            for (int i = 0; i < shortComposers.Length; i++)
            {
                Console.WriteLine("Composer {0} => {1}", i, shortComposers[i]);
            }

This isn’t bad, but you must be aware that the instantiation of a new object – the array – introduces quite a bit of overhead. You may find that using this construct is slower than the original in some surprising circumstances. For instance if the item is a value type (such as a struct or string) then the overhead of creating an array of them can easily be more than the cost of doing all those extra operations. Reference types (e.g. objects) will be less prone to this.

It's not safe!

It’s not safe!

Conclusion
Be aware that a lot of LINQ operations end up iterating through item by item performing some operation or other. This fact can be hidden in innocuous looking method calls such as ElementAt(). The number of unnecessary operations can very quickly get out of hand, especially when combined with other iterative operations.

If in doubt, use the source. If you haven’t got the source for whatever you’re using then, if it’s legal to do so where you live, crank up your favourite .NET decompiler and find out what’s going on underneath.