Tuesday, April 10, 2018

String comparison, easy as a == b, right? Wrong!

I have recently used a lot of string comparison. Comparing strings is not as easy as one might think. The string comparison should also be done fast when you're dealing with massive amounts of text. I have needed both case-sensitive and case-insensitive string comparisons. I've also needed to determine if a string is a substring of the other string. I have used a few methods and their variants.

To determine if a string is a substring of another string I've used two methods:
- Contains method from the string class: .Contains()
- IndexOf method from the string class: .IndexOf()

Both methods do case-sensitive matching of the strings. However, IndexOf method has an overloaded version which does a case-insensitive matching:
- .IndexOf(, 0, StringComparison.OrdinalIgnoreCase)
Another method to do case-insensitive matching is to use ToLower() method (or ToUpper()) with both strings. Since I needed fast string comparisons I started to wonder if an extra ToLower() method call would cause a huge time penalty.

I decided to compare four methods and variants:
- case-sensitive Contains method: .Contains()
- case-sensitive IndexOf method: .IndexOf()
- case-insensitive IndexOf method with ToLower method: .ToLower().IndexOf(.ToLower())
- case-insensitive .IndexOf(, 0, StringComparison.OrdinalIgnoreCase)

I wrote a small CSharp console application that provides comparisons above and would get accurate enough timing of the comparisons. To get measurable timings each comparison variant was repeated in a loop.

So here is the code:

class Program
{
    public static string stringToSearch;
    public static string text = "A Function to search for";
    public static bool match = false;
    public static DateTime startTime;
    public static TimeSpan elapsedTime;

    public static int Method1(int loops, string stringToSearch)
    {
        startTime = DateTime.Now;
        for(int i = 0; i < loops; i++)
        { 
            match = text.Contains(stringToSearch);
        }
        elapsedTime = DateTime.Now.Subtract(startTime);
        Console.WriteLine("Contains(" + stringToSearch + "): " + match.ToString());
        Console.WriteLine("Elapsed: " + (int)elapsedTime.TotalMilliseconds);
        return (int)elapsedTime.TotalMilliseconds;
    }

    public static int Method2(int loops, string stringToSearch)
    {
        startTime = DateTime.Now;
        for (int i = 0; i < loops; i++)
        {
            match = text.IndexOf(stringToSearch) >= 0;
        }
        elapsedTime = DateTime.Now.Subtract(startTime);
        Console.WriteLine("IndexOf(" + stringToSearch + "): " + match.ToString());
        Console.WriteLine("Elapsed: " + (int)elapsedTime.TotalMilliseconds);
        return (int)elapsedTime.TotalMilliseconds;
    }

    public static int Method3(int loops, string stringToSearch)
    {
        startTime = DateTime.Now;
        for (int i = 0; i < loops; i++)
        {
            match = text.ToLower().IndexOf(stringToSearch.ToLower()) >= 0;
        }
        elapsedTime = DateTime.Now.Subtract(startTime);
        Console.WriteLine("IndexOf(" + stringToSearch + ".ToLower()): " + match.ToString());
        Console.WriteLine("Elapsed: " + (int)elapsedTime.TotalMilliseconds);
        return (int)elapsedTime.TotalMilliseconds;
    }

    public static int Method4(int loops, string stringToSearch)
    {
        startTime = DateTime.Now;
        for (int i = 0; i < loops; i++)
        {
            match = text.IndexOf(stringToSearch, 0, StringComparison.OrdinalIgnoreCase) >= 0;
        }
        elapsedTime = DateTime.Now.Subtract(startTime);
        Console.WriteLine("IndexOf(" + stringToSearch + ", 0, StringComparison.OrdinalIgnoreCase) :" + match.ToString());
        Console.WriteLine("Elapsed: " + (int)elapsedTime.TotalMilliseconds);
        return (int)elapsedTime.TotalMilliseconds;
    }

    static void Main(string[] args)
    {
        int loops = 1000000; // One million
        int m1 = 0;
        int m2 = 0;
        int m3 = 0;
        int m4 = 0;

        stringToSearch = "Function";
        m1 += Method1(loops, stringToSearch);
        m2 += Method2(loops, stringToSearch);
        m3 += Method3(loops, stringToSearch);
        m4 += Method4(loops, stringToSearch);

        Console.WriteLine();
        stringToSearch = "not found";
        m1 += Method1(loops, stringToSearch);
        m2 += Method2(loops, stringToSearch);
        m3 += Method3(loops, stringToSearch);
        m4 += Method4(loops, stringToSearch);

        Console.WriteLine();
        Console.WriteLine("Method 1 Elapsed: " + (int)(m1 / 2));
        Console.WriteLine("Method 2 Elapsed: " + (int)(m2 / 2));
        Console.WriteLine("Method 3 Elapsed: " + (int)(m3 / 2));
        Console.WriteLine("Method 4 Elapsed: " + (int)(m4 / 2));

        Console.ReadKey();
    }
}


The code itself is pretty simple and should be self- explanatory.



Each method was executed both with a string that would be found and with a string that would not be found. The final timing was the average of these two.

For the case-insensitive searching .IndexOf(, 0, StringComparison.OrdinalIgnoreCase) seems to be the best choice. I was a bit surprised that this was faster than basic .IndexOf() search. If you need only case-sensitive search then .Contains() would be the fastest method.

No comments: