Optimise Text Search in Core Data

We typically use one of the following text search predicates with costs in terms of performance,

  • Beginwith, Endswith ($)

These are the cheapest queries that are possible in case of text  comparisons. In these cases first or last few characters are checked with the text and if the match is not found, code exits. 

  • Equality ($)

We can consider this query as similar to that of Beginswith, which checks all the characters in text.

  • Contains ($$)

This is bit more expensive as it keeps checking for a match in whole text length.

  • Matches ($$$)

This is the most expensive query in case of text comparisons, system needs to go through and work with regular expressions engine.

  • Case and Diacritic Insensitivity [cd] ($$$)

This is a lot more expensive query. Lets first see what does this do. Mentioning [cd] for contains query, treats a, A, à, á, â, ä, æ, ã, å, ā as simply ‘a’. By doing this we are commanding our system not to make difference between all these characters. System has to work a lot to this comparison.

  • Solution is to use canocicalized Text Seach [n] ($$)

When people search for some text, they type few characters and expect results. What we really need to do here is to use canonicalized text property. To do so we need to separate out the text that will be searched and text that will be displayed. For this we need to convert a display text with diacritic characters into canonicalized text in lowercase. Following code does the magic of converting string into canonicalized string.

NSString *str = @"àä";
CFStringTransform ((CFMutableStringRef)str, NULL, kCFStringTransformStripCombiningMarks, FALSE);
NSString *searchText = [str lowercaseString];

kCFStringTransformStripCombiningMarks is the identifier of a transform to strip combining marks (accents or diacritics).

Once we have string that is normalized form of the display text, we can apply text comparison with [n] and pass canonicalized-normalized query text. This saves lot of clock cycles that are wasted in case of case-diacritic insensitive search.