One of the amazing things about Google is that when you type in your question or word, you do not need to be precise with regards to spelling, punctuation or the niceties of grammar.. you get a list of suggestions that very often match what you are looking for. But how does Google achieve this feat in a matter of milliseconds?
Enter fuzzy matching. In a world where vast amounts of data are generated every second, it's often the case (especially with text data) that the data is not clean. This means that if this is human-generated data, it will have errors - misspellings, wrong punctuation, prefixes and suffixes that mean nothing to others, inconsistent formatting - that make it hard to recognise what is relevant and what is not. Fuzzy matching algorithms helps to assess how similar two words are to each other, eliminating the guesswork out of what a user is asking Google about.
Fuzzy matching is used in the following scenarios to automate what is essentially a manual task:
- Find documents that are similar to each other in terms of their content
- Classify documents into two or more categories based on their content
- Finding the best match for a given word from a given set of words -applicable in scenarios where customer lists need to be collated from varied sources
- Screening a customer list against international databases- for Anti-Money Laundering (AML) & fraud detection & immigration purposes
There are various classes of algorithms that are used to undertake a fuzzy match on text data, depending on the task at hand. It is the data that determines which of these types of algorithms is applied to solve the problem at hand. Each algorithm is judged by the number of false positives & false negatives it throws up in the results. A false positive is a case of a positive match found between a string and another, when in fact there is no match. A false negative is a case wherein the algorithm mistakenly indicates that there is no match, when in fact there is. These mismatches can lead to increased workloads (in case of false positives) or missed information and associated risks (in case of false negatives).
Types of Algorithms Employed:
The various classes of algorithms used for fuzzy matching are listed below:
- Edit Distance - The edit distance method looks at how many character changes it takes to go from one name to the other. For example, "Katherine" and "Catherine" are similar except for a single change, the substitution of the first letter "C" for "K". Various algorithms in this class include the Levenshtein distance, Jaro-Levenshtein distance and the Hammings distance algorithms. All of these approaches look at two basic factors
- The number of similar characters
- The number of edit operations it takes to convert one string into another - edit operations being insertion, deletion and transposition
- Phonetic Algorithms - Phonetic algorithms reduce words that sound similar to a key or code that is based on their English pronunciation. Similar sounding names have the same key in this system. Soundex is an example of this type of algorithm.
- Token-based similarity algorithms - These are used to find similar documents or for document classification. They convert a given document into a token (i.e. a bag of words) that is then compared with another token for similarity. This is achieved by counting how many similar words appear across the two tokens.
- List building - This method generates all possible spelling variations of a given name combination and then looks for matching names from among these lists of name variations.
A financial services firm in the renewable energy financing space had a manual process for matching applicants for loans with the list of current homeowners for a given area so as to determine genuine applicants for financing.
Manual operations were leading to many errors of matching due to the high volume of applications and stringent criteria for matching. At Devise Math we employed fuzzy match techniques using the grepl function in R to ascertain whether there was a match between the name on the loan application and the homeowner database.
This resulted in the automation of the title matching step of the loan application process and significant dollar savings to the company.
A leading CRM company working with leading consumer electronics firms was looking to standardize the product database for its clients in order to enable faster processing of post-purchase offers and increase up-sell opportunities.
A lack of standardisation of product codes and descriptions hampered their efforts on this initiative. We employed fuzzy matching using a mix of the Levenshtein distance algorithm and Jaro distance algorithm to identify similar product codes and product descriptions as per the manufacturer's The completed dataset allowed for the successful completion of this initiative, leading to greater adoption of the company's loyalty programs.
Fuzzy matching is a technique that's constantly evolving, with new approaches and use cases being developed to meet the changing requirements of the big data world. With the explosion in the amount and variety of text data across the world, in multiple languages, this will be an important component of every data scientist's toolbox for years to come.
Interested to learn more about data science applications for your business. Reach out to us at firstname.lastname@example.org and let's have a chat about how we can help solve your challenges.