The goal of this search procedure is to find words that are structurally closest to the search term, even if they do not match exactly.
What is Approximate String Matching?
Approximate String Matching is a computer science procedure that corrects flawed or inaccurate input with high probability. It is often referred to as Fuzzy Search or Similarity Search.
The method finds all matches that deviate only slightly from the search term, measured in distance points. This is particularly valuable for:
- Very long words with a high probability of typos.
- Historical data where spellings may vary.
- Database queries involving input errors.
How Does Approximate String Matching Work?
The central principle is the calculation of the distance between the search term and the words in the index. This distance is determined using a distance measure algorithm.
The Levenshtein Distance
The best-known algorithm is the Levenshtein Distance (also called Edit Distance). It measures how many single edit operations are necessary to transform one word into another.
Edit operations include:
- A Substitution (changing one character)
- An Insertion (adding one character)
- A Deletion (removing one character)
| Example | Levenshtein Distance | Explanation |
| Customer ➔ Custmer | 1 | A deletion of the letter ‘o’ |
| Berlin ➔ Berlim | 1 | A substitution of ‘n’ for ‘m’ |
| Searchword ➔ Searchwordee | 2 | Two insertions of ‘e’ |
| Search ➔ Phonetics | 8 | High distance, words are not similar. |
Advantages and Differentiation
Approximate String Matching, together with Phonetic Search, forms a robust fault tolerance layer for Full-Text Search.
- Efficient Error Correction: It corrects errors that are neither phonetic nor exact. This makes it the most comprehensive method for resolving typos.
- Data Cleansing: The mechanism can be used for the automatic detection and merging of duplicates in databases (e.g., “Frankfurt am Main” and “Frankfurt a. M.”).
- Clear Differentiation from Phonetics: This String Matching is script-based. It would effectively correct errors like “Maier” “Meier”. However, it would not recognize the difference between “Meyer” and “Meier” based on sound like Phonetic Search, as their Edit Distance is zero, but the phonetic similarity is given. The procedures are complementary.
Approximate String Matching offers a fundamental form of content-based approximation, but no contextual or semantic interpretation. It only recognizes the proximity of the letters.
Outlook and Conclusion
While advanced search functions in Full-Text Search (such as wildcards or the tilde operator for Fuzzy Search) enable this at the user level, Approximate String Matching is the underlying, formal principle. Unlike rigid Full-Text Search and sound-based Phonetic Search, this search Method uses algorithmic distance measures to calculate the structural proximity of two strings.
These procedures are essential for ensuring data quality in Document Management for increasing the hit rate in complex, error-prone databases.
