What is the state of the art for name-processing?

Name-matching can be as simple as a quick visual scan of a Contacts list on a cell phone, or as complex as a multi-technique automated review of billions of lines of unstructured data in search of possible references to one or more persons of interest.

Tools and techniques to cope with the special complexities of names have been devised since well before the advent of digital computers, because consistent handling of names for filing and retrieval has long been recognized as an important and on-going organizational requirement.

The commercial marketplace and the academic world at present offer numerous products and techniques that address various aspects of name processing. A small number of algorithmic solutions recur broadly and repeatedly over this set, especially when the primary task at hand is name-matching (also called name-linking, name-finding or name-searching). Typically, other name-processing techniques are used as complements to the name-matching process, in order to improve the delivered outcomes. Across these diverse resources, three basic name-matching methodologies tend to predominate:

String-similarity

The most intuitive, widely used and easily grasped way to compare two or more Named Entities is to consider the degree to which they are spelled using the same characters. The underlying assumption for this matching method is that related or equivalent names will generally comprise the same set of characters, found in generally the same places, when the name is considered as a string or, more specifically, a vector of symbols from a closed set (e.g., the Roman A-Z alphabet).

String-similarity matching also offers the attractive feature of noise-tolerance, that is, overcoming random errors that may have been introduced into a name at some point before the comparison is made. Another advantage of a string-similarity approach is that the results of an automated search can be ranked by degree of similarity, a useful feature when answer-sets can number in the thousands or beyond for a particular name-search request.

Borrowing initially from more abstract theories devised by Russian mathematicians in the late 19th and early 20th centuries, certain string-similarity techniques later evolved in way that made them better suited first for written language in general, and then for names in particular. Currently, public-domain similarity metrics such as n-grams, Levenshtein-Damerau and Jaro-Winkler are commonly applied in many name-matching applications, and are even implemented as “built-in” calculations for certain programming languages in some instances.

While name-matching techniques in this category have important practical and theoretical differences, they also share several essential assumptions: first, a name can be modeled as an ordered vector of orthographic symbols (i.e. letters from a specific alphabet or writing system); second, the similarity between two compared names can be measured by determining the number of equivalent symbols (letters) found in the same position in both names, possibly after certain manipulations such as reversing or justification.

Key-generation

In addition to string-similarity metrics, the other widely used set of automated name-matching techniques is key generation. As the scope and complexity of IT systems grew rapidly through the 1960’s, and as increasing amounts of data came to be comprised within or managed by these IT systems (especially in a corporate context), there was also a growing need to deal with names that were stored in a database management system (DBMS), typically as a field within a record, or collection of related data values.

Data contained within a DBMS can most easily and efficiently be manipulated as sets of records, rather than as an iteration over all individual records. Certain data values within a data record are typically identified as keys, and much DBMS-based processing (storage, retrieval, searching, defining subsets) involves operations on key values.

Set-oriented, key-based operations were easily implemented for numeric values, but less well-defined for character data. Finding all record containing a specific numeric value or any value within a defined range in a set of records was well-understood and easily implemented. But how could the same operations be made available for non-numeric data, such as text values? And how, in particular, could names that fell within a certain degree of spelling similarity be represented?

Fortunately, a convenient and efficient solution for converting names into keys came readily to hand, in the form of an obscure technique first devised in the 1920’s by two researchers working with data from the US Census of 1890. This technique, called Soundex, was among those characterized as “fundamental” in Donald Knuth’s seminal and highly influential work, The Art of Computer Programming, which was published just as DBMS technologies were gaining wide commercial use in IT.

As the practical limitations of Soundex for use with names were encountered, specialized techniques were developed that also converted a name into a key value, but did so in a way that more frequently caused similar or equivalent names to render the same key, and thus allowed those names to be retrieved as a set. In each instance, systems such as NYSIIS and Metaphone sought to broaden the complexity and ethnic diversity of the names that could be grouped as similar by the keys that were generated, linking names that were pronounced the same way (generally, by speakers of English), despite varying ways these names might be represented in written form.

A common thread across all these key-generation techniques is their mathematical property as functions. That is, they all produce a single distinct output value (the key) for each distinct input value. Tables or charts of equivalency rules are used to identify sequences of symbols (e.g., letters) that are potential sources of spelling variation, then neutralize these by substituting a omitting them, or replacing them with a normalized symbol-sequence. The resulting key, in theory, is shared by all names which are commonly understood as being pronounced the same but spelled differently.

Key-generation techniques provide a way to pre-define sets of names that are likely spelling variants of each other, and to do so in the context of data-management techniques that work well in very large collections of data records that contain names. As set-wise operators, they also combine efficiently with multi-criterion or record-based searches, when a search request contains both a name and other heterogeneous data-fields, a quality that makes them especially valuable for DBMS applications.

Name grouping

Automated name-matching techniques based on string similarity or key-generation arose largely within the context of US and Western European names and naming practices, because these techniques were driven, in the main, by the need to solve recurring, ubiquitous practical problems encountered by information technologists. Since the personal names for which these techniques were devised were understood to be represented uniformly in the Roman alphabet (perhaps including certain of its European extensions), and since the predominant naming practices used by this population were fairly consistent, a purely algorithmic approach can deliver acceptable results in many instances.

However, it soon became clear that not all names that, say, speakers of English understand as being closely related by sound or by meaning also possess a reasonable degree of similarity in their written forms. As a trivial example, use of familiar names of nicknames, as an equivalent form for a person’s first given-name, is a naming practice that spans many cultures of Western Europe and the Americas. In many name-matching contexts, the searcher looking for the name Katherine would presumably also want to see matching names that contain the form Kathy, and a search for Theodore should probably be able to return a match on Ted.

To accommodate this and other, related aspects of standard naming practices observed in personal-name data from the Americas and Western Europe, the technique of name grouping arose. In this technique, culturally and linguistically qualified individuals manually compile lists and tables of related names that are sufficiently distinct in their spelling to prohibit a calculation of relatedness by string-similarity or by key-generation. The most widely used form of this technique is the nickname table, which allows a name or name-part to be mapped to a form that is considered more basic or canonical, in the context of a specific language and/or culture.

Although name-grouping resources are frequently used to supplement and support algorithmic techniques, name-grouping can also form the sole basis for a fully automated name-search system. Given enough linguistic/cultural expertise and name-data that fully represents the naming practices of a specific population, it would be possible to build a table that encompasses all the relationships between and among names that are needed to perform a search and to deliver as output all names previously defined as related by the experts.

The State of the Art

All automated systems that link or search or relate names can be shown to use one or more of the three foregoing techniques. Each of the three techniques has its practical, computational and/or linguistic advantages; each has its known shortcomings and drawbacks.

The most advanced automated systems for name-processing use all three of these techniques in complementary ways which operate efficiently over the scope of the name-data that they comprise. They do so in an application architecture that takes maximum advantage of the observed and measurable statistical properties of names, especially the typical type-token distributions for personal names in a specific population or data-collection.

In this way, the case encountered most often will also be the one that is handled most efficiently and well by the system, and infrequent cases from the “long tail” will still be accommodated, if not quite as rapidly.

In summary, the state of the art for name-matching systems is reached when a given system can handle both random and predictable variation in names, across a variety of named-entity types (personal names, business names, place-names), encompassing a wide variety of major and minor name-models observed within the cultural/linguistic/social contexts in which the names are actually used. It should be able to do so in a high-volume, time-sensitive operational environment, and present results in such a way as to make the most likely matches more easily identified.

Ideally, the system would also provide, on demand, an explanatory dimension for search outcomes, such that consumers of search results would not necessarily require a complete grasp of all possible name-variation patterns in order to appreciate and act consistently upon those results.

To the extent that name-processing is seen as essentially an IT issue, and identifying related names is seen as a purely mathematical exercise in defining away or working around superficial dissimilarities in names that would otherwise be understood as adequately related, that name-processing system can be expected to fall well short of the state of the art.

Other Name-Processing Resources

Although the foregoing discussions have had as their principal focus name-matching (name-linking, name-filtering, name-searching), certain additional automated tools are also extant in commercial and public-domain contexts. These tools can be used as complements to name-matching, in order to refine and enhance the utility of the search results for users.

  • Name genderization: for personal names, assignment of the likely gender of the bearer, usually by examination of one or more of the given-names (“first names”).
  • Name parsing: for personal names and business names, mapping of discrete name-parts into an appropriate name-model (e.g., assigning parts to the First, Middle and Last Name fields in a database record).
  • Name ethnicity: identifying the cultural/linguistic context(s) exhibited by a personal name or a business name. Name distribution, showing the places where a name has been observed and the relative frequency in each location, is a special case of this process.
  • Name frequency: identifying the relative frequency with which a name (or name-part) is observed within a certain population or data collection.
  • Name transliteration: converting a name from its current representation in a specific writing system (e.g., the Roman alphabet) to its equivalent form in a different writing system (e.g., Greek, Cyrillic, Hangul, Chinese,…). Romanization is a special case of this process, in which a name is converted from a non-Roman writing system to its Roman-alphabet equivalent.
  • Name standardization: conversion of a specific personal name into a more typical or canonical form, as when the surname SMYTHE   is converted to SMITH or the given-name RICK is converted to RICHARD.

Want to learn more?  With multiple patents in automated processing techniques and over 20 years experience with governments and corporations around the world, the experts at Onomastic Resources are uniquely qualified to help you solve even the most difficult name processing challenges. Contact us to learn more about applying state of the art techniques in your organization.

Leave a Reply

Your email address will not be published. Required fields are marked *