Phase 1

Phase 1 Characterize Data and Determine Matching Algorithm

This phase answers the question, “What data elements should be used in the matching algorithm.” It assumes there is not an established set of matching variables or a matching algorithm in operation. It also considers the identifiers commonly used within the social context to distinguish one person from another.

Countries may collect different PII. Consequently, each country will likely develop matching algorithms that accommodate their distinctives. Although this module provides best practices, some tailoring is expected, much as different cooks alter spices or types of sweeteners (e.g., honey instead of sugar) to satisfy a range of tastes.

Other modules within this toolkit discuss implementation and considerations of alphanumeric IDs assigned by governing authorities and of biometric identifiers. These, along with other person characteristics, are included as potential data points that can identify an individual.

Some types of identifying data are more effective than others in singling out the same individual within and across datasets and across time, but a combination of identifiers, biometrics (if available), and demographic attributes is more likely to result in accurate person matching. In short, no single identifier is 100% reliable. Therefore, this phase determines the optimal combination of identifying data (i.e., the algorithm) that consistently recognizes true duplicate records as the same individual and true unique records as different. In this way the algorithm achieves the most favorable balance between false positive (i.e., identifies different individuals as the same) and false negative (i.e., does not identify records for the same person as duplicates) matches.

This first phase involves the following three main steps:

  • Step 1A: Determine appropriate identifiers

  • Step 1B: Address data source limitations

  • Step 1C: Select an optimal algorithm

Step 1A: Determine appropriate identifiers

To decide which identifiers to use for matching, start by determining the kinds of data used to identify people and if that data is available in the context. Source databases use both static and non-static identifiers to tell people apart. Static data is unchanging for a person, while non-static data can change over time or be unique to a data source. These identifiers include, but aren't limited to:

Static identifiers do not change over time. Some examples of static identifiers include:

  • Unique code, such as National Patient Identification number

  • Name and demographic attributes, such as given name, middle name or initial, family name (semi-static), sex at birth, date of birth, people group affiliation, and biometrics data (fingerprint, photo, etc.)

Non-static identifiers can change for an individual over time or are data source specific. Non-static identifiers can help to confirm a match but not to determine a match. Some examples of non-static identifiers include:

  • Contact identifiers, such as home address, phone number, and email address.

  • Other identifiers, such as marital status, place of birth, Medical Record Number, or other facility identifiers.

The reliability of static and non-static identifiers depends on data quality and on how the source system captures the data. For example, a National Patient Identifier that is keyed into a system rather than scanned could contain keystroke errors, reducing its reliability as an identifier. More detailed discussion of data quality review and resolution as a routine process that identifies and corrects data quality issues within the source data is included in Phase 2 Implementing the Matching Process. For this phase, the algorithm should account for missing data, potential capture errors, and data inconsistencies.

Step 1B: Address data source limitations

Matching processes rely on data completeness and other data quality concerns. Therefore, summarizing the amount of missing or outlier values informs decisions about how to handle these situations in the algorithm. For example, the matching algorithm can be designed to ignore missing values and assign a lower probability of match when only some of the combinations of identifiers are available.

To understand and address data source limitations, it is useful to:

  • Summarize the data.

  • Evaluate characteristics of the identifier data available from source systems.

  • Understand the variations.

  • Normalize and validate the data.

Let’s explore each of these four aspects…

Summarize the data:

The data summarization process helps us to understand the distribution of data values in each identifier at the source. It should yield the following three metrics for each identifier:

  • Frequency of each possible value: This is the number of occurrences of each value in each identifying data element. A histogram can be used to illustrate distribution of the counts. Since many values will have a count of 1 or 2, those values can be filtered out of the histogram or put together in the same category.

  • Summary statistics: The following summary statistics can also be computed for each identification data element using the frequency data described above: Maximum, Minimum (most likely to be 1, so additional filtering may be needed), Mean, Median, and Standard Deviation.

  • Combined data values: The same metrics can also be computed for combined data values, for example, first name + last name or first name + last name + DOB.

    • If there are missing values in any of the identifiers, the combined value should be NULL.

    • These results will support the combinations used in identifying blocking schemes (see FAQ), deterministic matching methods, or to improve the security of privacy-preserving record linkage methods (further discussion planned for the next version of the guidelines).

Evaluate the characteristics of the identifier data

Selection of a linkage algorithm depends on the characteristics of the underlying linkage data. For example, probabilistic algorithms might handle missing data more effectively than deterministic methods; or knowledge of the distribution of data values might inform the selection of data blocking schemes. To better understand identifier data, it can be useful to evaluate this data using metrics as exemplified by the following:

  • Data completeness using missing data ratio (MDR)

    Data completeness refers to:

    • Missing data in a field may be indicated with such terms as “NULL,” “Not available,” “N/A,”, “Missing”, white space (“ ”) or consist of an empty string (“”).

    • Place holders for default values are similar to missing data and have no value in the matching process. Examples are using “Baby boy” for first name, “999” for age, and “To be determined” for address. Place holders should be treated as missing data in the context of record linkage and thus converted to NULL.

    • Incomplete data is defined as incomplete when part of the value in a field is absent. Examples are no day and month in a date of birth and missing street number in a home address.

Data completeness can be evaluated using:

  • An identifier’s valid value ratio (VVR)

A valid value of an identifier is a value within the predefined value range. For example, to be valid, date of birth cannot be in the future; name must not contain digits; and sex should not contain undefined value set values.

  • Data’s discriminatory power using the distinct value ratio (DVR)

  • Data variability over time

Variability looks at temporal trends to identify any periodicity in the data. For example, date of birth might be missing for many records during earlier years of a system implementation but more consistently populated during more recent time periods, or data completeness might drop off during long periods of power interruption or during staffing changes within facilities.

  • Data availability

Availability refers to access to identifying data elements, whether because it is not collected or because it cannot be shared. Data sharing may be hindered due to regulatory, privacy, confidentiality, or other concerns.

  • Network capabilities

    Two features, bandwidth and latency, significantly influence patient matching system design.

  • Bandwidth, the maximum amount of data that can be transferred over a network or system within a given timeframe, impacts the speed of data queries, which affects match processing speed.

  • Latency, the delay before data transfer begins after an instruction is issued, also plays a role in data querying and match processing speed.

The technical capacities of bandwidth and latency inform decisions on whether the data need to be centralized for matching or can use federated queries. For example, a federated approach allows for local secure storage of sensitive data, but limitations in latency and bandwidth could slow it down. Therefore, a centralized approach may be needed to mitigate long latency and low bandwidth issues.

In addition to looking at the types of data, it's important to determine the amount of data in the dataset—how many records are there. The size of the data can affect various aspects like how complicated the calculations are, how much computing power is needed (like memory and processing cores), how well the linkage algorithm works, strategies for blocking similar records, how precise the results are (including both wrong matches and missed matches), the quality of the data, and the time it takes to run the process and the resources it uses.

Understand the variations

It is useful to know how data fields vary across the data sources. Variations can exist for field names, coded values, format, and collection conventions across different healthcare facilities, regions and/or systems. Below are examples of variations:

  • Variations in naming conventions: Field names can be presented differently for the same concept. For example:

    • first name, could be presented as First Name, FN, first_name, or fname, and

    • date of birth, could be presented as DOB or birthdate.

  • Variations in data coded values: The coded values used within a field to denote the captured concepts might be different or not align. For example:

    • Sex could use the values, male and female, 0 = Male and 1 = Female, or 1 = Female, 2 = Male, 3 = Other.

  • Variations in data format: Each source system might use dissimilar field formats to store the same information. For example:

    • Date fields such as date of birth could use the format MM/DD/YYYY, store the day, month, and year as separate fields, or be formatted as a string.

    • The recorded phone number could be formatted as a numerical string or include separators, such as (country code) (phone number)

  • Variations in data collection convention: In some source systems identifiers might be required and in others optional. If a field is required, it is also helpful to know how often the information is available as data is captured. If the information is sometimes unknown during data entry, the stored value could represent a best guess and be inaccurate. If the field is optional, the field might be incomplete more frequently.

For a matching strategy across multiple sources with such variations, one option is for each source to submit its data in the types and formats it currently uses, then standardizing the differences within the linking or matching process. An alternative is to define an operational data model with standard variable name, data type, format, and value set for each identifier. Each source would transform its data to the common model.

Normalize and validate approach:

After understanding the data characteristics, it is critical to address these limitations within the matching process. If there are errors in the data, then comparing two records using deterministic methods to ascertain matches may overlook true matches (that is, records belonging to the same person could be missed). For example, without accommodation, the likelihood of a true match is reduced if fields, such as names, address, phone number, and email, contain typographical errors (repeated, omitted, or transposed characters such as 'Allen’ versus ‘Allne’) or phonetic errors (a name with multiple possible spellings such as ‘Allen’ versus ‘Alan’).

Although staff training and data quality monitoring can help decrease such errors, there is no way to eliminate all errors. Therefore, to minimize the impact of errors and improve matching efficiency, it is important to apply normalization methods and validate the data using other information within the record. These help to understand the approximate (probabilistic) agreement between fields when the data are not exactly the same. The selection/implementation of these methods depends on the availability of comparator data, tools, and other resources. The methods include the following:

  • Algorithmic normalization techniques: These techniques, including least common substring, Levenshtein distance, and Jaro–Winkler similarity, use algorithms to compute a score that represents the similarity of two text strings.

  • Normalization via gold standard datasets: Data in the person’s record is compared against nationally available data sets containing names, address, or cell phone numbers. These could include deidentified census data sets, such as separate lists of names and addresses or a set of valid cell phone numbers from a local provider, such as might be used for random digit dialing phone surveys.

  • Use of internally validating content available in the record: This uses pre-established validating attributes, typically non-static identifiers (e.g., facility record number), to confirm potential matches. This could also include using clinical data to confirm or suggest missing identifying data. For example, the presence of a hysterectomy procedure for a patient that is missing birth sex could be used to attribute a female birth sex to the individual’s record.

  • Consistent values for missing data. Variations in how missingness is represented should be converted to a consistent value such as NULL.

Step 1C: Select the optimal algorithm

Once the appropriate set of identifiers and the strategy to address data limitations are determined, the final step is to execute the matching process and test the results to identify the optimal algorithm.

The below three items make up this step.

  • Determine technical approach.

  • Test initial algorithm.

  • Tune algorithm.

Determine technical approach:

The first decision is whether to write custom code or to leverage existing software to match records. If leveraging a software package, actual or synthetic data can be used to test the software to determine which provides the optimal performance. The community organization, Open Health Information Exchange (OpenHIE) provides possible packages here, Reference Technologies - Documents - OpenHIE Wiki (ohie.org). Further examples of open source and commercial matching software include:

  • OpenCR

  • Sante Suite

  • HAPI EMPI

  • OpenEMPI

  • Febrl

  • LinKing

  • R Module

  • Python Package

  • Intersystems

  • Verato

  • 4medica

Selecting which software packages to test should be based on the following considerations:

  • cost,

  • reported performance,

  • availability of community or user support,

  • command-line vs. graphical user interface, and

  • supported functionalities.

Potential software packages or custom code can be tested using synthetic or actual data as the gold standard. Synthetic data can be created or generated as follows:

Synthetic data as testing data:

  1. Use a data-generating tool (e.g., Mockaroo) to generate hypothetical data for the identifiers used in the actual data.

  2. Generate a master dataset at different sizes (e.g., 10K, 1M, 10M) according to the estimated number of records to be linked in the actual datasets.

  3. Create a random administrative identifier for each record in the master dataset.

  4. From the master dataset, create subsets of data. Number of subsets and number of records in each subset depend on the testing scenario. Any two records in any two subsets, having the same administrative identifiers belong to the same individual in the master dataset. For testing purposes, data in each subset can be corrupted to emulate typographical errors and other limitations (See Step 1B: Address data source limitations) present in the actual data.

  5. The synthetic data matched with the administrative identifiers becomes the testing gold standard to use as a comparator to test the software package or custom code.

Important note: While synthetic data might be more easily obtained than genuine patient identifiers extracted from clinical systems, it is vital to recognize the limitations when using synthetic data to evaluate the effectiveness of matching algorithms. The performance of matching algorithms hinges on the attributes of the data under consideration, and shifts in these attributes can significantly impact algorithm performance. For accurate assessment of algorithm performance, the test data must accurately mirror the genuine characteristics of the data (like missing values, discriminatory power, etc.). Synthetic data typically does not faithfully replicate the attributes of actual data intended for matching. Consequently, synthetic data may not adequately reflect the algorithm's performance when matching real-world data. Therefore, when using synthetic data, further algorithm evaluation using real world data is necessary prior to match system deployment.

Actual data as testing data:

  • Actual data can be used as testing data when the linkage between individuals within the datasets has been confirmed by human adjudicators or by including known duplicates. The use of actual data for testing is subject to institutional or federal regulations for data governance, data privacy, and security. For example, the actual data can be scrambled to switch addresses to be associated consistently with new names or similarly birthdates and names.

  • This validated actual data becomes the testing gold standard to use as a comparator to test the software package or custom code.

The following outlines the process to test the selected software package(s) or custom code using the testing gold standard data:

  1. Deploy the software or custom code in the testing computing environment and perform the matching process using the testing gold standard data.

  2. Compare the software or custom code matching results with the gold standard results to determine how well it correctly identified duplicate individuals. This can be summarized by looking at the number of:

    1. true positive (TP) – matches correctly identified,

    2. false positive (FP) – incorrectly identified as matches,

    3. false negative (FN) – unidentified matches, and

    4. true negatives (TN) – correctly identified as not matches.

  3. Use the following metrics to assess the performance of the custom code or each software tested:

  • Precision (also called positive predictive value) refers to the proportion of true matches among the identified potential matches. (Note the nuance from other uses of the term precision to indicate how close repeated measurements are to each other.) This metric is represented as:

  • Recall (also called sensitivity) is the proportion of true matches that are correctly identified among all the true matches in the data set (represented by the sum of true positives and false negatives). This metric is represented as:

  • the F-measure derives a single number to rate match quality, based on the precision and recall factors.

  1. Rank multiple software packages or combinations of software and custom code based on performance (F-measure).

  2. Choose a software package or custom code using the collected information on performance, functionality, and cost.

    1. One option is to provide feedback to the custom code developers or communicate with each software’s owner about its performance and discuss areas for improvement.

Test initial algorithm

Once the matching algorithm has been chosen, it's time to test the selected initial matching algorithm using naive matching data, which consists of records with unknown duplicates. This testing phase involves conducting two types of reviews for the matched pairs that result from the algorithm:

  • A quick match review determines obvious problems with the process and can be performed, following this procedure:

    • Randomly select at least 100 matches determined by each algorithm iteration.

    • Using the unique identifiers in the selected match, place the full records associated with those identifiers side by side.

    • Compare those full records to identify obvious match errors.

  • A formal match review determines errors and validates the methodology by adjudicating matched pairs with the source systems, service delivery sites or other source that can reliably ascertain that the pair represents the same person. For example, the reviewers could be individuals engaged with patients at source sites or individuals who contact the person, while maintaining confidentiality, to determine whether they have sought care at the source site. (See the Learn from Others section of the Toolkit for detailed descriptions.) The procedure is below.

    • Randomly select a sample of matched records identified by the tested algorithms, including linkages made by deterministic method or probabilistic methods, or both. If multiple methods were used, make sure the sample includes matches by all methods. Sample size should be determined using methods such as population proportion and available resources.

    • After matched records are selected for validation, the clear-text values of each record pair are compared by three reviewers: two serve as primary reviewers, and the third serves as a tiebreaker. For each matched pair, reviewers record one of three possible outcomes: correct, incorrect, or unable to determine. The final match outcome decision is determined by the majority. If all three reviewers have different outcomes or at least two are undecided, the records should be subjected to in-depth validation using additional data elements contained in the EHR.

Note: Consider including linkages with match score that comes close to your threshold i.e. which records just made it and which records just missed out on being matched. When reviewing large datasets, the bulk of non-matching scores should be near 0, and the bulk of matching scores should be near 100. The records near the threshold are where it is important to finetune i.e. whether new weightings need to be introduced/altered to increase or decrease the match score.

Tune algorithm

After testing the initial algorithm, its parameters must be tuned to increase its accuracy and improve its ability to meet performance goals, which vary across projects. For example, a project that prioritizes precision will choose a stringent match threshold that limits the number of false positive matches while a project that prioritizes recall might implement a more permissive match threshold. Improvements are typically sought in precision, recall, F-measure, and efficiency. Factors such as data quality, computational trade-offs, and the availability of human adjudication should also be considered.

  • Improve precision (reduce false positives)

The objective here is to maximize the number of correct matches out of all the potential matches identified. Strategies to do so depend on the matching method used:

  • For deterministic methods, additional identifying variables can be used to make the algorithm more restrictive. If the original deterministic method, for example, used first name + last name + DOB, the updated method could use first name + last name + DOB + home village.

  • For probabilistic methods, the matching threshold is increased.

  • Improve recall (increase true positive rate).

The objective here is to minimize the number of missed matches. Strategies to do so depend on the matching method used:

  • For deterministic methods, identifying variables are removed, or absolute values are replaced with fuzzy values. If the original deterministic method, for example, used first name + last name + DOB, the updated method could use Soundex of first name + Soundex of last name + DOB, which utilizes the Soundex phonetic algorithm for indexing names by sound.

  • For probabilistic methods, the matching threshold is decreased.

  • Improve F-measure.

The objective here is to find the optimal balance between precision and recall. Strategies used to improve precision and recall are combined to achieve the desired F-measure score.

  • Improve linkage efficiency.

The objective here is to shorten the run-time of the technical process that identifies potential duplicate record pairs. In particular, probabilistic methods are often computationally intensive. Several potential strategies may be used to achieve this, and the options to improve efficiency often involve removing, replacing, or improving probabilistic methods. These include:

  • Removing probabilistic methods altogether.

  • Replacing probabilistic methods with deterministic methods.

  • Restricting blocking strategies further for the probabilistic methods. For example, if the original blocking scheme was Year of birth + City, the updated blocking scheme could be Month of birth + Year of birth + people group.

  • Limiting the number of record pairs in each block by setting a maximum number for each. If a pair of records is excluded from a block due to this limit, it should be included in a different block.

The data characteristics and performance goals guide tuning the matching algorithm by managing trade-offs among aspects of the process. For example, to understand the burden of asthma a public health agency may want to link a public health registry to a list of known asthma patients from a local clinic EHR. To do so they may want to capture as many of the true asthma cases as possible, even if some patients without asthma are linked as well. This activity favors sensitivity or recall. Alternatively, a physician wishes to link HIV lab tests to their patients in an EHR to better understand which patients have HIV. They do not want to link a positive test to a person who does not have HIV, so they are willing to risk missing a match to avoid incorrectly matching two distinct patients. This use case would favor PPV or precision. While a simple algorithm will perform well if data are of high quality, more sophisticated algorithms are needed when data are incomplete or not standardized.

It is also impossible to eliminate all false positive matches because of the inverse relationship between false positives and false negatives. When one is decreased, the other increases proportionately, as shown in the graph below.

Figure 1

If avoiding false positives is a chief concern, then a higher precision (specificity) is desirable. In contrast, if the main goal is avoiding missed matches, then achieving high recall (sensitivity) is needed. Since these two metrics (precision and recall) are in opposition, performance goals should be defined in advance as a guide for setting the match score threshold in the tuning process.

The availability of human operators for the match process is also a consideration. Although matching algorithms can be configured to declare one of two outcomes (match or non-match), a third outcome as an uncertain match (“gray zone”) can be identified algorithmically if human supervision is available. This gray zone includes likely matches that fall below the threshold to automatically declare them a match; those are then adjudicated by human operators.

It is essential to keep in mind that tuning is an ongoing process that requires regular testing and refinement of the matching algorithm. This is discussed in Phase 3: Reviewing algorithm performance and refining as needed. Multiple data sources are often used to validate the algorithm and improve its reliability and accuracy. As new records and matching scenarios are introduced and because data can “drift” over time, continuous monitoring and refinement are needed to ensure optimal performance.

Last updated