Phase 2

Phase 2: Implementing the algorithm and adjudicating potential duplicates

During this phase the matching and deduplication processes are implemented within routine system operations. If implemented at a site level system, this involves determining the periodicity for running the algorithm and resolving record pairs identified as potential duplicates. If implemented within a national data repository (NDR), it involves both an initial processing of the existing data as well as an ongoing process that compares data as it is brought in with the records already in the NDR to identify and adjudicate potential matches. Implementing the matching algorithm relies on other processes that regularly (e.g., daily, weekly) identify and resolve potential data quality issues. This section discusses the operational aspect of ongoing matching and deduplication of individual records.

This phase involves the following main steps:

  • Step 2A: Determine roles and responsibilities

  • Step 2B: Install software or code in a production environment

  • Step 2C: Handle potential duplicate records

Step 2A: Determine roles and timeline

Roles and responsibilities

As with all ongoing processes, identifying who is responsible for each aspect ensures smoother operations. If matching within a system or across multiple data sources, individuals responsible for the following roles should be identified:

  • Leadership—to supervise execution of all steps in the record matching and adjudication processes, obtain regulatory permissions, and ensure standards are met.

  • Software and hardware deployment—to install and maintain necessary software (prerequisite tools such as Python, R, Java, etc. and, if site is doing its own linkage, the linkage software for data processing) as well as to ensure availability of hardware such as physical computers, servers, or virtual machines.

  • Data quality assurance—to identify data quality issues and work with the source system to resolve the issues, as well as monitor that the data are updated.

  • Duplicate adjudication – to work with data source to determine whether potential duplicates are true matches and if so to initiate linkage between the records

The responsible personnel should be able to communicate with each other easily and quickly before, during, and after the matching process. Communication options include email, instant message (e.g., WhatsApp, Slack), or a message board (e.g., Discord, Github).

Step 2B: Deploy Software or Custom Code in Production Environment and Perform Patient Matching

Software or Custom Code deployment

Once individuals are identified to fill each of the roles, the one(s) responsible for the software (or custom code) and hardware deployment should install the selected software or custom code in the production environment and ensure that the hardware in use is configured to handle the volume of records being reviewed for potential matches. Once these are in place, the responsible individual(s) should begin to routinely monitor their function to ensure they continue to operate without issue. If there are any issues, they should troubleshoot and remedy any problems with running the match process. Problems might include changes to incoming data sources such as missing data elements, invalid data format/values that were not previously accounted for, or an incorrect record count.

Performing patient matching

As discussed previously, matching may be performed with deterministic methods based on exact matches, as with a National Patient Identifier, first name, last name, and date of birth, and/or probabilistic methods, which use statistical analysis to determine likely matches based on a match score threshold even when the data is not exactly the same (such as variants of first name or transposed numbers in street address). There are many possible specific methods to use within each of these two general method types. In probabilistic methods, “blocking” is done to make the linkage more computationally manageable, especially for large datasets, by using a few specific identifiers to reduce the number of possible matches and filter out unlikely matching pairs.

Step 2C: Handle Potential Duplicate Records

Patient matching output

The output of the patient matching process should include metadata such as matching strategy used and the run-time of each. Software operations during this step might include pausing—temporarily stopping the operation for resumption later, a particularly helpful functionality when data volume is large—and termination, in which the current match process is ended. This step should also include troubleshooting any problems with the matching process. These might include 1) a hung process, when a method takes longer than expected to execute; and 2) a process that generates zero matched data. While this result may be legitimate, it needs to be verified to ensure the process ran such as by reviewing the number of record pairs actually compared.

Expected match result output should be stored in a table within the database and include the following:

Randomly assigned administrative identifiers. Matching software or custom code should assign a unique number to each record in the dataset for administrative purposes, in order to identify and count that record during the matching process.

Match strategy. The output should include an indication of the method (deterministic and/or probabilistic) used to identify matches.

Confidence score. If the record pair was linked by a probabilistic method, the confidence score represents the strength of the match. If a record pair was linked by a deterministic method, the confidence score will always be 100 (exact match).

The match process should ideally include process log files that summarize the results of a run using descriptive statistic metrics like these:

  • Number of records compared

  • Number of potential duplicate records identified

  • Number of matches by general method type: deterministic and/or probabilistic

  • Number of matches by each specific method

Linkage result adjudication

If a high degree of precision is needed, match result adjudication can be conducted manually by adjudicating matched pairs with the source systems, service delivery sites or other source that can reliably ascertain that the pair represents the same person. As mentioned in Phase 1, Step 1C, the reviewers, for example, could be individuals engaged with patients at source sites or individuals who contact the person, while maintaining confidentiality, to determine whether they have sought care at the source site. (See the Learn from Others section of the Toolkit. for detailed descriptions.)

Last updated