Frequently Asked Questions (FAQ)

Frequent Asked Questions (FAQ)

Q: What are blocking schemes?

A: To improve the efficiency of the probabilistic methods, data blocking methods, or blocking schemes, are used to reduce the search space of comparisons needed between two or more datasets. The blocking scheme divides the datasets into smaller, manageable subsets, known as "blocks." These blocks are constructed based on certain attributes or features shared between records in both datasets. For example, if we have two datasets of people's information, we could create blocks based on the first letter of their last names. All records with the same initial letter will be placed in the same block. When a blocking scheme is applied, only records in the same block are compared.

From Bruce Dickie: The best anecdote is if you were trying to pair up all your socks, and you start by creating small piles of socks that are of the same colors, so now you won't be comparing any red socks with blue socks, but only with other red socks, thus quickly reducing the search space.

In the context of patient matching, you can reduce the search space by almost half by splitting on gender at birth. You can reduce it further by looking only within age brackets.

More efforts can be made but then you run into the risk of accidentally excluding the correct records (false negatives) by making your blocking rules too rigorous.

Q: What are the deterministic and probabilistic linkage methods? How or when can methods be used together?

From Bruce Dikie: Deterministic rules are very simple to follow and implement, and much less computationally expensive. If used on good data, they will produce great results. But it is when data isn't captured well, accuracy starts to dip off.

This is when more probabilistic approaches are required. They are more computationally intensive but can be used to be more flexible in identifying matches. In messy data, they can yield higher accuracy rates.

Deterministic approaches are enough with high quality data, for example when data includes robust unique identifiers like a national ID, SSN, NHS #, etc. Therefore we just search for others instances.

Alternatively, when small amounts of fields are available, probabilistic will not add much more than what deterministic already covers.

If a system like this doesn't exist (which is often the case), we need to use multiple fields and apply various weightings to each field (as all fields can't hold the same weightings towards decision making), whereby uniqueness, data quality and data reliability of each field is factored into calculating a weighting for that field, and a total probabilistic score is produced.

Q: What is the strategy to set the match threshold in probabilistic linkages?

A: From Bruce Dickie: Choosing thresholds requires experience, a bit of trial and error, and some manual adjudication to list out scores and review which scores are accurate and which aren't. From there, you draw thresholds that yield the best results, and that entails which is more important: accidentally rejecting matching records (false negatives) and approving non-matching records (false positives). Depending on the context, ensuring one of them is minimized can be more important than the other. But it is often the case that you can't minimize both of them.

Suggestions for improvement

  • It might be worth including a step in the overall process to brainstorm a long list of scenarios to validate [the performance of linkage method using synthetic data]. This would help from a use case definition perspective but also a testing scenario perspective. Perhaps this doc could itemise some examples of those. Obviously they would need to be nuanced to each different country setting etc. - From Jack Hilton

Last updated