Integrating information from multiple sources plays a key role in social science research. However, when a unique identifier that unambiguously links records is not available, merging datasets can be a difficult and error-prone endeavor. In “Active Learning for Probabilistic Record Linkage”, I propose an active learning algorithm which efficiently incorporates human judgement to produce a probabilistic estimate for the unobserved matching status across records. I show that the algorithm significantly improves accuracy at the cost of manually labeling a small number of records. I illustrate the advantage of the algorithm using two empirical applications. The first one uses data from local politicians in Brazil, where a unique identifier is available for validation. The second is a recent vote validation study conducted for the ANES. I show that in both cases, the proposed method can recover estimates that are indistinguishable from those obtained from more extensive, expensive, and time-consuming clerical reviews.
Read more about me and my research in my webpage.