"Fast ER GPU Accelerated Record Linkage in Python" by Dr. R. Michael Alvarez and Jacob Morrier, Caltech

Watch the latest presentation at the Data Analytics Colloquium, "Fast ER GPU Accelerated Record Linkage in Python" by Dr. R. Michael Alvarez and Jacob Morrier, Caltech at https://youtu.be/wsv__0a_KDY?si=00slm3wG1RWUa0-Y.

Abstract:

Record linkage, also called "entity resolution," consists of matching observations from two datasets representing the same unit, even when consistent common identifiers are absent. This process typically involves computing string similarity metrics, such as the Jaro-Winkler metric, for all pairs of values between the datasets. The Fast-ER package accelerates these computations with graphical processing units (GPUs). It estimates the parameters of the Fellegi-Sunter model, a widely used probabilistic record linkage model, and performs the necessary data processing on CUDA-enabled GPUs. Our experiments demonstrate that this approach can increase processing speed by over 60 times, reducing processing time from hours to minutes, compared to the previous leading software implementation. This significantly improves the scalability of probabilistic record linkage and deduplication for large datasets.

Previous
Previous

Election Science Office Hours: Election Audits and their Effect on Public Trust in Elections: A Conversation with Jennifer Morrell, CEO and Co-Founder of The Elections Group

Next
Next

Election Science Office Hours: Efforts to Increase Public Trust in Elections in 2024: A Conversation with Tommy Gong, Deputy County Clerk-Recorder for Contra Costa County, CA