Monday, December 11, 2006

Personal Name Matching, Data Linkage and Geocoding


A Comparison of Personal Name Matching: Techniques and Practical Issues. -and also- Privacy-Preserving Data Linkage and Geocoding: Current Approaches and Research Directions

Peter Christen (DCS, ANU)

DATE: 2006-12-13
TIME: 16:00:00 - 17:00:00
LOCATION: CSIT Building, N101, ANU, Canberra

In this seminar I will present two talks I will give at the IEEE International Conference on Data Mining (ICDM) in Hong Kong, 18-22 December.

1) Finding and matching personal names is at the core of an increasing number of applications: from text and Web mining, search engines, to information extraction, deduplication and data linkage systems. Variations and errors in names make exact string matching problematic, and approximate matching techniques have to be applied. When compared to general text, however, personal names have different characteristics that need to be considered. In this talk I will discuss the characteristics of personal names and present potential sources of variations and errors. I then overview a comprehensive number of commonly used, as well as some recently developed name matching techniques. Experimental comparisons using four large name data sets indicate that there is no clear best matching technique.

2) Data linkage is the task of matching and aggregating records that relate to the same entity from one or more data sets. A related technique is geocoding, the matching of addresses to their geographic locations. As data linkage is often based on personal information (like names and addresses), privacy and confidentiality are of paramount importance. In this talk I will present an overview of current approaches to privacy-preserving data linkage, and discuss their limitations. Using real-world scenarios I will illustrate the significance of developing improved techniques for automated, large scale and distributed privacy-preserving linking and geocoding. I then discuss four core research areas that need to be addressed in order to make linking and geocoding of large confidential data collections feasible.

Dr Peter Christen is a lecturer at the Department of Computer Science at the Australian National University. He received his Diploma in computer science engineering from the ETH Zurich (Switzerland) in 1995 and his PhD in computer science from the University of Basel (Switzerland) in 1999. His research interests are data mining (especially data linkage and data pre-processing), high-performance computing, and most recently security and privacy preservation (in the context of data linkage and health informatics).

In the last four years his research has concentrated on the project "Investigation and Development of Parallel Large Scale Record Linkage Techniques", an ARC Linkage project conducted in collaboration with and partially funded by the NSW Department of Health.


