Linking, selecting cut-offs, and examining quality in the Integrated Data Infrastructure (IDI) Laura O’Sullivan Statistics New Zealand laura.o’
[email protected]
IAOS Vietnam October 2014
Outline The Integrated Data Infrastructure (IDI) Terminology IDI linking • • • •
Near-exact and non-exact Selecting cut-offs Quality Clerical review
Linking at Statistics New Zealand and at the Australian Bureau of Statistics 2
Integrated Data Infrastructure (IDI)
Student loans & allowances
Migration & movements
Education Benefits
Business data
Person-centred data Tax
Justice Health & safety
Families & households
33
Terminology Data integration (aka Record linkage) Deterministic linking Probabilistic linking (Fellegi-Sunter theory) Weights Represent the probability that two records are from the same person
4
Cut-offs Distribution of the weights Non-links
1240 Number of record pairs 1040
840
640
Links
440
240
40 -95
-75
-50
-25
0
25
50
Source: Statistics New Zealand
5
Quality True matches
Non matches
Linked
True positives
False positives
Unlinked
False negatives
True negatives
6
Near-exact and non-exact First name and Last name agreement Data Insert
Delete Replace Double Single
A
Robert Robert Robert
Robert
B
Robiert Robrt
Roobert Robert
Rovert
Swap
Append Truncate
Robbert Robert Kat Robret Katie
Katie Kat
Date of birth agreement Data
Replace
Swap
Transpose
A
04/08/1982
02/08/1982
02/08/1982
B
04/02/1982
20/08/1982
08/02/1982 7
Selecting the cut-off Graph of near-exact and non-exact links Frequency of links 300,000 Non-exact
Near-exact
250,000
200,000
150,000
100,000
50,000
0 0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
Source: Statistics New Zealand
8
Quality in the IDI False positive rates • Sample from non-exact links • Assume near-exact links are true matches • Use proportional sampling
Non-exact rates • Monitoring
9
Clerical review A link with two first names matching and different last name Dataset
First names
Last names
Date of birth
Sex
A B
Mary Louise Mary Lou
Brown Hughes
04/11/1984 04/11/1984
2 2
A link with unique identifiers and missing name information in one dataset Dataset A B
Identifier 12345 12345
First names Owen -
Last names Keyes -
Date of birth 06/01/1951 06/01/1951
Sex 1 1
A link with missing name information and without unique identifiers Dataset A B
First names Holly Jessica Holly
Last names Gordon
Date of birth 01/05/1940 01/05/1940
Sex 2 2
10
Statistics New Zealand and the Australian Bureau of Statistics Statistics New Zealand Census to the Post-enumeration survey (PES) Linking the longitudinal census
Australian Bureau of Statistics Linking projects using name and address Census data enhancement project
11
Thank you for listening Questions
laura.o’
[email protected] 12