key: cord-0024857-kim93gtb
authors: Harper, Gill; Stables, David; Simon, Paul; Ahmed, Zaheer; Smith, Kelvin; Robson, John; Dezateux, Carol
title: Evaluation of the ASSIGN open-source deterministic address-matching algorithm for allocating unique property reference numbers to general practitioner-recorded patient addresses
date: 2021-12-08
journal: nan
DOI: 10.23889/ijpds.v6i1.1674
sha: ed7ed198d35caa5ea2ff8b9be616e1bd987a7831
doc_id: 24857
cord_uid: kim93gtb

INTRODUCTION: Linking places to people is a core element of the UK government’s geospatial strategy. Matching patient addresses in electronic health records to their Unique Property Reference Numbers (UPRNs) enables spatial linkage for research, innovation and public benefit. Available algorithms are not transparent or evaluated for use with addresses recorded by health care providers. OBJECTIVES: To describe and quality assure the open-source deterministic ASSIGN address-matching algorithm applied to general practitioner-recorded patient addresses. METHODS: Best practice standards were used to report the ASSIGN algorithm match rate, sensitivity and positive predictive value using gold-standard datasets from London and Wales. We applied the ASSIGN algorithm to the recorded addresses of a sample of 1,757,018 patients registered with all general practices in north east London. We examined bias in match results for the study population using multivariable analyses to estimate the likelihood of an address-matched UPRN by demographic, registration, and organisational variables. RESULTS: We found a 99.5% and 99.6% match rate with high sensitivity (0.999,0.998) and positive predictive value (0.996,0.998) for the Welsh and London gold standard datasets respectively, and a 98.6% match rate for the study population. The 1.4% of the study population without a UPRN match were more likely to have changed registered address in the last 12 months (match rate: 95.4%), be from a Chinese ethnic background (95.5%), or registered with a general practice using the SystmOne clinical record system (94.4%). Conversely, people registered for more than 6.5 years with their general practitioner were more likely to have a match (99.4%) than those with shorter registration durations. CONCLUSIONS: ASSIGN is a highly accurate open-source address-matching algorithm with a high match rate and minimal biases when evaluated against a large sample of general practice-recorded patient addresses. ASSIGN has potential to be used in other address-based datasets including those with information relevant to the wider determinants of health.

Linking places to people is a core element of the UK government's geospatial strategy. Matching patient addresses in electronic health records to their Unique Property Reference Numbers (UPRNs) enables spatial linkage for research, innovation and public benefit. Available algorithms are not transparent or evaluated for use with addresses recorded by health care providers.

To describe and quality assure the open-source deterministic ASSIGN address-matching algorithm applied to general practitioner-recorded patient addresses.

Best practice standards were used to report the ASSIGN algorithm match rate, sensitivity and positive predictive value using gold-standard datasets from London and Wales. We applied the ASSIGN algorithm to the recorded addresses of a sample of 1,757,018 patients registered with all general practices in north east London. We examined bias in match results for the study population using multivariable analyses to estimate the likelihood of an address-matched UPRN by demographic, registration, and organisational variables.

We found a 99.5% and 99.6% match rate with high sensitivity (0.999,0.998) and positive predictive value (0.996,0.998) for the Welsh and London gold standard datasets respectively, and a 98.6% match rate for the study population. The 1.4% of the study population without a UPRN match were more likely to have changed registered address in the last 12 months (match rate: 95.4%), be from a Chinese ethnic background (95.5%), or registered with a general practice using the SystmOne clinical record system (94.4%). Conversely, people registered for more than 6.5 years with their general practitioner were more likely to have a match (99.4%) than those with shorter registration durations.

ASSIGN is a highly accurate open-source address-matching algorithm with a high match rate and minimal biases when evaluated against a large sample of general practice-recorded patient addresses. ASSIGN has potential to be used in other address-based datasets including those with information relevant to the wider determinants of health.

Keywords data linkage; electronic health record; addresses; address-matching; quality assurance; population health; place-based health Introduction Data linkage is being increasingly used in health data science, with growing examples of spatial linkage of electronic health records (EHRs) to environmental information for population health research [1] [2] [3] [4] . Address-matching is data linkage that enables spatial linkage by specifically matching nonstandardised addresses recorded in an administrative dataset to a reference address gazetteer that provides standardised address formats, property reference numbers, and geographic co-ordinates.

Linking places to people is a core element of the UK government's geospatial strategy [5] . In 2019, the Public Sector Geospatial Agreement [6] gave more than 5,000 public sector organisations unlimited access to Ordnance Survey data, including Unique Property Reference Numbers (UPRNs) -the unique identifier for every addressable location in Great Britain. UPRNs are described as the 'golden thread' which links datasets together, with the potential 'to underpin huge advances in our digital society, improving our lives and equipping the economy to recover from the effects of Coronavirus' [5] . UPRNs are now a mandated standard across the public sector, however challenges remain to implement this fully within the National Health Service (NHS) enabling geospatial linkage for research, innovation, and public benefit.

The UPRN acts as an address standardiser, a household identifier, and a high-resolution geocoder, and ultimately as the granular spatial link to environmental information to be used for direct patient care as well as for health research. Address-based geography using UPRNs moves away from the acknowledged limitations of area-based geography ecological approaches and enables more accurate patient-level analysis of the effect of geographical and household exposures and covariates on health outcomes.

Robust methods are important for linking addresses in health data to UPRNs. Schinasi et al. (2018) [4] concluded that such linkage is a major research opportunity, and that future research should include more detailed descriptions of methods used to geocode addresses and for dealing with missing or poor quality geographic information. They recommended assessment of the extent and impact of biases including the adoption or design of formal methods to assess the extent to which patterns of missing geographic data will lead to biased results.

While other address-matching algorithms are available in the UK [7] [8] [9] few, if any, have been developed specifically for patient recorded addresses available in EHRs, and their methods, accuracy and potential biases are often not transparent or evaluated, limiting the extent to which users of address-matching results can be aware of and assess implications for analyses.

From our experience we propose that there are five general factors that will affect the match rates, quality, and bias of match success of any address-matching algorithm:

1. How the address is provided, i.e. the quality of the address provided by the patient when registering with respect to its completeness, spelling mistakes or omissions.

2. How the address is recorded, i.e. manually by a data entry person using free-text or auto-fill prompts, and the level of attention to accuracy when doing this.

3. The content and quality of the property gazetteer being matched to, i.e. whether the gazetteer is up-to-date and complete.

4. The geography of the address as some geographic areas are more prone to variation or errors in how the address is provided that may differ from the standardised address in the property gazetteer. For example, apartment numbering can be represented in multiple ways or properties in rural areas can be addresses with a house name or a number.

5. The matching algorithm, i.e. the quality and appropriateness of the method used to find a match.

We describe ASSIGN (AddreSS MatchInG to Unique Property Reference Numbers), an address-matching algorithm specifically designed, developed and validated by Dr Gill Harper and Dr David Stables for the linkage of patient addresses as routinely recorded in EHRs to the UPRNs in the Ordnance Survey Great Britain property gazetteer database AddressBase Premium (ABP) [10] . ASSIGN as implemented in the north east London Discovery Data Service (DDS) enables the UPRNs from ABP to be assigned to each patient address in near real-time and subsequent changes to patient addresses and gazetteer databases to be automatically updated as required.

Overall, our objective was to transparently describe and quality assure the ASSIGN address-matching algorithm and examine potential biases in match results so that users of the algorithm and its outputs have this information available to them and have clarity on how their analyses may be affected by it.

If an address-matching algorithm is not accurate, an incorrect UPRN can result in the incorrect residential location being attributed to a patient. This can result in misclassification of environmental exposure estimates and consequently in epidemiologic affect estimates, potentially systematically, for example of air pollution exposure on asthma related emergency department visits [11] . It can also result in misassignment of occupants to a UPRN which when used as a proxy of a household, can introduce error in studies where the household occupancy or type is the risk factor, for example in COVID-19 studies [12] [13] [14] .

Knowledge of address-matching algorithm accuracy and error supports confident use of UPRNs not only within EHRs but across the growing variety of sectors who are moving towards the implementation of UPRNs on their address data.

The study population comprises 1,757,018 patients aged ≥18 years, alive and currently registered as at census date 16 th November 2020 with one of 277 general practices providing primary care services to the entire geography covered by seven north east London Clinical Commissioning Groups (CCGs), all of which publish primary care EHR data on a daily basis into the north east London DDS and associated subscriber database. These patients were recorded as living at 945, 196 

The ASSIGN algorithm was developed by exploiting the address-matching experience of the designers and with inner north east London GP recorded patient addresses as the test addresses. Repeated checks of false positives and false negatives were made to inform coding improvements and increase match rate and accuracy with each iteration. The input address to be matched is named the 'candidate' address, and the addresses in ABP to be matched to are named the 'standard' addresses. The method consists of three stages: reformat, match and return.

The ABP files are loaded into a database and mapped directly to the combination of eleven standard address object fields that exist across both the Royal Mail Delivery Point Address (DPA) and local authority Local Property Identifier (LPI) versions of addresses in ABP: flat, building, number, dependent thoroughfare, street, dependent locality, locality, town, postcode, organisation, vertical, concatenating where required.

These are stored and heavily indexed using a set of single and compound indexes designed to improve search performance at run time. In addition, certain performance improving indexes are generated based on semantic equivalence or semantic importance. Examples include correcting spelling errors, de-pluralisation, replacing or removing punctuation and lower casing, and removing extraneous words that are unnecessary in the match process, for example, the range of words that are equivalent to the word 'flat' such as 'apartment' or 'maisonette'.

When a candidate address is submitted the address string is parsed using a combination of Regex [18] matching expressions and index checking to form the same eleven address object fields. For example, the postcode is identified by checking the format and position in the string (postcodes are usually submitted at the end of the string or in a separate comma delimited field). A further candidate address version is created by applying the same reformatting techniques as applied to the standard addresses, so that both the eleven address fields nonformatted and the eleven address fields formatted candidate addresses are available to the algorithm.

The final reformatting step is positional checking, for example, a candidate address abbreviation 'st' would be mapped to 'street' as a spelling correction, but not if it was presented as the first word in a field 'St David's' for example would be retained as 'St David's'.

The objective of the matching algorithm is to reach a high level of confidence that the matched candidate address refers to the same location as the standard address and more so than any other available standard address. Blocking by matching postcode area, potential matching standard addresses are 'tried on for size' deterministically by applying matching judgement rules in rank order. The rules that are applied are determined on the content of the candidate address string and the text manipulation required. Higher ranking rules have required the least amount of address string manipulation, so that rank 1 is an exact word for word match for the entire address string. Rank 1 is the most frequent match rule.

These rules mirror human pattern recognition and manipulate and compare the address strings until the best available match is found. Human pattern recognition refers to knowing that similar or the same words in different orders, or transposed characters, or the correct spelling of a misspelled word usually means the same thing. The algorithm codes these using, for example, Levenshtein distance [19], pattern matching with Regex [18], field swapping and pluralisation.

The algorithm can be considered as a decision tree handling a combination of ANDs, ORs or NOTS with branching occurring on the OR conditions. The nodes of the trees relate to comparison of the different address fields and are pass/ fail tests and travelling down one of the next branches means a test has been passed. If a test fails the process goes back up the feeder branch to the next branching node, and tries the next untried branch, until all branches are exhausted. This has similarity to a tableaux tree [20] except the nodes branch on human judgement-based decision making rather than pure logic.

A match is made with one of four overall qualifiers that qualifies the relationship between the candidate address and the matched standard address in relation to approximate geography, or no match is made. The qualifiers are:

1. Best match: the closest match out of all available 2. Child: candidate address is a 'child' sub-property of the UPRN it has been matched to 3. Parent: candidate address is the 'parent' building shell of the UPRN it has been matched to 4. Sibling: candidate address is a near neighbour of the UPRN it has been matched to

Where there is a match, the algorithm returns the UPRN, the overall qualifier, the standard address, the match pattern, and match rule identifier employed to get that match. The match rule is a label identifying which section of the code made the match, and the match pattern depicts how five address objects were manipulated to achieve the match. These five address objects are merged from the original eleven: flat, building, number, street, postcode. Twelve possible match terms (Table 1 ) exist and can be combined in up to 50 different ways on the five address fields. These are restricted to plausible terms, for example, postcodes are never swapped with streets. Indicates a match using more than one candidate field moved to > Means that the candidate field was moved to another field to match e.g. number moved to flat moved from < Means that the candidate field was moved from another field to match on this field field merged f when moved from and to, the fields are then merged to match ABP field ignored i ABP field was ignored in order to match i.e. the ABP address contained more precise detail than the candidate but was unnecessary in order to match. This usually means that the candidate field is null Candidate field dropped d The candidate field was dropped in order to match i.e. the candidate address has more precise detail than the authority address. The ABP address would probably be null Matched as parent a The candidate field matched as being at a higher level than the ABP field, for example flat 6 matching to flat 6a Matched as child c The candidate field matched as being at a lower level than the ABP field, for example candidate flat 6a, ABP flat 6 Partial match p The candidate field was partially matched to the ABP field (or vice versa) typically 2 out of 3 words Possible spelling error l The candidate field and ABP field were matched using the Levenshtein distance algorithm taking account of misspellings Level based match v The level of a flat in a building (vertical from the street) was used to create the match e.g. 2b for second floor b Equivalent e The fields are equivalent, albeit not necessarily spelled the same, using various equivalence lists, word swaps, word drops etc An example of a match pattern is 'Pe,Se,Ne,Bp,Fe'. This means that the postcode, street, number, and flat fields were equivalent matches between the candidate and standard address, and the building field was a partial match between the candidate and standard address.

The ASSIGN algorithm code is available as fully opensource [21] for free use and information on the algorithm method [22] is freely provided for users. Supplementary Appendix 1 describes ASSIGN in the GUILD [23] format.

ASSIGN seeks to match the input addresses to addresses in Ordnance Survey's AddressBase Premium (ABP) [10] . This is a comprehensive property gazetteer of all current, historic and future addresses, properties and land areas in Great Britain. Each property address is recorded in national standard BS7666 [24] format and is represented by a Unique Property Reference Number (UPRN). Property classification type and geographic co-ordinates are also provided for each UPRN. Updates to ABP are provided by the Ordnance Survey every six weeks as Epochs. When the ASSIGN algorithm matches the candidate address to a UPRN, metadata relating to the match and variables of interest from ABP is assigned. Match metadata is listed in Table 2 .

ASSIGN is designed so that only records in ABP that are of relevant property types are made available for matching. These include all residential property types and a considered selection of commercial property types that we found can be given as a person's place of residence for example if they live above a public house. The choice of property types can be varied as required, for example, if matching patient addresses solely to commercial addresses such as care homes. We evaluated Version 4.2.1 of the ASSIGN algorithm and Epoch 75 of AddressBase Premium which were implemented in the north east London DDS at the time of data extraction.

The algorithm was run on two 'gold standard' external reference address datasets with previously assigned and verified UPRNs in order to calculate accuracy rates. The first of these datasets comprised 9,177 local authority sourced addresses in Wales, and the second 9,475 local authority sourced addresses from the London Borough of Tower Hamlets in north east London. The ASSIGN algorithm has been developed using north east London patient addresses. Therefore, addresses from the rural geography of Wales and from local authority sourced addresses that tend to be of poorer address quality than patient addresses were considered to be a challenging test of ASSIGN's performance.

For all analyses reported in this paper, we used identifiable data from general practitioner electronic health records data which are held in the DDS subscriber database and curated by the Queen Mary University of London based Clinical Effectiveness Group (CEG). The GP EHR data are provided daily from GP system suppliers to the CEG DDS database and contain demographic and clinical data and address history for each patient registration. Approval for access to the person identifiable data (patient addresses) used in this study was provided by the DDS data controllers to the CEG as appointed data sub-processors for the purpose of developing and evaluating the ASSIGN algorithm for direct patient care purposes only. This access was limited to approved individuals with appropriate information governance training working in a secure trusted data environment.

The primary outcome was a binary variable indicating whether a UPRN had been matched or not matched to the patient address using the ASSIGN algorithm.

We selected a range of patient level demographic and registration characteristics, and organisational features to evaluate match rates and biases. These are listed in Table 3 .

In the absence of a formal standard method to evaluate address-matching algorithms, we considered the GUILD [23] data linkage reporting principles to be a relevant framework for this purpose because address-matching is fundamentally a data linkage exercise linking address data between two sources to find a match. GUILD proposes which information may be required at each step of the linkage pathway to improve the transparency, reproducibility, and accuracy of linkage processes, and the validity of analyses and interpretation of results. We follow this framework as much as possible, in particular for the calculation of match and accuracy rates.

We applied ASSIGN to the two gold standard external reference datasets and calculated the data linkage accuracy metrics described in Table 4 . We estimated the match rate obtained from applying ASSIGN to the 945,196 distinct patient addresses from our study population.

Descriptive summary statistics by three age bands (18-19, 20-64 and ≥65), five ethnic groups (White, South Asian, Black, Other (including Chinese and Mixed) and Not Stated), Sex (female, male, other) and IMD 2019 score quintile (1 = most deprived, 5 = least deprived) were calculated for the entire study population, separately for those with and without an ASSIGN-matched UPRN, in order to compare the characteristics of each group, including those with missing data. In total, 268,382 had missing values across these four variables, the majority from missing ethnic groups.

The absolute difference in the proportion matched, relative to the reference group for each explanatory variable was calculated. We considered an absolute difference in match rates of 1% or greater to be potentially an important difference.

We performed a Poisson multilevel mixed-effects generalized linear model in a complete case analysis to estimate UPRN match prevalence ratios and their 99% confidence intervals after mutual adjustment for all explanatory variables described previously, including GP practice as a random effect. We explored between general practice variation in match rates.

All analyses were conducted using Stata/MP 15 (StataCorp LP).

When assessed against the Welsh and Tower Hamlets goldstandard datasets, the match rates were, respectively, 99. 

Positive Predictive Value (PPV) The proportion of record pairs classified by the algorithm as links that are true matches. Also known as precision. Sensitivity

The proportion of true matches that are correctly classified as links. Also known as recall. F-measure

The harmonic mean between positive predictive value and sensitivity. Often used to compare the overall efficiency of a method. F-measure = 2*(PPV*sensitivity)/(PPV + sensitivity)

Those addresses without a match were more likely in specific postcode areas, and for invalid addresses or postcodes, or address strings beginning with an alphabetic character indicating a flat rather than a house. Full details on the GUILD reporting of match and accuracy rates are provided in the Supplementary Appendix 1. 

Absolute match rate differences to the reference groups greater than 1% were found for people aged 15-29 or ≥85 years, from Chinese or Not Stated ethnic groups, with a missing IMD 2019 quintile or GP registration duration, with the longest GP registration duration quartiles, with ≥2 address changes in the previous 12 months, or who were registered with a GP practice using the SystmOne clinical record system or registered with a GP practice in Tower Hamlets. The match rate was consistently high with a minimum of 94.4%, and match rates were similar for any missing and non-missing categories, with the exception of the 0.2% of the study population with missing IMD 2019 values which had a substantially lower match rate of 23.5%. As the IMD score is assigned via the postcode, if this is missing or of poor quality, it is also likely that a UPRN cannot be assigned. The match rate in those with missing ethnicity codes (n = 265,525) was similar to that reported for those from White ethnic groups.

Full details of the UPRN match rates and absolute difference in the proportion matched relative to the reference group, for all explanatory variables and the complete study population are given in Supplementary Appendix 3.

The adjusted complete case analysis prevalence ratios and 99% confidence intervals are presented in Table 5 , which excludes 278,875 patients with missing data, the majority excluded due to missing ethnicity codes.

Based on absolute differences greater than 1% from the reference category, people aged 15-29 or 85 years and over, those of Chinese ethnic background, with ≥3 address changes in the preceding 12 months, registered at a GP practice using SystmOne, or at a GP practice in Tower Hamlets were less likely to have an address matched to a UPRN. Conversely, people registered with their GP practice for more than 6.5 years were more likely to have an address matched to a UPRN than the reference group (Figure 1) . At the practice level, GP practice UPRN match rates ranged from 84.9% to 99.96% with an average of 98.6% (data not shown). The three GP practices with UPRN match rates below 90% included one GP practice for homeless people and two using SystmOne supplier systems. There was no clear association between GP practice UPRN match rate and GP practice list size.

This is to our knowledge the first address-matching algorithm developed specifically to assign UPRNs to patient addresses recorded at registration for NHS general medical practitioner services. Using GUILD [23] specified criteria and methods we have shown that the ASSIGN algorithm achieved a greater than 99.5% match rate in two gold standard datasets drawn from diverse populations with high accuracy as indicated by the sensitivity, PPV and the F-measures. Incorrect matches were extremely low overall, with marginally higher percentages of incorrect matches for the Welsh addresses and of missed true matches in the Tower Hamlets addresses. The high value of the F-measures (0.99) for the ASSIGN algorithm exceeds the threshold of ≥0.8 specified by Ferrante and Boyd (2012) [26] for 'very good' linkage algorithms.

A similarly high match rate (98.6%) was also achieved by ASSIGN when applied to routinely entered GP registered patient addresses for an entire population of predominantly working age, ethnically diverse and socially disadvantaged adults in the complete geography of north east London. We found relatively small differences in some demographic and provider organisation characteristics among the 1.4% patients for whom a match to a UPRN was unsuccessful.

We found that UPRN matching success was less likely among patients aged 15-29 or ≥85 years, and those from Chinese ethnic backgrounds, with missing IMD, who were highly mobile (as assessed by three or more changes in address in the preceding 12 months) or were registered at a GP practice in Tower Hamlets CCG, or using the SystmOne clinical record system. Conversely, UPRN matching success was more likely for patients with missing GP registration dates, or with longer duration of GP registration.

In conclusion, we consider ASSIGN to be a transparent, robust and quality assured address-matching algorithm with a high and accurate match rate with minimal biases in those not matched when evaluated against a whole population Figure 1 : Adjusted prevalence ratios and 99% CIs for number of address changes in the preceding 12 months, GP EHR system, and GP registration duration dataset of NHS addresses registered as part of routine NHS processes.

Strengths of our study include the use of robust best-practice methods to calculate and evaluate the accuracy of the ASSIGN algorithm using two gold standard datasets from different populations in the UK reflecting very different demographics, geography, and property types. In doing so, we have addressed many of the methodological issues highlighted by Schinasi et al. (2018) [4] , by providing a detailed and transparent account of methods we used to geocode addresses, and to evaluate missing or poor quality geographic information, and have undertaken a rigorous evaluation of bias in matching success. To our knowledge, similar accounts of accuracy checks and bias have not been provided by other address-matching algorithms currently in use in the UK.

We evaluated the ASSIGN algorithm in addresses routinely recorded for more than 1.75 million adults who include all those registered for general medical services in an extensive geographic area in north east London. This diverse urban geography provided challenging address quality for developing, optimising and evaluating the algorithm. As the ASSIGN algorithm was developed on NHS patient addresses, it has a high potential for health service specific applications and is readily scalable. In addition, ASSIGN is open-source and freely available for others to use. ASSIGN has potential to be used in other address-based datasets including those with information relevant to the wider determinants of health.

The Clinical Effectiveness Group in north east London has pioneered the recording of ethnic background in general practice which is higher than that reported in other geographies or in acute care EHRs [27] . Although an ethnicity code was missing for 15% of our population, we found that the match rate for those with missing ethnicity was similar to that observed in those from White ethnic backgrounds.

While a number of alternative metrics are available to summarise the linkage performance we selected three metrics to be harmonised with GUILD [23] and others such as Office for National Statistics (ONS) [28] .

We reported the UPRN match rate based on all match qualifiers combined and not separately for the 2.2% of matches that were 'child', 'parent' or 'sibling' qualifiers which would not be exact matches to the actual patient address. The implications of this will depend on the use of the UPRN: for example, these qualifiers are fit for purpose when using UPRNs for geographical analyses but may be less appropriate for household analyses. We are currently undertaking further work to evaluate approaches to using UPRNs to represent households.

We did not evaluate address-matching success for patients who do not register with a GP practice at all or who are registered at non-residential addresses such as homeless or migrant people.

The ASSIGN algorithm has achieved a very high accurate match rate as evidenced by performance against the two gold standard external datasets with the slightly higher incorrect match rate for the Welsh addresses, reflecting the greater challenge of addresses which contain Welsh language words and spelling.

In the context of the very high match rates achieved, the biases in match success are small but important to identify. The impact of these biases can then be considered when using UPRNs in different populations and for a range of purposes.

Reasons specific to the study population that could influence the five known factors associated with a non-match were considered. Quality of the recorded patient address as well as the address type are important aspects as certain address types are more likely to vary from the address format given in AddressBase Premium, particularly addresses for flats which are more prevalent in urban areas. For example, Tower Hamlets has a higher rate of properties that are flats, which tend to be more poorly recorded addresses. There was also evidence of a slightly lower UPRN match success among those who are more mobile as evidenced by address and practice changes, and duration of registration at the practice. Address-matching was slightly less successful for younger people having taken account of mobility, and the reasons for this are unclear. Of interest were the differences noted by GP EHR supplier systems and further investigation of the address format in SystmOne may be warranted. In summary, those without a successful UPRN match -while small in absolute numbers -demonstrate some demographic, geographic and organisational characteristics indicative of underlying poorer address quality. Some of these factors may be amenable to improvement at the point of address recording in general practices and warrant further exploration.

Specifically, the GP practice is key to the accurate recording of the patient address and to improving address quality in the NHS. We are considering how results from this analysis could be fed back to GP practices to improve systems for patient address recording as well as to confirm accuracy of address with patients since many aspects of direct patient care depend on accurate patient addresses.

The momentum of address-matching and assigning UPRNs to address data created by the UK government's geospatial data strategy has not been matched by greater transparency and evaluation in methods used to assign UPRNs as highlighted by Schinasi et al. (2018) [4] . The Secure Anonymised Information Linkage (SAIL) databank [29] in Wales has a 14 year history of data linkage of national datasets including by address and UPRN with NHS Wales Informatics Service as the Trusted Third Party (TTP) organisation that carries out the linkage. To date address keys (e.g. UPRNs) have been assigned to addresses using Experian QAS [30] with the Postcode Address File [31] and ESRI LocatorHub [32] which, together with other internal methods in the Welsh Address Matching Service, does not have a transparent methodology. The methodology behind the ONS address-matching service [7] is open-source code and is documented and performance evaluated by match rate and by clerically checking the quality of matches compared to other commercial solutions, but there has been no evaluation of bias. The Scottish Improvement Service's Data Hub's addressmatching methodology is not documented or evaluated in detail, stating that 'no thorough clerical review of automatic matches' had yet been carried out [8] . The Ordnance Survey's Match and Cleanse service [9] does not currently provide transparent documentation of the method or any quality assurance.

The ASSIGN method is innovative in its transparency of methodology, quality assurance and bias, is open-source, and is scalable. We have now implemented automatic UPRN matching for the patient addresses of 6.9 million London citizens registered with general practitioners who are included in the London Discovery Programme. We are currently exploring wider implementation of ASSIGN in different geographic areas in the UK, as well as across different organisations to support integration of data between health and local authorities including schools and social care settings and to other non-residential property types, particularly care homes.

The ASSIGN address-matching algorithm has been developed for use with NHS recorded patient addresses in an ethnically diverse urban population. It offers a transparent, accurate and quality-assured method for assigning UPRNs and advancing the use of geospatial linkage for effective health care and population health management, for supporting planning and policy for whole systems approaches, and for health data science research. 1 An incomplete address <8 characters in length; or contains no alphanumeric characters; or contains the words: unknown, no fixed abode, dummy, nfa, not found, not entitled, overseas, not known, not given, overseas, patient, visitor, unk, address, zz99, @, place of birth, none; or begins with: a special character, london, xx, or x; or does not follow full UK postcode format 

Blocking by matching postcode area, potential matching standard addresses are assessed deterministically by applying matching judgement rules in rank order of extent of string manipulation (rank 1 = no manipulation), using a decision tree to determine which string comparison match tests are passed and which fail until all branches are exhausted and the best match is found. These rules mirror human pattern recognition and are coded using e.g. Levenshtein distance 4 , pattern matching (Regex), field swapping and pluralisation. A match is made with one of four overall qualifiers that qualifies the relationship between the candidate address and the matched standard address in relation to approximate geography, or no match is made. The four qualifiers are:

• Best match: the closest match out of all available • Child: candidate address is a 'child' sub-property of the UPRN it has been matched to • Parent: candidate address is the 'parent' building shell of the UPRN it has been matched to • Sibling: candidate address is a near neighbour of the UPRN it has been matched to

Where there is a match, the algorithm returns the UPRN, the overall qualifier, the standard address, the match pattern and match rule identifier employed to get that match. The match rule is a label identifying which section of the code made the match, and the match pattern depicts how five address objects were manipulated to achieve the match. These five address objects are merged from the original eleven: flat, building, number, street, postcode. Twelve possible match terms (see Table 1 ) exist and can be combined in up to 50 different ways on the five address fields. These are restricted to plausible terms, for example, postcodes are never swapped with streets.

Continued.

An example of a match pattern is 'Pe,Se,Ne,Bp,Fe'. This means that the postcode, street, number, and flat fields were equivalent matches between the candidate and standard address, and the building field was a partial match between the candidate and standard address. There are higher proportions of 'Other' postcodes, addresses beginning with an alphabetic character (i.e. a flat rather than a house) or a special character, and invalid addresses or postcodes in unmatched compared to matched. Differences between matched and unmatched addresses across all characteristics were found to be significant using chi square tests, but this could be attributable to the large sample size. Patient and registration characteristics are compared in section 'Population characteristics' of the paper. 

Body mass index and the built and social environments in children and adolescents using electronic health records. American journal of preventive medicine

Community vital signs": incorporating geocoded social determinants into electronic records to promote patient and population health

Use of electronic health records and geographic information systems in public health surveillance of type 2 diabetes: a feasibility study. JMIR public health and surveillance

Using electronic health record data for environmental and place based population health research: a systematic review

Unlocking the power of location: The UK's Geospatial Strategy

Public Sector Geospatial Agreement (PSGA)

ONS working paper series no 17 -Using data science for the address matching service

A guide to CHI-UPRN Residential Linkage (CURL) file. Scottish Centre for Administrative Data Research and Public Health Scotland

EDRIS/_docs/CURL-Report

Geocoding error, spatial uncertainty, and implications for exposure assessment and environmental epidemiology. International journal of environmental research and public health

Association between living with children and outcomes from covid-19: OpenSAFELY cohort study of 12 million adults in England. bmj

Covid-19: breaking the chain of household transmission. bmj

Household factors and the risk of severe COVID-like illness early in the US pandemic. medRxiv

Tableaux tables

ASSIGN open-source code

ASSIGN description

GUILD: GUidance for information about linking data sets

British Standard BS7666

NHS data model and dictionary -ethnic category

A transparent and transportable methodology for evaluating Data Linkage software

Research into practice: understanding ethnic differences in healthcare usage and outcomes in general practice

Developing standard tools for data linkage

What is the PAF

GUILD: GUidance for Information about Linking Data sets

This work was supported by a UKRI Rutherford Postdoctoral fellowship (GH), and by funding from Endeavour Health Charity, Barts Charity (Grant/Award Number: MGU0419), and Health Data Research UK, an initiative funded by UK Research and Innovation, Department of Health and Social Care (England) and the devolved administrations, and leading medical research charities.This work used data provided by patients in east London and recorded by the NHS general practitioners who shared deidentified data for research purposes via the Discovery Data Service which was curated with the support of the Queen Mary University Clinical Effectiveness Group and the north east London Discovery Programme.

None declared.

Ethics approval was not required or obtained. Approval for access to the person identifiable data (patient addresses) used in this study was provided by the north east London Discovery Data Service data controllers to the Clinical Effectiveness Group as appointed data sub-processors for the sole purpose of developing and evaluating the ASSIGN algorithm for direct patient care. This access was limited to approved individuals with appropriate information governance training working in a secure trusted data environment.Only aggregated patient data are reported in this study.

• into eleven standard address object fields: flat, building, number, dependent thoroughfare, street, dependent locality, locality, town, postcode, organisation, vertical• a second version of the eleven standard address object field is created by correcting spelling errors, de-pluralisation, replacing or removing punctuation and lower casing, and removing extraneous words that are unnecessary in the match process, for example, the range of words that are equivalent to the word 'flat' such as 'apartment' or 'maisonette' • positional checking is carried out e.g. the abbreviation 'st' would be mapped to "street" as a spelling correction, but not if it was presented as the first word in a field "St David's" for example would be retained as "St David". See https://github.com/endeavourhealthdiscovery/uprn-match/tree/master/ UPRN/yottadb for address preformatting routines.The addresses are reformatted:• into eleven standard address object fields: flat, building, number, dependent thoroughfare, street, dependent locality, locality, town, postcode, organisation, vertical• the eleven standard address object fields are indexed with single and compound indexes to improve search performance time• the eleven standard address object fields are indexed with performance improving indexes based on semantic equivalence or semantic performance including correcting spelling errors, de-pluralisation, replacing or removing punctuation and lower casing, and removing extraneous words that are unnecessary in the match process, for example, the range of words that are equivalent to the word 'flat' such as 'apartment' or 'maisonette' Linkability: replaced with artificial identifiers to reduce disclosure before linkage