Exploring the Geocoding Process:
A Municipal Case Study using Crime Data

Greta Nicoara
Summer 2005
The University of Texas at Dallas

Abstract

Addresses are the fundamental means by which people conceptualize location in the modern world. In a Geographic Information System, addresses are converted to features on a map via the geocoding process. Lack of standardization in addressing, differing ways of defining location, ownership and property definitions, the complexities of multi-source location and address data maintenance, differing address table schemas and models, the variety of algorithms used, and potential user error at every step along the way are all factors that make geocoding an error-prone process. This can lead to a dilemma in view of the prevailing need of accurate geo-locating tools in our increasingly spatially aware world where the spatial representations of tabular data is becoming standard practice. This project reviews the fundamental algorithms behind the geocoding process, such as probabilistic record linkage, standardization algorithms, Soundex, and the general geocoding process within the ESRI application framework. As a case study, crime data is analyzed and processed for the City of Richardson, Texas, and solutions for the various geocoding errors are presented and geocoding base file modifications are made to accommodate site-specific needs. Programming modifications are utilized to further enhance matching capabilities. Additionally, a geocoding workflow model is proposed for municipal applications.

Introduction

Geocoding, which is the transformation of location based tabular data into spatial data, is widely used in various fields and applications. Some of the more prominent disciplines that rely on accurate geocoding for spatial analysis include crime mapping at law enforcement institutions, health database geocoding and linking at health institutions, commercial customer database geocoding, environmental applications using point source data, and a multitude of municipal government applications. The FGDC estimates that geographic location is a key feature in 80%-90% of governmental data collections [5], hence the need and value of a spatial representation for tabular data for more effective decision-making. The original development of the fundamental processes behind geocoding is related to efforts at the US Census Bureau from the 1950's onward; efforts which have since continued to grow as an important area of research at the Census Bureau, particularly with the advent of automated GIS [21].

Much literature has been written on the topic of geocoding and the underlying algorithms that make it function. Questions such as, What is an acceptable match rate? [24] and How do different algorithms affect the geocoding result? [17] are readily encountered in the literature. While discussion can vary on the topic based on the particular entity and its needs, the primary concern generally relates to the accuracy of a geocoding technique. For example, national datasets and local datasets face different challenges. While large national datasets must contend with a diversity of address formats [29], requiring more complex rules that define how an address is broken down for geocoding, local datasets necesitate better accuracy since the geocoded data is often analyzed in relatively small geographic units (ex. number of rentals within a homeowner's association, homeowners that reported a missed recycling pick-up, etc). With the increasing power of GIS systems, and the need of point-based spatial analysis there has been an increase in technology firms that sell custom geocoding tools and reference data (TelContar, NavTech, Tele Atlas, GDT, etc.) to enable individuals, companies, or governmental instititions to perform geocoding tasks. Additionally, web-based address look-up engines such as MapQuest, Google Maps, and Yahoo Maps have become mainstream among the general public. Certainly, a large market exists for viewing and analysing spatial location data and this need is only increasing as GIS continues to become more integrated into the operational workflow of many organizations, from governmental to private industry.

While simple in concept, geocoding as a process is not as simple as just putting a dot on a map. Techniques involved in geocoding borrow from various disciplines such as information theory, decision theory, probability theory, and phonetics, as will be discussed in the sections below. Furthermore, geocoding is not an error free process; many variables must be taken into account such as reference data availability/quality, address data quality, the required accuracy of geocoded location relative to true location, etc. The geocoding algorithms used by a particular software vendor or developer affect all of these parameters. A clear understanding of the fundamental processes involved the the geocoding process is essential in order to better troubleshoot such a broadly based application.

As a case study in troubleshooting the geocoding process, the geocoding match rate of crime data at the City of Richardson, Texas, was analyzed in order to identify the major sources of error and to identify potential solutions to those errors. As will be discussed in more detail in the Analysis & Results section, three major sources of geocoding error were identified:

  1. Data input errors
  2. Reference data errors
  3. Errors related to the underlying geocoding process

These errors were broken down into more precise categories for analysis. Programming and non-programming solutions were sought and the possibilities of further customization (within ESRI's application framework) were investigated.

Project Objectives


Theoretical Framework

Techniques involved in geocoding borrow from various academic fields, most notably: information theory, decision theory, probability theory, and phonetics. As such, it is important to first review these techniques in a conceptual framework in order to better grasp the nature of the problems being investigated. What follows is a brief description of the theoretical foundation of the 5 fundamental concepts considered integral to geocoding in the context of this project:

  1. Geocoding as an all-encompasing term that describes this process of converting tabular data to spatial point data.
  2. Probabilistic record linkage which alows for human error and other variances in the input data to be converted.
  3. Standardization algorithms which define how an address component is broken down for the matching process.
  4. The Soundex function which allows for further flexibility in human spelling errors.
  5. Address data modeling concepts which are essential to efficient workflows in an enterprise system such as our study area, the City of Richardson.

  1. Geocoding Theory

    1. Definitions:
      1. Geocoding: the process of assigning an x,y coordinate value to the description of a place by comparing the descriptive location-specific elements to those present in the reference data. [8]
      2. Geocoding process: The steps involved in translating an address entry, searching for the address in the reference data, and delivering the best candidate or candidates. These steps include parsing the address, standardizing abbreviated values, assigning each address element to a category known as a match key, indexing the needed categories, searching the reference data, assigning a score to each potential candidate, filtering the list of candidates based on the minimum match score, and delivering the best match. [8]
      3. Address locator: One of the entities in the geocoding framework that acts to combine the style-specific guidelines and reference data. The address locator defines the technique to be used by the geocoding engine in interpreting the address against the rule base files. The file containing information about an address locator is distinguished with a LOC file extension. A locator is referred to as a Geocoding Service in 8.x.[8]
      4. Rule base: A collection of files that directs the geocoding engine in how to standardize address data and match it to the related location in the reference data. Each geocoding style uses specific files in the rule base. [8]
    2. ESRI Geocoding Process (in a nutshell):
      1. Input the original address (or table of addresses)
      2. Address is parsed
      3. Abbreviations are standardized
      4. Elements are assigned to match keys in the rule base files
      5. Index values are calculated
      6. Score of each potential match is established
      7. List of candidates is filtered
      8. Best candidate is matched
      9. Unmatched candidates are presented for review
      10. If edits are made to the reference data, geocoding indexes must be updated

    3. Geocoding Applications (City of Richardson)
      1. Crime analysis application: crime data is geocoded nightly from Police Record Management System (RMS) using an ESRI-based custom crime application. The resulting data is used by the Crime Analyst for spatial analysis of crime patterns and to generate statistical reports based on geographic areas. The crime datasets cover the full gamut of potential geocoding issues, errors, and considerations, which is why this particular dataset was used as the sample data on which to test the geocoding process.
      2. E-911 Applications: requires geocoding of all incoming 911 call to assisit dispatchers
      3. Web applications: must have ability to locate any addresses on user input in the Find Address dialog, must be able to use dynamic geocoding capabilities for retrieving the most up-to-date information in the geodatabase.
      4. General geocoding needs: there is a constant need to geocode tables of addresses for spatial analysis for many departments within the City (ex. we regularly geocode rental properties, building permits, neighborhood integrity violations, missed recycling locations, active work order locations, etc.) Variations in input data need to be accommodated by the geocoding engine.

    4. Common Geocoding Errors
      1. Address component standardization issues (ex. Directional as part of name (ex. North Star Rd), Saint, Mount, 1st, no street suffix/type)
      2. Address types that cannot be matched such as addresses with unit numbers (commercial suites, apartments, multi-story buildings, etc.)
      3. Locations without valid street address (ex. shopping centers that use shopping center name as address)
      4. Ambiguous addresses: those missing components of a complete address, such as directional prefix, street type, incorrect address range, etc.
      5. Data entry errors
      6. New addresses: address database must be kept current
      7. Vanity addresses: those that cannot be located on a street segment
      8. Addresses with mixed parity (do not follow convention of odd and even numbers on opposite sides)
      9. Non-address input such as PO Boxes, miscellanoues information within an addresses (ex. 1300 blk E Arapaho Rd)

    5. Decision Process for Geocoding Error Resolution
      1. Initial concern: Why are address records not matching?
      2. What are the most common causes of error?
      3. What is the most logical approach to resolve individual errors?
      4. The intricate nature of the process requires that the problem be broken down into its logical components.
      5. First consideration: Edit the reference data (are address ranges incorrect?, are address points missing?, etc.)
      6. Make simple default setting changes (in geocoding style, alternate referenace data could be used)
      7. Edit/post-process raw address data before geocoding
      8. Require additional reference files?
      9. Modify locator template?
      10. Modify the rule base?
      11. Combination of solutions will most likely be needed
      12. How can geocoding flexibility be increased in the overall day-to-day process?

    Decision Process

     

  2. Probabilistic Record Linkage

    1. ESRI's geocoding applications are built on top of MatchWare Technology's Probabilistic Record Linkage system. Probabilistic record linkage systems use a statistically valid form of fuzzy logic to score how well records do or do not match. Within the ESRI framework, the rule base upon which this system functions is customizable. [27] The following sections describe the primary concepts involved in a traditional probabilistic record linkage system. Please refer to the Tools & Resources section for more detailed information on how this type of system works within the ESRI rule base and geocoding framework.
    2. Definition:
      1. Probabilistic record linkage involves matching two data files under conditions of uncertainty.
      2. The objective is to identify and link records which represent a common entity whether the entity is an individual, a family, an event, a business, an institution, or an address.
      3. The concept is in contrast to Deterministic Record Linkage which assumes error-free identifying fields and links records that exactly match on these identifying fields. For example, the ability to join database ecords on matching primary and foreign keys. When no error free identifier is shared by all of the data sources, a probabilistic record linkage technique can be used to potentially join or merge the data sources. [14]
      4. Each field participating in the linkage comparison is subject to error which is measured by the probability that the field agrees given a record pair matches versus the probability of chance agreement of its values. This makes the process probabilistic due to the lack of definitive or consistent primary and foreign keys across databases.
      5. According to Matthew Jaro, a MatchWare developer: "Probabilistic record linkage methodology is imperative if computers are to consistently and effectively replicate the evaluation and judgment process of human clerks attempting to link common records. The ideal goal is to have the computer emulate the intuitive thought process of a human being as they might review, judge, evaluate, measure, and score linkage qualifications of records representing commonality." [16]
      6. Synonymous terms: record linkage (used mostly by epidemiologists and statisticians), entity heterogeneity, entity identification, object isomerism, instance identification, merge/purge, entity reconciliation, list washing, data cleaning, etc., (others used mostly by computer scientists). [14]
    3. History:
      1. The essential ideas of modern Record Linkage originated with geneticist Howard Newcombe who introduced odds ratios of frequencies and the decision rules for delineating matches and non-matches. [31]
      2. The term iteself originated in the public health area when files of individual patients were brought together using name, date-of-birth and other information. [31]
      3. I.P. Fellegi and A. B. Sunter provided the formal mathematical foundations of Record Linkage in the late 1950s. Their theory is referred to commonly in the literature as the Felligi-Sunter Model. [31]
      4. The first practical implementation of probabilistic record linkage in the United States was originally designed by Matthew Jaro on behalf of the U.S. Census Bureau in 1985.
      5. Record linkage is important at the US Census for census undercount determination, address list compilation and general census evaluation. [16]
    4. Basic Algorithm [31]:
      1. Premise: Two files A and B are to be matched
      2. Goal: To classify pairs in a product space AxB from two files A and B into M, the set of true matches and U, the set of true non-matches
      3. Ratio of Probability: R=P(γ ε Γ | M)/ P(γ ε Γ | U), where γ is an arbitrary agreement pattern in comparison space Γ
      4. If R > UPPER then designate a match (where UPPER is a predefined minimum match rate)
      5. If LOWER<=R<=UPPER, then designate pair as a possible match and hold for clerical review
      6. If R < LOWER, then designate pair as non-match
      7. The basic research problems in calibrating this ratio of probabilities have been open since the work of Newcombe (1959) and Fellegi & Sunter (1969). The major difficulties in all situations have been determining how identifying information can be used and what the relative value of a field is in matching in comparison with other fields.
    5. Information Flow for a Probabilistic Record Linkage System [14]:
      1. Data is the starting point: datasets from different sources (databases) that need to be linked
      2. Standardization:
        • Also called data cleaning or attribute-level reconciliation.
        • Without standardization many true matches could be wrongly designated as non-matches because the common identifying attributes do not have sufficient similarity.
        • Replace spelling variations of commonly occurring words with standard spelling.
        • Standardize to same system of units or to same coding system.
        • Perform integrity checks on attribute values.
        • Standardization methods need to be specific to the population of an application and the data capture process, ideal for local level data, more difficult for more diverse datasets that require varying standardization rules.
        • In the ESRI geocoding rule base, the PAT file is primarily responsible for address standardization.
      3. Searching/Blocking:
        • Blocking is used to reduce the number of comparisons of record pairs by bringing linkable pairs together (this increases the efficiency of the system)
        • A good attribute variable for blocking should contain a large number of attribute values that are fairly uniformly distributed. Such an attribute must have a low probability of reporting errors (ex. street name - this is also where Soundex is used since errors in the attributes used for blocking can cause a mismatch).
        • The ideal blocking component would be one which nearly always agrees in 'true match' record pairs but nearly always disagrees between pairs which are not valid matches. [16]
        • Street names are used as the blocking mechanism in ESRI geocoding tools. This is why if the street name is not found, no possible matches are suggested (ex. "North Star Rd"- the street name is actually "North Star"- standardization error makes "North" the directional prefix, geocoding engine does not find "Star Rd")
      4. Selection of attributes for Matching/Comparison:
        • Common attributes should be selected for use in comparison function.
        • Generally predefined in addresses- however there are significant regional variations (PreDir, StName, StType…)due to many atypical addresses in many parts of country/world.
      5. Weights (Comparison):
        • A comparison vector is used for each pair based on assigned weights.
        • The discriminating power of a component (or or address element) is a measure of how useful that component is in predicting a match. [16]
        • If the components are assumed to be statistically independent, then the composite weight is equal to the sum of the individual component weights.[16]
        • For each record pair, a decision is made whether to classify the pair as a match (M), unmatch (U) or unknown (T) which must be followed-up interactively by the user.
      6. Decision Models/Record pair matching:
        • Once matching weights of individual attributes or two records are calculated, the next step is to combine them to form a composite weight or score and then decide whether a record pair should be a match, non-match or possible match.
      7. Performance Measurement: the quality of record linkage can be measured in the following ways:
        • The number of record pairs linked correctly (true positives)
        • The number of record pairs linked incorrectly (false positives, Type I error)
        • The number of record pairs unlinked correctly (true negatives)
        • The number of record pairs unlinked incorrectly (false negatives, Type II error)
     
  3. Standardization

    1. Standardization is a part of the Probabilistic Record Linkage process, but it is such a specialized component in GIS that it deserves individual attention.
    2. The most common approach for name and address standardization is the manual specification of parsing and transformation rules. [7]
    3. Input string is separated into individual words. Each word is then mapped to a token of a particular class. [7] :

                Input address: 123 East Main Street
                Standardized: 123 E MAIN ST
                Tokenized (by PAT): ^ | D | ? | T

    4. The choice of class is determined by the presence of that word in user-supplied, class specific lexicons (look-up tables, such as CLS files), or by the type of characters found in the word such as all numeric, alphanumeric or alphabetical. [7]
    5. The process described above is a deterministic approach (as used in ESRIs geocoding). [7] presents a probabilistic method for standardization using Hidden Markov Models (HMMs). An HMM is a statistical model where the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters, from the observable parameters, based on this assumption. The extracted model parameters can then be used to perform further analysis, such as for pattern recognition applications. HMMs are used extensively in the fields of speech recognition and natural language processing to help solve problems such as word-sense disambiguation and part-of-speech tagging. More recently, HMMs gave been applied to the problem of extracting structured information from unstructured text.This probabilistic approach to standardization uses a sort of inverted logic where an address is composed of hidden states which don't exist until they are actualized into an address that isn't always in a standard format. The algorithm "hypothesizes" what format that address might have been in to produce the observed results. The algorithm is similar to a graph/network model through which it must find the most likely path (What path through this HMM must one take to result in a certain address- the input string?).
    6. Within the ESRI framework, standardization is the process of address parsing that prepares the address to be matched against the reference data, by breaking down the address into elements. Those elements are asisgned to match keys and the abbreviated address elements are converted into standardized values. The process involves standardization commands (STN), a match key dictionary (DCT), a classification table for providing standard address abbreviations (CLS) and patterns and actions for standardizing an address (PAT). Each of these files will be discussed in more detail in the Tools & Resources section. [27]
     
  4. Soundex

    1. Definition/History:
      1. Soundex is a way of indexing information based on how the word sounds rather than how it is spelled. It is a phonetic indexing system, blocking together many of the common types of spelling errors and abbreviations.
      2. Soundex was first developed and patented by Margaret O'Dell and Robert C. Russell in 1918. Most versions of Soundex convert a surname into a code consisting of the first (leftmost) letter of the surname, followed by 3 or more digits. [22]
      3. The method is based on the six phonetic classifications of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal), which in turn are based on where you put your lips and tongue to make the sounds.
      4. The key concept behind Soundex relies on the assumption that a constant relationship between letters and sounds should assure that similar-sounding names are assigned the same code.
      5. Soundex also functions as a compression scheme as the code contains one half to two thirds the information contents of the full name. [32]
      6. Source code is available in various programming languages for implementing the Soundex function in an application.[25]
      7. Within the ESRI geocoding process, a Soundex index is applied for the street name component of the tokenized address which is stored in the geocoding index table.

    2. Standard Sundex Algorithm: while there are many variations of the Soundex algorithm that accomodate region-specific sound values and other international variations, the following table reflects the algorithm in its most basic form. Essentially, a Soundex search algorithm takes a word, such as a street name, as input and produces a character string which identifies a set of words that are (roughly) phonetically alike. For address matching, street name is an important blocking variable. Therefore, to maximize the chance that similarly spelled street names reside in the same index block, the Soundex system is used as a blocking variable. The basic algorithm is as follows:
      1. Retain the first letter of the string and drop all occurences of a,e,h,i,o,u,w and y in other positions.
      2. Change all occurrence of the letters in the table below to the corresponding number.
      3. Ignore the spaces if the string contains multiple words.
      4. If 2 or more letters with the same code were adjacent in the original string, omit all but the first.
      5. Pad the string that resulted from the last step with trailing zeros and return only the first four positions, which will be of the form <uppercase letter><digit><digit><digit>
      6. For any character that can't be mapped, ESRI's geocoding products either retain the character or skip to the next character for generating the Soundex code.

      For example, the string "Main" (and "Maine" or "Mane") would produce the following Soundex code: M500. As can be seen in this example, this system can have both advantages and disadvantages. If "Maine" or "Mane" are misspellings of "Main", then the system has successfully caught spelling variations and errors and identified the correct street. However, if there exist both a "Main" and a "Maine" street in the same reference database (and, if they have similar block number ranges), they would both be given the same index code which would give each candidate an equal score in the matching process and cause potential false matches.

      Letter Code
      B,F,P, V 1
      C,G,J,K,Q,S,X,Z 2
      D,T 3
      L 4
      M,N 5
      R R
      H,Y,W Omitted
      A,E,I,O,U Omitted

      Limitations: [22]

      1. Dependence on initial letter
      2. Noise intolerance (mis-typing)
      3. Differing transcription systems
      4. Names containing particles
      5. Perceptual differences
      6. Silent consonants
      7. Name syntax variation
      8. Initials

    3. How helpful is Soundex in alleviating user input errors during geocoding? As will be demonstrated in later sections, approximately 1% of match error in the sample crime dataset is still due to misspellings despite the Soundex algorithm. The most common misspelling errors are caused by either the transposition of two letters (referred to as noise above) or due to concatenating street name components that should be separate (ex. "Beltline Rd" vs. "Belt Line Rd"). In these cases, Soundex has no effect. However, for a relatively small municipality like the City of Richardson with 1065 unique street names, misspellings are a minor error factor in the overall scope of the geocoding problem, since all of the address data references these 1065 streets, and most street names are well known by city staff, and most often properly spelled.
     
  5. Address Data Modeling

    1. Definition:
      1. One of the main challenges to geocoding is having good reference data. This includes a set of geographic features that are needed to match against as well as robust address characteristics (attribute data) that enable matching address records to feature locations in GIS. This necesitates a sturdy address model to organize the reference data components in a logical, maintainable and site-specific way. Two kinds of address data models are presented:
        • DIME model: (from the Census DIME data format) all address data is stored in single flat file (ex. with LeftTo, LeftFrom, RightTo, Right From ranges for the street centerline)
        • Nickel model: each street segment is associated with 2 separate range records, one for each side of street (called nickel since it has half the columns per record than the DIME). This model facilitates storing different names and address ranges for different sides of the street and enables easier validation of cases where there is no address range for one side of the street. It also better supports cases where streets have multiple address ranges and names. The ESRI Address Model follows this structure. [1]
      2. Address models are most practical in organizations where a master address repositary in in place. In most small municipalities, however address data is not uniform and can be maintained by various departments, each with or without their own standards.

    2. Challenges to building good reference data: [1]
      1. Addresses can be associated with many kinds of feature classes in a reference database (ex. roads, parcels, address points, building structures, etc.)
      2. The complexities of address component relationships dictate that some address elements are organized in separate, related tables in the DBMS.
      3. Addresses and features in the GIS can share complex relationships (many-to-many relationships may exist).
      4. A feature might have sub addresses (ex. a parcel may house a duplex with two separate addresses).
      5. Sets of address components can vary by locale and culture.
      6. Address locators must be built for geocoding addresses against the reference database (which may or may not follow a certain model).

    3. ESRI Address Model (for the City of Calgary):
    4. The ESRI Address Model for the City of Calgary is an address database schema intended to model most addressing needs and relationships. As such, it is more elaborate than necessary for most small municipalities, and as such needs to be "localized". An additional disadvantage of this model is that it implies the existance of a master address database, which is rarely the case in municiplaities. In the City of Richardson, for example, not only is address data maintained by multiple departments, but in some cases, different departments even use different naming conventions for the same stretch of road (for example, North East Frontage Rd, N Central Expy, etc). Thus, the goal of a unified Master Address Repository is still just that, a goal. However, studying the address model as presented by ESRI can give one a better idea of how various address cases and relationships might be modeled and how that model might be modified to fit the local needs, given that geocoding functionality is integrated into the model.

    5. Data Models & Geocoding

Literature Review

The following section presents a brief literature review on the topic of geocoding, both conceptually and in regard to practical applications. One of the most prolific sources of information is the Statistical Research Division of the US Census Bureau, which has over the years produced many publications on their research of the geocoding process, including concepts such as probabilistic record linkage and standardization techniques. Additionally, numerous articles have appeared in peer review journals that test for the efficiency of geocoding methods, look for the impact of accuracy on spatial analysis, etc.,among other studies. This brief look at the literature reveals the importance of this topic to all individuals and entities that rely on geocoded data in their work or studies.

  1. Geocoding Developments at the US Census Bureau
    1. Urban geocoding developed rapidly in preparation for 1970 US Census. According to [9], "A conceptual framework for location referencing or coding is not well developed [as of 1974] despite the logical need for conceptual development to precede the technical developments."
    2. [21] sketches the first 15 years of development of geocoding principles and methods at the US Census Bureau and the fundamental requirements of a geocoding system:
      1. A geographic reference file with all levels of needed codes (street segments with street names and address ranges)
      2. Computer programs which
        • A program to identify the components of the input address, standardize the components and arrange them into a record format consistent with the geographic base file.
        • A program to provide linkage to the proper geographic base file record through computer matching techniques and assign the standard geographic location identifiers or codes (blocking and weighting).
    3. Techniques involved in geocoding have evolved greatly since then with much effort from the Bureau's Statistical Research Division.
     
  2. US Census Bureau Chronology

  3. Below is a chronological listing of the major developments and the evolution of techniques and reference data associated with geocoding at the US Cenus Bureau:

    Year Major Developments
    1960 Census of Population and Housing mailed questionnaires to respondents. Each questionnaire was picked up from households by enumerators. The process was not automated.
    1961-1964 Development of first major geocoding systems: AULT (Automatic Location table) and SACS (Street Address Conversion Systems). AULT was developed in response to needs in transportation planning to code large files of origin-destination data (street addresses) to traffic zones. SACS was a geographic base file of street segment records containing ranges of street addresses and end-point coordinates.
    1963 (Economic Cenus) A major development of automating the geographic coding of postal addresses was accomplished. An Address Reference File consisting of street segment records within Census tracts was constructed for areas serviced by post offices located in cities of 25,000+ population. A Building reference File was developed as an adjunct to the Address Reference File for coding non-street name/house number type of addresses such as motels, hotels, buildings, etc. Rules for matching required that two words, street name and house number within a post office and postal zone match those of the Address Reference File. In the coding system, each component of an address had equal value. This worked well but it wasted discriminating power of some fields.
    1966-1969 First nationwide system developed: ACG (Address Coding Guides). The Census Bureau sent questionnaires to ~ 60% of US households. It was planned to use mail for both sending and returning surveys- this demanded geocoding capability and subsequent development of the ACG. Before generating mailing labels, they had the addresses coded to 1970 tract and block using ACGs for 147 SMSAs (Standard Metropolitan Statistical Areas). They then related a census block number to a range of addresses along one side of the street.
    1967 (Economic Census) Modified to incorporate the ZIP codes information and to include building names, eliminating a separate building file. The Address Reference File was expanded to include cities down to 2500 in population.
    1967-1971 Second nationwide system developed: DIME (Dual Independent Map Encoding). This new model was based on the street segment as the basic record. It treated a street network as a linear graph and each segment had attributes (LeftTo, LeftFrom, RightTo, RightFrom). This enables fast address lookup from a single flat file.
    1971 Unimatch & Unistan programs were developed for the geocoding process. (Similar programs called AutoMatch and AutoStan by MatchWare). These were laboratory tools to simulate existing or contemplated matching algorithms on an IBM computer.
    1972 & 1977 (Economic Census) Modifications were made to geocoding programs to modify weights. A string comparator technique on post office name introduced.
    1980 By 1980 ACG/DIME had become a geographic base file (GBF)/DIME. This was followed by a call for nationwide, seamless, digital map, to be called TIGER (Topologically Integrated Geographic Encoding and Referencing). Census Bureau geographer Robert Marx and his team implemented TIGER for the 1990 Census. Block face reference file for 1970 Census did not require probabilistic matching scheme. Deterministic coding was still the policy in 1980. "Because of both the quantity and sources of records involved and the requirement for the fine level coding, the [1980] Decennial Census has chosen not to accept the risk of probabilistic computer coding".
    1990 TIGER files now contain address ranges rather than individual addresses. Creation of an address list referred to as the ACF (Address Control File). The ACF is the second primary source of city-style addresses for the Census 2000 address list was used in conjunction with the DSF to create the Census 2000 address list.
    2000 LUCA (Local Update of Census Addresses) program initiated.
    2010 (Preparations) MTAIP (MAF/TIGER Accuracy Improvement Project). Program seeks to combine MAF (Master Address File) and TIGER into a single integrated database. The project began in 2002 to correct and update the spatial accuracy of the TIGER database. A portion of the nation's counties or statistically equivalent entities will be updated each year until all counties are completed in 2008.The program seeks to achieve a high level of map coordinate accuracy in TIGER by acquiring and using, as a first priority among data sources, digital (GIS) files prepared and provided by state, local and tribal governments (that meet or exceed street centerline spatial accuracy requirement of 7.6 meters and are provided without royalty or copyright restriction). [Gillum, 2005]
    (Note: The Census Bureau is bound by the confidentiality provisions of Title 13, United States Code, which is the collective set of laws passed by the Congress over several decades. Title 13 requires, among other things, that information about individuals and establishments (including their addresses and/or specific locations) cannot be disclosed (except in the form of statistical totals) to anyone who has not agreed to abide by these restrictions. This means that the address data being compiled can not be redistributed without a legally binding contract.)
     
  4. Positional error in geocoding [3]
    1. There has been a dramatic increase in the number of public health applications using geocoding to assess geographical distributions of health-related issues: zones of exposure, rates of disease, etc.
    2. This study evaluated positional error caused during automated geocoding of residential addresses and how this error varied between population densities.
    3. The study achieved better positional accuracy with parcel points vs. street centerline geocoding due to centerline interpolation errors, particularly in rural areas.
     
  5. Minimum acceptable hit rate for crime geocoding [24]
    1. This study tests the accuracy of thematic crime maps generated from data sets with incomplete geocoding in order to arrive at a first estimate of a reliable minimum geocoding match rate.
    2. The authors ran through a reiterative process randomly removing data to see at what point the spatial pattern changes and no longer reflects the true pattern.
    3. Result : 85% match rate is the limit for spatial accuracy.
    4. One limitation of this study is that it does not take into account that mismatches aren't necessarily random. Often all addresses on an entire street may not geocode due to rule base issues. The random removal of data does not mimic this process.
     
  6. Evaluation of uncertainties associated with geocoding techniques [17]
    1. This study looked into the uncertainties associated with interpolation techniques used in geocoding. They tested three algorithms (using TIGER as reference data) and then compared their results against address data collected by GPS. The three algorithms used were: LocMatch (developed by authors), ArcView 3.2 and TANA (Tele Atlas North America).
    2. Results: Differences between 3 algorithms is not significant, although TANA did fare best.
    3. Conclusions: The authors identify a need for the following when TIGER data is used as the reference data:
      1. Sophisticated interpolation techniques to geocode out-of-sequence address ranges using additional data from other databases
      2. Improved address matching techniques.
      3. Reference databases with higher spatial accuracy.
     
  7. Public health geocoding [18]
    1. The US government report, Healthy People 2010, set a goal of geocoding 90% "of all major National, State, and local health data systems to promote nationwide use of GIS at all levels."
    2. This study sought to determine the accuracy of geocoding for public health databases.
    3. The authors tested a file of 70 addresses with known errors, geocoding it to census tract and block group levels by 4 commercial geocoding firms (which were undisclosed).
    4. Results: Match rates for the test file ranged from 44% to 84%. When testing with real public health database data, the best firm matched 96% of all addresses.
    5. Conclusions: Public health studies involving geocoded databases should evaluate and report on methods used to verify accuracy, since such a wide range of accuracy exists among geocoding firms.
     
  8. Probabilistic Geocoding System (Australian Model) [4] [23]
    1. This study describes a geocoding system based on a comprehensive high-quality geocoded national address database, the Australian Geocoded National Address File (G-NAF).
    2. The authors use a learning address parser based on Hidden Markov Models to separate free-form addresses into components and a rule-based matching engine to determine the best set of candidate matches to a reference file.
    3. This is all implemented as part of Febrl: an open-source data linkage system written entirely in Python.
    4. The G-NAF is a comprehensive property-based database that is now available for Australia. It is is an enormous collection of data, all of which is used, even rejected addresses which are held to assist in the rectification of new addresses. The strength of G-NAF and the feature that makes the process unique in the world is the vast collection of high resolution spatial information that is used in combination with multiple address data sources (from approximately 13 organizations) with an estimated data volume in excess of 20 gigabytes and approximately 25 million address records. In order to access the data, data share arrangements have been arranged with many of the contributors. Under this arrangement there is a free two-way exchange of data. In this way the G-NAF has access to data that it requires to develop an authoritative and ultimately definitive dataset and the contributor receives the benefit of data ‘purification’ through the G-NAF process.
    5. In this program, standardization is not a rule-based system. Rather it is based on Hidden Markov Models, which the authors argue, achieve better address standardization accuracy and are easier to set-up and maintain compared to popular commercial linkage systems (however address matching is still rule-based).
    6. According to the authors, for addresses, the HMM approach produced equal or better standardization accuracies than widely-used rule-based systems (such as that by ESRI).
     
  9. Addressing Issues [10]
  10. A wealth of discussion exists about addressing standards among many major governmental organizations and local governments. The interest is evident in the fact that URISA sponsors an annual Addressing Conference and recently (2004) ESRI has sponosored a full pre-conference seminar (as part of the International User Conference) on addressing issues. Addressing is an important issue at the City of Richardson as well due to the lack of a unified addressing system and a master address database. This lack of standards at so many levels of government only serves to magnify geocoding problems. The following section gives a brief overview of addressing issues in general and the impact this has on geocoding, based on notes from the 2004 ESRI Pre-Conference Seminar of addressing issues:

    1. The Addressing Problem:
      1. Everyone knows their address
      2. There is no complete list of addresses
      3. Addresses and street names are stored in databases in different formats
      4. Some properties don't have addresses
      5. Addresses that are wrong can occur
      6. Exceptions are not handled consistently
    2. History of Addressing:
      1. Local governments historically have assigned street addresses since the early 19th century (sometimes attributed to Benjamin Franklin, Postmaster General).
      2. Each unit of government kept track of addresses that it either created or used. For example, water utilities kept track of their customers, building departments kept addresses for building permits, etc.
      3. No one was mandated to create or maintain a master list of all addresses within the jurisdiction.
      4. The official address is usually what was assigned by the unit of a local government (or sometimes the local post office) that is responsible for assigning new addresses.
    3. An Additional Consideration: 3D/Substructure Addressing
      1. This is sometimes referred to as occupancy addressing, or unit based addressing
      2. Having such precise data would be extremely useful for emergencies
      3. This kind of data is not historically created, so records may not exist
      4. Types of addresses involved:
        1. Multi-Unit Residential Buildings
        2. Condominiums, Landominiums, Boatominiums, Business Condos
        3. Multi-tenant Commercial Structures (malls, office buildings, industrial buildings)
        4. Campuses (Schools, Universities, Hospitals)
      5. Recommendations for MAR:
        1. All property parcels, vacant or improved
        2. All buildings, excepting small, non-inhabitable outbuildings (garden tool sheds, doghouses, etc.)
        3. All occupancies (apartments, office suites, retail tenancies, condominiums, etc.)--although this is difficult to maintain
    4. Data Validation
      1. Every piece of data put into the repository should be validated.
      2. Database integrity rules such as value domains should be implemented whenever possible.
    5. How do Lack of Addressing Standards Impact Geocoding?
      1. Lack of communication between departments on addressing issues can result in incomplete reference data used for geocoding.
      2. Lack of standards on data entry validation results in inconsistent addresses accross databases. When attempting to geocode this data, these inconsistencies result in a lower geocoding match rate.
      3. Having to repeatedly correct the same issues results in an inefficient workflow.

Tools & Resources

  1. ESRI Geocoding Rule Base [27]

  2. The geocoding rule base is the collection of tables which define how a table of addresses is actually translated (based on various rules and look-up tables) into a recognizable address format than can then be geocoded properly against the reference data. A Geocoding Rule Base Developer's Guide is provided with the ESRI 9.x documentation, which describes the rule base files and some of the modifications that can be made.

    1. MAT (Match Rules Definition): The MAT file is where the MatchRules are defined (MatchRules define the address fields from the reference data used for matching), m and u probabilities are specified and the weight of each address element is set. It is broken into 2 sections:

    2. Possible modifications:

      Match Rules Definition file

    3. STN (Command File): The STN is the starting point for each set of standardizing rules. It specifies input record size, debug mode, output file name, and the process name. It should not be edited without corresponding modifications to the PAT.
    4. Possible modifications:

      Command File

    5. DCT (Match Key Dictionary): The DCT specifies the fields than an address may be parsed into as a result of the standardization process. Each field is refered to as a Match Key. The match keys contain the values of the standardization result and are used for matching against the reference data.
    6. Possible modifications:

      Match Key Dictionary

    7. CLS (Classification table):
    8. The CLS table standardizes and tokenizes an input address based on keywords found in an address. It is categorized into classes at the beginning of the file.

      For example, the following address, 123 East Main Street, would be broken down as follows:
      Input address:            123 East Main Street
      Standardized:            123 E MAIN ST
      Tokenized (by PAT):    ^ | D | ? | T

      The CLS table can be edited, especially styles A, T, S and D. Other tokens may or may not work since some of the abbreviations done with the C LS are sent elsewhere, such as the TBL and PAT files for additional standardization.

      Some of the more common standardizations of the CLS file include the following:

      "S" is a special class for "ST". "ST" handled in the PAT file because it can mean many different things (Street, Saint, Suite, State, Ordinal Street Suffix). However, it can be handled a couple of ways without editing the PAT file, depending on how much flexibility is allowed with the reference data:

      Possible modifications:

      Classification File

    9. PAT (Pattern matching file):
    10. The PAT file is important in the standardization processs as it defines pattern rules and actions and converts the recognized operands into match key fields. A tokenized address is then matched against hundreds of recognized full and partial address style patterns. The PAT is the only file in the rule base that is binary encoded (since the patterns within and how they are handled are proprietary to ESRI). Editing this file requires the use of a tool (STANEDIT) provided in the Geocode Developer's Kit (GTK) to edit and re-encode the file. Using STANEDIT, edits are actually made to an XAT file, which is then encoded into a PAT file.

      Possible modifications:

      Pattern Matching File

    11. TBL (Direct.tbl, Expand.tbl, Prefix.tbl, Suffix.tbl): These are optional look up files used exclusively by the PAT file's pattern recognition process. They should not to be confused with the CLS file where abbreviations may be edited.
  3. ESRI Geocoding Developer's Kit (GTK)
  4. An additional resource that was not used in the study is a toolkit provided by ESRI, referred to as the Geocoding Developer's Toolkit (GTK). One of the original goals of this study was to use the GTK to make modifications to the encrypted PAT file, which defines (proprietary ESRI) standardization rules. It was hoped that PAT modifications could be made to model addresses with unit components by adding an additional token class within the rule base along with the appropriate interpretive rules. In this way, it was hoped that the unit component could be extracted from an address field and used in the matching process for more accurate geocoding. However, while the GTK is freely distributed, ESRI requires the granting of all intellectual property rights for any new rules developed using this kit to ESRI. This requirement has made it difficult for various entities from local government to public educational institutions, leery of the legal implications, to acquire this Kit. This has been the case in this study therefore no pattern rule modifications have been made.

  5. Existing ArcObjects Samples & ArcScripts
  6. Various tools, utilities and scripts already exist that can be used to improve the geocoding process, on the ArcObjects website, the ArcScripts website, and occasionally offered by users on the various ESRI User Forums. A few are cited below:

    1. PickAddress tool: manually specify unmatched addresses by clicking on map location
    2. Address standardization utility: can be used to see how a particular address will be standardized before geocoding
    3. No Parse Locator styles: can be used to geocode a table of addresses in which the address data is parsed (traditionally, it is not parsed)

Data Sources

The City of Richardson operates on an enterprise GIS platform, running ArcGIS 8.3, ArcSDE 8.3, and SQL Server 2000 (as of this study). Geocoding is used in many city applications, with two of the traditional base files serving as the primary geocoding reference data: the road centerline dataset and the parcels dataset. The geocoding match rate against these two reference data sources using default out-of-the-box parameters has always been an issue that needed further investigation due to the occasional poor match rates. While the literature expressed much interest over what constitutes an appropriate match rate and how data analysis is affected by a low match rate (especially in regard to the validity of spatial patterns as with crime data), for a small municipality such as the City of Richardson, match rates of less than 100% are not acceptable due to the high definition of the data available and the precision of spatial analysis required (for example, it might be asked, "How many geocoded records are associated with an individual parcel?"). The section below outlines the primary sources of data used for this study. All data is in the North Central Texas State Plane, NAD 83, projected coordinate system and all data (unless noted otherwise) is stored in an ArcSDE geodatabase.

  1. Crime data: The most compact data set that demonstrates most geocoding needs/errors encountered at the City is demonstrated in the crime application in which Police RMS data is geocoded nightly. The application uses a custom composite geocoder which first matches against parcels and then against roads. The resulting geocoded data is categorized into four feature classes in the geodatabase:
    1. Incidents
    2. Field Interview (suspicious activity, persons stopped and questioned)
    3. Accidents (vehicle/roadway accidents)
    4. CAD (911/dispatch calls)

  2. Roads: As a fundamental data layer, the roads features class plays an important role in many citywide GIS applications (ex. routing, field inspection application, public safety applications). Some of its attribute characteristics include:
    1. Full address ranges are used for major roads (ex. 100-199 N Coit Rd), actual address ranges are used for residential roads (ex.1705-1711 Meadowlark Dr). It is generally prefered to geocode with as much block-face accuracy as possible (i.e. true address ranges), but this can cause problems when approximated addresses are used in input data (ex. 300 block E Main St.- actual parcel range may run from 315-345)
    2. Address ranges have been corrected and modified many times based on application needs.
    3. Mixed parity issues exist for some roads, which throws off interpolation techniques.
    4. The direction of digitization follows increasing addressing, so all ranges increase from the LeftTo to FromTo, etc.
    5. Road segments are split at (almost) all block number changes and at all true intersections.
    6. There is a restriction on splitting roads into any further segments due to conflicts with other applications (primarily Field Inspector which relies on comparing values between years based on Road segment ID).
    7. Street name alias fields exist in the attribute table since naming standards vary across departments. This also allows for some flexibilty in modifying street names to better fit geocoding rule base expectations.

  3. Parcels: Another fundamental layer, the parcels feature class is traditionally the most spatially accurate layer with address information available to a municipality. Geocoding against parcels allows one to match against individual plots of land (or rather, the centroids of those polygons) rather than interpolating against a street centerline. This is particularly useful in areas where parcels are not regularly addressed (such as on roads with mixed parity) or those parcels that may be quite a distance from the centerline. Some of its attribute characteristics include:
    1. Some parcels do not have a true address (have an abstract number or non-standard reference, ex. Rail ROW)
    2. Street naming standards do not always match the Roads naming standards.

  4. Address Points: To address some of the accuracy issues noted above, an address points dataset was created primarily as a result of this study. The intention was to create a complete master address file which could contain all valid addresses within the City. It is a merger of various other point files, some existing and some that were generated for this purpose. Those files include:
    1. Parcel centroids (for all residential and major parcel properties) with any duplicates and non-address features removed.
    2. A point file of block number addresses for every valid block range (that is, a point for the beginning of every 100-block) This was created to enable matching to approximated addresses based on the block number, for example, "300 W Main St","400 W Main St", etc.).
    3. A point file of sub-addresses such as individual apartment units, condominium units, duplexes and triplexes (which do not exist in the parcels dataset).
    4. A point file of selected addresses within an approximately 100- mile radius outside of Richardson used commonly by public safety applications (for example, detention centers or other locations to/from which prisoners may be transferred or officers may travel).
    5. The centroids of a polygon file of commercial suites with suite and address information collected from field data.

  5. Geocoded Land Data: The closest dataset the city has to a Master Address List is the Land dataset in the AS400 system. This contains an address table containing address records each with a unique Land ID. This dataset is edited by both Water Customer Service and Building Inspections. Though not truly a 100% complete address list, in the sense of every valid city address from all apartment units to all business suites, this dataset is still useful since many other departments use it as a basis for identifying location in their records. Since each record contains a unique Land ID this offers another opportunity for matching city generated tabular data (which contains this unique ID) against this master file. To accomplish this, this dataset was geocoded against the address points and roads datasets and the numerous unmatched addresses were manually matched until all Land IDs were mapped. This dataset can be used either in a join against Land ID (if there is a one-to-one relationship) or geocoding against the Land ID field using a Single Field geocoding service style, which has proven to run significantly faster than standard address geocoding.

Analysis & Methodology

  1. Analayis of Existing Match Rates: Before any modifications were made to the original geocoding services or reference data used by the crime application, the match rate of the four files was analyzed. The following section details the resulting Match, Tie and Unmatch percentages. As can be seen, unmatch rates were as high as nearly 20% in some cases, significantly impeding an analyst's ability to properly summarize the data spatially. The original match rates (at the beginning of this project) were as follows:
  2. File

    M/U/T

    No. Records

    Percentage

    Incidents

    U

    7,408

    12.90%

    T

    1,976

    3.40%

    M

    48,215

    83.70%

    Field Interview

    U

    1,630

    17.70%

    T

    554

    6.00%

    M

    7,007

    76.20%

    CAD

    U

    21,240

    16.20%

    T

    10,376

    7.90%

    M

    99,396

    75.90%

    Accidents

    U

    2,094

    19.80%

    T

    1,619

    15.30%

    M

    6,875

    64.90%

    The Unmatched category was further broken down into 20 categories of error to more logically break down the problem into categories that could be addressed individually.These individual errors were then summarized into 3 main categories:
    1. Data Entry Errors: 46.1%
    2. Correctable Errors: 43%
    3. Geocoding Rule Base issues: 10.8%

    Data entry errors included the most categories, which reflected the lack of data validation on input. This consisted of leaving out important address components such as the street type or directional prefix, sloppy addresses due to inconsistent abbreviation and non-address information in the address field, and in some instances the complete lack of any valid address or the use of a landmark name rather than its address. While some of these errors can be corrected by making the reference data more flexible (by using additional reference data such as alternate name and/or alias tables), the core issue must ultimately be resolved by forcing data validation upon input. As this requires extensive cooperation from both software vendors (who provide the software into which crime data is input) and from department managers who should reinforce the importance of inputing address data in a standard way, both of these solutions were outside the scope of this study. However, making these simple modifications would greatly increase the match rate without any further effort.

     

    Geocoding Errors

    Incidents

    Field Interview

    CAD

    Accident

    Average

    No.

    %

    No.

    %

    No.

    %

    No.

    %

    Data Entry Errors

    Block number out of range

    576

    7.8%

    105

    6.4%

    654

    3.1%

    105

    5.0%

    46.1%

    Inconsistent abbreviation

    25

    0.3%

    56

    3.4%

    81

    0.4%

    46

    2.2%

    Incorrect directional prefix

    34

    0.5%

    6

    0.4%

    8

    0.0%

    9

    0.4%

    Incorrect street

    4

    0.1%

    2

    0.1%

    56

    0.3%

    27

    1.3%

    Incorrect street type

    33

    0.4%

    31

    1.9%

    1

    0.0%

    17