Integrating SAS and Geographic Information Systems for Regional Land Use Planning

Integrating SAS® and Geographic Information Systems for Regional Land Use Planning
Bill Bass, Houston-Galveston Area Council, Houston, Tx

ABSTRACT

The Houston-Galveston Area Council (H-GAC) provides regional socio-economic and land-use forecasting analysis for the
13-counties surrounding the Houston metropolitan area. Forecasting efforts require the integration of geographic data and
large amounts of tabular data from various sources, such as parcel boundary datasets and county appraisal records. H-GAC
uses SAS® in conjunction with ESRI® ArcGIS® geographic information systems (GIS) software to produce a comprehensive
land-use database for the 13-county region. The integrated process involves millions of appraisal data records as well as
large volumes of geographic data. Through the combined use of SAS® and GIS, H-GAC is able to streamline the data
development process, over using other SQL and desktop database technologies.

INTRODUCTION

H-GAC is the region-wide voluntary association of local governments in the 13-county Gulf Coast Planning region of Texas. It
is one of several Council of Government organizations (COGs) in the State of Texas, and services 12,500 square miles with
more than 5.7 million people. H-GAC is governed by a Board of Directors composed of local elected officials who serve on
the governing bodies of member local governments. There are 35 members on the H GAC Board. H-GAC provides many
tools, information, region-wide plans, and services to support municipalities, districts, and non-profit organizations. H-GAC's
mission is to serve as the instrument of local government cooperation, promoting the region's orderly development and the
safety and welfare of its citizens (H-GAC 2008). One of H-GAC’s programs includes regional socio-economic modeling.

The Socioeconomic Modeling group is an information and research hub in the Community and Environmental Planning
department that gathers, processes, generates, analyzes, and disseminates information on the past, present, and future
land use, economy, and population of our region in order to support comprehensive regional operations and planning (H-
GAC 2008). The primary purpose of forecasting efforts within the socio-economic group is to support Travel Demand
Modeling which is used in Regional Transportation Planning (RTP). However, H-GAC also uses socio-economic products for
other long range planning purposes that involve environmental conservation, water quality, and urban planning.

Due to the large amount and complexity of the data obtained for use in socio-economic modeling, SAS® is used for a variety
of functions including: data development, data organization, statistical analysis, and integration of data across multiple
databases. This paper will explain how H-GAC’s Socio-Economic Modeling group uses SAS® in conjunction with GIS to
develop regional land use data, which is one component of the overall regional modeling framework employed at H-GAC.

PROCESSING OF COUNTY PARCEL BOUNDARY AND APPRAISAL DISTRICT DATA

H-GAC obtains appraisal data from each of the 13 County Appraisal District (CAD) offices where data is electronically
available. Appraisal data are typically very large datasets that cover a wide variety of attributes regarding parcels (real
property) within each county. Some of the data attributes included in appraisal roll datasets are:

Valuation of land and improvements (e.g. buildings)
Land usage through the State Classification Coding framework
Ownership and legal descriptions of property
Taxing entities and exemptions
Square footage and structural amenities

1

In addition, H-GAC obtains parcel boundary datasets from the counties that compliment the appraisal roll data. Parcel
boundaries are typically provided in industry standard shapefile formats that can be viewed in GIS software, such as ESRI®
ArcView® or ArcInfo® products. In many cases, the parcel boundary data and appraisal roll data are not related in a manner
that allows for usage as a relational database system; although they do have common fields in both datasets, such as
Account Number. Furthermore, data schemas for datasets are not standardized across county appraisal systems, and thus
yield a variety of source data layouts and structures with a variety of field naming conventions.

ISSUES AND CHALLENGES IN WORKING WITH APPRAISAL DATA

Through H-GAC’s efforts in working with appraisal roll data, a number of challenges have been identified and overcome in
order to develop a comprehensive regional appraisal database. These challenges exist in both the appraisal roll dataset that
contain the property attribute data, as well as in the GIS parcel boundary datasets. Challenges for working with appraisal
roll data include:

Multiple datasets stored within a single text file, each with their own unique data schema
The need to convert data imported as character format to numeric, and numeric data to character
Cleanup of data entry errors such as leading and trailing spaces for primary key fields
Replacing zero values with NULL values to prevent errors when analyzing data

There are also challenges in working with the appraisal parcel boundary data due to the nature in which data is stored
within the county GIS systems. For instance, it is typical for a parcel to have one or more account numbers affiliated with
each parcel (multiple-owners), or multiple accounts to a parcel such as with a high-rise condominium complex. Instances
such as these are typically stored through a means of “stacking” identical parcels on top of one another within the GIS, but
giving each parcel feature a different Account Number. Although this may provide for an effective end product for viewing
ownership at the parcel level using a single table format, it does not support the establishment of a topologically integrated
geographic database, where a single parcel of land can have one or more owners, which is typically represented through a
more relational database structure.

In the following section, these issues and challenges will be explained in detail, as well as how GIS and SAS® are used
together to develop standardized data for the region.

APPRAISAL ROLL DATA DEVELOPMENT

Writing SAS® INFILE statements can be lengthy when setting up SAS® code to import data files, and appraisal data is no
exception. It is common for an appraisal roll dataset to contain more than 100 fields that each need to be listed in the
INFILE statement. Therefore an Excel® spreadsheet is used to help generate SAS® code that can be imported into the SAS®
editor file. Through the use of Excel® formulas and hard-coded text strings, a list of field names can be loaded into an Excel®
spreadsheet, and from there used to generate the INFILE, LENGTH, and INPUT portions of the DATA STEP statement. This
method reduces data entry errors as field names are copied, not typed, and saves time.

Once data is imported into SAS® dataset format, additional SAS® code is written to clean-up and standardize the datasets
into a common data structure for datasets from all counties in the region. Through the use of standardized dataset and
fieldnames and formats, the development of data is greatly simplified and aids in the data being used more efficiently when
doing analysis of appraisal data. The following are some examples of how SAS® is used to clean-up and standardize the
appraisal attribute data.

Attribute data is typically provide in either one or several flat-file layouts. These are typically delimited text files using either
comma or tab delimiters. In some cases multiple types of dataset are stored in a single text file, and thus, SAS® is used to

2

determine which records to read. For example, it is common for not only the appraisal data that includes ownership,
valuation, and land use to exist in one file, but also for summary data that aggregates valuations by subdivision to be in the
same file. Through the use of SAS®, a statement such as the one illustrated below can process the file, only importing the
records that represent the appraisal roll data. In many cases, multiple import statements are used, so that each type of data
can be loaded into a separate SAS® table. The following is an illustration of a conditional import statement that only loads
records that have a Record Type of ‘4’ in the source file.

Data Appraisal_Data Other_Data;
Infile 'Input_file.txt' *Name of flat file to load;
MISSOVER
lrecl=5000;

*Following code specifies field attributes used in conditional processing;
Length Record_Type $ 1; *Initializes record type field;
Input Record_Type $ 61-61 @; *Defines location of record type value in
flat file, @ forces SAS to use buffer to
evaluate condition and prevents skipped
records;
*Following code is conditional processing to only load certain record types;
If Record_Type ='4' Then Do; *Only loads record types with a value of
‘4’;
Length *Initializes and defines other fields in
flat file to import;
First_Field $ 10
Second_Field $ 50; *Notice that the Record_Type variable is
not used here;
Input
First_Field $ 1-10
Second_Field $ 11-60
Record_Type $61;
Output Appraisal_Data; *Name of dataset to write data;
End;
Else Output Other_Data; *Puts all other records into a scratch
dataset, not used;
Run;

Once data is loaded into SAS®, additional SAS® statements are used to assist in further cleaning the data. For instance, it is
common for some fields to be initially imported as text formats, when in fact they should be defined as numeric. The same
holds true for some attributes that are imported as numeric when they should be text (e.g. numbers that have leading
zeros). The following are two examples of code that are used in SAS® DATA STEP statements to handle these conversion
scenarios.

*Code for converting values from Numeric (N) to Character (C);
C = Strip(Put(N,10.)); *Where ‘10.’ is the desired character
length;
*Code for converting values From Character (C) to Numeric (N);
N=Input(C,8.0); *Where ‘8.0’ is a numeric informat;

Another example of data cleanup that is performed on SAS® datasets is that of replacing zero values with NULL values. For
appraisal data, it is typically not sufficient to note some values as being zero. Consider the value of land and improvement.

3

These items, if they exist, have a value associated with them. If the value is not known, then it should not be zero, but
rather NULL so as to not skew statistical analysis. For land values in the appraisal roll data that contain a value of ‘0’, those
are changed to be NULL, as all land has a value. The same holds true for improvement values, where if an improvement
exists, it should have a value greater than zero, so any values of zero are changed to NULL. These changes are performed
using a simple IF THEN statement in SAS® to look for zero value and modify the value to be NULL.

*Replaces zero values with NULL values;
If Land_Value = 0 Then Land_Value = .;
If Improvement_Value = 0 Then Improvement_Value = .;

In some instances, data such as zip codes are provides as either aggregated values (e.g. 77027-1234) or separated values in
their own fields (e.g. 77027, 1234). H-GAC chooses to store zip code data as two separate fields, so for some counties
where the data is only provided in an aggregated format, the SAS® SUBSTR statement is used. The following is an example
of how two separate zip code field are created from a single aggregate zip code field.

*Code separates 5-digit zip prefix from 4-digit suffix;
Zip_Code = Substr(Orig_Zip,1,5); *Reads and stores values of positions 1
thru 5;
Zip_Code_Plus4 = Substr(Oriz_Zip,7,4); *Reads and stores values of positions 7
thru 10;

Finally, in some instances primary key fields and field with formatted codes are missing characters or proceeded by spaces.
This can cause issues when trying to join data in multiple tables, as SQL typically views spaces as valid characters, thus a
value of ‘R1234’ in one table is not the same as a value preceded by a space such as ‘ R1234’ in another table, with the
latter value being a data entry error. To resolve these issues, a DATA STEP statement is used to remove spaces from fields
as in the example provides. The following is an example of such as statement.

*Removes leading and trailing spaces from account number field;
Acct_Num = Strip(Acct_Num);

The end result of using SAS® to process appraisal roll data, is a standardized set of SAS® datasets for each county that have
common fields and naming conventions for attributes such as owners, legal descriptions, land value, improvement value,
and state classification code. Although each set of appraisal data from the county includes far more than just the
standardized fields used by H-GAC, these additional fields are not dropped. Instead they are appended to the end of the
common variables. From this point, analysis can be run against the SAS® appraisal roll datasets and reports generated, and
if needed, exported to other formats such as Excel®, DBF, or delimited files.

GIS PARCEL BOUNDARY DATA DEVELOPMENT

In additional to performing quality review on attribute data, SAS® is also used to assist in the cleanup of geographic parcel
boundary data. Depending upon the type of parcel (residential, commercial, mixed use, etc), parcel features in the GIS
dataset may involve multiple features ‘stacked’ on top of one another, with each feature having a corresponding account
numbers. For instance, if there were two owners of a single parcel of land, each with their own account number for a
single-family residential property, there may be two spatially and geometrically identical polygon features, each with the
account number for the corresponding owner for which it represents. Therefore H-GAC uses ESRI® ArcGIS® in conjunction
with SAS® to create a single polygon for these features, but retain the multiple account number assignments. In effect, the
flat file structure of the appraisal GIS dataset is transformed into a more extensive relational database system, capable of
supporting complex analysis.

4

First, GIS is used to calculate a centroid value for each polygon in the original parcel dataset, which is expressed as an X/Y
coordinate. Think of an X/Y value as being latitude and longitude values, and if two geometrically identical parcels are
stacked on top of one another in the same geographic space, both will have the same X/Y coordinate value, or centroid
location.

Next, the parcel dataset is then processed using a method called Dissolving, where each polygon is grouped and simplified
based on some common value, in this case the X/Y coordinate. The result of the dissolve process is a new dataset that
contains only one parcel boundary to a defined space, where before there may have been multiple parcels stacked on top
one another. This new dataset also retains the X/Y coordinate value of the final aggregated polygons. If a parcel is not
stacked on top of another parcel to begin with, then the dissolve process merely takes the single parcel and places it into
the new dataset.

What exists at this point are two GIS datasets:

The original parcel boundaries, which contain stacked and non-stacked parcels, each with their respective account
numbers and X/Y coordinate; and,
The dissolved parcel boundaries, which contains only one parcel to an area of land and an X/Y coordinate of each
parcel

For the newly created dissolved parcel boundaries dataset, each parcel is given a unique parcel identification code, or
Parcel ID. The Parcel ID field serves as the primary key for this dataset. Then both parcel datasets are exported to a
shapefile format, which stores attribute data such as X/Y coordinate, Account Number, and Parcel ID in a DBF data table.

At this point, this is where SAS® assists in the integration of the two datasets into a relational database structure. Due to
the large amount of data to be processed for each county, sometimes upwards of 1 million parcels, SAS® is very efficient in
handling this volume of data. Using SAS® IMPORT statements as illustrated below, both DBF tables are loaded into SAS®.

*Loads original parcels data table containing the Account Number, Parcel ID, and
X/Y coordinate of each parcel;
Proc Import Out=Original_Parcels Replace
Datafile= 'c:Original_Parcels.dbf';
Run;

*Loads dissolved parcels data table containing the X/Y coordinate of each parcel;
Proc Import Out=Dissolved_Parcels Replace
Datafile= 'c:Dissolved_Parcels.dbf';
Run;

Next, the two datasets are joined using a PROC SQL LEFT JOIN statement as illustrated below.

*Joins dissolved parcels dataset to original parcels dataset to obtain account
numbers affiliated with each dissolved parcel;
Proc SQL;
Create Table Parcel_ID_to_Account_Number as
SELECT X.Parcel_ID, X.XY_Coord, Y.Account_Number
From Diss_Parcels AS X
Left Join
Orig_Parcels AS Y
On X.XY_Coord = Y.XY_Coord;
Quit;

5

The above SAS® statement creates a dataset that contains all Parcel IDs from the dissolved parcels dataset, and their
affiliated Account Numbers from the original dataset. The Parcel ID to Account Number table becomes a critical link
between the parcel boundary GIS data, and the Appraisal Roll property attribute data. Specifically, it allows for the relating
of a single parcel of land to one or more accounts affiliated with that parcel, and then each account to it corresponding
record of detail in the Appraisal Roll dataset. The following section will illustrate how having such a table allows H-GAC to
produce parcel level land use data for the region.

Determination of Land Use from Appraisal Roll Databases

H-GAC uses appraisal data as a basis for determining land use in the 13-county region surrounding the Houston
Metropolitan area. To process large amounts of appraisal data, H-GAC organizes appraisal records by parcel, which can
number upwards of 1 million records for a county, and over 3 million for the region.

However, not just appraisal data is used in the land use determination process, as H-GAC also acquires a variety of other
data related to land use, such as locations of schools, government buildings, infrastructure, and environmental
conservation and park areas. This additional information is used in conjunction with the appraisal roll data to obtain a more
accurate land use determination, where none may exist.

The first step in the process is to assign each appraisal roll record a Parcel ID. As discussed in the prior section, SAS® was
used to process data from the H-GAC GIS to determine parcel assignments for each appraisal account. Using a PROC SQL
LEFT JOIN statement illustrated below, each appraisal roll record is assigned to a parcel.

*Joins appraisal roll to Parcel ID based on Account Number assigned to parcels;
Proc SQL;
Create Table Appraisal_Roll_Parcel_ID as
SELECT X.Account_Number, X.Owner_Name, X.Legal, X.State_Class_Code,
Y.Parcel_ID
From Harris_Appraisal_Roll AS X
Left Join
Parcel_ID_to_Account_Number AS Y
On X.Account_Number = Y.Account_Number;
Quit;

The result of the query is a table that can be used as the basis for the land use model to determine land use and ownership
by parcel. Since the process is primarily focused on land use, only a few of the many fields available in the Appraisal Roll
dataset are retained for further processing. In order to determine land use, the State_Class_Code field will be the field of
focus, as this field contains two-digit codes that denote the type of property (e.g. single-family residential, commercial,
industrial, etc).

The next step in the process is to determine land use of each parcel based on the State Class Code attribute retained in the
prior query. Each record in the Appraisal Roll dataset is aggregated by the combined values of the Parcel_ID and
State_Class_Code fields. This prevents two different accounts with the same Parcel ID and State Class Code from being
listed more than once. For instance, if account ‘R12345’ had as State Class Code of ‘A1’, and account ‘R45678’ has a State
Class Code of ‘A1’, and both were assigned to Parcel Id ‘HR890’, then all that is needed is a record that lists parcel HR890 as
having a State Class Code of ‘A1’. Alternatively, if one of the State Class Codes for the above two accounts was different, say
‘A2’ for account R45678, then two records would be produced for parcel HR890, one with a State Class Code value of ‘A1’,
and another with a value of ‘A2’. The following is an illustration of the PROC SQL code used for this step in the process.

6

*Keeps only unique Parcel ID and State Class Code combinations;
Proc SQL;
Create Table Unique_Parcels_SC AS
SELECT Distinct(Parcel_ID) AS Unique_Parcel_ID, State_Class_Code,
Count(State_Class_Code) AS NumberOfDups
From Appraisal_Roll_Parcel_ID
GROUP BY Parcel_ID, State_Class_Code
Having NumberOfDups >= 0;
Quit;

As the next step, two SAS® procedures are used to transpose the vertical records for each parcel, whether it is a single State
Class Code or multiple, into columns. Next those columns are then merged to create a single State Class Code field or SSC.

*Creates counter to identify first Parcel ID record;
Data Unique_Parcels_SC_N (Rename =(Unique_Parcel_ID = Parcel_ID));
Retain Counter;
Set Unique_Parcels_SC (Drop = NumberOfDups);
By Unique_Parcel_ID;
If First.Unique_Parcel_ID Then Counter = 1;
Else Counter = Counter +1;
Run;

The result of the above statement is a dataset that numbers each Parcel ID observation in order starting with a value of ‘1’
for the first instance, and then ‘2’, ‘3’, etc if there are additional observations for that Parcel ID. This dataset is then used as
input to the PROC TRANSPOSE statement below.

*Transposes based on Parcel ID for each State Class Code value;
Proc Transpose
Data =Unique_Parcels_SC_N
Out = Parcels_SC_Horiz (Drop = _Name_);
By Parcel_ID;
Var State_Class_Code;
ID Counter;
Run;

The result of the above statement is a table that lists each Parcel ID as a record with one or more values in horizontal
attribute columns. Some parcels may have only one State Class Code value, whereas other may have several, and thus the
dataset may have anywhere from one to seven attribute field for each transposed value. Those multiple values are then
merged into a single State Class Code field as illustrated below.

*Creates final transposed parcel to state class code dataset;
Data Parcel_SSC (Keep = Parcel_ID State_Class_Code);
Set Parcels_SC_Horiz;
Length State_Class_Code $10; *Set field size to be sum of all
variables
being merged;
State_Class_Code = Strip(Strip(_1)||' '||Strip(_2)); *Merges multiple values;
Run;

7

The above statement creates a two column table that contains a field for Parcel ID and the merged State Class Code value
stored as SSC. Also, the Strip command is used to remove any leading or trailing spaces as a result of merging fields that
may be empty.

Next the Parcel_SSC table then joined with a Land Use to State Class Code lookup table to assign a land use code for each
parcel. H-GAC has defined approximately 70 land use types and has grouped them into 8 Land Use Categories. The Land Use
to State Class Code lookup table includes the following fields: Land Use Code, Land Use Category, and State Class Code.
Using a PROC SQL LEFT JOIN statement, the Parcel_SSC table is joined to the Land Use to State Class Code lookup table to
obtain the corresponding Land Use Code and Land Use Category information for that parcel based on its State Class Code
value.

At this point, a baseline land use determination is established for each parcel. However, as previously mentioned, H-GAC
has additional information that can supplement the appraisal data to determine a more accurate land use classification.
This supplemental information is helpful, as many appraisal roll records have Exempt status for their State Class Code
values. Exempt properties are typically schools, religious entities, government property, public infrastructure, and natural
areas that are not typically taxed as non-exempt properties. As a separate initiative, H-GAC uses GIS to overlay source data
representing these types of properties on top of the parcel boundary framework, in order to obtain Parcel IDs for each of
these entities. That information for each geographic dataset is then aggregated and place into a single Land Use Overrides
table that contains fields for the Parcel ID and the Land Use Code determined by the nature of the source geographic data
(e.g. school, religious, government owned, park, etc).

As a final step to creating a regional land use dataset, the baseline land use data developed in SAS® is then joined with the
Land Use Overrides table using as series of SAS® statements. This series of statements evaluates each parcel’s override table
value to determine if it is the same as the parcel’s baseline value, and if it is, then the override value is ignored and the
existing land use value determined from the appraisal roll data is retained. This allows for a more accurate tracking of how
land use was determined, and helps to gauge the accuracy of appraisal data over time. Furthermore, if there are any
conflicting values in the override table for a parcel, such as a parcel being listed as both a commercial facility and an
industrial facility, those override records are ignored as well, and an error report table is produced so that the override
values can be investigated further and corrected. What remains following the override audit steps are a final list of land use
codes that should replace the existing baseline land use determination values. The override values are then joined to the
baseline land use table and a final land use code is determined for each parcel, where a valid override value exists, and for
those parcels that do not have a match with the override table, they retain their baseline value.

As a final output of the land use model, SAS® is used to create land use datasets that can be joined with GIS datasets using
the Parcel ID value. This allows for a simplified method in which to produce regional land use maps. Furthermore, SAS® is
used to summarize the land use table by land use type to determine the amount of acreage in the region for each land use
type. This is accomplished by joining the land use table to a table that lists each parcel and its acreage.

Conclusion

As discussed in this paper, H-GAC uses SAS® as a critical component to determining land use for the region. The regional
land use efforts are not a process that can be accomplished through the use of a single technology or software platform,
but rather by integrating two separate software products. By using the best capabilities of two different systems, ESRI®
ArcGIS® and SAS®, an integrated process has been developed. This process assists in overcoming challenges such as large
volume datasets, quality review/control of variables, and relating multiple datasets from different sources together to
create a comprehensive regional database. Furthermore, it allows H-GAC to conduct regional analysis by standardizing data
across all county geographies.

8

References

H-GAC (Houston-Galveston Area Council). 2008. www.h-gac.com.

Contact Information

Bill Bass, GISP
Houston-Galveston Area Council
Socio-Economic Modeling
3555 Timmons Lane
Suite 120
Houston, Texas 77027
(713) 499-6687
William.Bass@h-gac.com

9

Integrating SAS and Geographic Information Systems for Regional Land Use Planning

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

Integrating SAS and Geographic Information Systems for Regional Land Use Planning