Part 3 - Geographic Analysis

The goal of part 3 is to introduce some analysis and geoprocessing features and techniques using a site selection problem as an example. Over the course of this exercise you'll learn how to: create a new project from an existing one, create a subset of a layer and process it to create land boundaries, join an attribute table to a shapefile, map the attributes of a shapefile, take a list of coordinates and convert it to a shapefile, draw buffers around a set of features, and select features based on their attributes and their spatial relationship to other features.

The object of this particular exercise is to identify potential areas within neighborhoods in New York City for locating a comic book store. Market research suggests that the primary demographic groups that purchase comic books are adults aged 18 to 34, people employed in professional and related occupations, and men. Based on this research we will identify neighborhoods that have a high percentage of adults in this age bracket and that don't have a large imbalance between the number of men and women. We will also identify areas within these neighborhoods that are within a half mile of a college or university (where young adults tend to congregate and people tend to work in professional occupations).

I. Creating New Project From Existing One II. Geoprocessing Shapefiles III. Joining and Mapping Attribute Data IV. Plotting Coordinate Data V. Running Statistics and Querying Attributes VI. Drawing Buffers and Making Selections VII. Screen Captures

Section I: Creating New Project From Existing One

This section will show you how to create a new project from an existing one and will set the working environment for the rest of part 3.

Steps

  1. Open project. Launch QGIS. Hit the Open Project button (or go to File > Open Project). Browse through your folders to the QGIS project file you created for part 2, and select it to open it.
  2. Save Project As. Once your project has loaded, hit the Save Project As button (or File > Save Project As). Browse to the data folder for part 3. Save the project in that folder as part3.qgs. Hit Save. You've now saved a new copy of your old project, and are currently working in this new copy (you can tell by looking at the title at the top of the window, where the project name is listed). We will work with this new project, part3.qgs, for this part of the tutorial.
    Project title
  3. Remove a layer. We don't need the raster layer for this exercise. Select the drg_central_park layer it in the Map Legend (ML). Right click on the layer in the ML and select Remove (or, hit the Remove Layer button on the toolbar).
  4. Zoom out and save. Hit the Zoom to Full Extent button to zoom out to the full extent of your layers. Then hit the Save button.

Commentary

Saving Projects and Removing Layers

Use the Save button to save the current project, and the Save As button to save the current project as a new copy with a different project name. Save As saves you the effort of starting from scratch if you have an existing project that you can use to branch off from. When you remove a layer from a project you're just severing the link between a particular project and that data; you're not actually deleting the data itself.


Section II: Geoprocessing Shapefiles

In this section you'll learn how to process a shapefile to prepare it for analysis. This is a common GIS task; normally when you download publicly available shapefiles you'll have to do some processing to make them usable for your projects.

You'll be processing a boundary file for Public Use Microdata Areas (PUMAs) which we'll use to approximate neighborhood boundaries. PUMAs are statistical boundaries created by the US Census Bureau. The file was downloaded from the US Census TIGER Line Files.

Steps

  1. Add the PUMA shapefile. Hit the Add Vector Data button. Hit the Browse button and browse to the data files for part 3. Select the PUMA layer, which is called tl_2009_36_puma500.shp, hit Open, and Open again to add the layer. By default the new layer will be drawn over top of the existing layers.
    Added PUMA layer
  2. Organize layers. Select the PUMA layer in the Map Legend (ML) and hit the Zoom to Layer button. You'll see the PUMA layer covers all of NY state, but we only need PUMAs for NYC. We'll do some operations and create a new file that just has the NYC PUMAs. Select the counties layer in the ML and hit the Zoom to Layer button. Select the PUMA layer in the ML, and drag it to the bottom of the ML. Check the boxes beside the green space, facilities, and colleges layer to turn them off for now.
  3. Change symbols. Double click on the counties layer. Under the style tab in the Outline option section, change the width box from .26 to .75. Hit OK. Then double click on the PUMAs layer. Click on the labels tab. Check the Display labels box. In the Field Containing Labels dropdown, select the PUMA5CE00 field as the labels field. Change the font size to 8. Click OK to apply.
    Counties and pumas with new symbols
  4. Activate the fTools plugin. If you haven't done so already, go to Plugins > Manage Plugins, and make sure the fTools plugin is checked. This will make the Vector menu appear on the menu bar.
  5. Select PUMAs within the counties layer. Go to Vector > Research Tools > Select by Location. Select features in the puma layer (tl_2009_36_puma500) that intersect features in the counties layer (nyc_counties_2008), and keep the default for Modify current selection by creating new selection. Click OK. You'll see that all PUMAs within and touching the NYC county layer have been selected. Close the Select by Location menu when finished.
    Select by LocationPUMAs that intersect NYC counties
  6. Remove PUMAs outside NYC from selection. Select the PUMA layer in the ML. Hit the Select Features button. While holding down the CTRL key, click on each of the PUMAs that are outside of the dark NYC boundary one by one to unselect each one. There are seven PUMAs that you must unselect (clockwise from top): 03400, 03505, 04201, 04204, 04205, 04211, 04212. If you unselect a PUMA by mistake, just click it again to re-select it. If you inadvertently unselect all of the PUMAs, you'll have to redo the previous step with the Select by Location tool to to reselect them.
    Just NYC PUMAs selected
  7. Save selection as new layer. Select the PUMA layer in the ML. Right click and choose the Save Selection As option. In the Save Selection as Menu, save the new layer as an ESRI shapefile. Browse and save it in your part 3 folder as pumas_nyc_boundaries. Leave the Encoding as the default, but hit the Browse button beside the CRS box and change the Original CRS to NAD 83 by selecting it from the list. Hit OK.
    Save vector layer menu
  8. Add new layer to map. Hit the Add Vector Data button. Browse to the folder to part3 and add the new layer you created, pumas_nyc_boundaries.shp. Drag it to the bottom of the ML. Then select the original PUMA layer for NY state in the ML. Right click and remove the layer, as we don't need it any more. Save your project at this point.
    NYC counties and PUMAs
  9. Convert PUMAs from statistical boundaries to land boundaries. Our last geoprocessing step is to convert the PUMA boundaries, which incorporate land and water, to boundaries that represent just land. Add a vector layer, browse to the part 3 data folder and add the layer nym_water. On the menu bar go to Vector > Geoprocessing Tools > Difference. Select pumas_nyc_boundaries as the Input vector layer, nym_water as the Difference layer, and Browse and save the new file in your part 3 data folder as pumas_nyc_land. Hit OK. When prompted to add the layer to the project, say Yes. Close the difference menu.
    Difference menuWater layer added
  10. Clean up. Select the nym_water layer in the ML, right click and remove it. Do the same for the puma_nyc_boundaries layer. Then drag the new nyc_pumas_land layer to the bottom of the ML. At this point, you have a brand new PUMA layer just for NYC that represents land boundaries. Save your project.
    New layer: PUMA land boundaries for NYC

Commentary

Geographic Units

For this exercise we're working with Public Use Microdata Areas (PUMAs) which are a statistical area created by the US Census Bureau. While PUMAs were created for a specific purpose (geographically aggregating census microdata), they are also useful for mapping areas within large cities. PUMAs were designed to have approximately 100,000 people, which makes them better than legal or administrative units for making comparisons or mapping distributions. Neighborhoods in most North American cities are rarely formally delineated, so a geographic unit like a PUMA can serve as a proxy.

The choice of a geographic unit is an important decision; it's often a balance between the availability of data for an area, the suitability of the unit for the analysis, the amount of work that has to be invested in processing and analyzing the data, and the final outputs that will be created (tables, charts, maps) to explain the data.

PUMAs are a good choice for our exercise because: data is regularly published for these areas by the Census Bureau (annually as three year estimates published in the American Community Survey), PUMAs are good for approximating NYC neighborhoods, they are designed to have approximately the same number of people, and there are only 55 of them in the city which makes it manageable for this tutorial. One disadvantage is that the large size of a PUMA can mask individual population clusters within it, which makes it difficult to pinpoint an exact location for a retail store (but isn't unreasonable for getting a general idea of which areas to explore).

Compare PUMAs with other geographic units to get a better idea of the strengths and weaknesses. ZIP codes are more familiar to people and are commonly used in marketing, but the boundaries tend to be irregular and areas vary widely in size and population (ZIP codes were designed for delivering mail, not for studying populations). ZIP Code data is only available from the decennial census, which has a much smaller number of variables relative to the American Community Survey and is only updated every ten years. Census tracts are census statistical areas created to have an optimum size of 4,000 people. They would be better for pinpointing a more specific location for a store, but there are thousands of them in the city and would require more time to process and work with. Census tract data is available from the American Community Survey annually as five year estimates.

TIGER Line Files

The Census Bureau creates and maintains legal, statistical, and administrative boundaries for all geographic areas that it publishes data for. It also creates and maintains geographic features such as water, roads, and landmarks that are used when creating statistical boundaries. These files were originally in a vector format created by the census called Topologically Integrated Geographic Encoding and Referencing or TIGER. The Census now provides this data in shapefile format. The files are in the public domain and can be downloaded for free at http://www.census.gov/geo/www/tiger/.

The PUMAs used in this tutorial were downloaded from the Census TIGER site. Most of the other files used in this exercise were created from the TIGER files. The NYC counties file is a subset of the TIGER county file for New York State, while the facilities and parks layers are aggregations and selections from the TIGER landmarks file for each of the five counties. All three layers were previously geoprocessed to convert legal boundaries to land boundaries, using a subset of the TIGER water features.

Download TIGER Line Files

We were able to add the PUMA layer directly to our project because it shares the same geographic coordinate system as our other layers - NAD 83. Data downloaded from the Census TIGER site are all projected in NAD 83. We'll discuss and work with map projections later on in this tutorial.

The Census Bureau makes minor updates to boundaries from year to year, but major changes occur each decade as each decennial census is released. The ACS data we are using in this tutorial is from 2007-2009, so we're using the 2009 TIGER files, which are based on 2000 Census geography. ACS data for 2010 will be tabulated based on new 2010 Census geography.

Geographic Selection

One of the strengths of GIS is the ability to perform spatial queries on features; i.e. select all areas that intersect other areas. This is one area where QGIS is still developing. The Select by Location feature of the fTools plugin only allows you to select features that intersect other features. However, several other spatial query options exist in other GIS packages, such as selecting features that border each other, or that are within or have their center within other features (the latter would have been the preferred option for selecting PUMAs within NYC counties). QGIS does have a Spatial Query plugin that can be activated in the plugins menu, and provides other options such as: crosses, disjoint, intersects, touches, and within. However, the tool isn't perfect and seems to have trouble when making selections between two polygon layers, which is why it wasn't demonstrated in this tutorial (although it works better when selecting points or lines in relation to polygons). If you need spatial query options beyond intersect, you can use other open source software: the command line GDAL / OGR tools, a geodatabase (PostGIS or SpatiaLite) tool, or GRASS GIS.

It's pretty common that you'll download geographic data that covers an area that is wider than you need. Since GIS data is malleable, it usually makes sense to grab data for a larger area and select out just the portions you need, if you can't find a layer that consists just of the areas you want; this is something to keep in mind when you search for data on the web.

Geoprocessing

It's also rather common that you'll download shapefiles that represent boundaries, but these boundaries will often incorporate land and water. If your intention is to show the actual boundary lines for reference purposes, then you will want to use the files as is. However, if you want to map the distribution of phenomena by area you'll want to process the boundaries to remove water as that phenomena isn't likely distributed there (i.e. there is no population living in the harbor or ocean). You'd also want to alter the boundaries if you're creating maps and want the user to be able to clearly understand the areas you're depicting. The Difference tool accomplishes this by subtracting the areas of bodies of water from the boundaries, resulting in features that show the outline of land.

This is merely one application and tool in the geoprocessing toolkit. Geoprocessing is essentially a GIS operation to manipulate the spatial aspects of GIS data. In the broad sense it includes layer overlay, feature selection, data conversion, and topology processing. In a more narrow sense that we're using here, it refers specifically to topology processing; modifying the actual geometry (points, lines, and areas) of features and files. Via the ftools plugin, QGIS has the following Geoprocessing tools for vector layers (running each tool creates a new layer; it does not modify existing layers):

In addition, there are also some geoprocessing tools under the Geometry Tools menu in ftools that convert or break polygons apart into simpler features like lines or points (we'll cover single-part and multi-part polygons later) and under the Data Management Tools menu (for aggregating many shapefiles into one file; the opposite of the selection / subset process). Geoprocessing for raster layers is available through the GDAL plugin.

File Naming Conventions

You may have noticed that when we've created new layers, we have used underscores instead of spaces when naming files, i.e. pumas_nyc_boundaries.shp. When naming files it's best practice to use underscores instead of spaces and to avoid using any punctuation in file names. This helps to insure compatibility of data across operating systems and to prevent possible errors when loading or reading data in the software. You should follow the same rules when creating folders to store data. The name of your file should reflect what it contains; you could include the geographic area it covers, the type of feature, and possibly a date or number to indicate different iterations of the data.


Section III: Joining and Mapping Attribute Data

In this section you'll learn how to join an attribute table to a shapefile and map the attributes in that table. Now that the PUMA boundaries are ready, we need to associate them with census data on the age and gender of residents of those PUMAs in order to select the optimal neighborhoods for locating our store.

Steps

  1. Open the data file. Minimize (don't exit) QGIS for the moment. Using your file manager, browse to the data folder for part 3. Look for a file called acs_2007_2009_data.dbf. A dbf is a dbase file, used for storing data. This is a stand-alone dbf file that is not associated with a shapefile. Depending on what operating system you're using, open this file with a spreadsheet package like Excel or Calc (if you're in Windows, right click the file, select Open With, and then choose the option to select a program from the list. Choose Excel, hit OK, and open the file).
  2. Examine the data file. The data file contains one row for each PUMA in NYC and several columns of attributes. The first four columns contain identifiers for each PUMA; the column ID2 is a FIPS code that we'll use to join this table to the shapefile. The remaining columns contain data from the American Community Survey. Columns are paired together, with the first representing the data itself and the second representing the margin of error (MOE) for the data. So, for the Riverdale / Kingsbridge PUMA (the first one in the spreadsheet), we're 90% confident (that's the confidence interval for the ACS) that there were 113,178 residents between 2007-2009, plus or minus 4,775. The columns that follow (with an associated MOE column) are: male population, percent of total population that is male, female population, percent of total population that is female, population aged 18-34, percentage of total population aged 18-34.
    ACS data
  3. Examine the attribute table of the PUMAs. Close the dbf file, exit your spreadsheet software and maximize QGIS. Select the PUMA layer in the ML, right click and open the attribute table. In the table, note the column labeled PUMA5ID00. It contains the same FIPS code that was stored in the ID2 column in the data table: two digits representing the State of New York (36) followed by five digits representing the PUMA number. Since these columns are the same, we can use them to join the two files. Close the table.
    PUMA attribute table
  4. Join menu
    Join data table to shapefile (new in 1.7). Hit the Add Vector Data button. Hit the Browse button to browse to your part3 data folder. Select the 2007_2009_acs_data.dbf data table and hit open (if you don't see the file, make sure the Files of Type dropdown menu at the bottom of the window is set to display all files). Hit open again to add the table to your project. It should appear in the ML. You can select it in the ML and hit the open Table button to verify that the table displays correctly. If all looks good, close the table, and double click on the pumas_nyc_land layer to open its' properties menu. Hit the Joins tab. Hit the green plus button to add a join. The join layer will be the data table acs_2007_2009_data. The Join field in that table is ID2. The Target field in the puma layer is PUMA5ID00. Hit OK. Close the properties menu. Right click on pumas_nyc_land in the ML and open the attribute table. Scroll over to the right, and you'll see all of the layers attributes and the data that is stored in the dbf file. Close the attribute table.
  5. Work-around (for 1.7). Dynamic joins, where layers can be loosely coupled to data tables in a join, are new in QGIS 1.7. Unfortunately this new feature has a bug; even though the data is joined successfully, we won't be able to classify the data in the table properly in order to symbolize or map it. In order to get this to work we're going to have to create a new shapefile where the data from the table becomes permanently fused to the shapefile. So, select pumas_nyc_land in the ML, right click and choose Save As. Browse to the part 3 data folder and Save the layers as pumas_data. Leave the encoding alone but Browse and change the CRS to NAD 83. Hit OK. Once it's been saved hit the Add Vector Data button to add the new pumas_data layer to the project.
    Save As
  6. Reorder the layers. Select the pumas_nyc_land layer in the ML, right click and remove it. Drag the new pumas_data layer to the bottom of the ML. Save your project.
  7. Map the age data. Double click the pumas_data layer in the ML and go to the Style tab. Change the Legend type dropdown from Single symbol to Graduated symbol. Change the Classification field to PER_AGE (percent of population age 18-34). Keep the mode set to Equal Interval. Change the number of classes to four. Hit the Classify button. Select the last class in the Classify box. In the fill option, hit the fill box (which should be dark blue), change the color to dark green, click OK. Then click OK to apply all the symbol changes. You should now have a choropleth (shaded area) map that shows the percentage of the population of each PUMA aged 18-34, classified by equal intervals (divides data into categories that have an equal value range). Save your project.
Symbolization menuShaded area map

Commentary

Census Data

The demographic data used in this exercise comes from the US Census Bureau's American Community Survey (ACS). Each year the census publishes annual survey data for all geographic areas in the US that have at least 65,000 people. Since the survey results for areas with smaller populations are often not statistically significant, the bureau averages data over several years for smaller areas. Data for areas that have at least 20,000 people is averaged for a three year period, and areas with less than 20,000 people down to the census tract level are averaged for a five year period. Each year the bureau releases a new annual data set and updates the two averaged data sets by adding the latest year of data and dropping the oldest one. For our exercise, we are using 3 year average data from 2007-2009. Even though PUMAs have a target population of 100,000 residents, it is better to use the three year data. As you drill down from the general population to figures that describe more specific groups, more data will be available in the three year dataset as the figures for smaller groups will not be significant in the annual dataset.

The American Community Survey was designed to provide data on a frequent basis and to largely replace the form on the decennial census that collected detailed information about the population. Beginning with the 2010 Census, the decennial census only provides basic demographic indicators of the population such as age, gender, race, and the total number of households and housing units. The decennial census is a count (not a survey) of the population and continues to be useful for making historical comparisons, providing a baseline for creating estimates, and for doing analysis below the census tract level (the decennial census is mandated by law to reapportion seats in the House of Congress). A third data product, Population Estimates, is published annually and is created using demographic calculations (as opposed to a count or survey) based on births, deaths, and migration. Basic estimates (total population, age, gender, race, and housing units) are published for states, counties, incorporated places, and metropolitan areas.

American Factfinder

All the datasets from the US Census are available for download from the bureau's American Factfinder data portal at http://factfinder.census.gov/. All of the data is free and in the public domain. When you download the data you may have to process it to aggregate certain variables before you can use it. The age data that we are using in this exercise has been preprocessed; when initially downloaded, there was one column for each age cohort for each gender; the appropriate age and gender columns for the 18 to 34 population were combined and the unnecessary columns deleted.

Census data from other countries may be more difficult to obtain, as is may not be free or in the public domain, may not be documented in English, and may not be available in a digital format. You can check the website of the statistical agency for an individual country to see what is available, or you can visit the websites of international organizations like the United Nations or World Bank to obtain basic population data for all countries.

The decision of which census variables to examine in this study was made by consulting psychographic data and market research reports. This data is generated by marketing surveys to determine which groups of people are interested in products or activities relative to other groups based on age, gender, race, occupation, education level, and geographic location. The census data for this exercise was chosen based on statistics from the Market Reporter, a series of pyschographic reports published in a database called MRI+. This data is not freely or publicly available; you would have to access it through an organization that subscribes to the database, such as your university library, academic department, or place of work.

Identifiers

The ability to join data tables in a database or a data table to a shapefile is made possible by the use of identifiers, which are codes used to uniquely identify features. If features in two separate data tables share the same identifier, those data tables can be matched or joined together based on that common identifier, allowing you to create new data or to map data in a table.

There are several standard codes for identifying features. In the United States, FIPS (Federal Information Processing Standards) codes are a classification system for identifying all legal, administrative, and statistical areas in the country. For example, FIPS 36061 is the FIPS code for New York County (Manhattan). The first two digits are the code for New York State, while the last three digits are the unique code within New York State for New York County. In an attribute table these codes may appear in separate columns (state, county) or in a single column as one string.

The US government has also created two-letter alpha FIPS codes for each of the world's countries and uses them for international data published by various agencies. However, international data is more commonly coded with ISO codes (ISO 3166) which are available in a two-letter alpha format, a three letter alpha format, and a three-digit numeric format.

Sample Country Codes
Country FIPS 10 ISO 3166
Denmark DA DK DNK 208
Djibouti DJ DJ DJI 262
Dominica DO DM DMA 212
Dominican Republic DR DO DOM 214

It is generally best practice to store ID codes as text and not as numbers since they don't represent quantities. Storing ID codes as numbers can result in data loss and misidentification. If codes begin with a value of zero and the ID is stored as a number, the zero will be dropped and the code will be incorrect (i.e. imagine you have a file with US ZIP codes and all ZIP codes that begin with zero are truncated).

In order to join two tables together based on an identifier, you need to be sure that each field is stored in the same data format; if one is stored as text and the other is numeric, the join will fail. Furthermore, you need to insure that each record is unique because one to many joins are not allowed; if you have a data table that has multiple records for one country, only one of those records will be joined to a shapefile and the others will be dropped. Finally, you should never use place names as identifiers or join fields because there are often many inconsistencies (imagine the number of different ways for spelling or abbreviating country names like the United States or South Korea).

Adding or appending identifiers to tabular data that lack this information is a common data processing task that you'll likely have to perform.

One footnote - technically, FIPS codes have been superseded by ANSI INCITS codes (American National Standards Institute, International Committee for Information Technology Standards). INCITS 38:2009 are the codes for states (formerly FIPS 5-2) and INCITS 31:2009 are for counties (formerly FIPS 6-4). FIPS codes for countries (FIPS 10-4) were also deprecated but not replaced with anything; the assumption is that codes comparable to ISO will eventually be adopted. In the meantime many federal agencies continue to use FIPS 10-4. Despite these recent changes, the term 'FIPS', while no longer correct, is still commonly used to refer to identifier codes created by the US government. The US Census Bureau maintains a lists of the ANSI (FIPS) codes on its website: http://www.census.gov/geo/www/ansi/ansi.html

A list of US State ANSI / FIPS codes is available in the appendix of this tutorial.

DBF Files

DBF files are an old data table file format from a database system called dBase that was once common in several database systems. While many of these systems are no longer widely used the file format has survived, in part because dbf files are a component of shapefiles that store all of the attributes of features. QGIS is able to take data stored in standalone dbf files and join them to dbfs affiliated with shapefiles based on a common ID code, using basic relational database techniques (a SQL join statement).

Important things to note about DBFs:

CSV files are an alternative format for getting stand-alone data tables into QGIS. We'll cover these later in the tutorial.


Section IV: Plotting Coordinate Data

In this section you'll learn how to take a text file with coordinate data, plot the data in GIS, and convert it to a shapefile. It's often difficult to find pre-existing shapefiles of buildings, particularly businesses and residences. But you can create your own point layers if you have the coordinates of the places you wish to plot. In this exercise you'll create a layer of comic book stores from a text file that lists each store with its latitude and longitude coordinates.

Steps

  1. Inspect the text file. Go to your data folder for part 3, open the file comicbks_refusa_july2010.txt in a text editor (like Notepad on MS Windows) and examine it. This is a tab-delimited text file with data for comic books stores in NYC; each record represents one store and each attribute column is separated by a tab. Close the file when you're finished.
  2. Activate the delimited text plugin. In QGIS, make sure that this plugin is activated under Plugins > Manage Plugins by checking it off in the list, and that the plugin toolbar is visible by right-clicking an empty area of the toolbar and checking the plugin box.
  3. Launch the delimited text plugin (updated in 1.7). Click the Add Delimited Text button or launch it from the Plugins menu. For the delimited text layer browse to the part 3 data folder and select comicbks_refusa_july2010.txt. Accept the default layer name. Choose the Selected delimiters radio button and the Tab checkbox. Make sure that by XY Fields the X field is Longitude and the Y field is Latitude. Hit OK.
    Add delimited text menu
  4. Convert the plot to a shapefile. Even though the points have been plotted, it isn't a shapefile yet. To convert it, right click on the comics layer in the ML and choose Save As. Save it as an ESRI shapefile in your part 3 data folder and call it comics_nyc. Change the default for the CRS to NAD 83.
  5. Add the new comic layer. Hit the add vector data button and add the new comics_nyc shapefile to your project. Then select the original text file in the ML, right click and remove it. Save your project.
    Stores plotted
  6. View the attribute table. Select the comics_nyc layer in the ML, right click and open the attribute table, to take a look at what's there. You should see all of the data that's affiliated with the comic book stores. Close the table when you're finished.

Commentary

Coordinate Data Sources

The coordinate data for the comic book stores was downloaded from a database called ReferenceUSA and processed so that it was ready for plotting. While government agencies often create and provide geographic data for boundaries and physical features, private features like businesses are usually not captured. These datasets must often be purchased or created from address or coordinate data. ReferenceUSA is not a freely available resource, but it is commonly held by many academic and public libraries. You can search for businesses by name, industrial classification code, and geography and download the data in spreadsheet format; although the number of records you can access in one download is limited. They provide comprehensive business, health care, and residence data for the US and Canada. The inclusion of XY coordinates (longitude and latitude) for each record makes it possible to plot the data in GIS.

Reference USA

There are free, public sources for downloading coordinate data that you can use to create features for natural (lakes, mountain peaks, parks, etc.) and human-made (cities, airports, schools, cemeteries, etc.) features, such as the USGS Geographic Names Information System (for US features) and the NGA's GEOnet Names Server (for international features). If you have batches of addresses, you can look-up the coordinates for these addresses using a geocoding service such as the Geocoding Service at the USC GIS lab.

Delimited Text Files

A text file is a plain document format that is often used for storing and sharing data. Since it is relatively simple and contains no formatting it is cross platform and historically stable. The attributes of each record are separated by a delimiter to indicate different fields. This allows spreadsheet and database programs to parse the text file into columns when you open or import it into that software. Common delimiters include commas, tabs, and pipes. The disadvantage of text files is that the fields are not associated with a specific data type; unlike a DBF file where a field can be designated as a string, integer, real, or other type. When importing text files you need to be careful that columns are designated correctly during the import process; strings inadvertently stored as numbers may have zeros dropped, while numbers inadvertently stored as strings cannot be treated mathematically. Depending on the source of the text files, fields that are intended to be strings may be surrounded by quotes, so that software can recognize and import those fields correctly.


Section V: Running Statistics and Querying Attributes

In this section you'll learn to calculate basic statistics for attributes and use some of the advanced query features. Now that all of the data is in place, we can begin to remove neighborhoods that don't meet our site selection criteria. We want to target neighborhoods that don't have a large number of existing stores, have a high percentage of 18 to 34 year olds, and don't have a large imbalance between men and women.

Steps

  1. Examine the age distribution. By looking at the map we can see that, based on the two highest age categories (.273 to .320 and .320 to .366) the largest concentration of persons aged 18 to 34 is in lower and upper Manhattan, western Brooklyn and Queens, and the southern Bronx. But is this division of categories really significant? Open the attribute table for the pumas_data layer. Sort the data by the PER_AGE column. We can see the gap between the data class starting with .273 and the previous class is quite small; the previous class ends with .272 and the value is the next class is .274. Furthermore, if you look at all the values from smallest to largest the distribution looks pretty consistent, with few sizable gaps between values.
  2. Examine the gender distribution. Sort the table by the PER_FEM column. You'll see that the PUMA with the highest concentration of women is approx 57%, and the one with the lowest concentration is approx 47%. Overall, there really isn't a huge imbalance between men and women within each PUMA. Given this fact, for the purpose of our example we won't consider gender in our selection criteria.
  3. Run some basic statistics. Close the attribute table. On the menu bar select Vector > Analysis Tools > Basic Statistics. Choose pumas_data as the input vector layer and the PER_AGE field as the target field. Hit OK. You'll see that the mean percentage is approximately .262 and if you scroll to the bottom, you'll see the median is .256. For the purpose of our example, we'll use 26% as our cut-off; PUMAS where 18 to 34 year olds make up 26% or more of the population will be included, while any with less than that number will be excluded. Close the stats menu.
    Running basic statistics
  4. Count stores by neighborhood. We should exclude PUMAs that already have a large number of comic book stores. On the menu bar go to Vector > Analysis Tools > Points in Polygons. Specify pumas_data as the Input polygon layer and comics_nyc as the Input point layer. Keep the output count field name as PNTCNT. Browse to your part 3 data folder and save the output as pumas_data_count. Hit OK to create the new shapefile. Close the Point to Polygon menu.
    Counting points in polygons
  5. Swap your layers. Select the pumas_data layer in the ML, right click and remove it. Drag the new pumas_data_count layer to the bottom of the ML. Don't worry about symbolizing the new layer.
  6. View the table for the new layer. Select pumas_data_count in the ML, right click and open the attribute table. Scroll the table all the way to the right. You'll see the new PNTCNT field, which shows the number of comic books stores in each PUMA. Click on the PNTCNT column heading to sort the table by that field. You'll see only five PUMAs have more than two comic book stores.
    Comic stores per PUMA
  7. Build an advanced query. Hit the Advanced Search button in the lower right-hand corner of the attribute table menu. In the Fields box scroll down and click PER_AGE. In the Operators box click greater than or equal to (>=). In the SQL where to box type in the value .26 (be sure to include the decimal point). Hit the AND button in the Operators box. Double-click the PNTCNT field in the fields box. Hit the less than button (<) in the Operators box. Hit the All button under the Values box to populate it with a list of all PNTCNT values. Double-click on the value 3. In the SQL Where Clause box, your statement should read: PER_AGE >= .26 AND PNTCNT < 3. Hit the Test button to test your statement - you should have 23 features selected as a result. Hit OK. Close the attribute table to view your selections in the map view.
    PUMAs that match our query
  8. Save your selection as a new shapefile. Select pumas_data_count in the map legend (ML). Right click and choose the Save selection as option. Browse to your part 3 folder and save the selection as pumas_selected. Browse and change the CRS to NAD 83. Hit OK to save it. Hit the add vector layer button to add the new pumas_selection layer to your map. Select the old pumas_data_count in the ML, right click and remove it. Drag the new pumas_selection layer to the bottom of the ML. Save your project.
Saving the selectionNew selection layer

Commentary

Selection Criteria

Since the goal of our exercise is to demonstrate the capabilities and possible uses of GIS, we're not adhering to really strict criteria in our site selection process; the example is merely illustrative. Is a cut off of 26% of total residents aged 18 to 34 reasonable? It really depends on your goals, and whether you would prefer to have a focused, narrow selection of places or a more expansive one. Does it make sense to omit a PUMA that is only a few hundredths of a decimal place below 26%? These are the kinds of decisions you'll have to make for each project you do. You may decide that a line has to be drawn somewhere and that's it, or you may wish to allow an exception within a few decimal places or to round your numbers. You also could decide to make a qualitative decision - based on what you know about the neighborhood that's near the dividing line, should you include it or exclude it?

You have a few tools at your disposal for making these decisions; the basic statistics for determining mean, median, range, and standard deviation to establish a baseline are helpful. The data classification tools for symbolizing your data based on quantiles or equal intervals can also aid your decision (we'll discuss these later on). Regardless of what you do, look at the attribute table and make sure to examine and understand your data. You can easily copy the data from the attribute table using the copy to clipboard button and paste it into a spreadsheet, where you can create a scatter plot of the distribution to visualize where gaps in the data are. This can also aid you in classifying it (the natural breaks method; we'll discuss this later as well). It also helps to become familiar with the places you are studying, so you can draw on your more qualitative experiences to make decisions and perform a "reality check" on your observations.

Some Basic SQL

The advanced selection menu under the attribute table allows you to build complex queries for selecting features. QGIS, and most GIS packages, use the Standard Query Language (SQL) that's used when working with databases. Some tips:


Section VI: Drawing Buffers and Making Selections

One of the primary strengths of GIS is the ability to layer different features and to combine or extract information to create new features. In this section you'll learn how to create buffers around features and to deduct areas from selections. For our example, since young people tend to congregate around universities we'll identify these zones and remove areas from our neighborhood selection that are not near schools.

Steps

  1. Activate the colleges layer. Hit the check box beside the nyc_4yr_colleges layer to turn it on. If it is similar in color to the comics layer, right click on one of the layers in the ML and change the fill in the style tab so you can clearly tell them apart (in our example, the comic stores are light yellow and the colleges are dark blue).
    Comic stores and colleges
  2. Create buffers. On the menu bar, go to Vector > Geoprocessing tools > Buffers. Specify the college layer, nyc_4yr_colleges as the input vector layer. For the buffer distance, type .01 (this is in degrees and represents approx 1/2 mile; see commentary below for explanation). Check the box that says Dissolve buffer results. Hit the browse button to save the new shapefile in your part 3 folder as colleges_buffer. Hit OK. Click Yes to add the new layer. Close the buffer menu. Drag the buffer layer just below the colleges layer in the ML. Explore the map; you'll see a circular zone in a 1/2 mile radius around each college. The boundaries between each buffer zone are merged where zones intersect (as a result of checking the dissolve results box).
    Buffer menuMap with buffers
  3. Isolate areas within buffers and pumas. On the menu bar, go to Vector > Geoprocessing tools > Intersect. Choose pumas_selection as the input vector layer. Choose colleges_buffer as the intersect layer. Browse and save the new result to your part 3 data folder as selected_areas. Hit OK. Close the Intersect menu. The new selected_areas layer shows you the areas to consider targeting: areas within a half mile of a college or university that are within PUMAS where the 18-34 age group represents 26% or more of the total population and there are less than three existing comic book stores.
    Intersect menuBuffers and selected areas
  4. Clean up your map. Uncheck the colleges_buffer and pumas_selection layers in the ML to turn them off. Drag the areas_selected layer so that it is directly above the nyc_counties layer. Check the nyc_facilities and nyc_greenspace layer to turn them back on. We could refine our analysis a bit more by subtracting the green space and facilities areas that intersect our areas of interest, since we couldn't build a store on this land. For now, overlaying these land uses on top of our areas of interest should suffice. Select the selected_areas layer in the ML, then hit the Zoom to layer button so our areas of interest are maximized within the map window.
    Final map layout
  5. Identify areas. Through the selection process, the attributes of our previous layers have been preserved in our new layers. Select the selected_areas layer in the ML. Use the identify button and click on one of the areas. You'll see the attributes from our earlier PUMA layer. While the identifying information, like the name of the neighborhood, is useful, many of the other attributes are now incorrect. The population figures represent the entire PUMA and not the small subset we've selected. If we were going to save these layers for future analysis or projects, we would want to delete the attributes that are no longer necessary. Save your project.

Commentary

Buffers and Distance Measurement

Since the coordinate system of our layers is NAD 83 and it uses degrees of latitude and longitude, we have to specify units for measuring the distance of our buffers in degrees. This is difficult for a number of reasons; it's much easier for us to conceive how large a kilometer or mile is relative to a degree. A thornier issue is that the length of a degree isn't constant - the distance between degrees of longitude decreases as we move from the equator to the poles. The distance between degrees of latitude is relatively consistent, but is also not equal to a degree of longitude, which requires us (or software) to make complex calculations to transform degrees into simple distance measurements. Here are some ways to get around this problem when creating buffers:

In our example we chose to dissolve the boundaries of the buffers where they intersected because we were interested in the total area within a half mile of any college. The resulting shapefile consisted of a single feature - the entire buffer. What if we wanted to preserve the individual boundaries of each buffer? We would leave that Dissolve box unchecked. The resulting shapefile would consist of several features, one buffer for each school, AND each feature would take the attributes of the school it surrounds (i.e. the school's ID codes, name, address, etc).

File Management

As we've moved through this exercise, we've created many shapefiles along the way; every time we made a selection or performed a geoprocessing function we ended up with a new file. There are two things we should note here.

First, this can get pretty confusing. With each new file you create, it's easy to lose track of what each one represents. You can mitigate this by giving your files names that clearly indicate what they are. Documenting your progress in a logbook, whether it's on paper or in a simple text file, can help you keep things straight. You may also decide to delete files that were created during the middle of the process. This is fine as long as you think you won't need to go back and re-do a step, either because the parameters of your project have changed or you've spotted an error.

Second, QGIS has recently made some improvements so it's not always necessary to create a new file with every single processing step. Some menus will give you the option to select features or perform operations on features that are ALREADY selected. This allows you to work with just the features you need from one layer to create a new one, skipping the interim step of creating a new shapefile of just the features you want to work with.

Site Selection

Site selection theories and land use analysis can be traced back to the early 19th century with the introduction of Von Thunen's land rent gradient. Subsequent work that included Weber's median location, Hotelling's competitive location problem, Christaller's Central Place Theory, and Tobler's Laws of Geography have provided a framework for the science (and art) of optimal site selection. Optimal site selection is studied within the fields of geography, location science, and operations management, and has expanded with the introduction and evolution of GIS. The three laws of location science, as summarized by Church and Murray (Business site selection, location analysis, and GIS, 2009) are:

It's also important to understand the unique spatial patterns of each type of business or industry; a phenomena that economic and urban geographers have been studying for many decades. Products or services classified as low ordered goods tend to be located in most environments, and there will be more of these businesses in places with higher population densities. High order goods tend to require a higher population density and will be present in fewer locations. For example, businesses like gas stations, dry cleaners, and family doctor's offices will be located in most areas, while office towers, specialty retail, and major hospitals will be located in fewer places, spaced further apart. Businesses like gas stations and convenience stores tend to cluster around major transportation intersections, while car dealerships and hotels tend to cluster around each other in districts. Movie theaters and large shopping malls on the other hand tend not to cluster together; they are spaced apart to serve different populations.

We worked with comic book stores in our exercise as they are good example of a specialized, moderately high order business. They serve a very specific demographic group and as a sub-industry they have not been co-opted by larger retailers or by the Internet (at least not yet). Since there are approximately fifty in New York City our example was manageable. The use of PUMAs instead of census tracts was feasible; as a higher order business comic book stores attract customers from a wider geographic area relative to lower order businesses.

The location of non-retail or non-service industries is also distinct. Manufacturing industries often depend on the availability of raw materials and inputs and the distance for finished products to reach transportation and markets, while hi-tech industries tend to locate near pools of highly educated labor. Agricultural uses often appear where other land uses are not present and where land is inexpensive. The types of crops or livestock they produce will vary based on environmental factors like climate or soil.

The bottom line: if you are going to conduct a site selection analysis, you must understand the context: study the industry or business you are interested in, do some market research, make sure you're familiar with the geographic environment you're working with, and choose your geographic units of analysis and indicators carefully.


Section VII: Screen captures

In this brief section you'll learn how to create a screen shot of your map that you can easily share with others. You'll learn how to make a presentation quality map in the next part.

Steps

  1. Zoom to layer. With the selected_areas layer selected in the ML, hit the zoom to layer button. Use the hand tool to center the map view. If you want to be fancier, you could activate some plugins under Plugins > Manage Plugins and add a north arrow, scale bar, and copyright info to the screen, and the Text annotation button to add a title.
  2. Save the map view screen. On the menu bar, go to File > Save as Image. Browse to your data folder for part 3 and save the image there as map_screen. Change the Files of Type dropdown to PNG file. Click Save.
  3. View your map. Save your project and then close QGIS. Navigate to your data folder for part 3. Look for the file map_screen.png. Double-click it to open the file in your computer's default photo viewing program, and you'll see your map view. This is a quick way to save and share your map content. This is a simple, static image file that is not connected to your project or data files. You can easily email or text this file to anyone.

Commentary

Considerations and Next Steps

Based on our results, what would you do next? How would you decide where to locate the store? What else would you investigate? Is there anything that we've done in this exercise that you would do differently, if you had to conduct an analysis like this for an actual project?

For more practice, some things to try:

<--- Back     Next --->