leehach
diff --git a/‎README.md
Lines changed: 47 additions & 15 deletions b/‎README.md
Lines changed: 47 additions & 15 deletions
diff --git a/‎postgis_support/Add geometry column and spatial views.sql
Lines changed: 0 additions & 46 deletions b/‎postgis_support/Add geometry column and spatial views.sql
Lines changed: 0 additions & 46 deletions
@@ -1,13 +1,19 @@
 # Overview
 
-This project consists of a number of SQL scripts and other supporting files for importing some recent US Census datasets into a PostgreSQL database. The datasets of interest are the Decennial Census and the annual American Community Survey (ACS). There are two types of scripts:
+This project consists of a number of SQL scripts and other supporting files for importing recent American Community Survey releases from the US Census Bureau datasets. (Support for importing the Decennial Census will hopefully be added soon.) In order to avoid having to produce copious amounts of documentation, the naming conventions employed adhere very closely to the Census data dictionary. This means that fields are named, e.g., `b07401037` rather than `moved_within_same_county_aged_20_to_24`. The data manager who wants more memorable names is advised to create views aliasing commonly used columns. The data are also maintained in tables named "sequences" so as to make it easy for data managers to drop or load data in the same structures the Census uses for distribution. 
+
+There are two types of scripts:
 
 1. Scripts which create the necessary data schema, bulk load the data to import tables, and coerce it to final form;
 2. Meta-scripts, scripts which create functions which generate the schema creation and data import scripts.
 
+The data manager has the choice of running the data import scripts directly, or of running the meta-scripts. The data import scripts are created programmatically by the meta-scripts. The meta-scrips can be run to (a) return the desired SQL statements to inspect or to execute later, or (b) to execute the SQL directly, actually creating multiple tables or views, and loading and transforming the desired data. The first option (running scripts specific to each Census product) is probably conceptually easier to understand. The second option is more powerful, as it gives the data manager more control over the import process, and in practice will be easier to implement once you understand what the scripts are doing.
+
 The data definition and data manipulation scripts appear in the folders named for specific Census products (e.g. 2010_5yr). The meta-scripts appear in the meta-scripts folder. My desire is to eventually make the schema creation and data import scripts (the first kind of script) conform to the SQL standard, so that they could be used in other SQL implementations than Postgres. As the meta-scripts (the second kind of script) rely upon PL/pgSQL, I doubt they could be converted to implementation-agnostic. I would be more likely to rewrite them in Python or another language.
 
-The data manager has the choice of running the data import scripts directly, or of running the meta-scripts first. If the meta-scripts are run, they will programatically create and optionally execute the generated script. The first option (running scripts specific to each Census product) is probably conceptually easier to understand. The second option is more powerful, as it gives the data manager more control over the import process (for example, only importing certain states or sequences--the same effect using the data import scripts requires searching the script for specific lines of code to execute or exclude).
+There are two ways in which I depart from the Census data formats. I have created the `sumlevel` field as an integer, even though text is arguably more appropriate (the summary level is a three digit field that is often written with prepended zeroes). I have also added the `geoid` field, already present in the `geoheader` table, to the sequence tables as well. Primarily intended by the Census for spatial data support, this field allows a single-key join between `geoheader` and a sequence tables (otherwise the join requires both the state code and an integer identifier that is only unique within states).
+
+A number of optional features are described at the end of this file.
 
 # Getting the Data
 
@@ -23,11 +29,15 @@ Each data product (e.g. American Community Survey 2006-2010) can be thought of a
 
 # Running the Data Scripts
 
+**The following instructions indicate how to import the Census data using the data import scripts. If you would prefer to use the meta-scripts, you do not need to read this. Instead, look at the README in the meta-scripts folder.**
+
 The data scripts should be run in a specific order. They will generate storage tables, staging (temp) tables, and views which mirror Census "subject tables", as well as actually doing the data import.
 
 *Table and view names in these scripts are* ***not*** *schema-qualified,* allowing the data manager to choose their own schema name. As mentioned above, I use a schema name based on folder names from the Census Bureau FTP server. Assuming you do the same, each script needs to prepended with 
 
-    SET search_path = acs2010_5yr, public; --or other appropriate schema name
+```sql
+SET search_path = acs2010_5yr, public; --or other appropriate schema name
+```
 
 If you want to avoid altering each script, SET search_path once in psql, then \i each script file.
 
@@ -37,7 +47,7 @@ DROP and CREATE TABLE scripts are separated so that half-loaded datasets won't a
 
 These scripts may be run in any order.
 
-* create_tmp_geoheader.sql (contains DROP … IF EXISTS statement)
+* create_tmp_geoheader.sql (contains `DROP … IF EXISTS` statement)
 * drop_import_tables.sql
 * create_import_tables.sql
 * drop_import_moe.sql
@@ -49,11 +59,9 @@ These scripts use COPY statements to do the actual data import, albeit to stagin
 
 These scripts use forward slashes to represent filesystem separators. Testing on Windows Vista indicates that forward slashes will be interpreted correctly. Backslashes, if used, are treated as escape characters and would need to be doubled.
 
-These scripts contain a filesystem placeholder "\<census_upload_root\>". This placeholder should be updated to reflect your filesystem. This folder should have a child named acs2010_5yr. The acs2010_5yr folder should have two children. As mentioned above, the files downloaded from Census should be in two sibling directories named All_Geographies_Not_Tracts_Block_Groups and Tracts_Block_Groups_Only. 
+These scripts contain a filesystem placeholder "\<census_upload_root\>". This placeholder should be updated to reflect your filesystem. This folder should have a child matching the Census product name, e.g. `acs2010_5yr`. The `acs2010_5yr` folder should have two children. As mentioned above, the files downloaded from Census should be in two sibling directories named `All_Geographies_Not_Tracts_Block_Groups` and `Tracts_Block_Groups_Only`. 
 
-The geoheader files use a fixed-length format, and are therefore imported to a table with a single column. This column is then parsed for insertion the the final geoheader table. The geoheader files contain *all* geographies, in spite of whether they are downloaded with the larger or smaller (tracts and block groups only) datasets. These scripts assume the existence of the All_Geographies_Not_Tracts_Block_Groups folder. If you have only downloaded the tracts and block groups, you will have to modify the script or create the expected folder and move the geography files (g20105xx.txt).
-
-Meta-scripts (to be released later) will make it easier to import selected states or sequences. For now, the data manager will have to manually select the states and sequences desired from these files.
+The geoheader files use a fixed-length format, and are therefore imported to a table with a single column. This column is then parsed for insertion into the final `geoheader` table. The geoheader files contain *all* geographies, in spite of whether they are downloaded with the larger or smaller (tracts and block groups only) datasets. These scripts assume the existence of the `All_Geographies_Not_Tracts_Block_Groups` folder. If you have only downloaded the tracts and block groups, you will have to modify the script or create the expected folder and move the geography files (`g20105xx.txt`).
 
 These scripts may be run in any order.
 
@@ -62,13 +70,11 @@ These scripts may be run in any order.
 
 ## Create Data Store
 
-At the moment I am experimenting with three different storage formats. The only one that is completed stores each sequence in its own database table. The other two options combine all the sequences into one table, but, because of Postgres' limit of ~2000 columns in a table, use array columns (one per sequence) or one hstore column to store the data. For information on the status of these options, see below.
-
-Researchers will typically interact with the data via a "subject table" a collection of related data. Often a subject table will break down the population into categories (e.g. age and sex) and include summary columns (e.g. total population, male population, female population). The data are stored by sequences (except for hstore, which pushes the entire dataset into two columns), so subject tables are constructed as views. The view definitions will of course depend upon which data storage method is chosen. **Currently, views are only defined for the table-based data store.**
+Researchers will typically interact with the data via a "subject table" a collection of related data. Often a subject table will break down the population into categories (e.g. age and sex) and include summary columns (e.g. total population, male population, female population). The data are stored by sequences, so subject tables are constructed as views.
 
 ### Create Geoheader
 
-If following the one-big-table approach, the geoheader columns appear as the first columns in those tables. Nonetheless, the CREATE TABLE scripts rely upon the existence of geoheader (using the LIKE keyword), so create_geoheader.sql should be run in any event.
+The `geoheader` table is the glue that holds the ACS together. It contains names of the geographic units (e.g. state or county names), as well as the unit's path in the Census geographic hierarchy (e.g. State → County → Census Tract).
 
 * create_geoheader.sql
 
@@ -88,6 +94,32 @@ Parse_tmp_geoheader.sql may be run at any time. The other scripts must be run in
 5. view_estimate_stored_by_tables.sql
 6. view_moe_stored_by_tables.sql
 
+# Optional Features
+
+The following features are not necessary to the project, but may be of interest to some data managers.
+
+## PostGIS Support
+
+Demographic data is inherently spatial, and many ACS data users will want to map the data. If you have installed the PostGIS extension, you may be interested in the scripts included in the `postgis_support` folder. The modifications are minor, but useful. Basically, a geometry column and a unique integer identifier are added to the `geoheader` table. Spatial data from any source may be copied to the file, but I assume use of TIGER/Line data from USCB. If you use data from a different source, the SRID of the geometry column (4269 for USCB sources) may need to be changed.
+
+Note that the standard Geographic Information System paradigm is to view a spatial "layer" as a collection of entities at the same scale (e.g., only counties), while the ACS mashes all geographies, from the nation down to block groups, into one table, with a `sumlevel` ("summary level") column to separate them out. Since most GISes will not expect to work with data organized this way, the data manager intending to support spatial analysis might also want to implement partitioning (next section).
+
+## Partitioning
+
+**Currently under development**
+
+It is expected that analysts will usually be working with data within a geographic scale (e.g. running a regression on county-level data), and often working within a defined region. Scale is identified by two columns, `sumlevel` and `component`, where `sumlevel` represents the level in the geographic hierarchy from Nation to State to County, etc. (and many others such as Tribal Tracts, ZIP Code Tabulation Areas or ZCTAs, Metropolitan and Micropolitan Areas, etc.) and `component` represents a subset of the population in that geography, e.g. Urban only or Rural only population. (The code `00` represents total population, and is often the component of interest. Codes other than `00` are only defined for county and geographies and larger.)
+
+Rather than require the analyst to constantly filter based on scale, the data may be partitioned by summary level and component. Since I anticipate that the analyst will rarely require data from multiple scales in the same query (and when desired this can easily be accomplished by a UNION query), and in order to avoid GIS users adding a layer with states, counties, tracts, ZCTAs, etc. all layered on top of each other, this is *not* implemented using inheritance. Partitioning this way will also save storage space, as many sequences contain *no* data for specific summary levels, but rows representing those geographies nonetheless appear in the sequence files with all NULLs after the identifying columns. Therefore when creating these partitions, the script will check for all NULL rows and not add those rows to the partition.
+
+The tables may be partitioned at a variety of geographic scales, but since the data are distributed by state, partitioning is implemented by state. This is standard partitioning (i.e. *with* inheritance), since the analyst might at different points want to query a specific state or all states in the country. This will also aid use in GIS, as the analyst can easily add one state or a handful of neighboring states by table, or add the entire country or construct a more complex geographic query (e.g., by metropolitan area) that crosses state boundaries. 
+
+## Alternative Storage Formats
+
+I am experimenting with two alternative storage formats which combine all the sequences into one table. Because of Postgres' limit of ~2000 columns in a table, to combine all the data into one table I use array columns (one per sequence) or one hstore column to store the data. In testing, the array column approach breaks down as the row size gets past 20 or so sequences (even with toast tables helping out), and the hstore approach was *extremely* slow for data loading. Experiment at your own risk, and let me know what you discover.
+
+If following the one-big-table approach, the geoheader columns appear as the first columns in these tables (either array column or hstore). Nonetheless, the CREATE TABLE scripts rely upon the existence of `geoheader` (using the LIKE keyword), so `create_geoheader.sql` should be run in any event.
+
 ### Create Array Column-based Data Store
 
 At the moment, the array column table is named by_arrays. When this project moves past the experimental phase, a less silly name will be chosen. First, create the table. This table will have no rows until parse_tmp_geoheader is run, modified to INSERT INTO by_arrays. Then the sequences can be inserted, using UPDATE to match the sequence data with existing geoheader data. Margin of error data is not yet handeld.
@@ -106,11 +138,11 @@ At the moment, the hstore table is named by_hstore. When this project moves past
 
 # Future Plans
 
-The array column-based storage method and hstore-based storage method need to be completed. Then I want to test some typical extracts and calculations against the different storage methods, to see which one performs the best.
+The array column-based storage method and hstore-based storage methods are half-baked, but so far don't seem promising. I still want to test some typical extracts and calculations against the different storage methods, to see which one performs the best.
 
-Partitioning might improve query speed. If partitioned by state, it might ease bulk loading as well. This should be investigated.
+Partitioning is currently being implemented.
 
-The data are most interesting when mapped. These data can be joined with geographic data for this purpose. The population data are released with multiple geographic scales appearing in the same file, but most geographic information systems will separate administrative units hierarchically, i.e. states and counties would not appear in the same GIS layer. A format that separated the data by geography would facilitate geographic visualization. Partitioning by region (e.g. by state), as mentioned in the last paragraph, would undoubtedly speed up visualization.
+The data are most interesting when mapped. The basic requirements of PostGIS support have been added. A useful feature would be to completely automate downloading the spatial data from the Census, loading it into Postgres, and joining it permanently to the demographic data.
 
 For comments, or if you are interested in assisting, please feel free to contact me at Lee.Hachadoorian@gmail.com