-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.Rmd
372 lines (274 loc) · 26.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
---
title: "Table of Contents"
output:
github_document:
toc: true
toc_depth: 3
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "docs/figures/",
out.width = "100%"
)
```
> :warning: **NOTE** :warning:
>
> The [condominium model](https://github.com/ccao-data/model-condo-avm) (this repo) is nearly identical to the [residential (single/multi-family) model](https://github.com/ccao-data/model-res-avm), with a few [key differences](#differences-compared-to-the-residential-model). Please read the documentation for the [residential model](https://github.com/ccao-data/model-res-avm) first.
# Prior Models
This repository contains code, data, and documentation for the Cook County Assessor's condominium reassessment model. Information about prior year models can be found at the following links:
| Year(s) | Triad(s) | Method | Language / Framework | Link |
|---------|----------|---------------------------------------------|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|
| 2015 | City | N/A | SPSS | [Link](https://gitlab.com/ccao-data-science---modeling/ccao_sf_cama_dev/-/tree/master/code.legacy/2015%20City%20Tri/2015%20Condo%20Models) |
| 2018 | City | N/A | N/A | Not available. Values provided by vendor |
| 2019 | North | Linear regression or GBM model per township | R (Base) | [Link](https://gitlab.com/ccao-data-science---modeling/ccao_sf_cama_dev) |
| 2020 | South | Linear regression or GBM model per township | R (Base) | [Link](https://gitlab.com/ccao-data-science---modeling/ccao_sf_cama_dev) |
| 2021 | City | County-wide LightGBM model | R (Tidyverse / Tidymodels) | [Link](https://github.com/ccao-data/model-condo-avm/tree/2021-assessment-year) |
| 2022 | North | County-wide LightGBM model | R (Tidyverse / Tidymodels) | [Link](https://github.com/ccao-data/model-condo-avm/tree/2022-assessment-year) |
| 2023 | South | County-wide LightGBM model | R (Tidyverse / Tidymodels) | [Link](https://github.com/ccao-data/model-condo-avm/tree/2023-assessment-year) |
| 2024 | City | County-wide LightGBM model | R (Tidyverse / Tidymodels) | [Link](https://github.com/ccao-data/model-condo-avm/tree/2024-assessment-year) |
# Model Overview
The duty of the Cook County Assessor's Office is to value property in a fair, accurate, and transparent way. The Assessor is committed to transparency throughout the assessment process. As such, this document contains:
* [A description of the differences between the residential model and this (condominium) model](#differences-compared-to-the-residential-model)
* [An outline of ongoing issues specific to condominium assessments](#ongoing-issues)
The repository itself contains the [code](./pipeline) for the Automated Valuation Model (AVM) used to generate initial assessed values for all condominium properties in Cook County. This system is effectively an advanced machine learning model (hereafter referred to as "the model"). It uses previous sales to generate estimated sale values (assessments) for all properties.
## Differences Compared to the Residential Model
The Cook County Assessor's Office has started to track a limited number of characteristics (building-level square footage, unit-level square footage, bedrooms, and bathrooms) for condominiums, but the data we have ***varies in both the characteristics available and their completeness*** between triads.
Like most assessors nationwide, our office staff cannot enter buildings to observe property characteristics. For condos, this means we cannot observe amenities, quality, or any other interior characteristics which must instead be gathered from listings and a number of additional third-party sources.
The only _complete_ information our office currently has about individual condominium units is their age, location, sale date/price, and percentage of ownership. This makes modeling condos particularly challenging, as the number of usable features is quite small. Fortunately, condos have two qualities which make modeling a bit easier:
1. Condos are more homogeneous than single/multi-family properties, i.e. the range of potential condo sale prices is much narrower.
2. Condos are pre-grouped into clusters of like units (buildings), and units within the same building usually have similar sale prices.
### Rolling Average Sale Price Feature
We leverage the qualities above to produce a leave-one-out, time-weighted, rolling average building-level sale price for
each sale. In layman's terms, we take the average of the sales in the same building from the prior 5 years, _excluding the current sale_. Here's what the rolling windows look like for sales in the training data, where the last row represents the actual assessment scenario (where the "sale" occurs on the lien date):

This feature is similar to a spatial lag model. It captures the spatial relationship and price dependence of units within the same building, effectively giving the primary model a hint that "these units are related and should have a similar price." Intuitively, this makes sense: if a unit in a building sells for $500K, it's likely that future sales in the same building will be around the same price (or at least trend together).
Note however, that because this feature is a _building-level_ average, it doesn't account for unit-level differences. A unit on the top floor with a view of the lake will get (roughly) the same building-level average as a unit on the ground floor with no view. As such, we still need a main predictive model to determine unit-level values, and the rolling average feature is used as a predictor in that model.
Some additional technical notes on this feature:
- The time weights used to weight sales are _global_, rather than building-specific. They follow a simple logistic curve centered 3 years before the most recent sale i.e. sales close to the lien date are weighted most heavily.
- The feature is calculated using sales from the _entire range_ of the training data. This means that the test set version of the feature has seen training data sales from the same building, but not the sale being predicted. We contend that this is not data leakage, as it mirrors the real-world, production scenario where the models sees all sales in the building from the five years prior to the lien date.
- We contend that this feature is _not_ sales chasing (in the IAAO sense) because it excludes the _current_ sale from the average. This means that the model is not using the current sale to predict itself, but rather using the _prior_ sales to predict the current sale.
- The average excludes outlier sales and sales of non-livable units.
### Features Used
Because our individual condo unit characteristics are sparse and incomplete, we primarily must rely on aggregate geospatial features, economic features, and time of sale to determine condo assessed values. The features in the table below are the ones used in the most recent assessment model.
```{r features_used, message=FALSE, echo=FALSE}
library(dplyr)
library(glue)
library(jsonlite)
library(purrr)
library(readr)
library(tidyr)
library(yaml)
condo_params <- read_yaml("params.yaml")
condo_preds <- as_tibble(condo_params$model$predictor$all)
# Some values are derived in the model itself, so they are not documented
# in the dbt DAG and need to be documented here
# nolint start
hardcoded_descriptions <- tribble(
~"column", ~"description",
"sale_year", "Sale year calculated as the number of years since 0 B.C.E",
"sale_day",
"Sale day calculated as the number of days since January 1st, 1997",
"sale_quarter_of_year", "Character encoding of quarter of year (Q1 - Q4)",
"sale_month_of_year", "Character encoding of month of year (Jan - Dec)",
"sale_day_of_year", "Numeric encoding of day of year (1 - 365)",
"sale_day_of_month", "Numeric encoding of day of month (1 - 31)",
"sale_day_of_week", "Numeric encoding of day of week (1 - 7)",
"sale_post_covid", "Indicator for whether sale occurred after COVID-19 was widely publicized (around March 15, 2020)",
"pin10_bldg_roll_mean", "Time-weighted five-year rolling average sale price for all units in the PIN10 (excluding the target sale)",
"pin10_bldg_roll_pct_sold", "Ratio of sale count in the PIN10 over the past five years to PIN10 unit count (excluding the target sale)"
)
# nolint end
# Load the dbt DAG from our prod docs site
dbt_manifest <- fromJSON(
"https://ccao-data.github.io/data-architecture/manifest.json"
)
# nolint start: cyclomp_linter
get_column_description <- function(colname, dag_nodes, hardcoded_descriptions) {
# Retrieve the description for a column `colname` either from a set of
# dbt DAG nodes (`dag_nodes`) or a set of hardcoded descriptions
# (`hardcoded_descriptions`). Column descriptions that come from dbt DAG nodes
# will be truncated starting from the first period to reflect the fact that
# we use periods in our dbt documentation to separate high-level column
# summaries from their detailed notes
#
# Prefer the hardcoded descriptions, if they exist
if (colname %in% hardcoded_descriptions$column) {
return(
hardcoded_descriptions[
match(colname, hardcoded_descriptions$column),
]$description
)
}
# If no hardcoded description exists, fall back to checking the dbt DAG
for (node_name in ls(dag_nodes)) {
node <- dag_nodes[[node_name]]
for (column_name in ls(node$columns)) {
if (column_name == colname) {
description <- node$columns[[column_name]]$description
if (!is.null(description) && trimws(description) != "") {
# Strip everything after the first period, since we use the first
# period as a delimiter separating a column's high-level summary from
# its detailed notes in our dbt docs
summary_description <- strsplit(description, ".", fixed = TRUE)[[1]][1]
return(gsub("\n", " ", summary_description))
}
}
}
}
# No match in either the hardcoded descriptions or the dbt DAG, so fall
# back to an empty string
return("")
}
# nolint end
# Make a vector of column descriptions that we can add to the param tibble
# as a new column
param_notes <- condo_preds$value %>%
ccao::vars_rename(names_from = "model", names_to = "athena") %>%
map(~ get_column_description(
.x, dbt_manifest$nodes, hardcoded_descriptions
)) %>%
unlist()
res_params <- read_yaml(
"https://raw.githubusercontent.com/ccao-data/model-res-avm/master/params.yaml"
)
res_preds <- res_params$model$predictor$all
condo_unique_preds <- setdiff(condo_preds$value, res_preds)
condo_preds_fmt <- condo_preds %>%
mutate(description = param_notes) %>%
left_join(
ccao::vars_dict,
by = c("value" = "var_name_model")
) %>%
distinct(
feature_name = var_name_pretty,
variable_name = value,
description,
category = var_type,
type = var_data_type
) %>%
mutate(
category = recode(
category,
char = "Characteristic", acs5 = "ACS5", loc = "Location",
prox = "Proximity", ind = "Indicator", time = "Time",
meta = "Meta", other = "Other", ccao = "Other", shp = "Parcel Shape"
),
feature_name = recode(
feature_name,
"Tieback Proration Rate" = "Condominium % Ownership",
"Year Built" = "Condominium Building Year Built"
),
unique_to_condo_model = ifelse(
variable_name %in% condo_unique_preds |
feature_name %in%
c("Condominium Building Year Built", "Condominium % Ownership"),
TRUE, FALSE
)
) %>%
arrange(desc(unique_to_condo_model), category)
condo_preds_fmt %>%
write_csv("docs/data-dict.csv")
condo_preds_fmt %>%
mutate(unique_to_condo_model = ifelse(unique_to_condo_model, "X", "")) %>%
rename(
"Feature Name" = "feature_name",
"Variable Name" = "variable_name",
"Description" = "description",
"Category" = "category",
"Type" = "type",
"Unique to Condo Model" = "unique_to_condo_model"
) %>%
knitr::kable(format = "markdown")
```
We maintain a few useful resources for working with these features:
- Once you've [pulled the input data](#getting-data), you can inner join the data to the CSV version of the data dictionary ([`docs/data-dict.csv`](./docs/data-dict.csv)) to filter for only the features that we use in the model.
- You can browse our [data catalog](https://ccao-data.github.io/data-architecture/#!/overview) to see more details about these features, in particular the [condo model input view](https://ccao-data.github.io/data-architecture/#!/model/model.ccao_data_athena.model.vw_pin_condo_input) which is the source of our training data.
- You can use the [`ccao` R package](https://ccao-data.github.io/ccao/) or its [Python equivalent](https://ccao-data.github.io/ccao/python/) to programmatically convert variable names to their human-readable versions ([`ccao::vars_rename()`](https://ccao-data.github.io/ccao/reference/vars_rename.html)) or convert numerically-encoded variables to human-readable values ([`ccao::vars_recode()`](https://ccao-data.github.io/ccao/reference/vars_recode.html)). The [`ccao::vars_dict` object](https://ccao-data.github.io/ccao/reference/vars_dict.html) is also useful for inspecting the raw crosswalk that powers the rename and recode functions.
### Valuation
For the most part, condos are valued the same way as single- and multi-family residential property. We [train a model](https://github.com/ccao-data/model-res-avm#how-it-works) using individual condo unit sales, predict the value of all units, and then apply any [post-modeling adjustment](https://github.com/ccao-data/model-res-avm#post-modeling).
However, because the CCAO has so [little information about individual condo units](#differences-compared-to-the-residential-model), we must rely on the [condominium percentage of ownership](#features-used) to differentiate between units in a building. This feature is effectively the proportion of the building's overall value held by a unit. It is created when a condominium declaration is filed with the County (usually by the developer of the building). The critical assumption underlying the condo valuation process is that percentage of ownership correlates with the relative market value differences between units.
Percentage of ownership is used in two ways:
1. It is used directly as a predictor/feature in the regression model to estimate differing unit values within the same building.
2. It is used to reapportion unit values directly i.e. the value of a unit is ultimately equal to `% of ownership * total building value`.
Visually, this looks like:

For what the office terms "nonlivable" spaces — parking spaces, storage space, and common area — the breakout of value works differently. See [this excel sheet](docs/spreadsheets/condo_nonlivable_demo.xlsx) for an interactive example of how nonlivable spaces are valued based on the total value of a building's livable space.
Percentage of ownership is the single most important feature in the condo model. It determines almost all intra-building differences in unit values.
### Multi-PIN Sales
The condo model is trained on a select number of "multi-PIN sales" (or "multi-sales") in addition to single-parcel sales. Multi-sales are sales that include more than one parcel. In the case of condominiums, many units are sold bundled with deeded parking spaces that are separate parcels. These two-parcel sales are highly reflective of the unit's actual market price. We split the total value of these two-parcel sales according to their relative percent of ownership before using them for training. For example, for a \$100,000 sale of a unit (4% ownership) and a parking space (1% ownership), the sale would be adjusted to \$80,000:
$$\frac{0.04}{0.04 + 0.01} * \$100,000 = \$80,000$$
# Ongoing Issues
The CCAO faces a number of ongoing issues specific to condominium modeling. We are currently working on processes to fix these issues. We list the issues here for the sake of transparency and to provide a sense of the challenges we face.
### Unit Heterogeneity
The current modeling methodology for condominiums makes two assumptions:
1. Condo units within the same building are similar and will sell for similar amounts.
2. If units are not similar, the percentage of ownership will accurately reflect and be proportional to any difference in value between units.
The model process works even in heterogeneous buildings as long as assumption 2 is met. For example, imagine a building with 8 identical units and 1 penthouse unit. This building violates assumption 1 because the penthouse unit is likely larger and worth more than the other 8. However, if the percentage of ownership of each unit is roughly proportional to its value, then each unit will still receive a fair assessment.
However, the model can produce poor results when both of these assumptions are violated. For example, if a building has an extreme mix of different units, each with the same percentage of ownership, then smaller, less expensive units will be overvalued and larger, more expensive units will be undervalued.
This problem is rare, but does occur in certain buildings with many heterogeneous units. Such buildings typically go through a process of secondary review to ensure the accuracy of the individual unit values.
### Buildings With Few Sales
The condo model relies on sales within the same building to calculate a weighted, rolling average building sale price. This method works well for large buildings with many sales, but can break down when there are only 1 or 2 sales in a building. The primary danger here is _unrepresentative_ sales, i.e. sales that deviate significantly from the real average value of a building's units. When this happens, buildings can have their average building sale price pegged too high or low.
Fortunately, buildings without any recent sales are relatively rare, as condos have a higher turnover rate than single and multi-family property. Smaller buildings with low turnover are the most likely to not have recent sales.
### Buildings Without Sales
When no sales have occurred in a building in the 5 years prior to assessment, the building's mean sale price feature is imputed. The model will look at nearby buildings that have similar unit counts, age, and other features, then try to assign an appropriate average to the target building.
Most of the time, this technique produces reasonable results. However, buildings without sales still go through an additional round of review to ensure the accuracy of individual unit values.
# FAQs
**Note:** The FAQs listed here are for condo-specific questions. See the residential model documentation for [more general FAQs](https://github.com/ccao-data/model-res-avm#faqs).
**Q: What are the most important features in the condo model?**
As with the [residential model](https://github.com/ccao-data/model-res-avm), the importance of individual features varies by location and time. However, generally speaking, the most important features are:
* Location, location, location. Location is the largest driver of county-wide variation in condo value. We account for location using [geospatial features like neighborhood](#features-used).
* Condo percentage of ownership, which determines the intra-building variation in unit price.
* Other sales in the building. This is captured by a rolling average of sales in the building over the past 5 years, excluding any sales of the target condo unit.
**Q: How do I see the assessed value of other units in my building?**
You can use the [CCAO's Address Search](https://www.cookcountyassessor.com/address-search#address) to see all the PINs and values associated with a specific condominium building, simply leave the `Unit Number` field blank when submitting a search.
**Q: How do I view my unit's percentage of ownership?**
The percentage of ownership for individual units is printed on assessment notices. You may also be able to find it via your building's board or condo declaration.
# Usage
Installation and usage of this model is identical to the [installation and usage of the residential model](https://github.com/ccao-data/model-res-avm#usage). Please follow the instructions listed there.
## Getting Data
The data required to run these scripts is produced by the [ingest stage](pipeline/00-ingest.R), which uses SQL pulls from the CCAO's Athena database as a primary data source. CCAO employees can run the ingest stage or pull the latest version of the input data from our internal DVC store using:
```bash
dvc pull
```
Public users can download data for each assessment year using the links below. Each file should be placed in the `input/` directory prior to running the model pipeline.
#### 2021
- [assmntdata.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2021/assmntdata.parquet)
- [modeldata.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2021/modeldata.parquet)
#### 2022
- [assessment_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2022/assessment_data.parquet)
- [condo_strata_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2022/condo_strata_data.parquet)
- [land_nbhd_rate_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2022/land_nbhd_rate_data.parquet)
- [training_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2022/training_data.parquet)
#### 2023
- [assessment_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2023/assessment_data.parquet)
- [condo_strata_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2023/condo_strata_data.parquet)
- [land_nbhd_rate_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2023/land_nbhd_rate_data.parquet)
- [training_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2023/training_data.parquet)
#### 2024
Due to a [data issue](https://github.com/ccao-data/data-architecture/pull/334) with the initial 2024 model run, there are actually _two_ final 2024 models. The run `2024-02-16-silly-billy` was used for Rogers Park only, while the run `2024-03-11-pensive-manasi` was used for all subsequent City of Chicago townships.
The data issue caused some sales to be omitted from the `2024-02-16-silly-billy` training set, however the actual impact on predicted values was _extremely_ minimal. We chose to update the data and create a second final model out of an abundance of caution, and, given low transaction volume in 2023, to include as many arms-length transactions in the training set as possible.
##### 2024-02-16-silly-billy
- [assessment_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/assessment_data.parquet)
- [char_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/char_data.parquet)
- [condo_strata_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/condo_strata_data.parquet)
- [land_nbhd_rate_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/land_nbhd_rate_data.parquet)
- [training_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-02-16-silly-billy/training_data.parquet)
##### 2024-03-11-pensive-manasi (final)
- [assessment_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/assessment_data.parquet)
- [char_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/char_data.parquet)
- [condo_strata_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/condo_strata_data.parquet)
- [land_nbhd_rate_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/land_nbhd_rate_data.parquet)
- [training_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2024/run_id=2024-03-11-pensive-manasi/training_data.parquet)
#### 2025
- [assessment_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2025/assessment_data.parquet)
- [char_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2025/char_data.parquet)
- [land_nbhd_rate_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2025/land_nbhd_rate_data.parquet)
- [training_data.parquet](https://ccao-data-public-us-east-1.s3.amazonaws.com/models/inputs/condo/2025/training_data.parquet)
For other data from the CCAO, please visit the [Cook County Data Portal](https://datacatalog.cookcountyil.gov/).
# License
Distributed under the AGPL-3 License. See [LICENSE](./LICENSE) for more information.
# Contributing
We welcome pull requests, comments, and other feedback via GitHub. For more involved collaboration or projects, please see the [Developer Engagement Program](https://github.com/ccao-data/people#external) documentation on our group wiki.