-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-work multi-card values and add multi-card pins to challenge groups #328
Re-work multi-card values and add multi-card pins to challenge groups #328
Conversation
pipeline/02-assess.R
Outdated
total_fmv = sum(pred_card_initial_fmv, na.rm = TRUE), | ||
total_bldg_sf_pin = sum(char_bldg_sf, na.rm = TRUE), | ||
share_bldg_sf = char_bldg_sf / total_bldg_sf_pin, | ||
pred_card_initial_fmv = total_fmv * share_bldg_sf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small amount of multi-card pins with a card having a sqft of 0, which results in the price for the card being 0. I'm not sure how to handle this. We could just make some fixed cut if that is the case. I'm not super clear on downstream exposure of these. Open to ideas
run_id: "2024-03-17-stupefied-maya" | ||
year: "2024" | ||
run_id: "2025-01-10-serene-boni" | ||
year: "2025" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'll leave this just because it is the current baseline?
The sales data we use to measure accuracy is the most recent sale per multi-card | ||
pin if there was one after 2020. | ||
|
||
```{r _decile_ratio_graph} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ratio decile graph based on sale_recent_1_price
after 2020
p_deciles | ||
``` | ||
|
||
```{r _scatterplot_pred_vs_sale} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interactive scatter
* Revert MC data munging * Simplify multi-card handling code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method and results here look sound. I double-checked everything by comparing the predictions here against the most recent baseline run; everything looks good. Nice work @wagnerlmichael.
…#328) * Add some additive solutions and checks for single card values * Remove other methods * Clean up multi-card edit and replace paths * Remove spaces * De-aggregate card preds * Format * Shorten ariable name * Format variable name length * Format variable name length * Add space back * Remove strings * Add back group by * Add multi-card analysis for challenge groups * Format * Format * Clean up comment * Fix pin aggregation * Switch strategy between 2-3 cards and 4+ cards * Lint * Fix decile calculation * Simplify new multi-card method (#335) * Revert MC data munging * Simplify multi-card handling code --------- Co-authored-by: Dan Snow <dan@sno.ws> Co-authored-by: Dan Snow <31494343+dfsnow@users.noreply.github.com>
This PR updates two things.
Previously we were using either simple aggregation of the cards' predicted values or a single model prediction of the card with the largest sqft. The decision on which option to use was based on a YoY cap that tracked YoY % changes in value. We tended to over-predict these values. My hypothesis is that because in a way we are gaining value from the location data twice if we predict on multiple cards with the same location data. Since location data is the bulk of the model's value, it is hard not to overpredict.
The new strategy relies on choosing the single card with the highest square footage from a multi-card property, but it then adjusts that card’s square footage by adding the square footage of the remaining cards. By folding the entire property’s building area into one card, the model produces a single prediction, predicting on the location data a single time. The values with this method look much better. For more information on the results and the different specifications tested, see this issue, or this report.
We are going to implement this strategy for multi-cards pins of 2-3 cards. These represent the vast majority of multi-card properties. For 4+, due to data problems/whacky pin shapes, we are going to keep the prior method of predicting each card individually and then summing them. For example, a multi-card sale could have a sale price representing a single card, where in our data is it attached to a full pin.
We tested including this square foot aggregation strategy in the training data in addition to doing it in the assess stage. The multi-card predictions were better, but at the cost of worse overall model performance, next steps for multi-cards might be trying to debug this.
However, with the assess stage aggregation strategy we still see substantial performance gains. More details on these figures in report