-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strategy for dealing with column names that overlap between tables #67
Comments
Although I thought I had solved my issue with the approach you've documented here as Example 2, it cropped up again, and I'm not sure that your proposed strategy handles this case. As it pertains to generating I think there are two solutions to this use case: 1) update the large mnl code so that the obs and alts tables that get passed in to the merged choice table constructor are pared down to only the columns that appear in the model expression; or 2) replace the joins in these lines in the MergedChoiceTable code with merges. I totally understand the rationale for sticking with merges and not wanted to involuntarily create new and arbitrary column names, thats why I originally raised the issue with you. However, after thinking about it some more, I'm fairly certain that as long as we're only using merges within the MergedChoiceTable class itself, which is more or less a temporary table only ever stored in memory, there really shouldn't be any downstream effects. Moreover, none of the columns that are actually needed by the model would be effected by the auto-generation of suffixes by I think we could go either way. What do you think? |
@mxndrwgrdnr I like approach 1, where the data passed to the MCT constructor is pared down to only include necessary columns. We should be doing this anyway because it's more efficient. (I think i had put it off just because it was more complicated to code than the equivalent for OLS.) I can work on this today if it's something that's blocking your progress. It also seems reasonable to add a parameter to the MCT constructor so that people who are using choicemodels directly can specify what they want to happen with overlapping columns: raise an error, drop some, or append suffixes. And we just wouldn't use that functionality from the UrbanSim side. |
That makes sense. And yeah, it should hopefully results in faster run times. Anyways, there's nothing currently blocking my progress because I have the joins replaced with merges in my local branch (I will just refrain from committing my changes and do a rebase once you've updated master) but it will block progress once we've assembled all these new simulation steps in the activitysynth repo and have multiple folks trying to run it. I make the changes to if you have more pressing stuff on your plate at the moment. |
I've been working on some code to verify ex-ante that a list of columns can be drawn unambiguously from a list of auto-merged tables. Snapshot in a branch of utils.py. It turns out to be hard to do this well within the current system of how Orca manages tables, columns, and join keys ("broadcasts"). Basically, we want to check whether column names are unique across a set of tables, making an exception for join keys that will collapse into a single column -- but NOT making an exception for keys that won't be used in merging this set of tables, since that's one of the problems we're trying to catch. Challenge: index namesThe first problem, which is annoying but surmountable, is that the Orca API treats index and non-index columns completely separately. You can't fetch index columns by name. To get ALL the column names, you need to specify I think the reason Orca was designed this way is that the Pandas API used to do the same thing. But Pandas is moving beyond this -- v0.23 and later let you refer to index columns by name for tasks like What matters in our broader data model is things like whether column names are unique, whether key values are unique, whether a primary-key-foreign-key relationship is valid. Pandas indexes don't affect the substantive output, they're just an implementation detail that speeds up certain operations -- so better to handle the distinction internally wherever possible. Challenge: broadcast validationA bigger problem -- at least for validating merges ex-ante -- is that any algorithm for checking column names relies on the broadcasts themselves being valid, and I don't think there's an easy way to confirm this before performing the merge. Orca won't stop you from specifying bad broadcasts, you'll just get a runtime error if they can't be resolved properly. So it would be nice to implement broadcast validation too, but this probably requires changes to Orca -- it would be a bigger project and require a lot of testing. Next stepsHere's my inclination:
|
Ok, we've taken care of the top-priority updates. Templates PR #74 pares down chooser and alternative tables to only include necessary columns before merging them. ChoiceModels PR #54 adds a check to MergedChoiceTable so that users get a clearer error if there are duplicate column names. I'm leaving this issue open for general discussion, and because there are other updates we might want to make in the future: MergedChoiceTable support for dropping or renaming columns in the case of name conflicts, broadcast validation before runtime, etc. |
Two notes:
|
We should shave some strategies and policies for dealing with column names that exist in multiple tables -- both so we can advise users about what's expected, and so our code handles things the same way in different places. This is prompted by conversations with @mxndrwgrdnr and @janowicz.
Problem
The same column name can exist in multiple tables, with values that are either aligned (e.g.
building_id
) or totally different (e.g.area
).When tables with name conflicts are merged together, Pandas either raises an error or appends suffixes to the names, depending on how the merge/join is specified. This can cause UrbanSim to crash or potentially to generate ambiguous or incorrect results.
What are the use cases where duplicate column names are justified?
a. Primary/foreign keys: the buildings table has a
building_id
, and the households table also has abuilding_id
because households are matched to buildings.b. Metrics that exist at multiple scales: there could be a tract-level
population
and a zone-levelpopulation
, and it might not be convenient to give the columns globally unique names.c. Temporary data re-indexing: if zone-level attributes are needed for many different building-level calculations, it could be computationally efficient to copy them onto the buildings table temporarily, to avoid repeated merges.
Existing strategy
It seems like our existing strategy is to mildly discourage overlapping column names without doing much to explicitly check for them. Users need to work out the details if they cause problems.
Orca's table merging will either (1) silently append suffixes, or (2) silently drop all but the left-most occurrence of the duplicate column.
Proposed strategy
The ambiguity and flux of auto-dropping columns or changing column names in merged tables seems bad to me.
My proposal: In situations where columns are automatically assembled from multiple tables, output column names other than the join keys must be unique. I think this will help make simulations more reliable and help identify potential data problems more proactively.
Example 1: The model expression for a move-out model is
move_choice ~ age + tenure
, drawing data from thehouseholds
andunits
tables. Iftenure
accidentally appears in both tables, the template will raise an error. If some other variable likebuilding_id
appears in both tables, it's no problem because this is not part of the model expression.Example 2: Choosers (
households
) and alternatives (workplaces
) both have a column namedtract_id
. Iftract_id
appears in the model expression, or as a join index for interaction terms, choicemodels will raise an error. Since we probably want to retain both id's in the merged data, a better practice is to store them with different names:household_tract_id
andworkplace_tract_id
. This is annoying, but I think it's better than silently resolving name overlaps.Other approaches to consider
Universally disallow duplicate column names except for indexes?
This seems too strict. It would prevent use cases (b) and (c) above.
Be more strict about how we specify column names, using some kind of
table_name.column_name
syntax in model expressions so that every column is unique?I find this appealing: it's clear, unambiguous, and would align well with database syntax. But referring to column names on their own is pretty fundamental to the Orca API, so it might not be something we want to change. (I toyed with this syntax in orca_test, but ended up only using it for specifying foreign keys.)
Questions
Are there use cases where my proposed strategy will cause problems?
The text was updated successfully, but these errors were encountered: