Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update_column() utility not working as intended? #89

Open
mxndrwgrdnr opened this issue Jan 31, 2019 · 13 comments
Open

update_column() utility not working as intended? #89

mxndrwgrdnr opened this issue Jan 31, 2019 · 13 comments

Comments

@mxndrwgrdnr
Copy link
Member

I'm suspicious that the new urbansim_templates.utils.update_column() method is not functioning properly, or at least only functioning in certain settings, because it definitely is working in some places, but not in others. I have a hypothesis as to why this might be happening, but I wonder what you @smmaurer think.

Here I am running the mode choice model in activitySynth:
screen shot 2019-01-31 at 12 51 16 pm

which uses the small MNL template and as such should automatically append the results to the correct output table as seen here.

But that doesn't happen:
screen shot 2019-01-31 at 12 53 16 pm

In order to figure out if the problem might be in the model step itself, I manually trigger the update_column() command but to no avail:
screen shot 2019-01-31 at 12 55 23 pm

So then I debug these lines of code from the update_column() method itself to see if the problem lies there, but it seems to work as intended:
screen shot 2019-01-31 at 12 56 20 pm

But actually they don't because the updated column cannot be found in the parent table:
screen shot 2019-01-31 at 12 58 38 pm

So my working hypothesis is that for some reason, in some instances, updates that are made to an orca.get_table() dataframe wrapper object are not made to the orca table itself. I believe this has to do with the fact that successive orca.get_table() calls to the same table return objects that are stored in different locations in memory, as seen here:
screen shot 2019-01-31 at 1 01 43 pm

Whether or not these having different in-memory addresses is the issue, the following exercise shows how changes do not permeate:
screen shot 2019-01-31 at 1 03 11 pm

Am I missing something here or is my reasoning sound? The bigger mystery to me now is why the update_column() method would ever work to update the parent orca table itself.

@mxndrwgrdnr
Copy link
Member Author

Here's a thought, could this have to do with whether or not the cache=True flag is set when the table is created?

@smmaurer
Copy link
Member

smmaurer commented Jan 31, 2019

Interesting (and troubling). Agree that it seems like DataFrameWrapper.update_col() is not reliably updating the underlying table in Orca.

I'm not sure the last part (beginning with "So my working hypothesis") is the reason, though -- i don't think i'd expect changes to propagate between copies anyway, even if they'd been recorded by Orca.

I'm looking through the Orca documentation to see if i can come up with any leads..

@smmaurer
Copy link
Member

It does seem like this is the intended workflow. And here's an example of the underlying data being updated properly, in the Large MNL unit tests: test_large_multinomial_logit.py#L202-L211

One thing to try might be explicitly replacing the orca table with the new one, which might be more reliable. So in your debugging section, something like:

dfw = orca.get_table(mc.out_tables)
dfw.update_col(mc.out_column, mc.choices)
orca.add_table(mc.out_tables, dfw.to_frame())

If that works, it would be an easy fix to add it to the template utility.

@mxndrwgrdnr
Copy link
Member Author

Hmm, and the cache=True flag isn't set in your tests so there goes my theory.

@mxndrwgrdnr
Copy link
Member Author

I'll test the explicit table replacement and report back.

@bridwell
Copy link

bridwell commented Jan 31, 2019

@mxndrwgrdnr I think you're exactly right about the cache status. Basically what's going on here is (a) a function evaluates and returns a data frame, (b) you're modifying that data frame and then (c) the function is being evaluated again so you're obtaining what looks to be the original data frame.

# define a non-cached table
@orca.table()
def my_table():
    return pd.DataFrame({'a': ['uno', 'dos', 'tres']})

# grab the table from orca and update it
my_wrapper = orca.get_table('my_table')
my_wrapper.update_col('b', pd.Series(['cuatro', 'cinco', 'seis']))

# this will bring back the original result
orca.get_table('my_table').local
index a
0 uno
1 dos
2 tres

Instead if we cache, then the function is NOT being re-evaluated and we get the same instance:

@orca.table(cache=True, cache_scope='forever')
def my_table():
    return pd.DataFrame({'a': ['uno', 'dos', 'tres']})

my_wrapper= orca.get_table('my_table')
my_wrapper.update_col('b', pd.Series(['cuatro', 'cinco', 'seis']))
orca.get_table('my_table').local
index a b
0 uno cuatro
1 dos cinco
2 tres seis

I lieu of messing with the cache, I suppose you could retrieve the data frame from the function and then add it back into orca as a data frame rather than a function.

@orca.table()
def my_table():
    return pd.DataFrame({'a': ['uno', 'dos', 'tres']})

orca.add_table('my_table', orca.get_table('my_table').local)
my_wrapper= orca.get_table('my_table')
my_wrapper.update_col('b', pd.Series(['cuatro', 'cinco', 'seis']))
orca.get_table('my_table').local
index a b
0 uno cuatro
1 dos cinco
2 tres seis

@mxndrwgrdnr
Copy link
Member Author

OK, so then why do your large MNL unit tests work, because you aren't setting the cache here

@bridwell
Copy link

That looks to be creating a orca table wrapper from an already created pandas.DataFrame, not a function that returns a pandas.DataFrame.

@mxndrwgrdnr
Copy link
Member Author

Yes but there is a function that returns a pandas.DataFrame that gets called behind the scenes here

@bridwell
Copy link

bridwell commented Feb 1, 2019

How is the __ persons_CHTS_format __ table being added to orca?

Like this?

#  create a data frame and add it explicitly
df = pandas.DataFrame({a: [1,2,3]})
orca.add_table('__persons_CHTS_format__', df)

Or like this?

# wrap a function and add it to orca
@orca.table()
def __persons_CHTS_format__():
    return pandas.DataFrame({a: [1,2,3]})

If table is added as a wrapped function (2nd case), I would set cache=True_ and cache_scope_='forever' or add the table explicitly. If that doesn't fix it then there's a template or orca problem.

@smmaurer
Copy link
Member

smmaurer commented Feb 1, 2019

Thanks @bridwell, this is all extremely helpful. It does seem like it must be the cache settings.

Which means that the workflow of modifying a table within a model step only works if caching is enabled for the table, which I hadn't fully realized.

Max and i were discussing what to do to prevent the sort of problem he ran into, and here are some possibilities:

  1. We should definitely add documentation in the templates library explaining this requirement.

  2. We could raise a warning in the template itself if the user is trying to write data to a table with caching disabled.

  3. We could also raise a similar warning in orca.DataFrameWrapper.update_col().

  4. Do we want to explore changing Orca's default from cache=False to cache=True? Probably not -- this would be disruptive to existing codebases, and i think the current defaults make sense. The idea is probably that tables coming from disk should use orca.add_table(df), where caching is automatic, while tables of derived data would use either orca.add_table(func) or the @orca.table() decorator. And a default of not caching derived data makes sense. Often we use the decorator for tables coming from disk because it's an easy way to incorporate data cleaning, but i don't think this is a workflow we should be encouraging.

So my inclination is to do numbers 1 through 3.

@bridwell
Copy link

bridwell commented Feb 4, 2019

One more thing to be mindful of, is that in the examples I provided, the table is cached with a scope of 'forever'. If the table was cached to a lesser scope (step, iteration) then you would run into the same problems--and these would probably be even more difficult to debug because the updates would sometimes persist and then be removed.

I agree that changing Orca cache defaults would probably cause problems. Raising a warning in orca when update_col is called on a table that is (a) being returned by a function and (b) is not cached forever, makes sense. In terms of the templates I wonder if you should be stricter and raise an error?

I also use orca decorators to initially load my tables. I don't really see anything wrong with the workflow, it's definitely helpful and convenient. But the whole caching idea does seem to create confusion at times. Perhaps a new decorator or input argument should be added that would clarify/differentiate this type of workflow? Like maybe @orca.load_table()? This would be equivalent to @orca.table(cache=True, cache_scope='forever').

@mxndrwgrdnr
Copy link
Member Author

@bridwell I believe the default is scope=forever, so no worries there.

But yeah, @smmaurer I think this solution (1-3) makes the most sense.

Regarding point 4, though, would the idea be to convert all of the @orca.table() calls and table func definitions in datasources.py files to a bunch of separate pd.read_csv() and orca.add_table() calls? Would these get automatically injected into the orca namespace the same way when doing from scripts import variables?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants