update_column() utility not working as intended? #89

mxndrwgrdnr · 2019-01-31T21:04:36Z

I'm suspicious that the new urbansim_templates.utils.update_column() method is not functioning properly, or at least only functioning in certain settings, because it definitely is working in some places, but not in others. I have a hypothesis as to why this might be happening, but I wonder what you @smmaurer think.

Here I am running the mode choice model in activitySynth:

which uses the small MNL template and as such should automatically append the results to the correct output table as seen here.

But that doesn't happen:

In order to figure out if the problem might be in the model step itself, I manually trigger the update_column() command but to no avail:

So then I debug these lines of code from the update_column() method itself to see if the problem lies there, but it seems to work as intended:

But actually they don't because the updated column cannot be found in the parent table:

So my working hypothesis is that for some reason, in some instances, updates that are made to an orca.get_table() dataframe wrapper object are not made to the orca table itself. I believe this has to do with the fact that successive orca.get_table() calls to the same table return objects that are stored in different locations in memory, as seen here:

Whether or not these having different in-memory addresses is the issue, the following exercise shows how changes do not permeate:

Am I missing something here or is my reasoning sound? The bigger mystery to me now is why the update_column() method would ever work to update the parent orca table itself.

The text was updated successfully, but these errors were encountered:

mxndrwgrdnr · 2019-01-31T22:36:01Z

Here's a thought, could this have to do with whether or not the cache=True flag is set when the table is created?

smmaurer · 2019-01-31T22:38:03Z

Interesting (and troubling). Agree that it seems like DataFrameWrapper.update_col() is not reliably updating the underlying table in Orca.

I'm not sure the last part (beginning with "So my working hypothesis") is the reason, though -- i don't think i'd expect changes to propagate between copies anyway, even if they'd been recorded by Orca.

I'm looking through the Orca documentation to see if i can come up with any leads..

smmaurer · 2019-01-31T23:04:37Z

It does seem like this is the intended workflow. And here's an example of the underlying data being updated properly, in the Large MNL unit tests: test_large_multinomial_logit.py#L202-L211

One thing to try might be explicitly replacing the orca table with the new one, which might be more reliable. So in your debugging section, something like:

dfw = orca.get_table(mc.out_tables)
dfw.update_col(mc.out_column, mc.choices)
orca.add_table(mc.out_tables, dfw.to_frame())

If that works, it would be an easy fix to add it to the template utility.

mxndrwgrdnr · 2019-01-31T23:14:35Z

Hmm, and the cache=True flag isn't set in your tests so there goes my theory.

mxndrwgrdnr · 2019-01-31T23:15:37Z

I'll test the explicit table replacement and report back.

bridwell · 2019-01-31T23:17:43Z

@mxndrwgrdnr I think you're exactly right about the cache status. Basically what's going on here is (a) a function evaluates and returns a data frame, (b) you're modifying that data frame and then (c) the function is being evaluated again so you're obtaining what looks to be the original data frame.

# define a non-cached table
@orca.table()
def my_table():
    return pd.DataFrame({'a': ['uno', 'dos', 'tres']})

# grab the table from orca and update it
my_wrapper = orca.get_table('my_table')
my_wrapper.update_col('b', pd.Series(['cuatro', 'cinco', 'seis']))

# this will bring back the original result
orca.get_table('my_table').local

index	a
0	uno
1	dos
2	tres

Instead if we cache, then the function is NOT being re-evaluated and we get the same instance:

@orca.table(cache=True, cache_scope='forever')
def my_table():
    return pd.DataFrame({'a': ['uno', 'dos', 'tres']})

my_wrapper= orca.get_table('my_table')
my_wrapper.update_col('b', pd.Series(['cuatro', 'cinco', 'seis']))
orca.get_table('my_table').local

index	a	b
0	uno	cuatro
1	dos	cinco
2	tres	seis

I lieu of messing with the cache, I suppose you could retrieve the data frame from the function and then add it back into orca as a data frame rather than a function.

@orca.table()
def my_table():
    return pd.DataFrame({'a': ['uno', 'dos', 'tres']})

orca.add_table('my_table', orca.get_table('my_table').local)
my_wrapper= orca.get_table('my_table')
my_wrapper.update_col('b', pd.Series(['cuatro', 'cinco', 'seis']))
orca.get_table('my_table').local

index	a	b
0	uno	cuatro
1	dos	cinco
2	tres	seis

mxndrwgrdnr · 2019-01-31T23:19:48Z

OK, so then why do your large MNL unit tests work, because you aren't setting the cache here

bridwell · 2019-01-31T23:22:52Z

That looks to be creating a orca table wrapper from an already created pandas.DataFrame, not a function that returns a pandas.DataFrame.

mxndrwgrdnr · 2019-01-31T23:31:16Z

Yes but there is a function that returns a pandas.DataFrame that gets called behind the scenes here

bridwell · 2019-02-01T00:36:34Z

How is the __ persons_CHTS_format __ table being added to orca?

Like this?

#  create a data frame and add it explicitly
df = pandas.DataFrame({a: [1,2,3]})
orca.add_table('__persons_CHTS_format__', df)

Or like this?

# wrap a function and add it to orca
@orca.table()
def __persons_CHTS_format__():
    return pandas.DataFrame({a: [1,2,3]})

If table is added as a wrapped function (2nd case), I would set cache=True_ and cache_scope_='forever' or add the table explicitly. If that doesn't fix it then there's a template or orca problem.

smmaurer · 2019-02-01T23:27:42Z

Thanks @bridwell, this is all extremely helpful. It does seem like it must be the cache settings.

Which means that the workflow of modifying a table within a model step only works if caching is enabled for the table, which I hadn't fully realized.

Max and i were discussing what to do to prevent the sort of problem he ran into, and here are some possibilities:

We should definitely add documentation in the templates library explaining this requirement.
We could raise a warning in the template itself if the user is trying to write data to a table with caching disabled.
We could also raise a similar warning in orca.DataFrameWrapper.update_col().
Do we want to explore changing Orca's default from cache=False to cache=True? Probably not -- this would be disruptive to existing codebases, and i think the current defaults make sense. The idea is probably that tables coming from disk should use orca.add_table(df), where caching is automatic, while tables of derived data would use either orca.add_table(func) or the @orca.table() decorator. And a default of not caching derived data makes sense. Often we use the decorator for tables coming from disk because it's an easy way to incorporate data cleaning, but i don't think this is a workflow we should be encouraging.

So my inclination is to do numbers 1 through 3.

bridwell · 2019-02-04T16:29:27Z

One more thing to be mindful of, is that in the examples I provided, the table is cached with a scope of 'forever'. If the table was cached to a lesser scope (step, iteration) then you would run into the same problems--and these would probably be even more difficult to debug because the updates would sometimes persist and then be removed.

I agree that changing Orca cache defaults would probably cause problems. Raising a warning in orca when update_col is called on a table that is (a) being returned by a function and (b) is not cached forever, makes sense. In terms of the templates I wonder if you should be stricter and raise an error?

I also use orca decorators to initially load my tables. I don't really see anything wrong with the workflow, it's definitely helpful and convenient. But the whole caching idea does seem to create confusion at times. Perhaps a new decorator or input argument should be added that would clarify/differentiate this type of workflow? Like maybe @orca.load_table()? This would be equivalent to @orca.table(cache=True, cache_scope='forever').

mxndrwgrdnr · 2019-02-07T19:58:03Z

@bridwell I believe the default is scope=forever, so no worries there.

But yeah, @smmaurer I think this solution (1-3) makes the most sense.

Regarding point 4, though, would the idea be to convert all of the @orca.table() calls and table func definitions in datasources.py files to a bunch of separate pd.read_csv() and orca.add_table() calls? Would these get automatically injected into the orca namespace the same way when doing from scripts import variables?

smmaurer mentioned this issue Feb 14, 2019

[0.2.dev0] Template for loading data #93

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update_column() utility not working as intended? #89

update_column() utility not working as intended? #89

mxndrwgrdnr commented Jan 31, 2019

mxndrwgrdnr commented Jan 31, 2019

smmaurer commented Jan 31, 2019 •

edited

Loading

smmaurer commented Jan 31, 2019

mxndrwgrdnr commented Jan 31, 2019

mxndrwgrdnr commented Jan 31, 2019

bridwell commented Jan 31, 2019 •

edited

Loading

mxndrwgrdnr commented Jan 31, 2019

bridwell commented Jan 31, 2019

mxndrwgrdnr commented Jan 31, 2019

bridwell commented Feb 1, 2019

smmaurer commented Feb 1, 2019 •

edited

Loading

bridwell commented Feb 4, 2019 •

edited

Loading

mxndrwgrdnr commented Feb 7, 2019

update_column() utility not working as intended? #89

update_column() utility not working as intended? #89

Comments

mxndrwgrdnr commented Jan 31, 2019

mxndrwgrdnr commented Jan 31, 2019

smmaurer commented Jan 31, 2019 • edited Loading

smmaurer commented Jan 31, 2019

mxndrwgrdnr commented Jan 31, 2019

mxndrwgrdnr commented Jan 31, 2019

bridwell commented Jan 31, 2019 • edited Loading

mxndrwgrdnr commented Jan 31, 2019

bridwell commented Jan 31, 2019

mxndrwgrdnr commented Jan 31, 2019

bridwell commented Feb 1, 2019

smmaurer commented Feb 1, 2019 • edited Loading

bridwell commented Feb 4, 2019 • edited Loading

mxndrwgrdnr commented Feb 7, 2019

smmaurer commented Jan 31, 2019 •

edited

Loading

bridwell commented Jan 31, 2019 •

edited

Loading

smmaurer commented Feb 1, 2019 •

edited

Loading

bridwell commented Feb 4, 2019 •

edited

Loading