-
Notifications
You must be signed in to change notification settings - Fork 582
reduce chunkstore memory footprint #747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -193,7 +193,7 @@ def serialize(self, df): | |
ret[METADATA][TYPE] = dtype | ||
return ret | ||
|
||
def deserialize(self, data, columns=None): | ||
def deserialize(self, data, columns=None, inplace=False): | ||
""" | ||
Deserializes SON to a DataFrame | ||
|
||
|
@@ -203,13 +203,17 @@ def deserialize(self, data, columns=None): | |
columns: None, or list of strings | ||
optionally you can deserialize a subset of the data in the SON. Index | ||
columns are ALWAYS deserialized, and should not be specified | ||
inplace: Convert and remove items from data in-place | ||
this will modify data | ||
|
||
Returns | ||
------- | ||
pandas dataframe or series | ||
""" | ||
if not data: | ||
return pd.DataFrame() | ||
if not inplace: | ||
data = data[:] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there are some errors in the tests, so I'm thinking this will need to be tweaked a bit more |
||
|
||
meta = data[0][METADATA] if isinstance(data, list) else data[METADATA] | ||
index = INDEX in meta | ||
|
@@ -218,16 +222,19 @@ def deserialize(self, data, columns=None): | |
if index: | ||
columns = columns[:] | ||
columns.extend(meta[INDEX]) | ||
if len(columns) > len(set(columns)): | ||
raise Exception("Duplicate columns specified, cannot de-serialize") | ||
columns = list(set(columns)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I see this as a win. It seems like the caller may have a bug if they are specifying duplicate columns, we're just hiding the error now There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current logic is confusing when subsetting data frames with indexes. For example, if you have the data frame: The logic works if the user passes: I don't see the value in the check - it should just do the right thing.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using pandas nomenclature the columns and the index are separate. If there is an index, you always get it back, even if you specify a subset of columns (and even if they do not include the index columns). Maybe the documentation should be improved. If for example, you specify price and security, you'll still get date as well as price and security, so your fix would only introduce more weirdness (in my opinion). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could remove index columns from columns and then check for duplicates. This keeps the nomenclature but keeps the user interface 'minimum surprise'. Or raise an error saying they have included index columns in the column list. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The result would be the same though, no? You'd supply index columns and it wont complain. I foresee someone opening a bug complaining they only specified 1 of 3 index columns but still got all 3 back. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, in retrospect, that means breaking the API for clients. How about we keep the fuzziness for clients and simply output a warning instead? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, lets see a warning :) but i still think that get info should change, otherwise how would you ever know how to rid yourself of the warning? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree with get_info change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so sounds like you just need to fix the broken tests and add the log and we're all set :D |
||
|
||
if not isinstance(data, list): | ||
df = self.converter.objify(data, columns) | ||
else: | ||
df = pd.concat([self.converter.objify(d, columns) for d in data], ignore_index=not index) | ||
dfs = [] | ||
while len(data): | ||
bmoscon marked this conversation as resolved.
Show resolved
Hide resolved
|
||
dfs.append(self.converter.objify(data.pop(0), columns)) | ||
df = pd.concat(dfs, ignore_index=not index) | ||
del dfs | ||
|
||
if index: | ||
df = df.set_index(meta[INDEX]) | ||
df.set_index(meta[INDEX], inplace=True) | ||
if meta[TYPE] == 'series': | ||
return df[df.columns[0]] | ||
return df | ||
|
Uh oh!
There was an error while loading. Please reload this page.