-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Series.flatmap, DataFrame.flatmap #8517
Comments
This is pretty straightforward (and pretty fast), to simply do this
|
s=pd.Series(['This is text No 1.', 'and here is no. 2','and 3'])
s.str.split('\s+').apply(Series).unstack().T.dropna().reset_index(1,drop=True)
Out[36]:
0 This
0 and
0 and
1 is
1 here
1 3
2 text
2 is
3 No
3 no.
4 1.
4 2
dtype: object That does work with slight tweaking, thanks. However I have about 15 MB of text chunks and it This seems like a natural operation to have supported. in terms of map-reduce pandas is great at processing rows but converting one row into multiple rows (the map stage) just isn't there as a stock operation. |
not really sure what you are doing. If you don't need the indices then just iterate, split, and concatenate. |
As much as I like flatMap interface, I'm not sure how would you apply it to a multi-column DataFrame, except for probably replicating whole rows (is that operation useful?). For Series of strings, there's nothing to vectorize and unless you (ab-)use character-wise representation of strings (which you can't do with pandas) there's not much you can do to beat the performance of something along the lines of In [38]: s = pd.Series(['This is text No 1.', 'and here is no. 2','and 3']); s
Out[38]:
0 This is text No 1.
1 and here is no. 2
2 and 3
dtype: object
In [39]: mapper = lambda x: pd.lib.list_to_object_array(x.split())
In [40]: vals = [(k, mapper(v)) for k, v in s.iteritems()]
In [41]: pd.Series(
np.concatenate([v[1] for v in vals]),
np.concatenate([np.repeat(v[0], len(v[1])) for v in vals]),
name=s.name)
Out[41]:
0 This
0 is
0 text
0 No
0 1.
1 and
1 here
1 is
1 no.
1 2
2 and
2 3
dtype: object |
"is that operation useful?" it is for what I'm trying to do, I'd argure for text data it would be very useful. Perhaps with the index gaining another level to numerate the generated sub-results within the original index label. Thanks for the example, lots of footwork, but it doesn't throw away memory like jreback's hack. You're both basically saying do it somewhere else then throw it in a DataFrame. Not awesome , but of course always an option. groupby.apply() converts returned series into multiple rows doesn't it? why can't pandas have similar functionality for an DataFrame apply-style operation? |
@kay1793 why don't you show a copy/pastable example end-to-end of what you are actually trying to do. Are you trying to say get work counts? or a dummies matrix? (that's what my soln does, but if you have a large work freq it WILL blow up memory, but that assumes you want to track the index). apparently you don't. @immerrr soln is just fine for what you are showing. But then you are not harnessing the power of pandas at all. Its just list manipulation. an example is worth 1000's of words. |
Fair enough. How do I get from this: In [8]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night. Alice and bob were at their computer"],
...: ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall. Then the mirror broke. He cut his left foot on the shards"]],columns=['ts','name','text'])
...: df to this: In [10]: df=pd.DataFrame([['2014-01-01T00:01:00','Alice',"It was a dark and stormy night"],
...: ['2014-01-01T00:01:00','Alice',"Alice and bob were at their computer"],
...: ['2014-01-02T00:02:23','Bob',"Detective Morgan was 9 feet tall"],
...: ['2014-01-02T00:02:23','Bob',"Then the mirror broke"],
...: ['2014-01-02T00:02:23','Bob',"He cut his left foot on the shards"]],
...: columns=['ts','name','text'],index=pd.MultiIndex.from_arrays([[0,0,1,1,1],[0,1,0,1,2]]))
...: df
...:
Out[10]:
ts name text
0 0 2014-01-01T00:01:00 Alice It was a dark and stormy night
1 2014-01-01T00:01:00 Alice Alice and bob were at their computer
1 0 2014-01-02T00:02:23 Bob Detective Morgan was 9 feet tall
1 2014-01-02T00:02:23 Bob Then the mirror broke
2 2014-01-02T00:02:23 Bob He cut his left foot the shards The index is unique, nothing else is in general. One chunk may have 50 (or 1000) paragraphs, others just one. Once I get it in this form, I can easily do summary stats and selection queries, looking for patterns. Basically, I'm using a pandas dataframe as my corpus. |
Which was exactly my point: if we're doing native object manipulation, there's not much pandas can help with performance-wise. As for the example, it's helpful indeed. So the desired output is equivalent to taking out that one column, flatmap it as a series adding an integer index level and join it back with the rest of the dataframe on the original index column. |
I think the performance of Sol #1 was due to bad complexity otherwise non-vectorized is good enough for my data sizes.
Yes, but with 2 sugars please. I've come to think of pandas as THE way to work with data and now that I'm working with a corpus It lacks some operations that appear also generally useful and common. The groupby point I made earlier seems solid to me, I hope you'll agree. |
closing, but pls reopen if the issue persists. |
Just came across this and I think it's worth reopening. Some sugar for |
We would take a cookbook of the recipes (the https://stackoverflow.com/questions/31080258/pysparks-flatmap-in-pandas) is most idiomatic. |
This how I'm doing for now. It's not the more performatic way, but works:
|
i'm supportive of adding a |
Is this being considered? I think having a flatMap or bind operation would be very interesting and useful (besides being theoretically satisfying for people with a functional programming background, but this isn't really an argument). I had a use-case today. I have a dataframe of events, with start time and end time:
And we'd like to transform this to a dataframe that has a row for each minute where an event where actually happening:
There are lots of ways to do this, but if we had a flatMap method this would be almost trivial. def make_time_table(row):
return DataFrame(
{'event': row['event'], 'time': time}
for time in pandas.date_range(...)
)
df.flatMap(make_time_table) Or something like it. |
I've been using @timotta 's function above which has been surprisingly fast |
The topic I reference in the above link (https://stackoverflow.com/questions/12680754/split-pandas-dataframe-string-entry-to-separate-rows) also leads to https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows. |
I have a long text in one of the dataframe columns I need to be able to split the text 2 or more columns depending on the length of that cell |
Was this closed because of opposition of storing lists in dataframes/series? I don't think that flatmap requires this, and the lack of flatmap is a fairly big problem with non-trivial and non-performant solutions. (+scala, spark, many other functional system have it as a first class operation). |
see #16538 which has a PR |
Works great for me. Thanks @timotta! Slightly modified to simplify and be more similar to the original poster's wishing syntax: def flatmap(self, func):
rows = []
for idx, row in self.iterrows():
multrows = func(row)
rows.extend(multrows)
return pd.DataFrame.from_records(rows)
pd.DataFrame.flatmap = flatmap
df.flatmap(lambda x: ....) |
I'm working on some language analysis and using pandas to munge the data and grab some descriptive stats. This is just an illustrative example, I'm doing all kinds of slighty different things.
Suppose I have a series containing chunks of text, and I want to turn the line into multiple lines, preserving the index values. Here are the naive results:
What I'd like is to be able to do is (Made up example):
In general, I'd like to be able to explode a single row in a dataframe into multiple rows, by transforming one column value into multiple values, each becoming a new row with the value of other columns prserved, for example:
Perhaps there's another way to do this, but that's how my natural instict suggests this should be done, flatmap is a fairly universal concept.
Groupby already does similar things based on return type, It doesn't have to be limited to groupby though.
The text was updated successfully, but these errors were encountered: