@@ -17,6 +17,8 @@ Reshaping and Pivot Tables
1717Reshaping by pivoting DataFrame objects
1818---------------------------------------
1919
20+ .. image :: _static/reshaping_pivot.png
21+
2022.. ipython ::
2123 :suppress:
2224
@@ -33,8 +35,7 @@ Reshaping by pivoting DataFrame objects
3335
3436 In [3]: df = unpivot(tm.makeTimeDataFrame())
3537
36- Data is often stored in CSV files or databases in so-called "stacked" or
37- "record" format:
38+ Data is often stored in so-called "stacked" or "record" format:
3839
3940.. ipython :: python
4041
@@ -66,8 +67,6 @@ To select out everything for variable ``A`` we could do:
6667
6768 df[df[' variable' ] == ' A' ]
6869
69- .. image :: _static/reshaping_pivot.png
70-
7170 But suppose we wish to do time series operations with the variables. A better
7271representation would be where the ``columns `` are the unique variables and an
7372``index `` of dates identifies individual observations. To reshape the data into
@@ -87,7 +86,7 @@ column:
8786.. ipython :: python
8887
8988 df[' value2' ] = df[' value' ] * 2
90- pivoted = df.pivot(' date' , ' variable' )
89+ pivoted = df.pivot(index = ' date' , columns = ' variable' )
9190 pivoted
9291
9392 You can then select subsets from the pivoted ``DataFrame ``:
@@ -99,6 +98,12 @@ You can then select subsets from the pivoted ``DataFrame``:
9998 Note that this returns a view on the underlying data in the case where the data
10099are homogeneously-typed.
101100
101+ .. note ::
102+ :func: `~pandas.pivot ` will error with a ``ValueError: Index contains duplicate
103+ entries, cannot reshape `` if the index/column pair is not unique. In this
104+ case, consider using :func: `~pandas.pivot_table ` which is a generalization
105+ of pivot that can handle duplicate values for one index/column pair.
106+
102107.. _reshaping.stacking :
103108
104109Reshaping by stacking and unstacking
@@ -704,10 +709,103 @@ handling of NaN:
704709 In [3]: np.unique(x, return_inverse=True)[::-1]
705710 Out[3]: (array([3, 3, 0, 4, 1, 2]), array([nan, 3.14, inf, 'A', 'B'], dtype=object))
706711
707-
708712 .. note ::
709713 If you just want to handle one column as a categorical variable (like R's factor),
710714 you can use ``df["cat_col"] = pd.Categorical(df["col"]) `` or
711715 ``df["cat_col"] = df["col"].astype("category") ``. For full docs on :class: `~pandas.Categorical `,
712716 see the :ref: `Categorical introduction <categorical >` and the
713717 :ref: `API documentation <api.categorical >`.
718+
719+ Examples
720+ --------
721+
722+ In this section, we will review frequently asked questions and examples. The
723+ column names and relevant column values are named to correspond with how this
724+ DataFrame will be pivoted in the answers below.
725+
726+ .. ipython :: python
727+
728+ np.random.seed([3 , 1415 ])
729+ n = 20
730+
731+ cols = np.array([' key' , ' row' , ' item' , ' col' ])
732+ df = cols + pd.DataFrame((np.random.randint(5 , size = (n, 4 )) // [2 , 1 , 2 , 1 ]).astype(str ))
733+ df.columns = cols
734+ df = df.join(pd.DataFrame(np.random.rand(n, 2 ).round(2 )).add_prefix(' val' ))
735+
736+ df
737+
738+ Pivoting with Single Aggregations
739+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
740+
741+ Suppose we wanted to pivot ``df `` such that the ``col `` values are columns,
742+ ``row `` values are the index, and the mean of ``val0 `` are the values? In
743+ particular, the resulting DataFrame should look like:
744+
745+ .. code-block :: ipython
746+
747+ col col0 col1 col2 col3 col4
748+ row
749+ row0 0.77 0.605 NaN 0.860 0.65
750+ row2 0.13 NaN 0.395 0.500 0.25
751+ row3 NaN 0.310 NaN 0.545 NaN
752+ row4 NaN 0.100 0.395 0.760 0.24
753+
754+ This solution uses :func: `~pandas.pivot_table `. Also note that
755+ ``aggfunc='mean' `` is the default. It is included here to be explicit.
756+
757+ .. ipython :: python
758+
759+ df.pivot_table(
760+ values = ' val0' , index = ' row' , columns = ' col' , aggfunc = ' mean' )
761+
762+ Note that we can also replace the missing values by using the ``fill_value ``
763+ parameter.
764+
765+ .. ipython :: python
766+
767+ df.pivot_table(
768+ values = ' val0' , index = ' row' , columns = ' col' , aggfunc = ' mean' , fill_value = 0 )
769+
770+ Also note that we can pass in other aggregation functions as well. For example,
771+ we can also pass in ``sum ``.
772+
773+ .. ipython :: python
774+
775+ df.pivot_table(
776+ values = ' val0' , index = ' row' , columns = ' col' , aggfunc = ' sum' , fill_value = 0 )
777+
778+ Another aggregation we can do is calculate the frequency in which the columns
779+ and rows occur together a.k.a. "cross tabulation". To do this, we can pass
780+ ``size `` to the ``aggfunc `` parameter.
781+
782+ .. ipython :: python
783+
784+ df.pivot_table(index = ' row' , columns = ' col' , fill_value = 0 , aggfunc = ' size' )
785+
786+ Pivoting with Multiple Aggregations
787+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
788+
789+ We can also perform multiple aggregations. For example, to perform both a
790+ ``sum `` and ``mean ``, we can pass in a list to the ``aggfunc `` argument.
791+
792+ .. ipython :: python
793+
794+ df.pivot_table(
795+ values = ' val0' , index = ' row' , columns = ' col' , aggfunc = [' mean' , ' sum' ])
796+
797+ Note to aggregate over multiple value columns, we can pass in a list to the
798+ ``values `` parameter.
799+
800+ .. ipython :: python
801+
802+ df.pivot_table(
803+ values = [' val0' , ' val1' ], index = ' row' , columns = ' col' , aggfunc = [' mean' ])
804+
805+ Note to subdivide over multiple columns we can pass in a list to the
806+ ``columns `` parameter.
807+
808+ .. ipython :: python
809+
810+ df.pivot_table(
811+ values = [' val0' ], index = ' row' , columns = [' item' , ' col' ], aggfunc = [' mean' ])
0 commit comments