-
Notifications
You must be signed in to change notification settings - Fork 2
Data Transformation and Aggregation QNA
Question: Given a DataFrame df
with a numeric column called sales
, how can you find the total (sum) of the sales
column?
Answer & Explanation:
import pandas as pd
# Example DataFrame
data = {
'sales': [200, 300, 150, 400],
'region': ['North', 'South', 'East', 'West']
}
df = pd.DataFrame(data)
# Compute sum of 'sales' column
total_sales = df['sales'].sum()
print(total_sales) # Output: 1050
-
df['sales'].sum()
returns the sum of all values in thesales
column. - Common aggregates include
.sum()
,.mean()
,.max()
,.min()
, and.count()
.
Question: Suppose you have a DataFrame df
with columns region
and sales
. How do you find the total sales
per region
?
Answer & Explanation:
import pandas as pd
# Example DataFrame
data = {
'region': ['North', 'South', 'East', 'East', 'North', 'West', 'South'],
'sales': [200, 300, 250, 100, 400, 150, 350]
}
df = pd.DataFrame(data)
# Group by 'region' and compute sum of 'sales'
sales_by_region = df.groupby('region')['sales'].sum()
print(sales_by_region)
Output:
region
East 350
North 600
South 650
West 150
Name: sales, dtype: int64
-
df.groupby('region')
creates a GroupBy object grouped byregion
. -
['sales'].sum()
applies the sum aggregation to thesales
column.
3. How do you apply multiple aggregations (e.g., sum, mean, count) using a single groupby operation?
Question: In the same df
with region
and sales
, how can you get the sum, mean, and count of sales
in one line of code?
Answer & Explanation:
aggregations = df.groupby('region')['sales'].agg(['sum', 'mean', 'count'])
print(aggregations)
Output (example):
sum mean count
region
East 350 175.0 2
North 600 300.0 2
South 650 325.0 2
West 150 150.0 1
-
.agg([...])
allows multiple functions. - The result is a DataFrame with each aggregation as a separate column.
Question: If you have a DataFrame with columns region
, store_type
, and sales
, how can you group by both region
and store_type
to find the total sales
?
Answer & Explanation:
data = {
'region': ['North', 'North', 'South', 'South', 'East', 'East'],
'store_type': ['Online', 'Retail', 'Online', 'Retail', 'Online', 'Retail'],
'sales': [300, 200, 400, 150, 250, 100]
}
df = pd.DataFrame(data)
# Group by two columns
grouped_sales = df.groupby(['region', 'store_type'])['sales'].sum()
print(grouped_sales)
Output (example):
region store_type
East Online 250
Retail 100
North Online 300
Retail 200
South Online 400
Retail 150
Name: sales, dtype: int64
-
df.groupby(['region', 'store_type'])
creates groups based on both columns. - You can also reset the index by appending
.reset_index()
if you want a standard DataFrame format.
Question: Given a DataFrame df
with columns region
and sales
, how can you find only those regions whose total sales
exceed 500?
Answer & Explanation:
grouped = df.groupby('region')['sales'].sum()
# Filter to get regions with sales > 500
filtered_regions = grouped[grouped > 500]
print(filtered_regions)
- First, group by
region
and sum thesales
. - Then, select the groups whose sum is above a threshold (in this case, 500).
Question: Suppose df
has columns region
, store_type
, and sales
. How do you create a pivot table that shows the total sales
for each region
vs. each store_type
?
Answer & Explanation:
pivot_df = df.pivot_table(values='sales',
index='region',
columns='store_type',
aggfunc='sum',
fill_value=0)
print(pivot_df)
-
pivot_table
aggregates data similarly to Excel pivot tables. -
values
specifies the column to aggregate,index
andcolumns
define the pivot axes, andaggfunc
is the aggregation function. -
fill_value=0
replaces missing values with 0.
Question: Given a time-indexed DataFrame df
with a numeric column called sales
, how do you compute a 7-day rolling mean of sales
?
Answer & Explanation:
import pandas as pd
# Example: create a date range and sample data
dates = pd.date_range(start='2023-01-01', periods=14, freq='D')
sales_data = [200, 220, 250, 180, 300, 350, 400, 250, 270, 280, 260, 300, 310, 350]
df = pd.DataFrame({'sales': sales_data}, index=dates)
# Compute 7-day rolling mean
df['rolling_mean_7d'] = df['sales'].rolling(window=7).mean()
print(df)
-
.rolling(window=7)
creates a rolling window of size 7. -
.mean()
calculates the mean value within each rolling window.
Question: If you want to use your own function (e.g., finding the range of sales
), how can you apply a custom function after a groupby operation?
Answer & Explanation:
def sales_range(x):
return x.max() - x.min()
range_by_region = df.groupby('region')['sales'].apply(sales_range)
print(range_by_region)
- You define a function
sales_range
that calculatesmax - min
. -
.apply()
allows you to use any custom function on each group.
Question: If you have multiple custom functions (or a mix of built-in and custom) to apply on sales
, how do you handle that in a single call?
Answer & Explanation:
import numpy as np
def sales_range(x):
return x.max() - x.min()
aggregations = {
'sum_sales': 'sum',
'mean_sales': 'mean',
'sales_range': sales_range
}
results = df.groupby('region')['sales'].agg(aggregations)
print(results)
- Pass a dictionary to
.agg(...)
where keys are the new column names and values are either built-in function names (as strings) or custom functions. - The result is a DataFrame with columns for each aggregation.
Question: You want the sum of sales
only for records where sales > 200
. How can you do that?
Answer & Explanation:
import numpy as np
# One approach is using np.where or mask inside groupby:
conditional_sum = df.assign(
sales_filtered = np.where(df['sales'] > 200, df['sales'], 0)
).groupby('region')['sales_filtered'].sum()
print(conditional_sum)
- First, create a temporary column
sales_filtered
that keeps thesales
value if it’s above 200, otherwise 0. - Then, group by
region
and sum that column. - Alternatively, you could filter the DataFrame before grouping, but sometimes you need the entire DataFrame context for other computations, so this approach is flexible.
-
Remember Common Aggregations:
sum
,mean
,count
,max
,min
, andstd
(standard deviation). -
Use
agg()
for Multiple Functions: Either pass a list of built-ins or a dict with names -> functions. -
Reset Index: If you want a “flat” DataFrame after grouping, use
.reset_index()
. - Pivot Tables: Great for summarizing data with multi-dimensional grouping and aggregations in a cross-tab format.
-
Rolling, Expanding, and Cumulative Aggregations:
.rolling()
,.expanding()
,.cumsum()
, etc., are important for time-series or sequential data.
Answer:
Answer Explanation:
-
pivot
:- Reshapes data (long to wide) based on unique values of an index/column pair.
- Does not allow for aggregation; it expects your data to have a single value for each index-column pair.
- If the data contains duplicates for the same index-column pair, it raises an error.
-
pivot_table
:- Similar to
pivot
but allows aggregation when there are duplicates in the data for the same index-column combination. - You can specify
aggfunc
(e.g.,sum
,mean
, etc.) andfill_value
for missing values.
- Similar to
Question Example:
You have a DataFrame df
with columns: date
, store_type
, and sales
. Each combination of (date, store_type)
appears exactly once. How do you reshape it so that store_type
values become separate columns, and the rows are indexed by date
?
import pandas as pd
data = {
'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
'store_type': ['Online', 'Retail', 'Online', 'Retail'],
'sales': [100, 150, 120, 130]
}
df = pd.DataFrame(data)
Answer & Explanation:
wide_df = df.pivot(index='date', columns='store_type', values='sales')
print(wide_df)
-
pivot
reshapes the DataFrame such that each unique value instore_type
becomes a new column. - The
date
column becomes the index. - The
sales
column fills the values in the new wide format.
Question Example:
What happens if your data contains multiple rows for the same (index, columns)
pair when using pivot
? How to solve this?
Answer & Explanation:
-
pivot
will raise aValueError
if there are duplicate combinations, because it doesn’t know how to handle multiple values for the same cell. -
To handle duplicates gracefully, you should use
pivot_table
and specify an aggregation function (aggfunc
). For example:wide_df = df.pivot_table( index='date', columns='store_type', values='sales', aggfunc='sum', fill_value=0 )
Question Example:
Given a DataFrame df
with columns region
, store_type
, and sales
, how do you create a pivot table that shows the sum of sales
for each (region, store_type)
combination?
data = {
'region': ['North', 'North', 'South', 'South', 'East'],
'store_type': ['Online', 'Retail', 'Online', 'Retail', 'Online'],
'sales': [100, 200, 300, 400, 150]
}
df = pd.DataFrame(data)
Answer & Explanation:
pivot_df = df.pivot_table(
values='sales',
index='region',
columns='store_type',
aggfunc='sum',
fill_value=0
)
print(pivot_df)
-
pivot_table
aggregates the data (usingsum
by default ifaggfunc
is not specified, but specifyingaggfunc='sum'
is more explicit). - Missing combinations are filled with 0 instead of
NaN
by usingfill_value=0
.
Question Example:
You have a wide DataFrame wide_df
with columns ['date', 'Online', 'Retail']
. How do you convert it into a long format so that you have columns date
, store_type
, and sales
again?
wide_df = pd.DataFrame({
'date': ['2023-01-01', '2023-01-02'],
'Online': [100, 120],
'Retail': [150, 130]
})
Answer & Explanation:
long_df = wide_df.melt(
id_vars='date',
value_vars=['Online', 'Retail'],
var_name='store_type',
value_name='sales'
)
print(long_df)
-
melt
transforms columns into rows, creating two new columns: one for the original column names (var_name
) and one for their values (value_name
).
Answer & Explanation:
-
stack()
:- Rotates or pivots the columns in a DataFrame into its index.
- Typically used when your DataFrame has MultiIndex columns.
- The result is a Series (or a DataFrame if you end up stacking more than one level).
-
melt()
:- Primarily used to “unpivot” or move from wide to long format, resulting in explicit columns (id_vars, variable name, value, etc.).
-
melt
is more flexible for typical tabular transformations, especially if you don’t have multi-level columns.
Here’s a quick example with stack()
:
df = pd.DataFrame({
('sales', 'Online'): [100, 120],
('sales', 'Retail'): [150, 130]
}, index=['2023-01-01', '2023-01-02'])
stacked_df = df.stack()
print(stacked_df)
Question Example:
Given a multi-index DataFrame, how can you transform its inner level of the index into columns?
Answer & Explanation:
-
unstack(level=-1)
(by default the last level) transforms the specified level of the index into columns.stacked = df.stack() unstacked = stacked.unstack() # Convert the last level of the index to columns print(unstacked)
-
unstack
is especially useful when you’ve done a groupby operation resulting in a multi-index Series/ DataFrame.
Question Example:
You might have a “long” DataFrame where one column contains multiple variables (e.g., “Sales_Online” and “Sales_Retail”) that should be split into different columns. What’s the approach?
Answer & Explanation:
- Often, you first melt your DataFrame to get a single “variable” column, then split that column into multiple parts using string operations, and finally pivot or pivot_table to restructure as needed.
For example:
df['variable_split'] = df['variable'].str.split('_', expand=True)[1] # e.g., "Online" / "Retail"
wide_df = df.pivot(index='date', columns='variable_split', values='value')
Question Example:
Given a DataFrame with region
, sales
, and quantity
, how do you create a pivot table that shows both the average sales
and the total quantity
by region
?
Answer & Explanation:
pivot_df = df.pivot_table(
index='region',
values=['sales', 'quantity'],
aggfunc={'sales': 'mean', 'quantity': 'sum'}
)
print(pivot_df)
- You can pass a dictionary to
aggfunc
, mapping each column to an aggregation function. - The output is a DataFrame showing different aggregations for different columns.
Answer & Explanation:
-
In a
pivot_table
, you can specifyfill_value
to replaceNaN
with a specific value (e.g.,0
). -
You can use data cleaning techniques (e.g.,
dropna()
,fillna()
) before or after pivot/melt. -
For example:
pivot_df = df.pivot_table( index='region', columns='store_type', values='sales', aggfunc='sum', fill_value=0 )
-
If using
melt
, you can decide whether or not to keep rows withNaN
values in the melted column by specifying parameters likeignore_index=True
.
-
pivot
vs.pivot_table
: Usepivot_table
if duplicates exist or you need aggregations. -
melt
(Wide to Long): Helps unpivot data into a tall, skinny table. -
stack
/unstack
: Primarily for multi-index transformations. -
Multiple Aggregations: Use dictionaries in
pivot_table
orgroupby().agg()
. -
Handling Missing Values:
fill_value=...
,dropna()
, orfillna()
where needed.