Data Transformation and Aggregation QNA

Question 1: Top 10 questions for Data aggregation using pandas?

1. How do you compute a simple aggregate (e.g., sum) of a column in a pandas DataFrame?

Question: Given a DataFrame df with a numeric column called sales, how can you find the total (sum) of the sales column?

Answer & Explanation:

import pandas as pd

# Example DataFrame
data = {
    'sales': [200, 300, 150, 400],
    'region': ['North', 'South', 'East', 'West']
}
df = pd.DataFrame(data)

# Compute sum of 'sales' column
total_sales = df['sales'].sum()
print(total_sales)  # Output: 1050

df['sales'].sum() returns the sum of all values in the sales column.
Common aggregates include .sum(), .mean(), .max(), .min(), and .count().

2. How do you group by a specific column and compute summary statistics?

Question: Suppose you have a DataFrame df with columns region and sales. How do you find the total sales per region?

Answer & Explanation:

import pandas as pd

# Example DataFrame
data = {
    'region': ['North', 'South', 'East', 'East', 'North', 'West', 'South'],
    'sales': [200, 300, 250, 100, 400, 150, 350]
}
df = pd.DataFrame(data)

# Group by 'region' and compute sum of 'sales'
sales_by_region = df.groupby('region')['sales'].sum()
print(sales_by_region)

Output:

region
East     350
North    600
South    650
West     150
Name: sales, dtype: int64

df.groupby('region') creates a GroupBy object grouped by region.
['sales'].sum() applies the sum aggregation to the sales column.

3. How do you apply multiple aggregations (e.g., sum, mean, count) using a single groupby operation?

Question: In the same df with region and sales, how can you get the sum, mean, and count of sales in one line of code?

Answer & Explanation:

aggregations = df.groupby('region')['sales'].agg(['sum', 'mean', 'count'])
print(aggregations)

Output (example):

         sum   mean  count
region                    
East     350  175.0      2
North    600  300.0      2
South    650  325.0      2
West     150  150.0      1

.agg([...]) allows multiple functions.
The result is a DataFrame with each aggregation as a separate column.

4. How do you group by multiple columns and perform an aggregation?

Question: If you have a DataFrame with columns region, store_type, and sales, how can you group by both region and store_type to find the total sales?

Answer & Explanation:

data = {
    'region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'store_type': ['Online', 'Retail', 'Online', 'Retail', 'Online', 'Retail'],
    'sales': [300, 200, 400, 150, 250, 100]
}
df = pd.DataFrame(data)

# Group by two columns
grouped_sales = df.groupby(['region', 'store_type'])['sales'].sum()
print(grouped_sales)

Output (example):

region  store_type
East    Online         250
        Retail         100
North   Online         300
        Retail         200
South   Online         400
        Retail         150
Name: sales, dtype: int64

df.groupby(['region', 'store_type']) creates groups based on both columns.
You can also reset the index by appending .reset_index() if you want a standard DataFrame format.

5. How do you use `groupby` and then filter groups based on an aggregation result?

Question: Given a DataFrame df with columns region and sales, how can you find only those regions whose total sales exceed 500?

Answer & Explanation:

grouped = df.groupby('region')['sales'].sum()

# Filter to get regions with sales > 500
filtered_regions = grouped[grouped > 500]
print(filtered_regions)

First, group by region and sum the sales.
Then, select the groups whose sum is above a threshold (in this case, 500).

6. How do you pivot a DataFrame to summarize data (similar to an Excel pivot table)?

Question: Suppose df has columns region, store_type, and sales. How do you create a pivot table that shows the total sales for each region vs. each store_type?

Answer & Explanation:

pivot_df = df.pivot_table(values='sales', 
                          index='region', 
                          columns='store_type', 
                          aggfunc='sum', 
                          fill_value=0)
print(pivot_df)

pivot_table aggregates data similarly to Excel pivot tables.
values specifies the column to aggregate, index and columns define the pivot axes, and aggfunc is the aggregation function.
fill_value=0 replaces missing values with 0.

7. How do you perform a rolling aggregation on a time series?

Question: Given a time-indexed DataFrame df with a numeric column called sales, how do you compute a 7-day rolling mean of sales?

Answer & Explanation:

import pandas as pd

# Example: create a date range and sample data
dates = pd.date_range(start='2023-01-01', periods=14, freq='D')
sales_data = [200, 220, 250, 180, 300, 350, 400, 250, 270, 280, 260, 300, 310, 350]
df = pd.DataFrame({'sales': sales_data}, index=dates)

# Compute 7-day rolling mean
df['rolling_mean_7d'] = df['sales'].rolling(window=7).mean()
print(df)

.rolling(window=7) creates a rolling window of size 7.
.mean() calculates the mean value within each rolling window.

8. How do you apply a custom aggregation function with `groupby`?

Question: If you want to use your own function (e.g., finding the range of sales), how can you apply a custom function after a groupby operation?

Answer & Explanation:

def sales_range(x):
    return x.max() - x.min()

range_by_region = df.groupby('region')['sales'].apply(sales_range)
print(range_by_region)

You define a function sales_range that calculates max - min.
.apply() allows you to use any custom function on each group.

9. How do you apply multiple custom functions using `agg`?

Question: If you have multiple custom functions (or a mix of built-in and custom) to apply on sales, how do you handle that in a single call?

Answer & Explanation:

import numpy as np

def sales_range(x):
    return x.max() - x.min()

aggregations = {
    'sum_sales': 'sum',
    'mean_sales': 'mean',
    'sales_range': sales_range
}

results = df.groupby('region')['sales'].agg(aggregations)
print(results)

Pass a dictionary to .agg(...) where keys are the new column names and values are either built-in function names (as strings) or custom functions.
The result is a DataFrame with columns for each aggregation.

10. How do you perform conditional aggregation (e.g., sum only where sales > 200)?

Question: You want the sum of sales only for records where sales > 200. How can you do that?

Answer & Explanation:

import numpy as np

# One approach is using np.where or mask inside groupby:
conditional_sum = df.assign(
    sales_filtered = np.where(df['sales'] > 200, df['sales'], 0)
).groupby('region')['sales_filtered'].sum()

print(conditional_sum)

First, create a temporary column sales_filtered that keeps the sales value if it’s above 200, otherwise 0.
Then, group by region and sum that column.
Alternatively, you could filter the DataFrame before grouping, but sometimes you need the entire DataFrame context for other computations, so this approach is flexible.

Final Tips

Remember Common Aggregations: sum, mean, count, max, min, and std (standard deviation).
Use agg() for Multiple Functions: Either pass a list of built-ins or a dict with names -> functions.
Reset Index: If you want a “flat” DataFrame after grouping, use .reset_index().
Pivot Tables: Great for summarizing data with multi-dimensional grouping and aggregations in a cross-tab format.
Rolling, Expanding, and Cumulative Aggregations: .rolling(), .expanding(), .cumsum(), etc., are important for time-series or sequential data.

Question 2: Top 10 questions for Data transformation ?

Answer:

1. What is the difference between `pivot` and `pivot_table` in pandas?

Answer Explanation:

pivot:
- Reshapes data (long to wide) based on unique values of an index/column pair.
- Does not allow for aggregation; it expects your data to have a single value for each index-column pair.
- If the data contains duplicates for the same index-column pair, it raises an error.
pivot_table:
- Similar to pivot but allows aggregation when there are duplicates in the data for the same index-column combination.
- You can specify aggfunc (e.g., sum, mean, etc.) and fill_value for missing values.

2. How do you convert a DataFrame from a long format to a wide format using `pivot`?

Question Example:
You have a DataFrame df with columns: date, store_type, and sales. Each combination of (date, store_type) appears exactly once. How do you reshape it so that store_type values become separate columns, and the rows are indexed by date?

import pandas as pd

data = {
    'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'store_type': ['Online', 'Retail', 'Online', 'Retail'],
    'sales': [100, 150, 120, 130]
}
df = pd.DataFrame(data)

Answer & Explanation:

wide_df = df.pivot(index='date', columns='store_type', values='sales')
print(wide_df)

pivot reshapes the DataFrame such that each unique value in store_type becomes a new column.
The date column becomes the index.
The sales column fills the values in the new wide format.

3. How do you handle duplicate entries when pivoting?

Question Example:
What happens if your data contains multiple rows for the same (index, columns) pair when using pivot? How to solve this?

Answer & Explanation:

pivot will raise a ValueError if there are duplicate combinations, because it doesn’t know how to handle multiple values for the same cell.

To handle duplicates gracefully, you should use pivot_table and specify an aggregation function (aggfunc). For example:

wide_df = df.pivot_table(
    index='date', 
    columns='store_type', 
    values='sales', 
    aggfunc='sum', 
    fill_value=0
)

4. How do you create a pivot table that summarizes data with an aggregation function?

Question Example:
Given a DataFrame df with columns region, store_type, and sales, how do you create a pivot table that shows the sum of sales for each (region, store_type) combination?

data = {
    'region': ['North', 'North', 'South', 'South', 'East'],
    'store_type': ['Online', 'Retail', 'Online', 'Retail', 'Online'],
    'sales': [100, 200, 300, 400, 150]
}
df = pd.DataFrame(data)

Answer & Explanation:

pivot_df = df.pivot_table(
    values='sales',
    index='region',
    columns='store_type',
    aggfunc='sum',
    fill_value=0
)
print(pivot_df)

pivot_table aggregates the data (using sum by default if aggfunc is not specified, but specifying aggfunc='sum' is more explicit).
Missing combinations are filled with 0 instead of NaN by using fill_value=0.

5. How do you convert data from a wide format back to a long format using `melt`?

Question Example:
You have a wide DataFrame wide_df with columns ['date', 'Online', 'Retail']. How do you convert it into a long format so that you have columns date, store_type, and sales again?

wide_df = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-02'],
    'Online': [100, 120],
    'Retail': [150, 130]
})

Answer & Explanation:

long_df = wide_df.melt(
    id_vars='date',
    value_vars=['Online', 'Retail'],
    var_name='store_type',
    value_name='sales'
)
print(long_df)

melt transforms columns into rows, creating two new columns: one for the original column names (var_name) and one for their values (value_name).

6. How do you stack a DataFrame, and what is the difference between `melt` and `stack`?

Answer & Explanation:

stack():
- Rotates or pivots the columns in a DataFrame into its index.
- Typically used when your DataFrame has MultiIndex columns.
- The result is a Series (or a DataFrame if you end up stacking more than one level).
melt():
- Primarily used to “unpivot” or move from wide to long format, resulting in explicit columns (id_vars, variable name, value, etc.).
- melt is more flexible for typical tabular transformations, especially if you don’t have multi-level columns.

Here’s a quick example with stack():

df = pd.DataFrame({
    ('sales', 'Online'): [100, 120],
    ('sales', 'Retail'): [150, 130]
}, index=['2023-01-01', '2023-01-02'])

stacked_df = df.stack()
print(stacked_df)

7. How do you unstack (the inverse of stack) a multi-index DataFrame?

Question Example:
Given a multi-index DataFrame, how can you transform its inner level of the index into columns?

Answer & Explanation:

unstack(level=-1) (by default the last level) transforms the specified level of the index into columns.

stacked = df.stack()
unstacked = stacked.unstack()  # Convert the last level of the index to columns
print(unstacked)

unstack is especially useful when you’ve done a groupby operation resulting in a multi-index Series/ DataFrame.

8. How do you reshape data if you have multiple variables stored in the same column?

Question Example:
You might have a “long” DataFrame where one column contains multiple variables (e.g., “Sales_Online” and “Sales_Retail”) that should be split into different columns. What’s the approach?

Answer & Explanation:

Often, you first melt your DataFrame to get a single “variable” column, then split that column into multiple parts using string operations, and finally pivot or pivot_table to restructure as needed.

For example:

df['variable_split'] = df['variable'].str.split('_', expand=True)[1]  # e.g., "Online" / "Retail"
wide_df = df.pivot(index='date', columns='variable_split', values='value')

9. How do you create a pivot table with multiple aggregation functions on the same pivot?

Question Example:
Given a DataFrame with region, sales, and quantity, how do you create a pivot table that shows both the average sales and the total quantity by region?

Answer & Explanation:

pivot_df = df.pivot_table(
    index='region',
    values=['sales', 'quantity'],
    aggfunc={'sales': 'mean', 'quantity': 'sum'}
)
print(pivot_df)

You can pass a dictionary to aggfunc, mapping each column to an aggregation function.
The output is a DataFrame showing different aggregations for different columns.

10. How do you handle missing or null values during pivot/melt operations?

Answer & Explanation:

In a pivot_table, you can specify fill_value to replace NaN with a specific value (e.g., 0).
You can use data cleaning techniques (e.g., dropna(), fillna()) before or after pivot/melt.

For example:

pivot_df = df.pivot_table(
    index='region',
    columns='store_type',
    values='sales',
    aggfunc='sum',
    fill_value=0
)

If using melt, you can decide whether or not to keep rows with NaN values in the melted column by specifying parameters like ignore_index=True.

Key Takeaways

pivot vs. pivot_table: Use pivot_table if duplicates exist or you need aggregations.
melt (Wide to Long): Helps unpivot data into a tall, skinny table.
stack / unstack: Primarily for multi-index transformations.
Multiple Aggregations: Use dictionaries in pivot_table or groupby().agg().
Handling Missing Values: fill_value=..., dropna(), or fillna() where needed.

Data Transformation and Aggregation QNA

Question 1: Top 10 questions for Data aggregation using pandas?

1. How do you compute a simple aggregate (e.g., sum) of a column in a pandas DataFrame?

2. How do you group by a specific column and compute summary statistics?

3. How do you apply multiple aggregations (e.g., sum, mean, count) using a single groupby operation?

4. How do you group by multiple columns and perform an aggregation?

5. How do you use groupby and then filter groups based on an aggregation result?

6. How do you pivot a DataFrame to summarize data (similar to an Excel pivot table)?

7. How do you perform a rolling aggregation on a time series?

8. How do you apply a custom aggregation function with groupby?

9. How do you apply multiple custom functions using agg?

10. How do you perform conditional aggregation (e.g., sum only where sales > 200)?

Final Tips

Question 2: Top 10 questions for Data transformation ?

1. What is the difference between pivot and pivot_table in pandas?

2. How do you convert a DataFrame from a long format to a wide format using pivot?

3. How do you handle duplicate entries when pivoting?

4. How do you create a pivot table that summarizes data with an aggregation function?

5. How do you convert data from a wide format back to a long format using melt?

6. How do you stack a DataFrame, and what is the difference between melt and stack?

7. How do you unstack (the inverse of stack) a multi-index DataFrame?

8. How do you reshape data if you have multiple variables stored in the same column?

9. How do you create a pivot table with multiple aggregation functions on the same pivot?

10. How do you handle missing or null values during pivot/melt operations?

Key Takeaways

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

5. How do you use `groupby` and then filter groups based on an aggregation result?

8. How do you apply a custom aggregation function with `groupby`?

9. How do you apply multiple custom functions using `agg`?

1. What is the difference between `pivot` and `pivot_table` in pandas?

2. How do you convert a DataFrame from a long format to a wide format using `pivot`?

5. How do you convert data from a wide format back to a long format using `melt`?

6. How do you stack a DataFrame, and what is the difference between `melt` and `stack`?