Skip to content

Data Transformation and Aggregation QNA

Praveen Kumar Anwla edited this page Jan 15, 2025 · 2 revisions

Question 1: Top 10 questions for Data aggregation using pandas?

1. How do you compute a simple aggregate (e.g., sum) of a column in a pandas DataFrame?

Question: Given a DataFrame df with a numeric column called sales, how can you find the total (sum) of the sales column?

Answer & Explanation:

import pandas as pd

# Example DataFrame
data = {
    'sales': [200, 300, 150, 400],
    'region': ['North', 'South', 'East', 'West']
}
df = pd.DataFrame(data)

# Compute sum of 'sales' column
total_sales = df['sales'].sum()
print(total_sales)  # Output: 1050
  • df['sales'].sum() returns the sum of all values in the sales column.
  • Common aggregates include .sum(), .mean(), .max(), .min(), and .count().

2. How do you group by a specific column and compute summary statistics?

Question: Suppose you have a DataFrame df with columns region and sales. How do you find the total sales per region?

Answer & Explanation:

import pandas as pd

# Example DataFrame
data = {
    'region': ['North', 'South', 'East', 'East', 'North', 'West', 'South'],
    'sales': [200, 300, 250, 100, 400, 150, 350]
}
df = pd.DataFrame(data)

# Group by 'region' and compute sum of 'sales'
sales_by_region = df.groupby('region')['sales'].sum()
print(sales_by_region)

Output:

region
East     350
North    600
South    650
West     150
Name: sales, dtype: int64
  • df.groupby('region') creates a GroupBy object grouped by region.
  • ['sales'].sum() applies the sum aggregation to the sales column.

3. How do you apply multiple aggregations (e.g., sum, mean, count) using a single groupby operation?

Question: In the same df with region and sales, how can you get the sum, mean, and count of sales in one line of code?

Answer & Explanation:

aggregations = df.groupby('region')['sales'].agg(['sum', 'mean', 'count'])
print(aggregations)

Output (example):

         sum   mean  count
region                    
East     350  175.0      2
North    600  300.0      2
South    650  325.0      2
West     150  150.0      1
  • .agg([...]) allows multiple functions.
  • The result is a DataFrame with each aggregation as a separate column.

4. How do you group by multiple columns and perform an aggregation?

Question: If you have a DataFrame with columns region, store_type, and sales, how can you group by both region and store_type to find the total sales?

Answer & Explanation:

data = {
    'region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'store_type': ['Online', 'Retail', 'Online', 'Retail', 'Online', 'Retail'],
    'sales': [300, 200, 400, 150, 250, 100]
}
df = pd.DataFrame(data)

# Group by two columns
grouped_sales = df.groupby(['region', 'store_type'])['sales'].sum()
print(grouped_sales)

Output (example):

region  store_type
East    Online         250
        Retail         100
North   Online         300
        Retail         200
South   Online         400
        Retail         150
Name: sales, dtype: int64
  • df.groupby(['region', 'store_type']) creates groups based on both columns.
  • You can also reset the index by appending .reset_index() if you want a standard DataFrame format.

5. How do you use groupby and then filter groups based on an aggregation result?

Question: Given a DataFrame df with columns region and sales, how can you find only those regions whose total sales exceed 500?

Answer & Explanation:

grouped = df.groupby('region')['sales'].sum()

# Filter to get regions with sales > 500
filtered_regions = grouped[grouped > 500]
print(filtered_regions)
  • First, group by region and sum the sales.
  • Then, select the groups whose sum is above a threshold (in this case, 500).

6. How do you pivot a DataFrame to summarize data (similar to an Excel pivot table)?

Question: Suppose df has columns region, store_type, and sales. How do you create a pivot table that shows the total sales for each region vs. each store_type?

Answer & Explanation:

pivot_df = df.pivot_table(values='sales', 
                          index='region', 
                          columns='store_type', 
                          aggfunc='sum', 
                          fill_value=0)
print(pivot_df)
  • pivot_table aggregates data similarly to Excel pivot tables.
  • values specifies the column to aggregate, index and columns define the pivot axes, and aggfunc is the aggregation function.
  • fill_value=0 replaces missing values with 0.

7. How do you perform a rolling aggregation on a time series?

Question: Given a time-indexed DataFrame df with a numeric column called sales, how do you compute a 7-day rolling mean of sales?

Answer & Explanation:

import pandas as pd

# Example: create a date range and sample data
dates = pd.date_range(start='2023-01-01', periods=14, freq='D')
sales_data = [200, 220, 250, 180, 300, 350, 400, 250, 270, 280, 260, 300, 310, 350]
df = pd.DataFrame({'sales': sales_data}, index=dates)

# Compute 7-day rolling mean
df['rolling_mean_7d'] = df['sales'].rolling(window=7).mean()
print(df)
  • .rolling(window=7) creates a rolling window of size 7.
  • .mean() calculates the mean value within each rolling window.

8. How do you apply a custom aggregation function with groupby?

Question: If you want to use your own function (e.g., finding the range of sales), how can you apply a custom function after a groupby operation?

Answer & Explanation:

def sales_range(x):
    return x.max() - x.min()

range_by_region = df.groupby('region')['sales'].apply(sales_range)
print(range_by_region)
  • You define a function sales_range that calculates max - min.
  • .apply() allows you to use any custom function on each group.

9. How do you apply multiple custom functions using agg?

Question: If you have multiple custom functions (or a mix of built-in and custom) to apply on sales, how do you handle that in a single call?

Answer & Explanation:

import numpy as np

def sales_range(x):
    return x.max() - x.min()

aggregations = {
    'sum_sales': 'sum',
    'mean_sales': 'mean',
    'sales_range': sales_range
}

results = df.groupby('region')['sales'].agg(aggregations)
print(results)
  • Pass a dictionary to .agg(...) where keys are the new column names and values are either built-in function names (as strings) or custom functions.
  • The result is a DataFrame with columns for each aggregation.

10. How do you perform conditional aggregation (e.g., sum only where sales > 200)?

Question: You want the sum of sales only for records where sales > 200. How can you do that?

Answer & Explanation:

import numpy as np

# One approach is using np.where or mask inside groupby:
conditional_sum = df.assign(
    sales_filtered = np.where(df['sales'] > 200, df['sales'], 0)
).groupby('region')['sales_filtered'].sum()

print(conditional_sum)
  • First, create a temporary column sales_filtered that keeps the sales value if it’s above 200, otherwise 0.
  • Then, group by region and sum that column.
  • Alternatively, you could filter the DataFrame before grouping, but sometimes you need the entire DataFrame context for other computations, so this approach is flexible.

Final Tips

  1. Remember Common Aggregations: sum, mean, count, max, min, and std (standard deviation).
  2. Use agg() for Multiple Functions: Either pass a list of built-ins or a dict with names -> functions.
  3. Reset Index: If you want a “flat” DataFrame after grouping, use .reset_index().
  4. Pivot Tables: Great for summarizing data with multi-dimensional grouping and aggregations in a cross-tab format.
  5. Rolling, Expanding, and Cumulative Aggregations: .rolling(), .expanding(), .cumsum(), etc., are important for time-series or sequential data.

Question 2: Top 10 questions for Data transformation ?

Answer:

1. What is the difference between pivot and pivot_table in pandas?

Answer Explanation:

  • pivot:

    • Reshapes data (long to wide) based on unique values of an index/column pair.
    • Does not allow for aggregation; it expects your data to have a single value for each index-column pair.
    • If the data contains duplicates for the same index-column pair, it raises an error.
  • pivot_table:

    • Similar to pivot but allows aggregation when there are duplicates in the data for the same index-column combination.
    • You can specify aggfunc (e.g., sum, mean, etc.) and fill_value for missing values.

2. How do you convert a DataFrame from a long format to a wide format using pivot?

Question Example:
You have a DataFrame df with columns: date, store_type, and sales. Each combination of (date, store_type) appears exactly once. How do you reshape it so that store_type values become separate columns, and the rows are indexed by date?

import pandas as pd

data = {
    'date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'store_type': ['Online', 'Retail', 'Online', 'Retail'],
    'sales': [100, 150, 120, 130]
}
df = pd.DataFrame(data)

Answer & Explanation:

wide_df = df.pivot(index='date', columns='store_type', values='sales')
print(wide_df)
  • pivot reshapes the DataFrame such that each unique value in store_type becomes a new column.
  • The date column becomes the index.
  • The sales column fills the values in the new wide format.

3. How do you handle duplicate entries when pivoting?

Question Example:
What happens if your data contains multiple rows for the same (index, columns) pair when using pivot? How to solve this?

Answer & Explanation:

  • pivot will raise a ValueError if there are duplicate combinations, because it doesn’t know how to handle multiple values for the same cell.

  • To handle duplicates gracefully, you should use pivot_table and specify an aggregation function (aggfunc). For example:

    wide_df = df.pivot_table(
        index='date', 
        columns='store_type', 
        values='sales', 
        aggfunc='sum', 
        fill_value=0
    )

4. How do you create a pivot table that summarizes data with an aggregation function?

Question Example:
Given a DataFrame df with columns region, store_type, and sales, how do you create a pivot table that shows the sum of sales for each (region, store_type) combination?

data = {
    'region': ['North', 'North', 'South', 'South', 'East'],
    'store_type': ['Online', 'Retail', 'Online', 'Retail', 'Online'],
    'sales': [100, 200, 300, 400, 150]
}
df = pd.DataFrame(data)

Answer & Explanation:

pivot_df = df.pivot_table(
    values='sales',
    index='region',
    columns='store_type',
    aggfunc='sum',
    fill_value=0
)
print(pivot_df)
  • pivot_table aggregates the data (using sum by default if aggfunc is not specified, but specifying aggfunc='sum' is more explicit).
  • Missing combinations are filled with 0 instead of NaN by using fill_value=0.

5. How do you convert data from a wide format back to a long format using melt?

Question Example:
You have a wide DataFrame wide_df with columns ['date', 'Online', 'Retail']. How do you convert it into a long format so that you have columns date, store_type, and sales again?

wide_df = pd.DataFrame({
    'date': ['2023-01-01', '2023-01-02'],
    'Online': [100, 120],
    'Retail': [150, 130]
})

Answer & Explanation:

long_df = wide_df.melt(
    id_vars='date',
    value_vars=['Online', 'Retail'],
    var_name='store_type',
    value_name='sales'
)
print(long_df)
  • melt transforms columns into rows, creating two new columns: one for the original column names (var_name) and one for their values (value_name).

6. How do you stack a DataFrame, and what is the difference between melt and stack?

Answer & Explanation:

  • stack():
    • Rotates or pivots the columns in a DataFrame into its index.
    • Typically used when your DataFrame has MultiIndex columns.
    • The result is a Series (or a DataFrame if you end up stacking more than one level).
  • melt():
    • Primarily used to “unpivot” or move from wide to long format, resulting in explicit columns (id_vars, variable name, value, etc.).
    • melt is more flexible for typical tabular transformations, especially if you don’t have multi-level columns.

Here’s a quick example with stack():

df = pd.DataFrame({
    ('sales', 'Online'): [100, 120],
    ('sales', 'Retail'): [150, 130]
}, index=['2023-01-01', '2023-01-02'])

stacked_df = df.stack()
print(stacked_df)

7. How do you unstack (the inverse of stack) a multi-index DataFrame?

Question Example:
Given a multi-index DataFrame, how can you transform its inner level of the index into columns?

Answer & Explanation:

  • unstack(level=-1) (by default the last level) transforms the specified level of the index into columns.

    stacked = df.stack()
    unstacked = stacked.unstack()  # Convert the last level of the index to columns
    print(unstacked)
  • unstack is especially useful when you’ve done a groupby operation resulting in a multi-index Series/ DataFrame.


8. How do you reshape data if you have multiple variables stored in the same column?

Question Example:
You might have a “long” DataFrame where one column contains multiple variables (e.g., “Sales_Online” and “Sales_Retail”) that should be split into different columns. What’s the approach?

Answer & Explanation:

  • Often, you first melt your DataFrame to get a single “variable” column, then split that column into multiple parts using string operations, and finally pivot or pivot_table to restructure as needed.

For example:

df['variable_split'] = df['variable'].str.split('_', expand=True)[1]  # e.g., "Online" / "Retail"
wide_df = df.pivot(index='date', columns='variable_split', values='value')

9. How do you create a pivot table with multiple aggregation functions on the same pivot?

Question Example:
Given a DataFrame with region, sales, and quantity, how do you create a pivot table that shows both the average sales and the total quantity by region?

Answer & Explanation:

pivot_df = df.pivot_table(
    index='region',
    values=['sales', 'quantity'],
    aggfunc={'sales': 'mean', 'quantity': 'sum'}
)
print(pivot_df)
  • You can pass a dictionary to aggfunc, mapping each column to an aggregation function.
  • The output is a DataFrame showing different aggregations for different columns.

10. How do you handle missing or null values during pivot/melt operations?

Answer & Explanation:

  • In a pivot_table, you can specify fill_value to replace NaN with a specific value (e.g., 0).

  • You can use data cleaning techniques (e.g., dropna(), fillna()) before or after pivot/melt.

  • For example:

    pivot_df = df.pivot_table(
        index='region',
        columns='store_type',
        values='sales',
        aggfunc='sum',
        fill_value=0
    )
  • If using melt, you can decide whether or not to keep rows with NaN values in the melted column by specifying parameters like ignore_index=True.


Key Takeaways

  1. pivot vs. pivot_table: Use pivot_table if duplicates exist or you need aggregations.
  2. melt (Wide to Long): Helps unpivot data into a tall, skinny table.
  3. stack / unstack: Primarily for multi-index transformations.
  4. Multiple Aggregations: Use dictionaries in pivot_table or groupby().agg().
  5. Handling Missing Values: fill_value=..., dropna(), or fillna() where needed.
Clone this wiki locally