Welcome, and thanks for taking the time to work on this case.
This exercise simulates a simplified version of how we ingest and transform customer feedback data into analytics-ready tables. It’s intentionally scoped so you can complete it in ~2–3 hours.
You are free to use any tools and stack you prefer:
- Python (Pandas, SQLAlchemy, etc.)
- SQL (Postgres, DuckDB, SQLite, etc.)
- dbt-like models
- Notebooks (Jupyter, VS Code, etc.)
- Or a combination
We’re primarily interested in how you think about data engineering: modeling, cleaning, joining, and explaining decisions.
You’re a data engineer in an AI/Data team.
The company collects survey feedback from different systems. For this case, you get:
- A file with survey responses
- A file with user metadata
Your job is to turn these into:
- Clean, well-structured staging tables
- A unified “analytics-ready” fact table
- A few simple aggregations for analysis
- A short description of your design and reasoning
We’ll review both your code and your explanations, and we’ll discuss your solution in a follow-up conversation.
All input data is under the data/ folder:
data/survey_results.csvdata/user_metadata.csv
Each row is a single survey response.
| Column | Type | Description |
|---|---|---|
submission_id |
string | Unique identifier for the survey submission (UUID). |
timestamp |
string | ISO 8601 timestamp when the response was submitted. |
user_email |
string | Email of the respondent. |
rating |
integer | Score from 1 to 5. |
comment_text |
string | Free-text comment from the respondent. |
region |
string | Region of the respondent (e.g. EMEA, Americas, APAC). |
Each row represents a user.
| Column | Type | Description |
|---|---|---|
user_email |
string | Email of the user (join key to survey data). |
full_name |
string | Full name of the user. |
department |
string | Department (e.g. Engineering, Sales, Marketing, etc.). |
country |
string | Country of the user. |
Note on
regionvscountry: Theregionfield in survey data is self-reported by the respondent at submission time. Thecountryfield in user metadata is the user's HR-recorded country. These come from different source systems and may not always align perfectly.
The data contains some realistic imperfections that you might encounter in production systems. Part of the exercise is identifying and handling these appropriately.
Ingest both CSV files into a data environment of your choice and create two staging structures:
stg_survey_resultsstg_user_metadata
You decide what “table” means in your setup:
- Database tables (Postgres, DuckDB, SQLite)
- DataFrames in Python
- dbt models
- Views or temporary tables
What we care about:
- Naming / structure makes sense
- It’s clear where “raw” ends and “cleaned/standardized” begins
Apply sensible validation and cleaning steps.
You don’t have to do anything overly complex, but we expect you to think about data quality and consistency.
- Remove exact duplicate submissions, if any
- Ensure
ratingis an integer between 1 and 5 - Parse and standardize
timestampinto a proper datetime column - Normalize
user_email(e.g. lowercase, trimmed) - Consider what you’d do if:
user_emailis missingratingis outside 1–5timestampis invalid
You can handle issues by:
- Dropping rows
- Flagging them
- Replacing invalid values
- Logging problems (even if just in comments / docs)
Just make sure your choices are explicit.
- Ensure
user_emailis unique per user (or document how you handle conflicts) - Standardize / trim
departmentvalues - Validate that
countryis present (not null) - Think about how this table would behave as the company grows (more countries, more departments, etc.)
Create a unified, analytics-ready table by joining survey results with user metadata on user_email.
We suggest naming this final table:
fct_survey_feedback
At a minimum, it should contain:
submission_idtimestampuser_emailratingcomment_textregiondepartmentcountry
You can add additional derived fields if you think they are useful and still keep the table clear and well-structured (e.g., rating_bucket, year_month, etc.).
Key things we look at here:
- Does the join make sense?
- How do you handle surveys with missing user metadata (if any)?
- Is the final table structured in a way that’s easy for analytics/BI to use?
Using your fct_survey_feedback table, produce the following aggregations:
- Output columns:
department,avg_rating
- Output columns:
region,avg_rating
- Output columns:
department,rating,rating_count
You can:
- Provide SQL queries
- Generate CSV files with the results
- Show results in a notebook
Choose the format that best demonstrates your approach.
This task is optional but encouraged if you have time.
Imagine this survey pipeline needs to evolve: instead of batch-processing CSV files, the system must handle ~1,000 survey submissions per minute in near real-time, with results available for analytics within seconds of submission.
Your task:
Create a simple architecture diagram (can be ASCII, a sketch, or any diagramming tool) showing how you would design this real-time pipeline. Include a brief explanation of your choices.
Consider including:
- How surveys are ingested (API, message queue, etc.)
- Stream processing or micro-batch approach
- Where validation and cleaning happen
- How data lands in the analytics layer
- How you'd handle failures or late-arriving data
We're looking for:
- Reasonable technology choices with brief justification
- Understanding of trade-offs (latency vs. throughput, complexity vs. reliability)
- Awareness of where the batch approach you built would need to change
This doesn't need to be production-ready — a clear diagram with 3–5 sentences explaining your reasoning is enough. Place this in your design_notes.md or as a separate file (e.g., realtime_architecture.md).
Please include a small design/notes document, e.g. design_notes.md.
Suggested content:
-
Tools & stack chosen
- e.g. “DuckDB + SQL + one Python script”, “Postgres + dbt”, “Pandas only”, etc.
-
High-level data flow
- A short description or simple diagram:
- raw → staging → cleaned → final fact table → aggregations
- A short description or simple diagram:
-
Transformation logic
- How you:
- Cleaned the data
- Joined the tables
- Calculated the aggregations
- How you:
-
Data quality & validation
- What you validate and where (e.g. input, staging, final table)
- How you would log/alert/fail in a real pipeline
-
Scalability
- How your approach would change if the data volume was 100x larger
- Any optimizations you’d consider (indexes, partitioning, incremental loads, etc.)
-
If you had more time…
- What improvements you would make
- How you might productionize this (scheduling, orchestration, testing, monitoring)
This doesn’t have to be long — a couple of well-structured sections with bullet points is enough — but it’s very important for us to understand your thinking.
You’re free to structure your project in the way that feels natural to you.
Here is a suggested layout:
.
├── data/
│ ├── survey_results.csv
│ └── user_metadata.csv
│
├── src/
│ ├── ... # your scripts, SQL files, or notebooks
│ └── ...
│
├── models/ # optional: dbt-style models if you want
│ └── ...
│
├── aggregations/ # optional: CSVs / exports for the 3 aggregations
│ └── ...
│
├── design_notes.md # your design & reasoning
└── README.md # this file