A simple but powerful Jupyter Notebook built to clean CSV files by identifying and removing duplicate rows. This is a common and essential first step in any data cleaning or analysis pipeline.
This notebook provides a clear, step-by-step process to:
- Load a CSV file into a Pandas DataFrame.
- Analyze the data to find the total number of duplicate rows.
- Remove all duplicate rows efficiently.
- Save the clean, deduplicated data back to a new CSV file.
- Python
- Pandas: The core library used for data loading, manipulation, and analysis.
- Jupyter Notebook: For interactive code execution and clear documentation.
- Add Your File: Place the CSV file you want to clean into the same folder as this notebook.
- Open the Notebook: Launch the
.ipynbfile (e.g., in VS Code, Jupyter Lab, or Google Colab). - Update the Filename: In the first code cell, change the
filenamevariable to match the name of your CSV file.# Change 'your-file-name.csv' to the name of your file filename = 'your-file-name.csv'
- Run All Cells: Run all the cells in the notebook from top to bottom.
- Get Your Clean File: The notebook will save a new file in the same folder, named
your-original-filename_deduplicated.csv. This new file contains your clean, duplicate-free dataset!