This project is a task assignment for Datacolab company. The objective is to create a Python script or notebook that combines NLP, Computer Vision, Machine Learning, and Data Visualization to analyze and understand a dataset of book summaries.
The main objective of this project is to analyze and understand a dataset of book summaries using various data science techniques. The project involves:
- Data Preprocessing and EDA (Exploratory Data Analysis): Cleaning the data and exploring its characteristics to uncover patterns and insights.
- NLP (Natural Language Processing): Condensing the book summaries to make them shorter while preserving the main information.
- Computer Vision: Converting the condensed text summaries into images using a Text-to-Image model.
- Clean the data: Handle missing values, remove duplicates, and ensure the data is in a usable format.
- Explore the dataset: Understand the distribution of data, identify patterns, and generate summary statistics.
- Condense the book summaries: Use NLP techniques to shorten the summaries while keeping the essential information intact.
- Convert text to images: Use a Text-to-Image model to transform the condensed summaries into visual representations.
The script should provide:
- Condensed text summaries: Shortened versions of the original book summaries.
- Converted images: Visual representations of the condensed summaries.
- Findings from EDA: Insights and patterns discovered during the exploratory data analysis.
- Python 3.8 or higher: Ensure Python is installed on your system.
- Jupyter Notebook or any other Python IDE: For running and editing the script/notebook.
-
Clone the repository
git clone https://github.com/MuhammadMahdiAmirpour/data_science_project_book_dataset.git cd data_science_project_book_dataset
-
Install the required packages
pip install -r requirements.txt
-
Open the Jupyter Notebook
jupyter notebook
-
Run the notebook cells to execute the script
data_science_project_book_dataset/
├── data/
│ └── book_summaries.csv # Dataset containing book summaries
├── notebooks/
│ └── analysis.ipynb # Jupyter notebook for analysis
├── src/
│ ├── data_preprocessing.py # Script for data preprocessing and EDA
│ ├── nlp_component.py # Script for NLP tasks
│ └── computer_vision_component.py # Script for converting text to images
├── requirements.txt # Required Python packages
└── README.md # Project documentation
The dataset will be located in the data
directory and contains book summaries in a CSV file named book_summaries.csv
.
This project was developed as part of a task assignment for Datacolab.
Muhammad Mahdi Amirpour
- GitHub: @MuhammadMahdiAmirpour