Skip to content

MuhammadMahdiAmirpour/data_science_project_book_dataset

Repository files navigation

Data Science Project: Book Information Dataset

This project is a task assignment for Datacolab company. The objective is to create a Python script or notebook that combines NLP, Computer Vision, Machine Learning, and Data Visualization to analyze and understand a dataset of book summaries.

🎯 Objective

The main objective of this project is to analyze and understand a dataset of book summaries using various data science techniques. The project involves:

  • Data Preprocessing and EDA (Exploratory Data Analysis): Cleaning the data and exploring its characteristics to uncover patterns and insights.
  • NLP (Natural Language Processing): Condensing the book summaries to make them shorter while preserving the main information.
  • Computer Vision: Converting the condensed text summaries into images using a Text-to-Image model.

🛠️ Requirements

1. Data Preprocessing and EDA

  • Clean the data: Handle missing values, remove duplicates, and ensure the data is in a usable format.
  • Explore the dataset: Understand the distribution of data, identify patterns, and generate summary statistics.

2. NLP Component

  • Condense the book summaries: Use NLP techniques to shorten the summaries while keeping the essential information intact.

3. Computer Vision Component

  • Convert text to images: Use a Text-to-Image model to transform the condensed summaries into visual representations.

📊 Output

The script should provide:

  • Condensed text summaries: Shortened versions of the original book summaries.
  • Converted images: Visual representations of the condensed summaries.
  • Findings from EDA: Insights and patterns discovered during the exploratory data analysis.

🚀 Getting Started

Prerequisites

  • Python 3.8 or higher: Ensure Python is installed on your system.
  • Jupyter Notebook or any other Python IDE: For running and editing the script/notebook.

Installation

  1. Clone the repository

    git clone https://github.com/MuhammadMahdiAmirpour/data_science_project_book_dataset.git
    cd data_science_project_book_dataset
  2. Install the required packages

    pip install -r requirements.txt

Running the Project

  1. Open the Jupyter Notebook

    jupyter notebook
  2. Run the notebook cells to execute the script

📝 Project Structure

data_science_project_book_dataset/
├── data/
│   └── book_summaries.csv        # Dataset containing book summaries
├── notebooks/
│   └── analysis.ipynb            # Jupyter notebook for analysis
├── src/
│   ├── data_preprocessing.py     # Script for data preprocessing and EDA
│   ├── nlp_component.py          # Script for NLP tasks
│   └── computer_vision_component.py  # Script for converting text to images
├── requirements.txt              # Required Python packages
└── README.md                     # Project documentation

📚 Data

The dataset will be located in the data directory and contains book summaries in a CSV file named book_summaries.csv.

🎓 Acknowledgments

This project was developed as part of a task assignment for Datacolab.

👨‍💻 Author

Muhammad Mahdi Amirpour


Built with ❤️ by Muhammad Mahdi Amirpour

About

the data science project for book information dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published