This repository includes the semester project of the course ETSP (source repository). The code is written in Python v3.12.
Main packages: annoy, sentene_transformers, torch, numpy, sklearn : {TfidfVectorizer}, scipy, pandas, matplotlib, tqdm
readAlike is a book recommendation system that provides similar books to a given input book. The system leverages multiple techniques, including TF-IDF vectorization, Sentence-BERT embeddings, and Approximate Nearest Neighbors (ANN) for generating content-based book recommendations based on title, description, author, and category data (dataset).
To run the recommendation engine:
- Open the command line and execute
pip install -r requirements.txt - Execute
main.py. Given an example book, the program will print the top five recommended books based on three methods: TF-IDF, SBERT, and ANN.
- preprocessing/: Manages data preprocessing.
Preprocessor: Handles data cleaning and formatting from a CSV file of book data.
- core/: Contains the core modules for recommendation.
LibraryandBook: Models the library of books and individual book data.Vectorizer: Converts book text data into numerical vectors using TF-IDF and Sentence-BERT.DimensionalityReducer: Reduces the dimensionality of TF-IDF vectors using Truncated SVD.Ann: Creates an Approximate Nearest Neighbors model for efficient similarity search.Recommender: Main recommendation engine that integrates the above components to provide recommendations.
config.py: Configuration file with column names for title, description, authors, and categories.main.py: Main entry point for running the recommendation pipeline.
- Preprocessing: The
Preprocessorclass reads the dataset and performs data cleaning. - Library Initialization:
Libraryis initialized with the cleaned dataset, storing each book as aBookobject. - Vectorization:
Vectorizercreates TF-IDF and Sentence-BERT embeddings for each book. - Dimensionality Reduction:
DimensionalityReducerreduces TF-IDF embeddings for optimized ANN performance. - ANN Construction:
Annconstructs an ANN model based on the reduced vectors. - Recommendation:
Recommenderclassifies recommendations into TF-IDF, SBERT, and ANN-based results, outputting top similar books.
- Attributes:
df: DataFrame containing cleaned book data.
- Methods:
preprocess_data(): Cleans and formats the data.drop_items_with_short_entries(),drop_duplicates(),convert_strings_into_lists(): Helper functions to clean the dataset.
- Attributes:
books: List ofBookobjects.
- Methods:
get_combined_data(): Concatenates title, description, authors, and categories into a single string per book.get_book_idx(): Retrieves the index of a book within the library.
- Attributes:
title,description,authors,categories: Fields describing the book.
- Methods:
get_combined_data(): Combines title, description, authors, and categories into a single string.
- Attributes:
tfidf_matrix: Sparse matrix of TF-IDF vectors.sbert_embeddings: Sentence-BERT embeddings for each book.
- Methods:
tfidf_vectorize(): Vectorizes a book using TF-IDF.sbert_vectorize(): Vectorizes a book or library using SBERT.
- Attributes:
reduced_matrix: Dimensionality-reduced version of the TF-IDF matrix.
- Methods:
reduce(): Reduces a TF-IDF vector to the lower dimension.
- Attributes:
ann_indices: ANN model for similarity search.
- Methods:
get_nearest_neighbors_by_index(),get_nearest_neighbors_by_vector(): Retrieves nearest neighbors by item index or vector.
- Attributes:
lib: Library of books.vectorizer: Vectorizer instance.reducer: Dimensionality reducer instance.ann: ANN instance.
- Methods:
recommend(): Provides top recommendations based on TF-IDF, SBERT, and ANN.