Latex project made on Overleaf, with it's pdf and images available in this repository:
Overleaf: https://overleaf.com
PDF: On the root folder or in
ArXiv:
Images: On the "Images" folder
I called venv, and the sequence is in the requirements.txt, suposing the python venv is already installed in your device you will:
Create the venv called "firstEnv":
python3 -m venv firstEnv
Activate it:
source firstEnv/bin/activate
Install jupyter notebook that we'll need to make the work presentable.
pip3 install jupyter
TensorFlow version 1, for all the machine learning stuff
pip3 install tensorflow
Keras to make AI models even easier to implement
pip3 install keras
MatPlotLib for our data visualization
pip3 install matplotlib
TQDM for ProgressBars in out jupyter notebook or terminal
pip3 install tqdm
Scikit-Learn for it's extremely efficient implementations of a few famous algorithms
pip3 install scikit-learn
Pandas for all our dataframes and CSV manipulation
python3 -m pip install --upgrade pandas
Surprise for some already done recommender algorithms to compare
pip3 install surprise
Numpy and Scipy for dealing with large arrays
pip3 install numpy
pip3 install scipy
Python: 3.7.3
TensorFlow: 1.14.0
Keras: 2.2.4
MatplotLib: 3.1.1
TQDM: 4.36.1
Scikit-Learn: 0.21.3
Surprise: 0.1
Scipy: 1.3.0
Pandas: 0.25.1
Jupyter Notebook: 1.0.0
Markdown: 3.1.1
Numpy: 1.16.4
Pip: 19.3
- ML_Dataset Folder: Contains the Movie Lens Smallest and 27M datasets, it's on .gitignore because all files beyond the original dataset is generated by executing the scripts on this project.
- first_env Folder: Also on .gitignore, it's basically the python virtual enviroment that we are going to install all we need. I caled first because, maybe, in the future, gonna have another ones to try different combinations of libraries like TensorFlow 2.0, Seaborn instead of Malplotlib. But this still not happened
- images Folder: Has the images used in the theorical and the experimental section of the article.
- main Folder: Literally the main, all the python scripts are here. Divided in the following subfolders:
-- dataPrep folder: The data preparation scripts, this includes the loading of datasets, the long tail crop, the dimensionality reduction and clusterization.
- longTail_crop.py: It will crop the movies that are too little rated, and the user that evaluated too little movies, to make our data less noisy.
- profiles.py: Creates the movie_profile.csv and user_profile.csv the caractheristic vector that define them, in this recommender systems.
- pca.py: Will reduce dimensionality in this profiles, or try, if reduces in a significant manner will rewrite the old dataset.
- HDBSCAN_applied.py: Responsible for running the cluster algorithm over the movies dataset, and creating the file: movies_cluster.csv, that wasn't created due to impossibility to cluster the majority of users, and a good chunck of movies.
- data_split.py: This one will just load the original csv's into more strategic ones, a training and a test set called training_movies.csv and test_movies.csv
- correct_dataset.py: excess column remover from datasets -- recommenders folder: The Surprise Library aplication, each file is very similar, some contains grid search like KNN Basic and SVD, all the others are execution with 5-fold cross validations, each has the algoritm name. -- results: Includes the results from almost all algorithm executions, has the 20K predictions of several algorithms, has the results from GRID Search in KNNBasic_results and SVD_results, has HBDSCAN Fit Results from users and movies, and the PCA results.
- longTailCrop
- profiles
- correct_dataset
- PCA
- HDBSCAN_Applied
- data_split
- KNNBasic
- SVD
- KNNMeans
- KNNBaseline
- KNNZScore
- Baseline
- CoClustering
- SlopeOne
- NormalPred
- NMF
- SVDpp
- ploting