The Machine Learning Works Repository is a comprehensive collection of Python implementations covering fundamental machine learning algorithms, data preprocessing techniques, model evaluation methods, and practical applications. It provides educational code examples for both supervised and unsupervised learning using popular Python libraries like scikit-learn, matplotlib, and numpy.
- Overview
- Repository Structure
- Technologies
- Installation and Setup
- Module Documentation
- Usage Examples
- Testing
- Contributing Guidelines
- References
- Support
This repository serves as an educational resource for machine learning concepts and implementations. It includes:
- Supervised learning algorithms (classification and regression)
- Unsupervised learning techniques (clustering and dimensionality reduction)
- Data preprocessing and feature engineering methods
- Model evaluation and hyperparameter tuning techniques
- Machine learning pipelines and workflow automation
chains-and-algorithm/
│ make-pipe.py
│ pipeline-1.py
│ pipeline-gridSearch.py
data_representation/
│ one-hot-encoding.py
model-evaluation/
│ GroupKFold.py
│ cross-validation-works.py
│ grid-search.py
│ k-fold-evaluation.py
models/
│ classifying_iris.py
supervised_learning/
│ SVM-works.py
│ decision_trees.py
│ kNN-classifier-works.py
│ kNN-regressor-works.py
│ linear_models_works.py
│ logistic-regression-works.py
│ neural-networks.py
unsupervised_learning/
│ Agglomerative-works.py
│ DBSCAN-works.py
│ NMF-works.py
│ PCA-works.py
│ comparison.py
│ feature_extraction.py
│ k-means-works.py
│ scaling-works.py
│ testing_file.py
- Python: 3.8+
- Scikit-learn: 1.2+ for machine learning algorithms
- Matplotlib: 3.5+ for data visualization
- NumPy: 1.22+ for numerical computations
- Pandas: For data manipulation and analysis
- mglearn: For educational visualization utilities
- Python: 3.8 or higher (verify with
python --version) - pip: Python package installer
- Git: For cloning the repository
-
Clone Repository:
git clone https://github.com/mixro/machine-learning-works.git cd machine-learning-works -
Create Virtual Environment (recommended):
python -m venv ml-env source ml-env/bin/activate # On Windows: ml-env\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
If
requirements.txtis not available, install core packages:pip install scikit-learn matplotlib numpy pandas mglearn jupyter
-
make-pipe.py
- Purpose: Demonstrates creating custom pipelines for machine learning workflows.
- Key Features: Custom transformer implementation, building preprocessing and modeling pipelines, integration with scikit-learn pipeline API
- Example Imports:
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.base import BaseEstimator, TransformerMixin
-
pipeline-1.py
- Purpose: Basic pipeline implementation for data preprocessing and modeling.
- Key Features: Sequential data transformation, combined preprocessing and model training, streamlined machine learning workflows
-
pipeline-gridSearch.py
- Purpose: Integrates pipelines with grid search for hyperparameter tuning.
- Key Features: Hyperparameter tuning across pipeline steps, optimization of preprocessing parameters alongside model parameters, efficient parameter search space definition
- one-hot-encoding.py
- Purpose: Demonstrates one-hot encoding for categorical variables.
- Key Features: Conversion of categorical data to numerical format, handling of nominal variables in machine learning, comparison with other encoding techniques
- Example Imports:
from sklearn.preprocessing import OneHotEncoder import pandas as pd import numpy as np
-
GroupKFold.py
- Purpose: Implements Group K-Fold cross-validation for data with groups.
- Key Features: Cross-validation that ensures same group not in both testing and training, useful for subject-specific or group-specific data
- Example Imports:
from sklearn.model_selection import GroupKFold import numpy as np
-
cross-validation-works.py
- Purpose: Demonstrates various cross-validation techniques.
- Key Features: Implementation of k-fold, stratified k-fold, and leave-one-out cross-validation, comparison of different validation strategies
-
grid-search.py
- Purpose: Implements hyperparameter tuning using grid search.
- Key Features: Exhaustive search over specified parameter values, model selection based on cross-validation performance, visualization of parameter performance
-
k-fold-evaluation.py
- Purpose: Focused implementation and evaluation of K-Fold cross-validation.
- Key Features: Detailed analysis of k-fold performance, impact of different k values on model evaluation, bias-variance tradeoff analysis
- classifying_iris.py
- Purpose: Implements classification algorithms on the Iris dataset.
- Key Features: Loads and preprocesses the famous Iris dataset, implements multiple classification algorithms, evaluates model performance metrics
- Example Imports:
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score
-
SVM-works.py
- Purpose: Demonstrates Support Vector Machines for classification and regression.
- Key Features: Linear and nonlinear SVM implementation, kernel trick visualization (RBF, polynomial), parameter tuning for C and gamma parameters
- Example Imports:
from sklearn.svm import SVC, SVR import matplotlib.pyplot as plt import numpy as np
-
decision_trees.py
- Purpose: Implements decision trees for classification and regression.
- Key Features: Tree-based model construction, visualization of decision boundaries, pruning and complexity parameter tuning
-
kNN-classifier-works.py
- Purpose: Demonstrates k-Nearest Neighbors for classification tasks.
- Key Features: Distance-based classification, impact of k parameter on model performance, feature scaling importance for kNN
-
kNN-regressor-works.py
- Purpose: Implements k-Nearest Neighbors for regression tasks.
- Key Features: Instance-based regression, distance-weighted predictions, comparison with other regression techniques
-
linear_models_works.py
- Purpose: Demonstrates linear models for regression and classification.
- Key Features: Linear and logistic regression implementation, regularization techniques (Ridge, Lasso, ElasticNet), coefficient analysis and interpretation
-
logistic-regression-works.py
- Purpose: Focused implementation of logistic regression for classification.
- Key Features: Binary and multiclass classification, probability calibration and threshold tuning, regularization path analysis
-
neural-networks.py
- Purpose: Implements neural networks using scikit-learn's MLP classifier/regressor.
- Key Features: Feedforward neural network implementation, hidden layer architecture experimentation, activation function and solver comparison
-
Agglomerative-works.py
- Purpose: Demonstrates hierarchical clustering using Agglomerative Clustering.
- Key Features: Implementation of Agglomerative Clustering algorithm, visualization of dendrograms and clustering results, comparison of different linkage methods
- Example Imports:
from sklearn.cluster import AgglomerativeClustering import matplotlib.pyplot as plt import numpy as np import mglearn from sklearn.datasets import make_blobs
-
DBSCAN-works.py
- Purpose: Implements Density-Based Spatial Clustering of Applications with Noise (DBSCAN).
- Key Features: Demonstration of density-based clustering, handling of noisy data and irregular cluster shapes, parameter tuning for epsilon and minimum samples
-
NMF-works.py
- Purpose: Implements Non-Negative Matrix Factorization for feature extraction and dimensionality reduction.
- Key Features: Matrix factorization for pattern recognition, applications in topic modeling and feature extraction, comparison with other dimensionality reduction techniques
-
PCA-works.py
- Purpose: Demonstrates Principal Component Analysis for dimensionality reduction.
- Key Features: Implementation of PCA for feature reduction, visualization of explained variance ratio, applications in data compression and visualization
-
comparison.py
- Purpose: Compares different unsupervised learning algorithms.
- Key Features: Side-by-side comparison of clustering algorithms, performance metrics for unsupervised learning, visualization of different techniques on common datasets
-
feature_extraction.py
- Purpose: Demonstrates various feature extraction techniques.
- Key Features: Text feature extraction (CountVectorizer, TF-IDF), image feature extraction, custom feature creation methods
-
k-means-works.py
- Purpose: Implements k-Means clustering algorithm.
- Key Features: Partition-based clustering, elbow method for determining optimal k, cluster initialization strategies
-
scaling-works.py
- Purpose: Demonstrates data scaling techniques for machine learning.
- Key Features: Standardization and normalization methods, impact of scaling on different algorithms, robust scaling for data with outliers
-
testing_file.py
- Purpose: Serves as a sandbox for testing new ideas and algorithms.
- Key Features: Experimental code development, algorithm prototyping, quick testing of concepts
-
Running a supervised learning algorithm:
python supervised_learning/SVM-works.py
-
Performing unsupervised clustering:
python unsupervised_learning/k-means-works.py
-
Testing a complete pipeline:
python chains-and-algorithm/pipeline-gridSearch.py
-
Evaluating model performance:
python model-evaluation/k-fold-evaluation.py
-
Using Jupyter Notebook for exploration:
jupyter notebook
The repository includes various test files to verify the functionality of different modules:
# Run specific test files
python -m pytest model-evaluation/ -v
# Run all test files in the repository
python -m pytest- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Add tests for new functionality
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please adhere to PEP 8 guidelines and include docstrings for all functions and classes following the Google Python Style Guide.
- Scikit-learn documentation: https://scikit-learn.org
- Matplotlib documentation: https://matplotlib.org
- Python Data Science Handbook: https://jakevdp.github.io/PythonDataScienceHandbook/
- Introduction to Machine Learning with Python: https://www.oreilly.com/library/view/introduction-to-machine/9781449369880/
For assistance, please refer to the repository issues page or contact the maintainers with detailed error logs and context.