This repository contains our solution for the Neftecode Hackathon, held in April 2024 by ITMO University in collaboration with the industrial partner interested in predictive modeling of lubricant properties. The main goal was to evaluate the applicability of modern machine learning methods for predicting the viscosity of oil mixtures based on an encrypted dataset.
The provided encrypted dataset describes various oil blends and includes:
- Oil type: Categorical identifier for the type of oil.
- Oil properties: Physical and chemical properties of the oil (e.g., density at different temperatures, viscosity of base oil, additives, ions compositions, and others).
- Component classes: Types of additives present in the oil.
- Component properties: Physical properties of each component (e.g., pour point, demulsification time, separated water volume, and others)
- SMILES strings: Text-based representations of the molecular structures of components.
The target variable is viscosity, measured by the industrial partner using the standard D445-24. A log-transformed distribution of the target is shown below.
- Data Analysis and Literature Review: Perform exploratory data analysis and review relevant publications to understand the domain and dataset.
- SMILES Conversion: Convert SMILES strings to machine-readable embeddings using methods such as:
- Transformers
- Graph Neural Networks (GNNs)
- Quantum chemical descriptors
- Model Development - Part 1: Develop a model to predict viscosity using encrypted physical and chemical data, capable of handling missing values.
- Model Development - Part 2: Build a model to predict viscosity from SMILES embeddings, accommodating varying numbers of SMILES per sample.
- Pipeline Integration: Combine both models into a unified pipeline, perform hyperparameter optimization, and evaluate performance using standard regression metrics.
Given the dataset's limited size (340 samples), we focused on tree-based models, known for their effectiveness with small datasets and interpretability—an essential factor for industrial applications.
- Decision Tree (DT)
- Random Forest (RF)
- Gradient Boosting (GB)
- Hyperparameter Tuning: Utilized
GridSearchCVfromscikit-learnfor systematic hyperparameter optimization. - Cross-Validation: Employed 5-fold cross-validation to ensure model robustness.
- Evaluation Metrics: Assessed models using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
- Target Transformation: Applied logarithmic transformation to the target variable (viscosity), which consistently improved model performance across all architectures.
The Gradient Boosting (GB) model demonstrated superior performance, achieving the lowest MAE and RMSE. Consequently, it was selected as the final model for prediction tasks.
.
├── dataset/ # Dataset files
├── model/ # Trained model
├── results/ # Prediction outputs
├── Predict.ipynb # Inference pipeline
└── Train.ipynb # Training pipeline
-
Training the Model:
- Open
Train.ipynbin Jupyter Notebook or JupyterLab. - Execute the cells sequentially to preprocess data, train the model, and evaluate performance.
- The trained model will be saved in the
model/directory.
- Open
-
Making Predictions:
- Open
Predict.ipynb. - Ensure the trained model is available in the
model/directory. - Execute the cells to load the model and make predictions on new data.
- Open


