Repository with data mining methodologies

In this repository, you'll find my approach to generating solutions for data mining projects. I follow the widely-used CRISP-DM methodology, as outlined in the image below. It provides an overview of the typical phases of a project, the tasks associated with each phase, and an explanation of how these tasks interconnect.

1. Business Understanding

Determine the main objectives of the organization related to the project;
Analyze the current scenario and identify challenges or opportunities;
Conduct an in-depth analysis;
Clearly define the objectives of data mining in the context of the project;
Develop a detailed plan outlining the phases and tasks to be carried out during the project.

2. Data Understanding

Description of the data (meaning of the variables, number of variables and records, meaning of the records, among others);
Data exploration (graphical and statistical analysis, identification of correlations);
Data quality verification (identification of possible errors, such as missing values, outliers, or inconsistent data);
Analyze the prevalence and nature of the identified errors

3. Data Preparation

Data cleaning:
- Selection of relevant data for analysis, considering its importance to the project objectives;
- Removal or imputation of data and selection of the best estimator, if necessary);
- Cleaning incorrect, incomplete, or duplicated data;
Data transformation:
- Feature engineering: Creating new aggregated or derived features from existing data to uncover insights;
- Encoding categorical data: Converting text categories to numbers for modeling;
- Data normalization or scaling: Standardizing data ranges to enable meaningful comparisons.
- Anomaly Detection
- Combining data from different sources
- Anonymizing personal information
- Converting data types
- Structuring unstructured data
- & Others
Data integration (grouping data, combining information from various columns);
- Feature selection: Selecting the most relevant features to avoid over-fitting.
- Addressing class imbalance: Re-sampling if one target class dominates to prevent bias.
- & Others
Data formatting.
Data Splitting: Divide the data into three sets — training data, validation data, and test data.

4. Data Modeling for Machine Learning

Data Modeling is a process of the Crisp-dm methodology that aims to estimate the inherent structure of a dataset to reveal valuable patterns and predict unseen instances. Remember, Crisp-dm is an interactive process. This step needs to be complemented by the previous step (data preparation).

At this step we need to choose the correct machine learning algorithms according to our objectives. In the following diagram we can see all the possible options:

Supervised Learning:

Supervised learning can be categorized into two main types:

Classification: This involves predicting a discrete label, such as identifying an email as spam or not spam.
Regression: This involves predicting a continuous value, like forecasting the price of a house based on its features.

Note: The category depends on the target variable. The predictors can be continuous or discrete (or both) depending on the model selected.

Popular Supervided Learning Algorithms:

Linear Regression: Used for predicting continuous outcomes. It models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
Logistic Regression: Used for binary classification tasks (e.g., predicting yes/no outcomes). It estimates probabilities using a logistic function.
Decision Trees: These models predict the value of a target variable by learning simple decision rules inferred from the data features.
Random Forests: An ensemble of decision trees, typically used for classification and regression, improving model accuracy and overfitting control.
Support Vector Machines (SVM): Effective in high-dimensional spaces, SVM is primarily used for classification but can also be used for regression.
Neural Networks: These are powerful models that can capture complex non-linear relationships. They are widely used in deep learning applications.

Advanced Supervised Learning Algorithms

Hyper-Parameter Optimization Techniques:

Grid Search: This method exhaustively searches through a specified hyperparameter space. While it guarantees finding the best model, it can be computationally expensive and time-consuming, especially with numerous hyperparameters.
Random Search: Unlike grid search, random search samples a fixed number of hyperparameter combinations randomly. This approach can be more efficient and often yields comparable results to grid search with less computational cost.
Bayesian Optimization: This advanced technique models the performance of the model as a probabilistic function and uses this model to select the most promising hyperparameters to evaluate next. It is particularly effective in reducing the number of evaluations needed to find optimal hyperparameters.

Ensemble Methods in Supervised Learning

Bagging: This technique involves training multiple models on different subsets of the data and aggregating their predictions. Random Forest is a well-known example of bagging, where multiple decision trees are trained and their outputs are combined to enhance accuracy.
Boosting: Boosting focuses on sequentially training models, where each new model attempts to correct the errors made by the previous ones. AdaBoost is a popular boosting algorithm that adjusts the weights of misclassified instances to improve the model's performance.
Stacking: This method involves training multiple models and then using another model to combine their predictions. It leverages the strengths of various algorithms to achieve better accuracy.

Unsupervised Learning:

Unsupervised learning involves training algorithms on unlabeled data to discover inherent patterns, structures, or relationships within the data.

Unsupervised machine learning methods

Clustering (SVD, K-Meand, SOM & others)
- Exclusive clustering: Data is grouped in a way where a single data point can only exist in one cluster. This is also referred to as “hard” clustering. A common example of exclusive clustering is the K-means clustering algorithm, which partitions data points into a user-defined number K of clusters.
- Overlapping clustering: Data is grouped in a way where a single data point can exist in two or more clusters with different degrees of membership. This is also referred to as “soft” clustering.
- Hierarchical clustering: Data is divided into distinct clusters based on similarities, which are then repeatedly merged and organized based on their hierarchical relationships. There are two main types of hierarchical clustering: agglomerative and divisive clustering. This method is also referred to as HAC—hierarchical cluster analysis.
- Probabilistic clustering: Data is grouped into clusters based on the probability of each data point belonging to each cluster. This approach differs from the other methods, which group data points based on their similarities to others in a cluster.
Association (Apriori, FP-Growth)
- Association rule mining is a rule-based approach to reveal interesting relationships between data points in large datasets. Unsupervised learning algorithms search for frequent if-then associations—also called rules—to discover correlations and co-occurrences within the data and the different connections between data objects.
- Apriori algorithms are the most widely used for association rule learning to identify related collections of items or sets of items. However, other types are used, such as Eclat and FP-growth algorithms.
Dimensionality reduction
- Dimensionality reduction is an unsupervised learning technique that reduces the number of features, or dimensions, in a dataset. More data is generally better for machine learning, but it can also make it more challenging to visualize the data.
- Dimensionality reduction extracts important features from the dataset, reducing the number of irrelevant or random features present. This method uses principle component analysis (PCA) and singular value decomposition (SVD) algorithms to reduce the number of data inputs without compromising the integrity of the properties in the original data.

Real-world unsupervised learning examples:

Anomaly detection: Unsupervised clustering can process large datasets and discover data points that are atypical in a dataset.
Recommendation engines: Using association rules, unsupervised machine learning can help explore transactional data to discover patterns or trends that can be used to drive personalized recommendations for online retailers.
Customer segmentation: Unsupervised learning is also commonly used to generate buyer-persona profiles by clustering customers’ common traits or purchasing behaviors. These profiles can then be used to guide marketing and other business strategies.
Fraud detection: Unsupervised learning is useful for anomaly detection, revealing unusual data points in datasets. These insights can help uncover events or behaviors that deviate from normal patterns in the data, revealing fraudulent transactions or unusual behavior like bot activity.
Natural language processing (NLP): Unsupervised learning is commonly used for various NLP applications, such as categorizing articles in news sections, text translation and classification, or speech recognition in conversational interfaces. Genetic research: Genetic clustering is another common unsupervised learning example. Hierarchical clustering algorithms are often used to analyze DNA patterns and reveal evolutionary relationships.

Semi-Supervised Learning:

Semi-supervised learning (SSL) is a machine learning technique that uses a small portion of labeled data and lots of unlabeled data to train a predictive model.

Has a limited area of applications (mostly for clustering purposes) and provides less accurate results. We won't be exploring SSL solutions on this repository.

Reinforcement Learning:

Reinforcement learning (RL) is a machine learning (ML) technique that trains software to make decisions to achieve the most optimal results. It mimics the trial-and-error learning process that humans use to achieve their goals. Software actions that work towards your goal are reinforced, while actions that detract from the goal are ignored. RL algorithms use a reward-and-punishment paradigm as they process data. They learn from the feedback of each action and self-discover the best processing paths to achieve final outcomes. The algorithms are also capable of delayed gratification. The best overall strategy may require short-term sacrifices, so the best approach they discover may include some punishments or backtracking along the way. RL is a powerful method to help artificial intelligence (AI) systems achieve optimal outcomes in unseen environments.

Types of reinforcement learning algorithms:

Model-based RL (well known algorithms):
- Dynamic Programming (DP)
- Monte Carlo Tree Search (MCTS)
- Temporal Difference (TD) Learning
Model-free RL

Neural network (the backbone of deep learning algorithms): A neural network of more than three layers, including the inputs and the output, can be considered a deep-learning algorithm.

Deep learning is based on machine learning in order to progressively get a computer to learn on its own and perform human-like tasks, such as image identification, speech recognition or predictions, from a large amount of data and after numerous layers of processing with algorithms.

APPLICATIONS OF DEEP LEARNING :

Artificial vision: Artificial vision acquires the ability to recognise characters, images, objects and even faces, and its impact on Industry 4.0, for example, in quality control, will be significant.
Predictive analysis: Predictive analysis can generate more accurate forecasts of business results, market developments or energy needs.
Virtual assistants: Alexa, Cortana or Siri are assistants that understand and execute the user's voice commands in natural language and are able to learn over time.
Chatbots
Robotics
Health
Entertainment
& Others

Most commonly used Deep Learning Algorithms:

Convolutional Neural Networks (CNNs)
1. CNNs are a deep learning algorithm that processes structured grid data like images. They have succeeded in image classification, object detection, and face recognition tasks.
Recurrent Neural Networks (RNNs)
1. RNNs are designed to recognize patterns in data sequences, such as time series or natural language. They maintain a hidden state that captures information about previous inputs.
Long Short-Term Memory Networks (LSTMs)
1. LSTMs are a special kind of RNN capable of learning long-term dependencies. They are designed to avoid the long-term dependency problem, making them more effective for tasks like speech recognition and time series prediction.
Generative Adversarial Networks (GANs)
1. GANs generate realistic data by training two neural networks in a competitive setting. They have been used to create realistic images, videos, and audio.
Transformer Networks
1. Transformers are the backbone of many modern NLP models. They process input data using self-attention, allowing for parallelization and improved handling of long-range dependencies.
Autoencoders
1. Autoencoders are unsupervised learning models for tasks like data compression, denoising, and feature learning. They learn to encode data into a lower-dimensional representation and then decode it back to the original data.
Deep Belief Networks (DBNs)
1. DBNs are generative models composed of multiple layers of stochastic, latent variables. They are used for feature extraction and dimensionality reduction.
Deep Q-Networks (DQNs)
1. DQNs combine deep learning with Q-learning, a reinforcement learning algorithm, to handle environments with high-dimensional state spaces. They have been successfully applied to tasks such as playing video games and controlling robots.
Variational Autoencoders (VAEs)
1. VAEs are generative models that use variational inference to generate new data points similar to the training data. They are used for generative tasks and anomaly detection.
Graph Neural Networks (GNNs)

Time-Series

Time series forecasting refers to the practice of examining data that changes over time, then using a statistical model to predict future patterns and trends.

Components of time series forecasting models

Trend: Increase or decrease in the series of data over longer a period.
Seasonality: Fluctuations in the pattern due to seasonal determinants over a period such as a day, week, month, season.
Cyclical variations: Occurs when data exhibit rises and falls at irregular intervals.
Random or irregular variations: Instability due to random factors that do not repeat in the pattern.

Top algorithms for Time forecasting:

Autoregressive (AR): An autoregressive (AR) model predicts future behaviour based on past behaviour. It’s used for forecasting when there is some correlation between values in a time series and the values that precede and succeed them.
Autoregressive Integrated Moving Average (ARIMA): Auto Regressive Integrated Moving Average, ARIMA, models are among the most widely used approaches for time series forecasting. It is actually a class of models that ‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values.
Seasonal Autoregressive Integrated Moving Average (SARIMA): Seasonal autoregressive integrated moving average (SARIMA) models extend basic ARIMA models and allow for the incorporation seasonal patterns.
Exponential Smoothing (ES): Exponential smoothing is a time series forecasting method for univariate data that can be extended to support data with a systematic trend or seasonal component.
Prophet: Prophet, which was released by Facebook’s Core Data Science team, is an open-source library developed by Facebook and designed for automatic forecasting of univariate time series data.
LSTM: Long Short-Term Memory (LSTM) is a type of recurrent neural network that can learn the order dependence between items in a sequence. It is often used to solve time series forecasting problems.
DeepAR: DeepAR developed by Amazon is a probabilistic forecasting model based on autoregressive recurrent neural networks.
N-BEATS: N-BEATS is a custom Deep Learning algorithm which is based on backward and forward residual links for univariate time series point forecasting.
Temporal Fusion Transformer (Google): A novel attention-based architecture which combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics.

5. Evaluation

This step depends majorly on the previous steps defined. Here, we need to find the correct methodology to analyze if our model has the results that we are expecting. This step can be defined in 3 parts:

Evaluate results by assessing the degree to which model meets the objective of the business and testing the models on test applications if time and budget permit.
Review Process by conducting a more thorough review of the data mining engagement to determine if there is any important factor or task that has somehow been overlooked during the process, identify any quality assurance issues, and summarize the process review and highlight activities that have been missed and/or should be repeated.
Determine next Steps by assessing how to proceed with the project. In this part, listing potential further actions along with the reason for and against each option and describing how to proceed is important.

New topic: Time Series Forecasting with Generative AI

6. Deployment

I'll leave this part for another time. But here you have some ideas on how you can explore deployment of data mining projects:

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Repository with data mining methodologies

1. Business Understanding

2. Data Understanding

4. Data Modeling for Machine Learning

5. Evaluation

6. Deployment

About

Uh oh!

Releases

Packages

Rodrigo-DS/Data-Mining-Repository

Folders and files

Latest commit

History

Repository files navigation

Repository with data mining methodologies

1. Business Understanding

2. Data Understanding

4. Data Modeling for Machine Learning

5. Evaluation

6. Deployment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages