The Unredactor

Project Overview

This project implements an unredactor system using machine learning techniques to predict redacted names in text documents based on contextual information. The system utilizes a Random Forest classifier and extracts various features from the text to make predictions.

Key Components of the Project

1. Data Loading and Preprocessing

Loads data from a TSV file containing redacted names and their contexts.
Cleans and preprocesses the data by removing empty entries.

2. Feature Extraction

Extracts several features from the text such as:
- Length
- Word count
- Uppercase count
- Punctuation count
- Digit count

3. Model Training

Uses a Random Forest classifier to train on the extracted features.
Implements train-test split for model validation.

4. Performance Evaluation

Evaluates model performance by calculating:
- Precision
- Recall
- F1-score
These metrics are logged and tracked on a validation set.

5. Prediction

Predicts redacted names for new text inputs based on the trained model.

6. Submission File Generation

Creates a submission file with predictions for a test set.

Libraries Used

NLTK for natural language processing tasks.
Random Forest classifier for model training and prediction.
tqdm for progress tracking.
Logging for tracking execution and progress.
Error Handling to ensure robustness during execution.

How to Run the Code

To run the code, follow these steps:

1. Install `pipenv`

First, make sure that pipenv is installed. You can install it using pip:

pip3 install pipenv

2. Install Dependencies

Install the required dependencies using pipenv:

pipenv install

3. Activate the Virtual Environment

Activate the virtual environment created by pipenv:

pipenv shell

4. Install Additional Packages

Install the necessary Python packages for the project:

pipenv install pandas numpy scikit-learn nltk tqdm pytest

5. Run the Code

Finally, run the main script unredacted.py:

python unredacted.py

5. Run pytest

To run the pytests in pytest_unredacted.py:

pytest pytest_unredactor.py

This will execute the code and perform the tasks as defined in the code.

What is working

1. `load_and_preprocess_data(data_path)`

This function loads and preprocesses the data from a given TSV file. The approach is well-reasoned because:

Robust loading: It reads the data with a sep="\t" argument, ensuring it handles tab-separated values. This is particularly important for TSV files.
Handling missing data: The code handles missing values by filling NaN entries in the 'context' column with empty strings (fillna('')), and converting all text to a string (astype(str)), which makes the data ready for processing.
Data cleaning: It removes rows where the 'context' column contains only whitespace (data['context'].str.strip().str.len() > 0), ensuring that only valid rows are processed.
Error handling: If there's an error during loading or preprocessing, it is logged, ensuring that any issues are captured and can be debugged easily.

2. `get_features(text)`

This function extracts features from a given text. The approach is well-thought-out because:

Feature selection: It calculates the text's length, word count, uppercase character count, punctuation count, and digit count. These are all useful features for identifying redacted names in the text, as names often have distinguishing characteristics such as length and word patterns.
Simplicity and efficiency: The function uses simple Python string operations (e.g., split(), isupper(), isdigit()) to extract these features, making it efficient for processing large datasets.

3. `extract_features(data)`

This function processes the entire dataset to extract features for each row. It is a reasoned approach because:

Iterative feature extraction: It iterates over the rows of the data, extracting context features using the get_features() function for each row. By utilizing tqdm for progress tracking, it provides visibility into the process, especially with larger datasets.
Handling errors: If an error occurs during feature extraction for a row, a warning is logged, and the process continues, ensuring robustness without crashing the entire pipeline.
Flexible structure: The features are stored as dictionaries, which makes it easy to convert them to a format suitable for machine learning models (like a feature matrix).

4. `train_model(train_data)`

This function trains a Random Forest model using the extracted features. The approach is well-reasoned because:

Efficient feature vectorization: The DictVectorizer is used to convert feature dictionaries into a sparse matrix (X). This method is efficient for handling feature extraction from structured data (like dictionaries) and prepares it for machine learning algorithms.
Train-test split: The data is split into training and validation sets using train_test_split(), ensuring that the model is trained and evaluated on separate data, which helps to assess generalization performance.
Model choice: A Random Forest classifier is chosen, which is a robust ensemble learning method. It can handle non-linear relationships in the data and is less prone to overfitting compared to simpler models.
Performance evaluation: The function uses precision_recall_fscore_support() to evaluate the model's performance on the validation set, providing insight into the model's accuracy, recall, and F1 score.
Time tracking: The training time is tracked using time.time(), allowing the user to assess the efficiency of the model training process.

5. `predict_unredacted(text, model, vectorizer)`

This function predicts the unredacted name for a given text using the trained model. The approach is well-thought-out because:

Feature extraction: The function uses the same feature extraction logic (get_features()) as in training, ensuring that the same set of features is used for both training and prediction.
Prediction: It uses the model.predict() method to make the prediction based on the extracted features, ensuring that the model's learned patterns are applied to new data.
Consistency: Since the same vectorizer used in training is applied here, the features are transformed consistently, ensuring that the prediction process matches the training process.

6. `create_submission(test_file, model, vectorizer, output_file)`

This function generates a submission file by predicting unredacted names for a test set. The approach is reasoned because:

Test data processing: The function loads the test data (test_file) and iterates over each row to make predictions using the predict_unredacted() function. The tqdm library provides a progress bar for this process, giving feedback to the user when handling large datasets.
Submission formatting: It creates a new DataFrame with the test data’s IDs and corresponding predicted names, then saves it as a TSV file (submission.tsv), making it ready for submission or further analysis.
Error handling: If any errors occur while creating the submission, they are logged, which ensures the user can track issues without the function failing silently.

7. `main()`

This function ties everything together and runs the entire process. The approach is well-thought-out because:

Sequential workflow: The main() function calls the other functions in sequence, first loading and preprocessing the training data, then training the model, and finally generating the submission file for a test dataset.
Time tracking: The execution time is tracked, which helps monitor the performance of the entire pipeline, especially in production environments where time efficiency might be critical.
Error handling: Any exceptions that arise during the execution of the main process are caught and logged, ensuring that failures are captured and can be debugged.

General Considerations:

Logging: The code uses logging extensively, which is a thoughtful approach for tracking the execution flow, debugging, and performance monitoring. Logs provide insight into different stages of the process (e.g., data loading, feature extraction, model training), and can help identify bottlenecks or errors.
Error handling: Throughout the code, error handling is implemented using try-except blocks. This ensures that the system remains robust even when unexpected situations occur.
Efficiency: The use of tqdm for progress bars, along with time tracking, enhances user experience and helps to keep the process efficient and transparent.
Scalability: By limiting the size of the dataset processed at each stage (e.g., using a train_test_split() for model validation), the approach manages computational resources well, making it scalable for larger datasets.

Potential Improvements:

Cross-validation: Currently, the model uses a single train-test split. Implementing cross-validation could improve the model's evaluation and make the performance metrics more reliable.
Hyperparameter tuning: Random Forests have several hyperparameters (e.g., number of trees, max depth) that could be fine-tuned to improve model performance.
Model comparison: Trying other machine learning models or ensemble methods might yield better results. For example, Gradient Boosting or SVMs, LSTM, embeddings, BERT could potentially provide improvements depending on the data characteristics.

Overall:

The approach loads and processes text data, extracts features like text length, word count, and punctuation count, and trains a Random Forest classifier using these features. The model's performance is measured with precision, recall, and F1 scores, and it predicts unredacted names based on the trained model. The process also includes creating a submission file for test data and uses error handling and logging to ensure smooth operation. The approach is designed to handle larger datasets and track execution time for better performance. The model's performance on the validation set is as follows:

Validation Precision: 0.0779
Validation Recall: 0.0719
Validation F1-score: 0.0715

Assumptions in the Implementation

The code assumes that the input data is in a specific format, with redacted names represented by block characters (█).
The code assumes that the length of the redacted block corresponds exactly to the length of the original name.
The implementation assumes that only person names are redacted and need to be unredacted.(My choice)
The code uses NLTK for named entity recognition, assuming it might accurately identify person names in the text.
It assumes that the context around a redacted name (words before and after) provides useful information for prediction.
The implementation uses TF-IDF scores of surrounding words as features, assuming these are indicative of the redacted name.
It assumes that the movie review ratings (extracted from filenames) are relevant features for name prediction.
The code uses a Random Forest classifier, assuming it's suitable for this type of prediction task.
The code assumes that the top 5 predictions for each redacted name are sufficient for evaluation purposes.

Example Usage

Here's an example of how to use the unredactor:

Input (in `tests/test.tsv`):

id	context
1	The movie starring ████████ was a blockbuster hit.
2	██████████ directed the award-winning film.

Run the script:

python unredactor.py

Output (in `submission.tsv`):

id	name
1	Tom Cruise
2	Steven Spielberg

Directions

Data Preparation:

Make sure that training data is in data/unredactor.tsv.
Place the test data in tests/test.tsv.

Model Training:

The script will automatically load and preprocess the training data.
It will then train a Random Forest classifier on the extracted features.

Prediction:

The trained model will be used to predict redacted names in the test data.
Predictions will be written to submission.tsv in the project root.

Evaluation:

The script will output validation metrics (precision, recall, F1-score) to the console.

Customization:

To modify feature extraction, edit the get_features() function.
To change the model, update the train_model() function.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
tests		tests
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
collaborators.md		collaborators.md
label_encoder.pkl		label_encoder.pkl
setup.cfg		setup.cfg
setup.py		setup.py
submission.tsv		submission.tsv
unredactor.py		unredactor.py
vectorizer.pkl		vectorizer.pkl

Jayasurya003/NameReveal

Folders and files

Latest commit

History

Repository files navigation

The Unredactor

Project Overview

Key Components of the Project

1. Data Loading and Preprocessing

2. Feature Extraction

3. Model Training

4. Performance Evaluation

5. Prediction

6. Submission File Generation

Libraries Used

How to Run the Code

1. Install pipenv

2. Install Dependencies

3. Activate the Virtual Environment

4. Install Additional Packages

5. Run the Code

5. Run pytest

What is working

1. load_and_preprocess_data(data_path)

2. get_features(text)

3. extract_features(data)

4. train_model(train_data)

5. predict_unredacted(text, model, vectorizer)

6. create_submission(test_file, model, vectorizer, output_file)

7. main()

General Considerations:

Potential Improvements:

Overall:

Assumptions in the Implementation

Example Usage

Input (in tests/test.tsv):

Run the script:

Output (in submission.tsv):

Directions

Data Preparation:

Model Training:

Prediction:

Evaluation:

Customization:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages