This project implements an unredactor system using machine learning techniques to predict redacted names in text documents based on contextual information. The system utilizes a Random Forest classifier and extracts various features from the text to make predictions.
- Loads data from a TSV file containing redacted names and their contexts.
- Cleans and preprocesses the data by removing empty entries.
- Extracts several features from the text such as:
- Length
- Word count
- Uppercase count
- Punctuation count
- Digit count
- Uses a Random Forest classifier to train on the extracted features.
- Implements train-test split for model validation.
- Evaluates model performance by calculating:
- Precision
- Recall
- F1-score
- These metrics are logged and tracked on a validation set.
- Predicts redacted names for new text inputs based on the trained model.
- Creates a submission file with predictions for a test set.
- NLTK for natural language processing tasks.
- Random Forest classifier for model training and prediction.
- tqdm for progress tracking.
- Logging for tracking execution and progress.
- Error Handling to ensure robustness during execution.
To run the code, follow these steps:
First, make sure that pipenv is installed. You can install it using pip:
pip3 install pipenvInstall the required dependencies using pipenv:
pipenv installActivate the virtual environment created by pipenv:
pipenv shellInstall the necessary Python packages for the project:
pipenv install pandas numpy scikit-learn nltk tqdm pytestFinally, run the main script unredacted.py:
python unredacted.pyTo run the pytests in pytest_unredacted.py:
pytest pytest_unredactor.pyThis will execute the code and perform the tasks as defined in the code.
This function loads and preprocesses the data from a given TSV file. The approach is well-reasoned because:
- Robust loading: It reads the data with a
sep="\t"argument, ensuring it handles tab-separated values. This is particularly important for TSV files. - Handling missing data: The code handles missing values by filling NaN entries in the 'context' column with empty strings (
fillna('')), and converting all text to a string (astype(str)), which makes the data ready for processing. - Data cleaning: It removes rows where the 'context' column contains only whitespace (
data['context'].str.strip().str.len() > 0), ensuring that only valid rows are processed. - Error handling: If there's an error during loading or preprocessing, it is logged, ensuring that any issues are captured and can be debugged easily.
This function extracts features from a given text. The approach is well-thought-out because:
- Feature selection: It calculates the text's length, word count, uppercase character count, punctuation count, and digit count. These are all useful features for identifying redacted names in the text, as names often have distinguishing characteristics such as length and word patterns.
- Simplicity and efficiency: The function uses simple Python string operations (e.g.,
split(),isupper(),isdigit()) to extract these features, making it efficient for processing large datasets.
This function processes the entire dataset to extract features for each row. It is a reasoned approach because:
- Iterative feature extraction: It iterates over the rows of the data, extracting context features using the
get_features()function for each row. By utilizingtqdmfor progress tracking, it provides visibility into the process, especially with larger datasets. - Handling errors: If an error occurs during feature extraction for a row, a warning is logged, and the process continues, ensuring robustness without crashing the entire pipeline.
- Flexible structure: The features are stored as dictionaries, which makes it easy to convert them to a format suitable for machine learning models (like a feature matrix).
This function trains a Random Forest model using the extracted features. The approach is well-reasoned because:
- Efficient feature vectorization: The
DictVectorizeris used to convert feature dictionaries into a sparse matrix (X). This method is efficient for handling feature extraction from structured data (like dictionaries) and prepares it for machine learning algorithms. - Train-test split: The data is split into training and validation sets using
train_test_split(), ensuring that the model is trained and evaluated on separate data, which helps to assess generalization performance. - Model choice: A Random Forest classifier is chosen, which is a robust ensemble learning method. It can handle non-linear relationships in the data and is less prone to overfitting compared to simpler models.
- Performance evaluation: The function uses
precision_recall_fscore_support()to evaluate the model's performance on the validation set, providing insight into the model's accuracy, recall, and F1 score. - Time tracking: The training time is tracked using
time.time(), allowing the user to assess the efficiency of the model training process.
This function predicts the unredacted name for a given text using the trained model. The approach is well-thought-out because:
- Feature extraction: The function uses the same feature extraction logic (
get_features()) as in training, ensuring that the same set of features is used for both training and prediction. - Prediction: It uses the
model.predict()method to make the prediction based on the extracted features, ensuring that the model's learned patterns are applied to new data. - Consistency: Since the same vectorizer used in training is applied here, the features are transformed consistently, ensuring that the prediction process matches the training process.
This function generates a submission file by predicting unredacted names for a test set. The approach is reasoned because:
- Test data processing: The function loads the test data (
test_file) and iterates over each row to make predictions using thepredict_unredacted()function. Thetqdmlibrary provides a progress bar for this process, giving feedback to the user when handling large datasets. - Submission formatting: It creates a new DataFrame with the test data’s IDs and corresponding predicted names, then saves it as a TSV file (
submission.tsv), making it ready for submission or further analysis. - Error handling: If any errors occur while creating the submission, they are logged, which ensures the user can track issues without the function failing silently.
This function ties everything together and runs the entire process. The approach is well-thought-out because:
- Sequential workflow: The
main()function calls the other functions in sequence, first loading and preprocessing the training data, then training the model, and finally generating the submission file for a test dataset. - Time tracking: The execution time is tracked, which helps monitor the performance of the entire pipeline, especially in production environments where time efficiency might be critical.
- Error handling: Any exceptions that arise during the execution of the main process are caught and logged, ensuring that failures are captured and can be debugged.
- Logging: The code uses logging extensively, which is a thoughtful approach for tracking the execution flow, debugging, and performance monitoring. Logs provide insight into different stages of the process (e.g., data loading, feature extraction, model training), and can help identify bottlenecks or errors.
- Error handling: Throughout the code, error handling is implemented using
try-exceptblocks. This ensures that the system remains robust even when unexpected situations occur. - Efficiency: The use of
tqdmfor progress bars, along with time tracking, enhances user experience and helps to keep the process efficient and transparent. - Scalability: By limiting the size of the dataset processed at each stage (e.g., using a
train_test_split()for model validation), the approach manages computational resources well, making it scalable for larger datasets.
- Cross-validation: Currently, the model uses a single train-test split. Implementing cross-validation could improve the model's evaluation and make the performance metrics more reliable.
- Hyperparameter tuning: Random Forests have several hyperparameters (e.g., number of trees, max depth) that could be fine-tuned to improve model performance.
- Model comparison: Trying other machine learning models or ensemble methods might yield better results. For example, Gradient Boosting or SVMs, LSTM, embeddings, BERT could potentially provide improvements depending on the data characteristics.
The approach loads and processes text data, extracts features like text length, word count, and punctuation count, and trains a Random Forest classifier using these features. The model's performance is measured with precision, recall, and F1 scores, and it predicts unredacted names based on the trained model. The process also includes creating a submission file for test data and uses error handling and logging to ensure smooth operation. The approach is designed to handle larger datasets and track execution time for better performance. The model's performance on the validation set is as follows:
Validation Precision: 0.0779
Validation Recall: 0.0719
Validation F1-score: 0.0715
- The code assumes that the input data is in a specific format, with redacted names represented by block characters (█).
- The code assumes that the length of the redacted block corresponds exactly to the length of the original name.
- The implementation assumes that only person names are redacted and need to be unredacted.(My choice)
- The code uses NLTK for named entity recognition, assuming it might accurately identify person names in the text.
- It assumes that the context around a redacted name (words before and after) provides useful information for prediction.
- The implementation uses TF-IDF scores of surrounding words as features, assuming these are indicative of the redacted name.
- It assumes that the movie review ratings (extracted from filenames) are relevant features for name prediction.
- The code uses a Random Forest classifier, assuming it's suitable for this type of prediction task.
- The code assumes that the top 5 predictions for each redacted name are sufficient for evaluation purposes.
Here's an example of how to use the unredactor:
| id | context |
|---|---|
| 1 | The movie starring ████████ was a blockbuster hit. |
| 2 | ██████████ directed the award-winning film. |
python unredactor.py| id | name |
|---|---|
| 1 | Tom Cruise |
| 2 | Steven Spielberg |
- Make sure that training data is in
data/unredactor.tsv. - Place the test data in
tests/test.tsv.
- The script will automatically load and preprocess the training data.
- It will then train a Random Forest classifier on the extracted features.
- The trained model will be used to predict redacted names in the test data.
- Predictions will be written to
submission.tsvin the project root.
- The script will output validation metrics (precision, recall, F1-score) to the console.
- To modify feature extraction, edit the
get_features()function. - To change the model, update the
train_model()function.