Automatic Terms of Service and Privacy Policy parser and summarizer, powered by custom NLP techniques.
Created by Andrew Mascillaro, Spencer Ng, William Qin, and Eric Zheng. Winner of PennApps XXI's Best use of Google Cloud.
Autotos is developed using finetune, a powerful NLP library with specific dependencies. Either install finetune as instructed on their GitHub page or use the following instructions:
- (Recommended) create a virtual environment using venv or conda:
conda create -n autotos python=3.8
conda activate autotos
python -m spacy download en
- Install the requirements
pip install -r requirements.txt
- Run the script
nlp/train.py
as shown in Repository Structure to load model-specific data
This should install all files and dependencies for development or for running the model locally.
This repository contains the following folders:
artifacts
: automatically downloaded files and those generated from data on the TOS;DR website. See below for a complete description.config
: manually-created configuration files to facilitate TOS downloading, NLP model training, and deploying to the clouddata
: scripts to download and clean data from TOS;DR into formats used for ML trainingnlp
: scripts to test and train the custom NLP RoBERTa-based modelprediction
: scripts to serve the API backend and run evaluation on the trained modeldocs
: code for the AutoTOS website frontendgcp-docker
: archival scripts used to integrate AutoTOS with Google Cloud Platform via Docker
cloudbuild.json
: configuration for Google Cloud Build for training the NLP model on the cloud. Unused.cookies.config
: cookies used for the TOS;DR website login to parse TOS excerptsmapped_classes.json
: manually-created mapping between originalclasses.csv
descriptions of privacy topics and combined classes for AutoTOS’s model. Created generally based on original classes with high appearance frequency and/or similar descriptions.
- Download
all.json
from source download_tos
package: useall.json
to fetch both the full text of TOS’s and the excerpts that correspond to points. Produces labeled_excerpts.csv and classes.csv__main__.py
fulltext.py
point_text.py
cleanup.py
: useslabeled_excerpts.csv
, adds extraneous (noise) data from the full TOS (broken down by sentence) for training, and generatesannotated_sentences.json
all.json
: data from TOS;DR’s GitHub repo with the category of various “points” (annotated excerpts from TOS’s), without the corresponding excerpt text. However, it links to relevant parts of the TOS;DR website that do contain these excerptslabeled_excerpts.csv
: list of excerpts from TOS texts. Each is labelled with the class ID and company slug.classes.csv
: Mapping between class IDs and the descriptive title, score, and frequency of each classtos/*.txt
: Full terms of service texts, as generated byfulltext.py
annotated_sentences.json
: sentences from TOS’s that either contain padding (sentences that belong to no class) or phrases labeled by class ID (as specified by classes.csv). Used directly by the training/testing model
split.py
: takes annotated_sentences.json and splits it into a 80/20 train/test set astrain_filter.json
andtest_filter.json
. Filters by the new classesin mapped_classes.json
. All padding data (i.e. data without a class ID) are put intotrain_filter.json
.train.py
: takestrain_filter.json
and builds a RoBERTa-based NLP model for sentence segmentation viahuggingface
orfinetune
. Thefinetune
-based model is slightly more precise when testing with significantly fewer false positives in practice.test.py
: takes the generated model andtest_filter.json
and prints out model statistics
TensorFlow-based models are output to the nlp/checkpoints
folder.
predictor.py
api.py
To be written...