AutoTOS

Automatic Terms of Service and Privacy Policy parser and summarizer, powered by custom NLP techniques.

Created by Andrew Mascillaro, Spencer Ng, William Qin, and Eric Zheng. Winner of PennApps XXI's Best use of Google Cloud.

Installation

Autotos is developed using finetune, a powerful NLP library with specific dependencies. Either install finetune as instructed on their GitHub page or use the following instructions:

(Recommended) create a virtual environment using venv or conda:

conda create -n autotos python=3.8
conda activate autotos
python -m spacy download en

Install the requirements

pip install -r requirements.txt

Run the script nlp/train.py as shown in Repository Structure to load model-specific data

This should install all files and dependencies for development or for running the model locally.

Repository structure

This repository contains the following folders:

artifacts: automatically downloaded files and those generated from data on the TOS;DR website. See below for a complete description.
config: manually-created configuration files to facilitate TOS downloading, NLP model training, and deploying to the cloud
data: scripts to download and clean data from TOS;DR into formats used for ML training
nlp: scripts to test and train the custom NLP RoBERTa-based model
prediction: scripts to serve the API backend and run evaluation on the trained model
docs: code for the AutoTOS website frontend
gcp-docker: archival scripts used to integrate AutoTOS with Google Cloud Platform via Docker

Configuration files

cloudbuild.json: configuration for Google Cloud Build for training the NLP model on the cloud. Unused.
cookies.config: cookies used for the TOS;DR website login to parse TOS excerpts
mapped_classes.json: manually-created mapping between original classes.csv descriptions of privacy topics and combined classes for AutoTOS’s model. Created generally based on original classes with high appearance frequency and/or similar descriptions.

Data pipeline

Download all.json from source
download_tos package: use all.json to fetch both the full text of TOS’s and the excerpts that correspond to points. Produces labeled_excerpts.csv and classes.csv
__main__.py
fulltext.py
point_text.py
cleanup.py: uses labeled_excerpts.csv, adds extraneous (noise) data from the full TOS (broken down by sentence) for training, and generates annotated_sentences.json

Data files and artifacts

all.json: data from TOS;DR’s GitHub repo with the category of various “points” (annotated excerpts from TOS’s), without the corresponding excerpt text. However, it links to relevant parts of the TOS;DR website that do contain these excerpts
labeled_excerpts.csv: list of excerpts from TOS texts. Each is labelled with the class ID and company slug.
classes.csv: Mapping between class IDs and the descriptive title, score, and frequency of each class
tos/*.txt: Full terms of service texts, as generated by fulltext.py
annotated_sentences.json: sentences from TOS’s that either contain padding (sentences that belong to no class) or phrases labeled by class ID (as specified by classes.csv). Used directly by the training/testing model

NLP pipeline

split.py: takes annotated_sentences.json and splits it into a 80/20 train/test set as train_filter.json and test_filter.json. Filters by the new classes in mapped_classes.json. All padding data (i.e. data without a class ID) are put into train_filter.json.
train.py: takes train_filter.json and builds a RoBERTa-based NLP model for sentence segmentation via huggingface or finetune. The finetune-based model is slightly more precise when testing with significantly fewer false positives in practice.
test.py: takes the generated model and test_filter.json and prints out model statistics

TensorFlow-based models are output to the nlp/checkpoints folder.

API connection

predictor.py
api.py

To be written...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoTOS

Installation

Repository structure

Configuration files

Data pipeline

Data files and artifacts

NLP pipeline

API connection

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
artifacts		artifacts
config		config
data		data
docs		docs
gcp-docker		gcp-docker
nlp		nlp
prediction		prediction
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

License

MathTauAthogen/autoTOS

Folders and files

Latest commit

History

Repository files navigation

AutoTOS

Installation

Repository structure

Configuration files

Data pipeline

Data files and artifacts

NLP pipeline

API connection

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages