Skip to content

Commit

Permalink
initial commit for labs repo
Browse files Browse the repository at this point in the history
  • Loading branch information
charlesfrye committed Jul 13, 2022
0 parents commit d1bdd17
Show file tree
Hide file tree
Showing 20 changed files with 1,853 additions and 0 deletions.
57 changes: 57 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Data
data/downloaded
data/processed
data/interim

# Logs
*/training/logs


# Editors
.vscode
*.sw?
*~

# Node
node_modules

# Python
__pycache__
.pytest_cache
.ipynb_checkpoints

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# logging
wandb
*.pt
*.ckpt
notebooks/lightning_logs
lightning_logs/
logs
flagged

# Misc
.DS_Store
_labs
.mypy_cache
lab9/requirements.txt
.coverage
/requirements.txt
bootstrap.py
21 changes: 21 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2022 Full Stack Deep Learning, LLC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
33 changes: 33 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Arcane incantation to print all the other targets, from https://stackoverflow.com/a/26339924
help:
@$(MAKE) -pRrq -f $(lastword $(MAKEFILE_LIST)) : 2>/dev/null | awk -v RS= -F: '/^# File/,/^# Finished Make data base/ {if ($$1 !~ "^[#.]") {print $$1}}' | sort | egrep -v -e '^[^[:alnum:]]' -e '^$@$$'

# Install exact Python and CUDA versions
conda-update:
conda env update --prune -f environment.yml
echo "!!!RUN THE conda activate COMMAND ABOVE RIGHT NOW!!!"

# Compile and install exact pip packages
pip-tools:
pip install pip-tools==6.3.1 setuptools==59.5.0
pip-compile requirements/prod.in && pip-compile requirements/dev.in
pip-sync requirements/prod.txt requirements/dev.txt

# Compile and install the requirements for local linting (optional)
pip-tools-lint:
pip install pip-tools==6.3.1 setuptools==59.5.0
pip-compile requirements/prod.in && pip-compile requirements/dev.in && pip-compile requirements/dev-lint.in
pip-sync requirements/prod.txt requirements/dev.txt requirements/dev-lint.txt

# bump versions of transitive dependencies
pip-tools-upgrade:
pip install pip-tools==6.3.1 setuptools==59.5.0
pip-compile --upgrade requirements/prod.in && pip-compile --upgrade requirements/dev.in && pip-compile --upgrade requirements/dev-lint.in

# Example training command
train-mnist-cnn-ddp:
python training/run_experiment.py --max_epochs=10 --gpus=-1 --accelerator=ddp --num_workers=20 --data_class=MNIST --model_class=CNN

# Lint
lint:
tasks/lint.sh
3 changes: 3 additions & 0 deletions data/raw/emnist/metadata.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
filename = 'matlab.zip'
sha256 = 'e1fa805cdeae699a52da0b77c2db17f6feb77eed125f9b45c022e7990444df95'
url = 'https://s3-us-west-2.amazonaws.com/fsdl-public-assets/matlab.zip'
9 changes: 9 additions & 0 deletions data/raw/emnist/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# EMNIST dataset

The EMNIST dataset is a set of handwritten character digits derived from the NIST Special Database 19
and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset."
From https://www.nist.gov/itl/iad/image-group/emnist-dataset

Original url is http://www.itl.nist.gov/iaui/vip/cs_links/EMNIST/matlab.zip

We uploaded the same file to our S3 bucket for faster download.
85 changes: 85 additions & 0 deletions data/raw/fsdl_handwriting/fsdl_handwriting.jsonl

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions data/raw/fsdl_handwriting/metadata.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
url = "https://dataturks.com/projects/sergeykarayev/fsdl_handwriting/export"
filename = "fsdl_handwriting.json"
sha256 = "720d6c72b4317a9a5492630a1c9f6d83a20d36101a29311a5cf7825c1d60c180"
8 changes: 8 additions & 0 deletions data/raw/fsdl_handwriting/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# FSDL Handwriting Dataset

## Collection

Handwritten paragraphs were collected in the FSDL March 2019 class.
The resulting PDF was stored at https://fsdl-public-assets.s3-us-west-2.amazonaws.com/fsdl_handwriting_20190302.pdf

Pages were extracted from the PDF by running `gs -q -dBATCH -dNOPAUSE -sDEVICE=jpeg -r300 -sOutputFile=page-%03d.jpg -f fsdl_handwriting_20190302.pdf` and uploaded to S3, with urls like https://fsdl-public-assets.s3-us-west-2.amazonaws.com/fsdl_handwriting_20190302/page-001.jpg
4 changes: 4 additions & 0 deletions data/raw/iam/metadata.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
url = 'https://s3-us-west-2.amazonaws.com/fsdl-public-assets/iam/iamdb.zip'
filename = 'iamdb.zip'
sha256 = 'f3c9e87a88a313e557c6d3548ed8a2a1af2dc3c4a678c5f3fc6f972ba4a50c55'
test_ids = ["m01-049","m01-060","m01-079","m01-084","m01-090","m01-095","m01-104","m01-121","m01-110","m01-131","m01-115","m01-125","m01-136","m01-149","m01-160","m02-048","m02-052","m02-055","m02-059","m02-066","m02-069","m02-072","m02-075","m02-106","m02-080","m02-083","m02-087","m02-090","m02-095","m02-102","m02-109","m02-112","m03-006","m03-013","m03-020","m03-033","m03-062","m03-095","m03-110","m03-114","m03-118","m04-000","m04-007","m04-012","m04-019","m04-024","m04-030","m04-038","m04-043","m04-061","m04-072","m04-078","m04-081","m04-093","m04-100","m04-107","m04-113","m04-123","m04-131","m04-138","m04-145","m04-152","m04-164","m04-180","m04-251","m04-190","m04-200","m04-209","m04-216","m04-222","m04-231","m04-238","m04-246","n04-000","n04-009","m06-019","n06-148","n06-156","n06-163","n06-169","n06-175","n06-182","n06-186","n06-194","n06-201","m06-031","m06-042","m06-048","m06-056","m06-067","m06-076","m06-083","m06-091","m06-098","m06-106","n01-000","n01-009","n01-004","n01-020","n01-031","n01-045","n01-036","n01-052","n01-057","n02-000","n02-016","n02-004","n02-009","n02-028","n02-049","n02-033","n02-037","n02-040","n02-045","n02-054","n02-062","n02-082","n02-098","p03-057","p03-087","p03-096","p03-103","p03-112","n02-151","n02-154","n02-157","n03-038","n03-064","n03-066","n03-079","n03-082","n03-091","n03-097","n03-103","n03-106","n03-113","n03-120","n03-126","n04-015","n04-022","n04-031","n04-039","n04-044","n04-048","n04-052","n04-060","n04-068","n04-075","n04-084","n04-092","n04-100","n04-107","n04-114","n04-130","n04-139","n04-149","n04-156","n04-163","n04-171","n04-183","n04-190","n04-195","n04-202","n04-209","n04-213","n04-218","n06-074","n06-082","n06-092","n06-100","n06-111","n06-119","n06-123","n06-128","n06-133","n06-140","p01-147","p01-155","p01-168","p01-174","p02-000","p02-008","p02-017","p02-022","p02-027","p02-069","p02-090","p02-076","p02-081","p02-101","p02-105","p02-109","p02-115","p02-121","p02-127","p02-131","p02-135","p02-139","p02-144","p02-150","p02-155","p03-004","p03-009","p03-012","p03-023","p03-027","p03-029","p03-033","p03-040","p03-047","p03-069","p03-072","p03-080","p03-121","p03-135","p03-142","p03-151","p03-158","p03-163","p03-173","p03-181","p03-185","p03-189","p06-030","p06-042","p06-047","p06-052","p06-058","p06-069","p06-088","p06-096","p06-104"]
30 changes: 30 additions & 0 deletions data/raw/iam/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# IAM Dataset

The IAM Handwriting Database contains forms of handwritten English text which can be used to train and test handwritten text recognizers and to perform writer identification and verification experiments.

- 657 writers contributed samples of their handwriting
- 1,539 pages of scanned text
- 13,353 isolated and labeled text lines

- http://www.fki.inf.unibe.ch/databases/iam-handwriting-database

## Pre-processing

First, all forms were placed into one directory called `forms`, from original directories like `formsA-D`.

To save space, I converted the original PNG files to JPG, and resized them to half-size
```
mkdir forms-resized
cd forms
ls -1 *.png | parallel --eta -j 6 convert '{}' -adaptive-resize 50% '../forms-resized/{.}.jpg'
```

## Split

The data split we will use is
IAM lines Large Writer Independent Text Line Recognition Task (lwitlrt): 9,862 text lines.

- The validation set has been merged into the train set.
- The train set has 7,101 lines from 326 writers.
- The test set has 1,861 lines from 128 writers.
- The text lines of all data sets are mutually exclusive, thus each writer has contributed to one set only.
9 changes: 9 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: fsdl-text-recognizer-2022
channels:
- defaults
- conda-forge
dependencies:
- python=3.7 # versioned to match Google Colab # version also pinned in Dockerfile
- cudatoolkit=11.3.1
- cudnn=8.3.2
- pip=21.1.3 # versioned to match Google Colab # version also pinned in Dockerfile
Loading

0 comments on commit d1bdd17

Please sign in to comment.