Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dockerfile #116

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
FROM rust:latest AS build
COPY Cargo.toml Cargo.toml
COPY src/ src/
COPY tests/ tests/
COPY benches/ benches/
COPY README.md README.md

# Add deps for kenlm
RUN apt update && apt install -y libboost-all-dev libeigen3-dev cmake clang
RUN cargo build --all-features --release
RUN ls target
RUN ls target/release

FROM alpine AS dl
ARG model_url=https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
#TODO blocklist commit?

# get git
RUN apk add wget git unzip

# Kenlms too big?

# get langid
RUN wget -O langid.bin $model_url

# get blocklist
RUN wget https://github.com/olbat/ut1-blacklists/archive/refs/heads/master.zip
RUN unzip master.zip
RUN mv ut1-blacklists-master ut1-blacklists

# decompress the biggest one
RUN gzip -d ut1-blacklists/blacklists/adult/domains.gz

# extract blocklist commit id
#RUN git rev-parse HEAD > ut1-blacklists-commitid.txt

# find something lighter?
FROM debian

# copy binary
COPY --from=build target/release/ungoliant /bin/ungoliant
RUN ls /bin/

# copy model
COPY --from=dl langid.bin /langid.bin
COPY --from=dl ut1-blacklists/blacklists/ /blocklists/

# create volumes for shards and corpus output
VOLUME /shards
VOLUME /kenlm
VOLUME /output

RUN ls


ENTRYPOINT ["/bin/ungoliant"]

#CMD ["pipeline", "--foo"]
CMD ["pipeline", "--domain-blocklists", "/blocklists/", "--kenlms-path", "/kenlm", "--lid-path", "langid.bin", "--split_size", "10000", "--comp", "/shards", "/output"]
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,13 @@ Ungoliant is a replacement of [goclassy](https://github.com/oscar-corpus/goclass

## Installation

### Docker

WIP:

/!\\: Be sure to have a properly set `ulimit`: `ulimit -n 10000`
Command :`docker run --rm -v <path_to_shards>:/shards -v ./output:/output --ulimit nofile=10000:10000 ungo:latest`

### Installing/Compiling the binary
* Via `cargo`: `cargo install ungoliant`
* Via `git`: `cargo install --git https://github.com/oscar-corpus/ungoliant`
Expand Down
Loading