Skip to content

Commit

Permalink
feat: upload DB dump to AWS S3 (#10863)
Browse files Browse the repository at this point in the history
Upload the following dump files to AWS S3, just after being created in
gen_feed_daily script:
- en.openfoodfacts.org.products.csv
- en.openfoodfacts.org.products.csv.gz
- fr.openfoodfacts.org.products.csv
- fr.openfoodfacts.org.products.csv.gz
- en.openfoodfacts.org.products.rdf
- fr.openfoodfacts.org.products.rdf
- openfoodfacts-products.jsonl.gz
- openfoodfacts-mongodbdump.gz
- openfoodfacts_recent_changes.jsonl.gz

Also add redirects (HTTP 302) to AWS S3 in the off nginx configuration
so that we save I/O.

We use minio client (mc) for synchronization. We expect
`/home/off/.mc/config.json` to contain AWS credentials.
  • Loading branch information
raphael0202 authored Oct 7, 2024
1 parent 0558184 commit 34ae5e4
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 0 deletions.
23 changes: 23 additions & 0 deletions conf/nginx/sites-available/off
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,29 @@ server {
gunzip on;
}

# Add an HTTP 302 redirect to AWS S3 bucket for specific dump files
location = /data/openfoodfacts_recent_changes.jsonl.gz {
return 302 https://openfoodfacts-ds.s3.eu-west-3.amazonaws.com/openfoodfacts_recent_changes.jsonl.gz;
}
location = /data/openfoodfacts-mongodbdump.gz {
return 302 https://openfoodfacts-ds.s3.eu-west-3.amazonaws.com/openfoodfacts-mongodbdump.gz;
}
location = /data/openfoodfacts-products.jsonl.gz {
return 302 https://openfoodfacts-ds.s3.eu-west-3.amazonaws.com/openfoodfacts-products.jsonl.gz;
}
location = /data/en.openfoodfacts.org.products.csv {
return 302 https://openfoodfacts-ds.s3.eu-west-3.amazonaws.com/en.openfoodfacts.org.products.csv;
}
location = /data/en.openfoodfacts.org.products.csv.gz {
return 302 https://openfoodfacts-ds.s3.eu-west-3.amazonaws.com/en.openfoodfacts.org.products.csv.gz;
}
location = /data/fr.openfoodfacts.org.products.csv {
return 302 https://openfoodfacts-ds.s3.eu-west-3.amazonaws.com/fr.openfoodfacts.org.products.csv;
}
location = /data/fr.openfoodfacts.org.products.csv.gz {
return 302 https://openfoodfacts-ds.s3.eu-west-3.amazonaws.com/fr.openfoodfacts.org.products.csv.gz;
}

if ($http_referer ~* (jobothoniel.com) ) { return 403; } # blocked since 2021-07-13

# the app requests /1.json to get the product count...
Expand Down
10 changes: 10 additions & 0 deletions scripts/gen_feeds_daily_off.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,16 @@ for export in en.openfoodfacts.org.products.csv fr.openfoodfacts.org.products.cs
mv -f new.$export.gz $export.gz
done

# Copy CSV and RDF files to AWS S3 using MinIO client
mc cp \
en.openfoodfacts.org.products.csv \
en.openfoodfacts.org.products.csv.gz \
en.openfoodfacts.org.products.rdf \
fr.openfoodfacts.org.products.csv \
fr.openfoodfacts.org.products.csv.gz \
fr.openfoodfacts.org.products.rdf \
s3/openfoodfacts-ds

# Generate the MongoDB dumps and jsonl export
cd /srv/off/scripts

Expand Down
7 changes: 7 additions & 0 deletions scripts/mongodb_dump.sh
Original file line number Diff line number Diff line change
Expand Up @@ -60,4 +60,11 @@ popd > /dev/null # data/delta
mongoexport --collection recent_changes --host $HOST --db $DB --fields=_id,comment,code,userid,rev,countries_tags,t,diffs | gzip -9 > "new.${PREFIX}_recent_changes.jsonl.gz" && \
mv new.${PREFIX}_recent_changes.jsonl.gz ${PREFIX}_recent_changes.jsonl.gz

# Copy files to AWS S3 using MinIO client
mc cp \
${PREFIX}-products.jsonl.gz \
${PREFIX}_recent_changes.jsonl.gz \
${PREFIX}-mongodbdump.gz \
s3/openfoodfacts-ds

popd > /dev/null # data

0 comments on commit 34ae5e4

Please sign in to comment.