Skip to content

Commit

Permalink
Improve NLTK downloader log output and error messages (#1690)
Browse files Browse the repository at this point in the history
Improves the build log output, error messages and buildpack
failure metrics for the NLTK downloader feature, and adds a
test for the app having an invalid `nltk.txt` (since previously
only the successful cases were tested).

This should be the last failure type seen in the wild that doesn't
emit a `failure_reason` metric.

GUS-W-8059892.
  • Loading branch information
edmorley authored Nov 8, 2024
1 parent 0cd8efc commit c496baa
Show file tree
Hide file tree
Showing 5 changed files with 60 additions and 15 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@

- Added a warning when the files for multiple package managers are found. In the future this warning will become an error. ([#1692](https://github.com/heroku/heroku-buildpack-python/pull/1692))
- Updated the build log message shown when installing dependencies to include the package manager command being run. ([#1689](https://github.com/heroku/heroku-buildpack-python/pull/1689))
- Improved the error messages and buildpack metrics for package manager related failures. ([#1689](https://github.com/heroku/heroku-buildpack-python/pull/1689))
- Changed test dependency installation on Heroku CI to now install `requirements.txt` and `requirements-test.txt` in a single `pip install` invocation rather than separately. This allows pip's resolver to resolve any version conflicts between the two files. ([#1689](https://github.com/heroku/heroku-buildpack-python/pull/1689))
- Improved the error messages and buildpack metrics for package manager related failures. ([#1689](https://github.com/heroku/heroku-buildpack-python/pull/1689))
- Improved the build log output, error messages and buildpack failure metrics for the NLTK downloader feature. ([#1690](https://github.com/heroku/heroku-buildpack-python/pull/1690))

## [v265] - 2024-11-06

Expand Down
24 changes: 18 additions & 6 deletions bin/steps/nltk
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
#!/usr/bin/env bash
# shellcheck disable=SC2250 # TODO: Use braces around variable references even when not strictly required.

# This script is run in a subshell via sub_env so doesn't inherit the options/vars/utils from `bin/compile`.
# TODO: Stop running this script in a subshell.
Expand All @@ -21,17 +20,30 @@ EXPORT_PATH="${BUILDPACK_DIR}/export"
if is_module_available 'nltk'; then
output::step "Downloading NLTK corpora..."

nltk_packages_definition="$BUILD_DIR/nltk.txt"
nltk_packages_definition="${BUILD_DIR}/nltk.txt"

if [[ -f "$nltk_packages_definition" ]]; then
if [[ -f "${nltk_packages_definition}" ]]; then
meta_set "nltk_downloader" "enabled"

readarray -t nltk_packages <"$nltk_packages_definition"
readarray -t nltk_packages <"${nltk_packages_definition}"
output::step "Downloading NLTK packages: ${nltk_packages[*]}"

python -m nltk.downloader -d "$BUILD_DIR/.heroku/python/nltk_data" "${nltk_packages[@]}" | output::indent
set_env NLTK_DATA "/app/.heroku/python/nltk_data"
nltk_data_dir="/app/.heroku/python/nltk_data"

if ! python -m nltk.downloader -d "${nltk_data_dir}" "${nltk_packages[@]}" |& output::indent; then
output::error <<-EOF
Error: Unable to download NLTK data.
The 'python -m nltk.downloader' command to download NLTK
data did not exit successfully.
See the log output above for more information.
EOF
meta_set "failure_reason" "nltk-downloader"
exit 1
fi

set_env NLTK_DATA "${nltk_data_dir}"
else
meta_set "nltk_downloader" "skipped-no-nltk-file"
echo " 'nltk.txt' not found, not downloading any corpora"
Expand Down
1 change: 1 addition & 0 deletions spec/fixtures/nltk_txt_invalid/nltk.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
invalid!
1 change: 1 addition & 0 deletions spec/fixtures/nltk_txt_invalid/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
nltk
46 changes: 38 additions & 8 deletions spec/hatchet/nltk_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@
expect(clean_output(app.output)).to match(Regexp.new(<<~REGEX))
remote: -----> Downloading NLTK corpora...
remote: -----> Downloading NLTK packages: city_database stopwords
remote: .*: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote: \\[nltk_data\\] Downloading package city_database to
remote: \\[nltk_data\\] /tmp/build_.*/.heroku/python/nltk_data...
remote: \\[nltk_data\\] Unzipping corpora/city_database.zip.
remote: \\[nltk_data\\] Downloading package stopwords to
remote: \\[nltk_data\\] /tmp/build_.*/.heroku/python/nltk_data...
remote: \\[nltk_data\\] Unzipping corpora/stopwords.zip.
remote: .*: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote: \\[nltk_data\\] Downloading package city_database to
remote: \\[nltk_data\\] /app/.heroku/python/nltk_data...
remote: \\[nltk_data\\] Unzipping corpora/city_database.zip.
remote: \\[nltk_data\\] Downloading package stopwords to
remote: \\[nltk_data\\] /app/.heroku/python/nltk_data...
remote: \\[nltk_data\\] Unzipping corpora/stopwords.zip.
REGEX

# TODO: Add a test that the downloaded corpora can be found at runtime.
Expand All @@ -44,7 +44,37 @@

it 'does not try to install the specified NLTK corpora' do
app.deploy do |app|
expect(app.output.downcase).not_to include('nltk')
expect(app.output).not_to include('NLTK')
expect(app.output).not_to include('nltk_data')
end
end
end

context 'when nltk.txt contains invalid entries' do
let(:app) { Hatchet::Runner.new('spec/fixtures/nltk_txt_invalid', allow_failure: true) }

it 'fails the build' do
app.deploy do |app|
expect(clean_output(app.output)).to match(Regexp.new(<<~REGEX, Regexp::MULTILINE))
remote: -----> Downloading NLTK corpora...
remote: -----> Downloading NLTK packages: invalid!
remote: .+: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote: \\[nltk_data\\] Error loading invalid!: Package 'invalid!' not found in
remote: \\[nltk_data\\] index
remote: Error installing package. Retry\\? \\[n/y/e\\]
remote: Traceback \\(most recent call last\\):
remote: .+
remote: EOFError: EOF when reading a line
remote:
remote: ! Error: Unable to download NLTK data.
remote: !
remote: ! The 'python -m nltk.downloader' command to download NLTK
remote: ! data did not exit successfully.
remote: !
remote: ! See the log output above for more information.
remote:
remote: ! Push rejected, failed to compile Python app.
REGEX
end
end
end
Expand Down

0 comments on commit c496baa

Please sign in to comment.