Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve NLTK downloader log output and error messages #1690

Merged
merged 1 commit into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@

- Added a warning when the files for multiple package managers are found. In the future this warning will become an error. ([#1692](https://github.com/heroku/heroku-buildpack-python/pull/1692))
- Updated the build log message shown when installing dependencies to include the package manager command being run. ([#1689](https://github.com/heroku/heroku-buildpack-python/pull/1689))
- Improved the error messages and buildpack metrics for package manager related failures. ([#1689](https://github.com/heroku/heroku-buildpack-python/pull/1689))
- Changed test dependency installation on Heroku CI to now install `requirements.txt` and `requirements-test.txt` in a single `pip install` invocation rather than separately. This allows pip's resolver to resolve any version conflicts between the two files. ([#1689](https://github.com/heroku/heroku-buildpack-python/pull/1689))
- Improved the error messages and buildpack metrics for package manager related failures. ([#1689](https://github.com/heroku/heroku-buildpack-python/pull/1689))
- Improved the build log output, error messages and buildpack failure metrics for the NLTK downloader feature. ([#1690](https://github.com/heroku/heroku-buildpack-python/pull/1690))

## [v265] - 2024-11-06

Expand Down
24 changes: 18 additions & 6 deletions bin/steps/nltk
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
#!/usr/bin/env bash
# shellcheck disable=SC2250 # TODO: Use braces around variable references even when not strictly required.

# This script is run in a subshell via sub_env so doesn't inherit the options/vars/utils from `bin/compile`.
# TODO: Stop running this script in a subshell.
Expand All @@ -21,17 +20,30 @@ EXPORT_PATH="${BUILDPACK_DIR}/export"
if is_module_available 'nltk'; then
output::step "Downloading NLTK corpora..."

nltk_packages_definition="$BUILD_DIR/nltk.txt"
nltk_packages_definition="${BUILD_DIR}/nltk.txt"

if [[ -f "$nltk_packages_definition" ]]; then
if [[ -f "${nltk_packages_definition}" ]]; then
meta_set "nltk_downloader" "enabled"

readarray -t nltk_packages <"$nltk_packages_definition"
readarray -t nltk_packages <"${nltk_packages_definition}"
output::step "Downloading NLTK packages: ${nltk_packages[*]}"

python -m nltk.downloader -d "$BUILD_DIR/.heroku/python/nltk_data" "${nltk_packages[@]}" | output::indent
set_env NLTK_DATA "/app/.heroku/python/nltk_data"
nltk_data_dir="/app/.heroku/python/nltk_data"

if ! python -m nltk.downloader -d "${nltk_data_dir}" "${nltk_packages[@]}" |& output::indent; then
colincasey marked this conversation as resolved.
Show resolved Hide resolved
output::error <<-EOF
Error: Unable to download NLTK data.

The 'python -m nltk.downloader' command to download NLTK
data did not exit successfully.

See the log output above for more information.
EOF
meta_set "failure_reason" "nltk-downloader"
exit 1
fi

set_env NLTK_DATA "${nltk_data_dir}"
else
meta_set "nltk_downloader" "skipped-no-nltk-file"
echo " 'nltk.txt' not found, not downloading any corpora"
Expand Down
1 change: 1 addition & 0 deletions spec/fixtures/nltk_txt_invalid/nltk.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
invalid!
1 change: 1 addition & 0 deletions spec/fixtures/nltk_txt_invalid/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
nltk
46 changes: 38 additions & 8 deletions spec/hatchet/nltk_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@
expect(clean_output(app.output)).to match(Regexp.new(<<~REGEX))
remote: -----> Downloading NLTK corpora...
remote: -----> Downloading NLTK packages: city_database stopwords
remote: .*: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote: \\[nltk_data\\] Downloading package city_database to
remote: \\[nltk_data\\] /tmp/build_.*/.heroku/python/nltk_data...
remote: \\[nltk_data\\] Unzipping corpora/city_database.zip.
remote: \\[nltk_data\\] Downloading package stopwords to
remote: \\[nltk_data\\] /tmp/build_.*/.heroku/python/nltk_data...
remote: \\[nltk_data\\] Unzipping corpora/stopwords.zip.
remote: .*: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote: \\[nltk_data\\] Downloading package city_database to
remote: \\[nltk_data\\] /app/.heroku/python/nltk_data...
remote: \\[nltk_data\\] Unzipping corpora/city_database.zip.
remote: \\[nltk_data\\] Downloading package stopwords to
remote: \\[nltk_data\\] /app/.heroku/python/nltk_data...
remote: \\[nltk_data\\] Unzipping corpora/stopwords.zip.
REGEX

# TODO: Add a test that the downloaded corpora can be found at runtime.
Expand All @@ -44,7 +44,37 @@

it 'does not try to install the specified NLTK corpora' do
app.deploy do |app|
expect(app.output.downcase).not_to include('nltk')
expect(app.output).not_to include('NLTK')
expect(app.output).not_to include('nltk_data')
end
end
end

context 'when nltk.txt contains invalid entries' do
let(:app) { Hatchet::Runner.new('spec/fixtures/nltk_txt_invalid', allow_failure: true) }

it 'fails the build' do
app.deploy do |app|
expect(clean_output(app.output)).to match(Regexp.new(<<~REGEX, Regexp::MULTILINE))
remote: -----> Downloading NLTK corpora...
remote: -----> Downloading NLTK packages: invalid!
remote: .+: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
remote: \\[nltk_data\\] Error loading invalid!: Package 'invalid!' not found in
remote: \\[nltk_data\\] index
remote: Error installing package. Retry\\? \\[n/y/e\\]
remote: Traceback \\(most recent call last\\):
remote: .+
remote: EOFError: EOF when reading a line
remote:
remote: ! Error: Unable to download NLTK data.
remote: !
remote: ! The 'python -m nltk.downloader' command to download NLTK
remote: ! data did not exit successfully.
remote: !
remote: ! See the log output above for more information.
remote:
remote: ! Push rejected, failed to compile Python app.
REGEX
end
end
end
Expand Down