Skip to content

[Issue: dolma.core.errors.DolmaFatalError in Step 1: Run Taggers] #194

Closed
@yushengsu-thu

Description

@yushengsu-thu

Hello @soldni ,
I have one more question.
When I execute Step 1: Run Taggers,

dolma tag \
    --documents "wikipedia/v0/documents/*" \
    --experiment exp \ # optional; assigning a name groups taggers in a single directory
    --taggers random_number_v1 \
              cld2_en_paragraph_with_doc_score_v2 \
              ft_lang_id_en_paragraph_with_doc_score_v2 \
              char_length_with_paragraphs_v1 \
              whitespace_tokenizer_with_paragraphs_v1 \
    --processes 16   # run on 96 cores

I encounter the following issue:

Traceback (most recent call last):
  File "/home/yushensu/miniconda3/envs/data/bin/dolma", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/cli/main.py", line 93, in main
    return cli.run_from_args(args=args, config=config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/cli/__init__.py", line 192, in run_from_args
    return cls.run(parsed_config)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/cli/tagger.py", line 129, in run
    create_and_run_tagger(
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/core/runtime.py", line 483, in create_and_run_tagger
    tagger_processor(
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/core/parallel.py", line 516, in __call__
    fn(
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/core/parallel.py", line 439, in _multiprocessing_run_all
    result.get()
  File "/home/yushensu/miniconda3/envs/data/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
dolma.core.errors.DolmaFatalError: Failed to process wikipedia/v0/documents/wiki_00.gz due to ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword. 

My env:

Python 3.12.4
numpy 2.1.0
dolma 1.0.11

Is this issue coming from the processed (I used scripts/make_wikipedia.py) data wikipedia/v0/documents/wiki_00.gz or the codebases in dolma? Do you have any suggestion to mitigate or solve this issue?

Activity

soldni

soldni commented on Aug 28, 2024

@soldni
Member

Oh I think it's because numpy 2.x is incompatible with numpy 1.x APIs. Cutting a quick fix and a new release (dolma 1.0.12) momentarily to fix that.

yushengsu-thu

yushengsu-thu commented on Aug 28, 2024

@yushengsu-thu
ContributorAuthor

@soldni thanks for your reply.
I found this issue comes from Step 0: Obtain Wikipedia processed data because of its used package wikiextractor

Now I have found a temporary solution:
set the python (from 3.12 --> 3.11) and pkgs in the following version:

Python 3.11.9
numpy 1.26.3
wikiextractor 3.0.6
dolma 1.0.11

Then, re-run the Step 0: Obtain Wikipedia

python scripts/make_wikipedia.py \
  --output wikipedia \
  --date 20231001 \
  --lang simple \
  --processes 16

and use its processed data to conduct the Step 1: Run Taggers that can mitigate this issus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Participants

    @soldni@yushengsu-thu

    Issue actions

      [Issue: `dolma.core.errors.DolmaFatalError` in Step 1: Run Taggers] · Issue #194 · allenai/dolma