Closed
Description
Hello @soldni ,
I have one more question.
When I execute Step 1: Run Taggers
,
dolma tag \
--documents "wikipedia/v0/documents/*" \
--experiment exp \ # optional; assigning a name groups taggers in a single directory
--taggers random_number_v1 \
cld2_en_paragraph_with_doc_score_v2 \
ft_lang_id_en_paragraph_with_doc_score_v2 \
char_length_with_paragraphs_v1 \
whitespace_tokenizer_with_paragraphs_v1 \
--processes 16 # run on 96 cores
I encounter the following issue:
Traceback (most recent call last):
File "/home/yushensu/miniconda3/envs/data/bin/dolma", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/cli/main.py", line 93, in main
return cli.run_from_args(args=args, config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/cli/__init__.py", line 192, in run_from_args
return cls.run(parsed_config)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/cli/tagger.py", line 129, in run
create_and_run_tagger(
File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/core/runtime.py", line 483, in create_and_run_tagger
tagger_processor(
File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/core/parallel.py", line 516, in __call__
fn(
File "/home/yushensu/miniconda3/envs/data/lib/python3.12/site-packages/dolma/core/parallel.py", line 439, in _multiprocessing_run_all
result.get()
File "/home/yushensu/miniconda3/envs/data/lib/python3.12/multiprocessing/pool.py", line 774, in get
raise self._value
dolma.core.errors.DolmaFatalError: Failed to process wikipedia/v0/documents/wiki_00.gz due to ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.
My env:
Python 3.12.4
numpy 2.1.0
dolma 1.0.11
Is this issue coming from the processed (I used scripts/make_wikipedia.py) data wikipedia/v0/documents/wiki_00.gz
or the codebases in dolma? Do you have any suggestion to mitigate or solve this issue?
Metadata
Metadata
Assignees
Labels
No labels
Activity
soldni commentedon Aug 28, 2024
Oh I think it's because numpy 2.x is incompatible with numpy 1.x APIs. Cutting a quick fix and a new release (dolma 1.0.12) momentarily to fix that.
yushengsu-thu commentedon Aug 28, 2024
@soldni thanks for your reply.
I found this issue comes from
Step 0: Obtain Wikipedia
processed data because of its used packagewikiextractor
Now I have found a temporary solution:
set the python (from 3.12 --> 3.11) and pkgs in the following version:
Then, re-run the
Step 0: Obtain Wikipedia
and use its processed data to conduct the
Step 1: Run Taggers
that can mitigate this issus.