-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG: validate_data.py ModuleNotFoundError (finetune & tensorflow) #98
Comments
I fixed part of my issue by running the command directly from within the
Still, I am now getting another error:
Downgrading to
Any idea? Best, C. |
Any help? I have tried also with |
I tried it in a new environment for python 3.10 and it worked. You have to run it as a module ( cd $HOME/mistral-finetune
python -m utils.reformat_data $HOME/data/ultrachat_chunk_train.jsonl
python -m utils.reformat_data $HOME/data/ultrachat_chunk_eval.jsonl FYI that package |
Dear @NazimHAli, many thanks for your support. As written above, I have fixed some of the issues by running it as a module instead of as a script (e.g., the missing finetune package). I went through the whole process once more and realized that I forgot to modify the I could thus successfully complete the dataset verification section: Nevertheless, it fails at training when running:
I get:
Any idea? Best, C. |
Try uninstalling It's possible a combination of the packages + local environment is causing it to install version 2, but not have the dependencies correctly defined. |
Many thanks for your support, that indeed fixed the "A module that was compiled using NumPy 1.x cannot be run in Unfortunately, when running
FYI I am trying to run the scripts on our University GPU Cluster. |
I don't have experience with this, so not sure how to debug because it could be specific to your cluster - you can try first getting it to run with a single GPU and go from there. This might be a better question in the |
Dear @NazimHAli, Thanks for the suggestion, unfortunately it fails similarly:
I will thus open a thread on the Best, C. |
@NazimHAli according to PyTorch developpers the issue is coming from your code and not from their package: pytorch/pytorch#137082. So I could go a step further by setting up Though it still fails later on...
Any idea what went wrong this time? |
Hey, Sorry for the late reply, lost track of things. From this error, it's complaining about your dataset: ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (32768,) + inhomogeneous part. Can you create a public repo with reproducible code and sample data? |
@NazimHAli, no worries and thanks for the reply. In fact, I have strictly followed your README and downloaded the Ultrachat_200k dataset from HuggingFace using Python:
Then, again as suggested in your README, I made use of the ./utils/reformat_data.py to correct the data:
Maybe this last step corrupted the dataset ? EDIT:There seems to be something wrong in the Ultrachat_200k dataset from HuggingFace dataset because when I verify the training yaml to make sure the data is correctly formatted running
Which is not what's to be expected as described in your README, namely:
|
Python Version
Pip Freeze
Reproduction Steps
python ./mistral-finetune/utils/validate_data.py --train_yaml ./mistral-finetune/example/7B.yaml
Expected Behavior
According to the README, it should return a "a summary of the data input and training parameters" such as:
Additional Context
The script returns the following error:
When installing the latest 'finetune-0.10.0' release, it returns a second error also related to a missing package:
Suggested Solutions
When installing the second missing package 'tensorflow-2.17.0' the problem should be fixed though it returns a pip's depencendy conflict:
Since finetune 0.10.0 requires numpy <1.24.0 while tensorflow-2.17.0 requires version numpy 1.26.4, I really don't see how I could make your script work.
Any idea?
Best,
C.
Follow up:
Command
torchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yamltorchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yaml
seems to fail as well due to a missing package:And when trying to install 'train-0.0.5', I got another pip's dependency conflict with the same packages as above:
The text was updated successfully, but these errors were encountered: