-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can you release the sharegpt dataset? #90
Comments
If data can't be released, can you please share the code for dataset crawling and all the processing you did to get markdown from HTML? |
"Open-Source"
That's not open source. Not at all, don't claim to be open. |
Hi @Kreijstal, @LZY-the-boys and @ari9dam Thanks for your interest! We plan to release the weights once we have addressed all concerns and have a low-resource version of the inference code ready. We released the demo first to get some early feedback on the model. We have no current plans to release the dataset and will first communicate with the ShareGPT team. The data cleaning script is this FastChat/fastchat/data/clean_sharegpt.py Lines 1 to 3 in 6f42570
|
In terms of the dataset, is avoiding the release out of respect to the ShareGPT team disabling their endpoint? My understanding is it was for security reasons, which I can respect. If so, do you know of any efforts being made to make public datasets for building foundational models like Vicuna? If not, do you know of any resource that could help others interested in such efforts? |
@merrymercy Seems the |
ShareGPT Dataset: Zipped jsons with 90 000 conversations from sharegpt. Split in two files with 45k each: Format should work as is for training. Use clean tool to remove html markup: https://github.com/lm-sys/FastChat/blob/main/docs/commands/data_cleaning.md (Note: I'm just relaying this info from someone who sent it my way. So I don't know anything more than anyone else.) The entire pre-cleaned 90k conversation dataset is also available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset A pre-cleaned, English only, "unfiltered," and 2048 token split version of the ShareGPT dataset ready for finetuning is available here: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered |
@MarkSchmidty |
For all you scrapers out there, there's another site that also has ChatGPT conversations that's rather easy: It has around 80k conversations from what I can tell. |
Not all heroes wear capes! |
@MarkSchmidty Thank you for providing the higher quality version that has all the senseless/misguided OpanAI moralizing purged. |
hello |
I fine-tuned the 13B model using the dataset from the huggingface link above, but the model's performance was poor - in some cases it failed to correctly output the end symbol. |
@alanxmay you fine tuned it with the unfiiltered dataset? |
Don't take everything for granted. Given that OpenAI is so closed, I feel really respectful for Meta to release LLaMA and also all the research groups that released follow-up works on LLaMA. |
ClosedAI, they really parted way with all the nice principles they once had. |
Closing this issue for now. So far, we have released the
We're unable to release the data due to various factors out of our control. We'll keep pushing the limit and get the community better and more open LLMs!
|
@BadisG Yes, I am using this one: https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_unfiltered_cleaned_split.json |
@alanxmay How did it go? Did you manage to make it better? |
@MarkSchmidty |
I didn't generate these. I was sent them by an anonymous source. It's not possible to crawl sharegpt anymore. Sharegpt used to have a page which you could crawl. But now it does not. |
ok, that's sad news. |
Theoretically, you could scrape Twitter. Anything someone share's publicly on social media is fair game to scrape, technically. |
Yeah, but they are charging exorbitant fees for scraping now. Twitter is as good as closed now, at least for ordinary developers and researchers. Academic access RIP. |
Ya, it's another case of sad news. Not sure how this is all going to play out long-term. Best of luck to everyone. |
i use this dataset with baichuan 7B model, and command as follows :CUDA_VISIBLE_DEVICES="7" torchrun --nproc_per_node=1 --master_port=20001 fastchat/train/train_baichuan.py --model_name_or_path /workspace/baichuan/model_para/Baichuan-7B --data_path /workspace/baichuan/dataset/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json --bf16 False --output_dir output_baichuan --num_train_epochs 3 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 10 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' --tf32 False --model_max_length 64 --gradient_checkpointing True --lazy_preprocess True , |
I am wandering can the sharegpt data be released?
The text was updated successfully, but these errors were encountered: