Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Script] Images folder convert script to data_info.json #57

Merged

Conversation

frutiemax92
Copy link
Contributor

@frutiemax92 frutiemax92 commented Apr 21, 2024

This script transforms a folder with images and captions to the correct folder structure with the data_info.json file. It also copies the image files to an indexed file name with the same extension as the original in the InternImgs folder. It also supports recursivity i.e. you can put multiple dataset folders in the root folder.

There is also an optional argument --caption_extension which is by default .txt but the user can change it if he wishes.

I thought this would be a useful script as I am more used to the other folder structure.

@frutiemax92 frutiemax92 marked this pull request as draft April 21, 2024 00:26
@frutiemax92 frutiemax92 force-pushed the script_convert_images_to_json branch from 976eec2 to 2256a99 Compare April 21, 2024 00:28
@frutiemax92 frutiemax92 marked this pull request as ready for review April 21, 2024 00:28
@lawrence-cj
Copy link
Contributor

lawrence-cj commented Apr 21, 2024

Pretty good and useful scripts. Thx a lot. Let's add a how-to-use in the Readme file? @frutiemax92

@frutiemax92 frutiemax92 force-pushed the script_convert_images_to_json branch from 2256a99 to d0888b1 Compare April 21, 2024 13:31
@frutiemax92 frutiemax92 force-pushed the script_convert_images_to_json branch from d0888b1 to b6dcc1a Compare April 21, 2024 13:39
@frutiemax92 frutiemax92 force-pushed the script_convert_images_to_json branch from fb9474b to 0a27daa Compare April 29, 2024 13:57
@Radtoo
Copy link

Radtoo commented May 12, 2024

The people that are most likely to train Pixart-Sigma tend to have SDXL structured (image + .txt caption) training data. Such a script should be officially included and documented. Else maybe the functionality needed to be able to use SDXL structured training data could be in train.py?

But I think the empty sharegptv4 values it generates are currently triggering an assertion error.

@frutiemax92 frutiemax92 force-pushed the script_convert_images_to_json branch from 610e4af to dff44c4 Compare May 16, 2024 19:18
@lawrence-cj
Copy link
Contributor

really nice work. Thank you so much for your PR.🥰 @frutiemax92

@lawrence-cj lawrence-cj merged commit 815fcc0 into PixArt-alpha:master May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants