Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MS-334] Dataset Converter #269

Merged
merged 22 commits into from
Jul 30, 2024
Merged

[MS-334] Dataset Converter #269

merged 22 commits into from
Jul 30, 2024

Conversation

imda-normanchia
Copy link
Collaborator

Description

This feature allows user to convert their existing dataset in csv format to the json in our format. The requirement from them would be to have a csv file that contains 2 column, target and input. The csv converter looks for this 2 column in the csv file and convert it to the dataset shape of our own.

This feature also allows user to download datasets from huggingface using moonshot and it will convert the downloaded dataset from huggingface into the dataset shape of our own.

Motivation and Context

This allows user to easily convert their dataset or download new dataset to use with moonshot.

Type of Change

How to Test

[Provide clear instructions on how to test and verify the changes introduced by this pull request, including any specific unit tests you have created to demonstrate your changes.]

Checklist

Please check all the boxes that apply to this pull request using "x":

  • I have tested the changes locally and verified that they work as expected.
  • I have added or updated the necessary documentation (README, API docs, etc.).
  • I have added appropriate unit tests or functional tests for the changes made.
  • I have followed the project's coding conventions and style guidelines.
  • I have rebased my branch onto the latest commit of the main branch.
  • I have squashed or reorganized my commits into logical units.
  • I have added any necessary dependencies or packages to the project's build configuration.
  • I have performed a self-review of my own code.
  • I have read, understood and agree to the Developer Certificate of Origin below, which this project utilises.

Screenshots (if applicable)

[If the changes involve visual modifications, include screenshots or GIFs that demonstrate the changes.]

Additional Notes

[Add any additional information or context that might be relevant to reviewers.]

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
   have the right to submit it under the open source license
   indicated in the file; or

(b) The contribution is based upon previous work that, to the best
   of my knowledge, is covered under an appropriate open source
   license and I have the right under that license to submit that
   work with modifications, whether created in whole or in part
   by me, under the same open source license (unless I am
   permitted to submit under a different license), as indicated
   in the file; or

(c) The contribution was provided directly to me by some other
   person who certified (a), (b) or (c) and I have not modified
   it.

(d) I understand and agree that this project and the contribution
   are public and that a record of the contribution (including all
   personal information I submit with it, including my sign-off) is
   maintained indefinitely and may be redistributed consistent with
   this project or the open source license(s) involved.

@imda-normanchia imda-normanchia requested review from imda-kelvinkok and imda-lionelteo and removed request for imda-kelvinkok July 22, 2024 02:32
@imda-lionelteo imda-lionelteo merged commit 773977b into dev_main Jul 30, 2024
1 of 2 checks passed
@imda-normanchia imda-normanchia deleted the dataset-converter branch July 30, 2024 13:37
@imda-benedictlee
Copy link
Contributor

@imda-benedictlee
Copy link
Contributor

Tested. Working as expected.

Create Dataset via CSV:
Screenshot 2024-07-31 at 12 50 53 PM

Screenshot 2024-07-31 at 12 53 07 PM Screenshot 2024-07-31 at 12 53 10 PM

Create Dataset via CSV using the same name:
Screenshot 2024-07-31 at 1 01 22 PM

Create Dataset via Hugging Face:
Screenshot 2024-07-31 at 12 58 31 PM

Screenshot 2024-07-31 at 12 59 02 PM Screenshot 2024-07-31 at 12 59 06 PM

Create Dataset via Hugging Face using the same name:
Screenshot 2024-07-31 at 1 02 32 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants