Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify add_column() to optionally accept a FeatureType as param #7143

Merged

Conversation

varadhbhatnagar
Copy link
Contributor

@varadhbhatnagar varadhbhatnagar commented Sep 8, 2024

Fix #7142.

Before (Add + Cast):

from datasets import load_dataset, Value
ds = load_dataset("rotten_tomatoes", split="test")
lst = [i for i in range(len(ds))]

ds = ds.add_column("new_col", lst)
# Assigns int64 to new_col by default
print(ds.features)

ds = ds.cast_column("new_col", Value(dtype="uint16", id=None))
print(ds.features)

Before (Numpy Workaround):

from datasets import load_dataset
import numpy as np
ds = load_dataset("rotten_tomatoes", split="test")
lst = [i for i in range(len(ds))]

ds = ds.add_column("new_col", np.array(lst, dtype=np.uint16))
print(ds.features)

After:

from datasets import load_dataset, Value
ds = load_dataset("rotten_tomatoes", split="test")
lst = [i for i in range(len(ds))]
val = Value(dtype="uint16", id=None))
ds = ds.add_column("new_col", lst, feature=val)
print(ds.features)

@varadhbhatnagar
Copy link
Contributor Author

Requesting review @lhoestq
I will also update the docs if this looks good.

@lhoestq
Copy link
Member

lhoestq commented Sep 9, 2024

Cool ! maybe you can rename the argument feature and with type FeatureType ? This way it would work the same way as .cast_column() ?

@varadhbhatnagar varadhbhatnagar force-pushed the add-pa-schema-to-add-column-func branch from b6b3aa0 to 383a18e Compare September 10, 2024 15:03
@varadhbhatnagar varadhbhatnagar changed the title Modify add_column() to optionally accept a pyarrow schema as param Modify add_column() to optionally accept a FeatureType as param Sep 10, 2024
@varadhbhatnagar
Copy link
Contributor Author

@lhoestq Since there is no way to get a pyarrow.Schema from a FeatureType, I had to go via Features. How does this look?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Can you just add the feature argument to the docstrings before we merge ?

@varadhbhatnagar
Copy link
Contributor Author

@lhoestq done!

@varadhbhatnagar
Copy link
Contributor Author

@lhoestq anything pending on this?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! LGTM :)

@lhoestq lhoestq merged commit 43b1fe1 into huggingface:main Sep 16, 2024
14 checks passed
@varadhbhatnagar varadhbhatnagar deleted the add-pa-schema-to-add-column-func branch September 17, 2024 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specifying datatype when adding a column to a dataset.
3 participants