Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove need for datasets fork #6

Open
alex-hh opened this issue Oct 31, 2024 · 2 comments
Open

Remove need for datasets fork #6

alex-hh opened this issue Oct 31, 2024 · 2 comments

Comments

@alex-hh
Copy link
Owner

alex-hh commented Oct 31, 2024

The point is that there are two cases:

  1. We are working with local arrow tables in which case we use arrow schema
  2. We are loading from / pushing to hub or local files, in which case we need to serialise.
    What guides the serialisation, what is the flow etc.
    def _build_metadata(info: DatasetInfo, fingerprint: Optional[str] = None) -> Dict[str, str]:
        info_keys = ["features"]  # we can add support for more DatasetInfo keys in the future
        info_as_dict = asdict(info)
        metadata = {}
        metadata["info"] = {key: info_as_dict[key] for key in info_keys}
        if fingerprint is not None:
            metadata["fingerprint"] = fingerprint
        return {"huggingface": json.dumps(metadata)}
def update_metadata_with_features(table: Table, features: Features):
    """To be used in dataset transforms that modify the features of the dataset, in order to update the features stored in the metadata of its schema."""
    features = Features({col_name: features[col_name] for col_name in table.column_names})
    if table.schema.metadata is None or b"huggingface" not in table.schema.metadata:
        pa_metadata = ArrowWriter._build_metadata(DatasetInfo(features=features))
    else:
        metadata = json.loads(table.schema.metadata[b"huggingface"].decode())
        if "info" not in metadata:
            metadata["info"] = asdict(DatasetInfo(features=features))
        else:
            metadata["info"]["features"] = asdict(DatasetInfo(features=features))["features"]
        pa_metadata = {"huggingface": json.dumps(metadata)}
    table = table.replace_schema_metadata(pa_metadata)
    return table

this also happens in Dataset init

inferred_features = Features.from_arrow_schema(arrow_table.schema)

    @classmethod
    def from_arrow_schema(cls, pa_schema: pa.Schema) -> "Features":
        """
        Construct [`Features`] from Arrow Schema.
        It also checks the schema metadata for Hugging Face Datasets features.
        Non-nullable fields are not supported and set to nullable.

        Also, pa.dictionary is not supported and it uses its underlying type instead.
        Therefore datasets convert DictionaryArray objects to their actual values.

        Args:
            pa_schema (`pyarrow.Schema`):
                Arrow Schema.

        Returns:
            [`Features`]
        """
        # try to load features from the arrow schema metadata
        metadata_features = Features()
        if pa_schema.metadata is not None and "huggingface".encode("utf-8") in pa_schema.metadata:
            metadata = json.loads(pa_schema.metadata["huggingface".encode("utf-8")].decode())
            if "info" in metadata and "features" in metadata["info"] and metadata["info"]["features"] is not None:
                metadata_features = Features.from_dict(metadata["info"]["features"])
        metadata_features_schema = metadata_features.arrow_schema
        obj = {
            field.name: (
                metadata_features[field.name]
                if field.name in metadata_features and metadata_features_schema.field(field.name) == field
                else generate_from_arrow_type(field.type)
            )
            for field in pa_schema
        }
        return cls(**obj)

load_dataset_builder
builder_instance.as_streaming_dataset
builder_instance.download_and_prepare
builder_instance.as_dataset

def _as_streaming_dataset_single(
        self,
        splits_generator,
    ) -> IterableDataset:
        ex_iterable = self._get_examples_iterable_for_split(splits_generator)
        # add auth to be able to access and decode audio/image files from private repositories.
        token_per_repo_id = {self.repo_id: self.token} if self.repo_id else {}
        return IterableDataset(
            ex_iterable, info=self.info, split=splits_generator.name, token_per_repo_id=token_per_repo_id
        )

    def _as_dataset(self, split: Union[ReadInstruction, Split] = Split.TRAIN, in_memory: bool = False) -> Dataset:
        """Constructs a `Dataset`.

        This is the internal implementation to overwrite called when user calls
        `as_dataset`. It should read the pre-processed datasets files and generate
        the `Dataset` object.

        Args:
            split (`datasets.Split`):
                which subset of the data to read.
            in_memory (`bool`, defaults to `False`):
                Whether to copy the data in-memory.

        Returns:
            `Dataset`
        """
        cache_dir = self._fs._strip_protocol(self._output_dir)
        dataset_name = self.dataset_name
        if self._check_legacy_cache():
            dataset_name = self.name
        dataset_kwargs = ArrowReader(cache_dir, self.info).read(
            name=dataset_name,
            instructions=split,
            split_infos=self.info.splits.values(),
            in_memory=in_memory,
        )
        fingerprint = self._get_dataset_fingerprint(split)
        return Dataset(fingerprint=fingerprint, **dataset_kwargs)
@alex-hh
Copy link
Owner Author

alex-hh commented Nov 2, 2024

TODO: we need from_yaml_dict to load the bio features
this is how the info gets loaded from the dataset card.
then there is possibly a second from_arrow_schema step - not sure when this applies exactly.

if we override Features then from_dataset_card_data doesn't need overriding:
this fixes all of the dataset card stuff.
However, we may also need to override arrow schema saving and loading.
we also definitely need to override Features encode_example and decode_example.

DatasetInfosDict.from_dataset_card_data gets invoked in module factory.
DatasetInfosDict.from_dataset_card_data() invokes from_yaml_dict.
DatasetInfo.from_yaml_dict needs to load the biodatasets thing as well as the huggingface thing.
to_dataset_card_data calls DatasetInfo._to_yaml_dict
these DatasetInfo methods call Features _to_yaml_list and Features _from_yaml_list
we probably do need to override datasets.Features.

@alex-hh
Copy link
Owner Author

alex-hh commented Nov 2, 2024

basically just modifying from from yaml list, to yaml list, from arrow schema and to arrow schema should handle everything

it's tricky to get everything working with anything other than a direct features override - Features is hardcoded in InfosDict, which in turn is hard coded in module factory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant