Remove need for datasets fork #6

alex-hh · 2024-10-31T21:55:37Z

The point is that there are two cases:

We are working with local arrow tables in which case we use arrow schema
We are loading from / pushing to hub or local files, in which case we need to serialise.
What guides the serialisation, what is the flow etc.

    def _build_metadata(info: DatasetInfo, fingerprint: Optional[str] = None) -> Dict[str, str]:
        info_keys = ["features"]  # we can add support for more DatasetInfo keys in the future
        info_as_dict = asdict(info)
        metadata = {}
        metadata["info"] = {key: info_as_dict[key] for key in info_keys}
        if fingerprint is not None:
            metadata["fingerprint"] = fingerprint
        return {"huggingface": json.dumps(metadata)}

def update_metadata_with_features(table: Table, features: Features):
    """To be used in dataset transforms that modify the features of the dataset, in order to update the features stored in the metadata of its schema."""
    features = Features({col_name: features[col_name] for col_name in table.column_names})
    if table.schema.metadata is None or b"huggingface" not in table.schema.metadata:
        pa_metadata = ArrowWriter._build_metadata(DatasetInfo(features=features))
    else:
        metadata = json.loads(table.schema.metadata[b"huggingface"].decode())
        if "info" not in metadata:
            metadata["info"] = asdict(DatasetInfo(features=features))
        else:
            metadata["info"]["features"] = asdict(DatasetInfo(features=features))["features"]
        pa_metadata = {"huggingface": json.dumps(metadata)}
    table = table.replace_schema_metadata(pa_metadata)
    return table

this also happens in Dataset init

inferred_features = Features.from_arrow_schema(arrow_table.schema)

    @classmethod
    def from_arrow_schema(cls, pa_schema: pa.Schema) -> "Features":
        """
        Construct [`Features`] from Arrow Schema.
        It also checks the schema metadata for Hugging Face Datasets features.
        Non-nullable fields are not supported and set to nullable.

        Also, pa.dictionary is not supported and it uses its underlying type instead.
        Therefore datasets convert DictionaryArray objects to their actual values.

        Args:
            pa_schema (`pyarrow.Schema`):
                Arrow Schema.

        Returns:
            [`Features`]
        """
        # try to load features from the arrow schema metadata
        metadata_features = Features()
        if pa_schema.metadata is not None and "huggingface".encode("utf-8") in pa_schema.metadata:
            metadata = json.loads(pa_schema.metadata["huggingface".encode("utf-8")].decode())
            if "info" in metadata and "features" in metadata["info"] and metadata["info"]["features"] is not None:
                metadata_features = Features.from_dict(metadata["info"]["features"])
        metadata_features_schema = metadata_features.arrow_schema
        obj = {
            field.name: (
                metadata_features[field.name]
                if field.name in metadata_features and metadata_features_schema.field(field.name) == field
                else generate_from_arrow_type(field.type)
            )
            for field in pa_schema
        }
        return cls(**obj)

load_dataset_builder
builder_instance.as_streaming_dataset
builder_instance.download_and_prepare
builder_instance.as_dataset

def _as_streaming_dataset_single(
        self,
        splits_generator,
    ) -> IterableDataset:
        ex_iterable = self._get_examples_iterable_for_split(splits_generator)
        # add auth to be able to access and decode audio/image files from private repositories.
        token_per_repo_id = {self.repo_id: self.token} if self.repo_id else {}
        return IterableDataset(
            ex_iterable, info=self.info, split=splits_generator.name, token_per_repo_id=token_per_repo_id
        )

    def _as_dataset(self, split: Union[ReadInstruction, Split] = Split.TRAIN, in_memory: bool = False) -> Dataset:
        """Constructs a `Dataset`.

        This is the internal implementation to overwrite called when user calls
        `as_dataset`. It should read the pre-processed datasets files and generate
        the `Dataset` object.

        Args:
            split (`datasets.Split`):
                which subset of the data to read.
            in_memory (`bool`, defaults to `False`):
                Whether to copy the data in-memory.

        Returns:
            `Dataset`
        """
        cache_dir = self._fs._strip_protocol(self._output_dir)
        dataset_name = self.dataset_name
        if self._check_legacy_cache():
            dataset_name = self.name
        dataset_kwargs = ArrowReader(cache_dir, self.info).read(
            name=dataset_name,
            instructions=split,
            split_infos=self.info.splits.values(),
            in_memory=in_memory,
        )
        fingerprint = self._get_dataset_fingerprint(split)
        return Dataset(fingerprint=fingerprint, **dataset_kwargs)

alex-hh · 2024-11-02T07:19:00Z

TODO: we need from_yaml_dict to load the bio features
this is how the info gets loaded from the dataset card.
then there is possibly a second from_arrow_schema step - not sure when this applies exactly.

if we override Features then from_dataset_card_data doesn't need overriding:
this fixes all of the dataset card stuff.
However, we may also need to override arrow schema saving and loading.
we also definitely need to override Features encode_example and decode_example.

DatasetInfosDict.from_dataset_card_data gets invoked in module factory.
DatasetInfosDict.from_dataset_card_data() invokes from_yaml_dict.
DatasetInfo.from_yaml_dict needs to load the biodatasets thing as well as the huggingface thing.
to_dataset_card_data calls DatasetInfo._to_yaml_dict
these DatasetInfo methods call Features _to_yaml_list and Features _from_yaml_list
we probably do need to override datasets.Features.

alex-hh · 2024-11-02T07:28:38Z

basically just modifying from from yaml list, to yaml list, from arrow schema and to arrow schema should handle everything

it's tricky to get everything working with anything other than a direct features override - Features is hardcoded in InfosDict, which in turn is hard coded in module factory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove need for datasets fork #6

Remove need for datasets fork #6

alex-hh commented Oct 31, 2024 •

edited

Loading

alex-hh commented Nov 2, 2024 •

edited

Loading

alex-hh commented Nov 2, 2024 •

edited

Loading

Remove need for datasets fork #6

Remove need for datasets fork #6

Comments

alex-hh commented Oct 31, 2024 • edited Loading

alex-hh commented Nov 2, 2024 • edited Loading

alex-hh commented Nov 2, 2024 • edited Loading

alex-hh commented Oct 31, 2024 •

edited

Loading

alex-hh commented Nov 2, 2024 •

edited

Loading

alex-hh commented Nov 2, 2024 •

edited

Loading