Skip to content

FileNotFoundError when reading from Hugging Face #554

@lhoestq

Description

@lhoestq

Description

Hi I'm Quentin from HF :) I wanted to play with datachain after #375 by @dberenbaum but I'm getting this error:

from datachain import DataChain

DataChain.from_csv("hf://datasets/infinite-dataset-hub/MobilePlanAssistant/data.csv").show()
FileNotFoundError                         Traceback (most recent call last)
[<ipython-input-2-1e396698d13d>](https://localhost:8080/#) in <cell line: 3>()
      1 from datachain import DataChain
      2 
----> 3 DataChain.from_csv("hf://datasets/infinite-dataset-hub/MobilePlanAssistant/data.csv").show()

5 frames
[/usr/local/lib/python3.10/dist-packages/datachain/lib/dc.py](https://localhost:8080/#) in from_csv(cls, path, delimiter, header, output, object_name, model_name, source, nrows, session, settings, column_types, **kwargs)
   1860             convert_options=convert_options,
   1861         )
-> 1862         return chain.parse_tabular(
   1863             output=output,
   1864             object_name=object_name,

[/usr/local/lib/python3.10/dist-packages/datachain/lib/dc.py](https://localhost:8080/#) in parse_tabular(self, output, object_name, model_name, source, nrows, **kwargs)
   1743         if col_names or not output:
   1744             try:
-> 1745                 schema = infer_schema(self, **kwargs)
   1746                 output = schema_to_output(schema, col_names)
   1747             except ValueError as e:

[/usr/local/lib/python3.10/dist-packages/datachain/lib/arrow.py](https://localhost:8080/#) in infer_schema(chain, **kwargs)
    112     schemas = []
    113     for file in chain.collect("file"):
--> 114         ds = dataset(file.get_path(), filesystem=file.get_fs(), **kwargs)  # type: ignore[union-attr]
    115         schemas.append(ds.schema)
    116     return pa.unify_schemas(schemas)

[/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py](https://localhost:8080/#) in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes)
    792 
    793     if _is_path_like(source):
--> 794         return _filesystem_dataset(source, **kwargs)
    795     elif isinstance(source, (tuple, list)):
    796         if all(_is_path_like(elem) or isinstance(elem, FileInfo) for elem in source):

[/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py](https://localhost:8080/#) in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
    474             fs, paths_or_selector = _ensure_multiple_sources(source, filesystem)
    475     else:
--> 476         fs, paths_or_selector = _ensure_single_source(source, filesystem)
    477 
    478     options = FileSystemFactoryOptions(

[/usr/local/lib/python3.10/dist-packages/pyarrow/dataset.py](https://localhost:8080/#) in _ensure_single_source(path, filesystem)
    439         paths_or_selector = [path]
    440     else:
--> 441         raise FileNotFoundError(path)
    442 
    443     return filesystem, paths_or_selector

FileNotFoundError: /infinite-dataset-hub/MobilePlanAssistant/data.csv

It looks like _ensure_single_source incorrectly uses a LocalFileSystem instead of the HfFileSystem

The same path works from pandas via fsspec:

>>> import pandas as pd
>>> df = pd.read_csv("hf://datasets/infinite-dataset-hub/MobilePlanAssistant/data.csv")
>>> df.head()
   idx                                         user_input  \
0    0                 Hi, I'm looking for a mobile plan.   
1    1   I need unlimited data and international calling.   
2    2            I want at least 10GB of data per month.   
3    3  That's too expensive, do you have anything che...   
4    4    I'm allergic to cats, will this affect my plan?   

                                        bot_response            labels  
0  Hello! I'd be happy to help you find the best ...          Greeting  
1  Great, do you have a preferred data limit and ...      Data Inquiry  
2  I found a plan with unlimited data and interna...   Plan Suggestion  
3  I found another plan with 8GB of data and inte...  Price Comparison  
4  I'm sorry, but my abilities are focused on mob...  Unexpected Topic 

Version Info

0.6.3
Python 3.10.12

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions