huggingface · Wauplin · Aug 29, 2024 · Apr 12, 2024 · Apr 12, 2024 · Apr 12, 2024
diff --git a/.gitignore b/.gitignore
@@ -138,3 +138,5 @@ dmypy.json
 
 # Spell checker config
 cspell.json
+
+tmp*
diff --git a/docs/source/en/guides/upload.md b/docs/source/en/guides/upload.md
@@ -103,6 +103,80 @@ set, files are uploaded at the root of the repo.
 
 For more details about the CLI upload command, please refer to the [CLI guide](./cli#huggingface-cli-upload).
 
+## Upload a large folder
+
+In most cases, the [`upload_folder`] method and `huggingface-cli upload` command should be the go-to solutions to upload files to the Hub. They ensure a single commit will be made, handle a lot of use cases, and fail explicitly when something wrong happens. However, when dealing with a large amount of data, you will usually prefer a resilient process even if it leads to more commits or requires more CPU usage. The [`upload_large_folder`] method has been implemented in that spirit:
+- it is resumable: the upload process is split into many small tasks (hashing files, pre-uploading them, and committing them). Each time a task is completed, the result is cached locally in a `./cache/huggingface` folder inside the folder you are trying to upload. By doing so, restarting the process after an interruption will resume all completed tasks.
+- it is multi-threaded: hashing large files and pre-uploading them benefits a lot from multithreading if your machine allows it.
+- it is resilient to errors: a high-level retry-mechanism has been added to retry each independent task indefinitely until it passes (no matter if it's a OSError, ConnectionError, PermissionError, etc.). This mechanism is double-edged. If transient errors happen, the process will continue and retry. If permanent errors happen (e.g. permission denied), it will retry indefinitely without solving the root cause.
+
+If you want more technical details about how `upload_large_folder` is implemented under the hood, please have a look to the [`upload_large_folder`] package reference.
+
+Here is how to use [`upload_large_folder`] in a script. The method signature is very similar to [`upload_folder`]:
+
+```py
+>>> api.upload_large_folder(
+...     repo_id="HuggingFaceM4/Docmatix",
+...     repo_type="dataset",
+...     folder_path="/path/to/local/docmatix",
+... )
+```
+
+You will see the following output in your terminal:
+```
+Repo created: https://huggingface.co/datasets/HuggingFaceM4/Docmatix
+Found 5 candidate files to upload
+Recovering from metadata files: 100%|█████████████████████████████████████| 5/5 [00:00<00:00, 542.66it/s]
+
+---------- 2024-07-22 17:23:17 (0:00:00) ----------
+Files:   hashed 5/5 (5.0G/5.0G) | pre-uploaded: 0/5 (0.0/5.0G) | committed: 0/5 (0.0/5.0G) | ignored: 0
+Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 5 | committing: 0 | waiting: 11
+---------------------------------------------------
+```
+
+First, the repo is created if it didn't exist before. Then, the local folder is scanned for files to upload. For each file, we try to recover metadata information (from a previously interrupted upload). From there, it is able to launch workers and print an update status every 1 minute. Here, we can see that 5 files have already been hashed but not pre-uploaded. 5 workers are pre-uploading files while the 11 others are waiting for a task.
+
+A command line is also provided. You can define the number of workers and the level of verbosity in the terminal:
+
+```sh
+huggingface-cli upload-large-folder HuggingFaceM4/Docmatix --repo-type=dataset /path/to/local/docmatix --num-workers=16
+```
+
+<Tip>
+
+For large uploads, you have to set `repo_type="model"` or `--repo-type=model` explicitly. Usually, this information is implicit in all other `HfApi` methods. This is to avoid having data uploaded to a repository with a wrong type. If that's the case, you'll have to re-upload everything.
+
+</Tip>
+
+<Tip warning={true}>
+
+While being much more robust to upload large folders, `upload_large_folder` is more limited than [`upload_folder`] feature-wise. In practice:
+- you cannot set a custom `path_in_repo`. If you want to upload to a subfolder, you need to set the proper structure locally.
+- you cannot set a custom `commit_message` and `commit_description` since multiple commits are created.
+- you cannot delete from the repo while uploading. Please make a separate commit first.
+- you cannot create a PR directly. Please create a PR first and then commit to it by passing `revision`.
+
+</Tip>
+
+### Tips and tricks for large uploads
+
+There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data, getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying.
+
+Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Next, let's move on with some practical tips to make your upload process as smooth as possible.
+
+- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate on a script when failing takes only a little time.
+- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never be re-uploaded twice but checking it client-side can still save some time. This is what [`upload_large_folder`] does for you.
+- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up uploads on machines with very high bandwidth. To use `hf_transfer`:
+    1. Specify the `hf_transfer` extra when installing `huggingface_hub`
+       (e.g. `pip install huggingface_hub[hf_transfer]`).
+    2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.
+
+<Tip warning={true}>
+
+`hf_transfer` is a power user tool! It is tested and production-ready, but it lacks user-friendly features like advanced error handling or proxies. For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).
+
+</Tip>
+
 ## Advanced features
 
 In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub.
@@ -418,36 +492,6 @@ you don't store another reference to it. This is expected as we don't want to ke
 already uploaded. Finally we create the commit by passing all the operations to [`create_commit`]. You can pass
 additional operations (add, delete or copy) that have not been processed yet and they will be handled correctly.
 
-## Tips and tricks for large uploads
-
-There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data,
-getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying.
-
-Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Next, let's move on with some practical tips to make your upload process as smooth as possible.
-
-- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate
-on a script when failing takes only a little time.
-- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always
-best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our
-servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you
-already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never
-be re-uploaded twice but checking it client-side can still save some time.
-- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up
-  uploads on machines with very high bandwidth. To use `hf_transfer`:
-
-    1. Specify the `hf_transfer` extra when installing `huggingface_hub`
-       (e.g. `pip install huggingface_hub[hf_transfer]`).
-    2. Set `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable.
-
-<Tip warning={true}>
-
-`hf_transfer` is a power user tool!
-It is tested and production-ready,
-but it lacks user-friendly features like advanced error handling or proxies.
-For more details, please take a look at this [section](https://huggingface.co/docs/huggingface_hub/hf_transfer).
-
-</Tip>
-
 ## (legacy) Upload files with Git LFS
 
 All the methods described above use the Hub's API to upload files. This is the recommended way to upload files to the Hub.

diff --git a/src/huggingface_hub/__init__.py b/src/huggingface_hub/__init__.py
@@ -252,6 +252,7 @@
         "update_webhook",
         "upload_file",
         "upload_folder",
+        "upload_large_folder",
         "whoami",
     ],
     "hf_file_system": [
@@ -756,6 +757,7 @@ def __dir__():
         update_webhook,  # noqa: F401
         upload_file,  # noqa: F401
         upload_folder,  # noqa: F401
+        upload_large_folder,  # noqa: F401
         whoami,  # noqa: F401
     )
     from .hf_file_system import (

diff --git a/src/huggingface_hub/_local_folder.py b/src/huggingface_hub/_local_folder.py
@@ -34,7 +34,7 @@
     └── [   16]  file.parquet
 
 
-Metadata file structure:
+Download metadata file structure:
 ```
 # file.txt.metadata
 11c5a3d5811f50298f278a704980280950aedb10
@@ -68,7 +68,7 @@ class LocalDownloadFilePaths:
     """
     Paths to the files related to a download process in a local dir.
 
-    Returned by `get_local_download_paths`.
+    Returned by [`get_local_download_paths`].
 
     Attributes:
         file_path (`Path`):
@@ -88,6 +88,30 @@ def incomplete_path(self, etag: str) -> Path:
         return self.metadata_path.with_suffix(f".{etag}.incomplete")
 
 
+@dataclass(frozen=True)
+class LocalUploadFilePaths:
+    """
+    Paths to the files related to an upload process in a local dir.
+
+    Returned by [`get_local_upload_paths`].
+
+    Attributes:
+        path_in_repo (`str`):
+            Path of the file in the repo.
+        file_path (`Path`):
+            Path where the file will be saved.
+        lock_path (`Path`):
+            Path to the lock file used to ensure atomicity when reading/writing metadata.
+        metadata_path (`Path`):
+            Path to the metadata file.
+    """
+
+    path_in_repo: str
+    file_path: Path
+    lock_path: Path
+    metadata_path: Path
+
+
 @dataclass
 class LocalDownloadFileMetadata:
     """
@@ -111,6 +135,50 @@ class LocalDownloadFileMetadata:
     timestamp: float
 
 
+@dataclass
+class LocalUploadFileMetadata:
+    """
+    Metadata about a file in the local directory related to an upload process.
+    """
+
+    size: int
+
+    # Default values correspond to "we don't know yet"
+    timestamp: Optional[float] = None
+    should_ignore: Optional[bool] = None
+    sha256: Optional[str] = None
+    upload_mode: Optional[str] = None
+    is_uploaded: bool = False
+    is_committed: bool = False
+
+    def save(self, paths: LocalUploadFilePaths) -> None:
+        """Save the metadata to disk."""
+        with WeakFileLock(paths.lock_path):
+            with paths.metadata_path.open("w") as f:
+                new_timestamp = time.time()
+                f.write(str(new_timestamp) + "\n")
+
+                f.write(str(self.size))  # never None
+                f.write("\n")
+
+                if self.should_ignore is not None:
+                    f.write(str(int(self.should_ignore)))
+                f.write("\n")
+
+                if self.sha256 is not None:
+                    f.write(self.sha256)
+                f.write("\n")
+
+                if self.upload_mode is not None:
+                    f.write(self.upload_mode)
+                f.write("\n")
+
+                f.write(str(int(self.is_uploaded)) + "\n")
+                f.write(str(int(self.is_committed)) + "\n")
+
+            self.timestamp = new_timestamp
+
+
 @lru_cache(maxsize=128)  # ensure singleton
 def get_local_download_paths(local_dir: Path, filename: str) -> LocalDownloadFilePaths:
     """Compute paths to the files related to a download process.
@@ -152,6 +220,49 @@ def get_local_download_paths(local_dir: Path, filename: str) -> LocalDownloadFil
     return LocalDownloadFilePaths(file_path=file_path, lock_path=lock_path, metadata_path=metadata_path)
 
 
+@lru_cache(maxsize=128)  # ensure singleton
+def get_local_upload_paths(local_dir: Path, filename: str) -> LocalUploadFilePaths:
+    """Compute paths to the files related to an upload process.
+
+    Folders containing the paths are all guaranteed to exist.
+
+    Args:
+        local_dir (`Path`):
+            Path to the local directory that is uploaded.
+        filename (`str`):
+            Path of the file in the repo.
+
+    Return:
+        [`LocalUploadFilePaths`]: the paths to the files (file_path, lock_path, metadata_path).
+    """
+    # filename is the path in the Hub repository (separated by '/')
+    # make sure to have a cross platform transcription
+    sanitized_filename = os.path.join(*filename.split("/"))
+    if os.name == "nt":
+        if sanitized_filename.startswith("..\\") or "\\..\\" in sanitized_filename:
+            raise ValueError(
+                f"Invalid filename: cannot handle filename '{sanitized_filename}' on Windows. Please ask the repository"
+                " owner to rename this file."
+            )
+    file_path = local_dir / sanitized_filename
+    metadata_path = _huggingface_dir(local_dir) / "upload" / f"{sanitized_filename}.metadata"
+    lock_path = metadata_path.with_suffix(".lock")
+
+    # Some Windows versions do not allow for paths longer than 255 characters.
+    # In this case, we must specify it as an extended path by using the "\\?\" prefix
+    if os.name == "nt":
+        if not str(local_dir).startswith("\\\\?\\") and len(os.path.abspath(lock_path)) > 255:
+            file_path = Path("\\\\?\\" + os.path.abspath(file_path))
+            lock_path = Path("\\\\?\\" + os.path.abspath(lock_path))
+            metadata_path = Path("\\\\?\\" + os.path.abspath(metadata_path))
+
+    file_path.parent.mkdir(parents=True, exist_ok=True)
+    metadata_path.parent.mkdir(parents=True, exist_ok=True)
+    return LocalUploadFilePaths(
+        path_in_repo=filename, file_path=file_path, lock_path=lock_path, metadata_path=metadata_path
+    )
+
+
 def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDownloadFileMetadata]:
     """Read metadata about a file in the local directory related to a download process.
 
@@ -165,8 +276,6 @@ def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDown
         `[LocalDownloadFileMetadata]` or `None`: the metadata if it exists, `None` otherwise.
     """
     paths = get_local_download_paths(local_dir, filename)
-    # file_path = local_file_path(local_dir, filename)
-    # lock_path, metadata_path = _download_metadata_file_path(local_dir, filename)
     with WeakFileLock(paths.lock_path):
         if paths.metadata_path.exists():
             try:
@@ -204,6 +313,84 @@ def read_download_metadata(local_dir: Path, filename: str) -> Optional[LocalDown
     return None
 
 
+def read_upload_metadata(local_dir: Path, filename: str) -> LocalUploadFileMetadata:
+    """Read metadata about a file in the local directory related to an upload process.
+
+    TODO: factorize logic with `read_download_metadata`.
+
+    Args:
+        local_dir (`Path`):
+            Path to the local directory in which files are downloaded.
+        filename (`str`):
+            Path of the file in the repo.
+
+    Return:
+        `[LocalUploadFileMetadata]` or `None`: the metadata if it exists, `None` otherwise.
+    """
+    paths = get_local_upload_paths(local_dir, filename)
+    with WeakFileLock(paths.lock_path):
+        if paths.metadata_path.exists():
+            try:
+                with paths.metadata_path.open() as f:
+                    timestamp = float(f.readline().strip())
+
+                    size = int(f.readline().strip())  # never None
+
+                    _should_ignore = f.readline().strip()
+                    should_ignore = None if _should_ignore == "" else bool(int(_should_ignore))
+
+                    _sha256 = f.readline().strip()
+                    sha256 = None if _sha256 == "" else _sha256
+
+                    _upload_mode = f.readline().strip()
+                    upload_mode = None if _upload_mode == "" else _upload_mode
+                    if upload_mode not in (None, "regular", "lfs"):
+                        raise ValueError(f"Invalid upload mode in metadata {paths.path_in_repo}: {upload_mode}")
+
+                    is_uploaded = bool(int(f.readline().strip()))
+                    is_committed = bool(int(f.readline().strip()))
+
+                    metadata = LocalUploadFileMetadata(
+                        timestamp=timestamp,
+                        size=size,
+                        should_ignore=should_ignore,
+                        sha256=sha256,
+                        upload_mode=upload_mode,
+                        is_uploaded=is_uploaded,
+                        is_committed=is_committed,
+                    )
+            except Exception as e:
+                # remove the metadata file if it is corrupted / not the right format
+                logger.warning(
+                    f"Invalid metadata file {paths.metadata_path}: {e}. Removing it from disk and continue."
+                )
+                try:
+                    paths.metadata_path.unlink()
+                except Exception as e:
+                    logger.warning(f"Could not remove corrupted metadata file {paths.metadata_path}: {e}")
+
+            # TODO: can we do better?
+            if (
+                metadata.timestamp is not None
+                and metadata.is_uploaded  # file was uploaded
+                and not metadata.is_committed  # but not committed
+                and time.time() - metadata.timestamp > 20 * 3600  # and it's been more than 20 hours
+            ):  # => we consider it as garbage-collected by S3
+                metadata.is_uploaded = False
+
+            # check if the file exists and hasn't been modified since the metadata was saved
+            try:
+                if metadata.timestamp is not None and paths.file_path.stat().st_mtime <= metadata.timestamp:
+                    return metadata
+                logger.info(f"Ignored metadata for '{filename}' (outdated). Will re-compute hash.")
+            except FileNotFoundError:
+                # file does not exist => metadata is outdated
+                pass
+
+    # empty metadata => we don't know anything expect its size
+    return LocalUploadFileMetadata(size=paths.file_path.stat().st_size)
+
+
 def write_download_metadata(local_dir: Path, filename: str, commit_hash: str, etag: str) -> None:
     """Write metadata about a file in the local directory related to a download process.