forked from iterative/dvc.org
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into cmd-ref/option-defaults
- Loading branch information
Showing
26 changed files
with
800 additions
and
80 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# dvc.api.get_url() | ||
|
||
Returns the URL to the storage location of a data file or directory tracked in a | ||
<abbr>DVC project</abbr>. | ||
|
||
```py | ||
def get_url(path: str, | ||
repo: str = None, | ||
rev: str = None, | ||
remote: str = None) -> str | ||
``` | ||
|
||
#### Usage: | ||
|
||
```py | ||
import dvc.api | ||
|
||
resource_url = dvc.api.get_url( | ||
'get-started/data.xml', | ||
repo='https://github.com/iterative/dataset-registry') | ||
|
||
# resource_url is now "https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355" | ||
``` | ||
|
||
## Description | ||
|
||
Returns the URL string of the storage location (in a | ||
[DVC remote](/doc/command-reference/remote)) where a target file or directory, | ||
specified by its `path` in a `repo` (<abbr>DVC project</abbr>), is stored. | ||
|
||
The URL is formed by reading the project's | ||
[remote configuration](/doc/command-reference/config#remote) and the | ||
[DVC-file](/doc/user-guide/dvc-file-format) where the given `path` is an | ||
<abbr>output</abbr>. The URL schema returned depends on the | ||
[type](/doc/command-reference/remote/add#supported-storage-types) of the | ||
`remote` used (see the [Parameters](#parameters) section). | ||
|
||
If the target is a directory, the returned URL will end in `.dir`. Refer to | ||
[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) | ||
and `dvc add` to learn more about how DVC handles data directories. | ||
|
||
⚠️ This function does not check for the actual existence of the file or | ||
directory in the remote storage. | ||
|
||
💡 Having the resource's URL, it should be possible to download it directly with | ||
an appropriate library, such as | ||
[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_fileobj) | ||
or | ||
[`paramiko`](https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_client.SFTPClient.get). | ||
|
||
## Parameters | ||
|
||
- **`path`** - location and file name of the file or directory in `repo`, | ||
relative to the project's root. | ||
|
||
- `repo` - specifies the location of the DVC project. It can be a URL or a file | ||
system path. Both HTTP and SSH protocols are supported for online Git repos | ||
(e.g. `[user@]server:project.git`). _Default_: The current project is used | ||
(the current working directory tree is walked up to find it). | ||
|
||
- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as | ||
a branch or tag name, or a commit hash). If `repo` is not a Git repo, this | ||
option is ignored. _Default_: `HEAD`. | ||
|
||
- `remote` - name of the [DVC remote](/doc/command-reference/remote) to use to | ||
form the returned URL string. _Default_: The | ||
[default remote](/doc/command-reference/remote/default) of `repo` is used. | ||
|
||
## Exceptions | ||
|
||
- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project. | ||
|
||
- `dvc.exceptions.NoRemoteError` - no `remote` is found. | ||
|
||
## Example: Getting the URL to a DVC-tracked file | ||
|
||
```py | ||
import dvc.api | ||
|
||
resource_url = dvc.api.get_url( | ||
'get-started/data.xml', | ||
repo='https://github.com/iterative/dataset-registry' | ||
) | ||
|
||
print(resource_url) | ||
``` | ||
|
||
The script above prints | ||
|
||
`https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355` | ||
|
||
This URL represents the location where the data is stored, and is built by | ||
reading the corresponding DVC-file | ||
([`get-started/data.xml.dvc`](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc)) | ||
where the `md5` file hash is stored, | ||
|
||
```yaml | ||
outs: | ||
- md5: a304afb96060aad90176268345e10355 | ||
path: get-started/data.xml | ||
``` | ||
|
||
and the project configuration | ||
([`.dvc/config`](https://github.com/iterative/dataset-registry/blob/master/.dvc/config)) | ||
where the remote URL is saved: | ||
|
||
```ini | ||
['remote "storage"'] | ||
url = https://remote.dvc.org/dataset-registry | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Python API | ||
|
||
DVC can be used as a Python library, simply [install](/doc/install) with `pip` | ||
or `conda`. This reference provides the details about the functions in the API | ||
module `dvc.api`, which can be imported any regular way, for example: | ||
|
||
```py | ||
import dvc.api | ||
``` | ||
|
||
The purpose of this API is to provide programatic access to the data or models | ||
[stored and versioned](/doc/use-cases/versioning-data-and-model-files) in | ||
<abbr>DVC repositories</abbr> from Python code. | ||
|
||
Please choose a function from the navigation sidebar to the left, or click the | ||
`Next` button below to jump into the first one ↘ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
# dvc.api.open() | ||
|
||
Opens a tracked file. | ||
|
||
```py | ||
def open(path: str, | ||
repo: str = None, | ||
rev: str = None, | ||
remote: str = None, | ||
mode: str = "r", | ||
encoding: str = None) | ||
``` | ||
|
||
#### Usage: | ||
|
||
```py | ||
import dvc.api | ||
|
||
with dvc.api.open( | ||
'get-started/data.xml', | ||
repo='https://github.com/iterative/dataset-registry' | ||
) as fd: | ||
# ... fd is a file descriptor that can be processed normally. | ||
``` | ||
|
||
## Description | ||
|
||
Open a data or model file tracked in a <abbr>DVC project</abbr> and generate a | ||
corresponding | ||
[file object](https://docs.python.org/3/glossary.html#term-file-object). The | ||
file can be tracked by DVC or by Git. | ||
|
||
> The exact type of file object depends on the `mode` used. For more details, | ||
> please refer to Python's | ||
> [`open()`](https://docs.python.org/3/library/functions.html#open) built-in, | ||
> which is used under the hood. | ||
`dvc.api.open()` may only be used as a | ||
[context manager](https://www.python.org/dev/peps/pep-0343/#context-managers-in-the-standard-library) | ||
(using the `with` keyword, as shown in the examples). | ||
|
||
This function makes a direct connection to the | ||
[remote storage](/doc/command-reference/remote/add#supported-storage-types) | ||
(except for Google Drive), so the file contents can be streamed. Your code can | ||
process the data [buffer](https://docs.python.org/3/c-api/buffer.html) as it's | ||
streamed, which optimizes memory usage. | ||
|
||
> Use `dvc.api.read()` to load the complete file contents in a single function | ||
> call – no _context manager_ involved. Neither function utilizes disc space. | ||
## Parameters | ||
|
||
- **`path`** - location and file name of the file in `repo`, relative to the | ||
project's root. | ||
|
||
- `repo` - specifies the location of the DVC project. It can be a URL or a file | ||
system path. Both HTTP and SSH protocols are supported for online Git repos | ||
(e.g. `[user@]server:project.git`). _Default_: The current project is used | ||
(the current working directory tree is walked up to find it). | ||
|
||
- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as | ||
a branch or tag name, or a commit hash). If `repo` is not a Git repo, this | ||
option is ignored. _Default_: `HEAD`. | ||
|
||
- `remote` - name of the [DVC remote](/doc/command-reference/remote) to look for | ||
the target data. _Default_: The | ||
[default remote](/doc/command-reference/remote/default) of `repo` is used if a | ||
`remote` argument is not given. For local projects, the <abbr>cache</abbr> is | ||
tied before the default remote. | ||
|
||
- `mode` - specifies the mode in which the file is opened. Defaults to `"r"` | ||
(read). Mirrors the namesake parameter in builtin | ||
[`open()`](https://docs.python.org/3/library/functions.html#open). | ||
|
||
- `encoding` - | ||
[codec](https://docs.python.org/3/library/codecs.html#standard-encodings) used | ||
to decode the file contents to a string. This should only be used in text | ||
mode. Defaults to `"utf-8"`. Mirrors the namesake parameter in builtin | ||
`open()`. | ||
|
||
## Exceptions | ||
|
||
- `dvc.exceptions.FileMissingError` - file in `path` is missing from `repo`. | ||
|
||
- `dvc.exceptions.PathMissingError` - `path` cannot be found in `repo`. | ||
|
||
- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project. | ||
|
||
- `dvc.exceptions.NoRemoteError` - no `remote` is found. | ||
|
||
## Example: Use data or models from DVC repositories | ||
|
||
Any <abbr>data artifact</abbr> hosted online can be processed directly in your | ||
Python code with this API. For example, an XML file tracked in a public DVC repo | ||
on Github can be processed like this: | ||
|
||
```py | ||
from xml.sax import parse | ||
import dvc.api | ||
from mymodule import mySAXHandler | ||
|
||
with dvc.api.open( | ||
'get-started/data.xml', | ||
repo='https://github.com/iterative/dataset-registry' | ||
) as fd: | ||
parse(fd, mySAXHandler) | ||
``` | ||
|
||
Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because | ||
`dvc.api.open()` is able to stream the data from | ||
[remote storage](/doc/command-reference/remote/add#supported-storage-types). | ||
(The `mySAXHandler` object should handle the event-driven parsing of the | ||
document in this case.) This increases the performance of the code (minimizing | ||
memory usage), and is typically faster than loading the whole data into memory. | ||
|
||
> If you just needed to load the complete file contents into memory, you can use | ||
> `dvc.api.read()` instead: | ||
> | ||
> ```py | ||
> from xml.dom.minidom import parse | ||
> import dvc.api | ||
> | ||
> xmldata = dvc.api.read('get-started/data.xml', | ||
> repo='https://github.com/iterative/dataset-registry') | ||
> xmldom = parse(xmldata) | ||
> ``` | ||
## Example: Accessing private repos | ||
This is just a matter of using the right `repo` argument, for example an SSH URL | ||
(requires that the | ||
[credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh) | ||
locally): | ||
```py | ||
import dvc.api | ||
with dvc.api.open( | ||
'features.dat', | ||
repo='git@server.com:path/to/repo.git' | ||
) as fd: | ||
# ... Process 'features' | ||
``` | ||
## Example: Use different versions of data | ||
The `rev` argument lets you specify any Git commit to look for an artifact. This | ||
way any previous version, or alternative experiment can be accessed | ||
programmatically. For example, let's say your DVC repo has tagged releases of a | ||
CSV dataset: | ||
```py | ||
import csv | ||
import dvc.api | ||
with dvc.api.open( | ||
'clean.csv', | ||
rev='v1.1.0' | ||
) as fd: | ||
reader = csv.reader(fd) | ||
# ... Process 'clean' data from version 1.1.0 | ||
``` | ||
Also, notice that we didn't supply a `repo` argument in this example. DVC will | ||
attempt to find a <abbr>DVC project</abbr> to use in the current working | ||
directory tree, and look for the file contents of `clean.csv` in its local | ||
<abbr>cache</abbr>; no download will happen if found. See the | ||
[Parameters](#parameters) section for more info. | ||
## Example: Chose a specific remote as the data source | ||
Sometimes we may want to choose the [remote](/doc/command-reference/remote) data | ||
source, for example if the `repo` has no default remote set. This can be done by | ||
providing a `remote` argument: | ||
```py | ||
import dvc.api | ||
with open( | ||
'activity.log', | ||
repo='location/of/dvc/project', | ||
remote='my-s3-bucket' | ||
) as fd: | ||
for line in fd: | ||
match = re.search(r'user=(\w+)', line) | ||
# ... Process users activity log | ||
``` | ||
## Example: Specify the text encoding | ||
To chose which codec to open a text file with, send an `encoding` argument: | ||
```py | ||
import dvc.api | ||
with dvc.api.open( | ||
'data/nlp/words_ru.txt', | ||
encoding='koi8_r') as fd: | ||
# ... Process Russian words | ||
``` |
Oops, something went wrong.