Skip to content

Commit

Permalink
Merge branch 'master' into cmd-ref/option-defaults
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Mar 11, 2020
2 parents d6aed90 + efc377a commit aa6b125
Show file tree
Hide file tree
Showing 26 changed files with 800 additions and 80 deletions.
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"debug": "node --inspect-brk server.js",
"build": "next build",
"test": "jest",
"start": "NODE_ENV=production node server.js",
"start": "./scripts/clear-cloudflare-cache.js; NODE_ENV=production node server.js",
"format-staged": "pretty-quick --staged --no-restage --bail",
"format-check": "prettier --check '{.,pages/**,public/static/docs/**,src/**}/*.{js,md,json}'",
"lint-check": "eslint --ext .json,.js src pages",
Expand Down
110 changes: 110 additions & 0 deletions public/static/docs/api-reference/get_url.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# dvc.api.get_url()

Returns the URL to the storage location of a data file or directory tracked in a
<abbr>DVC project</abbr>.

```py
def get_url(path: str,
repo: str = None,
rev: str = None,
remote: str = None) -> str
```

#### Usage:

```py
import dvc.api

resource_url = dvc.api.get_url(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry')

# resource_url is now "https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355"
```

## Description

Returns the URL string of the storage location (in a
[DVC remote](/doc/command-reference/remote)) where a target file or directory,
specified by its `path` in a `repo` (<abbr>DVC project</abbr>), is stored.

The URL is formed by reading the project's
[remote configuration](/doc/command-reference/config#remote) and the
[DVC-file](/doc/user-guide/dvc-file-format) where the given `path` is an
<abbr>output</abbr>. The URL schema returned depends on the
[type](/doc/command-reference/remote/add#supported-storage-types) of the
`remote` used (see the [Parameters](#parameters) section).

If the target is a directory, the returned URL will end in `.dir`. Refer to
[Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
and `dvc add` to learn more about how DVC handles data directories.

⚠️ This function does not check for the actual existence of the file or
directory in the remote storage.

💡 Having the resource's URL, it should be possible to download it directly with
an appropriate library, such as
[`boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Object.download_fileobj)
or
[`paramiko`](https://docs.paramiko.org/en/stable/api/sftp.html#paramiko.sftp_client.SFTPClient.get).

## Parameters

- **`path`** - location and file name of the file or directory in `repo`,
relative to the project's root.

- `repo` - specifies the location of the DVC project. It can be a URL or a file
system path. Both HTTP and SSH protocols are supported for online Git repos
(e.g. `[user@]server:project.git`). _Default_: The current project is used
(the current working directory tree is walked up to find it).

- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as
a branch or tag name, or a commit hash). If `repo` is not a Git repo, this
option is ignored. _Default_: `HEAD`.

- `remote` - name of the [DVC remote](/doc/command-reference/remote) to use to
form the returned URL string. _Default_: The
[default remote](/doc/command-reference/remote/default) of `repo` is used.

## Exceptions

- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project.

- `dvc.exceptions.NoRemoteError` - no `remote` is found.

## Example: Getting the URL to a DVC-tracked file

```py
import dvc.api

resource_url = dvc.api.get_url(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
)

print(resource_url)
```

The script above prints

`https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355`

This URL represents the location where the data is stored, and is built by
reading the corresponding DVC-file
([`get-started/data.xml.dvc`](https://github.com/iterative/dataset-registry/blob/master/get-started/data.xml.dvc))
where the `md5` file hash is stored,

```yaml
outs:
- md5: a304afb96060aad90176268345e10355
path: get-started/data.xml
```

and the project configuration
([`.dvc/config`](https://github.com/iterative/dataset-registry/blob/master/.dvc/config))
where the remote URL is saved:

```ini
['remote "storage"']
url = https://remote.dvc.org/dataset-registry
```
16 changes: 16 additions & 0 deletions public/static/docs/api-reference/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Python API

DVC can be used as a Python library, simply [install](/doc/install) with `pip`
or `conda`. This reference provides the details about the functions in the API
module `dvc.api`, which can be imported any regular way, for example:

```py
import dvc.api
```

The purpose of this API is to provide programatic access to the data or models
[stored and versioned](/doc/use-cases/versioning-data-and-model-files) in
<abbr>DVC repositories</abbr> from Python code.

Please choose a function from the navigation sidebar to the left, or click the
`Next` button below to jump into the first one ↘
200 changes: 200 additions & 0 deletions public/static/docs/api-reference/open.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# dvc.api.open()

Opens a tracked file.

```py
def open(path: str,
repo: str = None,
rev: str = None,
remote: str = None,
mode: str = "r",
encoding: str = None)
```

#### Usage:

```py
import dvc.api

with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
# ... fd is a file descriptor that can be processed normally.
```

## Description

Open a data or model file tracked in a <abbr>DVC project</abbr> and generate a
corresponding
[file object](https://docs.python.org/3/glossary.html#term-file-object). The
file can be tracked by DVC or by Git.

> The exact type of file object depends on the `mode` used. For more details,
> please refer to Python's
> [`open()`](https://docs.python.org/3/library/functions.html#open) built-in,
> which is used under the hood.
`dvc.api.open()` may only be used as a
[context manager](https://www.python.org/dev/peps/pep-0343/#context-managers-in-the-standard-library)
(using the `with` keyword, as shown in the examples).

This function makes a direct connection to the
[remote storage](/doc/command-reference/remote/add#supported-storage-types)
(except for Google Drive), so the file contents can be streamed. Your code can
process the data [buffer](https://docs.python.org/3/c-api/buffer.html) as it's
streamed, which optimizes memory usage.

> Use `dvc.api.read()` to load the complete file contents in a single function
> call – no _context manager_ involved. Neither function utilizes disc space.
## Parameters

- **`path`** - location and file name of the file in `repo`, relative to the
project's root.

- `repo` - specifies the location of the DVC project. It can be a URL or a file
system path. Both HTTP and SSH protocols are supported for online Git repos
(e.g. `[user@]server:project.git`). _Default_: The current project is used
(the current working directory tree is walked up to find it).

- `rev` - Git commit (any [revision](https://git-scm.com/docs/revisions) such as
a branch or tag name, or a commit hash). If `repo` is not a Git repo, this
option is ignored. _Default_: `HEAD`.

- `remote` - name of the [DVC remote](/doc/command-reference/remote) to look for
the target data. _Default_: The
[default remote](/doc/command-reference/remote/default) of `repo` is used if a
`remote` argument is not given. For local projects, the <abbr>cache</abbr> is
tied before the default remote.

- `mode` - specifies the mode in which the file is opened. Defaults to `"r"`
(read). Mirrors the namesake parameter in builtin
[`open()`](https://docs.python.org/3/library/functions.html#open).

- `encoding` -
[codec](https://docs.python.org/3/library/codecs.html#standard-encodings) used
to decode the file contents to a string. This should only be used in text
mode. Defaults to `"utf-8"`. Mirrors the namesake parameter in builtin
`open()`.

## Exceptions

- `dvc.exceptions.FileMissingError` - file in `path` is missing from `repo`.

- `dvc.exceptions.PathMissingError` - `path` cannot be found in `repo`.

- `dvc.api.UrlNotDvcRepoError` - `repo` is not a DVC project.

- `dvc.exceptions.NoRemoteError` - no `remote` is found.

## Example: Use data or models from DVC repositories

Any <abbr>data artifact</abbr> hosted online can be processed directly in your
Python code with this API. For example, an XML file tracked in a public DVC repo
on Github can be processed like this:

```py
from xml.sax import parse
import dvc.api
from mymodule import mySAXHandler

with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
parse(fd, mySAXHandler)
```

Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because
`dvc.api.open()` is able to stream the data from
[remote storage](/doc/command-reference/remote/add#supported-storage-types).
(The `mySAXHandler` object should handle the event-driven parsing of the
document in this case.) This increases the performance of the code (minimizing
memory usage), and is typically faster than loading the whole data into memory.

> If you just needed to load the complete file contents into memory, you can use
> `dvc.api.read()` instead:
>
> ```py
> from xml.dom.minidom import parse
> import dvc.api
>
> xmldata = dvc.api.read('get-started/data.xml',
> repo='https://github.com/iterative/dataset-registry')
> xmldom = parse(xmldata)
> ```
## Example: Accessing private repos
This is just a matter of using the right `repo` argument, for example an SSH URL
(requires that the
[credentials are configured](https://help.github.com/en/github/authenticating-to-github/connecting-to-github-with-ssh)
locally):
```py
import dvc.api
with dvc.api.open(
'features.dat',
repo='git@server.com:path/to/repo.git'
) as fd:
# ... Process 'features'
```
## Example: Use different versions of data
The `rev` argument lets you specify any Git commit to look for an artifact. This
way any previous version, or alternative experiment can be accessed
programmatically. For example, let's say your DVC repo has tagged releases of a
CSV dataset:
```py
import csv
import dvc.api
with dvc.api.open(
'clean.csv',
rev='v1.1.0'
) as fd:
reader = csv.reader(fd)
# ... Process 'clean' data from version 1.1.0
```
Also, notice that we didn't supply a `repo` argument in this example. DVC will
attempt to find a <abbr>DVC project</abbr> to use in the current working
directory tree, and look for the file contents of `clean.csv` in its local
<abbr>cache</abbr>; no download will happen if found. See the
[Parameters](#parameters) section for more info.
## Example: Chose a specific remote as the data source
Sometimes we may want to choose the [remote](/doc/command-reference/remote) data
source, for example if the `repo` has no default remote set. This can be done by
providing a `remote` argument:
```py
import dvc.api
with open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
```
## Example: Specify the text encoding
To chose which codec to open a text file with, send an `encoding` argument:
```py
import dvc.api
with dvc.api.open(
'data/nlp/words_ru.txt',
encoding='koi8_r') as fd:
# ... Process Russian words
```
Loading

0 comments on commit aa6b125

Please sign in to comment.