Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regular updates and fixes (late August) (2) #591

Merged
merged 6 commits into from
Aug 30, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions static/docs/changelog/0.35.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ improvements) we have done in the last few months:

- 📖 The [Get Started](/doc/get-started/agenda) section has been simplified
(e.g. to use tags instead of branches) and extended. We have also prepared a
[Github DVC project ](https://github.com/iterative/example-get-started)that
reflects the sequence of steps in the “get started” guide. You can now
[Github DVC project ](https://github.com/iterative/example-get-started) that
reflects the sequence of chapters in the “get started” guide. You can now
download the whole project and reproduce all the models.

- **`dvc diff`** **command introduced**. Summary statistics for the
Expand Down
6 changes: 3 additions & 3 deletions static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,6 @@ $ md5 model.pkl
MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
```

Previously this took two steps, `git checkout` followed by `dvc checkout`, but
we have skipped having to remember to run that second step. Instead it is
automatically executed for us, and the workspace is automatically synchronized.
Previously this took two steps, `git checkout` followed by `dvc checkout`. We
can now skip the second one, which is automatically executed for us. The
workspace is automatically synchronized accordingly.
36 changes: 20 additions & 16 deletions static/docs/commands-reference/commit.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# commit

Record changes to the repository by updating
[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to cache.
[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to
<abbr>cache</abbr>.

## Synopsis

Expand All @@ -19,8 +20,9 @@ positional arguments:
The `dvc commit` command is useful for several scenarios where a dataset is
being changed: when a [stage](/doc/commands-reference/run) or
[pipeline](/doc/commands-reference/pipeline) is in development, when one wishes
to run commands outside the control of DVC, or to force DVC-file updates to save
time tying stages or a pipeline.
to run commands outside the control of DVC, or to force
[DVC-file](/doc/user-guide/dvc-file-format) updates to save time tying stages or
a pipeline.

- Code or data for a stage is under active development, with rapid iteration of
code, configuration, or data. Run DVC commands (`dvc run`, `dvc repro`, and
Expand All @@ -46,8 +48,9 @@ DVC-files and save data to cache. They are still useful, but keep in mind that
DVC can't guarantee reproducibility in those cases – You commit any data you
want. Let's take a look at what is happening in the fist scenario closely:

Normally DVC commands like `dvc add`, `dvc repro` or `dvc run`, commit the data
to the DVC cache as the last step. What _commit_ means is that DVC:
Normally DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data
to the <abbr>DVC cache</abbr> after creating a DVC-file. What _commit_ means is
that DVC:

- Computes a checksum for the file/directory
- Enters the checksum and file name into the DVC-file
Expand All @@ -56,13 +59,13 @@ to the DVC cache as the last step. What _commit_ means is that DVC:
(`dvc init --no-scm`), this does not happen.)
- Adds the file/directory or to the DVC cache

There are many cases where the last step is not desirable (usually, rapid
iteration on some experiment). For the DVC commands where available, the
`--no-commit` option prevents the last step from occurring, thus we are saving
time and space by not storing all the <abbr>data artifacts</abbr> for every
command attempt. The checksum is still computed and added to the DVC-file, but
the file is not added to the cache. That's where the `dvc commit` command comes
into play. It handles that last step of adding the file to the DVC cache.
There are many cases where the last step is not desirable (for example rapid
iterations on an experiment). The `--no-commit` option prevents the last step
from occurring (on the commands where it's available), saving time and space by
not storing unwanted <abbr>data artifacts</abbr>. Checksums is still computed
and added to the DVC-file, but the actual data file is not saved in the DVC
cache. This is where the `dvc commit` command comes into play. It performs that
last step: storing the file in the DVC cache.

## Options

Expand Down Expand Up @@ -128,8 +131,8 @@ $ dvc pull --all-branches --all-tags

Sometimes we want to iterate through multiple changes to configuration, code, or
data, trying multiple options to improve the output of a stage. To avoid filling
the DVC cache with undesired intermediate results, we can run a single stage
with `dvc run --no-commit`, or reproduce an entire pipeline using
the <abbr>DVC cache</abbr> with undesired intermediate results, we can run a
single stage with `dvc run --no-commit`, or reproduce an entire pipeline using
`dvc repro --no-commit`. This prevents data from being pushed to cache. When
development of the stage is finished, `dvc commit` can be used to store data
files in the DVC cache.
Expand Down Expand Up @@ -219,7 +222,8 @@ that the new instance of `model.pkl` is in the cache.

It is also possible to execute the commands that are executed by `dvc repro` by
hand. You won't have DVC helping you, but you have the freedom to run any script
you like, even ones not recorded in a DVC-file. For example:
you like, even ones not recorded in a
[DVC-file](/doc/user-guide/dvc-file-format). For example:

```dvc
$ python src/featurization.py data/prepared data/features
Expand All @@ -228,7 +232,7 @@ $ python src/evaluate.py model.pkl data/features auc.metric
```

As before, `dvc status` will show which the files have changed, and when your
work is finalized `dvc commit` will commit everything to the cache.
work is finalized `dvc commit` will commit everything to the <abbr>cache</abbr>.

## Example: Updating dependencies

Expand Down
7 changes: 4 additions & 3 deletions static/docs/commands-reference/get-url.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# get-url

Download or copy file or directory from any supported URL (for example `s3://`,
`ssh://`, and other protocols) or local directory to the local file system.
Download or copy a file or directory from any supported URL (for example
`s3://`, `ssh://`, and other protocols) or local directory to the local file
system.

> Unlike `dvc import-url`, this command does not track the downloaded data files
> (does not create a DVC-file).
Expand All @@ -18,7 +19,7 @@ positional arguments:

## Description

In some cases it's convenient to get a data file or directory from a remote
In some cases it's convenient to get a <abbr>data artifact</abbr> from a remote
location into the current working directory, regardless of whether it's a DVC
project. The `dvc get-url` command helps the user do just that.

Expand Down
7 changes: 7 additions & 0 deletions static/docs/commands-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,13 @@ created in the current working directory, with its original file name.

## Options

- `-o`, `--out` - specify a path (directory and file name) to the desired
location to place the imported data in. The default value (when this option
isn't used) is the current working directory (`.`) and original file name.

- `--rev` - specific Git revision of the DVC repository to import the data from.
`HEAD` by default.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
Expand Down
11 changes: 6 additions & 5 deletions static/docs/commands-reference/import-url.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
# import-url

Download or copy file or directory from any supported URL (for example `s3://`,
`ssh://`, and other protocols) or local directory to the <abbr>workspace</abbr>,
and track changes in the remote data source with DVC. Creates a DVC-file.
Download or copy a file or directory from any supported URL (for example
`s3://`, `ssh://`, and other protocols) or local directory to the
<abbr>workspace</abbr>, and track changes in the remote data source with DVC.
Creates a DVC-file.

> See also `dvc get-url` which corresponds to the first step this command
> performs (just download the data).
> See also `dvc get-url` which corresponds to the first half of what this
> command does (downloading the <abbr>data artifact</abbr>).
## Synopsis

Expand Down
6 changes: 3 additions & 3 deletions static/docs/commands-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,9 @@ downloaded data artifact from the external DVC repo.

## Options

- `-o`, `--out` - specify a location in the workspace to place the imported data
in, as a path to the desired directory. The default value (when this option
isn't used) is the current working directory (`.`).
- `-o`, `--out` - specify a path (directory and file name) to the desired
location to place the imported data in. The default value (when this option
isn't used) is the current working directory (`.`) and original file name.

- `--rev` - specific Git revision of the DVC repository to import the data from.
`HEAD` by default.
Expand Down
21 changes: 11 additions & 10 deletions static/docs/commands-reference/move.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,20 +48,20 @@ outs:
path: data.csv
```
If we move this using the regular `mv data.csv other.csv` the DVC-file would not
know that we changed the `path` of `data.csv` to `other.csv`.
If we move this using the regular `mv data.csv other.csv` command, DVC wouldn't
know that we changed the `path` of `data.csv` to `other.csv`, as the old
location is still registered in the corresponding DVC-file.

`dvc move` adjusts the content of the DVC-file to update `path`. So that saves
some manual and programming steps.

To illustrate, notice that `path` value has changed, as well as the DVC-file
name:
`dvc move` adjusts the content of the DVC-file to update `path`. This saves
users from performing several manual operations:

```dvc
$ dvc move data.csv other.csv
$ cat other.csv.dvc
```

Notice that `path` value has changed, as well as the DVC-file name.

And here is the updated content of the `other.csv.dvc`:

```yaml
Expand Down Expand Up @@ -102,9 +102,10 @@ $ tree
```

Here we use `dvc add` to put a file under DVC control. Then we use `dvc move` to
change its location. Note that the `data.csv.dvc` DVC-file is also moved. If
target path already exists and is a directory, data file is moved with unchanged
name into this folder.
change its location. Note that the `data.csv.dvc`
[DVC-file](/doc/user-guide/dvc-file-format) is also moved. If target path
already exists and is a directory, data file is moved with unchanged name into
this folder.

```dvc
$ tree
Expand Down
4 changes: 2 additions & 2 deletions static/docs/get-started/configure.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ $ dvc remote add -d myremote /tmp/dvc-storage
$ git commit .dvc/config -m "Configure local remote"
```

> We only use a local remote in this guide for simplicity's sake in following
> these basic steps as you are learning to use DVC. We realize that for most
> We only use a local remote in this guide for simplicity's sake when following
> these steps, as you are learning to use DVC. For most
> [use cases](/doc/use-cases), other "more remote" types of remotes will be
> required.
Expand Down
21 changes: 10 additions & 11 deletions static/docs/get-started/example-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -336,16 +336,15 @@ changed.

Here where DVC pipelines feature comes very handy and was designed for. We
touched it briefly when we described `dvc run` and `dvc repro` at the very end.
The next step here would be splitting the script into two steps and utilizing
DVC pipelines. See this [example](/doc/get-started/example-pipeline) to get a
hands-on experience with them and try to apply it here. Don't hesitate to join
our [community](/chat) to ask any questions!

Another thing, you should have noticed, is the metrics file (`metrics.json`) and
the way we captured it with `-M metrics.json` option. Metric file is a special
type of output DVC provides an interface on top to compare across tags or
The next step here would be splitting the script into two parts, and utilizing
DVC [pipelines](/doc/commands-reference/pipeline). See
[this example](/doc/get-started/example-pipeline) to get a hands-on experience
with pipelines and try to apply it here. Don't hesitate to join our
[community](/chat) to ask any questions!

Another detail we only brushed on here is the way we captured the `metrics.json`
metrics file with the `-M` option of `dvc run`. Metric files are a special type
of output DVC provides an interface for, in order to compare across Git tags or
branches. See `dvc metrics` command and
[Compare Experiments](/doc/get-started/compare-experiments) to learn more about
managing metrics. Next step you should try on your own is converting both
iterations we had into `dvc run` and then utilize `dvc metrics show` to compare
them.
managing metrics with DVC.
12 changes: 6 additions & 6 deletions static/docs/get-started/index.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Get Started

_Get Started_ is a step by step introduction into basic DVC concepts. It doesn't
_Get Started_ is a step-by-step introduction into basic DVC concepts. It doesn't
go into details much, but provides links and expandable sections to learn more.

At the very end there are a few complete step-by-step examples to give you more
hands-on experience with real life scenarios. The first one is about model and
dataset [versioning](/doc/get-started/example-versioning), and the second one is
focused on [pipelines and reproducibility](/doc/get-started/example-pipeline).
At the very end there are a few complete examples to give you more hands-on
experience with real life scenarios. The first one is about model and dataset
[versioning](/doc/get-started/example-versioning), and the second one is focused
on [pipelines and reproducibility](/doc/get-started/example-pipeline).

✅ Please, join our [community](/chat) or see these [support](/support) options
if you have any questions or need any help. We are very responsive ⚡.
Expand All @@ -18,5 +18,5 @@ us a ⭐ if you like the project!
[on Patreon](https://www.patreon.com/DVCorg/overview) to support the project.

Separate to this section, the longer [Tutorial](/doc/tutorial) also introduces
DVC step-by-step while additionally explaining in great detail the motivation
DVC step-by-step, while additionally explaining in great detail the motivation
and what's happening internally.
5 changes: 3 additions & 2 deletions static/docs/get-started/reproduce.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ $ dvc repro train.dvc

> If you've just followed the previous chapters, the command above will have
> nothing to reproduce since you've already run all the pipeline stages. To
> easily try this command, you can clone this example
> [Github project](https://github.com/iterative/example-get-started) first.
> easily try this command, clone this example
> [Github project](https://github.com/iterative/example-get-started) and run it
> from there.
`train.dvc` file internally describes what data files and code we should take
and how to run the command to get the binary model file. For each data file it
Expand Down
22 changes: 5 additions & 17 deletions static/docs/get-started/retrieve-data.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,22 @@
# Retrieve Data

> Make sure that the steps described in the
> [initialization](/doc/get-started/initialize) and
> [configuration](/doc/get-started/configure) chapters are completed before you
> run the `dvc pull` command in a newly cloned or initialized Git repository.
> You'll need to complete the [initialization](/doc/get-started/initialize) and
> [configuration](/doc/get-started/configure) chapters before being able to run
> the commands explained here.
To retrieve data files into the <abbr>workspace</abbr> in your local machine,
run:

```dvc
$ rm -f data/data.xml
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
$ dvc pull
```

> If you've followed previous chapters of this section, try deleting
> `data/data.xml` before running the command above, otherwise DVC won't find a
> need to [checkout](/doc/commands-reference/checkout) the file, since it's
> already in your workspace.
This command retrieves data files that are referenced in all
[DVC-files](/doc/user-guide/dvc-file-format) in the <abbr>project</abbr>. So,
you usually run it after `git clone`, `git pull`, or `git checkout`.

As an easy way to test it:

```dvc
$ rm -f data/data.xml
$ dvc pull
```

Alternatively, if you want to retrieve a single dataset or a file:
Alternatively, if you want to retrieve a single dataset or a file you can use:

```dvc
$ dvc pull data/data.xml.dvc
Expand Down
Loading