Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init: add subdir description #1022

Merged
merged 4 commits into from
Mar 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions public/static/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,14 +62,14 @@ file (in `.dvc/config` by default), and they support the options below:

This is the main section with the general config options:

- `core.loglevel` - log level that the `dvc` command should use. Possible values
are: `info`, `debug`, `warning`, `error`.
- `core.loglevel` - log level that the `dvc` command should use. Accepts values
`info`, `debug`, `warning`, or `error`.

- `core.remote` - name of the remote storage that should be used by default.

- `core.interactive` - whether to always ask for confirmation before reproducing
each [stage](/doc/command-reference/run) in `dvc repro`. (Normally, this
behavior requires the use of option `-i` in that command.) Accepts values
behavior requires the use of option `-i` in that command.) Accepts values:
`true` and `false`.

- `core.analytics` - used to turn off
Expand All @@ -85,6 +85,11 @@ This is the main section with the general config options:
project is on a file system that doesn't properly support file locking (e.g.
[NFS v3 and older](http://nfs.sourceforge.net/)).

- `core.no_scm` - tells DVC to not expect or integrate with Git (even if the
<abbr>project</abbr> is initialized inside a Git repo). Accepts values `true`
and `false` (default). Set with the `--no-scm` option of `dvc init`
([more details](/doc/command-reference/init#initializing-dvc-without-git)).

### remote

These are sections in the config file that describe particular remotes. These
Expand Down
186 changes: 171 additions & 15 deletions public/static/docs/command-reference/init.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,178 @@
# init

This command initializes a <abbr>DVC project</abbr> on a directory.

Note that by default the current working directory is expected to contain a Git
repository, unless the `--no-scm` option is used.
Initialize a <abbr>DVC project</abbr> in the current working directory.

## Synopsis

```usage
usage: dvc init [-h] [-q | -v] [--no-scm] [-f]
usage: dvc init [-h] [-q | -v] [--no-scm] [-f] [--subdir]
```

## Description

DVC works on top of a Git repository by default. This enables all features,
providing the most value. It means that `dvc init` (without flags) expects to
run in a Git repository root (a `.git/` directory should be present).

The command options can be used to start an alternative workflow for advanced
scenarios like monorepos, automation, etc:

- [Initializing DVC in subdirectories](#initializing-dvc-in-subdirectories) -
support for monorepos, nested <abbr>DVC projects</abbr>, etc.
- [Initializing DVC without Git](#how-does-it-affect-dvc-commands) - support for
SCM other than Git, deployment automation cases, etc.

After DVC initialization, a new directory `.dvc/` will be created with the
`config` and `.gitignore` files. These and other files and directories are
hidden from user, as typically there's no need to interact with them directly.
See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to
learn more.

`.dvc/cache` is one of the most important
[DVC directories](/doc/user-guide/dvc-files-and-directories). It will hold all
the contents of tracked data files. Note that `.dvc/.gitignore` lists this
directory, which means that the cache directory is not tracked by Git. This is a
local cache and you cannot `git push` it.
### Initializing DVC in subdirectories

`--subdir` must be provided to initialize DVC in a subdirectory of a Git
repository. DVC still expects to find the Git repository (will check all
directories up to the root to find `.git`). This options does not affect any
config files, `.dvc` directory is created the same way as in the default mode.
This way multiple DVC projects (including nested ones) could be initialized in a
single Git repository providing isolation and granular project management.

#### When is this useful?

This option is mostly used in the scenario of a
[monorepo](https://en.wikipedia.org/wiki/Monorepo), but also can be used in
other workflows when such isolation and/or advanced granularity is needed.

Let's imagine we have an existing Git repository that is split into sub-projects
(monorepo). In this case `dvc init --subdir` can be run in one or many
sub-projects to mitigate the issues of initializing in the Git repository root:

- Repository maintainers might not allow extra `.dvc` top level directory,
especially if DVC is being used by a small number of sub-projects.

- Not enough isolation/granularity - DVC config, cache, and other files are
shared across different sub-projects. Means that it's not easy to use
different remote storages, for example, for different sub-projects, etc.

- Not enough isolation/granularity - commands like `dvc pull`, `dvc checkout`,
and others analyze the whole repository to look for
[DVC-files](/doc/user-guide/dvc-file-format) to download files and
directories, to reproduce <abbr>pipelines</abbr>, etc. It can be expensive in
the large repositories with a lot of projects.

- Not enough isolation/granularity - commands like `dvc metrics diff`,
`dvc pipeline show` and others by default dump all the metrics, all the
pipelines, etc.

#### How does it affect DVC commands?

No matter what mode is used, DVC looks for the `.dvc` directory when it starts
(from the current working directory and up). Location of the found `.dvc`
directory determines the root of the DVC project. (In case of `--subdir` it
might happen that Git repository root is located at different path than the DVC
project root.)

DVC project root defines the scope for the most DVC commands. Mostly meaning
that all DVC-file under the root path are being analyzed.

If there are multiple DVC sub-projects but they _are not_ nested, e.g.:

```sh
.
β”œβ”€β”€ .git
|
β”œβ”€β”€ project-A
β”‚Β Β  └── .dvc
β”‚ ...
β”œβ”€β”€ project-B
β”‚ └── .dvc
β”‚ ...
```

## Options
DVC considers them a two separate DVC projects. Any DVC command that is being
run in the `project-A` is not aware about DVC `project-B`. DVC does not consider
Git repository root an initialized DVC project in this case and commands that
require DVC project will raise an error.

On the other hand, if there _are_ nested DVC projects, e.g.:

```sh
project-A
β”œβ”€β”€ .dvc
β”œβ”€β”€ data-A.dvc
β”‚ ...
└── project-B
β”œβ”€β”€ .dvc
β”œβ”€β”€ data-B.dvc
β”‚ ...
```

Nothing changes for the `project-B`. But for any DVC command being run in the
`project-A` ignores the whole directory `project-B/`, meaning for example:

```dvc
$ cd project-A
$ dvc pull
```

won't download or checkout data for the `data-B.dvc` file.

### Initializing DVC without Git

In rare cases, `--no-scm` option might be used to initialize DVC in a directory
that is not part of a Git repository, or to make DVC ignore Git. Examples
include:

- SCM other than Git is being used. Even though there are DVC features that
require DVC to be run in the Git repo, DVC can work well with other version
control systems. Since DVC relies on simple text
[DVC-files](/doc/user-guide/dvc-file-format) to manage <abbr>pipelines</abbr>,
data, etc, they can be added into any SCM thus providing large data files and
directories versioning.

- There is no need to keep the history at all, e.g. having a deployment
automation like running a data pipeline using `cron`.

In this mode DVC features that depend on Git being present are not available -
e.g. managing `.gitignore` files on `dvc add` or `dvc run` to avoid committing
DVC-tracked files into Git, or `dvc diff` and `dvc metrics diff` that accept
Git-revisions to compare, etc.

- `--no-scm` - skip Git specific initialization, `.dvc/.gitignore` will not be
written.
DVC sets the `core.no_scm` option value to `true` in the DVC
[config](/doc/command-reference/config) when it is initialized this way. It
means that even if the project was Git-tracked already or Git is initialized in
it later, DVC keeps operating in the detached from Git mode.

## Options

- `-f`, `--force` - remove `.dvc/` if it exists before initialization. Will
remove any existing local cache. Useful when a previous `dvc init` has been
corrupted.

- `--subdir` - initialize the DVC project in the current working directory,
_even if it's not the Git repository root_. (If run in a project root, this
option is ignored.) It affects how other DVC commands behave afterwards,
please see
[Initializing DVC in subdirectories](#initializing-dvc-in-subdirectories) for
more details.

- `--no-scm` - initialize the DVC project detached from Git. It means that DVC
doesn't try to find or use Git in the directory it's initialized in. Certain
DVC features are not available in this mode, please see
[Initializing DVC without Git](#initializing-dvc-without-git) for more
details.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples
## Examples: Most common initialization workflow

Create a new <abbr>DVC repository</abbr> (requires Git):
Create a new <abbr>DVC repository</abbr> (requires to be run in the Git
repository root):

```dvc
$ mkdir example && cd example
Expand All @@ -67,3 +196,30 @@ $ cat .dvc/.gitignore
...
/cache
```

## Examples: Initializing DVC in a subdirectory

Create a new <abbr>DVC repository</abbr> in a subdirectory of a Git repository:

```dvc
$ mkdir repo && cd repo

$ git init
$ mkdir project-a && cd project-a

$ dvc init --subdir
```

In this case, Git repository is inside `repo` directory, while <abbr>DVC
repository</abbr> is inside `repo/project-a`.

```dvc
$ tree repo -a
repo
β”œβ”€β”€ .git
.
.
.
└── project-a
└── .dvc
```