Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update go-license-tool #553

Merged
merged 3 commits into from
Jan 7, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 118 additions & 20 deletions py/kubeflow/testing/go-license-tools/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,41 +6,139 @@ When we release go library images (can be considered as redistributing third
party binary).

We need to do the following to be compliant:
* Put license declarations in the image for licences of all dependencies and transistive dependencies.
* Put license declarations in the image for licenses of all dependencies and transitive dependencies.
* Mirror source code in the image for code with MPL, EPL, GPL or CDDL licenses.

It's not an easy task to get license of all (transitive) dependencies of a go
library. Thus, we need these tools to automate this task.

## How to get all dependencies with license and source code?

1. Install CLI tools here: `python setup.py install`
1. Collect dependencies + transitive dependencies in a go library. Put them together in a text file called `dep.txt`. Format: each line has a library name. The library name should be a valid golang import module name.
### I. Setup
Download this folder to your local folder namely `<license_tool>` and install it:
```
$ python <license_tool>/setup.py install
```

Example ways to get it:
* argo uses gopkg for package management. It has a [Gopkg.lock file](https://github.com/argoproj/argo/blob/master/Gopkg.lock)
with all of its dependencies and transitive dependencies. All the name fields in this file is what we need. You can run `parse-toml-dep` to parse it.
* minio uses [official go modules](https://blog.golang.org/using-go-modules), there's a [go.mod file](https://github.com/minio/minio/blob/master/go.mod) describing its direct dependencies. Run command `go list -m all` to get final versions that will be used in a build for all direct and indirect dependencies, [reference](https://github.com/golang/go/wiki/Modules#daily-workflow). Parse its output to make a file we need.
### II. Get all dependency repositories
1. Collect dependencies and transitive dependencies in a Go library into a text file called `dep.txt`, where each line is a valid golang import module name. For example
```
......
cloud.google.com/go
github.com/BurntSushi/toml
github.com/beorn7/perks
github.com/bmatcuk/doublestar
......
```

Reminder: don't forget to put the library itself into `dep.txt`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we keep this reminder?
For redistributing 3rd party images, we need the library itself's license.
For distributing first party images, I think it's better we deliver our own license too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added to line 43

1. Run `get-github-repo` to resolve github repos of golang imports. Not all
imports can be figured out by my script, needs manual help for <2% of libraries.
Typical ways to get it:
* `gopkg` for package management. `gopkg` has a [Gopkg.lock file](https://github.com/argoproj/argo/blob/master/Gopkg.lock)
with all of its dependencies and transitive dependencies. All the name fields in this file is what we need. You can run `parse-toml-dep.py` to parse it.
* [Official go modules](https://blog.golang.org/using-go-modules) has a [go.mod file](https://github.com/minio/minio/blob/master/go.mod) describing its direct dependencies. Run command

For a library we cannot resolve, manually put it in `dep-repo-mapping.manual.csv`, so the tool knows how to find its github repo the next time.
```$ go list -m all | cut -d ' ' -f 1 > dep.txt```

Defaults to read dependencies from `dep.txt` and writes to `repo.txt`.
1. Run `get-github-license-info` to crawl github license info of these libraries. (Not all repos have github recognizable license, needs manual help for <2% of libraries)
to get final versions that will be used in a build for all direct and indirect dependencies, ([reference](https://github.com/golang/go/wiki/Modules#daily-workflow)).

Defaults to read repos from `repo.txt` and writes to `license-info.csv`. You
need to configure github personal access token because it sends a lot of
requests to github. Follow instructions in `get-github-license-info -h`.
**Reminder:** don't forget to put your library itself into `dep.txt`.
2. Run `$ python <licence_tool>/get-github-repo.py` to resolve github repositories of golang imports. Not all imports can be figured out by my script, needs manual help for <2% of libraries. For example, you may see an output like this:
```
......
Successfully resolved github repo for 89 dependencies and saved to repo.txt. Failed to resolve 3 dependencies.
We failed to resolve the following dependencies:
gomodules.xyz/jsonpatch/v2
honnef.co/go/tools
ml_metadata
```

For repos that fails to fetch license, it's usually because their github repo
For a library we cannot resolve, manually put it in `dep-repo-mapping.manual.csv`, so the tool knows how to find its github repo in the future. For example, the corresponding `dep-repo-mapping.manual.csv` for the example above is
```
gomodules.xyz/jsonpatch/v2,gomodules/jsonpatch
honnef.co/go/tools,dominikh/go-tools
ml_metadata,google/ml-metadata
```
3. Rerun the command in this to resolve all repositories this time:
```
$ python <license_tool>/get_github_repo.py

......
Successfully resolved github repo for 92 dependencies and saved to repo.txt. Failed to resolve 0 dependencies.
```

### III. Get all license URLs and types

1. Crawl github license info of these libraries via the following command to produce the `license_info.csv` file. (Not all repositories have github recognizable license, needs manual help for <2% of libraries)
```
$ python <license_tool>/third_party/cli/get_github_license_info.py --github-api-token-file=<github_token_file>
......
Fetching license for google/ml-metadata
Fetching license for kubernetes-sigs/controller-runtime
Fetching license for kubernetes-sigs/testing_frameworks
Fetching license for kubernetes-sigs/yaml
Fetched github license info, 91 succeeded, 0 failed.
```
You have to create a `<github_token_file>` in order to access Github repositories, because it sends a lot of requests to github. Follow instructions in `get-github-license-info -h`.

For repositories that fails to fetch license, it's usually because their github repo
doesn't have a github understandable license file. Check its readme and
update correct info into `license-info.csv`. (Usually, use its README file which mentions license.)
1. Edit license info file. Manually check the license file for all repos with a license categorized as "Other" by github. Figure out their true license names.
1. Run `concatenate-license` to crawl full text license files for all dependencies and concat them into one file.

2. Fill in missing license information. If you open `license_info.csv`, you can see some fields are marked as `Other`. We have to update them to the right license types. First we need to grep all these unknown license URLs:
```
$ cat license_info.csv | grep Other | cut -d ',' -f 2

GoogleCloudPlatform/gcloud-golang,https://github.com/googleapis/google-cloud-go/blob/master/LICENSE,Other,https://raw.githubusercontent.com/googleapis/google-cloud-go/master/LICENSE
ghodss/yaml,https://github.com/ghodss/yaml/blob/master/LICENSE,Other,https://raw.githubusercontent.com/ghodss/yaml/master/LICENSE
gogo/protobuf,https://github.com/gogo/protobuf/blob/master/LICENSE,Other,https://raw.githubusercontent.com/gogo/protobuf/master/LICENSE
......
```

Now we can open these license all at once in Chrome via a plugin called [OpenList](https://chrome.google.com/webstore/detail/openlist/nkpjembldfckmdchbdiclhfedcngbgnl?hl=en).

After checking the license content one by one, we can now create `additional_license_info.csv` to record the right license types. The content of `additional_license_info.csv` looks like this:
```
https://github.com/googleapis/google-cloud-go/blob/master/LICENSE,Apache License 2.0
https://github.com/ghodss/yaml/blob/master/LICENSE,MIT
https://github.com/gogo/protobuf/blob/master/LICENSE,BSD 3-Clause "New" or "Revised" License
......
```

Finally, we can patch the additional license types in `additional_license_info.csv` on `license_info.csv` to get the final list of licenses with types.

```
$ python patch_additional_license_info.py
```


3. Run `concatenate-license` to crawl full text license files for all dependencies and concat them into one file.

Defaults to read license info from `license-info.csv`. Writes to `license.txt`.
Put `license.txt` to `third_party/library/license.txt` where it is read when building docker images.
1. Manually update a list of dependencies that requires source code, put it into `third_party/library/repo-MPL.txt`.
4. Manually update a list of dependencies that requires source code, put it into `third_party/library/repo-MPL.txt`.

## Add CI tests for license information.
It is considered as best practice to continuously test whether the right licence information is presented in the `license.txt` file for every new commit in your code repository. So that it is always safe to deliver a new image from the source code.

For examples, you can add the following tests into your CI pipeline.

1. Check if `dep.txt` is updated and force the license information to be updated in the same PR.

Suppose your repository uses standard Go module and `dep.txt` is checked in. The test Shell script can simply be
```
go list -m all | cut -d ' ' -f 1 > /tmp/generated_dep.txt

if ! diff /tmp/generated_dep.txt dep.txt; then
echo "Please update the license file for changed dependencies."
exit 1
fi
```

2. Check if the final `license.txt` is up-to-date. The test Shell script can be
```
python3 concatenate_license.py --output=/tmp/generated_license.txt

if ! diff /tmp/generated_license.txt license.txt; then
echo "Please regenerate third_party/license.txt."
exit 1
fi
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
parser = argparse.ArgumentParser(
description='Generate dependencies json from license.csv file.'
)
parser.add_argument(
'license_info_file',
nargs='?',
default='license_info.csv',
help='CSV file with license info fetched from github using '
'get-github-license-info CLI tool. (default: %(default)s)',
)
parser.add_argument(
'additional_license_info_file',
nargs='?',
default='additional_license_info.csv',
help='CSV file with license info. Each line is in the form '
'<license_url>,<license_type>. (default: %(default)s)',
)
args = parser.parse_args()


def main():
mapping = {}
with open(args.additional_license_info_file, 'r') as f:
for line in f:
parts = line.strip().split(',')
assert len(parts) == 2
[license_url, license_type] = parts
mapping[license_url] = license_type

newlines = []
with open(args.license_info_file, 'r') as f:
for line in f:
parts = line.strip().split(',')
_, license_url, license_type, *_ = parts
if license_type == 'Other':
if not license_url in mapping:
raise ValueError(
'Unknown license type: '
'please add the right license type for {} in file {}'
.format(license_url, args.additional_license_info_file)
)
parts[2] = mapping[license_url]
print(
'Update license {} to type {}'.format(
license_url, mapping[license_url]
)
)
newlines.append(','.join(parts))

with open(args.license_info_file, 'w') as f:
for line in newlines:
print(line, file=f)


if __name__ == "__main__":
main()