Skip to content

Commit 5826429

Browse files
Merge branch 'develop' into add-license-detection
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
2 parents 6a91773 + aba3112 commit 5826429

File tree

87 files changed

+6614
-3021
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

87 files changed

+6614
-3021
lines changed

CHANGELOG.rst

Lines changed: 126 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -5,16 +5,12 @@ Changelog
55
31.0.0 (next, roadmap)
66
-----------------------
77

8-
Important API changes:
9-
~~~~~~~~~~~~~~~~~~~~~~~~
8+
This is a major release with important bug and security fixes, new and improved
9+
features and API changes.
1010

11-
- Adopted the new skeleton from https://github.com/nexB/skeleton
12-
The key change is the location of the virtual environment. It used to be
13-
created at the root of the scancode-toolkit directory. It is now created
14-
under the ``venv`` subdirectory.
1511

16-
- The main package API function `get_package_infos` is deprecated, and
17-
replaced by `get_package_data`.
12+
Important API changes:
13+
~~~~~~~~~~~~~~~~~~~~~~~~
1814

1915
- The data structure of the JSON output has changed for copyrights, authors
2016
and holders. We now use a proper name for attributes and not a generic "value".
@@ -31,14 +27,14 @@ Important API changes:
3127
rather than "packages". This has all the data attributes of a "package_data"
3228
field plus others: "package_uuid", "package_data_files" and "files".
3329

34-
- There is a a new top-level "packages" attribute that contains package
35-
instances that can be aggregating data from multiple manifests.
30+
- There is a a new top-level "packages" attribute that contains package
31+
instances that can be aggregating data from multiple manifests.
3632

37-
- There is a a new top-level "dependencies" attribute that contains each dependency
38-
instance, these can be standalone or releated to a package.
33+
- There is a a new top-level "dependencies" attribute that contains each
34+
dependency instance, these can be standalone or releated to a package.
3935

40-
- There is a new resource-level attribute "for_packages" which refers to packages
41-
through package_uuids (pURL + uuid string).
36+
- There is a new resource-level attribute "for_packages" which refers to
37+
packages through package_uuids (pURL + uuid string).
4238

4339
- The data structure for HTML output has been changed to include emails and
4440
urls under the "infos" object. The HTML template displays output for holders,
@@ -48,12 +44,18 @@ Important API changes:
4844
column to "path". "copyright_holder" has been renamed to "holder"
4945

5046
- The license clarity scoring plugin has been overhauled to show new license
51-
clarity criteria. More details of the new criteria are provided below.
47+
clarity criteria. More details of the new scoring criteria are provided below.
48+
49+
- The functionality of the summary plugin has been imprived to provide declared
50+
origin and license information for the codebase being scanned. The previous
51+
summary plugin functionality has been preserved in the new ``tallies`` plugin.
52+
More details are provided below.
5253

53-
- The functionality of the summary plugin has been changed to provide declared
54-
origin information for the codebase being scanned. The previous summary plugin
55-
functionality has been preserved in the new ``tallies`` plugin. More details
56-
are provided below.
54+
- ScanCode has adopted the new code skeleton from https://github.com/nexB/skeleton
55+
The key change is the location of the virtual environment. It used to be
56+
created at the root of the scancode-toolkit directory. It is now created
57+
under the ``venv`` subdirectory. You mus be aware of this if you use ScanCode
58+
from a git clone
5759

5860

5961
Copyright detection:
@@ -76,7 +78,7 @@ License detection:
7678
- XXXX new license detection rules have been added, and
7779
- XXXX existing license rules have been updated.
7880
- XXXX existing false positive license rules have been removed (see below).
79-
- The SPDX license list has been updated to the latest v3.15
81+
- The SPDX license list has been updated to the latest v3.16
8082

8183
- The rule attribute "only_known_words" has been renamed to "is_continuous" and its
8284
meaning has been updated and expanded. A rule tagged as "is_continuous" can only
@@ -85,10 +87,10 @@ License detection:
8587
The processing for "is_continous" has been merged in "key phrases" processing
8688
below.
8789

88-
- Key phrases can now be defined in RULEs by surrounding one or more words with
89-
`{{` and `}}`. When defined a RULE will only match when the key phrases match
90-
exactly. When all the text of rule is a "key phrase", this is the same as being
91-
"is_continuous".
90+
- Key phrases can now be defined in a RULE text by surrounding one or more words
91+
with double curly braces `{{` and `}}`. When defined a RULE will only match
92+
when the key phrases match exactly. When all the text of rule is a "key phrase",
93+
this is the same as being "is_continuous".
9294

9395
- The "--unknown-licenses" option now also detects unknown licenses using a
9496
simple and effective ngrams-based matching in area that are not matched or
@@ -135,6 +137,7 @@ License detection:
135137
tagged and they may not be detected unless you activate this new indexing
136138
feature.
137139

140+
138141
Package detection:
139142
~~~~~~~~~~~~~~~~~~
140143

@@ -172,77 +175,84 @@ Package detection:
172175
License Clarity Scoring Update
173176
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
174177

175-
- We are moving away from the license clarity scoring defined by ClearlyDefined
176-
in the license clarity score plugin. The previous license clarity scoring
177-
logic produced a score that was misleading when it would return a low score
178-
due to the stringent scoring criteria. We are now
179-
using more general criteria to get a sense of what provenance information has
180-
been provided and whether or not there is a conflict in licensing between
181-
what licenses were declared at the top-level key files and what licenses have
182-
been detected in the files under the top-level.
178+
- We are moving away from the original license clarity scoring designed for
179+
ClearlyDefined in the license clarity score plugin. The previous license
180+
clarity scoring logic produced a score that was misleading when it would
181+
return a low score due to the stringent scoring criteria. We are now using
182+
more general criteria to get a sense of what provenance information has been
183+
provided and whether or not there is a conflict in licensing between what
184+
licenses were declared at the top-level key files and what licenses have been
185+
detected in the files under the top-level.
183186

184-
- The license clarity score is a value from 0-100 calculated by combining the
185-
weighted values determined for each of the scoring elements:
187+
- The license clarity score is a value from 0-100 calculated by combining the
188+
weighted values determined for each of the scoring elements:
186189

187-
- Declared license:
190+
- Declared license:
188191

189-
- When true, indicates that the software package licensing is documented at
190-
top-level or well-known locations in the software project, typically in a
191-
package manifest, NOTICE, LICENSE, COPYING or README file.
192-
- Scoring Weight = 40
192+
- When true, indicates that the software package licensing is documented at
193+
top-level or well-known locations in the software project, typically in a
194+
package manifest, NOTICE, LICENSE, COPYING or README file.
195+
- Scoring Weight = 40
193196

194-
- Identification precision:
197+
- Identification precision:
195198

196-
- Indicates how well the license statement(s) of the software identify known
197-
licenses that can be designated by precise keys (identifiers) as provided in
198-
a publicly available license list, such as the ScanCode LicenseDB, the SPDX
199-
license list, the OSI license list, or a URL pointing to a specific license
200-
text in a project or organization website.
201-
- Scoring Weight = 40
199+
- Indicates how well the license statement(s) of the software identify known
200+
licenses that can be designated by precise keys (identifiers) as provided in
201+
a publicly available license list, such as the ScanCode LicenseDB, the SPDX
202+
license list, the OSI license list, or a URL pointing to a specific license
203+
text in a project or organization website.
204+
- Scoring Weight = 40
202205

203-
- License texts:
206+
- License texts:
204207

205-
- License texts are provided to support the declared license expression in
206-
files such as a package manifest, NOTICE, LICENSE, COPYING or README.
207-
- Scoring Weight = 10
208+
- License texts are provided to support the declared license expression in
209+
files such as a package manifest, NOTICE, LICENSE, COPYING or README.
210+
- Scoring Weight = 10
208211

209-
- Declared copyright:
212+
- Declared copyright:
210213

211-
- When true, indicates that the software package copyright is documented at
212-
top-level or well-known locations in the software project, typically in a
213-
package manifest, NOTICE, LICENSE, COPYING or README file.
214-
- Scoring Weight = 10
214+
- When true, indicates that the software package copyright is documented at
215+
top-level or well-known locations in the software project, typically in a
216+
package manifest, NOTICE, LICENSE, COPYING or README file.
217+
- Scoring Weight = 10
215218

216-
- Ambiguous compound licensing:
219+
- Ambiguous compound licensing:
217220

218-
- When true, indicates that the software has a license declaration that
219-
makes it difficult to construct a reliable license expression, such as in
220-
the case of multiple licenses where the conjunctive versus disjunctive
221-
relationship is not well defined.
222-
- Scoring Weight = -10
221+
- When true, indicates that the software has a license declaration that
222+
makes it difficult to construct a reliable license expression, such as in
223+
the case of multiple licenses where the conjunctive versus disjunctive
224+
relationship is not well defined.
225+
- Scoring Weight = -10
223226

224-
- Conflicting license categories:
227+
- Conflicting license categories:
225228

226-
- When true, indicates that the declared license expression of the software is in
227-
the permissive category, but that other potentially conflicting categories,
228-
such as copyleft and proprietary, have been detected in lower level code.
229-
- Scoring Weight = -20
229+
- When true, indicates that the declared license expression of the software
230+
is in the permissive category, but that other potentially conflicting
231+
categories, such as copyleft and proprietary, have been detected in lower
232+
level code.
233+
- Scoring Weight = -20
230234

231235

232236
Summary Plugin Update
233237
~~~~~~~~~~~~~~~~~~~~~
234-
The summary plugin's behavior has been changed. Previously, it provided a count
235-
of the detected license expressions, copyrights, holders, authors, and
236-
programming languages from a scan. We have preserved this functionality by
237-
creating a new plugin called ``tallies``. All functionality of the previous
238-
summary plugin have been preserved in the tallies plugin.
239238

240-
The plugin now attempts to determine a declared license expression, holder, and
241-
primary programming language from a scan. The license clarity score provides
242-
context on what origin information is provided from key files. It also returns
243-
lists of tallies of the other detected license expressions, holders, and
244-
programming languages. All information is provided in the codebase level
245-
attribute named ``summary``.
239+
- The summary plugin's behavior has been changed. Previously, it provided a
240+
count of the detected license expressions, copyrights, holders, authors, and
241+
programming languages from a scan.
242+
243+
We have preserved this functionality by creating a new plugin called ``tallies``.
244+
All functionality of the previous summary plugin have been preserved in the
245+
tallies plugin.
246+
247+
- The new summary plugin now attempts to determine a declared license expression,
248+
declared holder, and the primary programming language from a scan. And the
249+
updated license clarity score provides context on the quality of the license
250+
information provided in the codebase key files.
251+
252+
- The new summary plugin also returns lists of tallies for the other "secondary"
253+
detected license expressions, copyright holders, and programming languages.
254+
255+
All summary information is provided at the codebase-level attribute named ``summary``.
246256

247257

248258
Outputs:
@@ -258,15 +268,36 @@ Outputs:
258268
Output version
259269
--------------
260270

261-
Scancode Data Output Version is now 3.0.0.
271+
Scancode Data Output Version is now 2.0.0.
272+
262273

263274
Changes:
264275

265-
- rename resource level attribute `packages` to `package_data`.
266-
- add top-level attribute `packages`.
267-
- add top-level attribute `dependencies`.
268-
- add resource-level attribute `for_packages`.
269-
- remove `package-data` attribute `root_path`.
276+
- Rename resource level attribute `packages` to `package_data`.
277+
- Add top-level attribute `packages`.
278+
- Add top-level attribute `dependencies`.
279+
- Add resource-level attribute `for_packages`.
280+
- Remove `package-data` attribute `root_path`.
281+
- The fields of the license clarity scoring plugin have been replaced with the
282+
following fields. An overview of the new fields can be found in the "License
283+
Clarity Scoring Update" section above.
284+
- `score`
285+
- `declared_license`
286+
- `identification_precision`
287+
- `has_license_text`
288+
- `declared_copyrights`
289+
- `conflicting_license_categories`
290+
- `ambigious_compound_licensing`
291+
- The fields of the summary plugin have been replaced with the following fields.
292+
An overview of the new fields can be found in the "Summary Plugin Update"
293+
section above.
294+
- `declared_license_expression`
295+
- `license_clarity_score`
296+
- `declared_holder`
297+
- `primary_language`
298+
- `other_license_expressions`
299+
- `other_holders`
300+
- `other_languages`
270301

271302

272303
Documentation Update
@@ -276,16 +307,22 @@ Documentation Update
276307
correct minor documentation issues.
277308

278309

279-
Development environment changes:
280-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
310+
Development environment and Code API changes:
311+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
312+
313+
- The main package API function `get_package_infos` is deprecated, and
314+
replaced by `get_package_data`.
315+
316+
- The Resources path are always the same regardless of the strip-root or
317+
full-root arguments.
281318

282-
- The license cache consistency is not checked anymore when you are using a Git
319+
- The license cache consistency is not checked anymore when you are using a git
283320
checkout. The SCANCODE_DEV_MODE tag file has been removed entirely. Use
284321
instead the --reindex-licenses option to rebuild the license index.
285322

286-
- We can now regenerate updated test fixtures using the new SCANCODE_REGEN_TEST_FIXTURES
287-
environment variable. There is no need to replace the regen=False with regen=True
288-
in the code.
323+
- We can now regenerate test fixtures using the new SCANCODE_REGEN_TEST_FIXTURES
324+
environment variable. There is no need to replace the regen=False with
325+
regen=True in the code.
289326

290327

291328
30.1.0 - 2021-09-25

README.rst

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,12 @@ Read more about ScanCode here: https://scancode-toolkit.readthedocs.io/.
1010

1111
Check out the code at https://github.com/nexB/scancode-toolkit
1212

13+
Discover also:
14+
15+
- The ScanCode.io server project here: https://scancodeio.readthedocs.io
16+
- Other companion SCA projects for code origin, license and security analysis
17+
here: https://aboutcode.org
18+
1319

1420
Build and tests status
1521
======================
@@ -92,12 +98,15 @@ for upcoming features.
9298
Documentation
9399
=============
94100

95-
The ScanCode documentation is hosted at `scancode-toolkit.readthedocs.io <https://scancode-toolkit.readthedocs.io/en/latest/>`_.
101+
The ScanCode documentation is hosted at
102+
`scancode-toolkit.readthedocs.io <https://scancode-toolkit.readthedocs.io/en/latest/>`_.
96103

97-
If you are new to Scancode, start `here <https://scancode-toolkit.readthedocs.io/en/latest/getting-started/newcomer.html>`_.
104+
If you are new to Scancode, start with our
105+
`newcomer <https://scancode-toolkit.readthedocs.io/en/latest/getting-started/newcomer.html>`_ page.
98106

99-
If you want to compare output changes between different versions of Scancode, or want to look at reference scans
100-
generated by Scancode, start `here <https://github.com/nexB/scancode-toolkit-reference-scans>`_.
107+
If you want to compare output changes between different versions of Scancode,
108+
or want to look at scans generated by Scancode, review our
109+
`reference scans <https://github.com/nexB/scancode-toolkit-reference-scans>`_.
101110

102111
Other Important Documentation Pages:
103112

docs/source/tutorials/how_to_run_a_scan.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ This extracts the zlib.tar.gz package:
3636

3737
.. note::
3838

39-
``--shallow`` option can be used to recursively extract packages.
39+
Use the ``--shallow`` option to prevent recursive extraction of nested archives.
4040

4141

4242
Deciding Scan Options

requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ chardet==4.0.0
99
charset-normalizer==2.0.12
1010
click==8.0.4
1111
colorama==0.4.4
12-
commoncode==30.2.0
12+
commoncode==31.0.0b4
1313
construct==2.10.68
1414
container-inspector==31.0.0
1515
cryptography==36.0.2
@@ -49,7 +49,7 @@ pefile==2021.9.3
4949
pip-requirements-parser==31.2.0
5050
pkginfo2==30.0.0
5151
pluggy==1.0.0
52-
plugincode==30.0.0
52+
plugincode==31.0.0b1
5353
ply==3.11
5454
publicsuffix2==2.20191221
5555
pyahocorasick==2.0.0b1

0 commit comments

Comments
 (0)