Skip to content

SemClone/osslili

OSS License & Copyright Detector (osslili)

License Python 3.8+ PyPI version

A high-performance tool for identifying licenses and copyright information in local source code. Produces detailed evidence of where licenses are detected with support for all 700+ SPDX license identifiers, enabling comprehensive compliance documentation for the SEMCL.ONE ecosystem.

Features

  • Three-Tier License Detection: Dice-SΓΈrensen similarity, TLSH fuzzy hashing, and regex pattern matching
  • Evidence-Based Output: Exact file paths, confidence scores, and detection methods
  • 700+ SPDX Licenses: Comprehensive support for all SPDX license identifiers
  • SEMCL.ONE Integration: Works seamlessly with purl2notices, ospac, and other ecosystem tools

How It Works

Three-Tier License Detection System

The tool uses a sophisticated multi-tier approach for maximum accuracy:

  1. Tier 1: Dice-SΓΈrensen Similarity with TLSH Confirmation

    • Compares license text using Dice-SΓΈrensen coefficient (97% threshold)
    • Confirms matches using TLSH fuzzy hashing to prevent false positives
    • Achieves 97-100% accuracy on standard SPDX licenses
  2. Tier 2: TLSH Fuzzy Hash Matching

    • Uses Trend Micro Locality Sensitive Hashing for variant detection
    • Catches license variants like MIT-0, BSD-2-Clause vs BSD-3-Clause
    • Pre-computed hashes for all 700+ SPDX licenses
  3. Tier 3: Pattern Recognition

    • Regex-based detection for license references and identifiers
    • Extracts from comments, headers, and documentation

Additional Detection Methods

  • Package Metadata Scanning: Detects licenses from package.json, composer.json, pyproject.toml, etc.
  • Copyright Extraction: Advanced pattern matching with validation and deduplication
  • SPDX Identifier Detection: Finds SPDX-License-Identifier tags in source files

Installation

pip install osslili

For development:

git clone https://github.com/SemClone/osslili.git
cd osslili
pip install -e .

Quick Start

# Fast default scan (LICENSE files + metadata + docs) - RECOMMENDED
osslili .

# Comprehensive deep scan (all source files)
osslili . --deep

# Generate SBOM with license evidence
osslili ./my-project -f cyclonedx-json -o sbom.json

Usage

Scanning Modes

osslili offers three scanning modes optimized for different use cases:

πŸš€ Default Mode (Recommended)

Fast and practical - scans LICENSE files, package metadata, and documentation.

# Scans: LICENSE*, README*, *.md, *.txt, package.json, go.mod, etc.
osslili ./my-project

What it scans:

  • LICENSE files: LICENSE, COPYING, NOTICE, COPYRIGHT, etc. (28+ patterns)
  • Documentation: README, CHANGELOG, CONTRIBUTING (.txt, .md, .rst, .adoc)
  • Package metadata: package.json, go.mod, Cargo.toml, pom.xml, etc. (40+ files)
  • Coverage: 12+ package ecosystems (npm, Python, Go, Java, .NET, Rust, Ruby, PHP, Swift, Dart, Elixir, Scala)

Performance: ~8 seconds on ffmpeg-6.0 (4,000+ files) Use case: Daily development, CI/CD pipelines, quick license checks

πŸ”¬ Deep Mode (Comprehensive)

Thorough scan of all source files for embedded licenses.

# Scans ALL files: .py, .js, .java, .c, .go, etc.
osslili ./my-project --deep

Performance: ~5 minutes on ffmpeg-6.0 (40x slower than default) Use case: Legal compliance reviews, finding embedded license headers

⚑ Strict Mode (Fastest)

LICENSE files only - maximum speed.

# Scans ONLY LICENSE files (no metadata, no README)
osslili ./my-project --license-files-only

Performance: ~7 seconds on ffmpeg-6.0 Use case: When you only need declared licenses


CLI Usage

# Default scan - fast and smart (RECOMMENDED)
osslili /path/to/project

# Deep scan - comprehensive but slower
osslili /path/to/project --deep

# Strict scan - LICENSE files only
osslili /path/to/project --license-files-only

# Generate different output formats
osslili ./my-project -f kissbom -o kissbom.json
osslili ./my-project -f cyclonedx-json -o sbom.json
osslili ./my-project -f cyclonedx-xml -o sbom.xml

# Scan with parallel processing (default: 4 threads)
osslili ./my-project --threads 8

# Scan with limited depth (only 2 levels deep)
osslili ./my-project --max-depth 2

# Extract and scan archives
osslili package.tar.gz --max-extraction-depth 2

# Use caching for faster repeated scans
osslili ./my-project --cache-dir ~/.cache/osslili

# Check version
osslili --version

# Save results to file
osslili ./my-project -o license-evidence.json

# With custom configuration and verbose output
osslili ./src --config config.yaml --verbose

# Debug mode for detailed logging
osslili ./project --debug

Example Output

{
  "scan_results": [{
    "path": "./project",
    "license_evidence": [
      {
        "file": "/path/to/project/LICENSE",
        "detected_license": "Apache-2.0",
        "confidence": 0.988,
        "detection_method": "dice-sorensen",
        "category": "declared",
        "match_type": "text_similarity",
        "description": "Text matches Apache-2.0 license (98.8% similarity)"
      },
      {
        "file": "/path/to/project/package.json",
        "detected_license": "Apache-2.0",
        "confidence": 1.0,
        "detection_method": "tag",
        "category": "declared",
        "match_type": "spdx_identifier",
        "description": "SPDX-License-Identifier: Apache-2.0 found"
      }
    ],
    "copyright_evidence": [
      {
        "file": "/path/to/project/src/main.py",
        "holder": "Example Corp",
        "years": [2023, 2024],
        "statement": "Copyright 2023-2024 Example Corp"
      }
    ]
  }],
  "summary": {
    "total_files_scanned": 42,
    "declared_licenses": {"Apache-2.0": 2},
    "detected_licenses": {},
    "referenced_licenses": {},
    "copyright_holders": ["Example Corp"]
  }
}

Library Usage

from osslili import LicenseCopyrightDetector

# Initialize detector
detector = LicenseCopyrightDetector()

# Process a local directory
result = detector.process_local_path("/path/to/source")

# Process a single file  
result = detector.process_local_path("/path/to/LICENSE")

# Generate different output formats
evidence = detector.generate_evidence([result])
kissbom = detector.generate_kissbom([result])
cyclonedx = detector.generate_cyclonedx([result], format_type="json")
cyclonedx_xml = detector.generate_cyclonedx([result], format_type="xml")

# Access results directly
for license in result.licenses:
    print(f"License: {license.spdx_id} ({license.confidence:.0%} confidence)")
    print(f"  Category: {license.category}")  # declared, detected, or referenced
for copyright in result.copyrights:
    print(f"Copyright: Β© {copyright.holder}")

Output Format

The tool outputs JSON evidence showing:

  • File path: Where the license was found
  • Detected license: The SPDX identifier of the license
  • Confidence: How confident the detection is (0.0 to 1.0)
  • Match type: How the license was detected (license_text, spdx_identifier, license_reference, text_similarity)
  • Description: Human-readable description of what was found

Integration with SEMCL.ONE

OSS License & Copyright Detector is a core component of the SEMCL.ONE ecosystem:

  • Works with purl2notices for comprehensive legal notice generation
  • Integrates with ospac for policy-based compliance evaluation
  • Supports ossnotices for simplified attribution documentation
  • Complements upmex for package metadata extraction workflows

Configuration

Create a config.yaml file:

similarity_threshold: 0.97
max_recursion_depth: 10
max_extraction_depth: 10
thread_count: 4
cache_dir: "~/.cache/osslili"
custom_aliases:
  "Apache 2": "Apache-2.0"
  "MIT License": "MIT"

Documentation

Contributing

We welcome contributions! Please see CONTRIBUTING.md for details on:

  • Code of conduct
  • Development setup
  • Submitting pull requests
  • Reporting issues

Support

For support and questions:

License

Apache License 2.0 - see LICENSE file for details.

Authors

See AUTHORS.md for a list of contributors.


Part of the SEMCL.ONE ecosystem for comprehensive OSS compliance and code analysis.

About

OSLILI - Open Source License Identification Library

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors 2

  •  
  •