Skip to content

8. More scripts

Pierre-Yves Lapersonne edited this page Apr 2, 2024 · 8 revisions

Extract all emails from all repos

Some features can be combined, for example extracting all email addresses from all repositories of an organisation (GitHub or GitLab for example) Thus you can combine several scripts do to that.

First, copy and paste somewhere (called that your "workspace") the extract-emails-from-history.sh and the extract-contributors-lists.rb scripts (you can pick them in toolbox/diver/utils). extract-contributors-lists.rb must be placed in a utils folder. Create in your workspace a data folder.

Then, make a dump of the git repositories. You will have to fill the suitable configuration Ruby file in the project and run:

# For GitHub
bash GitHubWizard.sh backup-all-repositories-from-org

# For GitLab
bash GitLabWizard.sh backup-all-repositories-from-org

Finally to prepare the big iteration, fill the PLATFORM_DUMP_PATH variable in the code bellow with the path where the clones of the repositories are. Copy/paste the code below in a script and run it.

#!/bin/bash

PLATFORM_DUMP_PATH="folder-to-make-dump-of-repositories"

echo "Will look inside '$PLATFORM_DUMP_PATH'"

# For each Git repository, extract emails
for file in $PLATFORM_DUMP_PATH/*; do
    if [ -d $file ]; then
        if [ -d "$file/.git" ]; then
            echo "Directory '$file' is a git repository, process it"
            bash extract-emails.sh --project "$file" --loglimit 10.years > /dev/null # The script produces files in a "data" folder
        fi
    fi
done

# Compress all results files
global_emails_file="$$_all-emails.txt"
for result_file in `ls data | grep -extracted-emails.txt`; do
    cat "data/$result_file" >> "data/$global_emails_file".tmp1
done

# Sort by email domain and keep unique values
sort -t@ -k2 "data/$global_emails_file".tmp1 | sort -u > "data/$global_emails_file".tmp2

# Keep only email address, like '@gmail\|@live\|@outlook' (for emails with @gmail or @live or @outlook)
cat "data/$global_emails_file".tmp2 | grep '@gmail\|@live\|@outlook' > "data/$global_emails_file"

# The end
rm "data/$global_emails_file".tmp1
rm "data/$global_emails_file".tmp2
echo "RESULT FILE IS: 'data/$global_emails_file'" # Emails ordered by domain and filtered and uniqued are in this file

Compute metrics (lines of code, languages, files) for all repositories of organisation

You may want to know ow many lines of code you have in all the repositories of your organisation. You may be also interested by other metrics like programming languages or number of files. The diver contains one script to compute these metrics thanks to cloc ; so here are some tips to run it for all of your repositories.

First, after having filled the configuration.rb file (for GitHub or GitLab) with personal acces tokens, organization name or ID, and after having prepare your SSH environment if needed, you can clone somewhere all the repositories of the platform:

# For GitHub
bash GitHubWizard.sh backup-all-repositories-from-org

# For GitLab
bash GitLabWizard.sh backup-all-repositories-from-org

Of course you must have defined in the configurations files the destination of your dump (REPOSITORIES_CLONE_LOCATION_PATH variable).

Then, one you have downloaded all your stuff, run the computing script with in parameter the destination of your dump:

bash lines-count.sh --folder "value defined in REPOSITORIES_CLONE_LOCATION_PATH"

Extract all pom.xml files from a project

You may have to audit some big backend projects with plenty of pom.xml files because the project uses Maven as a dependency manager. The gist bellow will help you to find and copy them elsewhere, put it in a file:

#!/bin/bash
set -euo pipefail

destination="output-$$" # To be sure each new run will output file in another location
mkdir $destination
find . -name "pom.xml" -print0 | while read -d $'\0' file
do
	newFileName=`echo "$file" | tr / _ | sed 's/._/-/g'` # Replace slashs by underscores, and remove useless first characters in file names
	echo "Found pom.xml at '$file', copy it in folder '$destination' with name '$newFileName'"
	cp $file $destination/$newFileName
done

Third-party components declaration for iOS apps

iOS applications can, like any software, contain third-party components. It is a best practice, or depending to the FLOSS license sometimes mandatory, to list somewhere somehow the third-party components in use. There is a simple and efficient tool called LicensePlist made by mono0926 available on GitHub under MIT License which can scan the project you use and build PLIST files to add in your Settings.bundle. Have a look on their awesome README. In few lines:

# Install
brew install licenseplist

# Run in your project
license-plist --add-version-numbers

Quick scan of projects for licenses

There is a tool which can help to look quickly in a project for some third-party licences or copyrights. This tool can also look for email address or copyrights: scancode toolkit. It can help if needed, the source code is in GitHub and the documention on their website.

In few lines:

# Install
pip install scancode-toolkit

# Run in your project
# clepeui : look for / in copyrights, licenses, manifests, emails, URL...
# -n 6 : 6 threads
# and build an HTML report named "rapport.html"
scancode -clpeui -n 6 --html rapport.html .

Apply some REUSE template to sources with contributors

You may be interested in applying some formatted header in your source files following REUSE standards. First, you will need to install reuse:

pip install reuse

Then, in the directory you want to scan and update, create a fodler .reuse and inside a subfolder templates. Then at this location add a new file named for example template.jinja2. This template can contain some stuff like:

{% for copyright_line in copyright_lines %}
{{ copyright_line }}
{% endfor %}
{% for expression in spdx_expressions %}
SPDX-License-Identifier: {{ expression }}
{% endfor %}

{% for contributor in contributor_lines[0].split(',') %}
{% if contributor %}
SPDX-FileContributor: {{ contributor }}
{% endif %}
{% endfor %}

Then run the command in your project:

reuse annotate --template="template" --skip-unrecognised --contributor="$(git log --format='%aN' | sort -u | tr '\n' ',')" --copyright="Cyberdyne Systems" --copyright-style="spdx-symbol" --license="MIT" **/*

Finally you will get in your sources a new header using the programming language symbols (here monoline comment symbol, for Swift):

// SPDX-FileCopyrightText: © 2024 Cyberdyne Systems
// SPDX-License-Identifier: MIT
//
// SPDX-FileContributor: Lydia Ahrolsedovah 
// SPDX-FileContributor: Adam Jensen
// SPDX-FileContributor: John Shepard
// SPDX-FileContributor: Liliana Vess
// SPDX-FileContributor: renovate[bot]

Note that values for FileContributor are entities identified as authors of commits (including bots!) uing the git command above.

You can get more details in the REUSE documentation and nice Git commands here. The usage is also explanined in their GitHub

Clone this wiki locally