What is? Semantic scanner for thousands of (git-) github repositories on leaing sensitive information.
The flow:
-
Parsing repositories
-
Checking deleted/history data with filters (blobs data)
- Codebase languages ( contains
.html, .py, .rs, .json, .sol
) - Codebase configuration files (
docker-compose.yaml
,.env
,.k8s
, '.js',.py
) - Generic binary files
- Compiled .exe with patched secrets
.pyc
and others language-specific precompiled/cache
- Codebase languages ( contains
-
Simply, for each of the blob filetype we have to run different regexp parsers + trufflehog.
- Trufflehog
- GH (Github-CLI)
- You need to setup your GH cli before run the program.
Then run with python 3.12
pip install -r requirements.txt
gh search repos --limit 500 --json fullName --jq '.[].fullName' 'hardhat language:Solidity' > hardhat_solidity_repos.txt
gh search repos --limit 500 --json fullName --jq '.[].fullName' 'language:Python web3.py' > python_web3py_repos.txt
gh search code --limit 500 --json repository.fullName,path --jq '.[] | .repository.fullName + "/" + .path' 'filename:.env PRIVATE_KEY' > potential_dotenv_leaks.txt
Make sure that redis is running To start redis:
docker-compose up redis
Start Crawler Worker(s):
rq worker -c config web3_crawler_queue --url redis://localhost:6379/0
Start Analyzer Worker(s):
rq worker -c config web3_analyzer_queue --url redis://localhost:6379/1
Submit Initial Jobs:
python gitsens/submit_jobs.py
! Warning. If you're submitting jobs to analyzer directly, specify (populize) file, commonly nameddirect_repos_to_analyze.txt
. To check the details, look at thegitsens/submit_jobs
implementation.
Simply make sure you have docker engine, docker-compose But you should probably login in your gh via github-cli. Docker-compose will mount this folder from your local machine
volumes:
- ~/.config/gh:/root/.config/gh:ro
docker-compose up --build
Check logs
docker-compose logs -f crawler_worker
docker-compose logs -f analyzer_worker
After start the analyzer will clone with batch the specified repository in
./analysis_output
And the tree will looks like
analysis_output
├── cloned_repos
├── custom_regex_findings
├── dangling_blobs
├── restored_files
└── trufflehog_findings
- GH api limit 50000 requrests/hour
- Limited heuristics
- No post-analysis of output data
It's a secret. Stay tuned.
Thanks trufflehog for the great security and reconnaissance tool! You can find it at - https://github.com/trufflesecurity/trufflehog
P.S: Honestly, I just wanted to get a better understanding of the git