file-registry

Python-based file registry system for cataloging and searching files across storage locations with metadata and AI analysis capabilities.

File Registry

A Python-based file registry system for cataloging and searching files across storage locations with metadata and AI analysis capabilities.

Overview

File Registry helps you build and maintain a central database of file information across your storage volumes. It indexes file metadata including paths, names, sizes, and MD5 hashes, enabling fast searches and analysis across distributed storage systems.

Key Features

Fast File Search: Quickly locate files by name, path, or hash across multiple storage units and servers
MD5 Hash Computation: Generate and store MD5 hashes for file integrity and duplicate detection
Metadata Framework: Extensible system for adding custom metadata and AI-generated metadata to files
Large-Scale Support: Optimized for handling millions of files across distributed storage
Configurable Scanning: Control which directories and file types to include/exclude from scanning

Use Cases

Content Management: Find and organize files across multiple storage locations
Data Preparation: Catalog and prepare data for AI analysis and machine learning
Storage Optimization: Identify duplicates and analyze storage patterns
Digital Asset Management: Track and manage large collections of digital assets

Installation

Clone Repository

# Clone the repository
git clone https://github.com/GameFusion/file-registry.git
cd file-registry

Installation Using a Virtual Environment (Optional but recommended)

# Optional: Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows, use: venv\Scripts\activate

Installation Requirements

# Install required packages (either method works)
pip install -r requirements.txt
# Or install directly:
# pip install mysql-connector-python

Run the Setup Wizard

# Run the setup script to configure your environment
python setup.py

The setup script will guide you through:

Creating necessary configuration files
Setting up database connection credentials
Creating database tables
Configuring excluded files and directories

Configuration

The system uses JSON configuration files (stored in the config directory):

credentials.json - Database connection details
excluded_files.json - Files to exclude from scanning
excluded_dirs.json - Directories to exclude from scanning

These files are created automatically by the setup script (setup.py), but can be modified manually as needed.

Usage

Scanning Files

python file_registry_scan.py /path/to/scan

Searching Files

python file_registry_search.py "search_term"

Viewing Logs

python file_registry_log.py

MD5 Metadata Scanner

python md5_metadata_scanner.py /path/to/scan

How to Use

# Store MD5s and file metadata to database (default)
python md5_metadata_scanner.py /path/to/scan

# Store MD5s as extended attributes (Linux)
python md5_metadata_scanner.py /path/to/scan --storage xattr

# Store MD5s in both database and extended attributes
python md5_metadata_scanner.py /path/to/scan --storage both

# Enable verbose output to see details of each file
python md5_metadata_scanner.py /path/to/scan -v

Project Structure

file_registry_scan.py - Main script for scanning and adding files to the database
file_registry_search.py - Search for files in the database registry
file_registry_log.py - Display log information
md5_metadata_scanner.py - Compute and store MD5 hashes for files

Performance Optimizations

The system includes several optimizations for handling large file systems:

Caching mechanism to speed up repeated operations
Exclusion of system directories like .snapshot, .git, and .gitold
Connection validation to ensure database reliability
Warm-up phase optimization for faster startup

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Authors

Andreas Carlen - Initial work - GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SanitizerLogLoader.py		SanitizerLogLoader.py
config_templates.json		config_templates.json
db_setup.sql		db_setup.sql
file_registry.py		file_registry.py
file_registry_log.py		file_registry_log.py
find_in_registry.py		find_in_registry.py
log_scan.py		log_scan.py
md5_metadata_scanner.py		md5_metadata_scanner.py
registry_database.py		registry_database.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

file-registry

File Registry

Overview

Key Features

Use Cases

Installation

Clone Repository

Installation Using a Virtual Environment (Optional but recommended)

Installation Requirements

Run the Setup Wizard

Configuration

Usage

Scanning Files

Searching Files

Viewing Logs

MD5 Metadata Scanner

How to Use

Project Structure

Performance Optimizations

License

Contributing

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

GameFusion/file-registry

Folders and files

Latest commit

History

Repository files navigation

file-registry

File Registry

Overview

Key Features

Use Cases

Installation

Clone Repository

Installation Using a Virtual Environment (Optional but recommended)

Installation Requirements

Run the Setup Wizard

Configuration

Usage

Scanning Files

Searching Files

Viewing Logs

MD5 Metadata Scanner

How to Use

Project Structure

Performance Optimizations

License

Contributing

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages