Python-based file registry system for cataloging and searching files across storage locations with metadata and AI analysis capabilities.
A Python-based file registry system for cataloging and searching files across storage locations with metadata and AI analysis capabilities.
File Registry helps you build and maintain a central database of file information across your storage volumes. It indexes file metadata including paths, names, sizes, and MD5 hashes, enabling fast searches and analysis across distributed storage systems.
- Fast File Search: Quickly locate files by name, path, or hash across multiple storage units and servers
- MD5 Hash Computation: Generate and store MD5 hashes for file integrity and duplicate detection
- Metadata Framework: Extensible system for adding custom metadata and AI-generated metadata to files
- Large-Scale Support: Optimized for handling millions of files across distributed storage
- Configurable Scanning: Control which directories and file types to include/exclude from scanning
- Content Management: Find and organize files across multiple storage locations
- Data Preparation: Catalog and prepare data for AI analysis and machine learning
- Storage Optimization: Identify duplicates and analyze storage patterns
- Digital Asset Management: Track and manage large collections of digital assets
# Clone the repository
git clone https://github.com/GameFusion/file-registry.git
cd file-registry
# Optional: Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows, use: venv\Scripts\activate
# Install required packages (either method works)
pip install -r requirements.txt
# Or install directly:
# pip install mysql-connector-python
# Run the setup script to configure your environment
python setup.py
The setup script will guide you through:
- Creating necessary configuration files
- Setting up database connection credentials
- Creating database tables
- Configuring excluded files and directories
The system uses JSON configuration files (stored in the config
directory):
credentials.json
- Database connection detailsexcluded_files.json
- Files to exclude from scanningexcluded_dirs.json
- Directories to exclude from scanning
These files are created automatically by the setup script (setup.py), but can be modified manually as needed.
python file_registry_scan.py /path/to/scan
python file_registry_search.py "search_term"
python file_registry_log.py
python md5_metadata_scanner.py /path/to/scan
# Store MD5s and file metadata to database (default)
python md5_metadata_scanner.py /path/to/scan
# Store MD5s as extended attributes (Linux)
python md5_metadata_scanner.py /path/to/scan --storage xattr
# Store MD5s in both database and extended attributes
python md5_metadata_scanner.py /path/to/scan --storage both
# Enable verbose output to see details of each file
python md5_metadata_scanner.py /path/to/scan -v
file_registry_scan.py
- Main script for scanning and adding files to the databasefile_registry_search.py
- Search for files in the database registryfile_registry_log.py
- Display log informationmd5_metadata_scanner.py
- Compute and store MD5 hashes for files
The system includes several optimizations for handling large file systems:
- Caching mechanism to speed up repeated operations
- Exclusion of system directories like
.snapshot
,.git
, and.gitold
- Connection validation to ensure database reliability
- Warm-up phase optimization for faster startup
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
- Andreas Carlen - Initial work - GitHub