A simple command-line tool for running CC AI models locally with performance optimizations.
- What is CC CLI?
- Features
- Quick Installation
- Usage
- Performance Optimizations
- Optimization Details
- Hardware Analysis
- LLM Optimizations
- Performance Mode
- Testing Model Performance
- Available CC Models
- Performance Tips
- Cloud Integration
- Technical Details
- Troubleshooting
- Advanced Configurations
- License
- Contributing
- Credits
CC CLI is a user-friendly command-line interface that makes it easy to install, manage, and run CC AI models on your local machine. It's built on top of Ollama and provides simplified commands for common tasks.
- Simple Installation: One command to install everything
- Easy Model Management: Download and switch between different CC models
- Intuitive Commands: Simple, memorable commands for running AI models
- Cross-Platform: Works on macOS and Linux
- Configurable: Set default models and preferences
- Performance Optimizations: Hardware-specific optimizations for faster inference
- Cloud Integration:
- Authenticate with GCP, AWS, and Azure
- Find the most cost-effective cloud instances for running models
- Provision and manage cloud instances for resource-intensive models
- Intelligent hardware requirements calculation based on model parameters
# Clone this repository
git clone https://github.com/Noahcasarotto/cc-cli.git
cd cc-cli
# Run the installer
bash install.sh
Or install with a single command:
curl -fsSL https://raw.githubusercontent.com/Noahcasarotto/cc-cli/main/install.sh | bash
cc install
# Run with interactive prompt
cc run
# Run with a specific prompt
cc run "What is quantum computing?"
# Run a specific model
cc run cc-r1:14b "Explain the theory of relativity"
# List available models
cc list
# Download a specific model
cc pull cc-r1:8b
# Set default model
cc set-default cc-r1:14b
# Enable verbose mode
cc verbose on
# Disable verbose mode
cc verbose off
CC CLI includes several tools to optimize the performance of AI models on your local hardware:
analyze_hardware.sh
: Analyzes your system and recommends suitable modelsoptimize_llm.sh
: Applies comprehensive optimizations for LLM inferencetest_models.sh
: Tests and compares the performance of different models
These tools work together to provide a seamless and efficient experience when running CC models.
# Analyze your hardware
./analyze_hardware.sh
# Apply optimizations
./optimize_llm.sh
# Test model performance
./test_models.sh
# Use optimized CLI
cc-optimized run "Your prompt here"
# Enable high-performance mode
llm-performance-mode
The optimize_llm.sh
script applies the following optimizations:
-
Metal GPU Optimizations:
- Disables debug layers for production performance
- Configures optimal buffer sizes
- Sets Apple Silicon specific parameters
-
Memory Management:
- Reduces memory fragmentation
- Optimizes memory allocation
- Configures memory mapping thresholds
-
Thread Optimizations:
- Sets optimal thread counts for your CPU
- Balances workloads across cores
- Configures numerical libraries for parallelism
-
System-Level Improvements:
- Performance-oriented environment variables
- I/O optimizations
- Library-specific settings
-
Automatic Loading:
- Creates LaunchAgent for automatic loading at login
- Adds environment settings to shell profile
- Provides convenient wrapper scripts
The analyze_hardware.sh
script analyzes your system hardware and provides model recommendations based on:
- CPU type and core count
- Available RAM
- Disk space
- GPU capabilities
./analyze_hardware.sh
Analyzing system hardware...
----------------------------------------
CPU Information:
Model: Apple M3
Cores: 8
Memory Information:
Total Memory: 8.00GB
Disk Information:
Available Space: 13Gi
GPU Information:
GPU: Chipset Model: Apple M3
Model Recommendation:
Recommended: Use medium-sized models (deepseek-r1:8b)
Performance Optimization Suggestions:
- GPU detected: Consider enabling GPU acceleration
----------------------------------------
The optimize_llm.sh
script applies comprehensive optimizations for AI model inference:
-
Metal GPU Optimizations for Apple Silicon
- Disables debug features for production performance
- Configures optimal buffer sizes
- Sets Apple Silicon specific parameters
-
Memory Management Optimizations
- Reduces memory fragmentation
- Optimizes memory allocation
- Configures memory mapping thresholds
-
Thread Optimizations
- Sets optimal thread counts for your CPU
- Configures numerical libraries for efficient parallelism
-
System-Level Improvements
- Performance-oriented environment variables
- Library-specific settings
-
Automatic Loading
- Creates LaunchAgent for automatic loading at login
- Adds environment settings to shell profile
./optimize_llm.sh
- Environment Files: Created in
~/.llm-optimizations/
- Wrapper Scripts:
cc-optimized
: Wrapper for running CC with optimizationsllm-performance-mode
: Script for enabling high-performance mode
- LaunchAgent: For automatic loading at login
- Shell Configuration: Updates to
~/.zshrc
After running the optimization script, you can use:
# Run with optimized settings
cc-optimized run "Your prompt here"
# Run with a specific model
cc-optimized run phi "Calculate the square root of 144"
The llm-performance-mode
script optimizes system resources for LLM operations:
- Flushes disk cache for better I/O performance
- Provides performance recommendations
- Optimizes system resources
llm-performance-mode
This script requires sudo access to flush the disk cache.
The test_models.sh
script allows you to test and compare the performance of different models:
- Tests multiple models with the same prompt
- Measures response time
- Captures model outputs
- Provides performance comparisons
./test_models.sh
Starting model phi...
Paris.
total duration: 5.46s
load duration: 68.53ms
prompt eval count: 44 token(s)
prompt eval duration: 2.44s
prompt eval rate: 17.99 tokens/s
eval count: 4 token(s)
eval duration: 2.94s
eval rate: 1.36 tokens/s
Model | Size | Description | Recommended RAM |
---|---|---|---|
cc-r1:1.5b | ~1.1GB | Lightweight model for basic tasks | 8GB+ |
cc-r1:8b | ~4.9GB | Good balance of capability and resource usage | 16GB+ |
cc-r1:14b | ~8.5GB | Better performance but requires more resources | 32GB+ |
cc-r1:32b | ~19GB | Advanced capabilities but requires significant resources | 32-64GB+ |
cc-r1:70b | ~40GB | Most capable but requires powerful hardware | 64GB+ |
- Smaller models (1.5B, 8B) run faster but have more limited capabilities
- Larger models (14B, 32B, 70B) offer better reasoning but require more RAM and processing power
- First-time model loading is slower as the model gets optimized for your hardware
- Subsequent runs are faster as the model remains cached
- Use
llm-performance-mode
before running intensive tasks - For Apple Silicon, Metal optimizations provide significant speed improvements
- Match thread count to your available CPU cores for optimal performance
CC CLI provides built-in functionality to authenticate and work with major cloud providers, allowing you to leverage cloud resources for running larger models or managing cloud infrastructure.
Before using the cloud integration features, you need to install the necessary cloud provider CLIs. A helper script is provided to streamline this process:
./install_cloud_deps.sh
This script will guide you through installing:
- Google Cloud SDK (gcloud)
- AWS CLI
- Azure CLI
- jq (required for JSON processing)
You can also install individual CLIs:
./install_cloud_deps.sh --gcp # Install only Google Cloud SDK
./install_cloud_deps.sh --aws # Install only AWS CLI
./install_cloud_deps.sh --azure # Install only Azure CLI
./install_cloud_deps.sh --all # Install all without prompting
The cc-login.sh
script allows you to authenticate with all supported cloud providers through a simple interface:
./cc-login.sh # Interactive login menu
./cc-login.sh --all # Login to all configured cloud providers
./cc-login.sh --gcp # Login to GCP only
./cc-login.sh --aws # Login to AWS only (supports SSO and access keys)
./cc-login.sh --azure # Login to Azure only
./cc-login.sh --status # Check authentication status
./cc-login.sh --logout # Logout from all providers
The cloud authentication system provides several features:
-
Multiple Authentication Methods:
- AWS: Support for both SSO and traditional access keys
- Azure: Web-based authentication with subscription selection
- GCP: Interactive project selection and management
-
Status Tracking: The system keeps track of which providers you're authenticated with
-
Configuration Management: Cloud provider configurations are stored in
~/.cc-cli/config
-
Intelligent Detection: Automatically detects if you're already logged in
-
Guided Setup: Walks you through project/subscription selection where applicable
The cloud_compute.sh
script helps you find the most cost-effective cloud instances for running your models. It uses an advanced model requirements calculation system that determines optimal hardware configurations based on model parameters, quantization, and intended use case.
./cloud_compute.sh find-cheapest <model> [performance-level]
Where <model>
is one of the supported models (cc-r1:1.5b, cc-r1:8b, llama3:8b, etc.) and the optional [performance-level]
can be:
basic
: Lowest cost configuration, CPU-only for smaller modelsstandard
: Balanced cost/performance (default)optimal
: Best performance, higher-tier GPUs with more resources
# Find cheapest instance for running cc-r1:8b with standard performance
./cloud_compute.sh find-cheapest cc-r1:8b
# Find the optimal (highest performance) instance for llama3:8b
./cloud_compute.sh find-cheapest llama3:8b optimal
# Find a basic (lowest cost) instance for phi
./cloud_compute.sh find-cheapest phi basic
The cloud_compute.sh
script uses a sophisticated approach to determine hardware requirements:
- Model Parameter Analysis: Calculates resource needs based on model size, context length, and architecture
- Quantization Awareness: Adjusts memory requirements based on quantization method (none, int8, int4)
- Performance Tier Scaling: Scales requirements based on desired performance level
- GPU Selection: Intelligently selects appropriate GPU types based on memory needs
- Provider-Specific Optimization: Considers differences between GCP and Azure instance types
The system supports a wide range of models with automatically calculated requirements:
- CC models: cc-r1:1.5b, cc-r1:8b, cc-r1:14b, cc-r1:32b, cc-r1:70b
- Third-party models: phi, mistral, gemma:2b, llama3:8b, qwen:4b
The cloud_compute.sh
script also provides functionality to provision cloud instances based on model requirements:
./cloud_compute.sh provision <model> [performance-level]
This command:
- Finds the most cost-effective instance across providers
- Provisions the instance with the necessary configuration
- Sets up the environment with Docker and CC CLI
- Provides connection instructions
After provisioning, you can:
-
Connect to your instance:
# For GCP (example) gcloud compute ssh <instance-name> --zone=<zone>
-
Run models on the provisioned instance:
cc run --cloud=gcp --machine=<machine-type> <model> "Your prompt"
-
Terminate the instance when done:
# For GCP (example) gcloud compute instances delete <instance-name> --zone=<zone>
-
Install cloud dependencies and authenticate:
./install_cloud_deps.sh --gcp ./cc-login.sh --gcp
-
Find the optimal instance for your model:
./cloud_compute.sh find-cheapest cc-r1:70b
-
Provision an instance:
./cloud_compute.sh provision cc-r1:70b
-
Connect and use the model
-
Terminate when done
CC CLI is a bash script wrapper around Ollama, making it easier to use CC models specifically. It handles:
- Installation of Ollama
- Model download and management
- Configuration preferences
- Simple command-line interface
- Hardware-specific optimizations
The following Metal environment variables are configured:
export MTL_CAPTURE_ENABLED=0
export MTL_DEBUG_LAYER=0
export MTL_MAX_BUFFER_LENGTH=1073741824
export MTL_GPU_FAMILY=apple7
export MTL_GPU_VERSION=1
Memory allocation is optimized with:
export MALLOC_ARENA_MAX=2
export MALLOC_MMAP_THRESHOLD_=131072
export MALLOC_TRIM_THRESHOLD_=131072
export MALLOC_MMAP_MAX_=65536
Thread counts are optimized for your CPU:
export OMP_NUM_THREADS=$CPU_CORES
export MKL_NUM_THREADS=$CPU_CORES
export VECLIB_MAXIMUM_THREADS=$CPU_CORES
export NUMEXPR_NUM_THREADS=$CPU_CORES
export OPENBLAS_NUM_THREADS=$CPU_CORES
Additional performance settings include:
export PYTHONUNBUFFERED=1
export TF_CPP_MIN_LOG_LEVEL=2
export TF_ENABLE_ONEDNN_OPTS=1
export HDF5_USE_FILE_LOCKING=FALSE
-
LaunchAgent Loading Error
If you see
Load failed: 5: Input/output error
when runningoptimize_llm.sh
, you can try:sudo launchctl bootstrap system ~/Library/LaunchAgents/com.llm.optimizations.plist
-
Permission Issues
If you encounter permission issues with
llm-performance-mode
, ensure the script is executable:chmod +x $HOME/bin/llm-performance-mode
-
Command Not Found
If
cc-optimized
orllm-performance-mode
commands are not found, ensure$HOME/bin
is in your PATH:echo 'export PATH="$HOME/bin:$PATH"' >> ~/.zshrc source ~/.zshrc
You can customize the optimization settings by editing the configuration files in ~/.llm-optimizations/
:
# Edit Metal configuration
nano ~/.llm-optimizations/metal.conf
# Edit memory configuration
nano ~/.llm-optimizations/memory.conf
# Edit main environment file
nano ~/.llm-optimizations/environment
For optimal performance with specific models:
-
Small Models (1.5B, 2B)
- Ideal for quick responses and basic tasks
- Works well with minimal optimizations
-
Medium Models (8B)
- Benefits from thread optimizations
- Good balance of performance and capability
-
Large Models (14B+)
- Requires full set of optimizations
- Benefits significantly from GPU acceleration
- Use
llm-performance-mode
for best results
Contributions are welcome! Please feel free to submit a Pull Request.