|
| 1 | +# Memory Monitoring for Funannotate2 Ab Initio Predictions |
| 2 | + |
| 3 | +This document describes the memory monitoring and prediction system implemented for funannotate2's ab initio gene prediction step. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The memory monitoring system provides: |
| 8 | + |
| 9 | +1. **Memory Usage Prediction** - Estimate memory requirements based on contig length |
| 10 | +2. **Real-time Memory Monitoring** - Track actual memory usage of subprocess calls |
| 11 | +3. **Memory-aware CPU Allocation** - Adjust parallelization based on memory constraints |
| 12 | +4. **Memory Usage Reporting** - Generate detailed memory usage reports |
| 13 | + |
| 14 | +## Features |
| 15 | + |
| 16 | +### 1. Memory Prediction Models |
| 17 | + |
| 18 | +The system includes empirical models to predict memory usage for each ab initio tool: |
| 19 | + |
| 20 | +- **SNAP**: Base 50 MB + 0.5 MB per MB of sequence |
| 21 | +- **Augustus**: Base 100 MB + 2.0 MB per MB of sequence |
| 22 | +- **GlimmerHMM**: Base 30 MB + 0.3 MB per MB of sequence |
| 23 | +- **GeneMark**: Base 80 MB + 1.0 MB per MB of sequence |
| 24 | + |
| 25 | +These models provide rough estimates that can be refined with actual usage data. |
| 26 | + |
| 27 | +### 2. Real-time Memory Monitoring |
| 28 | + |
| 29 | +Uses `psutil` to monitor subprocess memory usage in real-time: |
| 30 | + |
| 31 | +- Tracks RSS (Resident Set Size) and VMS (Virtual Memory Size) |
| 32 | +- Monitors parent process and all child processes |
| 33 | +- Samples memory usage at configurable intervals (default: 100ms) |
| 34 | +- Calculates peak, average, and duration statistics |
| 35 | + |
| 36 | +### 3. Memory-aware Scheduling |
| 37 | + |
| 38 | +Automatically adjusts CPU allocation based on: |
| 39 | + |
| 40 | +- Available system memory |
| 41 | +- Predicted memory usage per process |
| 42 | +- User-specified memory limits |
| 43 | +- System memory buffer (20% reserved for OS) |
| 44 | + |
| 45 | +### 4. Integration with Existing Code |
| 46 | + |
| 47 | +The memory monitoring is integrated into: |
| 48 | + |
| 49 | +- `runSubprocess()` - Optional memory monitoring for individual commands |
| 50 | +- `abinitio_wrapper()` - Memory prediction and logging per contig |
| 51 | +- `runProcessJob()` - Memory-aware CPU allocation for multiprocessing |
| 52 | + |
| 53 | +## Usage |
| 54 | + |
| 55 | +### Command Line Options |
| 56 | + |
| 57 | +Add memory monitoring to funannotate2 predict: |
| 58 | + |
| 59 | +```bash |
| 60 | +# Enable memory monitoring |
| 61 | +funannotate2 predict -i input_dir --monitor-memory |
| 62 | + |
| 63 | +# Enable memory monitoring with memory limit |
| 64 | +funannotate2 predict -i input_dir --monitor-memory --memory-limit 16 |
| 65 | +``` |
| 66 | + |
| 67 | +### CLI Options |
| 68 | + |
| 69 | +- `--monitor-memory`: Enable memory monitoring and prediction |
| 70 | +- `--memory-limit GB`: Set memory limit in GB to adjust CPU allocation |
| 71 | + |
| 72 | +### Example Output |
| 73 | + |
| 74 | +When memory monitoring is enabled, you'll see output like: |
| 75 | + |
| 76 | +``` |
| 77 | +Memory monitoring enabled for ab initio predictions |
| 78 | +Memory limit set to 16.0 GB |
| 79 | +Memory usage estimate for 150 contigs with tools ['snap', 'augustus']: |
| 80 | + Total estimated peak memory: 2847.3 MB |
| 81 | +System memory: 14.2 GB available |
| 82 | +Processing contig scaffold_1.fasta (length: 2,847,392 bp) |
| 83 | +SNAP memory prediction for scaffold_1.fasta: 51.4 MB |
| 84 | +Augustus memory prediction for scaffold_1.fasta: 105.4 MB |
| 85 | +Memory usage for snap-scaffold_1.fasta: |
| 86 | +Process: snap-scaffold_1.fasta |
| 87 | +Duration: 12.34 seconds |
| 88 | +Peak RSS: 48.2 MB |
| 89 | +Peak VMS: 156.7 MB |
| 90 | +Average RSS: 42.1 MB |
| 91 | +Samples collected: 247 |
| 92 | +``` |
| 93 | + |
| 94 | +## API Reference |
| 95 | + |
| 96 | +### Core Functions |
| 97 | + |
| 98 | +#### `predict_memory_usage(tool_name, contig_length, prediction_data=None)` |
| 99 | + |
| 100 | +Predict memory usage for an ab initio tool based on contig length. |
| 101 | + |
| 102 | +**Parameters:** |
| 103 | +- `tool_name`: Name of the ab initio tool ('snap', 'augustus', etc.) |
| 104 | +- `contig_length`: Length of the contig in base pairs |
| 105 | +- `prediction_data`: Optional historical data for improved predictions |
| 106 | + |
| 107 | +**Returns:** Dictionary with predicted memory usage statistics |
| 108 | + |
| 109 | +#### `MemoryMonitor.monitor_process(process, process_name)` |
| 110 | + |
| 111 | +Monitor memory usage of a subprocess in real-time. |
| 112 | + |
| 113 | +**Parameters:** |
| 114 | +- `process`: subprocess.Popen object to monitor |
| 115 | +- `process_name`: Name identifier for the process |
| 116 | + |
| 117 | +**Returns:** Dictionary containing memory statistics |
| 118 | + |
| 119 | +#### `estimate_total_memory_usage(contigs, tools, prediction_data=None)` |
| 120 | + |
| 121 | +Estimate total memory usage for running ab initio predictions on multiple contigs. |
| 122 | + |
| 123 | +**Parameters:** |
| 124 | +- `contigs`: List of contig file paths |
| 125 | +- `tools`: List of ab initio tools to run |
| 126 | +- `prediction_data`: Optional historical data |
| 127 | + |
| 128 | +**Returns:** Dictionary with total memory estimates |
| 129 | + |
| 130 | +#### `suggest_cpu_allocation(total_memory_estimate, available_memory_gb, max_cpus)` |
| 131 | + |
| 132 | +Suggest optimal CPU allocation based on memory constraints. |
| 133 | + |
| 134 | +**Parameters:** |
| 135 | +- `total_memory_estimate`: Total estimated memory usage in MB |
| 136 | +- `available_memory_gb`: Available system memory in GB |
| 137 | +- `max_cpus`: Maximum number of CPUs available |
| 138 | + |
| 139 | +**Returns:** Dictionary with CPU allocation suggestions |
| 140 | + |
| 141 | +### Utility Functions |
| 142 | + |
| 143 | +#### `get_system_memory_info()` |
| 144 | + |
| 145 | +Get current system memory information. |
| 146 | + |
| 147 | +**Returns:** Dictionary with system memory statistics |
| 148 | + |
| 149 | +#### `get_contig_length(contig_file)` |
| 150 | + |
| 151 | +Get the length of a contig from a FASTA file. |
| 152 | + |
| 153 | +**Parameters:** |
| 154 | +- `contig_file`: Path to the contig FASTA file |
| 155 | + |
| 156 | +**Returns:** Length of the contig in base pairs |
| 157 | + |
| 158 | +#### `format_memory_report(stats)` |
| 159 | + |
| 160 | +Format memory statistics into a human-readable report. |
| 161 | + |
| 162 | +**Parameters:** |
| 163 | +- `stats`: Memory statistics dictionary |
| 164 | + |
| 165 | +**Returns:** Formatted string report |
| 166 | + |
| 167 | +## Implementation Details |
| 168 | + |
| 169 | +### Memory Monitoring Process |
| 170 | + |
| 171 | +1. **Prediction Phase**: Before running ab initio tools, estimate memory usage based on contig lengths |
| 172 | +2. **System Check**: Assess available system memory and suggest CPU allocation |
| 173 | +3. **Real-time Monitoring**: During subprocess execution, sample memory usage at regular intervals |
| 174 | +4. **Reporting**: Log memory statistics and generate reports |
| 175 | +5. **Model Updates**: Optionally update prediction models with actual usage data |
| 176 | + |
| 177 | +### Memory Sampling |
| 178 | + |
| 179 | +The memory monitor: |
| 180 | +- Creates a `psutil.Process` object for the subprocess |
| 181 | +- Samples memory usage every 100ms (configurable) |
| 182 | +- Tracks both the main process and all child processes |
| 183 | +- Handles process termination gracefully |
| 184 | +- Calculates statistics from all samples |
| 185 | + |
| 186 | +### CPU Allocation Logic |
| 187 | + |
| 188 | +The system adjusts CPU allocation by: |
| 189 | +1. Estimating memory usage per parallel process |
| 190 | +2. Calculating how many processes can fit in available memory |
| 191 | +3. Leaving a 20% buffer for the operating system |
| 192 | +4. Ensuring at least 1 CPU is allocated |
| 193 | +5. Not exceeding the user-specified maximum |
| 194 | + |
| 195 | +## Testing |
| 196 | + |
| 197 | +Run the test suite to verify functionality: |
| 198 | + |
| 199 | +```bash |
| 200 | +python test_memory_monitoring.py |
| 201 | +``` |
| 202 | + |
| 203 | +This will test: |
| 204 | +- Memory prediction models |
| 205 | +- System memory information |
| 206 | +- CPU allocation suggestions |
| 207 | +- Total memory estimation |
| 208 | +- Real-time memory monitoring |
| 209 | + |
| 210 | +## Dependencies |
| 211 | + |
| 212 | +The memory monitoring system requires: |
| 213 | + |
| 214 | +- `psutil` - For system and process memory monitoring |
| 215 | +- `json` - For saving/loading memory statistics |
| 216 | +- `time` - For timing and sampling |
| 217 | +- `threading` - For concurrent memory monitoring |
| 218 | + |
| 219 | +## Future Enhancements |
| 220 | + |
| 221 | +Potential improvements include: |
| 222 | + |
| 223 | +1. **Machine Learning Models** - Use actual usage data to train better prediction models |
| 224 | +2. **Memory Profiling** - Detailed analysis of memory allocation patterns |
| 225 | +3. **Dynamic Scheduling** - Adjust CPU allocation during runtime based on actual usage |
| 226 | +4. **Memory Limits** - Hard memory limits with process termination |
| 227 | +5. **Historical Analysis** - Long-term memory usage trends and optimization |
| 228 | +6. **Tool-specific Tuning** - Fine-tune memory models for different ab initio tools |
| 229 | + |
| 230 | +## Troubleshooting |
| 231 | + |
| 232 | +### Common Issues |
| 233 | + |
| 234 | +1. **psutil not available**: Install with `pip install psutil` |
| 235 | +2. **Permission errors**: Some systems may restrict process monitoring |
| 236 | +3. **Inaccurate predictions**: Models are empirical and may need tuning for your data |
| 237 | +4. **Memory monitoring overhead**: Monitoring adds small CPU/memory overhead |
| 238 | + |
| 239 | +### Performance Impact |
| 240 | + |
| 241 | +Memory monitoring has minimal performance impact: |
| 242 | +- ~1-2% CPU overhead for sampling |
| 243 | +- ~1-5 MB memory overhead for the monitor |
| 244 | +- Sampling interval can be adjusted to reduce overhead |
0 commit comments