ERPdotAI · TataSatyaPratheek · Mar 10, 2025 · Mar 10, 2025 · Mar 10, 2025 · Mar 10, 2025
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-## Process Mining with Graph Neural Networks
+# Process Mining with Graph Neural Networks
 
 An advanced implementation combining Graph Neural Networks, Deep Learning, and Process Mining techniques for business process analysis and prediction.
 
@@ -68,7 +68,18 @@ git clone https://github.com/ERPdotAI/GNN.git
 cd GNN
 ```
 
-2. Install dependencies:
+2. Create and activate a virtual environment:
+```bash
+# For Linux/macOS
+python -m venv pm-venv
+source pm-venv/bin/activate
+
+# For Windows
+python -m venv pm-venv
+pm-venv\Scripts\activate
+```
+
+3. Install dependencies:
 ```bash
 pip install -r requirements.txt
 ```
@@ -84,10 +95,36 @@ The system expects process event logs in CSV format with the following structure
 
 ## 8. Usage
 
+### Basic Usage
+
 ```bash
 python main.py <input-file-path>
 ```
 
+For example:
+```bash
+python main.py input/BPI2020_DomesticDeclarations.csv
+```
+
+### Advanced Options
+
+The script supports several command-line arguments:
+
+```bash
+python main.py input/BPI2020_DomesticDeclarations.csv --epochs 30 --batch-size 64 --norm-features
+```
+
+Available options:
+- `--epochs`: Number of epochs for GNN training (default: 20)
+- `--lstm-epochs`: Number of epochs for LSTM training (default: 5)
+- `--batch-size`: Batch size for training (default: 32)
+- `--norm-features`: Use L2 normalization for features
+- `--skip-rl`: Skip reinforcement learning step
+- `--skip-lstm`: Skip LSTM modeling step
+- `--output-dir`: Custom output directory
+
+### Output Structure
+
 Results are stored in timestamped directories under `results/` with the following structure:
 ```
 results/run_timestamp/
@@ -100,30 +137,74 @@ results/run_timestamp/
 
 ## 9. Technical Details
 
-Graph Neural Network Architecture
+### Graph Neural Network Architecture
 - Multi-head attention mechanisms
 - Dynamic graph construction
 - Adaptive feature learning
 - Custom loss functions for process-specific metrics
 
-LSTM Implementation
+### LSTM Implementation
 - Bidirectional sequence modeling
 - Variable-length sequence handling
 - Custom embedding layer for process activities
 
-Process Mining Components
+### Process Mining Components
 - Inductive miner implementation
 - Token-based replay
 - Custom conformance checking metrics
 - Advanced bottleneck detection algorithms
 
-Reinforcement Learning
+### Reinforcement Learning
 - Custom environment for process optimization
 - State-action space modeling
 - Policy gradient methods
 - Resource allocation optimization
 
-## 10. Contributing
+### Visualization Capabilities
+- Process flow network diagrams
+- Bottleneck identification
+- Transition heatmaps
+- Interactive Sankey diagrams
+- Cycle time distributions
+- Task embedding visualizations
+
+## 10. Troubleshooting
+
+### Common Issues
+
+1. **UMAP/Numba version incompatibility**
+
+If you encounter an error like:
+```
+ImportError: Numba needs NumPy 2.1 or less. Got NumPy 2.2.
+```
+
+The code is designed to handle this gracefully by falling back to t-SNE for dimensionality reduction.
+
+2. **PM4Py installation issues**
+
+If PM4Py installation fails, you can use the code without conformance checking:
+```bash
+python main.py <input-file-path> --skip-conformance
+```
+
+3. **CUDA/GPU issues**
+
+The code will automatically detect and use the appropriate device (CUDA, MPS, or CPU). 
+If you encounter GPU memory issues, try reducing the batch size:
+```bash
+python main.py <input-file-path> --batch-size 16
+```
+
+### Getting Help
+
+If you encounter issues not covered above, please open an issue on the GitHub repository with:
+- Full error message
+- Python version
+- OS details
+- Dependencies list (output of `pip freeze`)
+
+## 11. Contributing
 
 We welcome contributions from the research community. Please follow these steps:
 
@@ -132,7 +213,7 @@ We welcome contributions from the research community. Please follow these steps:
 3. Implement your changes
 4. Submit a pull request with detailed documentation
 
-## 11. Citation
+## 12. Citation
 
 If you use this code in your research, please cite:
 
@@ -144,4 +225,4 @@ If you use this code in your research, please cite:
   publisher = {ERP.AI},
   url = {https://github.com/ERPdotAI/GNN}
 }
-``` 
+```
diff --git a/ablation.sh b/ablation.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+set -euo pipefail
+
+# Original parameters unchanged
+DATASET="input/BPI2020_DomesticDeclarations.csv"
+OUTPUT_DIR="ablation_results"
+LOG_DIR="${OUTPUT_DIR}/logs"
+mkdir -p "$LOG_DIR"
+
+# decision_tree random_forest xgboost 
+for MODEL in mlp lstm basic_gat positional_gat diverse_gat enhanced_gnn; do
+    echo "Running ablation for: $MODEL"
+
+    (
+        python -c "import torch; torch.cuda.empty_cache()" 2>/dev/null
+        python main.py $DATASET \
+            --run-ablation \
+            --model-type $MODEL \
+            --output-dir "${OUTPUT_DIR}/${MODEL}" \
+            --batch-size 32 \
+            --epochs 5 | tee "${LOG_DIR}/${MODEL}.log"
+    )
+
+    sleep 1  
+done
+
+echo "All ablation studies completed"
diff --git a/ablation_logs/fix_preprocessing.py b/ablation_logs/fix_preprocessing.py
@@ -0,0 +1,73 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+"""
+Patched data preprocessing module to fix tuple handling issues
+"""
+
+import importlib.util
+import sys
+import os
+from colorama import Fore, Style
+
+# Load the main module
+spec = importlib.util.spec_from_file_location("main", "main.py")
+main = importlib.util.module_from_spec(spec)
+sys.modules["main"] = main
+spec.loader.exec_module(main)
+
+# Fix load_and_preprocess_data_phase1 to properly handle tuple results
+def patched_load_and_preprocess_data_phase1(data_path, args):
+    from modules.data_preprocessing import load_and_preprocess_data, create_feature_representation, build_graph_data
+
+    main.print_section_header("Loading and Preprocessing Data with Phase 1 Enhancements")
+
+    # Load and preprocess data
+    result = load_and_preprocess_data(
+        data_path,
+        use_adaptive_norm=args.adaptive_norm,
+        enhanced_features=args.enhanced_features,
+        enhanced_graphs=args.enhanced_graphs,
+        batch_size=args.batch_size
+    )
+
+    # Proper type checking with diagnostic output
+    if isinstance(result, tuple):
+        print(f"{Fore.YELLOW}Debug: load_and_preprocess_data returned tuple of length {len(result)}{Style.RESET_ALL}")
+
+        if len(result) == 4:
+            # Properly returns (df, graphs, task_encoder, resource_encoder)
+            return result
+        elif len(result) >= 1:
+            # Extract the dataframe from the first element if it's a dataframe
+            candidate_df = result[0]
+            if hasattr(candidate_df, 'columns'):
+                print(f"{Fore.GREEN}Debug: Successfully extracted dataframe from tuple[0]{Style.RESET_ALL}")
+                df = candidate_df
+            else:
+                print(f"{Fore.RED}Error: First element of tuple is not a dataframe{Style.RESET_ALL}")
+                df = result  # Let it fail later with a clear error
+        else:
+            print(f"{Fore.RED}Error: Returned tuple is empty{Style.RESET_ALL}")
+            df = result  # Let it fail later with a clear error
+    else:
+        # Just returns a dataframe or other object
+        df = result
+
+    # Process the dataframe normally
+    if hasattr(df, 'columns'):
+        # Create feature representation
+        df, task_encoder, resource_encoder = create_feature_representation(df, use_norm_features=args.adaptive_norm)
+        graphs = build_graph_data(df)
+        return df, graphs, task_encoder, resource_encoder
+    else:
+        print(f"{Fore.RED}Error: df is not a dataframe, it's a {type(df)}{Style.RESET_ALL}")
+        raise TypeError(f"Expected DataFrame, got {type(df)}")
+
+# Apply our patch
+main.load_and_preprocess_data_phase1 = patched_load_and_preprocess_data_phase1
+
+# Run the main function with the arguments passed to this script
+if __name__ == "__main__":
+    # Pass all arguments to main function
+    main.main()
diff --git a/ablation_logs/run_ablation.py b/ablation_logs/run_ablation.py
@@ -0,0 +1,24 @@
+#!/usr/bin/env python3
+import sys
+import os
+import subprocess
+
+# Get the command line arguments
+args = sys.argv[1:]
+
+# Print what we're going to run
+print(f"Running: python main.py {' '.join(args)}")
+
+# Run the command and capture output
+try:
+    result = subprocess.run(['python', 'main.py'] + args, 
+                           check=True, 
+                           text=True,
+                           stdout=subprocess.PIPE,
+                           stderr=subprocess.STDOUT)
+    print(result.stdout)
+    sys.exit(0)
+except subprocess.CalledProcessError as e:
+    print(f"Error running main.py: {e}")
+    print(e.stdout)
+    sys.exit(e.returncode)
diff --git a/clear_gpu_memory.py b/clear_gpu_memory.py
@@ -0,0 +1,56 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+import torch
+import gc
+import os
+import psutil
+import time
+from colorama import Fore, Style, init
+
+# Initialize colorama
+init()
+
+def clear_gpu_memory():
+
+    print(f"{Fore.CYAN}Clearing GPU memory...{Style.RESET_ALL}")
+
+    # Force garbage collection
+    gc.collect()
+
+    if torch.cuda.is_available():
+        # Get initial memory stats
+        initial_allocated = torch.cuda.memory_allocated() / (1024**2)
+        initial_reserved = torch.cuda.memory_reserved() / (1024**2)
+
+        print(f"Initial GPU memory: {initial_allocated:.1f} MB allocated, {initial_reserved:.1f} MB reserved")
+
+        # Empty cache
+        torch.cuda.empty_cache()
+
+        # Synchronize device
+        torch.cuda.synchronize()
+
+        # Get final memory stats
+        final_allocated = torch.cuda.memory_allocated() / (1024**2)
+        final_reserved = torch.cuda.memory_reserved() / (1024**2)
+
+        print(f"Final GPU memory: {final_allocated:.1f} MB allocated, {final_reserved:.1f} MB reserved")
+        print(f"Freed {initial_reserved - final_reserved:.1f} MB")
+    else:
+        print(f"{Fore.YELLOW}No GPU available{Style.RESET_ALL}")
+
+    # Also report CPU memory
+    cpu_memory = psutil.virtual_memory()
+    print(f"CPU memory: {cpu_memory.percent}% used, {cpu_memory.available / (1024**3):.2f} GB available")
+
+if __name__ == "__main__":
+    clear_gpu_memory()
+
+    # Also kill any orphaned CUDA processes if on Linux
+    if os.name == 'posix':
+        try:
+            os.system("nvidia-smi | grep 'python' | awk '{print $3}' | xargs -r kill -9")
+            print(f"{Fore.GREEN}Killed orphaned CUDA processes{Style.RESET_ALL}")
+        except:
+            pass