Critical Bug Reproduction for Apache Hadoop ABFS Driver
This repository contains a complete reproduction test case and proof-of-concept fix for a critical thread leak in the Apache Hadoop ABFS (Azure Data Lake Storage Gen2) driver.
Bug: AbfsClientThrottlingAnalyzer creates Timer threads that are never cleaned up, causing memory exhaustion and OutOfMemoryError crashes in production environments.
Impact: Production Hive Metastore instances crash with 60,000+ leaked threads
Component: hadoop-azure - AbfsClientThrottlingAnalyzer class
Affected Versions: Hadoop 3.3.6, 3.4.1+, and all versions with ABFS autothrottling
- Java 11+ (tested with Java 23)
- Maven 3.6+
- Azure storage account with access key
-
Clone and configure:
git clone <this-repo> cd abfs-thread-leak-test # Update your Azure storage account key # Edit src/main/java/AbfsThreadLeakTest.java line 141: # conf.set("fs.azure.account.key.YOURACCOUNT.dfs.core.windows.net", "YOUR_KEY_HERE"); # Edit src/main/java/AbfsThreadLeakTest.java line 26 and 27: # String[] storageAccounts = { # "abfs://<container1>@<account>.dfs.core.windows.net/", # "abfs://<container2>@<account>.dfs.core.windows.net/" # };
-
Build:
./mvnw clean package
-
Run the reproduction test:
java --add-opens java.base/java.lang=ALL-UNNAMED \ --add-opens java.base/java.util=ALL-UNNAMED \ --add-opens java.base/java.util.concurrent=ALL-UNNAMED \ --add-opens java.base/java.net=ALL-UNNAMED \ --add-opens java.base/java.io=ALL-UNNAMED \ -Djava.security.manager=allow \ -Dhadoop.home.dir=/tmp \ -Dhadoop.security.authentication=simple \ -jar target/abfs-thread-leak-test-1.0.0.jar
Cycle 1: 6→22 total threads (+16), 0→2 ABFS timer threads (+2)
Cycle 2: 22→24 total threads (+2), 2→4 ABFS timer threads (+2)
...
Final: 55 total threads (+49), 36 ABFS timer threads (+36)
⚠️ THREAD LEAK DETECTED!
36 ABFS timer threads were not cleaned up
Final: 324 total threads (+318), 304 ABFS timer threads (+304)
⚠️ MASSIVE THREAD LEAK DETECTED!
304 ABFS timer threads were not cleaned up
$ jstack <PID> | grep 'abfs-timer-client-throttling-analyzer' | wc -l
282
290
294
298
304
308 # Continues growing even after test completionAbfsClientThrottlingAnalyzer creates Timer objects in its constructor but has no cleanup mechanism:
// In constructor - creates timer threads
this.timer = new Timer(
String.format("abfs-timer-client-throttling-analyzer-%s", name), true);
// Missing: No close() method to call timer.cancel()- Each ABFS filesystem connection creates an
AbfsClient - Each
AbfsClientcreates anAbfsClientThrottlingAnalyzer - Each analyzer creates Timer threads for read/write throttling
- When
AbfsClient.close()is called, it doesn't clean up analyzer timers - Timer threads persist until JVM shutdown
Leaked threads have consistent naming pattern:
abfs-timer-client-throttling-analyzer-read {storage-account}
abfs-timer-client-throttling-analyzer-write {storage-account}
This repository includes a working fix that completely eliminates the thread leak.
Added to AbfsClientThrottlingAnalyzer.java:
/**
* Closes the throttling analyzer and cleans up resources.
*/
public void close() {
if (timer != null) {
timer.cancel();
timer.purge();
timer = null;
}
if (blobMetrics != null) {
blobMetrics.set(null);
}
LOG.debug("AbfsClientThrottlingAnalyzer for {} has been closed and cleaned up", name);
}
@VisibleForTesting
public boolean isClosed() {
return timer == null;
}🔴 Without fix: +3 ABFS timer threads (LEAKED)
✅ With fix: +0 ABFS timer threads (NO LEAK)
✅ FIX WORKS! No additional threads leaked when close() is called!
For production environments experiencing this issue:
<property>
<name>fs.azure.enable.autothrottling</name>
<value>false</value>
</property>Trade-offs:
- Completely eliminates thread leak
- Disables client-side throttling (server-side throttling still active)
- May see increased 503 responses under heavy load
- Test environment: 304 threads in minutes
- Production environment: 60,000+ threads over weeks
- Memory impact: Each thread consumes ~1MB stack space
- Crash threshold: Varies by JVM configuration and available memory
src/main/java/
├── AbfsThreadLeakTest.java # Main test harness
└── org/apache/hadoop/fs/azurebfs/services/
└── AbfsClientThrottlingAnalyzer.java # Fixed version with close()
- Bug Reproduction: Normal ABFS operations with autothrottling enabled
- Workaround Validation: Same operations with autothrottling disabled
- Fix Verification: Direct analyzer testing with proper cleanup
- Live Monitoring: Real-time thread count tracking
- Java version and vendor
- Hadoop version and build details
- Thread states and IDs
- Memory usage analysis
- Configuration details
Monitor thread leaks in production:
# Count ABFS timer threads
jstack <PID> | grep 'abfs-timer-client-throttling-analyzer' | wc -l
# Monitor total thread count
jstack <PID> | grep "^\"" | wc -l
# Watch for thread growth
watch -n 5 "jstack <PID> | grep 'abfs-timer-client-throttling-analyzer' | wc -l"- Apply workaround:
fs.azure.enable.autothrottling=false - Monitor thread counts in production environments
- Restart affected services when thread count exceeds safe limits
- Add
close()method toAbfsClientThrottlingAnalyzer - Ensure
AbfsClient.close()calls analyzer cleanup - Add proper resource management to prevent future leaks
This repository provides a complete test suite for:
- Reproducing the bug reliably
- Validating the workaround
- Verifying the fix effectiveness
- Monitoring production environments
Test Environment:
- Java: 23.0.2 (also tested on Java 11)
- Hadoop: 3.3.6
- Maven: 3.9+
- OS: macOS 15.3 (cross-platform compatible)
Thread Leak Pattern:
- +2 threads per ABFS filesystem connection (read + write analyzer)
- Threads persist indefinitely until JVM shutdown
- Memory usage grows linearly with thread count
- No self-healing or cleanup mechanism