Advanced AI-powered PTZ camera tracking system with flexible object priorities
NOLOcam is a sophisticated Go-based system for tracking and monitoring objects using PTZ cameras with YOLO object detection. Unlike YOLO's "You Only Look Once" approach, NOLOcam continuously tracks objects with spatial awareness and predictive capabilities - hence "Never Only Look Once".
TLDR: Dude hooked up AI to a PTZ camera and it works.
See it in action here: https://www.youtube.com/@MiamiRiverCamera/streams
Why build this software? While everyone uses AI to enable code, I'm using code to enable AI. This project began as an experiment to test a fundamental thesis: AI integration isn't just coming to everything—it's inevitable, and I think the most powerful implementations will be hybrid systems that blend local and cloud intelligence.
This camera system represents a microcosm of where I believe all technology is heading. We're transitioning from AI as a separate tool to AI as an integrated nervous system within traditional hardware. Consider the implications:
Today: A security camera records video that humans review later, maybe in some conditions has movement or basic object detection on cameras that do not move. Other cases uses offload to a cloud service to do any sort of AI processing.
Tomorrow: Devices that think, track, analyze, and make decisions in real-time while seamlessly integrating with cloud-based reasoning systems. It's going to happene everywhere, children's toys, cars, and even the kitchen.
But this extends far beyond cameras. Imagine:
- Smart kitchens where sensor-aware cookware communicates thermal dynamics, moisture levels, and chemical changes to AI models embedded in ovens and ranges, automatically adjusting cooking parameters in real-time
- Intelligent manufacturing where every tool, sensor, and component participates in a distributed AI network, predicting failures, optimizing processes, and self-correcting
- Responsive infrastructure where bridges, roads, and buildings continuously monitor their structural health and environmental conditions, making micro-adjustments and predictive maintenance decisions
- Chidrend's Toys Imagine speaking to your childhood best friend Teddy Ruxpin and learning math from a talking fish? It's going to happen.
Pure cloud-based AI processing is fundamentally flawed for real-time applications:
The Math Doesn't Work: Processing 30fps video at broadcast quality would require constant cloud API calls. Even with perfect connectivity, you're looking at:
- Latency that kills real-time responsiveness instnatly. Dead in the water.
- API costs that could easily reach $$$$$$$$$ for continuous operation
The same math issue exists in cyber security detections and monitoring or really any sort of real-time streaming dataset.
The Local Reality: A modest gaming laptop with an NVIDIA GeForce RTX 3070 sits 80% idle while effortlessly processing this high-bandwidth video stream. The computational power already exists at the edge. However, I am not running any fully advanced model on this machine as of now.
This led to the system's hybrid design: local AI handles the high-frequency, low-latency tasks (object detection, tracking, movement control), while cloud AI provides deep reasoning and analysis for complex decision-making. This isn't just more efficient—it's more intelligent.
Local AI handles: Frame-by-frame analysis, real-time tracking, immediate responses
Cloud AI handles: Complex scene interpretation, behavioral analysis, strategic decision-making
This hybrid model represents the future across domains:
- Cybersecurity: Local models detecting anomalies in microseconds, cloud models providing sophisticated threat analysis and summarization.
- Autonomous vehicles: Local processing for immediate hazard response, cloud intelligence for route optimization and traffic pattern learning
- Medical devices: Local monitoring for critical vital signs, cloud analysis for diagnostic insights and treatment recommendations
We're witnessing a fundamental transformation in how intelligence is distributed through our technological ecosystem. Rather than centralizing AI in distant data centers, we're creating a web of interconnected intelligence that spans from the smallest embedded processors to the largest cloud infrastructures and overlapping APIs.
This camera system is proof of concept for a world where every device isn't just connected—it's intelligent. Where the boundary between hardware and software dissolves into responsive, adaptive systems that learn, predict, and evolve.
We're going to wire up AI to things that have controls and it's going to be very interesting to watch "things" become more than just "things".
The question isn't whether AI will be embedded in everything. The question is whether we'll build these systems thoughtfully, with proper attention to latency, cost, privacy, and resilience. This project suggests we can—and we must.
I haven't set up a webcam in decades, and I decided to set up something basic. All these Chinese branded cameras advertise AI and auto tracking, so after buying and returning about 5 of them I realized that was all made up marketing scammy stuff. These little cameras do not have the processing power to do any of that. Then, you have to offload your images or video to a cloud service, and that is not free, and it's really bad, and not secure. So, I thought, hey all of the parts and pieces are here. I have a GPU on a laptop, Linux, and I did a little more research and found a PTZ camera that I could get status messages from, which is very important, to have absolute values and to know where the camera is positioned.
Publishing was to YouTube which also has its own quirks. I started with an AI narrator that was hooked into the stream via ChatGPT gpt-4o model and could narrate what the camera sees at the same time as the AI tracking is running. This system is included in the packages and is very useful because it could also be used for many many other deep analysis tasks such as object classification, animal descriptions (what type of panther or boat is this, etc). All in all it is a challenging process.
I'm pretty sure this software can be flexible enough to monitor less complicated areas such as the Miami River. I'm sure it can run well for a lot of applications, so I am interested to see what people come up with and how they use the software.
I tried to use the cheapest off-the-shelf PTZ camera I could find, and it was a disaster.
Most of them support RTSP in different ways that don't make sense. Their APIs are flaky and not reliable.
Major issues: Zoom speed, digital zoom only via apps, very bad auto focus, RTSP relaiblity, API crasehs, etc etc
I finally settled on a Hikvision ColorVu AcuSense DS-2DE7A412MCG-EB 4MP Outdoor PTZ camera. The API is pretty good, the RTSP stream is rock solid, quality is good, and the PTZ control provides absolute position status. All these features allow it be wired into AI pretty smoothly.
Failures:
- Fosscam PTZ (Almost worked but slow focus and lack of API position data killed it)
- SUNBA PTZ (Bad for this type of thing)
- Reolink PTZ (Very bad for this type of project)
While this project uses Hikvision cameras for technical demonstration purposes, it's essential to acknowledge the serious ethical concerns surrounding this manufacturer. Hangzhou Hikvision Digital Technology Co., Ltd. is a Chinese state-owned surveillance equipment company that has been extensively documented as being directly complicit in human rights violations.
The Uyghur Human Rights Project has documented that Hikvision has secured lucrative contracts with Chinese authorities to supply, develop, and directly operate mass surveillance systems specifically targeting Uyghur and other ethnic minorities in East Turkistan (Xinjiang). These systems have been installed in and around internment camps, mosques, and schools across the region. After working with the camera and AI models, I can see how a specificlly traned model could be used to target just about anything you want. So, tehnically this is possible.
Investigations by IPVM have revealed that Hikvision developed facial recognition technology specifically designed to identify ethnic minorities, including what internal documents described as "Uyghur detection" capabilities. The company's technology is integral to China's Integrated Joint Operations Platform (IJOP), a predictive policing system used for mass surveillance and targeting of ethnic minorities.
The international community has responded with comprehensive sanctions. The United States has placed Hikvision on multiple sanctions lists, including the Entity List (2019), Chinese Military Companies list (2021), and banned their equipment from federal facilities under the National Defense Authorization Act. The UK, Australia, and several European countries have also implemented bans or restrictions on Hikvision equipment in government facilities.
While Hikvision cameras were selected for this project purely for their technical capabilities and API compatibility, I want to be transparent about these ethical concerns. The robustness and reliability that makes these cameras effective for this wildlife monitoring project are the same qualities that unfortunately make them effective tools for authoritarian surveillance. As this project evolves, I'm actively seeking partnerships with ethical camera manufacturers like Sony, Panasonic, or Bosch to migrate away from Hikvision hardware. If you represent an alternative camera manufacturer and would like to collaborate on integrating your APIs with this system, please reach out.
The camera pipeline is astoundingly fast, capture is 33ms which at 30fps is 33ms, so it is not a bottleneck. >1 frame per millisecond.
Capture: 30.1 fps (Read: 33.312905ms)
Process application can run estimated at 60fps with 12-13ms in YOLO and the tracking code and overlay is sub millisecond.
Process: 60.0 fps (YOLO: 12.728951ms, Track: 15.102µs)
Write is 1ms which is 1000fps, so it is not a bottleneck.
Write: 30.0 fps (Write: 1.067839ms)
I had to create some buffer systems to handle the pipeline, and it seems to be working great. There was a lot of tricky nature to memory leaks and frame ordering that had to be dealt with. Kind of drove me crazy for a bit.
Buffers: FrameChan 1/120 (0.8%) | WriteQueue 0/120 (0.0%) | PendingFrames 1/120 (0.8%)
The most challenging bug was a subtle aspect ratio issue that took forever to debug. The camera produces 2688×1520 frames (1.768:1 aspect ratio), but YOLO expects square 832×832 input (1:1 aspect ratio). This mismatch caused detection boxes to appear in completely wrong locations.
The Initial Confusion: "Crop" vs "Letterbox"
I initially thought OpenCV's BlobFromImage
crop parameter would handle this properly:
// What we THOUGHT would work (it doesn't):
blob := gocv.BlobFromImage(frame, 1.0/255.0, image.Pt(832, 832),
gocv.NewScalar(0, 0, 0, 0), true, false) // crop=false
The crop=false
parameter was supposed to add letterbox padding, but it actually just stretches the image to fit 832×832, destroying the aspect ratio. YOLO would see a distorted, horizontally-compressed image and return bounding boxes for the stretched coordinate space.
The Solution: Manual Letterboxing
I implemented proper letterboxing by hand:
// STEP 1: Calculate letterbox dimensions
originalWidth := 2688.0 // Camera resolution
originalHeight := 1520.0 // Camera resolution
aspectRatio := 1.768 // 2688/1520
contentHeight := 470 // 832/1.768 (preserves aspect ratio)
yOffset := 181 // (832-470)/2 (black bars on top/bottom)
// STEP 2: Create 832×832 black canvas
letterboxed := gocv.NewMatWithSize(832, 832, gocv.MatTypeCV8UC3)
letterboxed.SetTo(gocv.NewScalar(0, 0, 0, 0)) // Fill black
// STEP 3: Resize frame to 832×470 (preserves aspect ratio)
gocv.Resize(frame, &resized, image.Pt(832, 470), 0, 0, gocv.InterpolationLinear)
// STEP 4: Copy resized content to center of canvas (Y offset = 181px)
contentROI := letterboxed.Region(image.Rect(0, 181, 832, 651))
resized.CopyTo(&contentROI)
// STEP 5: Now we can safely use crop=true since image is pre-letterboxed
blob := gocv.BlobFromImage(letterboxed, 1.0/255.0, image.Pt(832, 832),
gocv.NewScalar(0, 0, 0, 0), true, true) // crop=true is safe now
Coordinate Transformation Magic
The real complexity is transforming YOLO's detected bounding boxes back to original camera coordinates:
Camera Frame: 2688×1520 (landscape)
┌─────────────────────────┐
│ │
│ Camera View │ 1520px
│ │
└─────────────────────────┘
2688px
YOLO Input: 832×832 (square with letterbox)
┌─────────────────────────┐
│ BLACK BAR (181px) │ ← yOffset
├─────────────────────────┤
│ │
│ Content Area (470px) │ ← contentHeight
│ │
├─────────────────────────┤
│ BLACK BAR (181px) │ ← yOffset
└─────────────────────────┘
832px
Detection Transform:
1. YOLO returns: (xNorm, yNorm, wNorm, hNorm) in 0.0-1.0 range
2. Convert to 832×832 space: yPixel832 = yNorm * 832
3. Remove letterbox offset: yContentPixel = yPixel832 - 181
4. Scale to camera coords: finalY = yContentPixel * (1520/470)
This coordinate transformation ensures that a boat detected at the bottom of the letterboxed YOLO input correctly maps to the bottom of the original camera frame, not somewhere in the middle due to the black bars.
Why This Was So Confusing
The crop
parameter in OpenCV's BlobFromImage
is misleadingly named. We thought crop=false
meant "add letterbox padding," but it actually means "stretch to fit." The GPU calls weren't creating letterboxing - they were creating stretched, distorted images that made YOLO's coordinate system completely wrong.
FFmpeg also was a bit of a pain in the ass due to the apt and brew versions being so old and out of date, I had to build it from source. Which I still have yet to get drawtext working in 7.x. So there went the narrator AI feed, but I was not sure if that added a lot of value or not.
Another issue was using nginx to publish the stream to YouTube, I had to use the -async 1 option to get it to work, but it was not working well, and I had to use the -vsync passthrough option to get it to work. At the end I decided nginx was terrible at RTMP and I switched the program over to SRS which is much better at handling this type of stuff and feed management.
I still have an issue with VLC attaching to the RTMP feed and not getting some sort of ordering issue, I suspect someplace in my pipeline there's a frame reordering issue, where frame 10 goes before frame 9 or something like that to trigger a reordering issue that makes VLC freak out. I'm not sure if this is a problem with VLC or my pipeline.
Moving to the GPU off the CPU was a huge win, however that came with its own issues with NVIDIA and Apple Silicon. It's clear that NVIDIA is so much more supported on Linux than open source tools like ffmpeg being supported on Apple Silicon even on native Apple macOS. For example, on "slower" encode (which means the best) on Apple Silicon we could never really reach 1x 1:1 real-time speed, where on NVIDIA after some of its own internal workings does reach 1:1 frame rate at real-time. Which is important for publishing to YouTube and to keep the stream looking good.
I also did not take the time to use on/off movements which are much smoother than the absolute ones, and I really don't have the time to do that right now, also I could not get the Hikvision camera to manage speed dynamically because of the API doing some strange shit. If anyone at Hikvision can help, please shoot me a note.
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌──────────────┐
│ RTSP Camera │───▶│ captureFrames│───▶│ frameChan │───▶│ writeFrames │
│ (30fps 1440p) │ │ Goroutine │ │ (120 buffer) │ │ Goroutine │
└─────────────────┘ └──────────────┘ └─────────────────┘ └──────────────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Frame Valid │ │ YOLO Detect │
│ 8UC3 Check │ │ & Process │
└─────────────┘ └─────────────┘
│
▼
┌─────────────────┐ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
│ FFmpeg Output │◀───│ Overlay │◀───│ Spatial Track │◀───│ P1/P2 Filter│
│ (RTMP Stream) │ │ Rendering │ │ Integration │ │ & Classify │
└─────────────────┘ └─────────────┘ └─────────────────┘ └─────────────┘
┌─────────────────┐
│ Raw Frame (Mat) │
└─────────┬───────┘
│
▼
┌─────────────────┐ NO ┌─────────────┐
│ Frame Valid? │────────▶│ Drop Frame │
│ (8UC3, 3-chan) │ └─────────────┘
└─────────┬───────┘
│ YES
▼ ┌─────────────────┐
┌─────────────────┐ │ Pre-Overlay │◀─── -pre-overlay-jpg
│ Clone for │ │ JPEG Save │ (LOCK state only)
│ Processing │ │ timestamp_pre- │
└─────────┬───────┘ │ overlay_det_N │
│ └─────────────────┘
▼
┌─────────────────┐
│ Status Overlay │◀──── -status-overlay flag
│ (FPS, Mode, │
│ Time, etc.) │
└─────────┬───────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ YOLO Blob │ │ 4:3 Aspect │
│ Creation │───────▶│ Ratio Fix │
│ (640x640) │ │ (Black Borders) │
└─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ net.SetInput() │ │ CUDA/CPU │
│ net.Forward() │───────▶│ Acceleration │
│ (12-13ms) │ │ Auto-Detect │
└─────────┬───────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ P1/P2 Class │ P1 │ Always Valid │
│ Filtering │───────▶│ (boat, kayak) │
│ (isP1/isP2Obj) │ └─────────────────┘
└─────────┬───────┘
│ P2 ┌─────────────────┐
└────────────────▶│ Valid Only in │
│ Tracking Mode │
│ (person, etc.) │
└─────────────────┘
│
▼
┌─────────────────┐
│ Confidence │
│ Threshold │
│ (>0.3 P1, │
│ >0.5 P2) │
└─────────┬───────┘
│
▼
┌─────────────────────────────────────────────────┐
│ SPATIAL INTEGRATION │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │updateAll │ │detectP2ObjectsInP1Targets│ │
│ │Boats() │ │(Enhancement Detection) │ │
│ └─────────────┘ └──────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │cleanupLost │ │feedLockedBoatsWithCluster│ │
│ │Boats() │ │Detections() │ │
│ └─────────────┘ └──────────────────────────┘ │
│ │ │ │
│ └──────┬────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │selectTarget │ │
│ │Boat() │ │
│ └─────────────┘ │
└─────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ PTZ CONTROL │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │Camera State │ │Rate Limiting │ │
│ │Manager │ │(Command Timing) │ │
│ └─────────────┘ └──────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │Position │ │Hikvision HTTP API │ │
│ │Validation │ │(XML Commands) │ │
│ │& Clamping │ │ │ │
│ └─────────────┘ └──────────────────────────┘ │
└─────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ OVERLAY RENDERING │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │Target Box │ │P2 Enhancement Indicators │ │
│ │Drawing │ │(Person icons, etc.) │ │
│ └─────────────┘ └──────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │PIP Zoom │ │Direction Arrows │ │
│ │Display │ │(Predictive Tracking) │ │
│ └─────────────┘ └──────────────────────────┘ │
└─────────────────┬────────────────────────────────┘
│
▼ ┌─────────────────┐
┌─────────────────┐ │ Post-Overlay │◀─── -post-overlay-jpg
│ Final Frame │────────────────────▶│ JPEG Save │ (LOCK state only)
│ (with overlays) │ │ timestamp_post- │ to organized dirs
└─────────┬───────┘ │ overlay_det_N │ /path/YYYY-MM-DD_HHPM/
│ └──────────────────┘
▼
┌─────────────────┐
│ FFmpeg Write │
│ (1ms latency) │
└─────────┬───────┘
│
▼
┌─────────────────┐
│ RTMP Stream │
│ (YouTube, SRS) │
└─────────────────┘
P1 OBJECTS (Primary Tracking Targets)
┌─────────────────────────────────────────────────────────────┐
│ boat, kayak, surfboard, etc. (-p1-track="boat,kayak") │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ Always Tracked │───▶│ Can Achieve │
│ (conf > 0.3) │ │ LOCK Status │
└─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Builds History │ │ Camera Follows │
│ (2+ = LOCK) │ │ (PTZ Control) │
│ (24+ = SUPER) │ │ │
└─────────────────┘ └─────────────────┘
P2 OBJECTS (Enhancement Objects)
┌─────────────────────────────────────────────────────────────┐
│ person, backpack, bottle, etc. (-p2-track="person,all") │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ Only During │───▶│ Only INSIDE │
│ Tracking Mode │ │ Locked P1 Boxes │
│ (conf > 0.5) │ │ │
└─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Priority Bonus │ │ Enhanced │
│ (+0.7 score) │ │ Targeting │
│ │ │ (Centroid) │
└─────────────────┘ └─────────────────┘
LOCK PROGRESSION
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ BUILDING │───▶│ LOCK │───▶│ SUPER LOCK │
│ (0-1 det) │ │ (2-23 det) │ │ (24+ det) │
│ │ │ │ │ │
│ No PTZ │ │ PTZ Follow │ │ Enhanced │
│ No PIP │ │ PIP Enabled │ │ PIP Zoom │
│ │ │ │ │ Predictive │
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────────────────────────────────────────────────────┐
│ GOROUTINE STRUCTURE │
└─────────────────────────────────────────────────────────────┘
Main Thread
│
├─── captureFrames() Goroutine
│ │
│ ├─── webcam.Read() ────┐
│ ├─── Frame Validation │
│ └─── frameChan <- data ─┼─── 120 Buffer Channel
│ │
│ │
└─── writeFrames() Goroutine◀┘
│
├─── YOLO Processing (12-13ms)
├─── Spatial Tracking (<1ms)
├─── Overlay Rendering (<1ms)
└─── FFmpeg Write (1ms)
┌─────────────────────────────────────────────────────────────┐
│ MEMORY TRACKING │
└─────────────────────────────────────────────────────────────┘
Mat Allocation Points:
├─── trackMatAlloc("capture") - Frame capture
├─── trackMatAlloc("buffer") - Frame cloning
├─── trackMatAlloc("yolo") - YOLO blob & output
└─── trackMatClose() calls - Explicit cleanup
Buffer Management:
├─── frameChan: 120 frames max
├─── writeQueue: 120 frames max
├─── pendingFrames: 120 max
└─── Drop policy: Oldest first
Most computer vision projects you see online use fixed cameras - a single static view where YOLO detects objects in a consistent scene. This is relatively straightforward: train on one perspective, optimize for one lighting condition, done.
This project tackles something exponentially more complex: real-time AI on a fully articulated PTZ camera system.
Our camera system operates across:
- Pan: 0-3599 positions (3,600 unique angles)
- Tilt: 0-900 positions (901 unique angles)
- Zoom: 10-120 levels (111 unique magnifications)
Total unique camera positions: 3,600 × 901 × 111 = 360,360,360
At 2600×1426 resolution, this creates ~1.34 quadrillion unique pixel-position combinations that the system must understand spatially.
Fixed Camera Approach:
- ✅ Single scene, consistent perspective
- ✅ Static object relationships
- ✅ Simple "digital zoom" on existing pixels
- ✅ One-time calibration
PTZ Camera Reality (This Project):
- 🔥 360+ million different scenes based on camera position
- 🔥 Dynamic spatial relationships - same object appears different at each zoom/angle
- 🔥 Real-time coordinate transformation between pixel space and physical PTZ coordinates
- 🔥 Continuous recalibration as camera moves through 3D space
- 🔥 Predictive tracking across position transitions
- 🔥 Multi-scale object detection - boats look different at 1x vs 12x zoom
While most PTZ projects are simple "move camera toward detected object" systems, this project solves the true spatial intelligence problem:
- Spatial Coordinate System: Translates between pixel coordinates and real-world PTZ positions with mathematical precision
- Multi-Scale Intelligence: Maintains object identity across zoom levels (a boat at 1x zoom vs 12x zoom)
- Predictive Movement: Anticipates where objects will be based on motion vectors
- Production-Grade Engineering: Handles the 360+ million position combinations reliably
Hopefully result: A system that doesn't just "follow objects" but truly understands 3D space through a moving camera. It's really not working exactly how I want it but it's a foundation for some new survalance systems.
Okay on to the software...
- P1 Objects (Primary): Main tracking targets that can achieve LOCK status
- P2 Objects (Enhancement): Objects detected inside locked P1 targets for enhanced tracking
- Configurable at runtime with
-p1-track
and-p2-track
flags
- Spatial Tracking: Real-world coordinate tracking with camera position awareness
- Predictive Tracking: Anticipates object movement when temporarily lost
- LOCK/SUPER LOCK Modes: Progressive tracking confidence levels (2+ → 24+ detections)
- Picture-in-Picture (PIP): Automatic zoom on locked targets with P2 objects
- Recovery Mode: Smart re-acquisition of lost targets
- Smart Camera Movement: Smooth tracking with velocity compensation
- Scanning Patterns: Customizable area scanning when no targets detected
- Position Limits: Safety boundaries to prevent mechanical damage
- State Management: Coordinated camera commands with latency compensation
- YOLO/ONNX Integration: Local AI inference (no API calls)
- GPU Acceleration: CUDA support for enhanced performance
- Multi-threaded Pipeline: Capture → Process → Track → Control
- Memory Management: Efficient OpenCV Mat handling with leak detection
- Verbose Debug Mode:
-debug-verbose
for detailed calculations - JPEG Frame Saving: Auto-organized by date/hour (
/path/2025-01-01_03PM/
) - Granular Overlay Control: Enable/disable specific overlay elements
- Performance Monitoring: Real-time FPS and latency tracking
Most AI camera tracking projects online are proof-of-concept Python scripts that:
- Run basic YOLO detection on static cameras
- Use simple "move camera toward object" logic
- Lack production reliability and error handling
- Process 1-5 FPS with basic tracking
This project is production-grade software that delivers:
- Go-based implementation: Higher performance and concurrency than Python alternatives
- Real-time processing: 30 FPS with sub-100ms latency
- Comprehensive error recovery: Handles camera disconnects, network issues, and tracking failures
- Production monitoring: Memory leak detection, performance metrics, debug infrastructure
- Coordinate transformation mathematics: Pixel→PTZ position mapping with calibration
- Multi-scale object persistence: Maintains identity across 1x→12x zoom transitions
- Predictive tracking algorithms: Kalman filters and spatial integration
- State machine progression: SEARCH→TRACK→LOCK→SUPER LOCK modes
- Local real-time AI: YOLO inference for immediate responses
- Cloud reasoning integration: Ready for GPT-4V scene analysis
- Cost-optimized: Avoids $200k+ monthly API costs of pure cloud solutions
- Edge-to-cloud intelligence: Best of both worlds
- Runtime configuration: No code changes needed for different environments
- Safety systems: Hardware position limits and emergency stops
- Industrial reliability: Designed for 24/7 operation
- Scalable architecture: Multi-camera network support foundation
The bottom line: While others build demos, this project solves the real engineering challenges of autonomous camera systems - the foundation for next-generation surveillance, security, and monitoring applications.
Run ./NOLO -h
to see all available options:
Usage of ./NOLO:
-YOLOdebug
Save YOLO input blob images to /tmp/YOLOdebug/ for analysis
-debug
Enable debug mode with overlay and detailed tracking logs
-debug-verbose
Enable verbose debug output (includes detailed YOLO, calibration, and tracking calculations)
-exit-on-first-track
Exit after first successful target lock (useful for debugging single track sessions)
-input string
RTSP input stream URL (required)
Example: rtsp://admin:password@192.168.1.100:554/Streaming/Channels/201
-jpg-path string
Directory path for saving JPEG frames (required when using JPEG flags)
-maskcolors string
Comma-separated hex colors to mask out (e.g., 6d9755,243314)
-masktolerance int
Color tolerance for masking (0-255, default: 50) (default 50)
-max-pan float
Maximum pan position in camera units (omit flag for hardware maximum)
Example: -max-pan=3000 prevents panning right of position 3000 (default -1)
-max-tilt float
Maximum tilt position in camera units (omit flag for hardware maximum)
Example: -max-tilt=900 prevents tilting too high (default -1)
-max-zoom float
Maximum zoom level in camera units (omit flag for hardware maximum)
Example: -max-zoom=120 prevents zooming above 12x (default -1)
-min-pan float
Minimum pan position in camera units (omit flag for hardware minimum)
Example: -min-pan=1000 prevents panning left of position 1000 (default -1)
-min-tilt float
Minimum tilt position in camera units (omit flag for hardware minimum)
Example: -min-tilt=0 prevents tilting below horizon (default -1)
-min-zoom float
Minimum zoom level in camera units (omit flag for hardware minimum)
Example: -min-zoom=10 prevents zooming below 1x (default -1)
-p1-track string
Priority 1 tracking objects (comma-separated) - primary targets that can achieve LOCK
Example: -p1-track="boat,surfboard,kayak" (default "boat")
-p2-track string
Priority 2 tracking objects (comma-separated, or 'all') - enhancement objects detected inside locked P1 targets
Example: -p2-track="person,backpack" or -p2-track="all" (default "person")
-pip-zoom
Enable Picture-in-Picture zoom display of locked targets (default: true) (default true)
-post-overlay-jpg
Save frames after overlay processing (requires -jpg-path)
-pre-overlay-jpg
Save frames before overlay processing (requires -jpg-path)
-ptzinput string
PTZ camera HTTP URL (required)
Example: http://admin:pass@192.168.0.59:80/
-status-overlay
Show status information overlay (time, FPS, mode) in lower-left corner
-target-overlay
Show tracking and targeting overlays (bounding boxes, paths, object info)
-terminal-overlay
Show debug terminal overlay (real-time messages) in upper-left corner
# Default behavior (boats with people enhancement)
./NOLO -input rtsp://... -ptzinput http://...
# Track boats and kayaks, enhance with people and backpacks
./NOLO -input rtsp://... -ptzinput http://... \
-p1-track="boat,kayak" \
-p2-track="person,backpack"
# Track surfboards, enhance with ANY detected object
./NOLO -input rtsp://... -ptzinput http://... \
-p1-track="surfboard" \
-p2-track="all"
# Multiple P1 objects with specific P2 enhancement
./NOLO -input rtsp://... -ptzinput http://... \
-p1-track="boat,surfboard,kayak" \
-p2-track="person,bottle,backpack"
# Save clean frames (before overlays)
./NOLO -input [URL] -ptzinput [URL] -jpg-path=/tmp/clean -pre-overlay-jpg
# Save processed frames (with all overlays)
./NOLO -input [URL] -ptzinput [URL] -jpg-path=/tmp/processed -post-overlay-jpg
# Save both types for complete analysis
./NOLO -input [URL] -ptzinput [URL] -jpg-path=/tmp/analysis -pre-overlay-jpg -post-overlay-jpg
# Debug mode with JPEG saving
./NOLO -input [URL] -ptzinput [URL] -debug -jpg-path=/tmp/debug -post-overlay-jpg
# Files are automatically organized into subdirectories: /path/2025-01-01_03PM/
# Clean view (no overlays)
./NOLO -input [URL] -ptzinput [URL]
# Status info only (lightweight monitoring)
./NOLO -input [URL] -ptzinput [URL] -status-overlay
# Target tracking visualization
./NOLO -input [URL] -ptzinput [URL] -target-overlay
# Terminal-style overlay display
./NOLO -input [URL] -ptzinput [URL] -terminal-overlay
# Picture-in-Picture zoom display
./NOLO -input [URL] -ptzinput [URL] -pip
# Full debug mode with all overlays
./NOLO -input [URL] -ptzinput [URL] -debug -status-overlay -target-overlay -terminal-overlay -pip
# Debug mode (clean output)
./NOLO -input [URL] -ptzinput [URL] -debug
# Verbose debug mode (includes detailed YOLO, calibration, and tracking calculations)
./NOLO -input [URL] -ptzinput [URL] -debug -debug-verbose
# YOLO analysis (save detection inputs)
./NOLO -input [URL] -ptzinput [URL] -debug -YOLOdebug
# Single track debugging (exit after first lock)
./NOLO -input [URL] -ptzinput [URL] -debug -exit-on-first-track
# PTZ movement limits (camera coordinate units)
-min-pan=1000 -max-pan=3000 # Pan boundaries
-min-tilt=0 -max-tilt=900 # Tilt boundaries
-min-zoom=10 -max-zoom=120 # Zoom boundaries
# Color masking for water removal
-maskcolors="6d9755,243314" # Mask water colors for better detection
-masktolerance=50 # Color tolerance (0-255)
- Go 1.19+
- OpenCV 4.x with Go bindings (
gocv
) - YOLO Model: YOLOv8n ONNX format (
models/yolov8n.onnx
) - PTZ Camera: Hikvision-compatible HTTP API
- RTSP Stream: Camera video feed
- Optional: CUDA for GPU acceleration
This system supports multiple YOLO architectures for maximum flexibility:
Model | Type | Performance | Use Case |
---|---|---|---|
YOLOv3-tiny | OpenCV .weights | Fast, lightweight | CPU-optimized real-time detection |
YOLOv8n | ONNX | Balanced speed/accuracy | GPU acceleration, higher precision |
The system can detect and track 80 different object types from the COCO dataset:
🚢 Vehicles & Transportation:
- boat, car, bicycle, motorbike, aeroplane, bus, train, truck
👥 People & Accessories:
- person, backpack, umbrella, handbag, tie, suitcase
🏠 Furniture & Indoor Objects:
- chair, sofa, bed, diningtable, toilet, tvmonitor, laptop, mouse, remote, keyboard
🍕 Food & Kitchen:
- bottle, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake, microwave, oven, toaster, sink, refrigerator
🐕 Animals:
- bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe
⚽ Sports & Recreation:
- frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket
🌱 Plants & Environment:
- pottedplant
🛠️ Tools & Household:
- scissors, hair drier, toothbrush, clock, vase, book, teddy bear
🚦 Infrastructure:
- traffic light, fire hydrant, stop sign, parking meter, bench
📱 Electronics:
- cell phone
Required Files:
models/
├── yolov8n.onnx # YOLOv8 nano model (12MB)
├── yolov3-tiny.weights # YOLOv3-tiny weights (34MB)
├── yolov3-tiny.cfg # YOLOv3-tiny configuration
└── coco.names # 80 object class names
Download Links:
- YOLOv8 Models: Ultralytics YOLOv8 Releases
- YOLOv3-tiny: Official YOLO Weights
- COCO Names: COCO Class Labels
YOLOv3-tiny:
- ✅ CPU-optimized: Runs efficiently on modest hardware
- ✅ Low latency: ~15-30ms inference time
- ✅ Small model: 34MB weights file
⚠️ Trade-off: Slightly lower accuracy than full models
YOLOv8n (nano):
- ✅ GPU-accelerated: Leverages CUDA when available
- ✅ Better accuracy: Improved detection performance
- ✅ ONNX format: Cross-platform compatibility
⚠️ Requires: More computational resources
The system processes video at 832×832 resolution for YOLO inference with smart letterboxing to maintain aspect ratios. Detection confidence thresholds and Non-Maximum Suppression (NMS) are automatically optimized for river monitoring scenarios.
Priority Object Configuration:
# Default: Track boats, enhance with people
-p1-track="boat" -p2-track="person"
# Maritime focus: Multiple watercraft types
-p1-track="boat,surfboard,kayak" -p2-track="person,backpack"
# Complete flexibility: Any object can be primary/secondary
-p1-track="car,truck,bus" -p2-track="person,bicycle"
- Clone the repository:
git clone https://github.com/doxx/NOLOcam.git
cd NOLOcam
- Install dependencies:
go mod tidy
go mod download
-
Set up YOLO models:
- Download YOLOv8n: Place
yolov8n.onnx
inmodels/
directory - Download YOLOv3-tiny: Place
yolov3-tiny.weights
andyolov3-tiny.cfg
in root directory - Ensure
coco.names
contains the 80 COCO class labels (see YOLO section above for links)
- Download YOLOv8n: Place
-
Configure scanning pattern (required):
- Edit
scanning.json
with your camera's scan positions - Define areas of interest for systematic monitoring
- Edit
-
Build the application:
go build .
The AI commentator system represents one of the most underutilized but potentially powerful components of this project. While currently implemented as entertainment (a sarcastic pirate making fun of Miami yacht owners), it demonstrates sophisticated AI-camera integration that could revolutionize autonomous surveillance systems.
The ai_commentary
standalone application embodies "Captain BlackEye," a pirate trapped inside the camera who provides running commentary about what he sees on the Miami River:
// The Captain BlackEye persona system prompt (excerpt):
You were once a pirate named Captain BlackEye, and somehow you got trapped inside of a camera.
Technical Implementation:
- Direct camera capture via Hikvision ISAPI with digest authentication
- GPT-4 vision analysis of captured JPEG images every 30 seconds (configurable)
- Conversation memory maintains context across 10 previous exchanges
- Atomic file writes to
/tmp/commentary.txt
for FFmpeg integration - Robust error handling with network timeouts and retry logic
# Captain BlackEye in action:
./ai_commentary
💬 Captain BlackEye says: "Ahoy! Another rich asshole in a yacht that's
bigger than my entire pirate ship ever was. Look at 'em, probably doesn't
even know port from starboard. Miami's finest maritime morons, I tell ya!"
This system scratches the surface of what's possible when you combine computer vision with large language models. The current implementation could be transformed into a sophisticated AI director for autonomous camera operations:
Instead of just making jokes, the AI could analyze scenes and direct camera movement:
// Potential enhancement:
type SceneAnalysis struct {
InterestingObjects []DetectedObject
RecommendedFocus CameraPosition
TrackingPriority int
SceneDescription string
}
// AI could say: "I see a rare manatee at coordinates (1200, 800).
// Suggest zooming to 8x and panning 15 degrees east for optimal tracking."
The AI could analyze object behavior patterns and predict movement:
// Enhanced detection integration:
"I've been watching this boat for 3 minutes. Based on its trajectory
and speed, it will reach the bridge in 90 seconds. Recommend switching
to wide-angle view to capture the entire passage sequence."
The AI could rate scenes in real-time and guide the tracking system to focus on the most interesting events:
type InterestScore struct {
WildlifeActivity int // Manatees, dolphins, pelicans
BoatActivity int // Unusual vessels, accidents, interactions
EnvironmentalEvent int // Weather changes, tidal effects
OverallPriority int // Combined weighted score
}
Integration with weather APIs, tide data, and historical patterns:
"Storm approaching from the southeast. Recommend lowering zoom to 2x
and focusing on the marina - boats will start moving to shelter in
approximately 20 minutes based on historical patterns."
The standalone design isolates the AI from the main detection pipeline, which limits its potential:
// Current: Isolated system
ai_commentary -> Camera -> GPT-4 -> Text File -> FFmpeg
// Potential: Integrated intelligence
Main Pipeline -> Detection Results -> AI Analysis -> Camera Commands
\-> Scene Understanding -> Tracking Priorities -> PTZ Control
Real-Time Detection Enhancement:
// AI could enhance detection confidence:
"I see what YOLO detected as a 'boat' but based on shape and movement
patterns, this appears to be a manatee. Adjusting classification and
switching to wildlife tracking mode."
Proactive Scene Management:
// AI could anticipate interesting events:
"Multiple boats converging near the sandbar. Historical data shows
this often leads to interesting interactions. Recommend pre-positioning
camera at coordinates (2400, 1100) with 6x zoom."
Contextual Understanding:
// AI provides rich context beyond basic object detection:
"Large yacht 'Sea Demon' (identified by hull markings) is the same
vessel that had mechanical issues here last month. Captain appears
to be showing off again - watch for potential comedy gold."
This represents the future of AI-augmented surveillance - not just detecting objects, but understanding scenes, predicting behavior, and making intelligent decisions about what deserves attention. Captain BlackEye may be making jokes about yacht owners now, but the underlying technology could power everything from wildlife research to security systems that truly understand what they're watching.
The gap between "AI that detects boats" and "AI that understands maritime behavior" is exactly what this commentator system could bridge. It's not just computer vision anymore - it's computer comprehension.
For accurate PTZ tracking, the system must understand the precise relationship between pixel movement on screen and actual camera motor movement. This relationship changes dramatically with zoom level - at 10x zoom, moving 100 pixels might require 20 camera units, but at 120x zoom, the same 100 pixels might only require 3 camera units.
Tool: calibration/hand_calibrator/
How it works:
- Interactive guidance through PTZ positioning
- User manually positions camera to align objects at screen edges
- Records precise camera coordinates for left→right (pan) and top→bottom (tilt) movements
- Calculates pixels-per-PTZ-unit for each zoom level
Process:
cd calibration/hand_calibrator
go build -o hand_calibrator
./hand_calibrator
The tool guides you through:
- Pan Calibration: Align object to left edge → record position → pan to right edge → record position
- Tilt Calibration: Align object to top edge → record position → tilt to bottom edge → record position
- Multi-Zoom: Repeat process at different zoom levels (10x, 20x, 30x, etc.)
Output: Precise conversion ratios like 15.27 pixels per pan unit at 50x zoom
Tool: calibration/scanning_recorder/
Purpose: Creates predefined camera positions for automated scanning mode
How it works:
- Manual positioning: Move camera to desired scan positions
- Record coordinates: Press ENTER to save each position with name and dwell time
- Pattern generation: Creates
scanning.json
with complete movement sequence
Process:
cd calibration/scanning_recorder
go build -o scanning_recorder
./scanning_recorder
Interactive workflow:
🎯 Position camera at "river_left" → ENTER → Set dwell time: 15 seconds
🎯 Position camera at "bridge_center" → ENTER → Set dwell time: 10 seconds
🎯 Position camera at "dock_area" → ENTER → Set dwell time: 20 seconds
Type 'done' → Saves to scanning.json
Tool: calibration/pixelinches/
Purpose: Enables speed calculations, boat length measurements, and distance estimation
How it works:
- Reference distance: Uses known 369-inch reference object in camera view
- Multi-zoom capture: Takes photos at all zoom levels (10x-120x)
- Manual measurement: User measures pixels for the 369-inch reference in each photo
- Calibration calculation: Generates pixels-per-inch ratios for each zoom level
Process:
cd calibration/pixelinches
go run pixelinches.go
For each zoom level:
- Camera automatically sets zoom level
- Captures photo (
Z10.jpg
,Z20.jpg
, etc.) - User opens photo, measures 369-inch reference in pixels
- System calculates conversion:
pixels ÷ 369 inches = pixels-per-inch
PanPixelsPerUnit: map[int]float64{
10: 4.87, // 4.87 pixels per pan unit at 10x zoom
20: 7.91, // 7.91 pixels per pan unit at 20x zoom
30: 10.38, // 10.38 pixels per pan unit at 30x zoom
40: 12.74, // 12.74 pixels per pan unit at 40x zoom
50: 15.27, // 15.27 pixels per pan unit at 50x zoom
60: 18.41, // 18.41 pixels per pan unit at 60x zoom
70: 19.91, // 19.91 pixels per pan unit at 70x zoom
80: 21.68, // 21.68 pixels per pan unit at 80x zoom
90: 26.10, // 26.10 pixels per pan unit at 90x zoom
100: 27.71, // 27.71 pixels per pan unit at 100x zoom
110: 30.20, // 30.20 pixels per pan unit at 110x zoom
120: 36.32, // 36.32 pixels per pan unit at 120x zoom
}
// Convert pixel detection to camera movement
panPixelsPerUnit := st.InterpolatePanCalibration(currentZoom)
tiltPixelsPerUnit := st.InterpolateTiltCalibration(currentZoom)
// Calculate required camera adjustment
panAdjustment := float64(pixelOffsetX) / panPixelsPerUnit
tiltAdjustment := float64(pixelOffsetY) / tiltPixelsPerUnit
// Result: Precise camera movement to center detected object
Poor calibration → Objects drift off-screen, tracking failures
Good calibration → Smooth tracking, objects stay centered
Excellent calibration → Rock-solid lock, imperceptible tracking adjustments
Tool | Output File | Purpose |
---|---|---|
Hand Calibrator | PTZ-master-calibration.json |
Pixel↔PTZ unit conversion ratios |
Scanning Recorder | scanning.json |
Automated scanning positions |
Pixel-Inches | pixels-inches-cal.json |
Real-world measurement conversions |
Each calibration method generates JSON files that the tracking system loads for precise spatial calculations.
Professional live streaming requires rock-solid reliability - any interruption, freeze, or crash results in lost viewers and revenue. Traditional FFmpeg scripts often fail silently, hang indefinitely, or crash without recovery. The NOLO broadcast system solves this with enterprise-grade monitoring and automatic recovery.
The broadcast system combines two video sources into a single professional stream:
- Primary Video: RTMP stream from NOLO tracking system (
rtmp://192.168.0.12/live/stream
) - Audio Source: Direct RTSP feed from camera with audio (
rtsp://admin:password@camera/audio
) - AI Commentary: Dynamic text overlay from
/tmp/commentary.txt
(live AI analysis)
Different encoding configurations for various hardware setups:
Config File | Purpose | Encoding | Use Case |
---|---|---|---|
broadcast_config.json |
CPU Encoding | libx264 |
Standard servers, development |
broadcast_config_nvidia.json |
GPU Encoding | h264_nvenc |
Production with NVIDIA GPU |
broadcast_config_nvidia_nodrawtext.json |
GPU No Overlay | h264_nvenc |
Raw feed without AI commentary |
broadcast_config_darwin.json |
macOS | Platform-optimized | Apple development |
// Monitors multiple health indicators
- FFmpeg output activity (15-second timeout)
- Frame progression tracking (detects stalls)
- Process status monitoring
- DTS/PTS timestamp error detection
- Memory and resource usage
- Unlimited restarts (configurable)
- Smart restart delays prevent resource exhaustion
- Graceful shutdown handling (SIGTERM → SIGKILL if needed)
- State preservation across restarts
{
"max_restarts": 999999,
"health_timeout_seconds": 15,
"restart_delay_seconds": 1,
"ffmpeg_args": [
"-re", "-thread_queue_size", "1024",
"-i", "rtmp://192.168.0.12/live/stream", // NOLO video
"-i", "rtsp://admin:pass@camera:554/audio", // Camera audio
"-map", "0:v:0", "-map", "1:a:0", // Mix video + audio
"-vf", "scale=2560:1440,drawtext=...", // Scale + AI overlay
"-c:v", "h264_nvenc", "-preset", "p7", // GPU encoding
"-b:v", "16000k", "-maxrate", "18000k", // Bitrate control
"-f", "flv", "rtmp://a.rtmp.youtube.com/live2/YOUR_KEY"
]
}
cd broadcast
./build.sh
./broadcast-monitor
# Use different config file
./broadcast-monitor -c broadcast_config_nvidia.json
# Enable local recording
./broadcast-monitor -record ./recordings
# Background operation
nohup ./broadcast-monitor > /dev/null 2>&1 &
NOLO AI Tracking → RTMP Stream
↓
Broadcast Monitor
↓
[Video Mixing + Audio + AI Overlay]
↓
Professional YouTube Live Stream
{
"ffmpeg_args": [
"-c:v", "h264_nvenc", "-preset", "p7", "-profile:v", "high",
"-b:v", "16000k", "-maxrate", "18000k", "-bufsize", "32000k",
"-rc", "vbr", "-cq", "23", "-spatial_aq", "1", "-temporal_aq", "1"
]
}
{
"ffmpeg_args": [
"-c:v", "libx264", "-preset", "fast", "-profile:v", "high",
"-b:v", "16000k", "-maxrate", "18000k", "-bufsize", "48000k"
]
}
The broadcast monitor provides comprehensive visibility:
- Frame processing statistics (every 30 seconds)
- Network connectivity status
- Encoding performance metrics
- Error categorization and frequency
- Restart event logging
- Resource utilization tracking
This production-grade approach ensures your AI-powered streams maintain professional quality and reliability 24/7.
This system operates with multiple unencrypted communication channels that transmit credentials in plaintext:
Protocol | Port | Security Issue | Risk Level |
---|---|---|---|
RTSP | 554 | No encryption, credentials in URL | 🔴 HIGH |
HTTP Camera Control | 80 | Plaintext Basic Auth | 🔴 HIGH |
RTMP | 1935 | No encryption | 🟡 MEDIUM |
All camera communications send usernames and passwords in plaintext:
# ❌ THESE CREDENTIALS ARE VISIBLE ON THE NETWORK:
rtsp://user:password123@192.168.1.100:554/stream
http://user:password123@192.168.1.100:80/
Anyone monitoring network traffic can capture:
- Camera login credentials
- Video streams
- PTZ control commands
- Administrative access tokens
# ✅ Suitable for:
- Private property monitoring (isolated networks)
- Development and testing environments
- Proof-of-concept demonstrations
- Internal research projects with proper network controls
- Education and learning (controlled environments)
./NOLO \
-input "rtsp://admin:password@192.168.1.100:554/Streaming/Channels/101" \
-ptzinput "http://admin:password@192.168.1.100:80/" \
-debug
# Track boats and kayaks, enhance with people and equipment
./NOLO \
-input "rtsp://admin:password@192.168.1.100:554/Streaming/Channels/101" \
-ptzinput "http://admin:password@192.168.1.100:80/" \
-p1-track="boat,kayak" \
-p2-track="person,backpack,bottle" \
-pip \
-debug
# Track any floating objects, enhance with birds or other wildlife
./NOLO \
-input "rtsp://admin:password@192.168.1.100:554/Streaming/Channels/101" \
-ptzinput "http://admin:password@192.168.1.100:80/" \
-p1-track="boat,surfboard" \
-p2-track="bird,person" \
-min-zoom=20 -max-zoom=100
# Debug mode with YOLO analysis and single-track exit
./NOLO \
-input "rtsp://admin:password@192.168.1.100:554/Streaming/Channels/101" \
-ptzinput "http://admin:password@192.168.1.100:80/" \
-debug \
-debug-verbose \
-YOLOdebug \
-exit-on-first-track \
-jpg-path=/tmp/debug \
-pre-overlay-jpg \
-post-overlay-jpg
NOLO provides rich debugging information:
[TRACKING_CONFIG] P1 (Primary): [boat kayak]
[TRACKING_CONFIG] P2 (Enhancement): [person backpack]
📊 Frame 1250: YOLO detected 2 boats, 1 people, 0 others
🎯 Boat boat_123 gets +0.7 priority for having 1 P2 objects!
🔒 Target boat boat_123 LOCKED for camera tracking (mature target)!
👤📺 Using LOCK target with P2 objects (det:15, lost:0) for PIP
🎯👤 Using P2 centroid (1245,680) for tracking - 1 P2 objects, quality 0.85
Common P1 (Primary) Objects:
boat
- Motor boats, sailboats, yachtskayak
- Kayaks, canoessurfboard
- Surfboards, paddleboards- Any COCO dataset object class
Common P2 (Enhancement) Objects:
person
- People on boats/boardsbackpack
- Equipment, luggagebottle
- Drinks, containersumbrella
- Shade, equipmentchair
- Seating, furnitureall
- Any non-P1 object
scanning.json
: Camera scan pattern definitionspixels-inches-cal.json
: Spatial calibration datacoco.names
: YOLO class labelsmodels/yolov8n.onnx
: YOLO detection model
- Debug Overlay: Real-time tracking visualization
- Verbose Mode: Detailed YOLO, calibration, and tracking calculations
- Spatial Logs: Coordinate transformation details
- YOLO Analysis: Save detection inputs for model debugging
- Performance Stats: FPS, latency, memory usage
- Session Logs: Per-object tracking history
- JPEG Capture: Organized frame saving by date/hour
Join us on Discord: https://discord.gg/Gr9rByrEzZ
Copyright (c) 2025 Barrett Lyon
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction for non-commercial, ethical, and educational purposes, subject to the following conditions:
- ✅ Personal projects and research
- ✅ Educational and academic purposes
- ✅ Open source contributions and improvements
- ✅ Wildlife and environmental monitoring
- ✅ Property security (private, residential)
- ✅ Technology demonstrations and prototyping
- ✅ Non-profit organizational use
The following uses are explicitly prohibited under this license:
- 🚫 Commercial deployment without explicit written license from Barrett Lyon
- 🚫 Revenue-generating activities using this software
- 🚫 Integration into commercial products or services
- 🚫 Corporate surveillance systems
- 🚫 Paid consulting services using this codebase
- 🚫 Facial recognition or biometric identification of any kind
- 🚫 Human tracking, profiling, or behavioral analysis
- 🚫 Surveillance that violates privacy rights or reasonable expectation of privacy
- 🚫 Discriminatory targeting based on race, religion, gender, nationality, or other protected characteristics
- 🚫 Mass surveillance programs or bulk data collection
- 🚫 Stalking, harassment, or intimidation of individuals
- 🚫 Military applications or defense contractor use
- 🚫 Weapons systems integration or targeting assistance
- 🚫 Border security or immigration enforcement
- 🚫 Law enforcement surveillance without proper judicial oversight
- 🚫 Intelligence gathering operations
- 🚫 Any use that could cause physical or psychological harm to individuals
- 🚫 Tracking of vulnerable populations (children, elderly, disabled, refugees)
- 🚫 Social credit scoring or behavioral modification systems
-
Attribution Required: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
-
Commercial Licensing: For any commercial use, contact Barrett Lyon for explicit written permission and commercial licensing terms.
-
Ethical Compliance: Users must certify their intended use complies with the ethical restrictions outlined above.
-
Immediate Termination: Any violation of the prohibited uses immediately terminates your rights under this license.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
For commercial licensing inquiries, please contact:
Barrett Lyon - blyon@blyon.com
This project stands on the shoulders of giants. NOLO would not be possible without the incredible work of these open source communities and organizations:
- The foundation of our object detection system
- Thanks to Glenn Jocher and the Ultralytics team for advancing real-time object detection
- YOLOv8 Models & Documentation
YOLOv3/YOLOv4 (Joseph Redmon & Alexey Bochkovskiy)
- Original YOLO architecture that revolutionized computer vision
- Joseph Redmon's pioneering work on "You Only Look Once"
- YOLO: Real-Time Object Detection
- 80-class object detection dataset that enables flexible tracking
- Microsoft's contribution to computer vision research
- Essential for training robust object detection models
OpenCV (Open Source Computer Vision Library)
- The backbone of all image processing, from frame capture to overlay rendering
- 20+ years of computer vision innovation by the global OpenCV community
- OpenCV GitHub Repository
- Makes OpenCV accessible from Go with excellent performance
- Thanks to Ron Evans (@deadprogram) and the Hybridgroup team
- GoCV GitHub Repository
- Absolutely essential for real-time video streaming and encoding
- Powers our RTMP pipeline for YouTube and live streaming
- The Swiss Army knife of multimedia processing
- Incredible engineering by Fabrice Bellard and the FFmpeg team spanning decades
- Supports virtually every video format and codec imaginable
- FFmpeg GitHub Repository
- Game-changing GPU acceleration that makes real-time AI possible
- NVIDIA's decades of parallel computing innovation
- CUDA Toolkit Documentation
- Deep learning acceleration primitives
- Optimized neural network inference that powers our YOLO processing
- Much better RTMP handling than nginx for live streaming
- Reliable, efficient real-time media server
Built with ❤️ for watchign crazy boat stuff.