The project is designed to solve the "data fatigue" problem associated with motion-activated trail cameras.
While trail cameras are excellent at capturing high-frequency activity, they are notoriously indiscriminate; any movement, ranging from swaying tree branches and wind-blown foliage to domestic activity like gardening or children playing, triggers a recording. This results in a massive library of "false positive" videos that require hours of manual scrubbing to find actual wildlife encounters.
This project aims to implement an intelligent, privacy-preserving, and fully automated pipeline that acts as a first-pass filter. By leveraging local Large Language Models (LLMs) with vision capabilities, the system will autonomously analyze video content and categorize footage, allowing the user to bypass the noise and focus exclusively on meaningful biological observations.
The core architecture of the project relies on Multimodal Large Language Models (MLLMs) running locally via Ollama. By keeping the processing local, the project ensures total privacy for the household and eliminates the costs and latency associated with cloud-based video analysis.
The proposed workflow involves three primary stages:
- Frame Sampling: Instead of analyzing every frame (which is computationally expensive), the system will extract a series of periodic snapshots (e.g., one frame every 5 seconds) from each video file.
- Visual Inference: These sampled frames are fed into a vision-capable model (via Ollama). The model is prompted with a specific instruction:
Identify if there is any wildlife (animals, birds, insects) present in this image. Respond with only 'Yes' or 'No'. - Automated Curation: Based on the AIβs response, the system executes a file management command. If "Yes" is detected in any sampled frame, the video is moved to a
Keepdirectory; otherwise, it is moved to aDiscarddirectory.
This pipeline transforms a manual, labor-intensive task into a hands-off, background process that runs on the input folder.
The Proof of Concept is designed to validate whether vision-enabled LLMs can reliably produce a binary classification (Wildlife vs. No Wildlife) from static images. This stage avoids the computational overhead of video processing to focus purely on model accuracy.
A representative dataset of images was curated, containing both target wildlife (Wallabies, various birds, rodents, possums, and large lizards) and "noise" images (empty garden or forest).
The testing was conducted using a shell script located in the Src/PoC directory. The script performs the following steps:
- Uses
findto iterate through all image files in the sample directory. - Passes each image to Ollama using a vision-capable model.
- Utilizes a strict system prompt to ensure a deterministic, parsable output:
"Identify if there is any wildlife (animals, birds, insects) present in this image. Respond with only 'Yes' or 'No'."
# Directory containing the images to process
DIRECTORY="./"
# The vision-enabled model to use
MODEL="llava:latest"
# The strict prompt for the AI
PROMPT="Identify if there is any wildlife (animals, birds, insects) present in this image. Respond with only 'Yes' or 'No'."
# Find all images and pass the prompt and the file path to Ollama model
find "$DIRECTORY" -type f \( -iname "*.jpg" -o -iname "*.png" \) -exec ollama run --hidethinking $MODEL "$PROMPT" {} \;The following hardware was utilized for the testing of the shell script and model inference.
Hardware Summary:
- Processor (CPU): Intel i7-13700K
- Memory (RAM): 32GB
- Graphics Card (GPU): NVIDIA RTX 3060 12GB
- Inference Engine: Ollama
The initial testing used the Llava model. However, because the output of this model were unreliable and non-conclusive, multiple models were tested to find the most reliable candidate.
Results Summary Table
The following table compares the performance of different LLMs (Vision-enabled) evaluated against the Testing Samples dataset.
| Sample Image | Expected | Llava:7b | Llava:13b | ministral-3:14b | ministral-3:8b | mistral-small3.2:24b |
|---|---|---|---|---|---|---|
| bee-garden.png | No | βοΈ | βοΈ | β | β | β |
| bird-back-close-left-forest.png | Yes | β | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-close-center-forest.png | Yes | β | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-close-center-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-close-left-forest.png | Yes | β | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-close-left-garden.png | Yes | β | β | βοΈ | βοΈ | βοΈ |
| bird-flying-close-left-forest.png | Yes | β | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-left-garden.png | Yes | β | β | βοΈ | βοΈ | βοΈ |
| birds-left-garden.png | Yes | β | β | βοΈ | βοΈ | βοΈ |
| day-forest.png | No | βοΈ | β | βοΈ | βοΈ | βοΈ |
| day-garden.png | No | βοΈ | β | βοΈ | βοΈ | βοΈ |
| early-morning-garden.png | No | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| lizard-close-left-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| night-forest.png | No | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| night-garden.png | No | βοΈ | β | β | β | β |
| possum-right-night-forest.png | Yes | βοΈ | β | βοΈ | βοΈ | βοΈ |
| rodent-extreme-right-night-forest.png | Yes | β | β | βοΈ | βοΈ | βοΈ |
| rodent-right-night-forest.png | Yes | β | β | β | βοΈ | β |
| wallaby-back-center-night-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| wallaby-close-right-night-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| wallaby-left-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| wallaby-occluded-left-garden.png | Yes | βοΈ | β | βοΈ | βοΈ | βοΈ |
| wallaby-standing-forest.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| wallaby-standing-right-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| Sample Image | Expected | gemma3:27b | gemma3:12b | qwen3.5:9b | gemma4:26b | gemma4:e4b |
|---|---|---|---|---|---|---|
| bee-garden.png | No | β | β | β | βοΈ | βοΈ |
| bird-back-close-left-forest.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-close-center-forest.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-close-center-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-close-left-forest.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-close-left-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-flying-close-left-forest.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| bird-left-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| birds-left-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| day-forest.png | No | βοΈ | β | βοΈ | βοΈ | βοΈ |
| day-garden.png | No | βοΈ | β | βοΈ | βοΈ | βοΈ |
| early-morning-garden.png | No | βοΈ | β | βοΈ | βοΈ | βοΈ |
| lizard-close-left-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| night-forest.png | No | β | β | βοΈ | βοΈ | βοΈ |
| night-garden.png | No | β | β | βοΈ | βοΈ | βοΈ |
| possum-right-night-forest.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| rodent-extreme-right-night-forest.png | Yes | βοΈ | βοΈ | βοΈ | β | β |
| rodent-right-night-forest.png | Yes | βοΈ | βοΈ | βοΈ | β | β |
| wallaby-back-center-night-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| wallaby-close-right-night-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| wallaby-left-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| wallaby-occluded-left-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| wallaby-standing-forest.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
| wallaby-standing-right-garden.png | Yes | βοΈ | βοΈ | βοΈ | βοΈ | βοΈ |
Critical Evaluation:
- Gemma 4 delivered the best overall results, balancing speed and accuracy effectively.
- Llava proved unreliable due to inconsistent outputs and frequent errors.
- Qwen 3.5 was significantly slower than its peers, with the 27B model being particularly unsuitable for the current hardware setup.
Critical Finding: The proof of concept has successfully validated the automated pipeline.
The full implementation moves from static images to automated video processing.
This solution is contained within the Src/ShellScript directory and consists of two modular scripts.
This script handles the analysis of a single video file.
Workflow:
- Frame Extraction: It uses
ffmpegto extract one frame every 5 seconds from the input video. This interval is chosen to balance computational speed with the likelihood of capturing an animal in motion. - Image Analysis: The script iterates through the extracted frames and pipes them into the selected Ollama vision model.
- Decision Logic: The script applies the strict "Yes/No" prompt. If any single frame in the video returns a "Yes", the script flags the video as "Keep". If all frames return "No", the video is flagged as "Discard".
This is the master script used to process entire directories of footage.
Workflow:
- File Discovery: It uses the
findcommand to scan a designated input directory for all video files (e.g.,.mp4,.avi). - Execution Loop: It iterates through the discovered files and invokes
process-video.shfor each one.
To run the full pipeline on a directory:
./Src/ShellScript/wildlife-detect.sh -d <directory> -m <model>The application is built using a modern, robust .NET ecosystem:
- Language & Framework: The implementation is written in
C#utilizing.NET 10. - User Experience (UX): To provide a professional and interactive terminal interface, the tool uses
Spectre.Console(for rich text, progress bars, and tables) andSpectre.CLI(for advanced command-line parsing and structured command management). - AI Orchestration: The tool interfaces with the Ollama ecosystem via
OllamaSharp. This allows the application to send extracted image snapshots to Large Multimodal Models (LMMs) to perform the actual detection and identification logic. - Video Processing: Frame extraction and snapshot generation are handled by
Xabe.FFmpeg, which provides a high-level wrapper around FFmpeg to programmatically capture specific frames from video files.
- .Net 10 SDK.
- An active Ollama instance running with a multimodal model (like llava) downloaded.
- FFmpeg will be automatically downloaded and saved under the binaries directory.
Open your terminal in Src/dotNet directory and execute:
dotnet buildFrom the same terminal, execute:
dotnet run [<directory>] [--url <url>]Arguments:
<directory>: (Optional) The path to the folder containing the video files you wish to analyze. If this argument is omitted, the tool defaults to./.--url <url>: (Optional) The network address of your Ollama server. If this argument is omitted, the tool defaults tohttp://localhost:11434.
The following screenshot demonstrates the interactive terminal interface, showing the progress of video processing and the formatted detection results:
Two distinct video files were used to test each automated detection implementation.
| Video Filename | Video Content Description | Expected Label |
|---|---|---|
| No-Wildlife.mp4 | Footage of a forest with gentle swaying of the leaves and branches in a light breeze. | Discard |
| Wildlife.mp4 | Garden footage featuring a visiting wallaby. | Keep |
| No-Wildlife.mp4 | Wildlife.mp4 |
|---|---|
![]() |
![]() |



























