Computer Vision System
A production-grade computer vision platform built on three custom-trained YOLO models — developed entirely from scratch, from raw data collection to deployed product. The system gives operators real-time and historical intelligence on vehicle traffic. What makes it different: users can spin up a new computer vision service directly from a dashboard, with no configuration required beyond pointing it at a camera stream.
Traffic monitoring traditionally required expensive proprietary hardware or vendor lock-in. Off-the-shelf models weren't accurate enough for local conditions — public datasets don't reflect Indonesian vehicles, road environments, or plate formats. If this was going to work in production, the models had to be built from the ground up.
A full computer vision platform built on three custom-trained YOLO models — each handling a distinct detection task — unified into one product with a clean operator dashboard. The R&D involved the full machine learning lifecycle: collecting real-world data, annotating it, training, evaluating, iterating, and then integrating those models into a scalable production system.
The Three Models
Object Detection
General-purpose detection layer — identifies and classifies objects within a camera frame. Forms the foundation that the vehicle-specific models build on. Trained from scratch using YOLO on collected real-world data.
Vehicle Detection
Specialized model trained to detect and classify vehicles by type (car, motorcycle, truck, bus) in traffic camera feeds. Trained on real Indonesian road and parking environments. Outputs: vehicle counts, type distribution, traffic flow by time period.
License Plate Recognition
Two-stage pipeline: Stage 1 locates the plate region within the vehicle bounding box; Stage 2 runs OCR on the cropped region. Trained on Indonesian plate formats. Outputs: plate number logs, most frequently detected plates, entry/exit tracking.
The Full ML Pipeline
Data Collection
Gathered real-world footage from traffic environments. Manually collected and curated datasets for Indonesian vehicles and license plates — no public dataset was sufficient for local conditions.
Data Annotation
Labeled bounding boxes and classes for each object type across thousands of images. Quality of annotation directly determines model quality — this step was as important as training.
Model Training (YOLO)
Trained each model from scratch using the YOLO architecture. Tuned hyperparameters, managed class imbalance, and iterated based on validation performance.
Evaluation
Measured precision, recall, and mAP (mean Average Precision) for each model. Re-collected data and retrained where performance fell short in specific conditions (night, rain, angle, occlusion).
Integration & Productization
Wrapped trained models into Python inference services, connected to the platform backend, and exposed results through the Vue.js dashboard with real-time and historical views.
Tech Stack
Architecture Decisions
RTSP + MediaMTX Streaming
Live video doesn't flow through a REST API — it needs a dedicated media transport layer. MediaMTX receives RTSP streams from cameras, routes them to Python inference services, and redistributes annotated output streams to dashboard consumers. Multiple clients can watch the same stream simultaneously without overloading the source.
Why RabbitMQ?
Camera streams produce detection events continuously and at high volume. Writing every event directly to a database would create a bottleneck. RabbitMQ acts as a buffer — inference services push events to a queue, and consumers process and store them asynchronously. This keeps the system stable under load and makes horizontal scaling straightforward.
Key Challenges
No suitable training data for Indonesian conditions
Public datasets don't represent Indonesian vehicles, plates, or road environments accurately enough for production use. I collected and annotated training data from real local environments — time-consuming but non-negotiable for model quality.
Plate recognition accuracy under real-world conditions
Plates appear at angles, in motion, in low light, and partially obscured. A two-stage pipeline (detect plate region first, then run OCR on the cropped region) significantly improved accuracy compared to end-to-end approaches.
Processing high-volume camera streams without data loss
Real-time detection from multiple feeds generates enormous event volume. RabbitMQ solved this by decoupling inference from storage — keeping inference fast and writes reliable even under heavy load.
Making computer vision accessible to non-technical users
Most CV systems require technical setup to operate. Abstracting all of that into a dashboard-driven service creation flow meant rethinking the product from the operator's perspective, not the engineer's.
This project taught me what "research and development" actually means in practice: you don't just build — you question, test, fail, learn, and rebuild.
Getting good at computer vision meant getting good at data — because a great model trained on bad data is a useless model. Productizing the models — turning research into something operators could use without technical help — was a different challenge entirely. It's where I learned that the hardest part of AI isn't the model. It's the system around it.