AI EngineeringR&DComputer Vision

Computer Vision System

RoleR&D Engineer

Year2022

StatusShipped · Live Product

A production-grade computer vision platform built on three custom-trained YOLO models — developed entirely from scratch, from raw data collection to deployed product. The system gives operators real-time and historical intelligence on vehicle traffic. What makes it different: users can spin up a new computer vision service directly from a dashboard, with no configuration required beyond pointing it at a camera stream.

The Problem

Traffic monitoring traditionally required expensive proprietary hardware or vendor lock-in. Off-the-shelf models weren't accurate enough for local conditions — public datasets don't reflect Indonesian vehicles, road environments, or plate formats. If this was going to work in production, the models had to be built from the ground up.

The Solution

A full computer vision platform built on three custom-trained YOLO models — each handling a distinct detection task — unified into one product with a clean operator dashboard. The R&D involved the full machine learning lifecycle: collecting real-world data, annotating it, training, evaluating, iterating, and then integrating those models into a scalable production system.

The Three Models

Object Detection

General-purpose detection layer — identifies and classifies objects within a camera frame. Forms the foundation that the vehicle-specific models build on. Trained from scratch using YOLO on collected real-world data.

Vehicle Detection

Specialized model trained to detect and classify vehicles by type (car, motorcycle, truck, bus) in traffic camera feeds. Trained on real Indonesian road and parking environments. Outputs: vehicle counts, type distribution, traffic flow by time period.

License Plate Recognition

Two-stage pipeline: Stage 1 locates the plate region within the vehicle bounding box; Stage 2 runs OCR on the cropped region. Trained on Indonesian plate formats. Outputs: plate number logs, most frequently detected plates, entry/exit tracking.

The Full ML Pipeline

Data Collection

Gathered real-world footage from traffic environments. Manually collected and curated datasets for Indonesian vehicles and license plates — no public dataset was sufficient for local conditions.

Data Annotation

Labeled bounding boxes and classes for each object type across thousands of images. Quality of annotation directly determines model quality — this step was as important as training.

Model Training (YOLO)

Trained each model from scratch using the YOLO architecture. Tuned hyperparameters, managed class imbalance, and iterated based on validation performance.

Evaluation

Measured precision, recall, and mAP (mean Average Precision) for each model. Re-collected data and retrained where performance fell short in specific conditions (night, rain, angle, occlusion).

Integration & Productization

Wrapped trained models into Python inference services, connected to the platform backend, and exposed results through the Vue.js dashboard with real-time and historical views.

Tech Stack

AI / MLPython — model inference, YOLO integration, data processing pipeline

FrontendVue.js — operator dashboard, real-time visualization, service management

DatabasePostgreSQL + MongoDB — structured detections + raw events/analytics

File StorageAWS S3 — model weights, video snapshots, captured frames

Message QueueRabbitMQ — async processing of detection events from camera streams

StreamingRTSP Protocol + MediaMTX — real-time video stream transport and routing

Architecture Decisions

RTSP + MediaMTX Streaming

Live video doesn't flow through a REST API — it needs a dedicated media transport layer. MediaMTX receives RTSP streams from cameras, routes them to Python inference services, and redistributes annotated output streams to dashboard consumers. Multiple clients can watch the same stream simultaneously without overloading the source.

Why RabbitMQ?

Camera streams produce detection events continuously and at high volume. Writing every event directly to a database would create a bottleneck. RabbitMQ acts as a buffer — inference services push events to a queue, and consumers process and store them asynchronously. This keeps the system stable under load and makes horizontal scaling straightforward.

Key Challenges

No suitable training data for Indonesian conditions

Public datasets don't represent Indonesian vehicles, plates, or road environments accurately enough for production use. I collected and annotated training data from real local environments — time-consuming but non-negotiable for model quality.

Plate recognition accuracy under real-world conditions

Plates appear at angles, in motion, in low light, and partially obscured. A two-stage pipeline (detect plate region first, then run OCR on the cropped region) significantly improved accuracy compared to end-to-end approaches.

Processing high-volume camera streams without data loss

Real-time detection from multiple feeds generates enormous event volume. RabbitMQ solved this by decoupling inference from storage — keeping inference fast and writes reliable even under heavy load.

Making computer vision accessible to non-technical users

Most CV systems require technical setup to operate. Abstracting all of that into a dashboard-driven service creation flow meant rethinking the product from the operator's perspective, not the engineer's.

What I Learned

This project taught me what "research and development" actually means in practice: you don't just build — you question, test, fail, learn, and rebuild.

Getting good at computer vision meant getting good at data — because a great model trained on bad data is a useless model. Productizing the models — turning research into something operators could use without technical help — was a different challenge entirely. It's where I learned that the hardest part of AI isn't the model. It's the system around it.