Object detection & Tracking within a video

After playing around with Detectron2 & Mask2Former, I was keen to explore the concept of object tracking within videos. For this article, I used Ultralytics’s YOLOv8 model.

Object tracking

Object tracking is when each unique object is identified within an image or video frame and followed as it moves through the following frames or set of images. Naturally it is a step from correctly identifying each distinct object within an image.

YOLOv8

YOLOv8 is a different style of neural network to that of Detectron2 or Mask2Former. Yolo, abbreviated from You Only Look Once, uses a single stage to detect objects via a single forward pass of a convolutional neural network. Mask2Former and Detection 2 both belong to a 2-stage model, generating regions of interest and then detecting objects.

Since the original YOLO, the accuracy of the single layer has improved between versions and v8 is lighting quick to easily and accurately operate in real-time. It is important to note that YOLOv8 does not support high resolutions so if you need that for your project, it is not suitable. For more information comparing the different YOLO versions, Encord wrote a detailed Medium article on here

For my purposes, high resolution was not critical so onto testing out the object detection and tracking capabilities.

Detection & tracking

I grabbed a clip from Safa Brian’s cycling YouTube videos, where Dante Young and him are cruising down the Pacific Coast Highway in Los Angeles. This clip was a good test because it has people coming off the beach, vehicles parked on the side of the road, vehicles driving past the cyclists and other cyclists on the road, all while both Dante and Safa are moving pretty quick themselves.

Running the 30 sec clip through the model, with an 80% confident limit for each object class was pretty quick and accurate. It got all the cars, bicycles, people and only got confused with caravans being labelled as trucks or buses as shown in the clip. This simply comes down to the data underpinning the model, as camper vans and caravans where not labelled as classes.

The quick speed and real-time detection capability of the model is very impressive. Further, the tracking was incredible with it correctly following each object and uniquely labelled them all. It didn’t get any muddled up that I manually would have labelled any different and objects such as Dante’s bike was always tracked as it goes in and out of view.

Filtering specific classes

I passed in specific object class filters down to bikes, people and cars and re-ran the model to see how the results compared. Again this was super impressive and shows the capability of training the model on a bespoke object and then tracking it specifically across frames.

Count the number of unique objects by class within the clip

The only minor frustration was trying to get a summary of the unique instances per class I was filtering to. There is PR to fix this but in the mean time I added in a quick function to go through each object detection and verify it is unique in each frame as it parsed.

The overall detection by class in the previous clip was:

People: 5
Bicycles: 6
Cars: 37
Bus: 7
Truck: 3

Conclusion:

Overall YOLOv8 has been immensely fun and easy to use. Its real-time capabilities with a high degree of accuracy are impressive. The model was able to accurately track multiple objects throughout the clip at ease regardless of whether they were moving or stationary.

Further YOLOv8 has to be one of the simplest and easy to use models I have explored. The developer experience is second to none, requiring minor other installs and packages and virtually no errors. The python install on colab was quick and the entire model could be run entirely on CPU.

Thanks for following along.