YOLOv5 Custom Object Detection on Fashion Accessories

Table of contents

Introduction
- What is Object Detection?
- How does object detection work?
Object Detection Algorithms
Why YOLOv5?
How to train YOLOv5 model?
1) Data Annotation
- Annotating Images in YOLOv5 Supported Format
- Using Label Studio
2) YOLOv5 Training
- Training the YOLOv5-Large Model
- Customizing YOLOv5 for Custom Object Detection
3) YOLOv5 Inference
- Running Inference with OpenCV-DNN

Introduction

In this project, we focused on building a custom object detection model using YOLOv5 to identify specific clothing accessories such as shirts, pants, shoes, handbags and sunglasses. The primary goal is to train a model that could accurately detect and classify above defined items in images.

After completing the training process, we ran inference on a set of test images to evaluate the model’s performance. The results demonstrated the model’s ability to successfully identify and classify the targeted objects with high accuracy. Each detected object was marked with a bounding box, and the corresponding label was displayed.

Below are examples of the model’s output, illustrating its ability to detect and label clothing accessories:

These results validate the model’s potential for real-world applications, such as fashion analysis, e-commerce, and inventory management, where quick and precise object detection is crucial.

What is Object Detection?

Object detection is a critical task in computer vision that involves identifying and locating objects within digital images or videos. The primary goal of object detection is to answer two key questions:

What is the object? - Identifying category or type of an object (e.g., person, car, dog, etc.)
Where is the object?: determining the specific location of the object within the image or video, often represented by bounding boxes around the object.

How does object detection work?

Object detection combines two tasks:

classification—it determines what the object is.
localization—it identifies where the object is by drawing a bounding box around it.

Real-world use cases of object detection

Security and Surveilleance:
- In security systems, object detection is used to monitor and identify suspicious activities.
- For instance, CCTV cameras equipped with object detection can automatically recognize unauthorized access, detect abandoned objects (e.g., luggage in airports), and alert security personnel. This improves the effectiveness of surveillance systems in places like banks, airports, and public spaces.
Augmented Reality (AR):
- Object detection plays an important role in AR applications by enabling the interaction between digital objects and the real world.
- For example, in mobile apps like IKEA Place, object detection helps users visualize how furniture will look in their homes by detecting the room’s layout and placing virtual furniture in appropriate locations. This enhances the shopping experience by encouraging customers to try before they buy.

Object detection is a powerful tool in computer vision with a wide variety of applications across various domains.

Object Detection Algorithms

EffcientNet:
- EfficientNet is a family of object detection models that focuses on optimizing both accuracy and efficiency.
- Built on top of the EfficientNet backbone, EfficientNet scales models architecture effectively using a compund scaling method that balances network depth, width and resolution.
- Features:
- Highly scalable
- Ideal for applications requirng high performance with limited computational resourcses.

Below is the architecture for the EfficientNet algorithm for object detection. Please read more details on EfficientNet Paper.

SSD (Single Shot Multibox Detector):
- It processes an entire image in a single pass.
- It divides image into grid and makes predictions based on each grid cell, similar to YOLO, but with mutiple feature layers for improved accuracy.
- Features:
- Good balance between speed and accuracy
- Suitable for real-time detection on resource-constrained devices

You can find more details in the paper here.

Single Shot Multi-Box Detector (SSD) Architecture

Faster R-CNN (Region-Based CNN):
- It is a two-stage object detection algorithm that first proposes regions of interest (ROIs) and then classifies these regions into object categories.
- The first-stage involves Region Proposed Network (RPN) that generates candidate bounding boxes, and
- the second stage uses these proposals to detect the objects within them. * Features:
- High accuracy, especially for complex and cluttered scenes
- computationally intensive, requiring more processing time than YOLO

Please follow the original paper here for more details.

YOLO (You Only Look Once):
- YOLO is one of the fastest and most efficient object detection algorithms.
- It treats object detection as a single regression problem, directly predicting the class probabilities and bounding box coordinates from an image in one evaluation.
- Unlike traditional methods that apply the image to multiple regions of an image, YOLO considers entire image at once, making it incredibly fast.
- Features:
- High speed and good accuracy
- Single forward pass through network

You can explore more details on the architecture and experiments done in the research paper here.

Here, we willbe using YOLOv5. YOLOv5 is known for its speed, accuracy and ease of use, making it a popular choice for real-time object detection.

Why YOLOv5?

YOLOv5 builds upon succession of previous versions of YOLO while itroducing improvements in model architecture and training techniques.
It offres multiple pre-trained models that vary in size and performance, allowing users to choose a model that best fits their requirements in terms of speed and accuracy.

The implementation of YOLOv5 by Ultralytics is built using PyTorch, an open-source machine learning framework. You can explore its basic through the official PyTorch Tutorial.

How to train YOLOv5 model?

Training a YOLOv5 model is traightforrd, thats to its user-friendly documentation. Here’s a brief overview of the steps:

Data Annotation
- Label your images with the necessary tags to create a dataset that the model can learn from. This labelled data is crucial for supervised machine learning.

YOLOv5 Training
- Train a custom YOLOv5 model using the annotated dataset. After training, we’ll receive a weight file that encapsulates the learned features.
YOLOv5 Inference
- Use the trained model to detect objects, such as clothing accessories, in new images during runtime.

1) Data Annotation

In this step, we focus on annotating the full-shot image data that has been scraped, preparing it for training a custom YOLOv5 model. Proper data annotation is crucial as it involves labeling the images with the exact locations and categories of the objects we want our model to detect.

Annotating Images in YOLOv5 Supported Format

The annotation process involves drawing bounding boxes around the target objects within the images and assigning the appropriate labels to each bounding box. For this project, we are interested in detecting specific clothing and accessories, and the labels we are considering are:

Shirts
Pants
Shoes
Handbags
Sunglasses

Using Label Studio

To facilitate the annotation process, I used an online data annotation tool called Label Studio. Label Studio offers a user-friendly interface for drawing bounding boxes around objects and assigning labels to them, making it an efficient choice for annotating large datasets.

In the context of this project, I created bounding boxes around each instance of Handbags, Pants, Shirts, Shoes, and Sunglasses in the images. Each box was labeled accordingly, ensuring that the YOLOv5 model can later recognize these objects during training and inference.

Here’s a demo video showing the data annotation process for this project using Label Studio, which provides a visual guide on how to draw bounding boxes and assign labels effectively.

YOLOv5 Data Format

YOLOv5 requires a specific format for the annotated data, where each bounding box is stored with the following information:

Category:
- The class or category to which the bounding box belongs (e.g., shirts, pants).
x-center:
- The normalized X coordinate of the center of the bounding box, representing the midpoint of the object in the horizontal axis.
y-center:
- The normalized Y coordinate of the center of the bounding box, representing the midpoint of the object in the vertical axis.
Width:
- The normalized width of the bounding box, indicating how wide the object is relative to the image.
Height:
- The normalized height of the bounding box, indicating how tall the object is relative to the image.

Normalization means that these coordinates and dimensions are scaled between 0 and 1, based on the size of the image. This standardization allows the YOLOv5 model to accurately process the data, regardless of the image resolution or aspect ratio.

By carefully annotating the data in this format, we ensure that the YOLOv5 model can effectively learn to detect the specified clothing and accessories during the training phase.

2) YOLOv5 Training

YOLOv5 offers five pre-defined models, each varying in size and performance. These models are:

YOLOv5-nano
YOLOv5-small
YOLOv5-medium
YOLOv5-large
YOLOv5-extraLarge

Training the YOLOv5-Large Model

For our project, we will be training a YOLOv5 model. This model strikes a good balance between speed and accuracy, making it well-suited for our task of detecting specific clothing and accessories.

The training process will follow the steps outlined in the official YOLOv5 documentation. This documentation provides a comprehensive guide on how to train YOLOv5 models using PyTorch, including setting up the environment, preparing the dataset, and initiating the training process.

Customizing YOLOv5 for Custom Object Detection

While YOLOv5 is capable of detecting up to 80 common classes, such as cars, persons, boats, and birds, we need to customize the model to detect our specific classes: Handbags, Pants, Shirts, Shoes, and Sunglasses. To achieve this, we must create a custom .yaml file that specifies the classes we want to detect and the paths to our training and testing datasets.

.yaml File Configuration

The .yaml file is a configuration file that YOLOv5 uses to understand the structure of the dataset. Below is a general description of the key elements in the .yaml file:

Classes: This section lists the custom classes that we want the YOLOv5 model to detect. For our project, this would include Handbags, Pants, Shirts, Shoes, and Sunglasses.
Train: The path to the directory containing the training images. This path tells the model where to find the images it should learn from.
Val (Validation): The path to the directory containing the validation images. These images are used to evaluate the model’s performance during training, ensuring that it is learning effectively.
Test (optional): The path to the directory containing the test images, which can be used to evaluate the model after training is complete.

Here is a snapshot of what the .yaml file might look like for our specific use case:

# Number of classes
nc: 5

# Names of the classes
names: ['Handbags', 'Pants', 'Shirts', 'Shoes', 'Sunglasses']

# Paths to the datasets
path: /content/Annotated_Data                # dataset root dir
train: /content/Annotated_Data/images/train  # train images (relative to 'path') 450 images
val: /content/Annotated_Data/images/val      # val images (relative to 'path') 50 images
test:                                        # test images (optional)

# Classes
names:
  0: Handbags
  1: Pants
  2: Shirts
  3: Shoes
  4: Sunglasses

This configuration file ensures that YOLOv5 knows exactly what classes it needs to detect and where to find the data necessary for training and validation.

By following these steps, we’ll be able to train a custom YOLOv5-large model that accurately detects the specific types of clothing and accessories we’re interested in.

Training Steps

# importing required dependencies

import pandas as pd
import numpy as np
import tensorflow as tf
import random
import cv2
from google.colab.patches import cv2_imshow
from tqdm.auto import tqdm
import os
import shutil as sh
from IPython.display import Image, clear_output
import torch

Now, we download a sample annotated dataset and use it for model training.

unzip '/content/Annotated_Data.zip'

Downloading the Ultralytics-YOLOv5 repository for training our custom YOLOv5 model using YOLOv5-large weights.

#Cloning the official YOLOv5 repository and other dependencies
git clone https://github.com/ultralytics/yolov5
pip install -U pycocotools

#Installing dependencies
pip install -qr yolov5/requirements.txt
cp yolov5/requirements.txt ./

Cloning into 'yolov5'...
remote: Enumerating objects: 16843, done.[K
remote: Counting objects: 100% (18/18), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 16843 (delta 4), reused 10 (delta 0), pack-reused 16825[K
Receiving objects: 100% (16843/16843), 15.57 MiB | 13.40 MiB/s, done.
Resolving deltas: 100% (11553/11553), done.
Requirement already satisfied: pycocotools in /usr/local/lib/python3.10/dist-packages (2.0.8)
Requirement already satisfied: matplotlib>=2.1.0 in /usr/local/lib/python3.10/dist-packages (from pycocotools) (3.7.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from pycocotools) (1.26.4)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools) (4.53.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools) (24.1)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.1.0->pycocotools) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib>=2.1.0->pycocotools) (1.16.0)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.5/4.5 MB[0m [31m101.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m865.6/865.6 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h

We can now begin training our custom YOLOv5 object detection model. The training is expected to take some time depending on the type of hardware used.

python /content/yolov5/train.py --img 640 --batch 32 --epochs 15 --weights yolov5l.pt --data /content/Annotated_Data/pinterest.yaml

[34m[1mtrain: [0mweights=yolov5l.pt, cfg=, data=/content/Annotated_Data/pinterest.yaml, hyp=yolov5/data/hyps/hyp.scratch-low.yaml, epochs=15, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=yolov5/data/hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=yolov5/runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False
[34m[1mgithub: [0mup to date with https://github.com/ultralytics/yolov5 ✅
YOLOv5 🚀 v7.0-351-g19ce9029 Python-3.10.12 torch-2.3.1+cu121 CUDA:0 (Tesla T4, 15102MiB)

[34m[1mhyperparameters: [0mlr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0
[34m[1mComet: [0mrun 'pip install comet_ml' to automatically track and visualize YOLOv5 🚀 runs in Comet
[34m[1mTensorBoard: [0mStart with 'tensorboard --logdir yolov5/runs/train', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=5

                  from  n    params  module                                  arguments                     
  0                -1  1      7040  models.common.Conv                      [3, 64, 6, 2, 2]              
  1                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]               
  2                -1  3    156928  models.common.C3                        [128, 128, 3]                 
  3                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]              
  4                -1  6   1118208  models.common.C3                        [256, 256, 6]                 
  5                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]              
  6                -1  9   6433792  models.common.C3                        [512, 512, 9]                 
  7                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 2]             
  8                -1  3   9971712  models.common.C3                        [1024, 1024, 3]               
  9                -1  1   2624512  models.common.SPPF                      [1024, 1024, 5]               
  10                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1]             
  11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
  12           [-1, 6]  1         0  models.common.Concat                    [1]                           
  13                -1  3   2757632  models.common.C3                        [1024, 512, 3, False]         
  14                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]              
  15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
  16           [-1, 4]  1         0  models.common.Concat                    [1]                           
  17                -1  3    690688  models.common.C3                        [512, 256, 3, False]          
  18                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]              
  19          [-1, 14]  1         0  models.common.Concat                    [1]                           
  20                -1  3   2495488  models.common.C3                        [512, 512, 3, False]          
  21                -1  1   2360320  models.common.Conv                      [512, 512, 3, 2]              
  22          [-1, 10]  1         0  models.common.Concat                    [1]                           
  23                -1  3   9971712  models.common.C3                        [1024, 1024, 3, False]        
  24      [17, 20, 23]  1     53850  models.yolo.Detect                      [5, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [256, 512, 1024]]
Model summary: 368 layers, 46159834 parameters, 46159834 gradients, 108.3 GFLOPs

Transferred 607/613 items from yolov5l.pt
[34m[1mAMP: [0mchecks passed ✅
[34m[1moptimizer:[0m SGD(lr=0.01) with parameter groups 101 weight(decay=0.0), 104 weight(decay=0.0005), 104 bias
[34m[1malbumentations: [0mBlur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8))
[34m[1mtrain: [0mScanning /content/Annotated_Data/labels/train.cache... 450 images, 0 backgrounds, 0 corrupt: 100% 450/450 [00:00<?, ?it/s]
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
[34m[1mval: [0mScanning /content/Annotated_Data/labels/val.cache... 50 images, 0 backgrounds, 0 corrupt: 100% 50/50 [00:00<?, ?it/s]

[34m[1mAutoAnchor: [0m4.66 anchors/target, 0.996 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to yolov5/runs/train/exp2/labels.jpg... 
Image sizes 640 train, 640 val
Using 2 dataloader workers
Logging results to [1myolov5/runs/train/exp2[0m
Starting training for 15 epochs...

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        0/14      9.85G    0.09428    0.05607    0.04508          9        640: 100% 29/29 [00:46<00:00,  1.61s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:08<00:00,  4.21s/it]
                    all         50        149      0.757      0.211      0.187     0.0464

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        1/14      10.3G    0.06949    0.05329    0.02564         14        640: 100% 29/29 [00:41<00:00,  1.42s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.20it/s]
                    all         50        149      0.457      0.589      0.401      0.107

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        2/14      10.3G    0.06675    0.04398    0.01508         16        640: 100% 29/29 [00:44<00:00,  1.53s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.57it/s]
                    all         50        149       0.26      0.485      0.328      0.103

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        3/14      10.3G    0.06548    0.03972    0.01043         15        640: 100% 29/29 [00:42<00:00,  1.48s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.26it/s]
                    all         50        149      0.282      0.704      0.537      0.177

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        4/14      10.3G    0.06062    0.03606    0.00862         12        640: 100% 29/29 [00:41<00:00,  1.44s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.38it/s]
                    all         50        149      0.404      0.798      0.667      0.328

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        5/14      10.3G    0.05276    0.03419   0.007469         13        640: 100% 29/29 [00:43<00:00,  1.51s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.43it/s]
                    all         50        149      0.514      0.609      0.582      0.328

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        6/14      10.3G    0.04568    0.03347   0.005993          7        640: 100% 29/29 [00:43<00:00,  1.50s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:02<00:00,  1.18s/it]
                    all         50        149      0.514      0.772        0.6      0.334

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        7/14      10.3G     0.0409    0.03169   0.004739          6        640: 100% 29/29 [00:44<00:00,  1.54s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:02<00:00,  1.12s/it]
                    all         50        149      0.676      0.826      0.786      0.493

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        8/14      10.3G     0.0381    0.03034    0.00471          7        640: 100% 29/29 [00:41<00:00,  1.42s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.37it/s]
                    all         50        149      0.863      0.888      0.902       0.52

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
        9/14      10.3G      0.035    0.02909   0.004154         22        640: 100% 29/29 [00:42<00:00,  1.45s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.46it/s]
                    all         50        149       0.81      0.863       0.87      0.515

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      10/14      10.3G    0.03239    0.02947   0.003605         11        640: 100% 29/29 [00:42<00:00,  1.47s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.40it/s]
                    all         50        149      0.828      0.872      0.869      0.571

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      11/14      10.3G    0.03016    0.02709   0.003643          9        640: 100% 29/29 [00:42<00:00,  1.48s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.43it/s]
                    all         50        149      0.822      0.896      0.893      0.628

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      12/14      10.3G    0.02786    0.02756   0.003469         11        640: 100% 29/29 [00:41<00:00,  1.42s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.53it/s]
                    all         50        149       0.87      0.862        0.9      0.619

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      13/14      10.3G    0.02609    0.02607   0.002858         14        640: 100% 29/29 [00:46<00:00,  1.61s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.40it/s]
                    all         50        149      0.859      0.886        0.9      0.632

      Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
      14/14      10.3G    0.02503    0.02679    0.00295         15        640: 100% 29/29 [00:42<00:00,  1.48s/it]
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:01<00:00,  1.38it/s]
                    all         50        149      0.853      0.889      0.907      0.649

15 epochs completed in 0.220 hours.
Optimizer stripped from yolov5/runs/train/exp2/weights/last.pt, 92.8MB
Optimizer stripped from yolov5/runs/train/exp2/weights/best.pt, 92.8MB

Validating yolov5/runs/train/exp2/weights/best.pt...
Fusing layers... 
Model summary: 267 layers, 46129818 parameters, 0 gradients, 107.7 GFLOPs
                  Class     Images  Instances          P          R      mAP50   mAP50-95: 100% 2/2 [00:02<00:00,  1.14s/it]
                    all         50        149      0.854      0.889      0.907      0.649
              Handbags         50         10       0.73        0.9      0.839      0.683
                  Pants         50         45       0.94      0.933      0.978      0.807
                Shirts         50         32      0.842      0.938       0.92      0.744
                  Shoes         50         41      0.758      0.683      0.805      0.527
            Sunglasses         50         21          1      0.992      0.995      0.482
Results saved to [1myolov5/runs/train/exp2[0m

we will be using tensorboard for data logging during model training.

from tensorboard import notebook

%load_ext tensorboard
%tensorboard --logdir /content/yolov5/runs/train

notebook.display(port=6006, height=1000)

Selecting TensorBoard with logdir /content/yolov5/runs/train (started 0:00:00 ago; port 6006, pid 8519).

# Converting the weights from .pt format to .onnx format
python yolov5/export.py --weights  /content/yolov5/runs/train/exp2/weights/best.pt --include onnx

[34m[1mexport: [0mdata=yolov5/data/coco128.yaml, weights=['/content/yolov5/runs/train/exp2/weights/best.pt'], imgsz=[640, 640], batch_size=1, device=cpu, half=False, inplace=False, keras=False, optimize=False, int8=False, per_tensor=False, dynamic=False, simplify=False, mlmodel=False, opset=17, verbose=False, workspace=4, nms=False, agnostic_nms=False, topk_per_class=100, topk_all=100, iou_thres=0.45, conf_thres=0.25, include=['onnx']
YOLOv5 🚀 v7.0-351-g19ce9029 Python-3.10.12 torch-2.3.1+cu121 CPU

Fusing layers... 
Model summary: 267 layers, 46129818 parameters, 0 gradients, 107.7 GFLOPs

[34m[1mPyTorch:[0m starting from /content/yolov5/runs/train/exp2/weights/best.pt with output shape (1, 25200, 10) (88.5 MB)

[34m[1mONNX:[0m starting export with onnx 1.16.2...
[34m[1mONNX:[0m export success ✅ 5.1s, saved as /content/yolov5/runs/train/exp2/weights/best.onnx (176.4 MB)

Export complete (9.0s)
Results saved to [1m/content/yolov5/runs/train/exp2/weights[0m
Detect:          python detect.py --weights /content/yolov5/runs/train/exp2/weights/best.onnx 
Validate:        python val.py --weights /content/yolov5/runs/train/exp2/weights/best.onnx 
PyTorch Hub:     model = torch.hub.load('ultralytics/yolov5', 'custom', '/content/yolov5/runs/train/exp2/weights/best.onnx')  
Visualize:       https://netron.app

We have now successfully trained our custom YOLOv5s model. We will now save the model weights and use it for inference

# Downloading the exported weight
from google.colab import files
files.download('/content/yolov5/runs/train/exp2/weights/best.onnx')

3) YOLOv5 Inference

In this step, we’ll use the trained YOLOv5 model to run inference on test images, utilizing the saved weights from the training process. The goal is to accurately identify and classify various clothing accessories, such as shirts, pants, shoes, handbags, and sunglasses in new images.

Running Inference with OpenCV-DNN

To perform inference, we’ll utilize OpenCV-DNN (Deep Neural Network module in OpenCV), which provides an efficient way to load and run the YOLOv5 model on images. OpenCV-DNN is a popular tool for deploying deep learning models, offering high performance and compatibility with different programming languages like Python and C++.

Using OpenCV-DNN with our custom YOLOv5-small model, we can detect and classify objects in images, highlighting them with bounding boxes and labels. This approach is particularly useful for real-time applications where speed and accuracy are essential.

You can find more detailed guidance on how to implement YOLOv5 inference using OpenCV-DNN in the official OpenCV-DNN documentation. This resource provides step-by-step instructions for setting up the environment, loading the YOLOv5 model, and running inference on images, whether you are working with Python or C++.

# importing necessary dependencies
import cv2
import time
import sys
import numpy as np
import os

from matplotlib import pyplot as plt

# width of an image
INPUT_WIDTH = 640

# height of an image
INPUT_HEIGHT = 640

# confidence threshold for object detection
CONFIDENCE_THRESHOLD = 0.5

# making an array of all categories
CATEGORIES = ['Handbags', 'Pants', 'Shirts', 'Shoes', 'Sunglasses']

# load model using OpenCV
def build_model():
  # load custom YOLOv5
  model = cv2.dnn.readNet('/content/best.onnx')
  # return loaded model
  return model

# performing object detection
def detect(image, net):
  # convert image to blob
  # mean subtraction and scaling
  blob = cv2.dnn.blobFromImage(image, 1/255.0,
                            (INPUT_WIDTH, INPUT_HEIGHT),
                            swapRB=True, crop=False)

  # set the blob as input to the network
  net.setInput(blob)

  # get prediction from the model
  prediction = net.forward()
  return prediction

# load classes
class_list = CATEGORIES
class_list

['Handbags', 'Pants', 'Shirts', 'Shoes', 'Sunglasses']

# pre-processing
def format_yolov5(frame):

  # defining number of rows and columns in image numpy representation
  row, col, _ = frame.shape

  # finding the maximum between row and column
  _max = max(col, row)

  # initializing result matrix with zeros
  result = np.zeros((_max, _max, 3), np.uint8)

  # copying data from frame ndarray to result
  result[0:row, 0:col] = frame

  return result

# post-preprocessing
def wrap_detection(input_image, output_data):
  class_ids = []
  confidences = []
  boxes = []

  # fetching number of rows in output_data
  rows = output_data.shape[0]

  # fetching width and height of input_image
  image_width, image_height, _ = input_image.shape

  # x-factor for resizing
  x_factor = image_width / INPUT_WIDTH

  # y_factor for resizing
  y_factor =  image_height / INPUT_HEIGHT

  # iterate through detections
  for r in range(rows):

    # fetch bounding box co-ordinates
    row = output_data[r]

    # fetch confidence of the detection
    confidence = row[4]

    # filitering out good detections
    if confidence >= CONFIDENCE_THRESHOLD:
      classes_scores = row[5:]

      # get index of max class score
      _, _, _, max_idx = cv2.minMaxLoc(classes_scores)
      class_id = max_idx[1]

      if (classes_scores[class_id] > 0.25):
        # append confidence of new bounding box to the list
        confidences.append(confidence)
        class_ids.append(class_id)

        # get coordinated center (x and y) and width and height of the bounding box
        x, y, w, h = row[0].item(), row[1].item(), row[2].item(), row[3].item()

        # calculate x-coordinate of top-left point of bounding box
        left = int((x-0.5*w) * x_factor)

        # calculate y-coordinate of top-left point of bounding box
        top = int((y-0.5*h) * y_factor)

        # calculate width of bounding box
        width = int(w * x_factor)

        # calculate height of bounding box
        height = int(h * y_factor)

        # create array of coordinates of the bounding box
        box = np.array([left, top, width, height])
        # append new bounding box coordinates to the list
        boxes.append(box)

  # use non-maximum suppression to avoid multiple bounding boxes for the same object
  indexes = cv2.dnn.NMSBoxes(boxes, confidences, 0.25, 0.45)

  # define new lists to store class ID's, confidences and bounding boxes
  result_class_ids = []
  result_confidences = []
  result_boxes = []

  # loop through indices
  for i in indexes:
    # add detection confidence to the list
    result_confidences.append(confidences[i])

    # add detection class id to the list
    result_class_ids.append(class_ids[i])

    # add detection bounding box to the list
    result_boxes.append(boxes[i])

  return result_class_ids, result_confidences, result_boxes

# define colors for drawing bounding boxes
colors = [(255, 255, 0), (0, 255, 0), (0, 255, 255), (255, 0, 0), (255, 0, 255)]

# load model for inference
net = build_model()

import cv2
from matplotlib import pyplot as plt

def display_object_detection(frame):
    # Display the input and detected images
    plt.figure(figsize=(8, 8))
    # Show the detected image
    plt.imshow(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    plt.title('YOLOv5 Detection')
    plt.axis('off')
    plt.show()

# Main function for the custom object detection
def yolo_detect(img_path):
    # Read the image
    frame = cv2.imread(img_path)
    # Proceed forward if the image read above is not None
    if frame is not None:

        # Make a copy of the original image for displaying later
        original_image = frame.copy()

        # Pre-process the input image
        inputImage = format_yolov5(frame)
        # Object detection using our custom model
        outs = detect(inputImage, net)

        # Post-process the detections
        class_ids, confidences, boxes = wrap_detection(inputImage, outs[0])

        # Iterate through the detections for drawing annotations on the image
        for (classid, confidence, box) in zip(class_ids, confidences, boxes):
            # Choosing the color for drawing annotations
            color = colors[int(classid) % len(colors)]
            # Drawing the bounding box
            cv2.rectangle(frame, box, color, 2)
            # Preparing the label with class name and confidence score
            label = f"{class_list[classid]}"
            # Calculating label size
            label_size, base_line = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)
            # Drawing a filled rectangle above the bounding box for the label
            top_left = (box[0], box[1] - label_size[1] - base_line)
            bottom_right = (box[0] + label_size[0], box[1])
            cv2.rectangle(frame, top_left, bottom_right, color, cv2.FILLED)
            # Writing the label on top of the bounding box
            cv2.putText(frame, label, (box[0], box[1] - base_line), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1)

        # Save the annotated image
        annotated_img_path = 'annotated_' + img_path.split('/')[-1]
        cv2.imwrite(annotated_img_path, frame)

        # display_object_detection(frame)
        return annotated_img_path

yolo_detect('input1.png')

png

yolo_detect('input2.png')

png

yolo_detect('input3.png')

png

Now, we will do the inference on more input itest mages to see how the object detection performs.

import cv2
import os
from matplotlib import pyplot as plt

# Function to create a grid of images
def create_image_grid(image_list, grid_size):
    # Calculate the size of the grid
    grid_rows, grid_cols = grid_size, grid_size
    fig, axes = plt.subplots(grid_rows, grid_cols, figsize=(15, 15))
    fig.subplots_adjust(hspace=0.1, wspace=0.05)

    # Iterate over the images and add them to the grid
    for i, img_path in enumerate(image_list):
        # Read the image
        frame = cv2.imread(img_path)
        if frame is not None:
            # Perform object detection and get the annotated image
            annotated_img_path = yolo_detect(img_path)
            # Read the annotated image
            detected_img = cv2.imread(annotated_img_path)
            # Convert BGR to RGB for displaying
            detected_img_rgb = cv2.cvtColor(detected_img, cv2.COLOR_BGR2RGB)

            # Calculate the row and column in the grid
            row = i // grid_cols
            col = i % grid_cols
            # Plot the image on the grid
            axes[row, col].imshow(detected_img_rgb)
            axes[row, col].axis('off')

    plt.show()

# Function to process a directory of images
def process_dir(directory_path, grid_size=(3, 3)):
    image_list = [os.path.join(directory_path, img) for img in os.listdir(directory_path) if img.endswith(('.jpg', '.png', '.jpeg'))]

    # Ensure that all images are resized to the same dimensions
    annotated_images = [yolo_detect(img) for img in image_list]

    # Create a grid of the processed images
    create_image_grid(annotated_images, grid_size[0])

process_dir('input')

png

In this project, we successfully trained a model for custom object detection that can identify and detect clothing accessories like topwear, bottomwear, eyewear, footwear, and handbags worn by individuals. The model has been fine-tuned to accurately locate and label these accessories in images, which is helpful in applications such as fashion analysis and personal styling.