Before jumping into  training the object detection model and all the fun that comes with it, there is an important step that needs to be taken first. I’m talking, of course, about preparing the data. Object detection model training depends on a quality image dataset input, which is the foundation of a well-trained, accurate model. Therefore, special attention must be given to the data we are about to train our model with. This article covers steps and tips on accomplishing that goal using Nvidia’s stack of tools through a practical example. Without further ado, let’s get on with it!


As mentioned before, this process is going to be explained through a practical example. Let’s assume our task is to train a delivery detection model which will detect UPS and FedEx logos. This example also covers how to prepare data for training a multiclass detection model. Please keep in mind that these same principles can be applied to a single class detection model. This article will be followed by another one explaining how to train a multiclass object detection model using the Nvidia DIGITS and dataset created in this article.




The whole process of preparing the data can be summarized into a few (easy) steps in the following order:


  1. Obtaining an image dataset

  2. Resizing images to adequate size

  3. Obtaining an image labeling tool which supports KITTI format

  4. Labeling images

  5. Splitting images and labels into training and validation folders

  6. Importing dataset into Nvidia DIGITS



Obtaining an image dataset


When it comes to obtaining an image dataset for your custom classes, there could be two scenarios: either you’ll be able to find images from one of the online sources, or you’ll have to take them yourself with a camera. In our case, there are plenty of UPS and FedEx images online so I’m going to use those. 

Keep in mind that some of the parameters like camera angle, distance from the camera, brightness and quantity of images significantly depend on your use case. If you want to detect objects during the night, use images taken at night. Daylight images tend not to work well in that case. Similarly, if you want to detect objects from a bird’s perspective, you need images of objects from that perspective. Images taken from a frog’s perspective won’t yield good results there. Also, make sure you can get a sufficient number of images. There is no specified minimum since in some cases you’ll be able to train the model with fewer images, but more is always better. Image size is also something you should pay attention to. Make sure, if possible, that your images are not too small because you might lose some of the details.

For this demo, I have downloaded 186 FedEx and 141 UPS images from Google in which none of them is smaller than 400 x 400 pixels and most of them are in human eyes perspective. Images include  trains, trucks, planes, packages, buildings, toys and  other objects with the logos. Remember it’s the logo we are after here, not a specific delivery vehicle.




Resizing images to adequate size

Images must be resized  in order to satisfy the detection model’s input dimension. In our case, we’ll use Nvidia’s DetectNet as our main object detection model in DIGITS v6. The DetectNet configuration can be altered to accept custom image sizes, and by default it’s set to 1392 x 512. This resolution is fine if you want to train your model based on the KITTI dataset (Cars, Vans, Pedestrians…), but we want to use our custom data so we’ll stick with standard HD resolution of 1280 by 720 pixels. More on how to customize the detection model will be explained in our next article. 

The most important thing when resizing images is to maintain the aspect ratio. There is an option to resize images in DIGITS v6, but it doesn’t preserve the aspect ratio. To automate resizing of hundreds of pictures I wrote a simple script in Python. It basically loops  through images in a folder, resizes them and adds a black padding to preserve the aspect ratio.

This script can easily be run by typing the following command:

import sys
import os
import numpy as np
from os import walk
import cv2
# width to resize
width = int(sys.argv[1])
# height to resize
height = int(sys.argv[2])
# location of the input dataset
input_dir = sys.argv[3]
# location of the output dataset
out_dir = sys.argv[4]
if len(sys.argv) != 5:
	print("Please specify width, height, input directory and output directory.")
# get all the pictures in directory
images = []
ext = (".jpeg", ".jpg", ".png")
for (dirpath, dirnames, filenames) in walk(input_dir):
	for filename in filenames:
    	if filename.endswith(ext):
        	images.append(os.path.join(dirpath, filename))
for image in images:
	img = cv2.imread(image, cv2.IMREAD_UNCHANGED)
	h, w = img.shape[:2]
	pad_bottom, pad_right = 0, 0
	ratio = w / h
	if h > height or w > width:
    	# shrinking image algorithm
    	interp = cv2.INTER_AREA
    	# stretching image algorithm
    	interp = cv2.INTER_CUBIC
	w = width
	h = round(w / ratio)
	if h > height:
    	h = height
    	w = round(h * ratio)
	pad_bottom = abs(height - h)
	pad_right = abs(width - w)
	scaled_img = cv2.resize(img, (w, h), interpolation=interp)
	padded_img = cv2.copyMakeBorder(
	cv2.imwrite(os.path.join(out_dir, os.path.basename(image)), padded_img)


Now, you should have the same pictures in size 1280 by 720 pixels. Here’s what a padded picture looks like after running the script:

Obtaining an Image Labeling Tool That Supports KITTI Format


KITTI Format Explained

Labeling images for object detection is a process where we create files that contain descriptions about regions of interest on images. Region of interest (ROI) is a rectangular or bounding box around an area of an image containing the object we want to detect. There are a few formats for labeling object detection data, but Nvidia’s DetectNet uses the KITTI format. Files are named the same as images, with a .txt extension.

More on KITTI format can be found here. As you can see, besides class names and bounding boxes, KITTI format can accept some other parameters like truncated, occluded, etc. For the sake of simplicity we are going to omit those extra parameters and use only class names and bounding boxes, but feel free to experiment with those later.


Installing Image Labeling software

At the time of writing this article there weren’t many tools for labeling images that support KITTI format. I found only one free application capable of doing that and it’s not really a breath-taking piece of software. There is a work around of using a  different format and converting it to KITTI by using some Python scripts, but I just wanted to keep it simple and use the direct solution. The labeling tool is called “Alp’s labelling tool” and it’s a free plugin for image editor called “Fiji”. Download links and installation guide can be found here. Make sure you also check out the videos explaining how to use the tool at the bottom of the page.


Labeling images

Notice that in this example I’m using custom class names UPS and FedEx. You can also use predefined KITTI names like Car, Van, Pedestrian or any other from the list. It doesn’t matter if these names do not match  yours, because you can always map your names to KITTI ones during inference.

Now, there is one class you should pay special attention to and it’s called “dontcare”. With this class you allow an object contained within that bounding box to be ignored during training. For example, if you think the object you want to detect is too small, too far, clipped or otherwise difficult  to recognize, label it as “dontcare”. This way you will get fewer false positives detected during training.

If  you want to experiment with different class names once you have already labeled them all, you can easily rename occurrences of strings in your files using the following command example:

$ find /path_to_label_folder/ -type f -exec sed -i 's/Car/Fedex/g' {} \;

The same command can also be executed in Windows 10 if you enable the Ubuntu bash console feature.

At the end of this step you should have label files for all the images in your folder.


Splitting Images and Labels into Training and Validation Folders

Images and labels should be split into two folders: “train” and “val”. Both of them should also contain another two folders, “images” and “labels”. Validation (val) folder should contain about 10% of the images and labels from your original  folder, and training (train) folder should contain the other 90%. By doing this we are giving DIGITS a folder to be trained on and a folder to be validated on. After each epoch of training with pictures and labels from “train” folder, DIGITS will try to validate the model using images and labels from “val” folder. If the result is not within bounding boxes stated in the corresponding label file, calculated accuracy will be lower. The goal is to get the highest possible accuracy. Splitting folders is also explained in this link under “Folder structure”. 

├── images/
│   └── 000001.png
└── labels/
    └── 000001.txt
├── images/
│   └── 000002.png
└── labels/
    └── 000002.txt



Importing the Dataset into Nvidia DIGITS

Finally, the last step is to import all those images and labels from the folders we created into Nvidia DIGITS. Once you open your instance of DIGITS, select the Datasets tab and, on the right side, select Object Detection from New Dataset (Images) combo box. You should be able to see a form like this one:

Parameters You Should Enter Correctly:

  • Training image folder – location of your images folder inside training folder /your_location/train/images

  • Training label folder – location of your labels folder inside training folder /your_location/train/labels

  • Validation image folderlocation of your images folder inside validation folder /your_location/val/images

  • Validation label folder location of your labels folder inside validation folder /your_location/val/labels

  • Custom classes – if you’ve decide to use predefined KITTI class names leave this blank. If you are using your custom class names, like in this example, enter the class names separated with a comma. It’s important that you also put “dontcare” first in the list if you are going to use predefined weights to train your model on. It is recommended to use predefined weights, otherwise it will take too much time for your model be trained. If you don’t include “dontcare” at the beginning of the list, first class will be ignored and it won’t be trained at all! Don’t worry about capital letters because they will be all turned into lowercase.

  • Dataset name – name your dataset


After completing  the form, click the Create button at the bottom. The import process should take a few seconds, ending up with green color, which means it was successful. 

There is a possibility the process will fail and display in red. An error message should be displayed then, letting you know what went wrong. Failures mostly happen due to missing or incorrect label names. In that case, check if the number of images matches the number of labels in folders, and if all label names match corresponding image names. Mistyping a folder path can also cause an error as shown on the picture above. Once the process successfully completes, your dataset is ready to be used for object detection model training in DIGITS!



After going through the whole process of data preparation it’s obvious that this can be a time-consuming and repetitive task, especially the part where you have to acquire and label lots of images. It takes about hour to hand pick and download 320 pictures, and approximately three hours to label them. Fortunately, those tasks do not require any special machine learning or computer vision skills so they can be delegated to someone else. It would also be nice if Nvidia could provide a feature for image resizing that maintains the aspect ratio in the next version of DIGITS. Anyhow, despite all the effort it takes, it’s still an exciting thing to do and it definitely feels good once you start to train your model and watch those accuracy lines rise up on the chart.

Feel free to contact us if you have any questions!