When I first started using Nvidia DIGITS to train object detection models, I found many articles on how to train a single class object detection model based on the KITTI dataset. One of those tutorials can be found here, and it basically shows how to detect cars.

Once I felt comfortable training single class models, I wanted to be able to detect multiple object classes too. My thinking that training a multiclass object detection model with DIGITS should be straightforward was proven to be wrong. There are some things that need to be done but are not available by clicking a button or selecting an option in the DIGITS interface. Furthermore, I couldn't find a single article about training a multiclass detection model on Internet, not even on Nvidia’s websites. All I found were bits of information scattered around different forums, so I decided to put it all together into one article.

If you are struggling to train a multiclass detection model, getting only zeros on your chart while training, and wondering what’s wrong, this is an article for you! I even  take it a step further, and train the model to detect custom objects which are not KITTI Cars, Vans, Pedestrians or similar. 



This article is a sequel to my This article is a sequel to my previous article explaining the process of data preparation. I would highly recommend reading it since it has important steps which are crucial for the training part to work as expected. In this article I will show you how to train a model to detect UPS and FedEx delivery using the dataset from my previous article. explaining the process of data preparation. I would highly recommend reading it since it has important steps which are crucial for the training part to work as expected. In this article I will show you how to train a model to detect UPS and FedEx delivery using the dataset from my previous article.



The object detection model training process can be summarized into following steps:

  1. Preparing data for custom object detection model training (done in previous article)

  2. Learning about DIGITS, DetectNet and prerequisites

  3. Creating an object detection task in DIGITS

  4. Modifying the “prototxt” file

  5. Training a model with DIGITS

  6. Experimenting with different pretrained weights




About DIGITS, DetectNet and Prerequisites


Nvidia DIGITS is a graphical user interface that runs in your browser. It allows you, among other things, to train an object detection model. What it actually does behind the scenes is as follows: it takes parameters you have entered through the GUI and, in the case of object detection, runs a training process using the DetectNet network configured with those parameters. DetectNet is an object detection network implemented using Nvidia’s branch of popular a Caffe deep learning framework also called “nvcaffe”. It is DIGITS’s main and, at the time of this writing, the only network you can use to train object detection models. The purpose of DIGITS is to simplify and visualize the training process using some of the concrete deep learning frameworks like Caffe in this case.

Running DIGITS has some requirements: Ubuntu operating system, Nvidia graphic card drivers, CUDA drivers and a CUDA capable Nvidia graphic card. Installation of those requirements won’t be covered here, but the process is well documented on Nvidia’s website. I would recommend the docker image installation because it’s simpler. My DIGITS instance is running on a remote computer with Nvidia GTX 1080 Ti 11GB graphic card.




Creating an Object Detection task in DIGITS


On the DIGITS home page, select the Models tab then click New Model > Images > Object Detection. A form will open for you to enter some of the parameters I’m going to discuss now.



Getting good accuracy when training custom classes can sometimes take a lot of time, so I enter a larger number of epochs, 600 in this case. Notice that I’ve selected the “UPS_and_Fedex2” dataset which is labeled with KITTI predefined class names. FedEx is labeled as car, and UPS as a van. Training with custom labels, which is in my case “UPS_and_Fedex3” dataset (sorry for bad naming), will be explained later.

Batch size and accumulation are parameters which will change depending on your graphics card’s memory size. It tells how many pictures can be processed at once. Sincemy graphics card has 11 GB of memory, numbers 2 and 5 work great. If you are not sure which numbers to enter, start with 1 and keep increasing until you run out of memory, which will be indicated by an error in red after you start training.



A number indicating how many bytes it tried to allocate will give you an idea of how you should modify your batch size. The goal is to fill as much memory as you can, without getting an error. Keep in mind that image size also makes a difference here. Larger images will take more memory.

Set subtract mean to none and solver type to ADAM.  ADAM is an algorithm that updates weights in the neural network after each iteration. It has proven to be very effective, so I decided to always use it from the  start. You can learn more about ADAM here. Feel free to experiment with other solvers too. 

Learning rate is also a parameter that can be experimented with. It’s the proportion at which  weights are updated. A lower learning rate means slower learning with more accuracy. In this case, 0.0001 has proven to be a good starting point. Under policy select exponential decay which is a function that modifies the learning rate over time.



Under networks tab select Custom Network and click on Caffe tab underneath. The network you can see on the picture won’t be there because we need to obtain and modify it first. Skip that step for now because it’s going to be explained in the next chapter. Under pretrained model(s) you need to enter the path to your pretrained weights. It’s very important to use pretrained weights, otherwise training will take too much time. The GoogleNet weights used in this example can be downloaded here.



Modifying the “prototxt” File

Prototxt is a Caffe file describing the structure of a neural network. DetectNet, a network we are using for object detection, can be downloaded here. However, this file is configured for training a single class model, so we need to modify it. Fortunately, Nvidia made a two class prototxt which can be downloaded here. This file is a good start for multiclass object detection. Let’s look at the differences in order to conclude what should be modified for training an N-class model.

$ diff detectnet_network.prototxt detectnet_network-2classes.prototxt

Running this command will give you the output representing differences between the files (you can also use a GUI diff online tool). Here is the summary:


Modification when Using KITTI Labels

In the case of multiple classes, a mapping needs to be done according to this document. We used car to represent FedEx and van to represent UPS, so source 8 becomes source 2 because it’s a class ID of a van. This is the result which also has to be replaced on line 121:


Modification when Using Custom Class Names

When training with a dataset that has  custom class names (Fedex, UPS), every class has a source number increased by 1, starting with 1. Destination numbers always start from zero and get increased by one. Here is the example:

If we decided to also include DHL delivery, the corresponding object would look like this:
object_class: { src: 3 dst: 2} # DHL -> 2


Dimension  and Python Layers Modifications

Output number of a cvg/classifier layer has to be changed from 1 to N:

Cluster boxes layer modification:
Pay attention to param_str and the last number in the string, which is obviously a number of classes (line 2507). Every class also has its entry prior to the top parameter (lines 2502 and 2503).

Similar modifications as above:

New Python layers are added for a second class: previous entries got suffix class0, and same layers are copied and renamed to accommodate the second class, which is class1.

Now, the only thing that’s left is replacing the correct image input size. We used standard HD resolution which is 1280 x 720 pixels, but this example is using 1248 x 352 pixels. We can fix this by finding and replacing occurrences of those numbers in the entire file, which should be an easy task. After this step you should be able to copy the contents of a modified prototxt file into the Caffe model text area in DIGITS, and click Create to start training.


Training a Model with DIGITS

With everything set up correctly, you should be able to see training the process going on. It took 3 hours and 13 minutes to go through 600 epochs with my graphics card. Results are shown on the picture below, and it’s a disappointment. First class didn’t gain any accuracy at all!


Despite this failure, I wanted to give it another chance and try something else. I made the previously failed model a pretrained model and used its weights to start a new clone job. To make a model pretrained, find a green button under the Trained Models section. Give it a name and save it. Now you should be able to click Clone Job at the top right corner and select the Pretrained Models tab where there should be a list containing your pretrained model. Select that model and click Create again. This will start new training with the same configuration and weights from the previously failed model. After 2 hours and 52 minutes, this is what I got:

The result is not bad, class0 got mAP of 67 and class1 68 at epoch #320. Before any fine tuning, I decided to run inference on a few pictures to see how accurate it is in practice. The results were pretty good!

This model could also be fine tuned by repeating the training process with an even lower learning rate, and maybe some other solver algorithm. If you decide to go down that path, don’t forget to make your successful or partially successful model pretrained and use its weights as I did before. Also make sure you select epoch where you are most satisfied with given accuracy, precision and recall.


Experimenting with Other Pretrained Weights

Previous training was a success, but it took a lot of time and GPU power to get the desired result. I kept wondering if this process could be any faster. I had some car and van detection models previously trained on the KITTI dataset with decent accuracy. The dimensions were matching, since that was also a 2-class model.  I picked some epoch where accuracy for both classes was good and made a pretrained model. Then I started a new training, but this time using the second dataset labeled with custom names (Fedex, UPS), just to prove it can also work. The result after 100 epochs was already something in comparison to previous example where one of the classes didn’t gain any accuracy at all even after 600 epochs!

I ran some inferences and got a little bit worse results than with previous model, but that was expected since this time I had much lower accuracy. This model needs to go through at least 300 more epochs to gain some serious accuracy, but the point is proven.



Training object detection models can be so much fun! It’s really exciting to see some results working j as you expected. However, getting an accurate model can be time consuming. Fortunately, there are some tricks that can speed things up a little bit. In any case, you will be rewarded for your patience. If results are not showing in the first 100, 200, or maybe 300 epochs, don’t lose hope. Give it some time, and results will eventually show up!

Feel free to contact us if you have any questions!