Solveddlib detector too slow on the TK1 board


Recently, we use dlib on our TK1(arm) board, but seems it take too long(about 3s) to detect one face in the picture.

We use 'pip install dlib' to install, and have a test used below code:

detector = dlib.get_frontal_face_detector()
img = io.imread("/home/ubuntu/face.jpg")
for i in range(1000):
dets = detector(img, 1)
print("Number of faces detected: {}".format(len(dets)))

And it take about 3s to detect one picture, do you know where is wrong? how to fix it? thanks~
Is the blas library impact so much?

25 Answers

✔️Accepted Answer

My test code and compiler settings are here.

Updated RPI3 measurements:

Raspberry Pi 3 Model B [rev. a02082] (circa 2016)

armv7/1.2GHz (g++ (Raspbian 4.9.2-10/Raspbian))

Run Flags Duration (ms) Notes
5. -O3 ~2904 Compiled, ran.
6. -O3 -mfpu=neon ~1267 Compiled, ran.
10a. -O3 -mfpu=neon -fprofile-generate ~5600 Compiled, ran.
10b. -O3 -mfpu=neon -fprofile-use ~444 Did 10a, then compiled, ran.

Wow! 🥇

Other Answers:

dets = detector(img, 1)

first try changing this to dets = detector(img, 0)

Next step is to use NEON optimizations. It is discussed here: #276
Some other possibility is to run partial face detection (only frontal faces) - this will make it run about 2x faster with some face missing. You can try reading this for more info

TK1's CPU is quite slow and the whole idea of TK1 is to use GPU for all processing tasks. Dlib does not support FHOG detectors on GPU, but there are some in OpenCV

And the one more problem of TK1 - 32-bit architecture, so max CUDA version is 6.5 for it. And Dlib require at least 7.5 CUDA version

Switching to Jetson TX1/2 is required to run Dlib's DNN algorithms

400x600 is quite small resolution, I think no need to try smaller images
Next is NEON question. This is not something oficially supported and should be double-checked
Check also this doc for Jetson CPU speed tuning

Also possible optimizations are not to use pyramid and use only frontal detector

        typedef dlib::scan_fhog_pyramid<dlib::pyramid_down<6>, dlib::fhog_feature_extractor > image_scanner_type;
        image_scanner_type scanner;
        detector = dlib::object_detector<image_scanner_type>(scanner, detector.get_overlap_tester(), detector.get_w());

This new detector will work about 4x faster, but will miss frontal faces and will detect only a limited face size range (about 80 pixels size)
But this is general optimization and they will work on PC too, while 50x gap is something very different. I assume that TK1 has 2x less CPU frequency, so the gap comes to 25x, then SIMD - they should give about 2x-4x performance improvement, and the rest is possible architecture differences, memory speed and bandwidth
To understand the real situation I recommend you to measure face detection stages separate. First stage is FHOG features extraction:

        dlib::array<dlib::array2d<double>> hog;
        dlib::impl_fhog::impl_extract_fhog_features(img, hog, 8, 1, 1);

The real way how to make face detection work on Tegra TK1 well is to rewrite the code into CUDA - this is the main idea of all Jetsons

