Solving libfreenect hangs in the Depth Controller - gdb saves the day

This is a technical and lengthy post explaining some of the work involved in the development of the Nitrogen Logic Depth Camera Controller.  You can follow my thought process and progression through the discovery of a bug, investigating its cause, and implementing a solution.  Executive summary: developer places tongue in cheek, heroically overcomes numerous obstacles, uses stress testing and gdb to find the cause of a bug, and ultimately fixes the problem.

the problem

Since the introduction of color image support in the Depth Controller, the low-level camera server on the controller has been experiencing occasional hangs.  These hangs could be reproduced quite easily by switching to the Video tab on the controller's web interface, then moving or resizing a zone.

The camera server's built-in watchdog detects these hangs, resetting the depth camera when they occur.  Every time the camera is reset, there's a several second delay while the software and camera reconnect, and sometimes the first few seconds of data from the depth camera are skewed by about half a meter.  With these hangs happening quite often when a zone was adjusted on the Video tab, it was clearly necessary to do something about the problem.

the investigation

Based on the symptoms, I came up with a few possible explanations:

  1. The cause might be within the Kinect itself, that repeatedly stopping and starting its RGB camera was causing it to hang.
  2. A bug in libfreenect was causing an invalid command to be sent to the camera.
  3. The controller didn't have enough USB bandwidth or CPU power to keep up with all of the depth and video data, causing data to be lost.  The resulting data loss caused the camera to hang.
  4. A race condition between the server's network thread and camera processing threads was leading to a deadlock.

The only way to narrow down the cause was to gather more information.  I started by adding better logging facilities to the camera backend that would also allow me to dump low-level debugging information from libfreenect.  If I could find a common set of events that always happened right before the camera stopped, I would most likely eliminate hypothesis #1.  The massive amount of data generated by the new logging features made debugging directly on the controller untenable.

Hypothesis #1 seemed unlikely because the bug was associated with specific behavior on top of video support.  One might also expect to see USB disconnection messages in the kernel dmesg output; the lack thereof further discredited hypothesis #1.

I upgraded to the latest version of libfreenect, in case the problem was #2, a libfreenect bug that had been solved upstream.  The updated version of libfreenect looked like it supported a 320x240 mode (the libfreenect.h header defines a constant called FREENECT_RESOLUTION_LOW), which would use 25% as much data as 640x480.  I was hoping that by switching to 320x240 the problem would go away, confirming hypothesis #3.  Unfortunately libfreenect doesn't actually support a low resolution mode, and there are some claims that the Kinect itself lacks a 320x240 mode (despite the Xbox 360 press indicating that the Kinect's launch resolution was 320x240, with a later software update to obtain 640x480).

I reasoned that if I could reproduce the bug on my desktop PC I would eliminate hypothesis #3, so I ran a pair of active USB extension cables from my desktop to the Kinect.  At first the bug did not occur on my PC, even when I opened multiple browsers to the Video tab and thrashed dozens of zones around like a madman:


I tried using valgrind's thread checking tools to identify the problem, but they made the camera server run so slowly that the watchdog timer fired constantly and the interface was unusable.

Meanwhile, I was still working on further improving performance to allow more zones to be created and monitored, trying to get the Depth Camera software working on the Raspberry Pi, and looking for other ways of reducing USB bandwidth requirements.  In every case I ran into a dead end.  The performance "improvements" turned out to be neutral, I found the Raspberry Pi is incapable of working with the Kinect, and as mentioned before, 320x240 support was out of the question.

All this discouragement over several days was getting rather annoying, so I spent quite a few hours playing the awesome Black Mesa Mod to clear my mind and get a fresh perspective.

the cause

I kept thinking about hypothesis #4 — if indeed there was a race condition on the controller under a certain amount of stress, maybe I could trigger the same condition on my PC by running a stress test.  Since the hang would occur while simultaneously watching the Video tab and moving a zone, I opened my browser to the Video tab, then ran the following command to simulate rapidly changing a zone:

$ (while true; do echo 'setzone 34,xmin,-1.589'; done) | nc localhost 14308

Success!  In less than a minute the hang was triggered.  Hypothesis #3 was eliminated.  But wait, I forgot to enable the detailed logging features!  I produced a debug build, set the appropriate environment variable to enable ultra-verbose logging, and started gdb (the low-level camera backend process is called knd):

$ KND_LOG_LEVEL=10 gdb ./knd

In order to get an exact snapshot of the program at the time of the hang, I set a breakpoint in the watchdog callback, automatically printing a stack trace of each thread at the instant the watchdog fires (there are five threads in the camera server):

(gdb) break watchdog_callback
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
info threads
> thread 1
> bt
> thread 2
> bt
> thread 3
> bt
> thread 4
> bt
> thread 5
> bt
> end

Then I launched the camera server with the run command, reloaded the Video tab in my web browser, and started my stress test.  Within minutes, I had logs and stack traces for several hangs.  It was time to look for a pattern.

Side note: as an example of how measuring a system can change its behavior, I added an additional field to the logging output after generating five traces.  With this one additional field, the hang became much more rare, having to wait several minutes for a single event.  Tricky timing bugs are tricky.

Every trace was the same, with the obvious exception of timestamps and thread IDs.  Here are the first few lines from a trace (manually colored for clarity and visual appeal):

2012-10-05 03:53:00.992642 -0600 - main_thread - Camera spew: Got video frame of size 307200/307200, 162/162 packets arrived, TS a5ffa17f
2012-10-05 03:53:00.993029 -0600 - video_thread - Camera debug: Write Reg 0x0005 2012-10-05 03:53:01.941985 -0600 - watchdog_thread - knd.c:89: watchdog_callback():  Timed out: at least 0.949313668s since last update.
[Switching to Thread 0x7ffff6934700 (LWP 9212)]

Breakpoint 1, watchdog_callback (data=, interval=0x7ffff6933e80) at knd.c:91
91              if(info->stop) {
  Id   Target Id                                        Frame
  5    Thread 0x7ffff5004700 (LWP 9215) "kndsrv_thread" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
  4    Thread 0x7ffff5805700 (LWP 9214) "video_thread" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216
  3    Thread 0x7ffff6006700 (LWP 9213) "depth_thread" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
* 2    Thread 0x7ffff6934700 (LWP 9212) "watchdog_thread" watchdog_callback (data=, interval=0x7ffff6933e80) at knd.c:91
  1    Thread 0x7ffff7fbd720 (LWP 9211) "main_thread" sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86

Every single hang was preceded by the same event, a write of 0x00 to register 0x0005 on the camera.  According to the Protocol Documentation, writing 0x00 to register 0x0005 disables the RGB camera stream.  It seems there's a problem at the point of disabling the video feed after taking a snapshot.

It's also notable that each thread is in the same function every time, and apart from the watchdog thread, they are all lock-related functions.  Hypothesis #4 grows ever stronger.  The next step is to identify which locks are held by which threads (Java makes this much easier than C, but this isn't Java).  We will pay particular attention to the main and video threads, which were the last two threads to print any data.

We'll need to compare the stack trace from each thread to look for common ancestors, then read the source code near each entry to identify mutexes and semaphores waited for or held by each thread.  Here are the traces, manually colored and annotated with lock information:

Thread 1 — main_thread

Holds something in libusb.
Waiting for the video_empty semaphore.

sem_wait ()
#1 video_callback (dev=, videobuf=0x644880, timestamp=2784993663) [waits for the video_empty semaphore, locks the video_in_use mutex, posts the video_full semaphore]
#2 iso_callback (xfer=0x641ee8)
#3 ?? ()
#4 ?? ()
#5 ?? ()
#6 libusb_handle_events_timeout ()
#7 freenect_process_events_timeout (ctx=0x63b750, timeout=)
#8 freenect_process_events (ctx=)
#9 vidproc_doevents (info=0x63b580)
#10 main (argc=, argv=)

Thread 2 — watchdog_thread

Happy as a clam.

watchdog_callback (data=, interval=0x7ffff6933e80)
#1 watchdog_thread (d=0x63b3e0)
#2 start_thread (arg=0x7ffff6934700)
#3 clone ()
#4 ?? ()

Thread 3 — depth_thread

Holds the depth_empty semaphore.
Waiting for the depth_full semaphore.

sem_wait ()
#1 depth_thread (d=0x63b580) [waits for the depth_full semaphore, locks the depth_in_use mutex, posts the depth_empty semaphore]
#2 start_thread (arg=0x7ffff6006700)
#3 clone ()
#4 ?? ()

Thread 4 — video_thread

Holds the video_in_use mutex.
Holds the video_empty semaphore.
Waiting for something in libusb.

pthread_cond_timedwait@@GLIBC_2.3.2 ()
#1 libusb_wait_for_event ()
#2 libusb_handle_events_timeout ()
#3 libusb_handle_events ()
#4 libusb_control_transfer ()
#5 fnusb_control (dev=, bmRequestType=, bRequest=, wValue=, wIndex=, data=, wLength=12)
#6 send_cmd (dev=0x63ba00, cmd=3, cmdbuf=, cmd_len=4, replybuf=0x7ffff5804e30, reply_len=4)
#7 write_register (dev=0x63ba00, reg=, data=)
#8 freenect_stop_video (dev=0x63ba00)
#9 video_thread (d=0x63b580) [waits for the video_full semaphore, locks the video_in_use mutex, posts the video_empty semaphore]
#10 start_thread (arg=0x7ffff5805700)
#11 clone ()
#12 ?? ()

Thread 5 — kndsrv_thread

Waiting for the video_in_use mutex.

__lll_lock_wait ()
#1 _L_lock_883 ()
#2 __pthread_mutex_lock (mutex=0x63b670)
#3 request_video (info=0x63b580) [locks the video_in_use mutex]
#4 getbright_func (client=0x7ffff7e07010, command=, argc=, args=)
#5 parse_line (line=0x7ffff7e1f4ed "getbright", client=0x7ffff7e07010)
#6 parse_buffer (count=131072, client=0x7ffff7e07010)
#7 knd_read (arg=0x7ffff7e07010, buf_event=)
#8 knd_read (buf_event=, arg=0x7ffff7e07010)
#9 bufferevent_readcb (fd=19, event=2, arg=0x643ab0)
#10 event_process_active (base=0x63a8f0)
#11 event_base_loop (base=0x63a8f0, flags=0)
#12 event_base_dispatch (event_base=0x63a8f0)
#13 kndsrv_thread (d=0x61a1a0)
#14 start_thread (arg=0x7ffff5004700)
#15 clone ()
#16 ?? ()

Even though the traces always have the same functions at the top of the stack, the final trace differs occasionally in the exact call stack. Specifically, most of the time the kndsrv thread was processing a video image for the getvideo command, but sometimes it is handling a getbright command instead. Both commands need to access video imagery.

Also interesting is the fact that two threads have libusb_handle_events_timeout() in their call stack; one under freenect_process_events() where you might expect it, and another under freenect_stop_video().  You may recall that every hang happens right after the camera is told to stop the video stream.

Anyway, as you can see, we have a classic deadlock: Thread 4 (video) holds video_empty, needs something in libusb.  Thread 1 (main) holds something in libusb, needs video_empty.  The other threads are operating as expected, waiting for work that will never come.  Hypothesis #4 is confirmed.

the solution

When working with third-party libraries in threaded code, it's easy to run into concurrency issues where they aren't expected. It is clear that libfreenect is not reentrant. One solution would be wrapping all access to libfreenect in a mutex, to ensure that only one thread at a time tries to access libusb. But this is rather like fighting fire with fire — you have to plan very carefully to ensure your fire breaks are in the right place. Better to avoid concurrent and recursive access altogether, in this case by moving all calls to freenect_*() functions into a single thread, outside of any libfreenect callbacks.

After building and uploading a new firmware image with the new changes and reconnecting the Kinect to a Depth Camera Controller, video support is fully functional and the hangs are nowhere to be found.  Mission accomplished.