Debugging live apps – SIGQUIT for the win

Example SIGQUIT stack trace from mb-util rubygem

Hey there, everyone! This is just a quick post to share a debugging technique I’ve found useful and that I recently added to mb-util. Today I’m talking about using SIGQUIT to trigger some kind of debugging action in an application, such as opening a REPL or logging a stack trace.

The reason SIGQUIT is useful is that most terminals will send that signal if you press Ctrl-\ (control and backslash), and many server environments also have a mechanism for sending UNIX signals.

I’m not the first person to do this of course. My inspiration comes from the JVM, where SIGQUIT will print stack traces and JVM statistics, but I believe the idea goes back further. Here I’ll show how to apply this concept to a Ruby application, but you can do the same thing in any language.

First example: printing stack traces in console and server-side apps

The MB::U.sigquit_backtrace function I recently added to mb-util iterates over all Ruby threads and prints a colorized stack trace for each thread.

Here’s the code; it’s really simple:

1
2
3
4
5
6
7
trap :QUIT do
  thread_count = Thread.list.count
  Thread.list.each.with_index do |t, idx|
    MB::U.headline "Thread #{idx + 1}/#{thread_count}: #{MB::U.highlight(t)}#{t == Thread.current ? ' (current thread)' : ''}"
    puts MB::U.color_trace(t.backtrace)
  end
end

Second example: opening a REPL in a console app

I’ve also used SIGQUIT for interactively debugging console apps with a REPL (Read Eval Print Loop, a fancy term for an interactive command line interface that uses the same programming language syntax).

Note that, at least in Ruby, this only works if your app has an interruptible loop in one of its threads or you create a new thread, as you cannot run Pry from within a signal handler. It’s also best to run binding.pry in a context that has access to variables you want to inspect.

Here’s some example code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
##!/usr/bin/env ruby
## pry_sigquit.rb

require 'pry-byebug'

$pry_next = false

trap :QUIT do
  puts 'Got SIGQUIT, waking up main thread'
  $pry_next = true
  Thread.main.run
end

count = 0
loop do
  puts "Processing step #{count}"
  count += 1
  sleep 2
  if $pry_next
    puts 'Pry REPL:'
    binding.pry 
    $pry_next = false
  end
end

You can run that example script, then press Ctrl-\ to view and manipulate the program using the Pry REPL.

1
2
gem install pry-byebug
ruby pry_sigquit.rb

In this session I use Ctrl-\ to start Pry, change a local variable, then resume execution. We can see that the updated value of the variable is used in the next loop.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Processing step 0
Processing step 1
Processing step 2
^\Got SIGQUIT, waking up main thread
Pry REPL:

From: /tmp/pry_sigquit.rb:22 :

    17:   count += 1
    18:   sleep 2
    19:   if $pry_next
    20:     puts 'Pry REPL:'
    21:     binding.pry 
 => 22:     $pry_next = false
    23:   end
    24: end

[1] pry(main)> count = -42
=> -42
[2] pry(main)> continue
Processing step -42
Processing step -41
Processing step -40

Note that if you try to run Pry from the trap signal handler, this happens instead:

1
2
ERROR: Unable to obtain mutex lock.
This can happen if binding.pry is called from a signal handler

Why? Some real-world scenarios

Ok, now that you have this new tool in your toolbelt, when would you use it?

Scenario 1: stalling web app

Let’s say you’re running a web app and you have some requests that seem to take forever. But you don’t know where in the code those requests are taking their time, and for some reason your APM (Application Performance Monitor) isn’t catching the culprit. You are running an APM and keeping app logs, right?

In this case, if your code already has a SIGQUIT handler, you can use your server infrastructure to send SIGQUIT to the application process. In Kubernetes for example, you might use kubectl exec to open a command line in the running pod, or if you’ve wisely removed all shells from your containers, or use kubectl debug to attach a different container image to the running pod. Then you run kill -QUIT [app pid], e.g. kill -QUIT 1 if your app is running as PID 1 in the Kubernetes pod.

From here you look at your application logs, find the stack trace printed by your SIGQUIT handler. Most likely the stack trace will show you exactly where your code has stalled.

Scenario 2: debugging a console app

Animation script for a YouTube video with a Pry REPL

I use Ruby code to generate most of the animations for my YouTube channel, and sometimes the animation breaks, or I want to tweak the running animation just a little bit. I’ve regularly used a SIGQUIT handler to open a Pry REPL in these cases. With Pry open I can inspect the animation state, change variable values, etc.

Scenario 3: is RSpec stuck in an infinite loop or just a really slow test?

Example RSpec run with a SIGQUIT backtrace

The most recent case, and the one that prompted this post, was some slow RSpec tests in mb-math, combined with code that I knew could potentially loop forever. You probably already know that RSpec just prints a . to the screen for each test it runs by default, and runs tests in random order, so it’s not obvious which test is taking a long time to run.

I added a SIGQUIT stack trace printer to my spec/spec_helper.rb using MB::U.sigquit_backtrace, then used Ctrl-\ to see what code RSpec was running. The stack traces allowed me to narrow down the root causes and correct them.

Alternatives

In some cases you won’t have the ability to send a UNIX signal, your app isn’t running in a console, or you are using a framework that already interprets SIGQUIT differently. No worries — the same concept can be applied in different ways. The root pattern or principle is allowing an authorized engineer to trigger a debugging action in the application. You could use a web endpoint that requires a secure password, a different UNIX signal, etc. Always keep security in mind, though, if you implement something that can be accessed remotely.

Summary

I’ve shown two different things you can do with a SIGQUIT signal handler, namely printing a stack trace and opening a REPL, but undoubtedly there are more options. I believe this is a useful tool that every developer should have in their arsenal.

For more reading: