From missing parentheses to memory corruption: how one typo crashed Ruby and uncovered two core bugs

Ruby bug $21220
Ruby bug #21220

  • Ruby's range syntax has lower precedence than method invocation, so 1..16.size returns the Range 1..8 (probably 1..4 on 32-bit systems), while (1..16).size returns the Integer 16. Because of this, we need to wrap a range in parentheses or store it in a variable when we want to call a method on the range.
  • I missed the parentheses around a range during a late night coding session.
  • Ruby parsed this as a flip-flop operator but my program appeared to work normally.
  • My test suite uses SimpleCov to gather code coverage metrics.
  • Ruby 3.4 switched to the PRISM parser.
  • The PRISM compiler was generating a line number of zero for flip-flop operators.
  • The Ruby code coverage system stores line counts in an Array.
  • A line number of zero would reference memory before the start of the Array.
  • Memory allocators use tags before and/or after allocated blocks to record information about the allocation.
  • Writing the line count for line zero corrupted the heap by overwriting the allocator's tags.
  • The program would crash some time later when the garbage collector deallocated memory and libc found the heap corruption.
  • mb-sound issue 36
  • Ruby issue 21220
  • Ruby issue 21259

Back in March of 2025 I started what should have been a straightforward Ruby version upgrade. Little did I know I was about to fall down a multi-day rabbit hole that would ultimately reveal a memory corruption bug (now fixed) deep within core Ruby systems.

Often when you make a typo or miss an operator in code the program simply doesn’t work. But there are those rare cases where you get a subtle, insidious change that appears to work correctly. This is the latter case, with missing parentheses leading to a crash due to memory corruption.

Here’s a bit of foreshadowing (did you know that Ruby has an operator called the flip-flop operator?):

1
2
3
raise 'Invalid channel' if !1..16.cover?(channel)
# vs.
raise 'Invalid channel' if !(1..16).cover?(channel)

How will these two lines behave?

Keep reading to see how code coverage, automated testing, uncommon operators, and out-of-bounds array writes all converge in a very unexpected way.

The setting

There’s this library I wrote called mb-sound. I use it to generate sound and visualize audio processing systems for my videos on YouTube. One of the tools in that library is called midi_roll.rb, and it generates a visualization of the notes in a MIDI music file in the terminal.

Output from midi_roll
Output from midi_roll

MIDI streams are divided into 16 channels, each of which can play a different instrument or a different “hand” on the same instrument. midi_roll.rb has a --channel option to select a specific MIDI channel. I wrote a check to ensure the channel number is within range:

1
raise 'MIDI channel must be an integer from 1 to 16, or -1' if channel && !1..16.cover?(channel)

But that line should have looked like this:

1
raise 'MIDI channel must be an integer from 1 to 16, or -1' if channel && !(0..15).cover?(channel)

I did not write tests for what happens if you give an invalid value to --channel, and as you’ll see, it’s pretty lucky that I didn’t.

Tests

Automated testing is a huge time saver. My approach to testing mb-sound is a pragmatic, gray-box style at a mix of abstraction layers. What matters to me is code coverage, functionality coverage, and application usability, so I’ll write tests from multiple angles that don’t always neatly fall into “unit” or “integration” categories.

Since many of the features of mb-sound are used extensively by the tools in bin/, and these scripts are an important part of interacting with mb-sound, I usually write tests for these scripts as well.

Here’s a simple example of a bin/ test from mb-sound:

1
2
3
4
5
6
7
8
9
RSpec.describe('bin/midi_info.rb') do
  it 'can show MIDI info' do
    text = `bin/midi_info.rb spec/test_data/midi.mid 2>&1`
    expect($?).to be_success

    expect(text).to include('Events')
    expect(text).to include('Unnamed')
  end
end

A simple test for a standalone executable script.

This test makes three assertions:

  • The bin/midi_info.rb script exits with a successful status code.
  • The script output includes the word “Events” (part of the table header).
  • The script output includes the word “Unnamed” (part of the table contents).

This looks basic but it actually verifies several aspects of the script and of mb-sound:

  • In order for that script to exit successfully, the entire library has to parse (no syntax errors), the script itself must be correct Ruby code, and there must be no errors during the entire process.
  • To be able to print the word Events, the data for the table must be passed correctly into the function from mb-util that draws tables. So we’re also testing the script correctness here.
  • And to show the word Unnamed, the MIDI file must be loaded correctly because that’s the name of a track in the MIDI file. Thus we’re testing the MIDI subsystem of mb-sound.

So it’s worth having these tests to cover large areas of code without writing tons of test cases.

Code coverage

I use SimpleCov code coverage metrics when developing a large new feature to decide what tests to write. I usually don’t target 100% coverage of everything, especially for a hobby project. Instead I review the line-by-line coverage of the files I’m working on just to make sure I’m hitting all of the happy paths (a “happy path” is the normal flow of code when no special options are passed and no errors occur), and any of the other paths that seem really likely or important.

As I was developing more and more scripts that use mb-sound, I really wanted to be able to include those scripts in my code coverage metrics. Why? Look at the midi_info.rb example test above – it touches several different parts of the MIDI subsystem, so we should count that as test coverage for the MIDI subsystem.

Unfortunately the bin/ tests run as standalone processes, while SimpleCov is only loaded in the main RSpec process. So I needed a way to load SimpleCov in the script processes as well. What I landed on was using the RUBYOPT environment variable to inject SimpleCov into the scripts I test. I have this line in my spec_helper.rb file:

1
ENV['RUBYOPT'] = "-r#{File.join(__dir__, 'simplecov_helper.rb')}"

And in simplecov_helper.rb, which that RUBYOPT line injects into the subprocess, we start SimpleCov and then load the script file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# File to be included via simplecov_runner.rb when testing standalone scripts from bin/
require 'securerandom'
require 'simplecov'

SimpleCov.start do
  command_name "#{$0} #{$$} #{SecureRandom.uuid}"
  formatter SimpleCov::Formatter::SimpleFormatter
  minimum_coverage 0
  enable_coverage :branch
end

# This require line makes sure the original script file is processed by simplecov
require File.expand_path($0, '.')

exit 0

This setup lets me include all of my standalone tools in the code coverage metrics for mb-sound.

The livestream

Okay, now you have the key background info for understanding this bug. You know what mb-sound is and how I test it, and that there was a script in mb-sound called midi_roll.rb with a --channel option that wasn’t covered by automated tests. Let’s get into the actual discovery of the bug.

For a long time I was using Ruby 2.7 by default and testing versions through 3.3 in CI. I planned a livestream in March of 2025 where I would go through all of the changes necessary to add Ruby 3.4 to mb-sound and its dependencies (mb-util, mb-math, etc.). My process was basically switch Ruby versions, run the test suite, and see what breaks.

When I got to the tests for mb-sound under Ruby 3.4, I kept getting test failures for midi_roll.rb, with Aborted (core dumped) in the process output:

1
2
munmap_chunk(): invalid pointer
Aborted (core dumped)

Since the tool appeared to work just fine when I ran it standalone, and it passed all tests in other Ruby versions, I disabled SimpleCov in subprocesses for Ruby 3.4 so I could finish the livestream. But I really want coverage metrics from subprocesses and I don’t like letting a root cause go undiscovered, so I logged the crash in mb-sound to revisit later.

The investigation

It’s generally best to assume that bugs are your own fault, and not the fault of the programming language or compiler. This is only the second time in my 20+ year career that I’ve found an actual bug in the language. I approached my investigation into this issue with the assumption that I had done something wrong in my own Ruby or C code.

I’ve pieced these events together from my Git history and my comments on mb-sound issue 36, so if you want you can follow the timeline there while you read here.

Apr. 4, 2025

Initially I had no idea what was causing Ruby 3.4 to crash when testing midi_roll.rb, so I started by making a copy of mb-sound and deleting as much code as I could until the crash stopped happening. This is when I found that having a large coverage stats directory makes the crash more likely.

After trimming down a copy of mb-sound to the point where the munmap_chunk crash stopped happening, I found that having a large coverage/ directory made the bug extremely likely to happen, but removing coverage/ made the bug extremely unlikely to happen.

Apr. 5, 2025

Now that I had a more minimal test case, I next looked for resources to expand my debugging capabilities. I’m familiar enough with tools like pry-byebug, gdb, and Valgrind, but wanted to both refresh my memory and get a broader list of ideas and tools for digging into the issue. Here are the resources I found (these are also listed on the mb-sound issue):

Resources that might be useful for debugging:

libc_malloc_debug

Glibc provides an alternative malloc implementation that does more checks to help find memory allocation and deallocation mistakes, so that was my next test:

1
2
3
export MALLOC_CHECK_=3
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libc_malloc_debug.so
./ruby34_bug_wrapper.rb # The minimized bug reproduction script
1
2
malloc_check_get_size: memory corruption
Aborted (core dumped)

This confirmed there was memory corruption going on (“memory corruption” was printed), rather than just an incorrect call to free() or something (“invalid pointer” from the earlier error). Unfortunately it didn’t move me much closer to finding the source of the corruption. One of the tricky things about memory corruption is that the crash often happens much later than the actual trigger, making it particularly difficult to find the root cause.

gdb

Ruby is written in C, so next I loaded Ruby into gdb, the standard C debugger on Linux, to start looking at the C layer. Here’s when I learned that the crash was happening inside Ruby’s garbage collector. I found that I couldn’t generate Ruby backtraces by setting a breakpoint, because generating a Ruby backtrace allocates memory, and you can’t allocate memory while garbage collecting memory.

I wasn’t able to generate Ruby backtraces for each Ruby thread using the example linked above, because the bug occurs within Ruby’s GC.

Instead, I wrote a GDB script to run the program and print the Ruby stack traces after the SIGABRT signal is raised. I included some embedded Python code generated by Google Search’s AI summary to count the number of running threads because I couldn’t find the documentation for GDB’s Python API or any other way to get a thread count in a GDB script. I really prefer writing code myself based on reading docs, but again, I couldn’t find the docs in Search, yet somehow the AI had this info.

I added this script to the automated test suite in the minimized test case repo, and copied the script’s output to the mb-sound issue.

The C stack trace showed in more detail that the crash happened during GC (garbage collection):

1
2
3
4
5
...
malloc_printerr
munmap_chunk
...
rb_gc_impl_free

The Ruby stack trace showed the crash happened within SimpleCov’s code that saves updated results, which shows why having a larger coverage directory made the crash more likely.

Valgrind

Like I mentioned above, memory corruption is tricky because the crash or bad behavior usually happens long after the initial corruption. That’s where Valgrind comes in. Valgrind has a bunch of tools for testing memory allocation and multithreaded lock management. Its default tool, called memcheck, can find double-free, use-after-free, out-of-bounds access, memory leaks, etc. Often, it can even tell you which line of code allocated the block of memory, and which line of code wrote outside that block.

Most of my C projects have an option to run the test suite under Valgrind because it’s so useful.

I didn’t realize it yet, but Valgrind pointed directly at the root cause:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
==29689== Invalid read of size 8
==29689==    at 0x4B3CE0F: rb_array_const_ptr (rarray.h:305)
==29689==    by 0x4B3CE0F: RARRAY_AREF (array.h:147)
==29689==    by 0x4B3CE0F: update_line_coverage (thread.c:5681)
.
.
.
==29689==  Address 0x22391a98 is 8 bytes before a block of size 1,128 alloc'd
==29689==    at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==29689==    by 0x49C5D7A: rb_gc_impl_malloc (default.c:8195)
==29689==    by 0x49C61C6: ruby_xmalloc2_body (gc.c:4604)
==29689==    by 0x49C61C6: ruby_xmalloc2 (gc.c:4598)
==29689==    by 0x48D57BF: ary_heap_alloc_buffer (array.c:350)
==29689==    by 0x48D57BF: ary_new (array.c:727)

It was almost midnight so my brain was too fried, and I was too unfamiliar with the Ruby codebase, to understand what Valgrind was telling me.

Apr. 6, 2025

The next day I spun my wheels for a while longer looking at garbage collection, but eventually had the good sense to switch to looking at what Valgrind was telling me about update_line_coverage().

I’ve confirmed that Ruby 3.3.5 does not crash, so a good next step might be comparing the source code of the functions mentioned by Valgrind between 3.3.5 and 3.4.2.

The update_line_coverage function hasn’t changed between 3.3.5 and 3.4.2, but when I build with debug info and -O0, run with Valgrind+vgdb, and break thread.c:5673 if line < 0, I see line is -1 which could cause the “invalid read of size 8” Valgrind error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Thread 1 hit Breakpoint 2, update_line_coverage (data=4, trace_arg=0x1ffeff9740) at thread.c:5673
5673                if (GET_VM()->coverage_mode & COVERAGE_TARGET_ONESHOT_LINES) {
1: line = -1
(gdb) list
5668            if (lines) {
5669                long line = rb_sourceline() - 1;
5670                long count;
5671                VALUE num;
5672                void rb_iseq_clear_event_flags(const rb_iseq_t *iseq, size_t pos, rb_event_flag_t reset);
5673                if (GET_VM()->coverage_mode & COVERAGE_TARGET_ONESHOT_LINES) {
5674                    rb_iseq_clear_event_flags(cfp->iseq, cfp->pc - ISEQ_BODY(cfp->iseq)->iseq_encoded - 1, RUBY_EVENT_COVERAGE_LINE);
5675                    rb_ary_push(lines, LONG2FIX(line + 1));
5676                    return;
5677                }
(gdb) disp line
2: line = -1

I still didn’t make the final connection in my mind until later in the day:

…so maybe something has changed between 3.3.5 and 3.4.2 that causes rb_sourceline() to return 0 where it didn’t before?

But also this might be a red herring and not the cause of the memory corruption that happens later.

Finally I convinced myself that the memory corruption was caused by writing before the start of the line coverage array:

I believe I’ve confirmed that the attempt to access an array at index -1 is causing the later memory corruption. If I run set line = 0 at the above mentioned breakpoint on thread.c:5673 if line < 0, the program continues without aborting:

However, if I just type continue without altering line, the program aborts with the munmap_chunk(): invalid pointer error:

I think this is probably enough to open a bug in upstream Ruby, but I might dig a little further into any changes to rb_sourceline() and its downstream methods.

For more confirmation, update_line_coverage() is writing before the start of the array. Definite memory corruption.

Apr. 7, 2025

This was the day I finally came to my senses and stopped trying to dig further on my own. I filed the bug with the Ruby team and worked on other things for the rest of the day.

As a wild guess I think it would take me a couple of weeks full time to track down the root cause of rb_sourceline() returning zero in Ruby 3.4 but not Ruby 3.3, so I have opened a bug upstream: https://bugs.ruby-lang.org/issues/21220

I got as far as looking at rb_vm_get_sourceline(), calc_lineno, and calc_pos but to go any further I’d have to get a deep understanding of the Ruby VM.

I included a proposed fix that stopped the memory corruption, but noted that I didn’t really think this was the root cause:

Something like this should prevent the memory corruption, but may be hiding a deeper issue:

1
2
-            if (line >= RARRAY_LEN(lines)) { /* no longer tracked */
+            if (line < 0 || line >= RARRAY_LEN(lines)) { /* no longer tracked */

Apr. 8, 2025

Amazingly, two legendary Ruby core team members, byroot (Jean Boussier) and mame (Yusuke Endoh), reproduced my issue, found the true root cause, and came up with a good fix, all within a single day. You really gotta read the bug comments to appreciate this.

Here are the issues they identified and the fixes:

  • RUBY_EVENT_COVERAGE_LINE in compile.c and prism_compile.c was firing when line numbers were <= 0. Diff
1
2
- if (ISEQ_COVERAGE(iseq) && ISEQ_LINE_COVERAGE(iseq)) {
+ if (line > 0 && ISEQ_COVERAGE(iseq) && ISEQ_LINE_COVERAGE(iseq)) {
  • The update_line_coverage() function in thread.c did not check for negative indices. Diff Diff
1
2
  long line = rb_sourceline() - 1;
+ VM_ASSERT(line >= 0);
  • The PRISM compiler was generating a line number of zero for flip-flop operators, even though the PRISM parser had correct line numbers. This was logged separately as Ruby issue 21259 and fixed a few months later by one of the greatest Ruby legends, tenderlovemaking (Aaron Patterson). Diff
1
2
3
4
5
- const pm_node_location_t location = { .line = ISEQ_BODY(iseq)->location.first_lineno, .node_id = -1 };
+ const pm_node_location_t location = PM_NODE_START_LOCATION(scope_node->parser, node);

- const pm_node_location_t location = { .line = ISEQ_BODY(iseq)->location.first_lineno, .node_id = -1 };
+ const pm_node_location_t location = { .line = lineno, .node_id = -1 };

This is also when I learned about my typo – before this point I had no idea that I had missed those parentheses in midi_roll.rb:

@mbcodeandsound (Mike Bourgeous) Just FYI, I bet you meant to write !(1..16).cover?(channel) in the following line.

https://github.com/mike-bourgeous/reproduce-simplecov-ruby34-bug/blob/d73c3fe80014cb91d8b6c64847581feb8a19d1b6/bin/midi_roll.rb#L42

Thank goodness for us, because it resulted in the discovery of a bug in Ruby :-)

If I had written tests for that --channel check then I would have added the missing parentheses, and never found these bugs in Ruby’s flip-flop operator and coverage tracking code!

Flip-flop and RuboCop

Okay, as I asked in the intro, did you know that Ruby has an operator called the flip-flop operator? I didn’t before this. It uses the same double-dot or triple-dot syntax as Range creation, but only works within the context of a conditional.

I could see the flip-flop operator being really useful for parsing semi-structured text or digging a range of events out of a server log, and enough people like it that it’s still in Ruby. But to me it seems pretty dangerous since it has the same syntax as Range creation.

RuboCop, a code “linting” tool for Ruby, does have a check for the flip-flop operator. I always use RuboCop on professional Ruby projects. It probably should be on mb-sound too, but if I’d added RuboCop to mb-sound, then I never would have found this Ruby bug, and you wouldn’t be reading this!

Many of RuboCop’s default rules are… suboptimal in my opinion, but I will most likely get around to adding it to all of my open source projects eventually.

Apr. 9, 2025

They merged the fix on the next day. This is an excellent turnaround time, and really added to my love of the Ruby language.

I’m sure they’ll never read this, but I have to say I genuinely appreciate the dedication and skill that the Ruby team members bring to the Ruby project. My brief interaction with them on my bug report was very pleasant and productive.

And now I can say I have my name in the commit history!

Ruby commit 0d6263bd

Ruby commit 0d6263bd

July 2025

While the memory corruption bug warranted a fast response, there was no need to rush the fix for line numbers on flip-flop operators. In the worst case scenario I can imagine, code coverage metrics might be slightly lower, or an error message might list line zero for one of the backtrace entries.

The fix for the PRISM compiler was committed on July 17 and backported to Ruby 3.4 on July 21.

Conclusion

So now you’ve seen how one simple typo in my Ruby code led to memory corruption through a series of small errors. The Ruby team handled the bug report exceptionally and all issues have been fixed and backported.

Some key takeaways:

  • These bugs were only found because several small issues aligned, like in the Swiss cheese model.
    • Lack of error-path tests → missing parentheses → obscure flip-flop operator → compiler line number bug → memory corruption bug
  • The open memory model used by C continues to bite us, but I still enjoy writing C, and tools like Valgrind go a long way toward mitigating the risks of pointer management.
  • It’s really valuable to fix both the proximate cause and the root cause of an issue. You fix the proximate cause first just to get your project moving again, but if you don’t take the time to find and fix the root cause, it’s likely to surface again at the worst possible time.
  • I should probably use RuboCop and/or other code quality tools on my open source projects like I do on my commercial projects. Rulesets can be customized to disable checks that get in the way more than they help. I could rant about how unhelpful linting rules affect software teams but maybe I’ll leave that for another post.
  • When I run into a memory-related crash, I’ll probably start with Valgrind next time instead of using it last.
  • It’s never the compiler’s fault. But sometimes, it is. So it’s worthwhile to have experience with the tools to debug your own code as well as your dependencies.

I hope this was an entertaining read! I’ve put almost as much time into writing this as I did into finding the Ruby bug in the first place.

Have a good one everyone, and keep having fun with Ruby!