A tale of a nasty bug, which turned out to be a bug in the language itself.

The initial bug

When a bug with the title ActionView::Template::Error: undefined method 'split' for nil:NilClass pops into Rollbar, you don’t think this is a hard problem right? Especially when the feature just launched. Experience tells you that this is programmer error. This was my feeling as well. I was wrong.

Digging deeper

This didn’t happen very often, 2 times on production, sometimes on CI, couldn’t replicate it locally.

As I’ve started digging into this and found that the bug is triggered in a gem called active_data. Because I had not done any work in it, and because it’s always nice to find a friend in misery, I pinged a friend in the company more versed in it. He confirmed that this is weird to him as well and suggested dumping some debug data along with the error. So I did.

- rescue NoMethodError => e
  - Rollbar.error(e, attrs: action.class._attributes, aliases: action.class._attribute_aliases)
  - raise e

So at this point, the first weird thing happened: pretty quickly at that. We got this error: RuntimeError: can't add a new key into hash during iteration.

It seems that something weird was happening with the debugging here. so I’ve changed the code to dump as safely as possible with good old #inspect.

And now we wait

Unfortunately, the next time this bug happened (despite me trying to make it happen each day) was 11 days later.

When it did, something troubling came up in the debug data as action_class_attrs:

"\x98\xD1\xCAG;\x7F\x00\x00I\x0E\xFEG;\x7F\x00\x00-\x00\"\"\x00\x00\x00\x00 \xD1\xCA"=>
  #<ActiveData::Model::Attributes::Reflections::Attribute:0x00007f3b491398f0 @
    name="\x98\xD1\xCAG;\x7F\x00\x00I\x0E\xFEG;\x7F\x00\x00-\x00\"\"\x00\x00\x00\x00 \xD1\xCA", 

Do you see that? What the hell is "\x98\xD1\xCAG;\x7F\x00\x00I\x0E\xFEG;\x7F\x00\x00-\x00\"\"\x00\x00\x00\x00 \xD1\xCA"? Initially I thought this was UTF code points, but it’s pretty much just gibberish.

The last part of the puzzle to reveal is the definition of the model class:

class Foo
  include ActiveData::Model

  attribute 'something_happened_reason', String

OK. Now you’ve got all the info I had. Figured it out yet? Go ahead, I’ll wait.

Might it be, a bug in Ruby?

I would like to tell you that I connected all the dots, but alas I did not. It was my colleague. The length of the attribute triggered a memory of an article about Ruby handling different length strings differently. This led him to try to find any current string bugs in Ruby that might be related to our situation, this led him to this little gem:

a = ('a' * 24).b.strip

->{ eval "", binding, a, 1 }.call

puts "Before garbage_collection: a=#{a.inspect}"
4.times { GC.start }
puts "After garbage_collection: a=#{a.inspect}"

Which outputs:

Before garbage_collection: a="aaaaaaaaaaaaaaaaaaaaaaaa"
After garbage_collection: a="\xA0\xA1=\x02\x00\x00\x00\x00aaaaaaaaaaaaaaaa"

Have you recovered from that? Cool. Take another deep breath, because this is a pretty edgy edge case. Messing with the above code in any way is prone to break the reproduction of the bug. Even more interesting is the fact that 24 is the magic limit on Ruby strings.

So basically, a bug in Ruby caused some strings above 23 chars to get corrupted after being massaged in specific ways and then cleaned up with the GC. Those strings might be hash keys. And something like this happens.

What this means

I’ve started programming 20 years ago. When I started it, I encountered these kinds of situations daily. I was sure that I was doing everything right and it was the computer’s fault when something didn’t work. Of course, this was always my fault, I just didn’t know it. As I got better, I got used to the idea that the problem is with me, and I stopped attributing these errors to outside forces. It saved time.

But keep in mind, that this can happen. And when it does, be ready for it. And have smart people nearby.