Ruby - Don't Freeze User-Defined Strings!
This week I was working on a Rails project that involved keeping a large hash of dynamic data in memory. On first thought, I was going to build the hash keys based on several data points which will be combined into a hash key. It would have looked something like this:
element_name = 'NETINC'
period = '20160930'
span = 'span_trailing_twelve_months'
"#{element_name}_#{period}_#{span}"
#=> "NETINC_20160930_span_trailing_twelve_months"
# ---------------------------------
# Then I would be able to reference
# the hash like this:
hash['NETINC_20160930_span_trailing_twelve_months']
However, if once I build the hash keys I freeze them using String.freeze
, do they get garbage collected when I'm no
longer processing that key? So let's say I'm making thousands of these hash keys, process the hash and then move on
to a new data set. Would Ruby keep the defunct hash key strings in memory for the remainder of the program?
The answer is YES! Freezing the keys could possibly cause a memory leak because the program will be maintaining each string object in an in-memory table even after I'm done using it. The memory pile of string objects could grow indefinitely. This is why freezing strings of user-defined input could be very dangerous - a malicious user could crash your server by causing it to run out of memory this way.
I verified this in a discussion with the Chicago Ruby Slack group:
"foo".freeze.object_id == "foo".freeze.object_id" # => true
, but without the freeze it'sfalse
. And ifdef a; "foo".freeze.object_id; end; a; a
gives the same id twice.Oh yeah, good point... it does allow some memory optimizations if you freeze it, like two strings being the same string in memory if they're frozen and identical
Wrapped up work and dug a minute: in
string.c
, the retaining is happening instr_new_frozen
. The vm callsstr_replace_shared_without_enc
to eval the string literal, that callsrb_fstring
, that callsstr_replace_shared_without_enc
, that callsstr_new_frozen
.They're marked as SHARED. Then
gc.c
,gc_mark_children
doesgc_mark(objspace, any->as.string.as.heap.aux.shared);
on all the shared strings so unless there's something I'm not spotting that sneaks around to unset SHARED those strings are retained for the life of the process.
The approach I took to get around this problem was to build my hash to be dimensions deep. So to access the value, I would instead use a chain of keys:
hash['NETINC']['20160930']['span_trailing_twelve_months']
All it took was a small modification of the fetch methods wrapping my Hash to implement this. element_name
and span
are fine to freeze because there is a finite number of them in my program.