SafeAF · January 16, 2016 13:50
diff --git a/export-tuning-vars.sh b/export-tuning-vars.sh
 xport RUBY_GC_HEAP_INIT_SLOTS=600000
 export RUBY_GC_HEAP_FREE_SLOTS=600000
 export RUBY_GC_HEAP_GROWTH_FACTOR=1.25
 export RUBY_GC_HEAP_GROWTH_MAX_SLOTS=300000
 or this

 # Boost Ruby
 export RUBY_GC_HEAP_INIT_SLOTS=1000000 # 1M
 export RUBY_GC_HEAP_FREE_SLOTS=500000  # 0.5M
 export RUBY_GC_HEAP_GROWTH_FACTOR=1.1
 export RUBY_GC_HEAP_GROWTH_MAX_SLOTS=10000000 # 10M
 export RUBY_GC_MALLOC_LIMIT_MAX=1000000000    # 1G
 export RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR=1.1
 # export RUBY_GC_OLDMALLOC_LIMIT=500000000      # 500M
 # export RUBY_GC_OLDMALLOC_LIMIT_MAX=1000000000 # 1G
 # export RUBY_GC_OLDMALLOC_LIMIT_GROWTH_FACTOR=1.1
 Performance MRI Ruby Patch

 $ rvm install 2.1.2 --patch railsexpress -n railsexpress
 $ rvm --default use 2.1.2-railsexpress
diff --git a/gistfile1.txt b/gistfile1.txt
 Run this on Ruby 2.1.1 and you will be out of memory soon:

 while true
  "a" * (1024 ** 2)
 end
 Malloc limits, Ruby and you
 From very early versions of Ruby we always tracked memory allocation. This is why I found FUD comments such as this troubling:

 the issue is that the Ruby GC is triggered on total number of objects, and not total amount of used memory

 This is clearly misunderstanding Ruby. In fact, the aforementioned article does nothing to mention memory allocation may trigger a GC.

 Historically Ruby was quite conservative issuing GCs based on the amount of memory allocated. Ruby keeps track of all memory allocated (using malloc) outside of the Ruby heaps between GCs. In Ruby 2.0, out-of-the-box every 8MB of allocations will result in a full GC. This number is way too small for almost any Rails app, which is why increasing RUBY_GC_MALLOC_LIMIT is one of the most cargo culted settings out there in the wild.

 Matz picked this tiny number years ago when it was a reasonable default, however it was not revised till Ruby 2.1 landed.

 For Ruby 2.1 Koichi decided to revamp this sub-system. The goal was to have defaults that work well for both scripts and web apps.

 Instead of having a single malloc limit for our app, we now have a starting point malloc limit that will dynamically grow every time we trigger a GC by exceeding the limit. To stop unbound growth of the limit we have max values set.

 We track memory allocations from 2 points in time:

 memory allocated outside Ruby heaps since last minor GC
 memory allocated since last major GC.
 At any point in time we can get a snapshot of the current situation with GC.stat:

 > GC.stat
 => {:count=>25,
 :heap_used=>263,
 :heap_length=>406,
 :heap_increment=>143,
 :heap_live_slot=>106806,
 :heap_free_slot=>398,
 :heap_final_slot=>0,
 :heap_swept_slot=>25258,
 :heap_eden_page_length=>263,
 :heap_tomb_page_length=>0,
 :total_allocated_object=>620998,
 :total_freed_object=>514192,
 :malloc_increase=>1572992,
 :malloc_limit=>16777216,
 :minor_gc_count=>21,
 :major_gc_count=>4,
 :remembered_shady_object=>1233,
 :remembered_shady_object_limit=>1694,
 :old_object=>65229,
 :old_object_limit=>93260,
 :oldmalloc_increase=>2298872,
 :oldmalloc_limit=>16777216}
 malloc_increase denotes the amount of memory we allocated since the last minor GC. oldmalloc_increase the amount since last major GC.

 We can tune our settings, from "Ruby 2.1 Out-of-Band GC":

 RUBY_GC_MALLOC_LIMIT: (default: 16MB)
 RUBY_GC_MALLOC_LIMIT_MAX: (default: 32MB)
 RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR: (default: 1.4x)

 and

 RUBY_GC_OLDMALLOC_LIMIT: (default: 16MB)
 RUBY_GC_OLDMALLOC_LIMIT_MAX: (default: 128MB)
 RUBY_GC_OLDMALLOC_LIMIT_GROWTH_FACTOR: (default: 1.2x)

 So, in theory, this unbound memory growth is not possible for the script above. The two MAX values should just cap the growth and force GCs.

 However, this is not the case in Ruby 2.1.1

 Investigating the issue
 We spent a lot of time ensuring we had extensive instrumentation built in to Ruby 2.1, we added memory profiling hooks, we added GC hooks, we exposed a large amount of internal information. This has certainly paid off.

 Analyzing the issue raised by this mini script is trivial using the gc_tracer gem. This gem allows us to get a very detailed snapshot of the system every time a GC is triggered and store it in a text file, easily consumable by spreadsheet.

 We simply add this to the rogue script:

 require 'gc_tracer'
 GC::Tracer.start_logging("log.txt")
 And get a very detailed trace back in the text file:
diff --git a/gistfile2.txt b/gistfile2.txt
 Run this on Ruby 2.1.1 and you will be out of memory soon:

 while true
  "a" * (1024 ** 2)
 end
 Malloc limits, Ruby and you
 From very early versions of Ruby we always tracked memory allocation. This is why I found FUD comments such as this troubling:

 the issue is that the Ruby GC is triggered on total number of objects, and not total amount of used memory

 This is clearly misunderstanding Ruby. In fact, the aforementioned article does nothing to mention memory allocation may trigger a GC.

 Historically Ruby was quite conservative issuing GCs based on the amount of memory allocated. Ruby keeps track of all memory allocated (using malloc) outside of the Ruby heaps between GCs. In Ruby 2.0, out-of-the-box every 8MB of allocations will result in a full GC. This number is way too small for almost any Rails app, which is why increasing RUBY_GC_MALLOC_LIMIT is one of the most cargo culted settings out there in the wild.

 Matz picked this tiny number years ago when it was a reasonable default, however it was not revised till Ruby 2.1 landed.

 For Ruby 2.1 Koichi decided to revamp this sub-system. The goal was to have defaults that work well for both scripts and web apps.

 Instead of having a single malloc limit for our app, we now have a starting point malloc limit that will dynamically grow every time we trigger a GC by exceeding the limit. To stop unbound growth of the limit we have max values set.

 We track memory allocations from 2 points in time:

 memory allocated outside Ruby heaps since last minor GC
 memory allocated since last major GC.
 At any point in time we can get a snapshot of the current situation with GC.stat:

 > GC.stat
 => {:count=>25,
 :heap_used=>263,
 :heap_length=>406,
 :heap_increment=>143,
 :heap_live_slot=>106806,
 :heap_free_slot=>398,
 :heap_final_slot=>0,
 :heap_swept_slot=>25258,
 :heap_eden_page_length=>263,
 :heap_tomb_page_length=>0,
 :total_allocated_object=>620998,
 :total_freed_object=>514192,
 :malloc_increase=>1572992,
 :malloc_limit=>16777216,
 :minor_gc_count=>21,
 :major_gc_count=>4,
 :remembered_shady_object=>1233,
 :remembered_shady_object_limit=>1694,
 :old_object=>65229,
 :old_object_limit=>93260,
 :oldmalloc_increase=>2298872,
 :oldmalloc_limit=>16777216}
 malloc_increase denotes the amount of memory we allocated since the last minor GC. oldmalloc_increase the amount since last major GC.

 We can tune our settings, from "Ruby 2.1 Out-of-Band GC":

 RUBY_GC_MALLOC_LIMIT: (default: 16MB)
 RUBY_GC_MALLOC_LIMIT_MAX: (default: 32MB)
 RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR: (default: 1.4x)

 and

 RUBY_GC_OLDMALLOC_LIMIT: (default: 16MB)
 RUBY_GC_OLDMALLOC_LIMIT_MAX: (default: 128MB)
 RUBY_GC_OLDMALLOC_LIMIT_GROWTH_FACTOR: (default: 1.2x)

 So, in theory, this unbound memory growth is not possible for the script above. The two MAX values should just cap the growth and force GCs.

 However, this is not the case in Ruby 2.1.1

 Investigating the issue
 We spent a lot of time ensuring we had extensive instrumentation built in to Ruby 2.1, we added memory profiling hooks, we added GC hooks, we exposed a large amount of internal information. This has certainly paid off.

 Analyzing the issue raised by this mini script is trivial using the gc_tracer gem. This gem allows us to get a very detailed snapshot of the system every time a GC is triggered and store it in a text file, easily consumable by spreadsheet.

 We simply add this to the rogue script:

 require 'gc_tracer'
 GC::Tracer.start_logging("log.txt")
 And get a very detailed trace back in the text file:
diff --git a/gistfile3.txt b/gistfile3.txt
 Installs tested

 rvm install 1.9.3-p385
 rvm install 1.9.3-p385 --patch https://github.com/funny-falcon/ruby/compare/p385...p385_falcon.diff
 rvm install 2.0.0
 note this is an in-progress post, feel free to comment, but it will grow quite a bit as I flesh it out.

 Environments testing

 default
 RUBY_GC_MALLOC_LIMIT=90000000 (GC)
 LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.0.1.0 (GC+TC)
 RAILS_ENV=production bundle exec thin start -p 3111

 RAILS_ENV=production RUBY_GC_MALLOC_LIMIT=90000000 bundle exec thin start -p 3111

 RAILS_ENV=production RUBY_GC_MALLOC_LIMIT=90000000 LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.0.1.0 bundle exec thin start -p 3111
 (tried a bunch of other "recommended" optimisations, most of them made little to no diff)

 Test rig

 Production: Intel(R) Xeon(R) CPU E3-1280 V2 @ 3.60GHz

 Test performed

 Results for: ab -n 100 -c 1 http://localhost:3111/
 first run discarded

 rvm use 1.9.3-p385

 RAILS_ENV=production bundle exec thin start -p 3111

                min  mean[+/-sd] median   max
 Default:        209  260  15.3    263     297
 GC     :         84  135  32.1    139     205
 GC+TC  :         76  124  27.1    131     178
 rvm install 1.9.3-p385 --patch falcon

              min  mean[+/-sd] median   max
 Default:        72  131  49.1    124     540
 GC     :        70  120  18.9    126     158
 GC+TC:          67  121  40.4    116     395
 rvm use 2.0.0

              min  mean[+/-sd] median   max
 Default:       119  158  25.3    154     211
 GC    :         65  120  36.1    118     221
 GC+TC :         67  113  20.0    114     175
 What is the deal with RUBY_GC_MALLOC_LIMIT

 This is one of the most poorly documented bits of optimisation out there:

 github.com 160
 ruby/ruby/blob/trunk/gc.c#L68-L70

 #include <sys/time.h>
 #endif

 github.com 29
 ruby/ruby/blob/trunk/gc.c#L3479-L3483


 while (sweep_page) {
 	heap->sweep_pages = next = sweep_page->next;
 	gc_page_sweep(objspace, heap, sweep_page);

 Out-of-the box Ruby forces a GC every time 8 megs of memory is allocated this is insanely low, our front page caused about 32megs of allocation thus causing 4 GCs. bumping this up to allowing 90megs of allocation before forcing a GC means there is much less pressure of the GC. We are able to handle 3 or so requests without a GC.

 As a rule we want the number large, but not too large, if its too big we will bleed way too much memory and GC will take forever. If its too small, like it is out-of-the-box we will have multiple GCs per page which is horrible on so many levels.

 What is the impact of raising RUBY_GC_MALLOC_LIMIT to 90 million ?

 On our test rig, resident memory raised from 116Megs to 193Megs for the GC setting alone and up to 211Megs when combined with tcmalloc

 What do you recommend?

 I have 2 recommendations depending on your stance:

 Conservative: All conservative installs should use RUBY_GC_MALLOC_LIMIT=90000000 , its a no-brainer and the increased memory usage is well worth the extra perf, you get a 48% decrease in median request time just by doing that.

 Bleeding Edge: I would recommend using Ruby 2.0 allocating memory using tcmalloc. You get a 57% decrease in median request time using that.

 I can no longer recommend falcon patches due to the fact most of the patches are in 2.0 anyway, embrace the future.

 Issues with this test

 The test was performed on a CPU that has Turbo Boost69 enabled. This means that clock speed varies based on load. It made it more complicated to get consistent test runs, that said I repeated all the tests multiple times and it was run on our production boxes.

 Why are we eating so much memory per request?

 The easiest thing to answer here is what kind of memory we are consuming: (adding ?pp=profile-gc)



 230 THOUSAND strings allocated during one request

 It is going to take us a while to work through this to find the actual culprits, though quite a few of the strings allocated over and over again do seem to be coming from Active Record, so focusing on cutting the amount of querying is a sure way to reduce string allocations during a request life cycle.

 What about Rails 4?

 We need to re-test this under rails 4 to see if its faster or slower and determine if string allocations have gone up or down.

diff --git a/gistfile4.txt b/gistfile4.txt
 Are you able to reproduce these allocations on OS X? If so (since you're able to use Ruby 2.0), you should be able to find the allocations pretty easily with DTrace. Use this script (I'm writing this from memory, so it may need tweaks):

 ruby$target:::string-create {
  @_[copyinstr(arg1), arg2] = count();
 }
 Start your Rails process, run a few request, then attach to the pid with DTrace, run another request, and kill DTrace. You should end up with aggregate counts of file name / line number as the key, and number of string allocations done at that location.
diff --git a/gistfile5.txt b/gistfile5.txt
 he biggest offenders according to the probes appear to be:

  # this MiniProfiler monkey patch
 def values(*args, &blk)
      return values_without_profiling(*args, &blk) unless @miniprofiler_sql_id

  # and this AR method
  def hash_rows
      @hash_rows ||= @rows.map { |row|
        # this string allocation 
        Hash[@columns.zip(row)]
      }
    end
 One method AR could get rid of this double allocation would be to nuke the underlying hash in favor of a real object (tm), cache some sort of deserializer.

 A HUGE packet of these string allocations could go away if pg supported a mode where it returned coerced types as opposed to strings.

 I wonder if the MiniProfiler monkey patch should really be attributed as the "responsible party" for the allocation, the allocation happened actually a fair bit deeper in the cruby extension.
diff --git a/gistfile6.txt b/gistfile6.txt
 onfirm this fix removes that huge number of allocations, it is a huge perf boost for AR.

 A very big reason for the excessive allocations is us, for example we run:

 SELECT  "topics".* FROM "topics"  WHERE ("topics"."deleted_at" IS NULL) AND (topics.archetype <> 'private_message') ORDER BY CASE WHEN topics.category_id IS NULL and topics.pinned THEN 0 ELSE 1 END, topics.bumped_at DESC LIMIT 30
 We are not limiting down to the actual columns we need, this table has 38 columns, so that means the pg gem is allocating at least double the amount of strings we need.

 The natural thing is to add a .select there, but my concern is that if we forget a column the failure mode is very problematic, stuff will not "explode" it will fail silently pretending the missing columns are nil. I wonder if we should add some special mode in dev to assist here. Regardless, we will fix this.

 Additionally, we have an N+1 query hanging off category.

 I think I will add a mode to MiniProfiler that allows you to see object creation count per query, that will allow people to attack the biggest offenders.

 The very frustrating thing about this topic table is that only 2 out of the 38 columns are actually strings.

 I profiled the simple:

 (ran it twice before in production to ensure it was warmed up), and got:

 Topic.limit(30).to_a

 Amount of string allocations still feels excessive (even after the patch - I patched local)

 File / Line Number / Allocations.
 -------------------------------------
 /gems/activerecord-3.2.12/lib/active_record/connection_adapters/abstract_adapter.rb 276 1
 /gems/activerecord-3.2.12/lib/active_record/log_subscriber.rb 4 1
 /gems/activerecord-3.2.12/lib/active_record/log_subscriber.rb 8 1
 /gems/activerecord-3.2.12/lib/active_record/querying.rb 38 1
 /gems/activerecord-3.2.12/lib/active_record/relation/spawn_methods.rb 53 1
 /gems/activerecord-3.2.12/lib/active_record/scoping/default.rb 119 1
 /gems/arel-3.0.2/lib/arel.rb 43 1
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 126 1
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 131 1
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 135 1
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 139 1
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 192 1
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 200 1
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 283 1
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 339 1
 test.rb    4 1
 test.rb    7 1
 /gems/activerecord-3.2.12/lib/active_record/relation/query_methods.rb 268 2
 /gems/activerecord-3.2.12/lib/active_record/scoping/default.rb 123 2
 /gems/arel-3.0.2/lib/arel.rb 39 2
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 134 2
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 136 2
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 187 2
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 282 2
 /gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 366 2
 test.rb    6 2
 /gems/activerecord-3.2.12/lib/active_record/connection_adapters/abstract/connection_specification.rb 149 3
 /gems/activesupport-3.2.12/lib/active_support/log_subscriber.rb 91 3
 /gems/activerecord-3.2.12/lib/active_record/scoping.rb 130 4
 /gems/activerecord-3.2.12/lib/active_record/scoping.rb 126 5
 /gems/activerecord-3.2.12/lib/active_record/relation/spawn_methods.rb 14 6
 /gems/activerecord-3.2.12/lib/active_record/relation/spawn_methods.rb 26 8
 /gems/activerecord-3.2.12/lib/active_record/relation/spawn_methods.rb 52 12
 /gems/activerecord-3.2.12/lib/active_record/base.rb 514 30
 /gems/activerecord-3.2.12/lib/active_record/model_schema.rb 162 30
 /gems/activerecord-3.2.12/lib/active_record/relation.rb 26 32
 /gems/activerecord-3.2.12/lib/active_record/connection_adapters/postgresql_adapter.rb 665 38
 /gems/activerecord-3.2.12/lib/active_record/result.rb 33 38
 /gems/activerecord-3.2.12/lib/active_record/relation.rb 27 40
 /gems/activerecord-3.2.12/lib/active_record/inheritance.rb 68 60
 /gems/activesupport-3.2.12/lib/active_support/callbacks.rb 81 60
 /gems/activesupport-3.2.12/lib/active_support/callbacks.rb 414 180
 /gems/activerecord-3.2.12/lib/active_record/connection_adapters/postgresql_adapter.rb 605 910
 1 Reply

 samSam Saffronco-founderMar '13 1    tenderlove
 More profiling (both warmed up in prd)

 100.times do
 Topic.limit(30).to_a
 end
 vs.

 100.times do
 ActiveRecord::Base.connection.raw_connection.async_exec("select * from topics limit 30").values
 end
 sampler: 

 ruby*:::cmethod-entry
 {
  k = copyinstr(arg0);
  m = copyinstr(arg1);
 }

 tick-5000hz
 /k != 0/
 {
  @[k, m] = count();
 }
 AR

 Hash   each   72
 String  hash   82
 Kernel  freeze  87
 PG::Result       ftype  88
 Thread  []    89
 Module  name   94
 Array   zip   107
 Array   each  120
 Thread  current 123
 Class   allocate       131
 Fixnum  abs   136
 String  intern 147
 Hash   delete 150
 Kernel  instance_variable_set 183
 Hash   []=   195
 String  to_s  195
 String  initialize      237
 String  to_sym 238
 Kernel  respond_to?      247
 Fixnum  to_s  286
 Class   superclass      349
 PG::Connection     async_exec      354
 PG::Result       values 383
 #<Class:0x007ff6e8b9e510>       relation       399
 Module  ===   415
 Kernel  block_given?     450
 Kernel  hash  512
 Kernel  class  517
 Hash   []   566
 Symbol  to_s  586
 Hash   key?  6810
 Raw

 Mutexd current 1
 Thread [] 1
 Mutex unlock 3
 Module name 8
 Thread current 13
 PG::Result values 78
 PG::Connection async_exec 266
 My particular concern here is all these extraneous calls to .key? , seems the sampler is really spending much time there.

 These key? calls are all coming form

  2 143160      vm_call_cfunc:cmethod-entry Hash key? /Users/sam/.rvm/gems/ruby-2.0.0-p0/gems/activerecord-3.2.12/lib/active_record/locking/optimistic.rb 174
  2 143160      vm_call_cfunc:cmethod-entry Hash key? /Users/sam/.rvm/gems/ruby-2.0.0-p0/gems/activerecord-3.2.12/lib/active_record/attribute_methods/serialization.rb 66
diff --git a/old-history.md b/old-history.md
	xport RUBY_GC_HEAP_INIT_SLOTS=600000
	export RUBY_GC_HEAP_FREE_SLOTS=600000
	export RUBY_GC_HEAP_GROWTH_FACTOR=1.25
	export RUBY_GC_HEAP_GROWTH_MAX_SLOTS=300000
	or this

	# Boost Ruby
	export RUBY_GC_HEAP_INIT_SLOTS=1000000 # 1M
	export RUBY_GC_HEAP_FREE_SLOTS=500000 # 0.5M
	export RUBY_GC_HEAP_GROWTH_FACTOR=1.1
	export RUBY_GC_HEAP_GROWTH_MAX_SLOTS=10000000 # 10M
	export RUBY_GC_MALLOC_LIMIT_MAX=1000000000 # 1G
	export RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR=1.1
	# export RUBY_GC_OLDMALLOC_LIMIT=500000000 # 500M
	# export RUBY_GC_OLDMALLOC_LIMIT_MAX=1000000000 # 1G
	# export RUBY_GC_OLDMALLOC_LIMIT_GROWTH_FACTOR=1.1
	Performance MRI Ruby Patch

	$ rvm install 2.1.2 --patch railsexpress -n railsexpress
	$ rvm --default use 2.1.2-railsexpress
	Run this on Ruby 2.1.1 and you will be out of memory soon:

	while true
	"a" * (1024 ** 2)
	end
	Malloc limits, Ruby and you
	From very early versions of Ruby we always tracked memory allocation. This is why I found FUD comments such as this troubling:

	the issue is that the Ruby GC is triggered on total number of objects, and not total amount of used memory

	This is clearly misunderstanding Ruby. In fact, the aforementioned article does nothing to mention memory allocation may trigger a GC.

	Historically Ruby was quite conservative issuing GCs based on the amount of memory allocated. Ruby keeps track of all memory allocated (using malloc) outside of the Ruby heaps between GCs. In Ruby 2.0, out-of-the-box every 8MB of allocations will result in a full GC. This number is way too small for almost any Rails app, which is why increasing RUBY_GC_MALLOC_LIMIT is one of the most cargo culted settings out there in the wild.

	Matz picked this tiny number years ago when it was a reasonable default, however it was not revised till Ruby 2.1 landed.

	For Ruby 2.1 Koichi decided to revamp this sub-system. The goal was to have defaults that work well for both scripts and web apps.

	Instead of having a single malloc limit for our app, we now have a starting point malloc limit that will dynamically grow every time we trigger a GC by exceeding the limit. To stop unbound growth of the limit we have max values set.

	We track memory allocations from 2 points in time:

	memory allocated outside Ruby heaps since last minor GC
	memory allocated since last major GC.
	At any point in time we can get a snapshot of the current situation with GC.stat:

	> GC.stat
	=> {:count=>25,
	:heap_used=>263,
	:heap_length=>406,
	:heap_increment=>143,
	:heap_live_slot=>106806,
	:heap_free_slot=>398,
	:heap_final_slot=>0,
	:heap_swept_slot=>25258,
	:heap_eden_page_length=>263,
	:heap_tomb_page_length=>0,
	:total_allocated_object=>620998,
	:total_freed_object=>514192,
	:malloc_increase=>1572992,
	:malloc_limit=>16777216,
	:minor_gc_count=>21,
	:major_gc_count=>4,
	:remembered_shady_object=>1233,
	:remembered_shady_object_limit=>1694,
	:old_object=>65229,
	:old_object_limit=>93260,
	:oldmalloc_increase=>2298872,
	:oldmalloc_limit=>16777216}
	malloc_increase denotes the amount of memory we allocated since the last minor GC. oldmalloc_increase the amount since last major GC.

	We can tune our settings, from "Ruby 2.1 Out-of-Band GC":

	RUBY_GC_MALLOC_LIMIT: (default: 16MB)
	RUBY_GC_MALLOC_LIMIT_MAX: (default: 32MB)
	RUBY_GC_MALLOC_LIMIT_GROWTH_FACTOR: (default: 1.4x)

	and

	RUBY_GC_OLDMALLOC_LIMIT: (default: 16MB)
	RUBY_GC_OLDMALLOC_LIMIT_MAX: (default: 128MB)
	RUBY_GC_OLDMALLOC_LIMIT_GROWTH_FACTOR: (default: 1.2x)

	So, in theory, this unbound memory growth is not possible for the script above. The two MAX values should just cap the growth and force GCs.

	However, this is not the case in Ruby 2.1.1

	Investigating the issue
	We spent a lot of time ensuring we had extensive instrumentation built in to Ruby 2.1, we added memory profiling hooks, we added GC hooks, we exposed a large amount of internal information. This has certainly paid off.

	Analyzing the issue raised by this mini script is trivial using the gc_tracer gem. This gem allows us to get a very detailed snapshot of the system every time a GC is triggered and store it in a text file, easily consumable by spreadsheet.

	We simply add this to the rogue script:

	require 'gc_tracer'
	GC::Tracer.start_logging("log.txt")
	And get a very detailed trace back in the text file:
	Installs tested

	rvm install 1.9.3-p385
	rvm install 1.9.3-p385 --patch https://github.com/funny-falcon/ruby/compare/p385...p385_falcon.diff
	rvm install 2.0.0
	note this is an in-progress post, feel free to comment, but it will grow quite a bit as I flesh it out.

	Environments testing

	default
	RUBY_GC_MALLOC_LIMIT=90000000 (GC)
	LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.0.1.0 (GC+TC)
	RAILS_ENV=production bundle exec thin start -p 3111

	RAILS_ENV=production RUBY_GC_MALLOC_LIMIT=90000000 bundle exec thin start -p 3111

	RAILS_ENV=production RUBY_GC_MALLOC_LIMIT=90000000 LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so.0.1.0 bundle exec thin start -p 3111
	(tried a bunch of other "recommended" optimisations, most of them made little to no diff)

	Test rig

	Production: Intel(R) Xeon(R) CPU E3-1280 V2 @ 3.60GHz

	Test performed

	Results for: ab -n 100 -c 1 http://localhost:3111/
	first run discarded

	rvm use 1.9.3-p385

	RAILS_ENV=production bundle exec thin start -p 3111

	min mean[+/-sd] median max
	Default: 209 260 15.3 263 297
	GC : 84 135 32.1 139 205
	GC+TC : 76 124 27.1 131 178
	rvm install 1.9.3-p385 --patch falcon

	min mean[+/-sd] median max
	Default: 72 131 49.1 124 540
	GC : 70 120 18.9 126 158
	GC+TC: 67 121 40.4 116 395
	rvm use 2.0.0

	min mean[+/-sd] median max
	Default: 119 158 25.3 154 211
	GC : 65 120 36.1 118 221
	GC+TC : 67 113 20.0 114 175
	What is the deal with RUBY_GC_MALLOC_LIMIT

	This is one of the most poorly documented bits of optimisation out there:

	github.com 160
	ruby/ruby/blob/trunk/gc.c#L68-L70

	#include <sys/time.h>
	#endif

	github.com 29
	ruby/ruby/blob/trunk/gc.c#L3479-L3483


	while (sweep_page) {
	heap->sweep_pages = next = sweep_page->next;
	gc_page_sweep(objspace, heap, sweep_page);

	Out-of-the box Ruby forces a GC every time 8 megs of memory is allocated this is insanely low, our front page caused about 32megs of allocation thus causing 4 GCs. bumping this up to allowing 90megs of allocation before forcing a GC means there is much less pressure of the GC. We are able to handle 3 or so requests without a GC.

	As a rule we want the number large, but not too large, if its too big we will bleed way too much memory and GC will take forever. If its too small, like it is out-of-the-box we will have multiple GCs per page which is horrible on so many levels.

	What is the impact of raising RUBY_GC_MALLOC_LIMIT to 90 million ?

	On our test rig, resident memory raised from 116Megs to 193Megs for the GC setting alone and up to 211Megs when combined with tcmalloc

	What do you recommend?

	I have 2 recommendations depending on your stance:

	Conservative: All conservative installs should use RUBY_GC_MALLOC_LIMIT=90000000 , its a no-brainer and the increased memory usage is well worth the extra perf, you get a 48% decrease in median request time just by doing that.

	Bleeding Edge: I would recommend using Ruby 2.0 allocating memory using tcmalloc. You get a 57% decrease in median request time using that.

	I can no longer recommend falcon patches due to the fact most of the patches are in 2.0 anyway, embrace the future.

	Issues with this test

	The test was performed on a CPU that has Turbo Boost69 enabled. This means that clock speed varies based on load. It made it more complicated to get consistent test runs, that said I repeated all the tests multiple times and it was run on our production boxes.

	Why are we eating so much memory per request?

	The easiest thing to answer here is what kind of memory we are consuming: (adding ?pp=profile-gc)



	230 THOUSAND strings allocated during one request

	It is going to take us a while to work through this to find the actual culprits, though quite a few of the strings allocated over and over again do seem to be coming from Active Record, so focusing on cutting the amount of querying is a sure way to reduce string allocations during a request life cycle.

	What about Rails 4?

	We need to re-test this under rails 4 to see if its faster or slower and determine if string allocations have gone up or down.
	Are you able to reproduce these allocations on OS X? If so (since you're able to use Ruby 2.0), you should be able to find the allocations pretty easily with DTrace. Use this script (I'm writing this from memory, so it may need tweaks):

	ruby$target:::string-create {
	@_[copyinstr(arg1), arg2] = count();
	}
	Start your Rails process, run a few request, then attach to the pid with DTrace, run another request, and kill DTrace. You should end up with aggregate counts of file name / line number as the key, and number of string allocations done at that location.
	he biggest offenders according to the probes appear to be:

	# this MiniProfiler monkey patch
	def values(*args, &blk)
	return values_without_profiling(*args, &blk) unless @miniprofiler_sql_id

	# and this AR method
	def hash_rows
	@hash_rows \|\|= @rows.map { \|row\|
	# this string allocation
	Hash[@columns.zip(row)]
	}
	end
	One method AR could get rid of this double allocation would be to nuke the underlying hash in favor of a real object (tm), cache some sort of deserializer.

	A HUGE packet of these string allocations could go away if pg supported a mode where it returned coerced types as opposed to strings.

	I wonder if the MiniProfiler monkey patch should really be attributed as the "responsible party" for the allocation, the allocation happened actually a fair bit deeper in the cruby extension.
	onfirm this fix removes that huge number of allocations, it is a huge perf boost for AR.

	A very big reason for the excessive allocations is us, for example we run:

	SELECT "topics".* FROM "topics" WHERE ("topics"."deleted_at" IS NULL) AND (topics.archetype <> 'private_message') ORDER BY CASE WHEN topics.category_id IS NULL and topics.pinned THEN 0 ELSE 1 END, topics.bumped_at DESC LIMIT 30
	We are not limiting down to the actual columns we need, this table has 38 columns, so that means the pg gem is allocating at least double the amount of strings we need.

	The natural thing is to add a .select there, but my concern is that if we forget a column the failure mode is very problematic, stuff will not "explode" it will fail silently pretending the missing columns are nil. I wonder if we should add some special mode in dev to assist here. Regardless, we will fix this.

	Additionally, we have an N+1 query hanging off category.

	I think I will add a mode to MiniProfiler that allows you to see object creation count per query, that will allow people to attack the biggest offenders.

	The very frustrating thing about this topic table is that only 2 out of the 38 columns are actually strings.

	I profiled the simple:

	(ran it twice before in production to ensure it was warmed up), and got:

	Topic.limit(30).to_a

	Amount of string allocations still feels excessive (even after the patch - I patched local)

	File / Line Number / Allocations.
	-------------------------------------
	/gems/activerecord-3.2.12/lib/active_record/connection_adapters/abstract_adapter.rb 276 1
	/gems/activerecord-3.2.12/lib/active_record/log_subscriber.rb 4 1
	/gems/activerecord-3.2.12/lib/active_record/log_subscriber.rb 8 1
	/gems/activerecord-3.2.12/lib/active_record/querying.rb 38 1
	/gems/activerecord-3.2.12/lib/active_record/relation/spawn_methods.rb 53 1
	/gems/activerecord-3.2.12/lib/active_record/scoping/default.rb 119 1
	/gems/arel-3.0.2/lib/arel.rb 43 1
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 126 1
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 131 1
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 135 1
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 139 1
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 192 1
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 200 1
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 283 1
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 339 1
	test.rb 4 1
	test.rb 7 1
	/gems/activerecord-3.2.12/lib/active_record/relation/query_methods.rb 268 2
	/gems/activerecord-3.2.12/lib/active_record/scoping/default.rb 123 2
	/gems/arel-3.0.2/lib/arel.rb 39 2
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 134 2
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 136 2
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 187 2
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 282 2
	/gems/arel-3.0.2/lib/arel/visitors/to_sql.rb 366 2
	test.rb 6 2
	/gems/activerecord-3.2.12/lib/active_record/connection_adapters/abstract/connection_specification.rb 149 3
	/gems/activesupport-3.2.12/lib/active_support/log_subscriber.rb 91 3
	/gems/activerecord-3.2.12/lib/active_record/scoping.rb 130 4
	/gems/activerecord-3.2.12/lib/active_record/scoping.rb 126 5
	/gems/activerecord-3.2.12/lib/active_record/relation/spawn_methods.rb 14 6
	/gems/activerecord-3.2.12/lib/active_record/relation/spawn_methods.rb 26 8
	/gems/activerecord-3.2.12/lib/active_record/relation/spawn_methods.rb 52 12
	/gems/activerecord-3.2.12/lib/active_record/base.rb 514 30
	/gems/activerecord-3.2.12/lib/active_record/model_schema.rb 162 30
	/gems/activerecord-3.2.12/lib/active_record/relation.rb 26 32
	/gems/activerecord-3.2.12/lib/active_record/connection_adapters/postgresql_adapter.rb 665 38
	/gems/activerecord-3.2.12/lib/active_record/result.rb 33 38
	/gems/activerecord-3.2.12/lib/active_record/relation.rb 27 40
	/gems/activerecord-3.2.12/lib/active_record/inheritance.rb 68 60
	/gems/activesupport-3.2.12/lib/active_support/callbacks.rb 81 60
	/gems/activesupport-3.2.12/lib/active_support/callbacks.rb 414 180
	/gems/activerecord-3.2.12/lib/active_record/connection_adapters/postgresql_adapter.rb 605 910
	1 Reply

	samSam Saffronco-founderMar '13 1 tenderlove
	More profiling (both warmed up in prd)

	100.times do
	Topic.limit(30).to_a
	end
	vs.

	100.times do
	ActiveRecord::Base.connection.raw_connection.async_exec("select * from topics limit 30").values
	end
	sampler:

	ruby*:::cmethod-entry
	{
	k = copyinstr(arg0);
	m = copyinstr(arg1);
	}

	tick-5000hz
	/k != 0/
	{
	@[k, m] = count();
	}
	AR

	Hash each 72
	String hash 82
	Kernel freeze 87
	PG::Result ftype 88
	Thread [] 89
	Module name 94
	Array zip 107
	Array each 120
	Thread current 123
	Class allocate 131
	Fixnum abs 136
	String intern 147
	Hash delete 150
	Kernel instance_variable_set 183
	Hash []= 195
	String to_s 195
	String initialize 237
	String to_sym 238
	Kernel respond_to? 247
	Fixnum to_s 286
	Class superclass 349
	PG::Connection async_exec 354
	PG::Result values 383
	#<Class:0x007ff6e8b9e510> relation 399
	Module === 415
	Kernel block_given? 450
	Kernel hash 512
	Kernel class 517
	Hash [] 566
	Symbol to_s 586
	Hash key? 6810
	Raw

	Mutexd current 1
	Thread [] 1
	Mutex unlock 3
	Module name 8
	Thread current 13
	PG::Result values 78
	PG::Connection async_exec 266
	My particular concern here is all these extraneous calls to .key? , seems the sampler is really spending much time there.

	These key? calls are all coming form

	2 143160 vm_call_cfunc:cmethod-entry Hash key? /Users/sam/.rvm/gems/ruby-2.0.0-p0/gems/activerecord-3.2.12/lib/active_record/locking/optimistic.rb 174
	2 143160 vm_call_cfunc:cmethod-entry Hash key? /Users/sam/.rvm/gems/ruby-2.0.0-p0/gems/activerecord-3.2.12/lib/active_record/attribute_methods/serialization.rb 66