Random Notes on Redis/Mongodb/Memcached/Caching (mostly on Twitter and hasMany through relations)

Garbage collection

Need to make sure ember references are cleared after a request is sent back.
http://stackoverflow.com/questions/5326300/garbage-collection-with-node-js
http://blog.caustik.com/2012/04/08/scaling-node-js-to-100k-concurrent-connections/
http://dtrace.org/blogs/bmc/2012/05/05/debugging-node-js-memory-leaks/
http://www.scirra.com/blog/76/how-to-write-low-garbage-real-time-javascript
http://benoitvallee.net/blog/2012/06/node-js-garbage-collector-explicit-call/ (--expose_js and call global.gc())
http://blog.caustik.com/2012/04/11/escape-the-1-4gb-v8-heap-limit-in-node-js/
http://www.ibm.com/developerworks/web/library/wa-memleak/
http://stackoverflow.com/questions/5733665/how-to-prevent-memory-leaks-in-node-js
http://blog.caustik.com/2012/04/10/node-js-w250k-concurrent-connections/
http://stackoverflow.com/questions/5903675/node-js-garbage-collection-event-or-trace-gc-to-stderr
http://dtrace.org/blogs/dap/2012/01/05/where-does-your-node-program-spend-its-time/
http://stackoverflow.com/questions/9941374/node-js-gc-mark-compact
https://groups.google.com/forum/?fromgroups#!topic/datamapper/M7BMUe8AjlY
http://s3.mrale.ph/nodecamp.eu
https://github.com/TooTallNate/node-weak
manually trigger gc: node --expose_gc --nouse_idle_notification tmp/leaks.js
https://groups.google.com/forum/?fromgroups#!topic/nodejs/BO6JdYi4n2k
node --expose_gc --nouse_idle_notification --trace-gc tmp/leaks.js
https://github.com/raganwald/homoiconic/blob/master/2012/03/garbage_collection_in_coffeescript.md
precise generational garbage collection
https://developers.google.com/chrome-developer-tools/docs/heap-profiling
https://developers.google.com/chrome-developer-tools/docs/heap-profiling-dominators
http://lostechies.com/derickbailey/2012/03/19/backbone-js-and-javascript-garbage-collection/
http://stackoverflow.com/questions/3788805/garbage-collection-and-javascript-delete-is-this-overkill-obfuscation-or-a-g
https://developer.mozilla.org/en/JavaScript/Memory_Management
http://stackoverflow.com/questions/6297007/javascript-anonymous-function-garbage-collection
http://stackoverflow.com/questions/864516/what-is-javascript-garbage-collection/864544#864544
http://stackoverflow.com/questions/7347203/circular-references-in-javascript-garbage-collector
question: for $.ajax(error: fn, success: fn), if one of the closures doesn't execute does the reference still exist?
http://stackoverflow.com/questions/4324133/javascript-garbage-collection
http://nodeguide.com/convincing_the_boss.html

Rails and Garbage Collection (and inserting data faster - Identity Map)

http://www.acunote.com/blog/2008/01/garbage-collection-is-why-ruby-is-slow.html
Further, simply allocating memory is relatively expensive, and that will also show up in profiler output. [which is why reusing objects as much as possible is helpful]
http://stackoverflow.com/questions/6480148/is-there-a-better-solution-than-activerecord-for-batch-data-imports
http://www.coffeepowered.net/2009/01/23/mass-inserting-data-in-rails-without-killing-your-performance/
http://www.williambharding.com/blog/uncategorized/rails-3-performance-abysmal-to-good-to-great/
identity map removed in rails: https://github.com/rails/rails/commit/302c912bf6bcd0fa200d964ec2dc4a44abe328a6
http://mongoid.org/en/mongoid/docs/identity_map.html

The problem in Rails was that if you have post.comments.first.update_attribute('post', null), and Post.destroy(post.id), it will destroy the comment, which you just changed postId to null so it shouldn't be available to the post. To fix this you need to remove the comment from the comments array after its postId is changed. So you need to have a map of the comment to the associations it's in (the cursors), and just iterate through them and remove it. Basically, just iterate through all cursors for the comment when a property changes on it that's part of its observableFields, and if it doesn't match anymore, remove it from the in-memory array. This way, when you find the post again with Post.destroy(), which returns the in-memory post, it will have the post.comments association, but the comment won't be in there, so dependent-destroy won't have effect. Also, dependent-destroy shouldn't even be affecting this... it should realize the comment.postId is null now.

Answer. So, global identity map scoped to the current request. Attach the request to the controller and vice-versa. Do App.Post.with(@), which initializes an identity map on the request. Have the identity map keep track of all the cursors and models instantiated in the request. Then after the controller responds, after any after callbacks, everything in the identity-map is cleared from memory with Ember.Object#destroy. If you have some async callback after the request has been written (say, doing a streaming operation or progress indicator), then it's up to you to fetch the records again. Instead of doing that, you should create a background job and pass the current user's socket id so you can send them messages through the already-instantiated web socket. This frees up the controller and everything in the identity map for garbage collection, making room for the next request.

Rails identity map is cleared when a request is closed. rails/rails#6524

Some ideas for garbage collecting ember in node

You can keep a global hash pointing to all of the controllers instantiated, and if it or a model has not been accessed within some interval of time, it destroys it. This way, you could store all ember guids in a global list and refresh a timer if any of the objects have been accessed in less than a specified interval ['__ember__guid_1232181', '__ember__guid_1232132', ...], otherwise it will iterate through the list, find the objects by that id, and delete it.

Perhaps we could also keep a global object pool of instantiated models of each type, just so the server only needs to swap the attributes out. It might be cheaper to just delete them and start over, but maybe it's better to have like a million objects in memory, and swap them out.

The computed properties are what we need to worry about on the server. If they return another model instance, then there is potentially a memory leak if they are not destroyed, no? Hmm... If there are no objects pointing to either of those records (circular referencing each other), will there be a memory leak? That is, if they are "unreachable", shouldn't they be garbage collected? Need to test.

What about variables in the controller, do they need to be garbage collected?

How about the .instance() property for the current controller on the server? (don't think we're even using that).

Need to set up sort of debugger/logger for the properties watched in Ember (or all the event listeners) on the server.

Need to clear out the cursor.data property.

You want your requests to return as quickly as possible so the javascript can be garbage collected. Then run processor-intensive functions in a separate process. How do you then do things like streaming back progressive file upload data? Maybe in this case you have a deallocate function that you can run when you start your long-running process. Or you can get access to the socket for the user from a background job! Tower.connections[job.data.socketId]. To make this work we'll have to message via the command-line and hook.io to the socket.io server. Unless there's some way to run the worker alongside the job.

If this happens in the controller, will the controller be garbage collected (and all properties on it), even if the function it calls internally is long-running?

class App.AttachmentsController extends App.Controller
  create: ->
    App.Attachment.create @params, (error, attachment) =>
      # Say this is non-blocking but takes about a minute, will everything except the attachment be garbage collected?
      # Probably not, which is why you want to start up background processes.
      # So, this function should create a background job, passing the currentUser id, which we can use to search the sockets
      # for the socket, which we can use to send data back, all in a separate process so the 
      # request/response cycle can be freed up and garbage collected.
      attachment.processAndUploadInBackground()
      @render json: attachment

There is no explicit garbage collection code for the current HTTP request, so it must be getting cleaned up.

You can set the Ember guid to the model guid!

record[Ember.GUID_KEY] = databaseRecord._id.toString()

Then maybe whenever you call Ember.guidFor and it matches the object id, then maybe you can pass that into the identity map.

http://en.wikipedia.org/wiki/Identity_map_pattern

Ember.destroy

Ember.destroy: Tears down the meta on an object so that it can be garbage collected.
Ember.Object.create().destroy(): Destroys an object by setting the isDestroyed flag and removing its metadata, which effectively destroys observers and bindings.
Ember.Object#willDestroy: called the frame before it will actually be destroyed.
Ember.Object#didDestroy: called the next frame, just after all metadata for it has been destroyed.

Debugging/Profiling

https://github.com/chrisa/node-dtrace-provider
node-inspector --web-port=8989
http://dtrace.org/blogs/dap/2012/04/25/profiling-node-js/
npm install -g stackvis
sudo dtrace -o stacks.out -n 'profile-97/execname == "node" && arg1/{ @[jstack(100, 8000)] = count(); } tick-60s { exit(0); }'
npm install memwatch (https://github.com/lloyd/node-memwatch)
http://stackoverflow.com/questions/5718391/memory-leak-in-node-js-scraper
http://www.unix.com/man-page/OpenSolaris/1/mdb/ (referenced a lot in dtrace's blog)
http://dtrace.org/blogs/dap/2012/01/13/playing-with-nodev8-postmortem-debugging/
brew install mdbtools

Better way to remove items from away (not adding to the garbage collector by creating new arrays):

for (var i = index, len = arr.length - 1; i < len; i++)
  arr[i] = arr[i + 1];

arr.length = len;

Caching

http://nosql.mypopescu.com/post/13493023635/rails-caching-benchmarked-mongodb-redis-memcached
https://github.com/SFEley/mongo_store
Cache queries in mongodb with cursor.toParams: http://stackoverflow.com/questions/5709773/how-to-cache-a-query-in-ruby-on-rails-3
http://www.mongodb.org/display/DOCS/Caching
https://github.com/jnunemaker/bin/blob/master/lib/active_support/cache/bin.rb
http://www.quora.com/Is-MongoDB-a-good-replacement-for-Memcached
- "Are you caching data that would benefit more than just a key-value store? Now we're talking. This plays directly into the strengths of MongoDB, and takes memcache where it wasn't really intended to go."
- that means we can store the results from multiple computations (getting a groups/tweets/memberships) and cache it potentially.
http://stackoverflow.com/questions/5465737/memcache-vs-java-memory
https://github.com/mape/node-caching/

MongoDB

Redis vs. Memcached

http://redis.io/commands/expire
http://stackoverflow.com/questions/4188620/redis-and-memcache-or-just-redis
https://github.com/jodosha/redis-store/blob/master/redis-store/lib/redis/store/marshalling.rb
http://stackoverflow.com/questions/10558465/memcache-vs-redis
http://highscalability.com/scaling-twitter-making-twitter-10000-percent-faster/
- "Send message to invalidate friend's cache in the background instead of doing all individually, synchronously."
http://www.codypowell.com/taods/2012/01/the-beautiful-marriage-of-mongodb-and-redis.html
- store tweets in both redis and mongodb: "Was it faster to pull the app ids from Redis, use that to pull the documents from MongoDB, then use Python to reorder everything? Actually, yes. Thus far, getting from the cache takes 1/3 of the time that it did before. Meanwhile, adding to the cache is essentially free."
http://stackoverflow.com/questions/10317732/why-use-redis-instead-of-mongodb-for-caching
http://broadcastingadam.com/2011/05/advanced_caching_in_rails/
http://highscalability.com/blog/2011/7/6/11-common-web-use-cases-solved-in-redis.html
http://stackoverflow.com/questions/7888880/what-is-redis-and-what-do-i-use-it-for

Redis

http://antirez.com/post/take-advantage-of-redis-adding-it-to-your-stack.html/
- Precomputed queries! All cursors should store ids of sorted/matching records in redis.
- Use redis to store sorted sets up to 10000 records each (first 50 pages).
http://stackoverflow.com/questions/10205635/redis-filter-by-range-sort-and-return-10-first
http://playnice.ly/blog/2010/05/05/a-fast-fuzzy-full-text-index-using-redis/
http://openmymind.net/Paging-And-Ranking-With-Large-Offsets-MongoDB-vs-Redis-vs-Postgresql/
http://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-using-redis-bitmaps/
http://patshaughnessy.net/2011/11/29/two-ways-of-using-redis-to-build-a-nosql-autocomplete-search-index
https://github.com/seatgeek/soulmate
https://github.com/seatgeek/soulmate/blob/master/lib/soulmate/matcher.rb
http://redis.io/topics/twitter-clone
http://www.quora.com/Redis/How-efficient-would-Redis-sorted-sets-be-for-a-news-feed-architecture
http://santosh-log.heroku.com/2011/05/21/relationlike-redis/
https://github.com/smrchy/redis-tagging
http://dr-josiah.blogspot.com/2011/02/some-redis-use-cases.html
redis for rate limiting
nested sets (trees) in redis: https://groups.google.com/forum/#!topic/redis-db/IsLJ4PlBo9E/discussion
https://github.com/rediscookbook/rediscookbook
http://openmymind.net/Data-Modeling-In-Redis/
http://knokio.com/data/analytics-with-redis/

Use MongoDB to store the details (membership.createdAt, membership.role, etc.) but use redis just to map the ids (user.membership_ids, user.group_ids).

You want to store all of these ids in redis so you can do fast writes as well! So every time a user posts a tweet, it instantly can grab all users following that group (pure redis query) and push that tweet id into their feed (even if they have 1 million followers redis can do that in 10 seconds). And twitter probably only pushes it into the users that have been recently active (so if you come after not coming for a month or whatever, you have to wait for it to load). In that case, it has to fetch all the users you followed and compute the ids (grab all follower ids for user from redis, then grab each user timeline's latest tweets, and add them to this users home timeline).

twitter stream algorithm

Redis Search

Scaling

News Feed

Determining what data to store depends on your front-end (including what activities your users participate in) and your back-end. I'll describe some general information you can store. Italics are special, optional information you might want or need depending on your schema.

Activity(id, user_id, source_id, activity_type, edge_rank, parent_id, parent_type, data, time)

user_id - user who generated activity source_id - record activity is related to activity_type - type of activity (photo album, comment, etc.) edge_rank - the rank for this particular activity parent_type - the parent activity type (particular interest, group, etc.) parent_id - primary key id for parent type data - serialized object with meta-data

To support relevance filtering and personalization, we needed three types of signals:

Static signals, added at indexing time
Resonance signals, dynamically updated over time
Information about the searcher, provided at search time
https://github.com/ryanking/earlybird/blob/master/earlybird.rb
spiderduck: http://engineering.twitter.com/2011/11/spiderduck-twitters-real-time-url.html
kestrel (message queueing system for twitter): https://github.com/robey/kestrel
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html

These servers use a specialized ranking function that combines relevance signals and the social graph to compute a personalized relevance score for each Tweet.

Twitter is a complex yet elegant distributed network of queues, daemons, caches, and databases.

lancejpollard/index.md