Skip to content

Instantly share code, notes, and snippets.

@brixen
Created November 21, 2011 18:58
Show Gist options
  • Select an option

  • Save brixen/1383557 to your computer and use it in GitHub Desktop.

Select an option

Save brixen/1383557 to your computer and use it in GitHub Desktop.
Proposal: Convert Symbol table to proper Ruby objects.
Goals:
1. Add Encoding to Symbol without needing Encoding index.
2. Provide ability to garbage collect Symbols.
3. Improve concurrent allocation of Symbols.
4. Potentially improve speed of Symbol lookup.
5. Simplify 1.9 Symbol methods.
Implementation:
Use a concurrent lock-free Hash Array Mapped Trie as the SymbolMap (3).
Structure of SymbolEntry:
The SymbolEntry object stores a CharArray which includes an Encoding reference
and makes Symbol#== direct and consistent with other Ruby String
Encoding-aware operations (5). The same is true of the numerous String-like
methods that were added to Symbol in 1.9. The SymbolEntry object has the
following fields (1):
symbol - The Symbol immediate value for String#to_sym
string - A reference to a CharArray which has bytes, size, Encoding
Structure of SymbolMap:
Address the SymbolMap with the untagged Symbol value and with the Symbol
String hash value. Both of these are machine word sized and suitable for the
trie lookup that HAMT implements. In psuedo-code:
se = SymbolEntry.new str
SymbolMap[se.symbol] = se
SymbolMap[se.string.hash] = se
Using a concurrent lock-free HAMT for the SymbolMap has a couple of nice
features. First, concurrent allocation of Symbols is simple and robust (3).
Second, the data structure requires no resizing and provides very fast
lookup (4).
Allocating Symbols:
Allocate Symbol immediate values using a vector of machine words where the
most significant bit is 0 if bits are free or 1 if no bits are free. The
untagged immediate value is computed by:
word_index * bits_in_word + bit_index
most significant bit
^
| bits for symbol values
| ^
+-+---------+
0 |1|1111.....|
+-+---------+ <- SymbolMap bitmap
1 |0|000....11|
+-+---------+
^ ^
| bit_index
word_index
The bitmap is a machine word vector 0-addressable. The bits_in_word is the
machine word size in bits (ie 32 or 64). The bit_index is the bit that
represents an untagged Symbol value. The word_index is the machine word in
the bitmap containing the bit for the Symbol value.
For example, the bit representing Symbol :<some_foo> is the 3rd
(1-addressable) bit in the 2nd (index 1) bitmap word. On a 64bit
architecture, the Symbol value is:
1 * 64 + 3 = 67
The untagged value for the symbol is 67.
Tagging gives the final Symbol value.
Garbage Collection:
A high water mark is set for SymbolMap. During a full GC, when a Symbol is
seen the bit representing the Symbol is set in a 2nd bitmap. Once the trace
phase is complete, all Symbols allocated in the main SymbolMap bitmap that are
not allocated in the GC bitmap are deleted from the SymbolMap. The GC bitmap
replaces the SymbolMap bitmap (2).
C-API:
Presently, Symbol values in the C-API are not tracked as references and there
is no distinction made between ID and Symbol. We would need to track Symbols
as references, and return handle values instead of the Symbol immediate value.
The SYM2ID macro could just return the same handle value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment