Created
November 21, 2011 18:58
-
-
Save brixen/1383557 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Proposal: Convert Symbol table to proper Ruby objects. | |
| Goals: | |
| 1. Add Encoding to Symbol without needing Encoding index. | |
| 2. Provide ability to garbage collect Symbols. | |
| 3. Improve concurrent allocation of Symbols. | |
| 4. Potentially improve speed of Symbol lookup. | |
| 5. Simplify 1.9 Symbol methods. | |
| Implementation: | |
| Use a concurrent lock-free Hash Array Mapped Trie as the SymbolMap (3). | |
| Structure of SymbolEntry: | |
| The SymbolEntry object stores a CharArray which includes an Encoding reference | |
| and makes Symbol#== direct and consistent with other Ruby String | |
| Encoding-aware operations (5). The same is true of the numerous String-like | |
| methods that were added to Symbol in 1.9. The SymbolEntry object has the | |
| following fields (1): | |
| symbol - The Symbol immediate value for String#to_sym | |
| string - A reference to a CharArray which has bytes, size, Encoding | |
| Structure of SymbolMap: | |
| Address the SymbolMap with the untagged Symbol value and with the Symbol | |
| String hash value. Both of these are machine word sized and suitable for the | |
| trie lookup that HAMT implements. In psuedo-code: | |
| se = SymbolEntry.new str | |
| SymbolMap[se.symbol] = se | |
| SymbolMap[se.string.hash] = se | |
| Using a concurrent lock-free HAMT for the SymbolMap has a couple of nice | |
| features. First, concurrent allocation of Symbols is simple and robust (3). | |
| Second, the data structure requires no resizing and provides very fast | |
| lookup (4). | |
| Allocating Symbols: | |
| Allocate Symbol immediate values using a vector of machine words where the | |
| most significant bit is 0 if bits are free or 1 if no bits are free. The | |
| untagged immediate value is computed by: | |
| word_index * bits_in_word + bit_index | |
| most significant bit | |
| ^ | |
| | bits for symbol values | |
| | ^ | |
| +-+---------+ | |
| 0 |1|1111.....| | |
| +-+---------+ <- SymbolMap bitmap | |
| 1 |0|000....11| | |
| +-+---------+ | |
| ^ ^ | |
| | bit_index | |
| word_index | |
| The bitmap is a machine word vector 0-addressable. The bits_in_word is the | |
| machine word size in bits (ie 32 or 64). The bit_index is the bit that | |
| represents an untagged Symbol value. The word_index is the machine word in | |
| the bitmap containing the bit for the Symbol value. | |
| For example, the bit representing Symbol :<some_foo> is the 3rd | |
| (1-addressable) bit in the 2nd (index 1) bitmap word. On a 64bit | |
| architecture, the Symbol value is: | |
| 1 * 64 + 3 = 67 | |
| The untagged value for the symbol is 67. | |
| Tagging gives the final Symbol value. | |
| Garbage Collection: | |
| A high water mark is set for SymbolMap. During a full GC, when a Symbol is | |
| seen the bit representing the Symbol is set in a 2nd bitmap. Once the trace | |
| phase is complete, all Symbols allocated in the main SymbolMap bitmap that are | |
| not allocated in the GC bitmap are deleted from the SymbolMap. The GC bitmap | |
| replaces the SymbolMap bitmap (2). | |
| C-API: | |
| Presently, Symbol values in the C-API are not tracked as references and there | |
| is no distinction made between ID and Symbol. We would need to track Symbols | |
| as references, and return handle values instead of the Symbol immediate value. | |
| The SYM2ID macro could just return the same handle value. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment