Berkeley DB pool tuning instructions

First a little background info. The Berkeley DB Java Edition library used by the pools (if so configured) uses a log structured file format. What this means is that the files of the database (called log segments) are only ever appended to. Once they reach a certain size (10 MB by default), a new log segment is created and the previous log segments are never modified. If existing data is modified or deleted, this leaves unused fragments in these these database files. Once the utilization (amount of data still in use) falls under a certain level, remaining data is copied to the end of the last segment and the original segment is deleted (this is all text book log structured database).

Berkeley DB uses a btree structure, i.e. it is structured as a tree, with the actual data at the leafs and the internal nodes allowing fast search of the data.

Berkeley DB internally maintains a cache of the files. The default in dCache is to use 20% of the maximum heap size as a cache for Berkeley DB. It is a recommendation from Oracle that the cache is big enough to contain all the internal btree nodes.

Since not everything is cached, the data will have to be read from disk (or at least the file system cache). Although only the last file is ever appended to, reads happen from all the files of the database. By default the library keeps up to 100 open files, so if the database grows beyond this limit it will have to close one file and open another. In particular during pool startup in which the entire content is read in lexicographic order, this may cause a lot of file open-close cycles.

The Berkeley DB Jave Edition library has a lot of configuration settings that can be set by placing a je.properties file inside the meta directory. When the pool is restarted, the settings in this file are read. There are two settings I want to point out here:

je.log.fileCacheSize=100

This is the number of files to keep open. It defaults to 100, but if you have significantly more jdb files in the meta directory, it may be worth increasing this limit. Be aware though that you need a file descriptor for each open file. You should make sure to increase the OS limit on how many file descriptors the pool process may keep open.

The other relevant settings are

je.maxMemory=
je.maxMemoryPercent=20

These are equivalent. The former sets the number of bytes used for the btree cache, while the latter defines it as a percentage of the max heap size. E.g. if dcache.java.memory.heap is set to 2048m, then 410 MB is used for the btree cache.

The question is then how to determine a good size for the cache. There are some hidden utilities you can use to do this:

/srv/ore_ndgf_org_002/pool/meta$ java -cp  /usr/share/dcache/classes/je-*.jar com.sleepycat.je.util.DbPrintLog -h . -S

<DbPrintLog>
Log statistics:
           type               total         provisional               total                 min                 max                 avg             entries
                              count               count               bytes               bytes               bytes               bytes         as % of log
          MapLN                  26                   0               2,211                  49                 125                  85                   0
         NameLN                   4                   0                 141                  31                  38                  35                   0
  FileSummaryLN                  23                   0             331,258                  24              74,250              14,402                 2.3
             IN                  21                   0              35,582                  44               4,383               1,694                 0.2
            BIN                 394                 394             242,633                  70               2,130                 615                 1.7
         DbTree                  10                   0               1,174                 100                 134                 117                   0
         Commit              69,561                   0           2,225,952                  32                  32                  32                15.4
      CkptStart                   5                   0                 147                  28                  31                  29                   0
        CkptEnd                   5                   0                 761                  61                 177                 152                   0
          Trace                  12                   0               1,392                  47                 383                 116                   0
     FileHeader                   3                   0                 114                  38                  38                  38                   0
      DEL_LN_TX              18,833                   0           1,261,811                  67                  67                  67                 8.7
      INS_LN_TX              18,837                   0           4,362,303                  95                 391                 231                30.2
      UPD_LN_TX              31,891                   0           5,454,607                 100                 396                 171                37.7
         UPD_LN               2,340                   0             542,409                  18                 522                 231                 3.7
    NewBINDelta                  17                  17               2,487                 121                 364                 146                   0
      key bytes              71,901                               2,802,302                   1                  61                  38              (19.4)
     data bytes              53,068                               6,910,747                   1                 502                 130              (47.8)

Total bytes in portion of log read: 14,464,982
Total number of entries: 141,982

Per checkpoint interval info:
          lnTxn                  ln            mapLNTxn               mapLN          end to end        end to start        start to end         maxLNReplay             ckptEnd
         21,276                 790                   0                   8          14,711,616          14,469,073             242,543              22,066   0x41/0x47e4c0
         48,261               1,550                   0                   8           9,724,864           9,520,804             204,060              49,813   0x42/0x43b200
              0                   0                   0                   4               1,339                 664                 675                   0   0x42/0x43b73b
              8                   0                   0                   3              12,182               1,999              10,183                   8   0x42/0x43e6d1
             16                   0                   0                   3           5,564,955           5,553,901              11,054                  16   0x43/0x3a6c
              0                   0                   0                   0                   0                   0                   0                   0   0x43/0x3a6c
</DbPrintLog>

This prints some statistics about the database. In particular the two rows called key bytes and data bytes are relevant. For the following step you need the values from the avg column - i.e. 38 and 130 in this case. You also need the value of “Total number of entries” (141,982 in this case).

Now you take those values and put them into this command:

/srv/ore_ndgf_org_002/pool/meta$ java -cp  /usr/share/dcache/classes/je-*.jar com.sleepycat.je.util.DbCacheSize -records 141982 -key 38 -data 130

=== Environment Cache Overhead ===

3,157,213 minimum bytes

=== Database Cache Size ===

  Number of Bytes  Description
  ---------------  -----------
   11,473,200  Internal nodes only
   37,341,168  Internal nodes and leaf nodes

For this very small pool it tells us that we need a bit above 11 MB to keep all the internal btree nodes cached. Oracle’s recommendation is that if the database is updated often, the cache is at least big enough to contain the internal nodes. Obviously you want to make it a bit bigger to leave room for it to grow. One could take this and configure the cache using the je.maxMemory setting. If the size is lower than the 20% of the max heap size you already use, then you don’t need to do anything (I do not suggest lowering it further).

There is one caveat though: If you increase the cache size, less free space is left on the heap. If the pool is pushed to the limit, this may actually slow down the pool as garbage collection overhead increases. You need to ensure that enough space is left in addition to the cache (possibly by increasing the max heap size). An alternative to adjusting je.maxMemory is of course to adjust the max heap size. If you make it large enough so that 20% is enough to cache the internal btree nodes, then all is well. This may however mean you assign significantly more memory to the pool than it really needs. Yeah, there are lots of things to consider :-)

If all of this is confusing and you are happy with your pools, then simply ignore all I said.

A little bonus: There is also this command:

/srv/ore_ndgf_org_002/pool/meta$ java -cp  /usr/share/dcache/classes/je-*.jar com.sleepycat.je.util.DbSpace -h .
 File    Size (KB)  % Used
--------  ---------  ------
00000041       9765       4
00000042       4345      11
00000043         14      86
TOTALS       14125       6

It will tell you for each of the database files (the log segments) how big it is and what the utilization is. The Berkeley DB will try to keep the total utilization above 50%, but for a small pool like this one it cannot do it. I figure that you may find this interesting to know after reading about utilization above.

The log segment size and the utilization goal can be adjusted too, but I cannot give any sound advice on whether that’s a good idea and how to determine good values.

All of the above is only relevant if you use the Berkeley DB backend for pools.

gbehrmann/je_tuning.md