- 2010 InnoDB 5.1 announced, will first ship with MySQL 5.5.
- Historically InnoDB development lags while MySQL is trying to GA
- lots of things fixed in InnoDB since MySQL 5.0
- important note -- plugin version of InnoDB is not the default in 5.1
- percona server uses innodb plugin by default
- xtradb is a fork of the innodb plugin
- the on-disk format of xtradb does not change by default, you can switch back and forth; some extra features require changing the on-disk format, however
- percona xtradb releases closely tied to upstream mysql releases
- release model is short release cycle, rebased against new mysql releases
- historically there is an average of a new release every 1-2 months
- in general, innodb has less opportunities for reordering writes than an operating system, since it must be more reliable
- innodb uses fsync, myisam does not
- for a basic select, first innodb queries the buffer pool, then the tablespace
- for a basic update, first buffer pool is updated, then the log file is updated, and the update to the tablespace is delayed
- optimization here for updates is based around the theory that sequential updates to log files is cheap
- logs are only used during recovery, to figure out which pages need to be evicted there is a flush list and LRU, both of which are in memory
- background process for syncing dirty pages is normally referred to as a "checkpoint"
- innodb buffer size can be increased to cause innodb to flush to disk less frequently (and store more data in the logs)
- on a commit, only log files need to be synced, nothing else
- Fast Index Creation (inherited from the plugin)
- adding an index just adds the index, doesn't rebuild the table
- same with dropping index
- does not work for utf-8 (bug #???) or primary key
- I/O scalability (via plugin)
--innodb_io_capacity
, tell mysql how many io operations per second your disk is capable of; 100 iops is typical for a 7200rpm disk--innodb_read_io_threads
and--innodb_write_io_threads
-- tell innodb how many threads to use for reading and writing
- CPU scalability (via plugin)
- mostly consists of 3-4 patches that transparently make innodb work better with multi-core CPUs
- lots of x-locks converted to read/write-locks
- lots of locks broken into more fine-grained locks
- Adapting Flushing (invented by percona server)
- teaches mysql to do background work more aggressively when log space is about to run out, so that TPM (transactions per minute) doesn't suddenly drop right when a log if rotated
- see also: adaptive checkpointing, which is similar but different
- Fast Crash Recovery (invented by percona)
- see mysql bug #29847, related to innodb recovery being slow (e.g. a larger buffer pool can increase innodb recovery time)
- in many cases 10x faster than builtin
- Improved Buffer Pool Scalability
- buffer pool mutex becomes four mutexes: flush list, lru, free, and hash mutexes
- Data Dictionary control
- once an innodb table is open it's never freed from the in-memory data dictionary
- adds --innodb_dict_size_limit
- Undo Slots
- by default only 1024 transactions can be open at once, some statements require 2 undo slots
- in xtradb it's expanded to up to 4072 with --innodb_extra_undoslots=1
- Rollback Segments
- makes it possible to have more htan one rollback segment, and also reduces mutex convention on the rollback segment
- this is also fixed upstream in mysql 5.5
- Fast Checksums
- binary format incompatible
- Different Page Sizes
- instead of 16k, you can have 4k or 8k page sizes (probably useful only for older systems)
- Separate Purge Thread
- cleans up long history list length faster
- speaker thinks that the best features are usability improvements, not performance enhancements
- ability to show buffer pool contents
- ability to save contents of the buffer pool (actually saves offsets in the innodb files instead of data contents, so 8 bytes not 16k bytes for each thing); can help with restarting mysql faster; stores to file called 'ib_lru_dump'; it's possible to dump the lru from one server and load it on another server, but be super careful
- transactional replication (also available in mysql 5.5); slaves store position in relay-log.info, which isn't fsynced; somehow makes slave crashes more reliable
- import/export tables, makes it easy to just restore one table at a time from the ibd files; import/export still has to be done with xtrabackup, however
- better handling of corrupt tables, a corrupt table (i.e. totally bogus data) won't crash mysql, instead the table is just marked as corrupt and the server continues starting up
- the slow query log for percona has hella more stuff if you increase verbosity; mk-query-digest also understands the new percona columns
- user statistics, new tables that show up in the information_schema database. for instance, allows you to find indexes that are unused
- xtrabackup can report on fragmentation of indexes (runs very slowly, you should do offline); if an index is super fragmented, you can drop and re-add the index
- main areas of improvement: sl level, innodb, windows, performance schema
LOCK_open
fixed, used in a lot of places where there are global locks- MySQL 5.5 introduces MDL (Meta Data Locking) subsystem; optimizes for both number of operations, and reduces contention
- smaller fixes
LOCK_alarm
mutex, replaced by posix-specific apis where availableLOCK_thread_count
mutex reduced usage to connect/disconnect from 1-2 times per queryTHR_LOCK_charset
mutex, hotspot for setting up character sets, completely removed
- improved log sys mutex; had been used for both log records and lsn, and also something about mtr
- this has been changed to 2 separate mutexes, but contention can still be observed for some heavy log write workloads
- separate flush_list mutex, so that the flush list mutex is different now from the buffer pool mutex; the buffer pool mutex is very hot
- ability to create multiple buffer pools, using
innodb_buffer_pool_instances
; however, no ability for users to specify how to map table/index constructs to buffer pools - some internal operations need to query all buffer pools, so for certain things internal overhead can be increased
- multiple rollback segments
- now there are 128 segmetns, giving up to 128k transactions
innodb_fast_shutdown=0
needed for this change
- asynchronous i/o on linux
- for many years only windows was using native async i/o
- on linux it was simulated using helper threads
- can make crash recovery up to 1.5x faster
- assumes usage of O_DIRECT (which is good to use anyway)
- change buffering
- insert buffer has existed for a long time, speeds up inserts for i/o bound workloads
- however, updates and deletes were unable to use this feature
- mysql 5.5 supports "delete mark" and "purge" change buffering
- it handles update case as well as delete because secondary indexes are never updated in place
- up to 40x performance for large deletes
- separate purge thread
- before mysql 5.5 purging of old records is done by the main thread
- use
innodb_purge_threads=1
to use separate purge thread
- adaptive flushing
- not a new feature in mysql 5.5, but was updated to deal with multiple bufer pools, and got lots of small tunings
- mysql 5.5 should be a lot "smoother" on many workloads
- will probably continue to improve as mysql 5.5 matures, as the mysql team is still looking at it
- performance schema
- includes many innodb probes
- can be used by developers to better understand sources of performance problems
-
is mysql 5.5 faster? peter says no it's not across the board, but there are no big specific regressions for now
-
remaining issues:
- per index lock; use partitioning as partial workaround
- kernel mutex; seems to be a big focus for innodb in mysql 5.6
- dropping tablespace with innodb file per table
- log mutex
- group commit
- peter notes that probably mysql 5.5 is generally better than percona server 5.1
-
innodb plugin is default in 5.5
-
improved cpu scalability, fast index creation, buffer pool tablescan resistance, fast crash recovery, multiple buffer pools
-
5.5 plugin is innodb 1.1.x, 5.1 plugin is 1.0.x
-
communicates via the row-based relationship
-
foreign keys handled by engine backend
-
features: crash save, mvcc, full index reading, change buffering
-
new plugin supports compression and full hot backup support
-
commands:
SHOW ENGINE INNODB STATUS
SHOW ENGINE INNODB MUTEX
- make sure that the averages calculated time is at least 20-30 seconds, you have no control over what value you get
- rollback is determined based on which transaction has the least number of "rollback" operations to do
- whenever you see two numbers together, they could be really representing one 64-bit integer (all ints are 32 bit in output)
- note to self: read more about innodb concurrency tickets
- http://everythingisdata.wordpress.com/2009/10/17/numbers-everyone-should-know/
- make sure you have noatime enabled, so the redo logs aren't constantly having their atime update
- data in tablespace is organized at page level, not row level
- communication always happens with the buffer pool before it hits the tablespace
- on a wite, buffer pool is marked dirty, the log file is updated, and the log file is fsynced
- so: pages in the tablespace are written in a delayed manner
- logs are only read from when doing a recovery
- contrast to oracle which uses log files for reading and doing fancy things like replication
- log contains two kinds of data: the flush list, and the LRU
- all log activities are assign a LSN (log sequence number), which is monotonically increasing
- LSN value is stored in tablespace after each checkpoint
- oracle calls their logs "redo logs", but it meanst the same thing
- mysql does fuzzy checkpointing, because after a checkpoint there may still be dirty pages in the buffer file
- outside of the buffer pool, changing
innodb_log_file_size
is one of the most important settings, to smooth out background I/O for longer - by default up to 90% of the data in the buffer pool can be dirty
- if log files are larger than the buffer pool, you're going to be fucked when doing a recovery
- you can also change
innodb_flush_log_at_trx_commit
to 0 or 2 to reduce durability (good for slaves!) innodb_log_buffer_size
is useful for writing bigTEXT
/BLOB
changes
- look for innodb log weights going up over time
- tuning log buffer time is a tradeoff between performance and recovery time; tuning this must be contingent on your requirements for recovery time
- these numbers are show in show innodb status:
- log sequence number
- log flushed up to
- last checkpoint at
- you can track the amount of log you're generating (e.g. per hour) by tracking the delta of the LSN
- status variables in percona server also show this
- rollback segments are used to get consistent read view, and also used to rollback uncommitted changes during recovery
innodb_read_io_threads
changes how many threads the buffer pool can use to read data from tablespaceinnodb_max_dirty_page_pct
variable to change how much of the buffer pool can be marked dirtyinnodb_buffer_pool_wait_free
in "show global status" means innodb is spending time waiting for free pages in the buffer pool- when you write to a file you are writing to a buffer; the buffer exists in your pocess; flushing a buffer moes it to the os cache; anything flushed will survive the process crashing; data synced will survive power outage
- by default,
innodb_flush_log_at_trx_commit=1
. this is really expensive, which is why it's important to have a battery backed write cache or ssd. This is the only mode in which innodb is fully ACID compliant - if we have
innodb_flush_log_at_trx_commit=2
then on every commit (or once each second), innodb will flush to the innodb log buffer; every second innodb will sync the log buffer to disk.
- if you run on the master you could have slaves receiving rows that were actually rolled back on the master; manifests itself as a "duplicate key error" on all slaves
innodb_flush_log_at_trx_commit=0
is bad times- some synchronization thing called XA: http://dev.mysql.com/doc/refman/5.1/en/xa.html
- everything is stored in innodb table space
- by default, tablespace is one contiguous logical file called
ibdata1
; you can instruct it to be multiple physical files, but only the last one can be autoextend - every page is filled 15/16 for innodb, no way to configure it (called the "fill factor" in oracle)
- ext3 bad for mysql, xfs better, due to fsync behavior
- using raw partitions is "here be dragons" and makes backups really fucking hard
- you can configure
--inodb-file-per-table
to have one physical file per table, but some things are still stored in the main in the main thing - per table is usually better since you can reclaim space whenever you TRUNCATE/DROP a table, and therefore reduce back size
- some of the new options like compression and off-page storage of logs from mysql 5.5 only work with file per table
- cons:
- uses more file descriptors
- need to tune something
- may frbe more gragmentation
- complicated internal synchronization when starting up, so starting up mysql can take a long time (but would probably on be an issue if you had like 50k or 100k tables)
- but pro: works better with ext3 and O_DIRECT
- as well as table data, table space contains:
- tablespace header
- innodb-specific data dictionary
- undo information
- the doublewrite buffer
- the insert buffer
- you can turn on/off file-per-table, but existing tables will remain how they were (in ibdata1 or in separate file) until you do alter table or optimize table
- most internal allocations are done using extents
innodb_autoextend_increment
affects how big to expand the tablespace at a time- everything fits into pages, which are 16 Kib
- a data page has a header, rows, and a trailer
- rows are normally in variable width format, there is a fixed width format but it is very rarely used
- header sys what kind (data or index), has a checksum, the LSN, the last value of the LSN as flushed to disk
- some things in the trailer: part of the lsn, checksum
- checksum is always checked when reading page to innodb, if checksum is incorrect then innodb will crash (unless using percona server)
- checksums are updated when writing pages back to disk
- xtradb has fast checksum option for super fast disks (e.g. fusion i/o), but changes xtradb compatibility
- data row format is like:
- transaction id (for when row was created)
- roll pointers
- field pointers
- fields
- most important things: transactoin id, storage of pointer to older version of
row in the
UNDO
space - exact format is different -- in 5.0 there is a
COMPACT
row format where null values are stored more efficiently - all field are stored inline except Blobs, which are a special case
- rows cannot be split across pages
- innodb sets a restriction that two rows must fit in one page
- so a single row cannot be more than 8 KiB
- this can cause
VARCHAR
to move to off-page locations
- 20 byte pointer to overflow page for
VARCHAR
/TEXT
/BLOB
for case where blob does not fit on the same page as the row - for lots of internal buffers (including buffer pool), space allocated for varchar or whatever is based on the maximum possible size of that field
- 800 Kb row can take up to 1.5M on disk for various technical reasons
- if the whole row firts in ~8000 bytes then the blob is always stored on the page, otherwise just first 768 bytes (antelope) or 20 byte pointer (barracuda)
- what does log file look like? records are 512 byte aligned, does not store a complete change, just the parts that changed, i.e. Space ID + Page ID + Offset + Payload
- undo information is stored in the rollback segment
- xtradb supports multiple, innodb just supports one
- the rollback segment is always in the global tablesapce
- if you get error message about too many open transactions then you ran out of undo slots, you should increase number of rollback segments
- besides buffer pool, in memory:
- meta data for accessed tables (never released, happens whenever any table is copied in), a.k.a. data dictionary
- lock information
- adaptive hash index
- sql statements
- thread stacks
innodb_use_sys_malloc
(default 1) disavles the additional memory pool- regardless of built-in or plugin, leaving
innodb_additional_mem_pool_size
at the default setting is usually fine - best thing evar:
--innodb-buffer-pool-size
- you may be surprised at how much space insert buffer and undo buffers take up!
- warning: leave space for metadata, alwasy consumes an additional 5-10% from buffer pool
- some things like binary logs and relay logs depend on OS caching, so don't use all memory
- you can change
innodb_max_pct_dirty_pages
- with innodb plugin you can set io capacity to also affect how quickly dirty pages are flushed
- show innodb status shows how many "modified db pages"
- for blobs maybe tweak
innodeb_log_buffer_size
to be as big as the largest blob you ever expect to write - if
innodb_log_waits
inGLOBAL STATUS
is high then the log buffer isn't big enough - locks are held for the duration of a transaction
- for updates/deletes, more rows than you may expect may end up being locked
- "index level locking" more apt name than "row level locking"
- mysql creates next-key locks to prevent inserts from happening on slaves when statements are committed out of order
- two common lock types: shared and exclusive
- an S lock never ocfurs for reads because of MVCC; S locks are normally only visible with foreign key constraints
- in inndob everything is an index
- data is stored in a clustered index organized by the primary key; in the absence of a primary key, the first unique not null key is selected
- other indexes are stored as secondary indexes
- clustered index means that data is stored alongside index, in myisam the index just stores a pointer to the actual data
- in innodb primary key lookup is faster than myisam, but secondary index is slower: secondary index points to primary key, so need to do two index lookups
- all access to the data is always done by primary key
- once mysql uses a RANGE, it cannot use anything else in the index
(e.g. example of (
date
,customer_id
) where usually date is more selective, but if you ever want to do a query bycustomer_id
and adate
range then you're screwed) - just says don't use UUIDs for primary key on innodb
- never piggy-back the primary key index into the end of a composite index or covering index - it is already included for free
mk-duplicate-key-checker
will find this
- innodb has an adaptive hash
- if you have a workload that has "hotspots" then innodb can internally build a hash index for values that you access frequently. this hash index exists only in memory, and there's no control over it from the end-user standpoint.
- the hash index does not cover the whole index (out of time)
- for insert buffer, when you make multiple updates to a single page, they can be buffered into the insert buffer, and then data on disk is merged with the in-memroy insert buffers
- mysql 5.5 also has this for deletes
- may want to disable for ssd
- can measure efficiency for innodb 5.1 and later with show innodb status
- double write buffer helps to protect against partially written/corrupt pages
- always 2MB, two sets of 64 pages
- you can turn the double write buffer off if you have a transactional filesystem like ZFS
- adaptive flushing is in mysql 5.5
- by default 200 iops (assumes raid 10 and 4 disks), change with
innodb_io_capacity
- this is a more generic solution, not MySQL solution; and there are other ways of doing this
- speaker (yves trudeau) is a former mysql and sun consultant, has a phd, english is not his first language
- the plan
- the challenges
- the solution
- building the solution
- other solutions
- challenges:
- cost! people don't want to have a big standby instance since it costs money; different from traditional HA solutions where people have two big servers, one standby and one active;
- no static ips on the private network, and you have no control over which ip it receives
- no vip / no {broad,multi}cast
-
the solution
-
there's a small monitoring instance, typically the smallest instance type available
-
there's also a larger instance
-
farm of web or application servers connected to the private ip of the large instance
-
the large instance stores the mysql data on an ebs volume that is used as permanent storage
-
if there is a failure, the small instance will detect the failure and will initiate the failover procedure
-
small instance will create a new large instance, the large instance will mount the EBS volume, and recovery will initiate
-
web/app servers still don't know where to connect
-
small instance will connect to each web/app server and update their configuration to point to the new large instance
-
technology pieces
-
cluster manager
- pacemaker, other could work too http://www.linux-ha.org/wiki/Pacemaker
- unicast communication
- lack of broadcast/multicast
- setup cluster partner ip for new host
- reconvfiguring service customers
-
Pacemaker
-
allow port 694 UDP within security group
-
heartbeart or corosync for communication, unicast based
-
scripts type resource on the monitoring host (type anything), on remote end is a no-op script
-
localiztion weight favoring the large instance
-
contents of
/etc/ha.cf
(left side of screen is cut off) autojoin any ucast eth0 warntime 5 deadtime 15 initdead 60 keepalive 2 crm respawn -
be careful to set values high, since amazon's network can be laggy which can cause spurious heartbeat failures
-
custom ec2 images that are pre-configured with pacemaker/heartbeat and mysql
-
aws tools; parsing can be difficult --> helper scripts
-
failover steps:
- detect failure
- kill the faulty instance
- free the shared resource
- launch the new instance
- reassign shared resource
- reconfigure heartbeat
- spread connection info
-
ec2-run-instance
has an interesting feature for user data script which is run by root on startup; used to seed the new instance with cluster parnter ip -
don't use pacemaker to move resources because of some weird synchronization thing
-
be careful about timing
-
series of blog posts at http://www.mysqlperformanceblog.com/2010/06/17/high-availability-for-mysql-on-amazon-ec2-part-1-intro/
-
other solutions:
- RDS, pros/cons, only mysql
- continuent
- rightscale (eip based)
- scalarium
- scalebase (beta)
- others?