- 2010(2009?) InnoDB 1.1 announced, will first ship with MySQL 5.5.
- Historically InnoDB development lags while MySQL is trying to GA
- lots of things fixed in InnoDB since MySQL 5.0
- important note -- plugin version of InnoDB is not the default in 5.1
- percona server uses innodb plugin by default
- xtradb is a fork of the innodb plugin
- the on-disk format of xtradb does not change by default, you can switch back and forth; some extra features require changing the on-disk format, however
- percona xtradb releases closely tied to upstream mysql releases
- release model is short release cycle, rebased against new mysql releases
- historically there is an average of a new release every 1-2 months
- in general, operating system has less opportunity to merge/reorder innodb writes, since InnoDB keeps durability. InnoDB itself must change it's algorithms to produce more sequential IO.
- innodb uses fsync, myisam does not
- for a basic select, first innodb queries the buffer pool, then the tablespace
- for a basic update, first buffer pool is updated, then the log file is writen to, and the update to the tablespace is delayed as part of a background thread.
- optimization here for updates is based around the theory that sequential updates to log files is cheap
- logs are only used during recovery, to figure out which pages need to be evicted there is a flush list and LRU, both of which are in memory
- background process for syncing dirty pages is normally referred to as a "checkpoint"
- innodb buffer size can be increased to cause innodb to flush to disk less frequently (and store more data in the logs). Log file size must also be larger enough that it can hold all the pending changes.
- on a commit, only log files need to be synced, nothing else
- Fast Index Creation (inherited from the plugin)
- adding an index just adds the index, doesn't rebuild the table
- same with dropping index
- does not work for utf-8 (bug #33650) or primary key
- I/O scalability (via plugin)
--innodb_io_capacity
, tell mysql how many io operations per second your disk is capable of; 100 iops is typical for a 7200rpm disk--innodb_read_io_threads
and--innodb_write_io_threads
-- tell innodb how many threads to use for reading and writing
- CPU scalability (via plugin)
- mostly consists of 3-4 patches that transparently make innodb work better with multi-core CPUs
- lots of x-locks converted to read/write-locks
- lots of locks broken into more fine-grained locks
- Adapting Flushing (invented by percona server - also in InnoDB plugin)
- teaches mysql to do background work more aggressively when log space is about to run out, so that TPM (transactions per minute) doesn't suddenly drop right when a log runs out of free space.
- see also: adaptive checkpointing, which is similar but different
- Fast Crash Recovery (invented by percona; current implimentation independently written by InnoDB team)
- see mysql bug #29847, related to innodb recovery being slow (e.g. a larger buffer pool can increase innodb recovery time)
- in many cases 10x faster than builtin
- Improved Buffer Pool Scalability
- buffer pool mutex becomes four mutexes: flush list, lru, free, and hash mutexes
- Data Dictionary control
- once an innodb table is open it's never freed from the in-memory data dictionary
- adds --innodb_dict_size_limit
- Undo Slots
- by default only 1024 transactions can be open at once, some statements require 2 undo slots
- in xtradb it's expanded to up to 4072 with --innodb_extra_undoslots=1
- Rollback Segments
- makes it possible to have more htan one rollback segment, and also reduces mutex convention on the rollback segment
- this is also fixed upstream in mysql 5.5
- Fast Checksums
- binary format incompatible
- Different Page Sizes
- instead of 16k, you can have 4k or 8k page sizes (probably useful only for older systems)
- Separate Purge Thread
- cleans up long history list length faster
-
speaker thinks that the best features are usability improvements, not performance enhancements
-
ability to show buffer pool contents
-
ability to save contents of the buffer pool (actually saves offsets in the innodb files instead of data contents, so 8 bytes not 16k bytes for each thing); can help with restarting mysql faster; stores to file called 'ib_lru_dump'; it's possible to dump the lru from one server and load it on another server, but be super careful
-
transactional replication (also available in mysql 5.5). [by default] slaves store position in relay-log.info, which isn't fsynced. Transactional replication stores these coordinates in an InnoDB table to make slave crashe proof.
-
import/export tables, makes it easy to just restore one table at a time from the ibd files; import/export still has to be done with xtrabackup, however. Destination must be percona server, source can be MySQL/Percona.
-
better handling of corrupt tables, a corrupt table (i.e. totally bogus data) won't crash mysql, instead the table is just marked as corrupt and the server continues starting up
-
the slow query log for percona has hella more stuff if you increase verbosity; mk-query-digest also understands the new percona columns
-
user statistics, new tables that show up in the information_schema database. for instance, allows you to find indexes that are unused
-
xtrabackup can report on fragmentation of indexes (runs very slowly, you should do offline); if an index is super fragmented, you can drop and re-add the index
- main areas of improvement: sl level, innodb, windows, performance schema
LOCK_open
fixed, used in a lot of places where there are global locks- MySQL 5.5 introduces MDL (Meta Data Locking) subsystem; optimizes for both number of operations, and reduces contention
- smaller fixes
LOCK_alarm
mutex, replaced by posix-specific apis where availableLOCK_thread_count
mutex reduced usage to connect/disconnect from 1-2 times per queryTHR_LOCK_charset
mutex, hotspot for setting up character sets, completely removed
- improved log sys mutex; had been used for both log records and lsn, and also something about mtr
- this has been changed to 2 separate mutexes, but contention can still be observed for some heavy log write workloads
- separate flush_list mutex, so that the flush list mutex is different now from the buffer pool mutex; the buffer pool mutex is very hot
- ability to create multiple buffer pools, using
innodb_buffer_pool_instances
; however, no ability for users to specify how to map table/index constructs to buffer pools - some internal operations need to query all buffer pools, so for certain things internal overhead can be increased
- multiple rollback segments
- now there are 128 segmetns, giving up to 128k transactions
innodb_fast_shutdown=0
needed for this change
- asynchronous i/o on linux
- for many years only windows was using native async i/o
- on linux it was simulated using helper threads
- can make crash recovery up to 1.5x faster
- assumes usage of O_DIRECT (which is good to use anyway)
- change buffering
- insert buffer has existed for a long time, speeds up inserts for i/o bound workloads
- however, updates and deletes were unable to use this feature
- mysql 5.5 supports "delete mark" and "purge" change buffering
- it handles update case as well as delete because secondary indexes are never updated in place
- up to 40x performance for large deletes
- separate purge thread
- before mysql 5.5 purging of old records is done by the main thread
- use
innodb_purge_threads=1
to use separate purge thread
- adaptive flushing
- not a new feature in mysql 5.5, but was updated to deal with multiple bufer pools, and got lots of small tunings
- mysql 5.5 should be a lot "smoother" on many workloads
- will probably continue to improve as mysql 5.5 matures, as the mysql team is still looking at it
- performance schema
- includes many innodb probes
- can be used by developers to better understand sources of performance problems
-
is mysql 5.5 faster? peter says no it's not across the board, but there are no big specific regressions for now
-
remaining issues:
- per index lock; use partitioning as partial workaround
- kernel mutex; seems to be a big focus for innodb in mysql 5.6
- dropping tablespace with innodb file per table
- log mutex
- group commit
- peter notes that probably mysql 5.5 is generally better than percona server 5.1
-
innodb plugin is default in 5.5
-
improved cpu scalability, fast index creation, buffer pool tablescan resistance, fast crash recovery, multiple buffer pools
-
5.5 plugin is innodb 1.1.x, 5.1 plugin is 1.0.x
-
communicates via the row-based relationship
-
foreign keys handled by engine backend
-
features: crash save, mvcc, full index reading, change buffering
-
new plugin supports compression and full hot backup support
-
commands:
SHOW ENGINE INNODB STATUS
SHOW ENGINE INNODB MUTEX
- make sure that the averages calculated time is at least 20-30 seconds, you have no control over what value you get
- rollback is determined based on which transaction has the least number of "rollback" operations to do
- whenever you see two numbers together, they could be really representing one 64-bit integer (all ints are 32 bit in output)
- note to self: read more about innodb concurrency tickets
- http://everythingisdata.wordpress.com/2009/10/17/numbers-everyone-should-know/
- make sure you have noatime enabled, so the redo logs aren't constantly having their atime update
- data in tablespace is organized at page level, not row level
- communication always happens with the buffer pool before it hits the tablespace
- on a wite, buffer pool is marked dirty, the log file is updated, and the log file is fsynced
- so: pages in the tablespace are written in a delayed manner
- logs are only read from when doing a recovery
- contrast to oracle which uses log files for reading and doing fancy things like replication
- log contains two kinds of data: the flush list, and the LRU
- all log activities are assign a LSN (log sequence number), which is monotonically increasing
- LSN value is stored in tablespace after each checkpoint
- oracle calls their logs "redo logs", but it meanst the same thing
- mysql does fuzzy checkpointing, because after a checkpoint there may still be dirty pages in the buffer file
- outside of the buffer pool, changing
innodb_log_file_size
is one of the most important settings, to smooth out background I/O for longer - by default up to 90% of the data in the buffer pool can be dirty
- if log files are larger than the buffer pool, you're going to be fucked when doing a recovery
- you can also change
innodb_flush_log_at_trx_commit
to 0 or 2 to reduce durability (good for slaves!) innodb_log_buffer_size
is useful for writing bigTEXT
/BLOB
changes
- look for innodb log weights going up over time
- tuning log buffer time is a tradeoff between performance and recovery time; tuning this must be contingent on your requirements for recovery time
- these numbers are show in show innodb status:
- log sequence number
- log flushed up to
- last checkpoint at
- you can track the amount of log you're generating (e.g. per hour) by tracking the delta of the LSN
- status variables in percona server also show this
- rollback segments are used to get consistent read view, and also used to rollback uncommitted changes during recovery
innodb_read_io_threads
changes how many threads the buffer pool can use to read data from tablespaceinnodb_max_dirty_page_pct
variable to change how much of the buffer pool can be marked dirtyinnodb_buffer_pool_wait_free
in "show global status" means innodb is spending time waiting for free pages in the buffer pool- when you write to a file you are writing to a buffer; the buffer exists in your pocess; flushing a buffer moes it to the os cache; anything flushed will survive the process crashing; data synced will survive power outage
- by default,
innodb_flush_log_at_trx_commit=1
. this is really expensive, which is why it's important to have a battery backed write cache or ssd. This is the only mode in which innodb is fully ACID compliant - if we have
innodb_flush_log_at_trx_commit=2
then on every commit (or once each second), innodb will flush to the innodb log buffer; every second innodb will sync the log buffer to disk.
- if you run on the master you could have slaves receiving rows that were actually rolled back on the master; manifests itself as a "duplicate key error" on all slaves
innodb_flush_log_at_trx_commit=0
is bad times- some synchronization thing called XA: http://dev.mysql.com/doc/refman/5.1/en/xa.html
- everything is stored in innodb table space
- by default, tablespace is one contiguous logical file called
ibdata1
; you can instruct it to be multiple physical files, but only the last one can be autoextend - every page is filled 15/16 for innodb, no way to configure it (called the "fill factor" in oracle)
- ext3 bad for mysql, xfs better, due to fsync behavior
- using raw partitions is "here be dragons" and makes backups really fucking hard
- you can configure
--inodb-file-per-table
to have one physical file per table, but some things are still stored in the main in the main thing - per table is usually better since you can reclaim space whenever you TRUNCATE/DROP a table, and therefore reduce back size
- some of the new options like compression and off-page storage of logs from mysql 5.5 only work with file per table
- cons:
- uses more file descriptors
- need to tune something
- may frbe more gragmentation
- complicated internal synchronization when starting up, so starting up mysql can take a long time (but would probably on be an issue if you had like 50k or 100k tables)
- but pro: works better with ext3 and O_DIRECT
- as well as table data, table space contains:
- tablespace header
- innodb-specific data dictionary
- undo information
- the doublewrite buffer
- the insert buffer
- you can turn on/off file-per-table, but existing tables will remain how they were (in ibdata1 or in separate file) until you do alter table or optimize table
- most internal allocations are done using extents
innodb_autoextend_increment
affects how big to expand the tablespace at a time- everything fits into pages, which are 16 Kib
- a data page has a header, rows, and a trailer
- rows are normally in variable width format, there is a fixed width format but it is very rarely used
- header sys what kind (data or index), has a checksum, the LSN, the last value of the LSN as flushed to disk
- some things in the trailer: part of the lsn, checksum
- checksum is always checked when reading page to innodb, if checksum is incorrect then innodb will crash (unless using percona server)
- checksums are updated when writing pages back to disk
- xtradb has fast checksum option for super fast disks (e.g. fusion i/o), but changes xtradb compatibility
- data row format is like:
- transaction id (for when row was created)
- roll pointers
- field pointers
- fields
- most important things: transactoin id, storage of pointer to older version of
row in the
UNDO
space - exact format is different -- in 5.0 there is a
COMPACT
row format where null values are stored more efficiently - all field are stored inline except Blobs, which are a special case
- rows cannot be split across pages
- innodb sets a restriction that two rows must fit in one page
- so a single row cannot be more than 8 KiB
- this can cause
VARCHAR
to move to off-page locations
- 20 byte pointer to overflow page for
VARCHAR
/TEXT
/BLOB
for case where blob does not fit on the same page as the row - for lots of internal buffers (including buffer pool), space allocated for varchar or whatever is based on the maximum possible size of that field
- 800 Kb row can take up to 1.5M on disk for various technical reasons
- if the whole row firts in ~8000 bytes then the blob is always stored on the page, otherwise just first 768 bytes (antelope) or 20 byte pointer (barracuda)
- what does log file look like? records are 512 byte aligned, does not store a complete change, just the parts that changed, i.e. Space ID + Page ID + Offset + Payload
- undo information is stored in the rollback segment
- xtradb supports multiple, innodb just supports one
- the rollback segment is always in the global tablesapce
- if you get error message about too many open transactions then you ran out of undo slots, you should increase number of rollback segments
- besides buffer pool, in memory:
- meta data for accessed tables (never released, happens whenever any table is copied in), a.k.a. data dictionary
- lock information
- adaptive hash index
- sql statements
- thread stacks
innodb_use_sys_malloc
(default 1) disavles the additional memory pool- regardless of built-in or plugin, leaving
innodb_additional_mem_pool_size
at the default setting is usually fine - best thing evar:
--innodb-buffer-pool-size
- you may be surprised at how much space insert buffer and undo buffers take up!
- warning: leave space for metadata, alwasy consumes an additional 5-10% from buffer pool
- some things like binary logs and relay logs depend on OS caching, so don't use all memory
- you can change
innodb_max_pct_dirty_pages
- with innodb plugin you can set io capacity to also affect how quickly dirty pages are flushed
- show innodb status shows how many "modified db pages"
- for blobs maybe tweak
innodeb_log_buffer_size
to be as big as the largest blob you ever expect to write - if
innodb_log_waits
inGLOBAL STATUS
is high then the log buffer isn't big enough - locks are held for the duration of a transaction
- for updates/deletes, more rows than you may expect may end up being locked
- "index level locking" more apt name than "row level locking"
- mysql creates next-key locks to prevent inserts from happening on slaves when statements are committed out of order
- two common lock types: shared and exclusive
- an S lock never ocfurs for reads because of MVCC; S locks are normally only visible with foreign key constraints
- in inndob everything is an index
- data is stored in a clustered index organized by the primary key; in the absence of a primary key, the first unique not null key is selected
- other indexes are stored as secondary indexes
- clustered index means that data is stored alongside index, in myisam the index just stores a pointer to the actual data
- in innodb primary key lookup is faster than myisam, but secondary index is slower: secondary index points to primary key, so need to do two index lookups
- all access to the data is always done by primary key
- once mysql uses a RANGE, it cannot use anything else in the index
(e.g. example of (
date
,customer_id
) where usually date is more selective, but if you ever want to do a query bycustomer_id
and adate
range then you're screwed) - just says don't use UUIDs for primary key on innodb
- never piggy-back the primary key index into the end of a composite index or covering index - it is already included for free
mk-duplicate-key-checker
will find this
- innodb has an adaptive hash
- if you have a workload that has "hotspots" then innodb can internally build a hash index for values that you access frequently. this hash index exists only in memory, and there's no control over it from the end-user standpoint.
- the hash index does not cover the whole index (out of time)
- for insert buffer, when you make multiple updates to a single page, they can be buffered into the insert buffer, and then data on disk is merged with the in-memroy insert buffers
- mysql 5.5 also has this for deletes
- may want to disable for ssd
- can measure efficiency for innodb 5.1 and later with show innodb status
- double write buffer helps to protect against partially written/corrupt pages
- always 2MB, two sets of 64 pages
- you can turn the double write buffer off if you have a transactional filesystem like ZFS
- adaptive flushing is in mysql 5.5
- by default 200 iops (assumes raid 10 and 4 disks), change with
innodb_io_capacity
- this is a more generic solution, not MySQL solution; and there are other ways of doing this
- speaker (yves trudeau) is a former mysql and sun consultant, has a phd, english is not his first language
- the plan
- the challenges
- the solution
- building the solution
- other solutions
- challenges:
- cost! people don't want to have a big standby instance since it costs money; different from traditional HA solutions where people have two big servers, one standby and one active;
- no static ips on the private network, and you have no control over which ip it receives
- no vip / no {broad,multi}cast
-
the solution
-
there's a small monitoring instance, typically the smallest instance type available
-
there's also a larger instance
-
farm of web or application servers connected to the private ip of the large instance
-
the large instance stores the mysql data on an ebs volume that is used as permanent storage
-
if there is a failure, the small instance will detect the failure and will initiate the failover procedure
-
small instance will create a new large instance, the large instance will mount the EBS volume, and recovery will initiate
-
web/app servers still don't know where to connect
-
small instance will connect to each web/app server and update their configuration to point to the new large instance
-
technology pieces
-
cluster manager
- pacemaker, other could work too http://www.linux-ha.org/wiki/Pacemaker
- unicast communication
- lack of broadcast/multicast
- setup cluster partner ip for new host
- reconvfiguring service customers
-
Pacemaker
-
allow port 694 UDP within security group
-
heartbeart or corosync for communication, unicast based
-
scripts type resource on the monitoring host (type anything), on remote end is a no-op script
-
localiztion weight favoring the large instance
-
contents of
/etc/ha.cf
(left side of screen is cut off) autojoin any ucast eth0 warntime 5 deadtime 15 initdead 60 keepalive 2 crm respawn -
be careful to set values high, since amazon's network can be laggy which can cause spurious heartbeat failures
-
custom ec2 images that are pre-configured with pacemaker/heartbeat and mysql
-
aws tools; parsing can be difficult --> helper scripts
-
failover steps:
- detect failure
- kill the faulty instance
- free the shared resource
- launch the new instance
- reassign shared resource
- reconfigure heartbeat
- spread connection info
-
ec2-run-instance
has an interesting feature for user data script which is run by root on startup; used to seed the new instance with cluster parnter ip -
don't use pacemaker to move resources because of some weird synchronization thing
-
be careful about timing
-
series of blog posts at http://www.mysqlperformanceblog.com/2010/06/17/high-availability-for-mysql-on-amazon-ec2-part-1-intro/
-
other solutions:
- RDS, pros/cons, only mysql
- continuent
- rightscale (eip based)
- scalarium
- scalebase (beta)
- others?