Percona Server with XtraDB

Introduction

2010 InnoDB 5.1 announced, will first ship with MySQL 5.5.
Historically InnoDB development lags while MySQL is trying to GA
lots of things fixed in InnoDB since MySQL 5.0
important note -- plugin version of InnoDB is not the default in 5.1
percona server uses innodb plugin by default
xtradb is a fork of the innodb plugin
the on-disk format of xtradb does not change by default, you can switch back and forth; some extra features require changing the on-disk format, however
percona xtradb releases closely tied to upstream mysql releases
release model is short release cycle, rebased against new mysql releases
historically there is an average of a new release every 1-2 months

How InnoDB Works

in general, innodb has less opportunities for reordering writes than an operating system, since it must be more reliable
innodb uses fsync, myisam does not
for a basic select, first innodb queries the buffer pool, then the tablespace
for a basic update, first buffer pool is updated, then the log file is updated, and the update to the tablespace is delayed
optimization here for updates is based around the theory that sequential updates to log files is cheap
logs are only used during recovery, to figure out which pages need to be evicted there is a flush list and LRU, both of which are in memory
background process for syncing dirty pages is normally referred to as a "checkpoint"
innodb buffer size can be increased to cause innodb to flush to disk less frequently (and store more data in the logs)
on a commit, only log files need to be synced, nothing else

Percona Server with XtraDB

Fast Index Creation (inherited from the plugin)

adding an index just adds the index, doesn't rebuild the table
same with dropping index
does not work for utf-8 (bug #???) or primary key

I/O scalability (via plugin)

--innodb_io_capacity, tell mysql how many io operations per second your disk is capable of; 100 iops is typical for a 7200rpm disk
--innodb_read_io_threads and --innodb_write_io_threads -- tell innodb how many threads to use for reading and writing

CPU scalability (via plugin)

mostly consists of 3-4 patches that transparently make innodb work better with multi-core CPUs
lots of x-locks converted to read/write-locks
lots of locks broken into more fine-grained locks

Adapting Flushing (invented by percona server)

teaches mysql to do background work more aggressively when log space is about to run out, so that TPM (transactions per minute) doesn't suddenly drop right when a log if rotated
see also: adaptive checkpointing, which is similar but different

Fast Crash Recovery (invented by percona)

see mysql bug #29847, related to innodb recovery being slow (e.g. a larger buffer pool can increase innodb recovery time)
in many cases 10x faster than builtin

Improved Buffer Pool Scalability

buffer pool mutex becomes four mutexes: flush list, lru, free, and hash mutexes

Data Dictionary control

once an innodb table is open it's never freed from the in-memory data dictionary
adds --innodb_dict_size_limit

Undo Slots

by default only 1024 transactions can be open at once, some statements require 2 undo slots

in xtradb it's expanded to up to 4072 with --innodb_extra_undoslots=1
Rollback Segments

makes it possible to have more htan one rollback segment, and also reduces mutex convention on the rollback segment
this is also fixed upstream in mysql 5.5

Fast Checksums

binary format incompatible

Different Page Sizes

instead of 16k, you can have 4k or 8k page sizes (probably useful only for older systems)

Separate Purge Thread

cleans up long history list length faster

speaker thinks that the best features are usability improvements, not performance enhancements
ability to show buffer pool contents
ability to save contents of the buffer pool (actually saves offsets in the innodb files instead of data contents, so 8 bytes not 16k bytes for each thing); can help with restarting mysql faster; stores to file called 'ib_lru_dump'; it's possible to dump the lru from one server and load it on another server, but be super careful
transactional replication (also available in mysql 5.5); slaves store position in relay-log.info, which isn't fsynced; somehow makes slave crashes more reliable
import/export tables, makes it easy to just restore one table at a time from the ibd files; import/export still has to be done with xtrabackup, however
better handling of corrupt tables, a corrupt table (i.e. totally bogus data) won't crash mysql, instead the table is just marked as corrupt and the server continues starting up
the slow query log for percona has hella more stuff if you increase verbosity; mk-query-digest also understands the new percona columns
user statistics, new tables that show up in the information_schema database. for instance, allows you to find indexes that are unused
xtrabackup can report on fragmentation of indexes (runs very slowly, you should do offline); if an index is super fragmented, you can drop and re-add the index

Improvements To Performance in MySQL 5.5

main areas of improvement: sl level, innodb, windows, performance schema

SQL Level

LOCK_open fixed, used in a lot of places where there are global locks
MySQL 5.5 introduces MDL (Meta Data Locking) subsystem; optimizes for both number of operations, and reduces contention
smaller fixes

LOCK_alarm mutex, replaced by posix-specific apis where available
LOCK_thread_count mutex reduced usage to connect/disconnect from 1-2 times per query
THR_LOCK_charset mutex, hotspot for setting up character sets, completely removed

InnoDB Optimizations

improved log sys mutex; had been used for both log records and lsn, and also something about mtr
this has been changed to 2 separate mutexes, but contention can still be observed for some heavy log write workloads
separate flush_list mutex, so that the flush list mutex is different now from the buffer pool mutex; the buffer pool mutex is very hot
ability to create multiple buffer pools, using innodb_buffer_pool_instances; however, no ability for users to specify how to map table/index constructs to buffer pools
some internal operations need to query all buffer pools, so for certain things internal overhead can be increased
multiple rollback segments

now there are 128 segmetns, giving up to 128k transactions
innodb_fast_shutdown=0 needed for this change

asynchronous i/o on linux

for many years only windows was using native async i/o
on linux it was simulated using helper threads
can make crash recovery up to 1.5x faster
assumes usage of O_DIRECT (which is good to use anyway)

change buffering

insert buffer has existed for a long time, speeds up inserts for i/o bound workloads
however, updates and deletes were unable to use this feature
mysql 5.5 supports "delete mark" and "purge" change buffering
it handles update case as well as delete because secondary indexes are never updated in place
up to 40x performance for large deletes

separate purge thread

before mysql 5.5 purging of old records is done by the main thread
use innodb_purge_threads=1 to use separate purge thread

adaptive flushing

not a new feature in mysql 5.5, but was updated to deal with multiple bufer pools, and got lots of small tunings
mysql 5.5 should be a lot "smoother" on many workloads
will probably continue to improve as mysql 5.5 matures, as the mysql team is still looking at it

performance schema

includes many innodb probes
can be used by developers to better understand sources of performance problems

is mysql 5.5 faster? peter says no it's not across the board, but there are no big specific regressions for now
remaining issues:

per index lock; use partitioning as partial workaround
kernel mutex; seems to be a big focus for innodb in mysql 5.6
dropping tablespace with innodb file per table
log mutex
group commit

peter notes that probably mysql 5.5 is generally better than percona server 5.1

InnoDB Internals

innodb plugin is default in 5.5
improved cpu scalability, fast index creation, buffer pool tablescan resistance, fast crash recovery, multiple buffer pools
5.5 plugin is innodb 1.1.x, 5.1 plugin is 1.0.x
communicates via the row-based relationship
foreign keys handled by engine backend
features: crash save, mvcc, full index reading, change buffering
new plugin supports compression and full hot backup support
commands:

SHOW ENGINE INNODB STATUS
SHOW ENGINE INNODB MUTEX

Show Engine Innodb Status

make sure that the averages calculated time is at least 20-30 seconds, you have no control over what value you get
rollback is determined based on which transaction has the least number of "rollback" operations to do
whenever you see two numbers together, they could be really representing one 64-bit integer (all ints are 32 bit in output)
note to self: read more about innodb concurrency tickets
http://everythingisdata.wordpress.com/2009/10/17/numbers-everyone-should-know/
make sure you have noatime enabled, so the redo logs aren't constantly having their atime update
data in tablespace is organized at page level, not row level
communication always happens with the buffer pool before it hits the tablespace
on a wite, buffer pool is marked dirty, the log file is updated, and the log file is fsynced
so: pages in the tablespace are written in a delayed manner
logs are only read from when doing a recovery

contrast to oracle which uses log files for reading and doing fancy things like replication

log contains two kinds of data: the flush list, and the LRU
all log activities are assign a LSN (log sequence number), which is monotonically increasing
LSN value is stored in tablespace after each checkpoint
oracle calls their logs "redo logs", but it meanst the same thing
mysql does fuzzy checkpointing, because after a checkpoint there may still be dirty pages in the buffer file
outside of the buffer pool, changing innodb_log_file_size is one of the most important settings, to smooth out background I/O for longer
by default up to 90% of the data in the buffer pool can be dirty
if log files are larger than the buffer pool, you're going to be fucked when doing a recovery
you can also change innodb_flush_log_at_trx_commit to 0 or 2 to reduce durability (good for slaves!)
innodb_log_buffer_size is useful for writing big TEXT/BLOB changes

look for innodb log weights going up over time

tuning log buffer time is a tradeoff between performance and recovery time; tuning this must be contingent on your requirements for recovery time
these numbers are show in show innodb status:

log sequence number
log flushed up to
last checkpoint at

you can track the amount of log you're generating (e.g. per hour) by tracking the delta of the LSN

status variables in percona server also show this

rollback segments are used to get consistent read view, and also used to rollback uncommitted changes during recovery
innodb_read_io_threads changes how many threads the buffer pool can use to read data from tablespace
innodb_max_dirty_page_pct variable to change how much of the buffer pool can be marked dirty
innodb_buffer_pool_wait_free in "show global status" means innodb is spending time waiting for free pages in the buffer pool
when you write to a file you are writing to a buffer; the buffer exists in your pocess; flushing a buffer moes it to the os cache; anything flushed will survive the process crashing; data synced will survive power outage
by default, innodb_flush_log_at_trx_commit=1. this is really expensive, which is why it's important to have a battery backed write cache or ssd. This is the only mode in which innodb is fully ACID compliant
if we have innodb_flush_log_at_trx_commit=2 then on every commit (or once each second), innodb will flush to the innodb log buffer; every second innodb will sync the log buffer to disk.

if you run on the master you could have slaves receiving rows that were actually rolled back on the master; manifests itself as a "duplicate key error" on all slaves

innodb_flush_log_at_trx_commit=0 is bad times
some synchronization thing called XA: http://dev.mysql.com/doc/refman/5.1/en/xa.html
everything is stored in innodb table space
by default, tablespace is one contiguous logical file called ibdata1; you can instruct it to be multiple physical files, but only the last one can be autoextend
every page is filled 15/16 for innodb, no way to configure it (called the "fill factor" in oracle)
ext3 bad for mysql, xfs better, due to fsync behavior
using raw partitions is "here be dragons" and makes backups really fucking hard
you can configure --inodb-file-per-table to have one physical file per table, but some things are still stored in the main in the main thing
per table is usually better since you can reclaim space whenever you TRUNCATE/DROP a table, and therefore reduce back size
some of the new options like compression and off-page storage of logs from mysql 5.5 only work with file per table
cons:

uses more file descriptors
need to tune something
may frbe more gragmentation
complicated internal synchronization when starting up, so starting up mysql can take a long time (but would probably on be an issue if you had like 50k or 100k tables)
but pro: works better with ext3 and O_DIRECT

as well as table data, table space contains:

tablespace header
innodb-specific data dictionary
undo information
the doublewrite buffer
the insert buffer

you can turn on/off file-per-table, but existing tables will remain how they were (in ibdata1 or in separate file) until you do alter table or optimize table
most internal allocations are done using extents
innodb_autoextend_increment affects how big to expand the tablespace at a time
everything fits into pages, which are 16 Kib
a data page has a header, rows, and a trailer

rows are normally in variable width format, there is a fixed width format but it is very rarely used
header sys what kind (data or index), has a checksum, the LSN, the last value of the LSN as flushed to disk
some things in the trailer: part of the lsn, checksum

checksum is always checked when reading page to innodb, if checksum is incorrect then innodb will crash (unless using percona server)
checksums are updated when writing pages back to disk

xtradb has fast checksum option for super fast disks (e.g. fusion i/o), but changes xtradb compatibility

data row format is like:

transaction id (for when row was created)
roll pointers
field pointers
fields

most important things: transactoin id, storage of pointer to older version of row in the UNDO space
exact format is different -- in 5.0 there is a COMPACT row format where null values are stored more efficiently
all field are stored inline except Blobs, which are a special case
rows cannot be split across pages
innodb sets a restriction that two rows must fit in one page

so a single row cannot be more than 8 KiB
this can cause VARCHAR to move to off-page locations

20 byte pointer to overflow page for VARCHAR/TEXT/BLOB for case where blob does not fit on the same page as the row
for lots of internal buffers (including buffer pool), space allocated for varchar or whatever is based on the maximum possible size of that field
800 Kb row can take up to 1.5M on disk for various technical reasons
if the whole row firts in ~8000 bytes then the blob is always stored on the page, otherwise just first 768 bytes (antelope) or 20 byte pointer (barracuda)
what does log file look like? records are 512 byte aligned, does not store a complete change, just the parts that changed, i.e. Space ID + Page ID + Offset + Payload
undo information is stored in the rollback segment

xtradb supports multiple, innodb just supports one

the rollback segment is always in the global tablesapce
if you get error message about too many open transactions then you ran out of undo slots, you should increase number of rollback segments
besides buffer pool, in memory:

meta data for accessed tables (never released, happens whenever any table is copied in), a.k.a. data dictionary
lock information
adaptive hash index
sql statements
thread stacks

innodb_use_sys_malloc (default 1) disavles the additional memory pool
regardless of built-in or plugin, leaving innodb_additional_mem_pool_size at the default setting is usually fine
best thing evar: --innodb-buffer-pool-size

you may be surprised at how much space insert buffer and undo buffers take up!

warning: leave space for metadata, alwasy consumes an additional 5-10% from buffer pool
some things like binary logs and relay logs depend on OS caching, so don't use all memory
you can change innodb_max_pct_dirty_pages

with innodb plugin you can set io capacity to also affect how quickly dirty pages are flushed
show innodb status shows how many "modified db pages"

for blobs maybe tweak innodeb_log_buffer_size to be as big as the largest blob you ever expect to write
if innodb_log_waits in GLOBAL STATUS is high then the log buffer isn't big enough
locks are held for the duration of a transaction
for updates/deletes, more rows than you may expect may end up being locked
"index level locking" more apt name than "row level locking"
mysql creates next-key locks to prevent inserts from happening on slaves when statements are committed out of order
two common lock types: shared and exclusive
an S lock never ocfurs for reads because of MVCC; S locks are normally only visible with foreign key constraints
in inndob everything is an index

data is stored in a clustered index organized by the primary key; in the absence of a primary key, the first unique not null key is selected
other indexes are stored as secondary indexes

clustered index means that data is stored alongside index, in myisam the index just stores a pointer to the actual data
in innodb primary key lookup is faster than myisam, but secondary index is slower: secondary index points to primary key, so need to do two index lookups
all access to the data is always done by primary key
once mysql uses a RANGE, it cannot use anything else in the index (e.g. example of (date, customer_id) where usually date is more selective, but if you ever want to do a query by customer_id and a date range then you're screwed)
just says don't use UUIDs for primary key on innodb
never piggy-back the primary key index into the end of a composite index or covering index - it is already included for free

mk-duplicate-key-checker will find this

innodb has an adaptive hash
if you have a workload that has "hotspots" then innodb can internally build a hash index for values that you access frequently. this hash index exists only in memory, and there's no control over it from the end-user standpoint.

the hash index does not cover the whole index (out of time)

for insert buffer, when you make multiple updates to a single page, they can be buffered into the insert buffer, and then data on disk is merged with the in-memroy insert buffers
mysql 5.5 also has this for deletes
may want to disable for ssd
can measure efficiency for innodb 5.1 and later with show innodb status
double write buffer helps to protect against partially written/corrupt pages
always 2MB, two sets of 64 pages
you can turn the double write buffer off if you have a transactional filesystem like ZFS
adaptive flushing is in mysql 5.5
by default 200 iops (assumes raid 10 and 4 disks), change with innodb_io_capacity

Building a MySQL HA Solution on EC2

this is a more generic solution, not MySQL solution; and there are other ways of doing this
speaker (yves trudeau) is a former mysql and sun consultant, has a phd, english is not his first language
the plan

the challenges
the solution
building the solution
other solutions

challenges:

cost! people don't want to have a big standby instance since it costs money; different from traditional HA solutions where people have two big servers, one standby and one active;
no static ips on the private network, and you have no control over which ip it receives
no vip / no {broad,multi}cast

the solution
there's a small monitoring instance, typically the smallest instance type available
there's also a larger instance
farm of web or application servers connected to the private ip of the large instance
the large instance stores the mysql data on an ebs volume that is used as permanent storage
if there is a failure, the small instance will detect the failure and will initiate the failover procedure
small instance will create a new large instance, the large instance will mount the EBS volume, and recovery will initiate
web/app servers still don't know where to connect
small instance will connect to each web/app server and update their configuration to point to the new large instance
technology pieces
cluster manager

pacemaker, other could work too http://www.linux-ha.org/wiki/Pacemaker

unicast communication

lack of broadcast/multicast
setup cluster partner ip for new host
reconvfiguring service customers

Pacemaker
allow port 694 UDP within security group
heartbeart or corosync for communication, unicast based
scripts type resource on the monitoring host (type anything), on remote end is a no-op script
localiztion weight favoring the large instance
contents of /etc/ha.cf (left side of screen is cut off) autojoin any ucast eth0 warntime 5 deadtime 15 initdead 60 keepalive 2 crm respawn
be careful to set values high, since amazon's network can be laggy which can cause spurious heartbeat failures
custom ec2 images that are pre-configured with pacemaker/heartbeat and mysql
aws tools; parsing can be difficult --> helper scripts
failover steps:

detect failure
kill the faulty instance
free the shared resource
launch the new instance
reassign shared resource
reconfigure heartbeat
spread connection info

ec2-run-instance has an interesting feature for user data script which is run by root on startup; used to seed the new instance with cluster parnter ip
don't use pacemaker to move resources because of some weird synchronization thing
be careful about timing
series of blog posts at http://www.mysqlperformanceblog.com/2010/06/17/high-availability-for-mysql-on-amazon-ec2-part-1-intro/
other solutions:

RDS, pros/cons, only mysql
continuent
rightscale (eip based)
scalarium
scalebase (beta)
others?

eklitzke/percona.live.notes.md

Percona Server with XtraDB

Introduction

How InnoDB Works

Percona Server with XtraDB

Improvements To Performance in MySQL 5.5

SQL Level

InnoDB Optimizations

InnoDB Internals

Show Engine Innodb Status

Building a MySQL HA Solution on EC2