Created
January 22, 2019 04:08
-
-
Save belltailjp/482f329bfaa3a2a6141d8f1ed6e55522 to your computer and use it in GitHub Desktop.
Python+SQLite+Peewee Squeeze insertion performance challenge
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import math\n", | |
"import random\n", | |
"from tqdm import tqdm" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Model definition\n", | |
"\n", | |
"主キーの他に2つのIntegerFieldをもつモデルを定義。\n", | |
"一方はそれらのカラムそれぞれに対して・およびペアでインデクスを張るモデル、もう一方はインデクスを張らないモデル。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import peewee\n", | |
"\n", | |
"from peewee import IntegerField\n", | |
"from peewee import Model\n", | |
"from peewee import SqliteDatabase\n", | |
"\n", | |
"from peewee import chunked\n", | |
"\n", | |
"db = SqliteDatabase(None)\n", | |
"\n", | |
"class NoIndexedTable(Model):\n", | |
" value_1 = IntegerField(unique=False)\n", | |
" value_2 = IntegerField(unique=False)\n", | |
" \n", | |
" class Meta:\n", | |
" database = db\n", | |
" \n", | |
" def __str__(self):\n", | |
" return \"{}-{}\".format(self.value_1, self.value_2)\n", | |
" \n", | |
"class IndexedTable(Model):\n", | |
" value_1 = IntegerField(index=True, unique=False)\n", | |
" value_2 = IntegerField(index=True, unique=False)\n", | |
" \n", | |
" class Meta:\n", | |
" database = db\n", | |
" \n", | |
" indexes = (((\"value_1\", \"value_2\"), True),)\n", | |
" \n", | |
" def __str__(self):\n", | |
" return \"{}-{}\".format(self.value_1, self.value_2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Initialize DB\n", | |
"\n", | |
"まっさらなDBを作成する。\n", | |
"IO速度による影響を軽減するためにいくつかのオプションを使用。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import os\n", | |
"os.remove('tmp.db')\n", | |
"db.init('tmp.db', pragmas={'synchronous': 0, 'journal_mode': 'memory'})\n", | |
"db.connect()\n", | |
"db.create_tables([NoIndexedTable, IndexedTable])\n", | |
"\n", | |
"def delete_all():\n", | |
" \"\"\"Delete all the columns in IndexedTable and NoIndexedTable\"\"\"\n", | |
" IndexedTable.delete().execute()\n", | |
" NoIndexedTable.delete().execute()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# 実験条件の設定\n", | |
"\n", | |
"1億件のinsertをベンチマーク。\n", | |
"SQLiteでは1クエリには最大999変数まで含められ、各モデルの1レコードのinsertに2変数を保存できるため、1クエリ(チャンク)サイズは最大で499である。ここでは`chunk_size`として400を設定。\n", | |
"\n", | |
"レコードの各カラムの値としてランダム値を用い、その最大値を`max_value`とする。\n", | |
"\n", | |
"このnotebookの実行に全体で約5時間(i7 8700K上で)。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"N = 1000 * 1000 * 100\n", | |
"chunk_size = 400\n", | |
"max_value = N ** 2" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# (1) Peeweeの通常のインタフェースでのバッチinsertの速度の計測\n", | |
"\n", | |
"これがベースライン。\n", | |
"\n", | |
"時間についてはtqdmのit/sではなく%timeのWall timeを参照のこと。\n", | |
"\n", | |
"## 両モデルにレコードを`N`要素追加するコード" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def insert_records_peewee_standard_api(table, tpl=False):\n", | |
" with db.atomic(), tqdm(total=N) as pbar:\n", | |
" for _ in range(int(math.ceil(N / chunk_size))):\n", | |
" if tpl:\n", | |
" table.insert_many([(random.randint(0, max_value), random.randint(0, max_value))\n", | |
" for _ in range(chunk_size)],\n", | |
" fields=[NoIndexedTable.value_1, NoIndexedTable.value_2]).execute()\n", | |
" else:\n", | |
" table.insert_many([{'value_1': random.randint(0, max_value),\n", | |
" 'value_2': random.randint(0, max_value)}\n", | |
" for _ in range(chunk_size)]).execute()\n", | |
" pbar.update(chunk_size)\n", | |
" assert table.select().count() == N" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## インデクスを(主キー以外)もたないモデル\n", | |
"\n", | |
"### (1a) Baseline (dict)\n", | |
"通常のinsert_manyをチャンクごとに呼ぶパターン。\n", | |
"\n", | |
"### (1b) Baseline (dict)\n", | |
"insert_manyに渡す際小さなdictを大量生成しそれをPeewee内部で確認するオーバヘッドの懸念があるためタプルで渡すようにしたパターン。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [19:16<00:00, 86472.91it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 19min 17s, sys: 4.62 s, total: 19min 22s\n", | |
"Wall time: 19min 18s\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [18:08<00:00, 91852.23it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 18min 10s, sys: 3.95 s, total: 18min 14s\n", | |
"Wall time: 18min 10s\n" | |
] | |
} | |
], | |
"source": [ | |
"%time insert_records_peewee_standard_api(NoIndexedTable)\n", | |
"delete_all()\n", | |
"%time insert_records_peewee_standard_api(NoIndexedTable, tpl=True)\n", | |
"delete_all()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"dictを大量に生成し渡すよりタプルを用いたほうが約1分高速。" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### インデクスをもつモデル" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [54:52<00:00, 30370.43it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 35min 2s, sys: 20min 1s, total: 55min 3s\n", | |
"Wall time: 54min 54s\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [53:54<00:00, 30915.55it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 33min 48s, sys: 20min 17s, total: 54min 6s\n", | |
"Wall time: 53min 56s\n" | |
] | |
} | |
], | |
"source": [ | |
"%time insert_records_peewee_standard_api(IndexedTable)\n", | |
"delete_all()\n", | |
"%time insert_records_peewee_standard_api(IndexedTable, tpl=True)\n", | |
"delete_all()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"インデクスがある場合はinsertの速度が大幅に遅くなっていることがわかる。タプルを渡すのはやはり有効で、インデクスがある場合・ない場合とも約1分の高速化。この1分はdictに絡むオーバヘッドで消費されていることがわかる(1レコードあたり0.6マイクロ秒なのでそんなものだろうか)。" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"| Table | Method | Wall Time | Average speed | \n", | |
"|-----------|----------------------------------------|-----------|------------------|\n", | |
"| NoIndexed | (1a) Baseline (dict) | 19min 18s | 86356 records/s |\n", | |
"| NoIndexed | (1b) Baseline (tuple) | 18min 10s | 91743 records/s |\n", | |
"| Indexed | (1a) Baseline (dict) | 54min 54s | 30358 records/s |\n", | |
"| Indexed | (1b) Baseline (tuple) | 53min 56s | 30902 records/s |" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# (2) インデクスを一旦無効化してinsertし再有効化する\n", | |
"\n", | |
"アクロバティックでかつあまり安全ではない方法。\n", | |
"\n", | |
"インデクスの更新をレコード追加のたびに行うのはオーバヘッドが大きいため、一旦インデクスを削除した状態でレコードを追加した上でインデクスを張り直す。まとめてインデクスを再構築したほうが、レコード追加のたびに更新を行うより効率がいいという仮説に基づく方法。\n", | |
"\n", | |
"Peeweeの拡張モジュールplayhouseにあるDBマイグレーションAPIを用いて、インデクスの削除および再構築を行う。\n", | |
"\n", | |
"※DBをオンラインに保った状態でこれをするのは危険もしくは動かない。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from playhouse.migrate import SqliteMigrator, migrate\n", | |
"migrator = SqliteMigrator(db)\n", | |
"\n", | |
"def drop_index_insert_remake_index():\n", | |
" table_name = IndexedTable._meta.table_name\n", | |
" with db.atomic():\n", | |
" migrate(\n", | |
" migrator.drop_index(table_name, table_name + '_value_1'),\n", | |
" migrator.drop_index(table_name, table_name + '_value_2'),\n", | |
" migrator.drop_index(table_name, table_name + '_value_1_value_2'),\n", | |
" )\n", | |
"\n", | |
" insert_records_peewee_standard_api(IndexedTable, tpl=True)\n", | |
" \n", | |
" with db.atomic():\n", | |
" migrate(\n", | |
" migrator.add_index(table_name, ('value_1',), False),\n", | |
" migrator.add_index(table_name, ('value_2',), False),\n", | |
" migrator.add_index(table_name, ('value_1', 'value_2'), True),\n", | |
" )\n", | |
" assert IndexedTable.select().count() == N" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [18:13<00:00, 91489.88it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 20min 33s, sys: 15.1 s, total: 20min 48s\n", | |
"Wall time: 20min 45s\n" | |
] | |
} | |
], | |
"source": [ | |
"%time drop_index_insert_remake_index()\n", | |
"delete_all()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"| Table | Method | Wall Time | Average speed |\n", | |
"|-------------|----------------------------------------|-----------|------------------|\n", | |
"| NoIndexed | (1a) Baseline (dict) | 19min 18s | 86356 records/s |\n", | |
"| NoIndexed | (1b) Baseline (tuple) | 18min 10s | 91743 records/s |\n", | |
"| Indexed | (1a) Baseline (dict) | 54min 54s | 30358 records/s |\n", | |
"| Indexed | (1b) Baseline (tuple) | 53min 56s | 30902 records/s |\n", | |
"| **Indexed** | **(2) Drop index->insert->Reindex** | **20min 45s** | **80321 records/s** |\n", | |
"\n", | |
"インデクスのあるテーブルに対して、53分56秒→20分45秒(2.6倍)高速化。" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# (3) SQLクエリを効率化する\n", | |
"\n", | |
"## (3-1) モデルを経由せずSQLiteクエリを直接Peeweeから実行する\n", | |
"\n", | |
"前述の方法とは直交する改善案。\n", | |
"\n", | |
"極端に大量のレコードを一括で挿入する場合、ORMそのもののオーバヘッドが速度の制約となる可能性がある。\n", | |
"\n", | |
"そこで、Peeweeのexecute_sqlを呼び生SQLでレコードを作成する。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def insert_records_peewee_execute_sql(table):\n", | |
" with db.atomic():\n", | |
" query = 'insert into {}(value_1, value_2) values (?,?)'.format(table._meta.table_name)\n", | |
" for _ in tqdm(range(N)):\n", | |
" db.execute_sql(query, (random.randint(0, max_value), random.randint(0, max_value)))\n", | |
" assert table.select().count() == N" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"下記はインデクスのないテーブルの場合の結果。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [07:21<00:00, 226344.94it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 7min 21s, sys: 3.63 s, total: 7min 25s\n", | |
"Wall time: 7min 23s\n" | |
] | |
} | |
], | |
"source": [ | |
"%time insert_records_peewee_execute_sql(NoIndexedTable)\n", | |
"delete_all()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"下記はインデクスのあるテーブルの場合の結果。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [39:46<00:00, 41903.39it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 19min 39s, sys: 20min 13s, total: 39min 52s\n", | |
"Wall time: 39min 47s\n" | |
] | |
} | |
], | |
"source": [ | |
"%time insert_records_peewee_execute_sql(IndexedTable)\n", | |
"delete_all()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"| Table | Method | Wall Time | Average speed |\n", | |
"|-------------|----------------------------------------|-----------|------------------|\n", | |
"| NoIndexed | (1a) Baseline (dict) | 19min 18s | 86356 records/s |\n", | |
"| NoIndexed | (1b) Baseline (tuple) | 18min 10s | 91743 records/s |\n", | |
"| Indexed | (1a) Baseline (dict) | 54min 54s | 30358 records/s |\n", | |
"| Indexed | (1b) Baseline (tuple) | 53min 56s | 30902 records/s |\n", | |
"| Indexed | (2) Drop index->insert->Reindex | 20min 45s | 80321 records/s |\n", | |
"| **NoIndexed** | **(3-1) Raw SQL through ORM** | **7min 23s** | **225734 records/s** |\n", | |
"| **Indexed** | **(3-1) Raw SQL through ORM** | **39min 47s** | **41893 records/s** |" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"(1)で示した標準APIで追加する場合では、インデクスなしで18分、インデクスありで約54分であり、その差は約36分であった。\n", | |
"ここで示した直接SQL実行でのインデクスなしとありの差は32分であった。\n", | |
"つまりこのおよそ30分強がインデクスの更新で消費されていると言えそう。1要素あたりにならすと18マイクロ秒程度であるが1億件insertでは無視できない。" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## (3-2) SQLite APIでPeeweeを一切介さずにレコードを挿入\n", | |
"\n", | |
"上記の直接SQL発行では1要素ずつクエリを発行していた。これは現在のPeeweeのAPIの制約によるものであるが、SQLite自体は複数レコードを一度にinsertできる。\n", | |
"\n", | |
"オーバヘッドを極力減らすため、ORMを介さずレコード挿入をしてみる。\n", | |
"\n", | |
"なおこのときPeeweeで開いたDBは一旦閉じる必要がある。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import sqlite3\n", | |
"\n", | |
"def insert_records_direct_sqlite(table):\n", | |
" con = sqlite3.connect(\"tmp.db\")\n", | |
" c = con.cursor()\n", | |
" query = \"insert into {}(value_1, value_2) values (?,?)\".format(table._meta.table_name)\n", | |
" c.executemany(query, [(random.randint(0, max_value), random.randint(0, max_value))\n", | |
" for _ in tqdm(range(N))])\n", | |
" con.commit()\n", | |
" con.close()\n", | |
"\n", | |
"def measure(table):\n", | |
" db.close()\n", | |
" %time insert_records_direct_sqlite(table)\n", | |
" db.init('tmp.db')\n", | |
" assert table.select().count() == N\n", | |
" delete_all()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"インデクスのないテーブルの場合とインデクスのあるテーブルの場合。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [03:03<00:00, 543872.58it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 3min 56s, sys: 6.18 s, total: 4min 2s\n", | |
"Wall time: 4min 2s\n" | |
] | |
}, | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [03:01<00:00, 550154.27it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 15min 12s, sys: 19min 16s, total: 34min 29s\n", | |
"Wall time: 53min 23s\n" | |
] | |
} | |
], | |
"source": [ | |
"measure(NoIndexedTable)\n", | |
"measure(IndexedTable)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"| Table | Method | Wall Time | Average speed |\n", | |
"|-------------|----------------------------------------|-----------|------------------|\n", | |
"| NoIndexed | (1a) Baseline (dict) | 19min 18s | 86356 records/s |\n", | |
"| NoIndexed | (1b) Baseline (tuple) | 18min 10s | 91743 records/s |\n", | |
"| Indexed | (1a) Baseline (dict) | 54min 54s | 30358 records/s |\n", | |
"| Indexed | (1b) Baseline (tuple) | 53min 56s | 30902 records/s |\n", | |
"| Indexed | (2) Drop index->insert->Reindex | 20min 45s | 80321 records/s |\n", | |
"| NoIndexed | (3-1) Raw SQL through ORM | 7min 23s | 225734 records/s |\n", | |
"| Indexed | (3-1) Raw SQL through ORM | 39min 47s | 41893 records/s |\n", | |
"| **NoIndexed** | **(3-2) Direct SQLite** | **4min 2s** | **413223 records/s** |\n", | |
"| **Indexed** | **(3-2) Raw SQL through ORM** | **53min 23s** | **31221 records/s** |\n", | |
"\n", | |
"インデクスのないテーブルの場合は、execute_sqlを呼ぶ場合に比べてさらに大幅に高速化(7分23秒→4分2秒、差は約3分強)。\n", | |
"\n", | |
"当然、インデクスのあるテーブルの場合は全く恩恵に与れない。" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# (4) 上記のテクニックを組み合わせる\n", | |
"\n", | |
"(2)で示したように、インデクスを一旦削除しレコードを追加した上でインデクス再構築することが(安全性と引き換えに)パフォーマンス上有効。\n", | |
"\n", | |
"また(3)で示したように、ORMをすっとばしてレコードを追加することがパフォーマンス上有効。\n", | |
"\n", | |
"これらを組み合わせて最速insertを目指す。\n", | |
"\n", | |
"## Peeweeにおける汎用的なindexの削除・再構築コード\n", | |
"\n", | |
"index名を手打ちしてPeeweeのマイグレータに投げるのは能率が悪いので、これを自動化する。" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from playhouse.migrate import SqliteMigrator, migrate\n", | |
"\n", | |
"class drop_and_recreate_index:\n", | |
" def __init__(self, table):\n", | |
" self.table = table\n", | |
" self.table_name = table._meta.table_name\n", | |
" self.migrator = SqliteMigrator(table._meta.database)\n", | |
" \n", | |
" self.indexed_columns = [(name, col.unique)\n", | |
" for name, col in self.table._meta.columns.items()\n", | |
" if col.index]\n", | |
" \n", | |
" # [(('value_1', 'value_2'), True), ...]\n", | |
" self.multi_column_indexes = self.table._meta.indexes\n", | |
"\n", | |
" def drop_index(self):\n", | |
" with self.table._meta.database.atomic():\n", | |
" # Drop column indexes\n", | |
" for name, _ in self.indexed_columns:\n", | |
" idx_name = self.table_name + '_' + name\n", | |
" migrate(migrator.drop_index(self.table_name, idx_name))\n", | |
" \n", | |
" # Drop multi-column indexes\n", | |
" for columns, _ in self.multi_column_indexes:\n", | |
" idx_name = \"_\".join([self.table_name] + list(columns))\n", | |
" migrate(migrator.drop_index(self.table_name, idx_name))\n", | |
"\n", | |
" def recreate_index(self):\n", | |
" with self.table._meta.database.atomic():\n", | |
" # Recreate column indexes\n", | |
" for name, unique in self.indexed_columns:\n", | |
" migrate(migrator.add_index(self.table_name, (name,), unique))\n", | |
" \n", | |
" # Recreate multi-column indexes\n", | |
" for (columns, unique) in self.multi_column_indexes:\n", | |
" migrate(migrator.add_index(self.table_name, columns, unique))\n", | |
"\n", | |
" def __enter__(self):\n", | |
" self.drop_index()\n", | |
" \n", | |
" def __exit__(self, exception_type, exception_value, traceback):\n", | |
" self.recreate_index()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"上記のコードで、このように使える。\n", | |
"\n", | |
"```python\n", | |
"with drop_and_recreate_index(TableName):\n", | |
" # insert records to DB\n", | |
"```\n", | |
"\n", | |
"なお途中で失敗した場合にインデクスがない状態のままDBが残る危険があることなど実装としては十分安全に詰められていないので実際に使う場合は注意されたい。" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## (4-1) execute_sqlでのレコード挿入" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def insert_records_peewee_execute_sql_drop_index():\n", | |
" with drop_and_recreate_index(IndexedTable):\n", | |
" insert_records_peewee_execute_sql(IndexedTable)\n", | |
" assert IndexedTable.select().count() == N" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 17, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [07:23<00:00, 225438.44it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 9min 45s, sys: 13.8 s, total: 9min 59s\n", | |
"Wall time: 9min 57s\n" | |
] | |
} | |
], | |
"source": [ | |
"%time insert_records_peewee_execute_sql_drop_index()\n", | |
"delete_all()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## (4-2) sqlite3直叩きでのレコード挿入" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 18, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def insert_records_direct_sqlite_drop_index():\n", | |
" with drop_and_recreate_index(IndexedTable):\n", | |
" insert_records_direct_sqlite(IndexedTable)\n", | |
" assert IndexedTable.select().count() == N" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 19, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"100%|██████████| 100000000/100000000 [03:02<00:00, 548998.48it/s]\n" | |
] | |
}, | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"CPU times: user 6min 15s, sys: 17.8 s, total: 6min 33s\n", | |
"Wall time: 6min 33s\n" | |
] | |
} | |
], | |
"source": [ | |
"%time insert_records_direct_sqlite_drop_index()\n", | |
"delete_all()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"| Table | Method | Wall Time | Average speed |\n", | |
"|-------------|----------------------------------------|-----------|------------------|\n", | |
"| NoIndexed | (1a) Baseline (dict) | 19min 18s | 86356 records/s |\n", | |
"| NoIndexed | (1b) Baseline (tuple) | 18min 10s | 91743 records/s |\n", | |
"| Indexed | (1a) Baseline (dict) | 54min 54s | 30358 records/s |\n", | |
"| Indexed | (1b) Baseline (tuple) | 53min 56s | 30902 records/s |\n", | |
"| Indexed | (2) Drop index->insert->Reindex | 20min 45s | 80321 records/s |\n", | |
"| NoIndexed | (3-1) Raw SQL through ORM | 7min 23s | 225734 records/s |\n", | |
"| Indexed | (3-1) Raw SQL through ORM | 39min 47s | 41893 records/s |\n", | |
"| NoIndexed | (3-2) Direct SQLite | 4min 2s | 413223 records/s |\n", | |
"| Indexed | (3-2) Raw SQL through ORM | 53min 23s | 31221 records/s |\n", | |
"| **Indexed** | **(4-1) (2)+(3-1)** | **9min 57s** | **167504 records/s** |\n", | |
"| **Indexed** | **(4-2) (2)+(3-2)** | **6min 33s** | **254453 records/s** |" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"インデクスをもつモデルでも1億件のinsertを6分33秒で完了(インデクス破棄・再構築の時間を全て含む)、平均して254000レコード毎秒を達成。通常のバッチ+トランザクションでのinsertでは3.1万レコード毎秒が最速であったので、8倍以上の高速化を達成。\n", | |
"\n", | |
"インデクスをもたないモデルの場合でも、9.2万レコード毎秒であったものが41万レコード毎秒と4倍以上の高速化。\n", | |
"\n", | |
"ORMはもちろん柔軟性や安全性を担保するために必要な各種の仕組みのために細かなオーバヘッドが生じているが、DBが壊れるリスクを承知の上、また挿入する値がアプリケーションロジック上正当であることが確実である場合でありかつ極端に多数のinsertを行う場合については、これらのテクニックが有効。" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment