Skip to content

Instantly share code, notes, and snippets.

@harusametime
Last active September 4, 2019 15:32
Show Gist options
  • Save harusametime/cd39e5d7cce0c5a08086f450885c22da to your computer and use it in GitHub Desktop.
Save harusametime/cd39e5d7cce0c5a08086f450885c22da to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## データの準備\n",
"\n",
"### データのダウンロード\n",
"今回はMNISTというデータセットを利用します。データセットはscikit-learnを利用してダウンロードします。Xとyには画像とラベルが入ります。\n",
"X.shape や y.shapeとすると、どういった次元なのかがわかります。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_openml\n",
"X, y = fetch_openml('mnist_784', version=1, return_X_y=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### データの分割\n",
"\n",
"60000枚と10000枚に画像とラベルを分けて学習データとテストデータを作成しましょう。スライスという機能を使えば配列データを分けることができます。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train = X[:60000]\n",
"X_test = X[60000:]\n",
"y_train = y[:60000]\n",
"y_test = y[60000:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## kNNの学習\n",
"\n",
"データから近いk個のラベルを見て、その多数決でデータのラベルを判定するものです。理論的には全データと比較すれば良いというもので、学習は不要ですが、推論の比較数が多く大変なので、データを木構造にして探索しやすくしています。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"neigh = KNeighborsClassifier(n_neighbors=1)\n",
"neigh.fit(X_train, y_train) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上で説明したようにkNNは推論が大変です。最初の5個だけ選んで推論します。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"y_predict = neigh.predict(X_test[0:5])\n",
"print(y_predict)\n",
"print(y_test[0:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 決定木の学習\n",
"\n",
"決定木は、データをルールで分類する木構造の分類器です。分類のルールをデータから学習します。scikit-learnでは、DecisionTreeClassifier()を呼び出してからfit()を実行します。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import tree\n",
"\n",
"clf = tree.DecisionTreeClassifier()\n",
"clf = clf.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"いったん学習が終われば predict で分類を行うことができます。X_testを入力すると予測を得ることができます。ついでに時間を測るとkNNより速いことも確認できると思います。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"y_predict = clf.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"予測の分類 y_predict は、正解の y_test と合っている方が良い結果ということになります。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import accuracy_score\n",
"accuracy_score(y_test, y_predict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"多クラス分類のときは、どのクラスがどこに分類されているか(例えば、7が1になってないかなど)が気になると思います。Confusion Matrixは、クラス間の分類結果を示すことができます。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"confusion_matrix(y_test, y_predict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Keras で単純パーセプトロン/多層パーセプトロン\n",
"\n",
"単純パーセプトロンは全結合層と活性化層のみで構成されます。MNISTの場合は、10次元に全結合してから、活性化層に通します。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from tensorflow.keras.models import Sequential\n",
"from tensorflow.keras.layers import Dense, Activation, Dropout\n",
"\n",
"model = Sequential()\n",
"model.add(Dense(10, input_shape=(784,)))\n",
"model.add(Activation('softmax'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"モデルの定義ができたら、compileで最適化手法やロスなどを決めます。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model.compile(optimizer='rmsprop',\n",
" loss='categorical_crossentropy',\n",
" metrics=['accuracy'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"summary()を使うことでモデルの内容を見ることができます。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"学習は同じく fit を使うことで行います。ニューラルネットワークでは、数値を正規化することで学習が安定することがあります。ここでは255に割っておきます。また、出力が10次元なので、それにあわせてone-hotなベクトル(ラベルが6なら[0,0,0,0,0,0,1,0,0,0])に変換します。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from tensorflow.keras.utils import to_categorical\n",
"model.fit(X_train.astype('float32')/255, to_categorical(y_train), epochs=10, batch_size=32)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"同じようにpredictで予測できます。結果はone-hotなベクトルなので、argmaxで逆にラベルに変換しましょう。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"y_predict = model.predict(X_test)\n",
"y_predict = np.argmax(y_predict, axis =1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"精度とConfusion_matrix を表示してみます。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"score = model.evaluate(X_test.astype('float32')/255, to_categorical(y_test), verbose=0)\n",
"print('Test loss:', score[0])\n",
"print('Test accuracy:', score[1])\n",
"\n",
"from sklearn.metrics import confusion_matrix\n",
"confusion_matrix(y_test.astype(int), y_predict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SageMaker を使いこなす\n",
"\n",
"ここではノートブックインスタンスで学習しましたが、学習用インスタンスで学習することができます。Kerasの場合は、[こちら](https://github.com/harusametime/sagemaker-notebooks/tree/master/keras_tensorflow)に詳しい説明があります。\n",
"\n",
"学習用インスタンスにデータを渡すために、S3にいったんデータを置きましょう。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sagemaker \n",
"import os\n",
"import numpy as np\n",
"sagemaker_session = sagemaker.Session()\n",
"\n",
"os.makedirs(\"./s3_data\", exist_ok = True)\n",
"np.save(\"./s3_data/image.npy\", X)\n",
"np.save(\"./s3_data/label.npy\", y.astype(int))\n",
"\n",
"input_data = sagemaker_session.upload_data(\"./s3_data\", key_prefix=\"mnist_data\")\n",
"print(\"Data is uploaded to {}\".format(input_data))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"次に学習に必要なコードを用意します。以下を実行すると、%%writefile train.py 以下の内容が train.py に保存されます。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile train.py\n",
"import argparse\n",
"import keras\n",
"import os\n",
"import numpy as np\n",
"\n",
"import tensorflow as tf\n",
"from tensorflow.keras.models import Sequential\n",
"from tensorflow.keras.utils import to_categorical\n",
"from tensorflow.keras.layers import Dense, Dropout, Activation\n",
"from tensorflow.keras import backend as K\n",
"\n",
"if __name__ == '__main__':\n",
" \n",
" parser = argparse.ArgumentParser()\n",
"\n",
" # input data and model directories\n",
" parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])\n",
" parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAINING'])\n",
" \n",
" args, _ = parser.parse_known_args()\n",
" \n",
" train_dir = args.train\n",
" \n",
" X = np.load(os.path.join(train_dir, 'image.npy'))\n",
" y = np.load(os.path.join(train_dir, 'label.npy'))\n",
" \n",
" X_train = X[:60000]\n",
" X_test = X[60000:]\n",
" y_train = y[:60000]\n",
" y_test = y[60000:]\n",
" \n",
" model = Sequential()\n",
" model.add(Dense(10, input_shape=(784,)))\n",
" model.add(Activation('softmax'))\n",
"\n",
" model.compile(optimizer='rmsprop',\n",
" loss='categorical_crossentropy',\n",
" metrics=['accuracy'])\n",
" \n",
" model.fit(X_train.astype('float32')/255, to_categorical(y_train), epochs=10, batch_size=32)\n",
" score = model.evaluate(X_test.astype('float32')/255, to_categorical(y_test), verbose=0)\n",
" print('Test loss:', score[0])\n",
" print('Test accuracy:', score[1])\n",
"\n",
" '''\n",
" Save Keras model as a model for tensorflow serving\n",
" '''\n",
" sess = K.get_session()\n",
" tf.saved_model.simple_save(\n",
" sess,\n",
" os.path.join(args.model_dir, 'model/1'),\n",
" inputs={'inputs': model.input},\n",
" outputs={t.name: t for t in model.outputs})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"SageMaker の API を利用して、学習用インスタンスで学習を行います。ここでは、インスタンスタイプや数、上記で作成した学習スクリプト train.py を指定します。S3のファイルパスを指定してfitを実行すれば、学習を行うことできます。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.tensorflow import TensorFlow\n",
"from sagemaker import get_execution_role\n",
"\n",
"role = get_execution_role()\n",
"mnist_estimator = TensorFlow(entry_point = \"train.py\",\n",
" role=role,\n",
" train_instance_count=1,\n",
" train_instance_type=\"ml.m4.xlarge\",\n",
" framework_version=\"1.14.0\",\n",
" py_version='py3',\n",
" script_mode=True)\n",
"\n",
"mnist_estimator.fit(input_data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"学習が終わればデプロイしてWeb APIを発行することもできます。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predictor = mnist_estimator.deploy(instance_type='ml.m4.xlarge',\n",
" initial_instance_count=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"APIに対してリクエストを送ることが可能です。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import random\n",
"import matplotlib.pyplot as plt\n",
"\n",
"num_samples = 5\n",
"indices = random.sample(range(X_test.shape[0] - 1), num_samples)\n",
"images, labels = X_test[indices]/255, y_test[indices]\n",
"\n",
"for i in range(num_samples):\n",
" plt.subplot(1,num_samples,i+1)\n",
" plt.imshow(images[i].reshape(28, 28), cmap='gray')\n",
" plt.title(labels[i])\n",
" plt.axis('off')\n",
" \n",
"prediction = predictor.predict(images)['predictions']\n",
"prediction = np.array(prediction)\n",
"predicted_label = prediction.argmax(axis=1)\n",
"print('The predicted labels are: {}'.format(predicted_label))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"不要な場合はエンドポイントを削除しましょう。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predictor.delete_endpoint()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_tensorflow_p36",
"language": "python",
"name": "conda_tensorflow_p36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment