Skip to content

Instantly share code, notes, and snippets.

@abkisssb
Created November 19, 2020 23:31
Show Gist options
  • Save abkisssb/3e7f95b94ac9d39bced3bac9cb17c31c to your computer and use it in GitHub Desktop.
Save abkisssb/3e7f95b94ac9d39bced3bac9cb17c31c to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# Assignment 4\n## Understaning scaling of linear algebra operations on Apache Spark using Apache SystemML\n\nIn this assignment we want you to understand how to scale linear algebra operations from a single machine to multiple machines, memory and CPU cores using Apache SystemML. Therefore we want you to understand how to migrate from a numpy program to a SystemML DML program. Don't worry. We will give you a lot of hints. Finally, you won't need this knowledge anyways if you are sticking to Keras only, but once you go beyond that point you'll be happy to see what's going on behind the scenes.\n\nSo the first thing we need to ensure is that we are on the latest version of SystemML, which is 1.2.0:\n\nThe steps are:\n- pip install\n- start execution at the cell with the version - check"
},
{
"metadata": {},
"cell_type": "code",
"source": "from IPython.display import Markdown, display\ndef printmd(string):\n display(Markdown('# <span style=\"color:red\">'+string+'</span>'))\n\n\nif ('sc' in locals() or 'sc' in globals()):\n printmd('<<<<<!!!!! It seems that you are running in a IBM Watson Studio Apache Spark Notebook. Please run it in an IBM Watson Studio Default Runtime (without Apache Spark) !!!!!>>>>>')\n \n",
"execution_count": 1,
"outputs": []
},
{
"metadata": {},
"cell_type": "code",
"source": "!pip install pyspark==2.4.5",
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": "Collecting pyspark==2.4.5\n Downloading pyspark-2.4.5.tar.gz (217.8 MB)\n\u001b[K |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 217.8 MB 12 kB/s s eta 0:00:01MB 11.1 MB/s eta 0:00:18 |\u2588\u2588\u2588\u2588\u2588 | 33.5 MB 38.8 MB/s eta 0:00:05 |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u258f | 130.4 MB 42.3 MB/s eta 0:00:03 |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u258d | 199.7 MB 49.4 MB/s eta 0:00:01\ufffd\u2588\u258d | 207.1 MB 48.5 MB/s eta 0:00:01\n\u001b[?25hCollecting py4j==0.10.7\n Downloading py4j-0.10.7-py2.py3-none-any.whl (197 kB)\n\u001b[K |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 197 kB 39.1 MB/s eta 0:00:01\n\u001b[?25hBuilding wheels for collected packages: pyspark\n Building wheel for pyspark (setup.py) ... \u001b[?25ldone\n\u001b[?25h Created wheel for pyspark: filename=pyspark-2.4.5-py2.py3-none-any.whl size=218257927 sha256=45da134efeece9a05d505ef977c4bd60ffad06c60479d839c8a1686c65665c1b\n Stored in directory: /tmp/wsuser/.cache/pip/wheels/01/c0/03/1c241c9c482b647d4d99412a98a5c7f87472728ad41ae55e1e\nSuccessfully built pyspark\nInstalling collected packages: py4j, pyspark\nSuccessfully installed py4j-0.10.7 pyspark-2.4.5\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "code",
"source": "!pip install systemml",
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": "Collecting systemml\n Downloading systemml-1.2.0.tar.gz (9.7 MB)\n\u001b[K |\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 9.7 MB 8.3 MB/s eta 0:00:01\n\u001b[?25hRequirement already satisfied: numpy>=1.8.2 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from systemml) (1.18.5)\nRequirement already satisfied: scipy>=0.15.1 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from systemml) (1.5.0)\nRequirement already satisfied: pandas in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from systemml) (1.0.5)\nRequirement already satisfied: scikit-learn in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from systemml) (0.23.1)\nRequirement already satisfied: Pillow>=2.0.0 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from systemml) (7.2.0)\nRequirement already satisfied: pytz>=2017.2 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from pandas->systemml) (2020.1)\nRequirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from pandas->systemml) (2.8.1)\nRequirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from scikit-learn->systemml) (2.1.0)\nRequirement already satisfied: joblib>=0.11 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from scikit-learn->systemml) (0.16.0)\nRequirement already satisfied: six>=1.5 in /opt/conda/envs/Python-3.7-main/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas->systemml) (1.15.0)\nBuilding wheels for collected packages: systemml\n Building wheel for systemml (setup.py) ... \u001b[?25ldone\n\u001b[?25h Created wheel for systemml: filename=systemml-1.2.0-py3-none-any.whl size=9724741 sha256=dd4a8b65a78be4f8bb754189000790aab3f58d93184a43e14bb4d2665b47bbb9\n Stored in directory: /tmp/wsuser/.cache/pip/wheels/a6/d6/32/93b56093b91654d5141d5e5e56b8b0d1fb2b12f4478c5f8235\nSuccessfully built systemml\nInstalling collected packages: systemml\nSuccessfully installed systemml-1.2.0\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "code",
"source": "from pyspark import SparkContext, SparkConf\nfrom pyspark.sql import SQLContext, SparkSession\nfrom pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType\nsc = SparkContext.getOrCreate(SparkConf().setMaster(\"local[*]\"))\nfrom pyspark.sql import SparkSession\nspark = SparkSession \\\n .builder \\\n .getOrCreate()",
"execution_count": 4,
"outputs": []
},
{
"metadata": {},
"cell_type": "code",
"source": "!mkdir -p /home/dsxuser/work/systemml",
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": "mkdir: cannot create directory \u2018/home/dsxuser\u2019: Permission denied\r\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "code",
"source": "from systemml import MLContext, dml\nimport numpy as np\nimport time\nml = MLContext(spark)\nml.setConfigProperty(\"sysml.localtmpdir\", \"mkdir /home/dsxuser/work/systemml\")\nprint(ml.version())\n \nif not ml.version() == '1.2.0':\n raise ValueError('please upgrade to SystemML 1.2.0, or restart your Kernel (Kernel->Restart & Clear Output)')",
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"text": "1.2.0\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Congratulations, if you see version 1.2.0, please continue with the notebook..."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "We use an MLContext to interface with Apache SystemML. Note that we passed a SparkSession object as parameter so SystemML now knows how to talk to the Apache Spark cluster"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now we create some large random matrices to have numpy and SystemML crunch on it"
},
{
"metadata": {},
"cell_type": "code",
"source": "u = np.random.rand(1000,10000)\ns = np.random.rand(10000,1000)\nw = np.random.rand(1000,1000)",
"execution_count": 7,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "Now we implement a short one-liner to define a very simple linear algebra operation\n\nIn case you are unfamiliar with matrxi-matrix multiplication: https://en.wikipedia.org/wiki/Matrix_multiplication\n\nsum(U' * (W . (U * S)))\n\n\n| Legend | | \n| ------------- |-------------| \n| ' | transpose of a matrix | \n| * | matrix-matrix multiplication | \n| . | scalar multiplication | \n\n"
},
{
"metadata": {},
"cell_type": "code",
"source": "start = time.time()\nres = np.sum(u.T.dot(w * u.dot(s)))\nprint (time.time()-start)",
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"text": "1.2589201927185059\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "As you can see this executes perfectly fine. Note that this is even a very efficient execution because numpy uses a C/C++ backend which is known for it's performance. But what happens if U, S or W get such big that the available main memory cannot cope with it? Let's give it a try:"
},
{
"metadata": {},
"cell_type": "code",
"source": "#u = np.random.rand(10000,100000)\n#s = np.random.rand(100000,10000)\n#w = np.random.rand(10000,10000)",
"execution_count": 9,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "After a short while you should see a memory error. This is because the operating system process was not able to allocate enough memory for storing the numpy array on the heap. Now it's time to re-implement the very same operations as DML in SystemML, and this is your task. Just replace all ###your_code_goes_here sections with proper code, please consider the following table which contains all DML syntax you need:\n\n| Syntax | | \n| ------------- |-------------| \n| t(M) | transpose of a matrix, where M is the matrix | \n| %*% | matrix-matrix multiplication | \n| * | scalar multiplication | \n\n## Task"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "In order to show you the advantage of SystemML over numpy we've blown up the sizes of the matrices. Unfortunately, on a 1-2 worker Spark cluster it takes quite some time to complete. Therefore we've stripped down the example to smaller matrices below, but we've kept the code, just in case you are curious to check it out. But you might want to use some more workers which you easily can configure in the environment settings of the project within Watson Studio. Just be aware that you're currently limited to free 50 capacity unit hours per month wich are consumed by the additional workers."
},
{
"metadata": {},
"cell_type": "code",
"source": "script = \"\"\"\nU = rand(rows=1000,cols=10000)\nS = rand(rows=10000,cols=1000)\nW = rand(rows=1000,cols=1000)\nres = sum(t(U) %*% (W * (U%*% S)))\n\"\"\"",
"execution_count": 10,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "To get consistent results we switch from a random matrix initialization to something deterministic"
},
{
"metadata": {},
"cell_type": "code",
"source": "prog = dml(script).output('res')\nres = ml.execute(prog).get('res')\nprint(res)",
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": "ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7ANTLR Tool version 4.5.3 used for code generation does not match the current runtime version 4.7ANTLR Runtime version 4.5.3 used for parser compilation does not match the current runtime version 4.7\nSystemML Statistics:\nTotal execution time:\t\t11.143 sec.\nNumber of executed Spark inst:\t0.\n\n\n6248244030210.443\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "If everything runs fine you should get *6252492444241.075* as result (or something in that bullpark). Feel free to submit your DML script to the grader now!\n\n### Submission"
},
{
"metadata": {},
"cell_type": "code",
"source": "!rm -f rklib.py\n!wget https://raw.githubusercontent.com/romeokienzler/developerWorks/master/coursera/ai/rklib.py",
"execution_count": 12,
"outputs": [
{
"output_type": "stream",
"text": "--2020-11-19 22:48:22-- https://raw.githubusercontent.com/romeokienzler/developerWorks/master/coursera/ai/rklib.py\nResolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.112.133\nConnecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.\nHTTP request sent, awaiting response... 200 OK\nLength: 2289 (2.2K) [text/plain]\nSaving to: \u2018rklib.py\u2019\n\nrklib.py 100%[===================>] 2.24K --.-KB/s in 0s \n\n2020-11-19 22:48:23 (19.9 MB/s) - \u2018rklib.py\u2019 saved [2289/2289]\n\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "code",
"source": "from rklib import submit\nkey = \"esRk7vn-Eeej-BLTuYzd0g\"\n\n\nemail = \"[email protected]\"",
"execution_count": 13,
"outputs": []
},
{
"metadata": {},
"cell_type": "code",
"source": "part = \"fUxc8\"\ntoken = \"w9WCu1B9vOyzbKW3\" #you can obtain it from the grader page on Coursera (have a look here if you need more information on how to obtain the token https://youtu.be/GcDo0Rwe06U?t=276)\nsubmit(email, token, key, part, [part], script)",
"execution_count": 14,
"outputs": [
{
"output_type": "stream",
"text": "Submission successful, please check on the coursera grader page for the status\n-------------------------\n{\"elements\":[{\"itemId\":\"P1p3F\",\"id\":\"tE4j0qhMEeecqgpT6QjMdA~P1p3F~WgGPuyq5Eeuzsw4eXVq3sw\",\"courseId\":\"tE4j0qhMEeecqgpT6QjMdA\"}],\"paging\":{},\"linked\":{}}\n-------------------------\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.7",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.7.9",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment