Created
July 24, 2020 00:37
-
-
Save Ben-Epstein/b5dc591d59106539a02012ec6267c37f to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"cells":[{"metadata":{},"cell_type":"markdown","source":"# Splice + MLflow: What you need to know\n<blockquote><p class='quotation'><span style='font-size:15px'>Mlflow allows you to track experiments and share results with teammates easily.<br>At Splice Machine, MLflow is embedded directly into your database (MLManager). This means that all of the configuration is taken care of for you, and <b>everything</b> you track in mlflow is persisted to the database.<br><br>\n MLflow requires the NSDS (or ExtNSDS) as a parameter to connect to the database. If are unfamliar with our NSDS, check out the <a href=\"./7.1 Splice and Spark.ipynb\">previous notebook</a> on using Splice Machine and Spark.<footer>Splice Machine</footer>\n</blockquote>\n\n#### Let's start our Spark Session"},{"metadata":{"trusted":true},"cell_type":"code","source":"# Setup\nfrom pyspark.sql import SparkSession\nfrom splicemachine.spark import PySpliceContext\n\nspark = SparkSession.builder.getOrCreate()\nsplice = PySpliceContext(spark)\n","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Importing MLflow Support\n<blockquote><p class='quotation'><span style='font-size:15px'>Using MLflow on Splice is as easy as a single import. After imporing, you immediately have access to the <code>mlflow</code> module. <br>You will have access to all of the functions in the standard MLflow API as well as some extra ones that are custom to Splice Machine.<br> You can check out our full <a href='https://pysplice.readthedocs.io/en/latest/splicemachine.mlflow_support.html'>documentation</a> for everything available and our <a href=\"https://www.github.com/splicemachine/pysplice\">GitHub</a> repo to raise issues and ask questions. <br>After importing, you can register your Splice Context for access to even more functions.<br><br><footer>Splice Machine</footer>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"# MLFlow Setup\nfrom splicemachine.mlflow_support import *\nmlflow.register_splice_context(splice)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Step 0: The MLflow UI\n<blockquote> You can access the MLflow UI in 2 ways:\n <ul>\n <li>From the url at <a href=/mlflow>/mlflow</a></li>\n <li>From the Notebook as an IFrame using the <code>get_mlflow_ui</code> function. You can also pass in an optional experiment ID and/or run ID to open the IFrame directly to your experiment/run.</li>\n </ul>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"from splicemachine.notebook import get_mlflow_ui\nget_mlflow_ui()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## MLflow concepts\n<blockquote>MLflow Tracking is organized around the concept <code>experiments</code> and <code>runs</code>:<br> \n <ul>\n <li>Experiments can be thought of as the problem you are trying to track or solve (ie Performance Testing TPC-C</li>\n <li>Runs are single executions of some piece of code (ie a single full execution of TPC-C with some database configuration). Experiments have multiple runs (1-to-many).</li>\n </ul>\n</blockquote>"},{"metadata":{},"cell_type":"markdown","source":"### Setting an Experiment\n<blockquote>To start an Experiment, you can call <code>mlflow.set_experiment('EXP_NAME')</code> and pass in an experiment name.<br> \n If the Experiment exists, it will be set to the <code>active</code> experiment. Otherwise, mlflow will create the Experiment for you and set it to active.\n\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"help(mlflow.set_experiment)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.set_experiment('mlflow_api_demo')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### View Your [Experiment](/mlflow)"},{"metadata":{"trusted":true},"cell_type":"code","source":"exp_id = mlflow.client.get_experiment_by_name('mlflow_api_demo').experiment_id\nget_mlflow_ui(exp_id)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Starting a run\n<blockquote>Once you have an Experiment, you can start your run by calling <code>mlflow.start_run(run_name='RUN_NAME')</code> and pass in a run name. You can also pass in the optional <code>tags</code> parameter as a dictionary and store key value pairs associated to the run.<br> \nWhen you start a run, MLFlow (MLManager) automatically logs some information for you:\n <ul>\n <li>Start Date</li>\n <li>Current User</li>\n <li>Run ID</li>\n <li>DB Transaction ID</li>\n <li>Source (where the run came from)</li>\n </ul>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"help(mlflow.start_run)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.start_run(run_name='First_pass_default_settings', tags={'team': 'pd', 'purpose':'performance testing'})","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Tracking Concepts\n<blockquote>There are 4 main conepts when tracking a run:<br>\n <ul>\n <li><b>Tags</b>: Any key value pair that likely won't be used for comparison between runs (non-measurable items). Only tags can be overwritten</li>\n <li><b>Parameters</b>: Configuration options that were made before starting the run that may have a measurable effect on the outcome</li>\n <li><b>Metrics</b>: The measured outcomes between runs that can be compared. Metrics have an optional <code>step</code> parmeter if you want to track metrics over time for a specific run</li>\n <li><b>Artifacts</b>: Objects (files, images, notebooks, etc) to be associated with a run</li>\n </ul>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"help(mlflow.set_tag)\nprint('---------------------------------------------------------------------------------')\nhelp(mlflow.lp)\nprint('---------------------------------------------------------------------------------')\nhelp(mlflow.lm)\nprint('---------------------------------------------------------------------------------')\nhelp(mlflow.log_artifact)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.set_tag('teammates', 'carol, daniel')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.lp('spark executors', '5')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.lm('execution time sec', 25)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Setting metrics over \"steps\"\nfor i in range(10):\n mlflow.lm('Build time', i*3, step=i)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"get_mlflow_ui(mlflow.current_exp_id(), mlflow.current_run_id())","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### End Run\n<blockquote>When you finish a run, you call <code>mlflow.end_run()</code>.<br> You know a run is ended in the MLFlow UI because there is a green check mark next to it</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.end_run()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Artifacts\n<blockquote>Artifacts can be any file type. The artifact is serialized as a BLOB and stored in the database. When storing artifacts in the database, files with file extensions such as <code>.txt</code>, <code>.pdf</code>, <code>.yaml</code>, <code>.pdf</code>, <code>.jpeg</code> etc. will be available for preview in the mlflow ui <br>We can use some neat Jupyter tricks like <code>writefile</code> to make artifacts even more useful.\n</blockquote>\n\n#### Write a yaml file"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%writefile my_env.yaml\n\nname: datatest \nchannels:\n- defaults\n- conda-forge\n- ericmjl\ndependencies:\n- python=3.6\n- colorama=0.3.9\n- jupyter=1.0.0\n- ipykernel=4.6.1\n- jupyterlab=0.25.2\n- pytest=3.1.3\n- pytest-cov=2.5.1\n- tinydb=3.3.1\n- pyyaml=3.12\n- pandas-summary=0.0.41\n- environment_kernels=1.1\n- missingno=0.3.7\n","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Write a code snippet"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%writefile harm_mean.py\ndef harm_mean(nums, rnd=4):\n \"\"\"\n Calculates the harmonic mean of n numbers rounded to rnd decimal places\n :param nums: List of numbers\n :param rnd: Number of decimal places to round the result\n \"\"\"\n return round(len(nums)/sum([1/i for i in nums]),rnd)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Put it together\n#### Start a run, log our artifacts, view the results"},{"metadata":{"trusted":true},"cell_type":"code","source":"!jupyter nbconvert --to html '7.2 Splice MLflow Support.ipynb'\nwith mlflow.start_run(run_name='environment_requirements'):\n run_id = mlflow.current_run_id()\n exp_id = mlflow.current_exp_id()\n mlflow.log_artifact('my_env.yaml', name='my_env.yaml')\n mlflow.log_artifact('harm_mean.py', name='harm_mean.py')\n mlflow.log_artifact('7.2 Splice MLflow Support.ipynb', name='training_notebook.ipynb')\n mlflow.log_artifact('7.2 Splice MLflow Support.html', name='training_notebook.html')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Click on one of your artifacts to render the results!"},{"metadata":{"scrolled":false,"trusted":true},"cell_type":"code","source":"get_mlflow_ui(exp_id, run_id)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Another Artifact Example"},{"metadata":{"trusted":true},"cell_type":"code","source":"import matplotlib.pyplot as plt\nfrom random import random\nwith mlflow.start_run(run_name='my_plot'):\n plt.rcParams.update({\n \"pgf.texsystem\": \"pdflatex\",\n \"pgf.preamble\": [\n r\"\\usepackage[utf8x]{inputenc}\",\n r\"\\usepackage[T1]{fontenc}\",\n r\"\\usepackage{cmbright}\",\n ]\n })\n\n plt.figure(figsize=(4.5, 2.5))\n plt.plot([random()*19 for _ in range(10)])\n plt.text(0.5, 3., \"serif\", family=\"serif\")\n plt.text(0.5, 2., \"monospace\", family=\"monospace\")\n plt.text(2.5, 2., \"sans-serif\", family=\"sans-serif\")\n plt.xlabel(r\"µ is not $\\mu$\")\n plt.tight_layout(.5)\n\n plt.savefig(\"pgf_texsystem.png\")\n mlflow.log_artifact('pgf_texsystem.png', 'results.png')\n rid = mlflow.current_run_id()\n eid = mlflow.current_exp_id()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"get_mlflow_ui(eid,rid)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Context Managers in Runs\n<blockquote>There are 2 Context Managers in MLManager/MLflow. <code>start_run</code> and <code>timer</code>.<br>\nContext managers enable some autologging and cleanup functions for you. To use a Context Manager, prepend the command with the <code>with</code> call append a <code>:</code> after the call, and indent all lines after it.<br>\nAnother great feature is if the run fails for some reason MLflow will track that for you</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"with mlflow.start_run(run_name='run with context manager'):\n mlflow.lp('foo','bar')\n mlflow.lm('score', 92)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"with mlflow.start_run(run_name='a run that failed'):\n raise Exception","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from time import sleep\n# Multiple context managers\nwith mlflow.start_run(run_name='using the timer too'):\n with mlflow.timer('run time'):\n sleep(2)\n print('done!')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Timers are default stored as parameters, but can also be stored as metrics"},{"metadata":{"trusted":true},"cell_type":"code","source":"from time import sleep\n# Multiple context managers\nwith mlflow.start_run(run_name='using the timer as a metric'):\n with mlflow.timer('run time', param=False):\n sleep(2)\n print('done!')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Nested Runs\n<blockquote>MLFlow supports the concept of <code>nested</code> runs. A nested run is a run that occurs underneath a parent run. In machine learning, this could be used for hyperparmeter tuning (like choosing K in a k-means clustering algorithm). But it can be used for anything you find useful.<br> To use it, simply pass <code>nested=True</code> to the <code>start_run</code> function</blockquote>"},{"metadata":{"scrolled":false,"trusted":true},"cell_type":"code","source":"from random import randint, sample\nfrom time import sleep\nfrom tqdm.notebook import tqdm\nexec_time = [1,3,5,2]\nnum_execs = []\nwith mlflow.start_run(run_name='parent run'):\n for i in tqdm(range(4)):\n with mlflow.start_run(run_name=f'child {i+1}', nested=True):\n with mlflow.timer('run time', param=False):\n sleep(exec_time[i])\n mlflow.set_tag('child', 'yes')\n mlflow.lp('num_executors', i+1)\n num_execs.append(i+1)\n # Plot results\n plt.figure(figsize=(4.5, 2.5))\n plt.plot(num_execs, exec_time)\n\n plt.ylabel('exec time')\n plt.xlabel('num executors')\n plt.tight_layout(.5)\n plt.savefig(\"spark_results.png\")\n mlflow.log_artifact('spark_results.png','spark_results.png')\n display(get_mlflow_ui(mlflow.current_exp_id()))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Storing ML Models\n<blockquote><p class='quotation'><span style='font-size:15px'>Just like everything else we've tracked so far, tracking ML Models is easy with Splice Machine's MLManager. The <code>log_model</code> and <code>load_model</code> functions are all you need. \n <footer>Splice Machine</footer> \n</blockquote>\n\n#### Let's try it out"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn import svm\nfrom sklearn import datasets\nfrom sklearn.metrics import accuracy_score\n\n# Start a run\nwith mlflow.start_run(run_name='my first model'):\n # Load some sklearn data\n digits = datasets.load_digits()\n\n # Build a simple model\n clf = svm.SVC(gamma=0.001, C=100.)\n # Log parameters to mlflow\n mlflow.lp('gamma', 0.001)\n mlflow.lp('C', 100.0)\n\n # Train the model\n with mlflow.timer('train_time'):\n clf.fit(digits.data[:-1], digits.target[:-1])\n\n # Predict with some data\n preds = clf.predict(digits.data[:-1])\n\n # Measure accuracy\n acc = accuracy_score(digits.target[:-1], preds)\n print('Accuracy:',acc)\n # Log metric to mlflow\n mlflow.lm('accuracy', acc)\n \n # Save model\n mlflow.log_model(clf, 'clf_model')\n rid = mlflow.current_run_id()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Load our model back and make new predictions"},{"metadata":{"scrolled":true,"trusted":true},"cell_type":"code","source":"loaded_model = mlflow.load_model(run_id=rid, name='clf_model')\ndisplay(loaded_model)\n# Make a new prediction\nnew_data = [ \n 0., 0., 12., 10., 0., 0., 0., 0., 0., 0., 14., 16., 16.,\n 14., 0., 0., 0., 0., 13., 16., 15., 10., 1., 0., 0., 0.,\n 11., 16., 16., 7., 0., 0., 0., 0., 0., 4., 7., 16., 7.,\n 0., 0., 0., 0., 0., 4., 16., 9., 0., 0., 0., 5., 4.,\n 12., 16., 4., 0., 0., 0., 9., 16., 16., 10., 0., 0.\n]\nprint('Prediction on new data:', loaded_model.predict([new_data])[0])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"spark.stop()","execution_count":32,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Fantastic!\n<blockquote> \nNow you have all of the tools necessary to start tracking your experiments and sharing results! Again, feel free to check out our <a href=\"https://pysplice.readthedocs.io/en/latest/splicemachine.mlflow_support.html\">documentation</a>!<br><br>\n Next Up: <a href='./7.3 Data Exploration.ipynb'>Using MLManager to explore and analyze your data</a>\n<footer>Splice Machine</footer>\n</blockquote>"}],"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.7.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":true,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{"height":"calc(100% - 180px)","left":"10px","top":"150px","width":"212px"},"toc_section_display":true,"toc_window_display":false}},"nbformat":4,"nbformat_minor":4} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Awesome ToolBox, thanks