Ben-Epstein · July 24, 2020 00:37 · AllieUbisse · Aug 10, 2020
diff --git a/7.2 Splice MLflow Support.ipynb b/7.2 Splice MLflow Support.ipynb
 {"cells":[{"metadata":{},"cell_type":"markdown","source":"# Splice + MLflow: What you need to know\n<blockquote><p class='quotation'><span style='font-size:15px'>Mlflow allows you to track experiments and share results with teammates easily.<br>At Splice Machine, MLflow is embedded directly into your database (MLManager). This means that all of the configuration is taken care of for you, and <b>everything</b> you track in mlflow is persisted to the database.<br><br>\n    MLflow requires the NSDS (or ExtNSDS) as a parameter to connect to the database. If are unfamliar with our NSDS, check out the <a href=\"./7.1 Splice and Spark.ipynb\">previous notebook</a> on using Splice Machine and Spark.<footer>Splice Machine</footer>\n</blockquote>\n\n#### Let's start our Spark Session"},{"metadata":{"trusted":true},"cell_type":"code","source":"# Setup\nfrom pyspark.sql import SparkSession\nfrom splicemachine.spark import PySpliceContext\n\nspark = SparkSession.builder.getOrCreate()\nsplice = PySpliceContext(spark)\n","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Importing MLflow Support\n<blockquote><p class='quotation'><span style='font-size:15px'>Using MLflow on Splice is as easy as a single import. After imporing, you immediately have access to the <code>mlflow</code> module. <br>You will have access to all of the functions in the standard MLflow API as well as some extra ones that are custom to Splice Machine.<br> You can check out our full <a href='https://pysplice.readthedocs.io/en/latest/splicemachine.mlflow_support.html'>documentation</a> for everything available and our <a href=\"https://www.github.com/splicemachine/pysplice\">GitHub</a> repo to raise issues and ask questions. <br>After importing, you can register your Splice Context for access to even more functions.<br><br><footer>Splice Machine</footer>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"# MLFlow Setup\nfrom splicemachine.mlflow_support import *\nmlflow.register_splice_context(splice)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Step 0: The MLflow UI\n<blockquote> You can access the MLflow UI in 2 ways:\n    <ul>\n        <li>From the url at <a href=/mlflow>/mlflow</a></li>\n        <li>From the Notebook as an IFrame using the <code>get_mlflow_ui</code> function. You can also pass in an optional experiment ID and/or run ID to open the IFrame directly to your experiment/run.</li>\n    </ul>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"from splicemachine.notebook import get_mlflow_ui\nget_mlflow_ui()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## MLflow concepts\n<blockquote>MLflow Tracking is organized around the concept <code>experiments</code> and <code>runs</code>:<br> \n    <ul>\n        <li>Experiments can be thought of as the problem you are trying to track or solve (ie Performance Testing TPC-C</li>\n        <li>Runs are single executions of some piece of code (ie a single full execution of TPC-C with some database configuration). Experiments have multiple runs (1-to-many).</li>\n    </ul>\n</blockquote>"},{"metadata":{},"cell_type":"markdown","source":"### Setting an Experiment\n<blockquote>To start an Experiment, you can call <code>mlflow.set_experiment('EXP_NAME')</code> and pass in an experiment name.<br> \n    If the Experiment exists, it will be set to the <code>active</code> experiment. Otherwise, mlflow will create the Experiment for you and set it to active.\n\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"help(mlflow.set_experiment)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.set_experiment('mlflow_api_demo')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### View Your [Experiment](/mlflow)"},{"metadata":{"trusted":true},"cell_type":"code","source":"exp_id = mlflow.client.get_experiment_by_name('mlflow_api_demo').experiment_id\nget_mlflow_ui(exp_id)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Starting a run\n<blockquote>Once you have an Experiment, you can start your run by calling <code>mlflow.start_run(run_name='RUN_NAME')</code> and pass in a run name. You can also pass in the optional <code>tags</code> parameter as a dictionary and store key value pairs associated to the run.<br> \nWhen you start a run, MLFlow (MLManager) automatically logs some information for you:\n    <ul>\n        <li>Start Date</li>\n        <li>Current User</li>\n        <li>Run ID</li>\n        <li>DB Transaction ID</li>\n        <li>Source (where the run came from)</li>\n    </ul>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"help(mlflow.start_run)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.start_run(run_name='First_pass_default_settings', tags={'team': 'pd', 'purpose':'performance testing'})","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Tracking Concepts\n<blockquote>There are 4 main conepts when tracking a run:<br>\n    <ul>\n        <li><b>Tags</b>: Any key value pair that likely won't be used for comparison between runs (non-measurable items). Only tags can be overwritten</li>\n        <li><b>Parameters</b>: Configuration options that were made before starting the run that may have a measurable effect on the outcome</li>\n        <li><b>Metrics</b>: The measured outcomes between runs that can be compared. Metrics have an optional <code>step</code> parmeter if you want to track metrics over time for a specific run</li>\n        <li><b>Artifacts</b>: Objects (files, images, notebooks, etc) to be associated with a run</li>\n    </ul>\n</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"help(mlflow.set_tag)\nprint('---------------------------------------------------------------------------------')\nhelp(mlflow.lp)\nprint('---------------------------------------------------------------------------------')\nhelp(mlflow.lm)\nprint('---------------------------------------------------------------------------------')\nhelp(mlflow.log_artifact)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.set_tag('teammates', 'carol, daniel')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.lp('spark executors', '5')","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.lm('execution time sec', 25)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"# Setting metrics over \"steps\"\nfor i in range(10):\n    mlflow.lm('Build time', i*3, step=i)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"get_mlflow_ui(mlflow.current_exp_id(), mlflow.current_run_id())","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### End Run\n<blockquote>When you finish a run, you call <code>mlflow.end_run()</code>.<br> You know a run is ended in the MLFlow UI because there is a green check mark next to it</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"mlflow.end_run()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Artifacts\n<blockquote>Artifacts can be any file type. The artifact is serialized as a BLOB and stored in the database. When storing artifacts in the database, files with file extensions such as <code>.txt</code>, <code>.pdf</code>, <code>.yaml</code>, <code>.pdf</code>, <code>.jpeg</code> etc. will be available for preview in the mlflow ui <br>We can use some neat Jupyter tricks like <code>writefile</code> to make artifacts even more useful.\n</blockquote>\n\n#### Write a yaml file"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%writefile my_env.yaml\n\nname: datatest  \nchannels:\n- defaults\n- conda-forge\n- ericmjl\ndependencies:\n- python=3.6\n- colorama=0.3.9\n- jupyter=1.0.0\n- ipykernel=4.6.1\n- jupyterlab=0.25.2\n- pytest=3.1.3\n- pytest-cov=2.5.1\n- tinydb=3.3.1\n- pyyaml=3.12\n- pandas-summary=0.0.41\n- environment_kernels=1.1\n- missingno=0.3.7\n","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Write a code snippet"},{"metadata":{"trusted":true},"cell_type":"code","source":"%%writefile harm_mean.py\ndef harm_mean(nums, rnd=4):\n    \"\"\"\n    Calculates the harmonic mean of n numbers rounded to rnd decimal places\n    :param nums: List of numbers\n    :param rnd: Number of decimal places to round the result\n    \"\"\"\n    return round(len(nums)/sum([1/i for i in nums]),rnd)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Put it together\n#### Start a run, log our artifacts, view the results"},{"metadata":{"trusted":true},"cell_type":"code","source":"!jupyter nbconvert --to html '7.2 Splice MLflow Support.ipynb'\nwith mlflow.start_run(run_name='environment_requirements'):\n    run_id = mlflow.current_run_id()\n    exp_id = mlflow.current_exp_id()\n    mlflow.log_artifact('my_env.yaml', name='my_env.yaml')\n    mlflow.log_artifact('harm_mean.py', name='harm_mean.py')\n    mlflow.log_artifact('7.2 Splice MLflow Support.ipynb', name='training_notebook.ipynb')\n    mlflow.log_artifact('7.2 Splice MLflow Support.html', name='training_notebook.html')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Click on one of your artifacts to render the results!"},{"metadata":{"scrolled":false,"trusted":true},"cell_type":"code","source":"get_mlflow_ui(exp_id, run_id)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Another Artifact Example"},{"metadata":{"trusted":true},"cell_type":"code","source":"import matplotlib.pyplot as plt\nfrom random import random\nwith mlflow.start_run(run_name='my_plot'):\n    plt.rcParams.update({\n        \"pgf.texsystem\": \"pdflatex\",\n        \"pgf.preamble\": [\n             r\"\\usepackage[utf8x]{inputenc}\",\n             r\"\\usepackage[T1]{fontenc}\",\n             r\"\\usepackage{cmbright}\",\n             ]\n    })\n\n    plt.figure(figsize=(4.5, 2.5))\n    plt.plot([random()*19 for _ in range(10)])\n    plt.text(0.5, 3., \"serif\", family=\"serif\")\n    plt.text(0.5, 2., \"monospace\", family=\"monospace\")\n    plt.text(2.5, 2., \"sans-serif\", family=\"sans-serif\")\n    plt.xlabel(r\"µ is not $\\mu$\")\n    plt.tight_layout(.5)\n\n    plt.savefig(\"pgf_texsystem.png\")\n    mlflow.log_artifact('pgf_texsystem.png', 'results.png')\n    rid = mlflow.current_run_id()\n    eid = mlflow.current_exp_id()","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"get_mlflow_ui(eid,rid)","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Context Managers in Runs\n<blockquote>There are 2 Context Managers in MLManager/MLflow. <code>start_run</code> and <code>timer</code>.<br>\nContext managers enable some autologging and cleanup functions for you. To use a Context Manager, prepend the command with the <code>with</code> call append a <code>:</code> after the call, and indent all lines after it.<br>\nAnother great feature is if the run fails for some reason MLflow will track that for you</blockquote>"},{"metadata":{"trusted":true},"cell_type":"code","source":"with mlflow.start_run(run_name='run with context manager'):\n    mlflow.lp('foo','bar')\n    mlflow.lm('score', 92)","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"with mlflow.start_run(run_name='a run that failed'):\n    raise Exception","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"from time import sleep\n# Multiple context managers\nwith mlflow.start_run(run_name='using the timer too'):\n    with mlflow.timer('run time'):\n        sleep(2)\n    print('done!')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Timers are default stored as parameters, but can also be stored as metrics"},{"metadata":{"trusted":true},"cell_type":"code","source":"from time import sleep\n# Multiple context managers\nwith mlflow.start_run(run_name='using the timer as a metric'):\n    with mlflow.timer('run time', param=False):\n        sleep(2)\n    print('done!')","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"### Nested Runs\n<blockquote>MLFlow supports the concept of <code>nested</code> runs. A nested run is a run that occurs underneath a parent run. In machine learning, this could be used for hyperparmeter tuning (like choosing K in a k-means clustering algorithm). But it can be used for anything you find useful.<br> To use it, simply pass <code>nested=True</code> to the <code>start_run</code> function</blockquote>"},{"metadata":{"scrolled":false,"trusted":true},"cell_type":"code","source":"from random import randint, sample\nfrom time import sleep\nfrom tqdm.notebook import tqdm\nexec_time = [1,3,5,2]\nnum_execs = []\nwith mlflow.start_run(run_name='parent run'):\n    for i in tqdm(range(4)):\n        with mlflow.start_run(run_name=f'child {i+1}', nested=True):\n            with mlflow.timer('run time', param=False):\n                sleep(exec_time[i])\n            mlflow.set_tag('child', 'yes')\n            mlflow.lp('num_executors', i+1)\n            num_execs.append(i+1)\n    # Plot results\n    plt.figure(figsize=(4.5, 2.5))\n    plt.plot(num_execs, exec_time)\n\n    plt.ylabel('exec time')\n    plt.xlabel('num executors')\n    plt.tight_layout(.5)\n    plt.savefig(\"spark_results.png\")\n    mlflow.log_artifact('spark_results.png','spark_results.png')\n    display(get_mlflow_ui(mlflow.current_exp_id()))","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"## Storing ML Models\n<blockquote><p class='quotation'><span style='font-size:15px'>Just like everything else we've tracked so far, tracking ML Models is easy with Splice Machine's MLManager. The <code>log_model</code> and <code>load_model</code> functions are all you need. \n    <footer>Splice Machine</footer>   \n</blockquote>\n\n#### Let's try it out"},{"metadata":{"trusted":true},"cell_type":"code","source":"from sklearn import svm\nfrom sklearn import datasets\nfrom sklearn.metrics import accuracy_score\n\n# Start a run\nwith mlflow.start_run(run_name='my first model'):\n    # Load some sklearn data\n    digits = datasets.load_digits()\n\n    # Build a simple model\n    clf = svm.SVC(gamma=0.001, C=100.)\n    # Log parameters to mlflow\n    mlflow.lp('gamma', 0.001)\n    mlflow.lp('C', 100.0)\n\n    # Train the model\n    with mlflow.timer('train_time'):\n        clf.fit(digits.data[:-1], digits.target[:-1])\n\n    # Predict with some data\n    preds = clf.predict(digits.data[:-1])\n\n    # Measure accuracy\n    acc = accuracy_score(digits.target[:-1], preds)\n    print('Accuracy:',acc)\n    # Log metric to mlflow\n    mlflow.lm('accuracy', acc)\n    \n    # Save model\n    mlflow.log_model(clf, 'clf_model')\n    rid = mlflow.current_run_id()","execution_count":null,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"#### Load our model back and make new predictions"},{"metadata":{"scrolled":true,"trusted":true},"cell_type":"code","source":"loaded_model = mlflow.load_model(run_id=rid, name='clf_model')\ndisplay(loaded_model)\n# Make a new prediction\nnew_data = [ \n    0.,  0., 12., 10.,  0.,  0.,  0.,  0.,  0.,  0., 14., 16., 16.,\n    14.,  0.,  0.,  0.,  0., 13., 16., 15., 10.,  1.,  0.,  0.,  0.,\n    11., 16., 16.,  7.,  0.,  0.,  0.,  0.,  0.,  4.,  7., 16.,  7.,\n    0.,  0.,  0.,  0.,  0.,  4., 16.,  9.,  0.,  0.,  0.,  5.,  4.,\n    12., 16.,  4.,  0.,  0.,  0.,  9., 16., 16., 10.,  0.,  0.\n]\nprint('Prediction on new data:', loaded_model.predict([new_data])[0])","execution_count":null,"outputs":[]},{"metadata":{"trusted":true},"cell_type":"code","source":"spark.stop()","execution_count":32,"outputs":[]},{"metadata":{},"cell_type":"markdown","source":"# Fantastic!\n<blockquote> \nNow you have all of the tools necessary to start tracking your experiments and sharing results! Again, feel free to check out our <a href=\"https://pysplice.readthedocs.io/en/latest/splicemachine.mlflow_support.html\">documentation</a>!<br><br>\n    Next Up: <a href='./7.3 Data Exploration.ipynb'>Using MLManager to explore and analyze your data</a>\n<footer>Splice Machine</footer>\n</blockquote>"}],"metadata":{"kernelspec":{"name":"python3","display_name":"Python 3","language":"python"},"language_info":{"name":"python","version":"3.7.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"toc":{"nav_menu":{},"number_sections":false,"sideBar":true,"skip_h1_title":false,"base_numbering":1,"title_cell":"Table of Contents","title_sidebar":"Contents","toc_cell":false,"toc_position":{"height":"calc(100% - 180px)","left":"10px","top":"150px","width":"212px"},"toc_section_display":true,"toc_window_display":false}},"nbformat":4,"nbformat_minor":4}