Skip to content

Instantly share code, notes, and snippets.

@carolineh101
Last active September 27, 2017 19:27
Show Gist options
  • Save carolineh101/e4533666b9d4632ae26342a0f1761522 to your computer and use it in GitHub Desktop.
Save carolineh101/e4533666b9d4632ae26342a0f1761522 to your computer and use it in GitHub Desktop.
Stanford Azure Machine Learning Workshop

Azure Machine Learning: Titanic Survival

Originally written by Christine Matheney

Goal

Use Azure machine learning to predict survival on the Titanic!

Before Starting...

If you don't already have a Microsoft Azure account, sign up here!

Get the Data

  1. Sign up for an account on https://www.kaggle.com.
  2. Go to https://www.kaggle.com/c/titanic.
  3. Click the data tab at the left in the dashboard.
  4. Download the train CSV file and open it in Excel.

Feature Engineering

  1. Add a column for another feature.
  • Example: Let's try Age*Class, and put that text into cell M1.
  • Enter the following formula into cell M2: =F2*C2
  • Grab the bottom right corner of M2 and drag it down to the end of the data to populate the formula down the column.
  • To save the data but clear the formula, select all of column M, copy it, and paste in the same area.
  • A small box will pop up in the corner. Select it and choose "Values only."
  1. Add more features that you think will help your model better predict survivors!
  • Check the Kaggle site for hints and to understand the data.
  • LEFT(<cell>,1) will return the first character in a cell.
  • 1 British pound in 1912 = $87.66 today
  • When you are done, save the spreadsheet as a CSV file (if it wasn't already).

Modeling

  1. Navigate to https://studio.azureml.net/ and log in.
  • If you run into issues, try opening a private browser session (i.e., Incognito mode on Chrome).
  1. Click on the + sign on the bottom right and choose Dataset –> From Local File. Upload your dataset.
  2. From the same + sign, choose Experiment –> Blank Experiment. Rename it to "Titanic Experiment."
  3. Drag your newly added dataset to the Azure ML Canvas. This is on the left side, under Saved Datasets -> My Datasets.
  4. Go to Data Transformation –> Manipulation and drag "Clean Missing Data" to the canvas.
  • Click the bottom circle of the dataset module and drag the arrow that appears to the top circle of "Clean Missing Data".
  1. Go to Data Transformation –> Manipulation and drag "Select Columns in Dataset" to the canvas.
  • On the right side of the screen, click "Launch Column Selector" to choose your independent and dependent variables.
  • Connect the bottom of "Clean Missing Data" to the top of "Select Columns in Dataset".
  1. Go to Data Transformation –> Sample and Split and drag "Split Data" to the canvas.
  • On the right side of the screen, change fraction of rows to 0.7 to use 70% of data to train the model and 30% to test.
  • Connect the bottom of "Select Columns in Dataset" to the top of "Split Data".
  1. Go to Machine Learning –> Train and drag "Train Model" to the canvas.
  • On the right side of the screen, click "Launch Column Selector" to choose your dependent variable (Survived).
  • Connect the bottom left of "Split Data" to the top right of "Train Model".
  1. Go to Machine Learning –> Initialize Model and drag any model to the canvas.
  • Connect the bottom of the model to the top left of "Train Model."
  1. Go to Machine Learning –> Score and drag "Score Model" to the canvas.
  • Connect the bottom right of "Split Data" to the top right of "Score Model".
  • Connect the bottom of "Train Model" to the top left of "Score Model".
  1. Repeat steps 8-10 to add another model to your experiment.
  2. Go to Machine Learning –> Evaluate and drag "Evaluate Model" to the canvas.
  • Connect the bottom of each "Score Model" module to the top of "Evaluate Model".
  1. Click the "Run" button at the bottom of the screen.
  • Once it is finished running, click on the bottom of "Evaluate Model" and select "Visualize".
  • Keep trying other algorithms until you are happy with your results!
  • Click the bottom of the better "Train Model" module and go to Trained Model –> Save as Trained Model.

Creating Your Kaggle Submission

  1. Go back to Kaggle and download the test.csv file.
  2. Open it in Excel and add the same features you had for your training set.
  3. Upload your dataset to Azure ML and rename it "Titanic Test Data."
  4. Create a new blank experiment and rename it Kaggle submission - <names>.
  5. Drag your newly uploaded dataset and model (under Trained Models) onto the canvas.
  6. Drag "Score Model" onto the canvas and connect the dataset and trained model modules to it.
  7. Use a "Select Columns in Dataset" module to select only the columns needed for submission (PassengerID, Scored Labels).
  8. Go to Data Transformation –> Manipulation and drag "Edit Metadata" to the canvas.
  • Launch Column Selector to select Scored Labels. Then under "New column names", type in Survived.
  1. Go to Data Format Conversions and drag "Convert to CSV" onto the canvas. Connect "Edit Metadata" to it.
  2. Click "Run". Then click the bottom of the "Convert to CSV" module and select "Save as Dataset".
  3. At the far left dashboard, go to "Datasets", select your dataset, and download it to your desktop.

Submitting to Kaggle

Click "make a submission" on the left hand side of the page. Then upload your data set and see where you rank!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment