Use Azure machine learning to predict survival on the Titanic!
If you don't already have a Microsoft Azure account, sign up here!
- Sign up for an account on https://www.kaggle.com.
- Go to https://www.kaggle.com/c/titanic.
- Click the data tab at the left in the dashboard.
- Download the train CSV file and open it in Excel.
- Add a column for another feature.
- Example: Let's try Age*Class, and put that text into cell M1.
- Enter the following formula into cell M2: =F2*C2
- Grab the bottom right corner of M2 and drag it down to the end of the data to populate the formula down the column.
- To save the data but clear the formula, select all of column M, copy it, and paste in the same area.
- A small box will pop up in the corner. Select it and choose "Values only."
- Add more features that you think will help your model better predict survivors!
- Check the Kaggle site for hints and to understand the data.
- LEFT(<cell>,1) will return the first character in a cell.
- 1 British pound in 1912 = $87.66 today
- When you are done, save the spreadsheet as a CSV file (if it wasn't already).
- Navigate to https://studio.azureml.net/ and log in.
- If you run into issues, try opening a private browser session (i.e., Incognito mode on Chrome).
- Click on the + sign on the bottom right and choose Dataset –> From Local File. Upload your dataset.
- From the same + sign, choose Experiment –> Blank Experiment. Rename it to "Titanic Experiment."
- Drag your newly added dataset to the Azure ML Canvas. This is on the left side, under Saved Datasets -> My Datasets.
- Go to Data Transformation –> Manipulation and drag "Clean Missing Data" to the canvas.
- Click the bottom circle of the dataset module and drag the arrow that appears to the top circle of "Clean Missing Data".
- Go to Data Transformation –> Manipulation and drag "Select Columns in Dataset" to the canvas.
- On the right side of the screen, click "Launch Column Selector" to choose your independent and dependent variables.
- Connect the bottom of "Clean Missing Data" to the top of "Select Columns in Dataset".
- Go to Data Transformation –> Sample and Split and drag "Split Data" to the canvas.
- On the right side of the screen, change fraction of rows to 0.7 to use 70% of data to train the model and 30% to test.
- Connect the bottom of "Select Columns in Dataset" to the top of "Split Data".
- Go to Machine Learning –> Train and drag "Train Model" to the canvas.
- On the right side of the screen, click "Launch Column Selector" to choose your dependent variable (Survived).
- Connect the bottom left of "Split Data" to the top right of "Train Model".
- Go to Machine Learning –> Initialize Model and drag any model to the canvas.
- Connect the bottom of the model to the top left of "Train Model."
- Go to Machine Learning –> Score and drag "Score Model" to the canvas.
- Connect the bottom right of "Split Data" to the top right of "Score Model".
- Connect the bottom of "Train Model" to the top left of "Score Model".
- Repeat steps 8-10 to add another model to your experiment.
- Go to Machine Learning –> Evaluate and drag "Evaluate Model" to the canvas.
- Connect the bottom of each "Score Model" module to the top of "Evaluate Model".
- Click the "Run" button at the bottom of the screen.
- Once it is finished running, click on the bottom of "Evaluate Model" and select "Visualize".
- Keep trying other algorithms until you are happy with your results!
- Click the bottom of the better "Train Model" module and go to Trained Model –> Save as Trained Model.
- Go back to Kaggle and download the test.csv file.
- Open it in Excel and add the same features you had for your training set.
- Upload your dataset to Azure ML and rename it "Titanic Test Data."
- Create a new blank experiment and rename it Kaggle submission - <names>.
- Drag your newly uploaded dataset and model (under Trained Models) onto the canvas.
- Drag "Score Model" onto the canvas and connect the dataset and trained model modules to it.
- Use a "Select Columns in Dataset" module to select only the columns needed for submission (PassengerID, Scored Labels).
- Go to Data Transformation –> Manipulation and drag "Edit Metadata" to the canvas.
- Launch Column Selector to select Scored Labels. Then under "New column names", type in Survived.
- Go to Data Format Conversions and drag "Convert to CSV" onto the canvas. Connect "Edit Metadata" to it.
- Click "Run". Then click the bottom of the "Convert to CSV" module and select "Save as Dataset".
- At the far left dashboard, go to "Datasets", select your dataset, and download it to your desktop.
Click "make a submission" on the left hand side of the page. Then upload your data set and see where you rank!