90POE Data Science Take Home Test: AirBnb Price Prediction

This repository outlines my approach to the 90 POE's Take Home Task. The given data is detailed listing information for AirBnb properties in Berlin. The goal of the task was open and an opportunity to showcase my strengths. I aim to answer these questions:

What insights can you gather ?
How can you visualize the data ?
What is suprising for you about the data?
Create a ML model that can predict the listing price from the other information in the data file, then evaluate your model.

90POE Data Science Take Home Test
Table of Contents
Getting Started
Data Preprocessing and Feature Engineering
Exploratory Data Analysis
Data Modelling and Results
Future Work

Getting Started

This repository aims to explain the process of how I handled the project. I used jupyter-notebooks to make my data analysis process more visibly clear and interpretable. Since the task has also a presentation, I included some visualizations on the Exploratory Data Analysis part.(EDA)

eda.ipynb is the exploratory data analysis and preprocessing notebook.
training.ipynb is training of machine learning model for predicting price values also includes feature selection.
nlp.ipynb is my attempt to predict the room size from the space column of airbnb data by utilizing NLP
data folder includes raw and processed data

Data Preprocessing and Feature Engineering

My steps can be listed as:

Preprocessing:

Drop useless columns
Drop text columns for now
Drop 51 rows with many nan values
Categorize columns
Replace NaN rows in cleaning_fee and security deposit as 0.
Remove $ sign from price columns
Remove % sign from host_response_rate
Remove abnormal prices

Feature Engineering:

Create a distance column from properties coordinates to Berlin's centre
Create amenities features
Create host_verification features
Create a property_size features from the text data by regex

Exploratory Data Analysis

Amenities Related

Berlin Map and Neighbourhoods

Tells us the distribution of all bookable places on airbnb Berlin but it is hard to generalize. Therefore I made this interactive map instead of previous plot where you can also see every neighbourhood and their mean prices:

Accommodations

Well this is suprising, Kreuzberg seems to have the smallest area however have the most airbnb on Berlin. I think this is because of the socio-cultural effect of Kreuzerberg. I know that area is famous with immigrants and also it is in the centre right next to the historical berlin wall.

Distance to the center of the Berlin. The more it gets close the more price varies. Also There are more available airbnbs at the center.

Data Modelling and Results

I selected XgBoost algorithm for my model due to many reasons:

Can Handle Missing Values
Easy To Use / Tune
Fast
Number of hyper-parameters that can be tuned — a primary advantage over gradient boosting machines.
Much more interpretable than a neural network, therefore a good start point

I used GridSearch to find the optimal parameters:

{'colsample_bytree': 0.6, 'gamma': 0.0, 'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 400}

Results

MSE: 26.2054

Future Work

Other Datasources

The data has links in the column of listig_url and host_url. Most of the links are broken but some of them can be use for data scraping and get other type of data such as images of properties.

Model

Instead of Xgboost, a neural network can be used to predict the data. We can also create features from text data

Text

I would like to explore the text columns and extract more insights especially summary,space and description columns.

Another thing I would like to add is extracting property_size from the text data with Question - Answering Transformers models. I tried one of the huge bert-model which was tuned for Q&A however it didn't give results better than my regex solution.

Hsgngr/airbnb.md