This repository outlines my approach to the 90 POE's Take Home Task. The given data is detailed listing information for AirBnb properties in Berlin. The goal of the task was open and an opportunity to showcase my strengths. I aim to answer these questions:
- What insights can you gather ?
- How can you visualize the data ?
- What is suprising for you about the data?
- Create a ML model that can predict the listing price from the other information in the data file, then evaluate your model.
- 90POE Data Science Take Home Test
- Table of Contents
- Getting Started
- Data Preprocessing and Feature Engineering
- Exploratory Data Analysis
- Data Modelling and Results
- Future Work
This repository aims to explain the process of how I handled the project. I used jupyter-notebooks to make my data analysis process more visibly clear and interpretable. Since the task has also a presentation, I included some visualizations on the Exploratory Data Analysis part.(EDA)
eda.ipynb
is the exploratory data analysis and preprocessing notebook.training.ipynb
is training of machine learning model for predicting price values also includes feature selection.nlp.ipynb
is my attempt to predict the room size from the space column of airbnb data by utilizing NLPdata
folder includes raw and processed data
My steps can be listed as:
Preprocessing:
- Drop useless columns
- Drop text columns for now
- Drop 51 rows with many nan values
- Categorize columns
- Replace NaN rows in cleaning_fee and security deposit as 0.
- Remove $ sign from price columns
- Remove % sign from host_response_rate
- Remove abnormal prices
Feature Engineering:
- Create a distance column from properties coordinates to Berlin's centre
- Create amenities features
- Create host_verification features
- Create a property_size features from the text data by regex
Tells us the distribution of all bookable places on airbnb Berlin but it is hard to generalize. Therefore I made this interactive map instead of previous plot where you can also see every neighbourhood and their mean prices:
Well this is suprising, Kreuzberg seems to have the smallest area however have the most airbnb on Berlin. I think this is because of the socio-cultural effect of Kreuzerberg. I know that area is famous with immigrants and also it is in the centre right next to the historical berlin wall.Distance to the center of the Berlin. The more it gets close the more price varies. Also There are more available airbnbs at the center.
I selected XgBoost algorithm for my model due to many reasons:
- Can Handle Missing Values
- Easy To Use / Tune
- Fast
- Number of hyper-parameters that can be tuned — a primary advantage over gradient boosting machines.
- Much more interpretable than a neural network, therefore a good start point
I used GridSearch to find the optimal parameters:
{'colsample_bytree': 0.6, 'gamma': 0.0, 'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 400}
MSE: 26.2054
The data has links in the column of listig_url and host_url. Most of the links are broken but some of them can be use for data scraping and get other type of data such as images of properties.
Instead of Xgboost, a neural network can be used to predict the data. We can also create features from text data
I would like to explore the text columns and extract more insights especially summary,space and description columns.
Another thing I would like to add is extracting property_size from the text data with Question - Answering Transformers models. I tried one of the huge bert-model which was tuned for Q&A however it didn't give results better than my regex solution.