New York City is the city that never sleeps. It is one of the most popular cities in the world and attracts tens of millions of visitors every year. A popular lodging alternative to hotels is Airbnb where city locals can offer their homes for visitors to stay in. The price of an Airbnb per night can vary based on many factors, like it's location, room type (entire home/apt, private room, etc.), reviews, and availability. We want to explore the question if we can predict the price of Airbnbs in New York City based on these factors using data science and machine learning methods.
The first step in the data science pipeline is obtaining data and making sure that it is in a usuable form for analysis, visualization, and modeling.
import pandas as pd
For this section we will be using the pandas
library so that we can put our data into a dataframe format. The easiest way to install pandas is as part of the Anaconda distribution.
# Import data from csv file to dataframe
data = pd.read_csv('AB_NYC_2019.csv')
data.head()
The next step is to make sure the data is in a form that is organized and easy to use. We follow the model for tidy data where:
To make sure our data follows the tidy model, we clean the data so that each column (attribute) has its own uniform format and units for each row. As an example, we must make sure that the price for each row is in the same units (US dollars). We must also make sure each row forms an entity. In this case, each row represents an Airbnb listing and the entire table is made up of NYC listings. We also want to drop columns that we will not be using in our analysis and create any new columns based on the existing columns that we may need for our models later on.
For regression, we need to ensure that all inputs are numerical in nature. So for the latest date reviewed, we convert the date to an actual measure of time, in this case the number of days since the last review.
# Drop unused columns
data = data.drop(['id', 'host_id', 'host_name'], axis=1)
# Remove entries that are missing the name attribute
data = data.dropna(subset=['name'])
# Convert the datetime information into a numerical attribute
# by extracting the number of days since the last review (comparing to the current day)
data['last_review'] = pd.to_datetime(data['last_review'])
today = pd.to_datetime('today')
data['days_since_last_review'] = today - data['last_review']
data['days_since_last_review'] = data['days_since_last_review'].dt.days
data.drop(['last_review'], axis=1, inplace=True)
data.head()
More information about tidy data can be found in Hadley Wickham's paper titled Tidy Data
This section we will take a deep dive into learning how to show patterns, correlations, and visualize our data to the common eye. Creating visually appealing stastics about our data is impossible to understand because the raw csv file helps no one. Through graphs and plots we will be able to better understand our data and get a true hollistic overview of it.
import numpy as np
import matplotlib.pyplot as plt
from ipyleaflet import Map, Heatmap
import seaborn as sns
from collections import defaultdict, Counter
The main libraries we will use for the data visualization are matplotlib.pyplot
and seaborn
as they are both geared towards drawing attractive and visually appealing stastics. While as ipyleaftlet
will be used as a geographic visualization for our data. Below is how to download the libraries.
Now we will begin looking at our data, but before we do, lets look at where the Airbnb listings are located. Rather than using other means of looking at density, we will be using a heatmap over New York City and have a nice understanding of which borough in NYC has the most amount of listings.
# Grab the latitude and longitude of each listing into the locations list
location = []
for idx, row in data.iterrows():
location.append([row['latitude'], row['longitude'], 2])
# Create the NYC map using the Map module from the ipyleaflet library
nyc_map = Map(center=(40.75,-74.01), zoom=11)
nyc_map
heatMap = Heatmap(
locations = location,
radius =8
)
# Add the heatmap layer onto the map
nyc_map.add_layer(heatMap);
nyc_map
This map is not only useful in determing areas with a dense amount of listings (such as areas next to Central Park and the areas around the World Trade Center), but these interactive maps give the user a nice overview of what to expect with the data. Furthermore, this library is very useful for data that has a geographical aspect to it.
Now to begin the graphs and plots, I envision myself comparing the boroughs (Manhattan, Brooklyn, Queens, etc.) against each other. So, rather than having multiple if statements to extract the borough I want, I am simply going to split them up in their own respective dataframe using the loc
method.
# Create a dataframe for each borough by using the loc method and having a condition within it for which borough
# you want
df_manhattan = data.loc[data['neighbourhood_group'] == "Manhattan"]
df_brooklyn = data.loc[data['neighbourhood_group'] == "Brooklyn"]
df_queens = data.loc[data['neighbourhood_group'] == "Queens"]
df_staten_island = data.loc[data['neighbourhood_group'] == "Staten Island"]
df_bronx = data.loc[data['neighbourhood_group'] == "Bronx"]
Lets compare the distribution of prices for each borough and try to come up with some analysis and a deeper understanding of if there is a change of pricing over each borough.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,6))
sns.set(style="whitegrid")
bp1 = sns.boxplot(x = 'neighbourhood_group', y = 'price', data = data, ax=ax1)
bp2 = sns.boxplot(x = 'neighbourhood_group', y = 'price', data = data, ax=ax2, showfliers=False)
bp1.set_xticklabels(bp1.get_xticklabels(), rotation=45)
bp1.set_title("Neighbourhood Group vs. Price (Without Outliers)")
bp2.set_xticklabels(bp1.get_xticklabels(), rotation=45)
plt.title("Neighbourhood Group vs Price (With Outliers)")
plt.show()
As you can see, there are two boxplots, one with outliers and one without outliers. I think it is important to show both because it shows the user that there is a very big range of pricing for each borough (which is shown in the boxplot with outliers). But with that said, it is also important to show the user about how the majority of the data looks, which is the boxplot graph without the outliers.
One thing we can determine about the pricing for each borough is that the Airbnb's in Manhattan tend to be more expensive than their counterparts, while as the airbnb's in the Bronx seem like the way to go if you want a cheaper option.
The next aspect we are going to look at is the number of reviews, and if that variable affects any of the other varaiables in the dataframe. The first thing we are going to do is take a distribution plot of the number of reviews, which is a histogram with a density line showing it. One thing we always want to check before drawing plots and graphs is to take a look at the distribution of the variable in order to understand the range, skew, outliers, etc. This distribution will build us the foundation for all the other graphs and plots we draw for the variable.
# Create a matrix of subplots, 2x3
fig, ax = plt.subplots(3, 2, figsize=(15,20))
# Create the distribution plots, set it to a variable for extraneous detailing later in the code
manhattan = sns.distplot(df_manhattan['number_of_reviews'], ax=ax[0,0])
bronx = sns.distplot(df_bronx['number_of_reviews'], ax=ax[0,1])
queens = sns.distplot(df_queens['number_of_reviews'], ax=ax[1,0])
st = sns.distplot(df_staten_island['number_of_reviews'], ax=ax[1,1])
brooklyn = sns.distplot(df_brooklyn['number_of_reviews'], ax=ax[2,0])
overall = sns.distplot(data['number_of_reviews'], ax=ax[2, 1])
# Extraineous detailing, such as setting the title and making the fontsize bigger
manhattan.set_title("Manhattan", weight='bold').set_fontsize('14')
bronx.set_title("Bronx", weight='bold').set_fontsize('14')
queens.set_title("Queens", weight='bold').set_fontsize('14')
st.set_title("Staten Island", weight='bold').set_fontsize('14')
brooklyn.set_title("Brooklyn", weight='bold').set_fontsize('14')
overall.set_title("Overall", weight='bold').set_fontsize('14')
plt.show()
Each respective borough's distribution, along with the overall's distribution plot, is a univariate distribution with a strong right skew. This means that the mean is greater than the median for the number of reviews and that most of the data leans to the right of the mode.
Since we want to continue examing the number of reviews quantitative variable, lets group all the boroughs together and determine the mean price and mean number of reviews for each of them and see if we can notice any patterns.
# Create a dataframe that groups the data by borough and then calculates the mean for each of the other columns
x = (data.groupby('neighbourhood_group').mean())
plt.figure(figsize= (15, 8))
# Creating the prices bar in blue
plt.bar(x.index, x['price'], label = 'mean prices')
# Creating the number of reviews bar in orange
plt.bar(x.index, x['number_of_reviews'], label = 'mean number of reviews')
plt.legend()
plt.title("Means vs Boroughs", weight='bold').set_fontsize('14')
plt.xlabel("Boroughs", weight='book').set_fontsize('14')
plt.ylabel("Dollars", weight='book').set_fontsize('14')
plt.show()
This bar graph asserts the data we found from the boxplots and leads to an interesting observation regarding the relationship between the two variables. For the former, we can solidify the fact that the average price for an Airbnb is the highest in Manhattan, while the cheaper airbnb's can be found in Bronx. But an interesting observation this bar graph shows is how the mean number of reviews is stagnent across all the boroughs, almost implying that the number of reviews has nothing to do with the pricing of an Airbnb.
In order to fully prove our observation from the bar graph above asserting that there is no correlation between price and the number of reviews, let's create a correlation matrix and a scatter plot of the two variables and examine them to draw some conclusions.
print(data[["price", "number_of_reviews"]].corr())
data.plot.scatter(x='price', y='number_of_reviews')
According to the correlation matrix and the scatter plot, it seems our assumption is correct in implying that there is no correlation between price and the number of reviews. From the correlation matrix, we can see that the correlation between price and the number of reviews is -0.04, which indicates no correlation between the two variables since the value is so close to 0. Furthermore, the scatter plot is very haphazard in nature and shows no patterns within it as well.
Through this analysis we are able to assert that there is no correlation between price and the number of reviews. Or in other words, the number of reviews does not dictate the price of an airbnb.
Now lets say just for fun you wanted to figure out which is the most popular neighbourhood within each borough. Why? Well if you put yourself into the shoes of someone trying to find a safe, affordable and nice Airbnb, you would expect it to be in a neighbourhood with a high density of them. So this next part shows the Top 10 most listed Airbnb's for each neighbourhood within each borough.
# Create a table for each borough and populate it with the neighbourhood as the key and the amount of listings
# as the value. What defaultdict does for us is initialize the tables for us so that we don't have to worry
# about that extraneous code.
manhattan, bronx, queens, staten_island, brooklyn, overall = defaultdict(int), defaultdict(int), defaultdict(int), defaultdict(int),defaultdict(int), defaultdict(int)
for idx, row in df_manhattan.iterrows():
manhattan[row['neighbourhood']] += 1
for idx, row in df_bronx.iterrows():
bronx[row['neighbourhood']] += 1
for idx, row in df_queens.iterrows():
queens[row['neighbourhood']] += 1
for idx, row in df_staten_island.iterrows():
staten_island[row['neighbourhood']] += 1
for idx, row in df_brooklyn.iterrows():
brooklyn[row['neighbourhood']] += 1
for idx, row in data.iterrows():
overall[row['neighbourhood']] += 1
manhattan = dict(Counter(manhattan).most_common(10))
bronx = dict(Counter(bronx).most_common(10))
queens = dict(Counter(queens).most_common(10))
staten_island = dict(Counter(staten_island).most_common(10))
brooklyn = dict(Counter(brooklyn).most_common(10))
overall = dict(Counter(overall).most_common(10))
# Now we create each respective pie chart
#Manhattan
explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0) # only "explode" the 1st slice
fig, ax = plt.subplots(3, 2, figsize=(15,20)) # create 2x3 matrix of pies
ax[0,0].pie(manhattan.values(), explode= explode, labels=manhattan.keys(), autopct='%1.1f%%',
shadow=True, startangle=90)
ax[0,0].axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
ax[0,0].set_title('Manhattan', weight='bold').set_fontsize('18')
#Bronx
ax[0,1].pie(bronx.values(), explode= explode, labels=bronx.keys(), autopct='%1.1f%%',
shadow=True, startangle=90)
ax[0,1].axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
ax[0,1].set_title('Bronx', weight='bold').set_fontsize('18')
#Queens
ax[1,0].pie(queens.values(), explode= explode, labels=queens.keys(), autopct='%1.1f%%',
shadow=True, startangle=90)
ax[1,0].axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
ax[1,0].set_title('Queens', weight='bold').set_fontsize('18')
#Staten Island
ax[1,1].pie(staten_island.values(), explode= explode, labels=staten_island.keys(), autopct='%1.1f%%',
shadow=True, startangle=90)
ax[1,1].axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
ax[1,1].set_title('Staten Island', weight='bold').set_fontsize('18')
#Brooklyn
ax[2,0].pie(brooklyn.values(), explode= explode, labels=brooklyn.keys(), autopct='%1.1f%%',
shadow=True, startangle=90)
ax[2,0].axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
ax[2,0].set_title('Brooklyn', weight='bold').set_fontsize('18')
#Overall
ax[2,1].pie(overall.values(), explode= explode, labels=overall.keys(), autopct='%1.1f%%',
shadow=True, startangle=90)
ax[2,1].axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
ax[2,1].set_title('Overall', weight='bold').set_fontsize('18')
plt.show()
Data visualization and analysis allows the user/reader to understand the data in a simple and insightful way. Through our analysis, we are able to show the reader a visualization of where all the listings are in NYC and allow them to interactively determine where high density and low density spots of listings are. Furthermore, we took a deep dive into one variable of the dataframe and were able to assert through bar graphs, scatter plots, density plots, and correlation matrices, that there is no correlation between the number of reviews and the price. We finally created a visually appealing matrix of pies to allow the reader to see which neighbourhoods are booked the most and better inform their decisions and opinions.
Through data visualization, we try to abstract the code and make the data as easy to understand as possible. And this can be done through visually appealing plots and graphs. If you are using matrices or math to describe your conclusion, be sure to thoroughly describe what all the variables mean and the significance of them.
As we can see in this dataset, Airbnb locations are distributed throughout different boroughs. An interesting comparison that we could test is whether the price in one location is greater than another location. Specifically, whether the price of a location in Brooklyn is higher on average than in Manhattan.
So to state the null hypothesis, we expect that the price in one borough, should be the same as the price in another borough: the borough the listing is located in should not affect its price. Our alternative hypothesis would be that in fact, the prices are different based on the borough it is located in.
First let's extract the data from Brooklyn and the Manhattan areas.
brooklyn_data = data[data['neighbourhood_group'] == 'Brooklyn']
manhattan_data = data[data['neighbourhood_group'] == 'Manhattan']
We can now randomly sample 30 locations from each borough.
brooklyn_sample = brooklyn_data.sample(n=30, random_state=1)
manhattan_sample = manhattan_data.sample(n=30, random_state=1)
Let's extract the data for each borough into a single table. First, we need to get the mean and standard deviation of each sample.
brooklyn_sample_mean = np.mean(brooklyn_sample['price'])
manhattan_sample_mean = np.mean(manhattan_sample['price'])
print('Brooklyn: Mean: {}'.format(brooklyn_sample_mean))
print('Manhattan: Mean: {}'.format(manhattan_sample_mean))
We now define a random variable $Y = \bar{X_{m}} - \bar{X_{b}}$, where m represents Manhattan and b represents Brooklyn. With assumption that the prices are the same, our expected value for the difference is 0, while the variance is the sum of the variances.
We thus need to get an estimate of average price for both boroughs for an approximation. We can do this by assuming that all observations from this experiment can be treated as i.i.d draws. Thus we would estimate the mean and variance of $Y$ as shown below.
approx_y_bar = np.mean(pd.concat([brooklyn_sample, manhattan_sample])['price'])
approx_y_var = np.var(pd.concat([brooklyn_sample, manhattan_sample])['price'])
print('Approx: Mean: {}, Variance: {}'.format(approx_y_bar, approx_y_var))
The estimate $\hat{y}$ based on the data we recorded is approximately 39.2 (177.433 - 138.233). Thus we can now use CLT approximation to determine $P(Y \neq \hat{y})$.
import scipy.stats as stats
mu = 0
sigma = np.sqrt(approx_y_var)
stats.norm(mu, sigma).cdf(39.2)
It can be seen that we fail to reject the hypothesis that the prices are the same, since the probability we computed (.619) is greater than 0.05, which is the standard alpha value.
We can plot the normal distribution based on our mean and standard deviation to validate this. The plot below shows that this makes sense as the standard deviation is quite large.
x = np.linspace(-1000, 1000, 100)
y = stats.norm.pdf(x,mu,sigma)
plt.plot(x,y, color='coral')
import sklearn
# Data Prep
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
# Text Extraction for Name
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer
from sklearn.decomposition import PCA
# ML Models
from sklearn.linear_model import LinearRegression, ElasticNet, Ridge, Lasso
# Metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
sklearn
is the main library we will use for machine learning. It is a very popular Python library that provides efficient tools for machine learning and analysis. Below is how to download the library.
The goal is to determine the price of an Airbnb listing based on the given data. In Section 2 we removed the ids as they contain no useful information about the location and we also removed the host's name since there may be added bias that is not relevant to the problem.
We continue to include the name information to see if we can apply feature extraction methods and we removed any of the rows that are missing this information also in Section 2. However they did not prove to be too useful as you can see later.
ml_data = data
ml_data.head()
For regression, there is a central assumption that the attribute being regressed has a normal distribution. Let's see if that is the case, by plotting the distribution of prices.
# Plot distribution of price
sns.distplot(ml_data['price'])
plt.title('Distribution of Price')
We can see that the distribution is skewed positively. This is problematic for our regression, so we can aim to solve this problem by performing the log operation to obtain a normal distribution. Additionally, since the log operation is defined at 0, we add 1 to our price prior to performing the log, so that we avoid any NaN
errors.
# Calculate the log price
ml_data['log_price'] = np.log(ml_data['price'] + 1)
# Plot the new distribution
sns.distplot(ml_data['log_price'])
plt.title('Distribution of Log Price')
It can be seen that this feature has a much more normal distribution and thus is ready for regression.
In this section we want to process our data which includes steps like splitting the data into a training and testing set so that we can fit a model to our data.
The next thing we want to do is split the data into X and y. Remember that the y is now the log of the price not the price itself.
X = ml_data.drop(['price', 'log_price'], axis=1)
y = ml_data[['log_price']]
# Obtain feature vectors
v = TfidfVectorizer()
namesVect = v.fit_transform(X['name']).toarray()
# Scale the data before PCA
scale = StandardScaler()
namesVect = scale.fit_transform(namesVect)
# Apply PCA
pca = PCA(n_components=10)
namesVect_pca = pca.fit_transform(namesVect)
namesVect_pca
pca.explained_variance_ratio_
Looking at the PCA explained variance ratio, we can examine how much of the variance each of the components captures. It can be seen that since so little the variance is captured in the first 10 components, this kind of analysis will not provide to be too useful in predicting price. Therefore, we do not include this in our model.
Next, we split the data into training and testing with 80% of the data in the training set and 20% of it in the testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
We now build a pipeline for processing our data.
The pipeline consists of first selecting the relevant columns of the data that we want to feed into our model. We then scale our inputs for numerical data nad then perform imputation by replacing the missing values with the mean. For categorical data, we impute using the most common data and then encode our values into numerical data that can be passed into the regression model using one-hot encoding. We can then combine all our data from these different pipelines using sklearn's
FeatureUnion class.
# Transformation will select numerical or categorical columns automatically
from sklearn.base import BaseEstimator, TransformerMixin
# Create a class to select numerical or categorical columns
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Select the relevant numerical columns
num_cols = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews',
'reviews_per_month', 'calculated_host_listings_count',
'availability_365', 'days_since_last_review']
# Select relevant categorical columns
cat_cols = ['neighbourhood_group', 'room_type']
# Pipeline for handling numerical data: select, impute, scale
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_cols)),
('imputer', SimpleImputer(strategy="mean")),
('scaler', StandardScaler())
])
# Pipeline for handling categorical data: select, impute, encode
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_cols)),
('imputer', SimpleImputer(strategy='most_frequent')),
('one-hot', OneHotEncoder())
])
# Combine pipelines to get final pipeline for all data
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Now, we use the pipeline to fit to and transform the training data, and then transform the test data. It is very important that we only transform the test data and not fit on it. This is so the test data does not leak into the training data and we maintain the test set as a viable evaluation of our model.
X_train_prepared = full_pipeline.fit_transform(X_train)
X_train_prepared.shape
X_test_prepared = full_pipeline.transform(X_test) X_test_prepared.shape
X_test_prepared = full_pipeline.transform(X_test)
X_test_prepared.shape
We now have our processed data and are ready to fit a model to our data! There are a variety of different regression models, but we will specifically look at linear models such as Linear Regression. Additionally, we will look into using different variations that add regularization such as Ridge and Lasso as well as ElasticNet.
If you would like to learn more about these kinds of models, here are a few links to look at:
First, let's build a Linear Regression model using the default parameters.
# Build model and fit with training data
lin_reg = LinearRegression()
lin_reg.fit(X_train_prepared, y_train)
# Use model to get predictions on testing set
y_preds = lin_reg.predict(X_test_prepared)
print('Mean Absolute Error: ', mean_absolute_error(y_test, y_preds))
print('Root Mean Squared Error: ', np.sqrt(mean_squared_error(y_test, y_preds)))
We can now take this code and turn it into a function, allowing us to test out a variety of different models.
# Train and evaluate a given model, returns the trained model
def train_and_evaluate_model(model):
model.fit(X_train_prepared, y_train)
y_preds = model.predict(X_test_prepared)
print('Mean Absolute Error: ', mean_absolute_error(y_test, y_preds))
print('Root Mean Squared Error: ', np.sqrt(mean_squared_error(y_test, y_preds)))
return model
Let's take a look at the performance for Ridge, Lasso and Elastic-Net Regression.
print('Ridge Scores: ')
ridge = train_and_evaluate_model(Ridge())
print('\nLasso Scores: ')
lasso = train_and_evaluate_model(Lasso())
print('\nElastic Net Scores: ')
elastic = train_and_evaluate_model(ElasticNet())
We can see that the Ridge and plain Linear Regression models perform the best, with the lowest mean absolute error and root mean squared error. The reason for this could be that the regularization parameters for the latter two models were too high. In order to improve on these results, you can look into a lot of different things like cross-validation as well as grid-search for finding the best hyperparamters to tune your model. Additionally, you can look into other types of regression models like SVMs and Random Forests.
This link can give you a better idea of what model might suit your needs the best.
In this tutorial we went over the main steps of a data science pipline:
1. Data Curation, Parsing, and Management
2. Exploratory Data Analysis
3. Hypothesis Testing and Machine Learning
We applied the data science pipeline to our specific study of the Airbnb listings in New York City in 2019. However these steps can be generalized to any kind of data or problem that you may be interested in.
Data science is a very powerful tool that has changed the way that we understand people, businesses, and society. We can understand trends and patterns from the past and even extrapolate them to predict the future. Data science has given us a scientific and objective lense that provides insight into the world around us through data.