SG Airbnb Insights : Looking into Superhost Status

Samantha Lin
11 min readNov 4, 2020
Infinity pool at Marina Bay Sands

Problem Statement

Airbnb is a home-sharing platform that allows home-owners to put their properties online and guests from anywhere can use the platform to book for a stay. Hosts are expected to set their own prices for their listings. Although Airbnb provide some general guidance, there are currently no free and accurate services which help hosters understand the current airbnb situation in Singapore. The main focus of the study will address the following:

  • Does the relationship of guest satisfaction rate and price differ on superhost status?

The following packages are activated:

library(tidyverse)
library(gridExtra)
library(grid)
library(gvlma)
library(moments)
library(ggcorrplot)
library(caret)
library(broom)
library(lubridate)
library(ggmap)
library(modelr)
library(car)
library(ggfortify)
library(leaflet)
library(jtools)
library(huxtable)
library(ggstance)
library(interactions)
library(mediumr)

Import

In this study, Singapore Airbnb dataset is imported from Inside Airbnb scraped on 22nd June 2020.

raw.data <- read_csv("http://data.insideairbnb.com/singapore/sg/singapore/2020-06-22/data/listings.csv.gz")

Variables

The dataset consists of 106 variables and 7,323 observations.
The variables that will be selected in this study are:

  • Price of listing given by host (price)
  • Region of the listing (neighbourhood_group_cleansed)
  • Neighbourhood of the listing (neighbourhood_cleansed)
  • Type of property listed on Airbnb (property_type)
  • Type of room listed on Airbnb (room_type)
  • Numbers of reviews given by guests (number_of_reviews)
  • Satisfaction rating given by guests (review_scores_rating)
  • Latitude/longitude location of the unit listed on Airbnb (latitude/longitude)
  • Amenities provided by the host (amenities)
  • Cleaning fee imposed by the host (cleaning_fee)
  • Status of the host on Airbnb (host_is_superhost)
  • Number of bedrooms in the listed unit (bedrooms)
  • Number of bathrooms in the listed unit (bathrooms)
  • Number of guest the listed unit can accommodate (accommodates)

Tidy & Transform

Data is filtered to analyze minimum nights of less than 3.

data_filtered <-  raw.data %>% 
select(neighbourhood_group_cleansed, neighbourhood_cleansed, property_type,
room_type, price, number_of_reviews, review_scores_rating,
latitude, longitude, amenities, cleaning_fee, host_is_superhost,
bedrooms, bathrooms, accommodates)

glimpse(data_filtered)
## Rows: 7,323
## Columns: 15
## $ neighbourhood_group_cleansed <chr> "North Region", "Central Region", "North Region", "East R…
## $ neighbourhood_cleansed <chr> "Woodlands", "Bukit Timah", "Woodlands", "Tampines", "Tam…
## $ property_type <chr> "Apartment", "Apartment", "Apartment", "Villa", "House", …
## $ room_type <chr> "Private room", "Private room", "Private room", "Private …
## $ price <chr> "$84.00", "$80.00", "$70.00", "$167.00", "$95.00", "$84.0…
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20, 12, 133, 104, 14…
## $ review_scores_rating <dbl> 94, 91, 98, 89, 83, 88, 82, 99, 99, 88, 94, 89, 90, 84, 9…
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.34567, 1.34702, 1.3…
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.9571, 103.9596, 103.961…
## $ amenities <chr> "{TV,\"Cable TV\",Internet,Wifi,\"Air conditioning\",\"Pe…
## $ cleaning_fee <chr> NA, NA, NA, "$56.00", "$28.00", "$28.00", "$70.00", "$0.0…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TR…
## $ bedrooms <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 1.0, NA, 0.5, 1.0…
## $ accommodates <dbl> 1, 2, 1, 6, 3, 3, 6, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, …

Next step of tidying process includes:

  • Undefined (NA) observations are replaced with zero.
  • All prices categorical variables are converted to numerical variables and dollar signs are removed.
  • Missing fill in host_is_superhost variables are replaced with FALSE.
  • superhost column is mutated to transform boolean expression into numeric where FALSE is 0 and True is 1.
data_filtered <- data_filtered %>% 
mutate(
review_scores_rating = replace_na(review_scores_rating, 0),
cleaning_fee = replace_na(cleaning_fee, 0),
price = parse_number(price),
cleaning_fee = parse_number(cleaning_fee),
amenities = (str_count(amenities, "," ) + 1),
host_is_superhost = replace_na(host_is_superhost, FALSE),
superhost = ifelse(host_is_superhost == FALSE, 0, 1),
bedrooms = replace_na(bedrooms, 0),
bathrooms = replace_na(bathrooms, 0))

Check for any presence of NAs

map(data_filtered, ~sum(is.na(.)))

To calculate the distance of listing to City Centre (Marina Bay Sands, MBS),
latitude and longitude values are extracted from Google API.

## Rows: 1
## Columns: 2
## $ lon <dbl> 103.8607
## $ lat <dbl> 1.283894

Create a function to calculate distance between airbnb listings and MBS.
The latitude and longitude of MBS is 1.283894 and 103.8607.
Listings that are within 5km from MBS are termed near if else it shall be far.

dist_centre <- function(lat, long) {
degree_to_km <- 111.139;
degree_to_km*((1.285332 - lat)**2 + (103.8594 - long)**2)**0.5
}

data_filtered <- data_filtered %>%
mutate(distance = unlist(map2(latitude, longitude, dist_centre)))

Rename columns

  • neighbourhood_group_cleansed to region
  • neighbourhood_cleansed to neighbourhood
  • review_scores_rating to satisfaction_rate
data_filtered <- data_filtered %>% 
rename(region = neighbourhood_group_cleansed) %>%
rename(neighbourhood = neighbourhood_cleansed) %>%
rename(satisfaction_rate = review_scores_rating)

As price variable is an important predictor in model building, it is essential to check for any presence of outliers.

price_bp1 <- data_filtered %>% 
ggplot(aes(y = price)) +
geom_boxplot() +
labs(title = "Singapore Airbnb Price Distribution",
subtitle = "Presence of extreme outliers",
caption = "Source: http://data.insideairbnb.com",
y = "Price (SGD)")

price_bp1

From the boxplot diagram above, extreme price outliers can be observed where listing price per night can go beyond 5000 sgd. Outliers have to be removed to make the results statistically significant. Hence, outliers are assigned into a vector and removed subsequently.

price_outlier <- boxplot(data_filtered$price, plot = FALSE)$out
data_filtered <- data_filtered[-which(data_filtered$price %in% price_outlier),]

Visualization comparison of price distribution before and after outliers removal

grid.arrange(price_bp1, price_bp2, ncol= 2)

After removal of the outliers, the interquatile range of the price variable becomes clearer and more statistics insights can be inferred.

Exploratory Data Analysis

Visualization of price/listing distribution by Region

In the region variable, there are 5 unique categories. From the bar chart distribution, central region has the most number of Airbnb listing while north region has the least units. From the boxplot diagram, central and east region has similar mean and interquartile range price.

Visualization of price/listing distribution by Neighbourhood

In the neighbourhood variable, there are 39 unique categories. The 10 most frequent neighbourhood are filtered to gain more insights. Kallang is the most popular neighbourhood choice for most Airbnb host and based on the same top 10 neighbourhood, Downtown Core Airbnb listing has the highest average price. 9 out of the top 10 areas are located in the Central Region, with Bedok located in East Region as an exception.

Visualization of price/listing distribution by Room Type

In the room_type variable, there are 4 unique categories. From the bar chart distribution, private room has the most number of listing counts and entire Home and apartment room type has the highest average price.

Transformation

Before deep diving into the dataset, further transformation is required of the variables to aggregate meaningful analysis. Replace categorical to numeric variables.

columns <- c("region","neighbourhood","property_type","room_type")
data_filtered[, columns] <- data_filtered %>%
select(all_of(columns)) %>% lapply(as.factor) %>% lapply(as.numeric)
glimpse(data_filtered)
## Rows: 6,973
## Columns: 17
## $ region <dbl> 3, 1, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 4, 2,…
## $ neighbourhood <dbl> 40, 7, 40, 36, 36, 36, 36, 2, 2, 5, 5, 20, 12, 5, 5, 36, 12, 20, 20,…
## $ property_type <dbl> 2, 2, 2, 26, 18, 18, 18, 25, 25, 2, 2, 2, 2, 2, 2, 18, 25, 21, 25, 2…
## $ room_type <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ price <dbl> 84, 80, 70, 167, 95, 84, 209, 52, 54, 49, 41, 63, 40, 45, 41, 49, 80…
## $ number_of_reviews <dbl> 1, 18, 20, 20, 24, 48, 29, 176, 199, 20, 12, 133, 104, 14, 10, 58, 0…
## $ satisfaction_rate <dbl> 94, 91, 98, 89, 83, 88, 82, 99, 99, 88, 94, 89, 90, 84, 93, 92, 0, 9…
## $ latitude <dbl> 1.44255, 1.33235, 1.44246, 1.34541, 1.34567, 1.34702, 1.34348, 1.323…
## $ longitude <dbl> 103.7958, 103.7852, 103.7967, 103.9571, 103.9596, 103.9610, 103.9634…
## $ amenities <dbl> 9, 13, 10, 28, 25, 19, 24, 37, 36, 15, 17, 28, 16, 13, 16, 29, 11, 2…
## $ cleaning_fee <dbl> 0, 0, 0, 56, 28, 28, 70, 0, 0, 65, 65, 42, 0, 55, 65, 28, 0, 42, 42,…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, …
## $ bedrooms <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 1.0, 0.0, 0.5, 1.0, 1.0, 0.0…
## $ accommodates <dbl> 1, 2, 1, 6, 3, 3, 6, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2,…
## $ superhost <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ distance <dbl> 18.848617, 9.761806, 18.803281, 12.748842, 13.002183, 13.212958, 13.…

Correlation Matrix

For the preparation of the model, a correlational matrix is created to see how the variables of interest (within the model) are related.

data_modeling <- select(data_filtered, -8, -9, -12)
glimpse(data_modeling)

corr <- data_modeling %$%
cor(tibble(data_modeling)) %>%
round(.,2)

ggcorrplot(corr, lab = TRUE, colors = c("indianred1", "white", "dodgerblue"),
show.legend = T, outline.color = "white", type = "lower", hc.order = T,
tl.cex = 10, lab_size = 3, sig.level = .2,
title = "Correlation Matrix")

Model Building

Mean Centering

Mean-centering transformations are performed on all the variables that will be turned into interaction terms.

data_centered<- data_modeling %>% 
select(satisfaction_rate, # Outcome Variable
price,
neighbourhood, property_type, distance,
room_type, region, number_of_reviews, accommodates,
amenities, cleaning_fee, bedrooms, bathrooms,
superhost) %>% # Moderator
mutate_at(vars(price:bathrooms),
funs(. - mean(., na.rm=T)))

Specify OLS Model

The regression models are created using poisson distribution. The first model regressed price and comfort factors orientations onto satisfaction rate (model1).

The key investigation lies in the next model, in which price and comfort factors orientations is regressed along with interaction terms, onto satisfaction rate (model2).

To test if model2, with interaction terms, enhances the explanatory power of the model using anova function.

The results of the analysis suggest that adding the interaction terms significantly increases the R-squared of model2, as compared to model1.

Multicollinearity Check

Check the linear assumptions for Model 1 and Model 2 using the vif.

vif(model1);vif(model2)##             price         superhost     neighbourhood         room_type     property_type 
## 2.115065 1.140453 1.049608 1.799833 1.056309
## distance number_of_reviews accommodates amenities cleaning_fee
## 3.348875 1.098825 1.608353 1.263925 1.175992
## region bedrooms bathrooms
## 3.204368 1.657571 1.228814

## price superhost neighbourhood room_type property_type
## 2.593531 1.178599 1.049829 1.828312 1.060624
## distance number_of_reviews accommodates amenities cleaning_fee
## 3.363900 1.104111 1.604539 1.285326 1.179046
## region bedrooms bathrooms price:superhost
## 3.212724 1.658764 1.236818 1.452826

From the above results, VIF value all less than 5 indicates no issue with collinearity.

Visualize — Price x Superhost

To visualize the OLS regression analysis performed above, it is stored in the OLS regression model’s predictions.

predicted_Y1 <- data_centered %>%  
modelr::data_grid(price,
superhost,
neighbourhood = 0,
region =0,
property_type = 0,
number_of_reviews = 0,
accommodates = 0,
amenities = 0,
cleaning_fee = 0,
bedrooms = 0,
bathrooms = 0,
room_type = 0,
distance = 0) %>%
mutate(pred_SR1 = predict(model2, . , type="response"))

Undo the centering of variable (price).

predicted_Y1 <- predicted_Y1 %>% 
mutate(price = price + mean(data_modeling$price)
)

The following figure represents the two lines which explains how the superhost status differs in its relationships between satisfaction rate and price.

Probing Interactions.

stone <- 
glm(satisfaction_rate ~ price*superhost + neighbourhood + room_type + property_type + distance + number_of_reviews + accommodates + amenities + cleaning_fee + region + bedrooms + bathrooms, data_centered, family = poisson)

Run simple slopes analysis using Johnson-Neyman Techniques.

sim_slopes(stone,
data = data_modeling,
pred = price,
modx = superhost,
johnson_neyman = F)
## SIMPLE SLOPES ANALYSIS
##
## Slope of price when superhost = 0.00 (0):
##
## Est. S.E. z val. p
## ------- ------ -------- ------
## -0.00 0.00 -30.82 0.00
##
## Slope of price when superhost = 1.00 (1):
##
## Est. S.E. z val. p
## ------ ------ -------- ------
## 0.00 0.00 4.26 0.00
sim_slopes(stone,
data = data_modeling,
pred = price,
modx = superhost,
johnson_neyman = T)
## JOHNSON-NEYMAN INTERVAL
##
## When superhost is OUTSIDE the interval [0.78, 0.91], the slope of price is p < .05.
##
## Note: The range of observed values of superhost is [0.00, 1.00]
##
## SIMPLE SLOPES ANALYSIS
##
## Slope of price when superhost = 0.00 (0):
##
## Est. S.E. z val. p
## ------- ------ -------- ------
## -0.00 0.00 -30.82 0.00
##
## Slope of price when superhost = 1.00 (1):
##
## Est. S.E. z val. p
## ------ ------ -------- ------
## 0.00 0.00 4.26 0.00

The result indicates that for superhost outside interval of 0.78 to 0.91, the slope of satisfaction rate is p < 0.05.

Run interaction_plot() by adding benchmark for regions of significance.

Interpretation of the Results

From the analysis above, the following findings are inferred:

  1. Guest satisfaction rate increases with price for superhost status. Hence, it is important for Airbnb host to achieve superhost status as it seems to garner better staying experiences for guests and this may indirectly increase the booking demand of their listing.
  2. From earlier exploratory data analysis, price variable is more room type and region specific. Entire apartment tends to be expensive in comparison because of bigger space area rented. Central region has higher demand at the same time most popular neighbourhoods are also located in central region.

Limitations

Airbnb has a fairly complex relationship in Singapore. Since five years ago, the Singapore government labeled the short-term rental offered by Airbnb as an illegal service. This study used the dataset scraped from June 2020 during this Covid-19 pandemic period. With the negative tourism impact from Covid-19 and government interference on Airbnb business concept, many host may have left the platform. In addition, the study are based on short term rental hence many observations have been filtered away.

References

--

--

Samantha Lin

A development engineer exploring in the field of data analytics and additive manufacturing