Introduction

My research goal was to find how much of a role location plays in the popularity of restaurants in Boston. I was especially interested in whether the proximity to an MBTA subway stop had an effect on Yelp ratings. At first, I assumed that restaurants closer to T-stops would be more popular, but as I started to collect data I realized this may not be the case because it is difficult to find many restaurants more 0.3 miles from a stop. I expected to see trends in overall rating and number of reviews as a function of zipcode, but not in distance to T-stops.

Before creating my models, I hypothesized the follwing: 1) that both the linear regression of star rating as a function of distance from a subway stop and number of reviews as a function of distance would have no correlation, 2) number of reviews and star ratings would have the greatest change in slope in the North End and the greatest decrease in Chinatown, and 3) there would be no correlation between distance to T-stops and different neighborhoods.

Methods

I recorded information on star rating and number of reviews from Yelp.com about the first 25 “moderately priced” restaurants from four different zip codes in Boston, Massachussets. The restaurants were each placed in a neighborhood category based on what zipcode the restaurant is in. I looked at the zip code 02113 for restaurants in the North End neighborhood, 02116 for Back Bay, 02111 for Chinatown, and 02215 for Fenway. I used Google Maps to find how far each restaurant was in miles from the nearest MBTA subway station.

To analyze my collected data, I used lm_basic models to perform inference for the mean and linear regression tests. Inference for the mean looks at the change in slope for different categories of a variable, and linear regression creates an estimation for a line based on two continuous variables.

Results

Linear Regression

First, I looked at both of the continuous variables that help determine the popularity of a restaurant, rating and reviews, as a function of the distance from a subway stop using linear regression models.

## 
## Call:
## lm_basic(formula = rating ~ 1 + distance, data = yelp_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8977 -0.3532  0.1023  0.1913  0.6913 
## 
## Coefficients:
##             Estimate   2.5 % 97.5 %
## (Intercept)   3.7643  3.5928  3.936
## distance      0.4446 -0.2064  1.096
## 
## Residual standard error: 0.3719 on 98 degrees of freedom
## Multiple R-squared:  0.0184, Adjusted R-squared:  0.008381 
## F-statistic: 1.837 on 1 and 98 DF,  p-value: 0.1785

The scatter plot above shows a slight positive correlation between distance and rating; however, as shown in the regression table, the R-squared value is 0.0184, and the confidence interval for the slope ranges from -0.2064 to 1.096, so the model has tons of variability and is not statistically significant.