I. Introduction

Zillow has realized that its housing market predictions are not as accurate as they could be because they do not factor in enough local intelligence. As such they have asked us to build a better predictive model of home prices for Mecklenberg County, NC. This model aims to build an accurate and generalizable hedonic model that predicts home prices for Mecklenberg County, NC by deconstructing overall home price into the value of constituent parts. Accurate models lead to only small differences between the predicted and observed values, and generalizable models accurately predict on new data and with comparable accuracy across various groups. By working to improve the accuracy and generalizability of the predictive model, we are ultimately striving to create a more useful decision-making tool. Such a model may be useful for local governments as they assess property taxes, for example.

II. Data Manipulation and Visualization

2.0 Setup

In this section, we loaded necessary libraries, created plot theme options and map theme options, and identified functions of quintile breaks, and average nearest neighbor distance for further analysis.

# Load some libraries

library(tidyverse)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot) # plot correlation plot
library(corrr)      # another way to plot correlation plot
library(kableExtra)
library(jtools)     # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr)    # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(tidycensus)
library(geojsonio)
library(stargazer)
library(utils)
library(rgdal)
#library(plyr)

# functions and data directory
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#25CB10", "#5AB60C", "#8FA108",   "#C48C04", "#FA7800")

2.1 Data Wrangling

In this section, we loaded Mecklenburg County data and other complementary datasets and conducted feature selection and engineering. Besides, for other analyses, we separate the whole dataset into and in advance of Mecklenburg.

Five categorical datasets were used in this project: Mecklenburg County House Price and Internal Characteristics: basic geometric dataset provided by the course in advance.

Mecklenburg County Base Data: geometric data containing the geographic information of Mecklenburg County. Mecklenburg County Beach Neighborhood: As an alternative way to using neighborhood data of Mecklenburg County Beach, we used Mecklenburg County Municipal Boundary data that is a polygon feature class of municipal boundaries within Mecklenburg County-Dade County, and specifically filter geometric data in Mecklenburg County Beach.

Census Data: demographic variables from the ACS 2017 for census tracts in Mecklenburg County. We specifically selected 8 variables in the analysis:

• TotalPop: Total population in each census tract
• Whites: White population in each census tract
• MedHHInc: Median household income in each census tract
• MedRent: Median Rent for properties in each census tract
• TotalPoverty: Population living under the level of poverty in each census tract
• TotalUnit: Total housing units in each census tract
• pctWhite: white population proportion in each census tract
• pctBachelors: Bachelor population proportion in each census tract
• pctPoverty: Poverty population proportion in each census tract
• pctTotalVacant: Vacant unites proportion in each census tract

Amenity Data: most of the data come from the website of government: http://maps.co.mecklenburg.nc.us/openmapping/data.html.Compilatory data were chosen for describing amenities and public services of Mecklenburg County. The datasets we chose encompasses fireplace, police station, education opportunity, healthcare service, and entertainment accessibility. See each dataset below:

1. Public School: Dataset contains points representing Elementary, Middle, and High Schools in Mecklenburg County.
2. Bus station: A point feature class of CATS Bus Stops for Mecklenburg County.
3. Fire station : Geographical representation of the physical locations of the Charlotte Fire Department Fire Stations.
4. Church : Churches dataset defines locations classified as religious organizations in Mecklenburg County, data is queried from Master Address Table.
5. Medical Facilities: This dataset was generated to publish a geospatial inventory of Medical Facilities in Mecklenburg County, NC. This data serves to support and assist the Charlotte Mecklenburg Planning Department, The City of Charlotte, and others in Urban Planning decisions.
6. Park : Park amenities for all parks in Mecklenburg County, including Charlotte, Davidson, Cornelius, Davidson, Huntersville, Matthews, Mint Hill, and Pineville.
7. Police: Charlotte-Mecklenburg Police Department Stations.

primary.data<-
  st_read("https://raw.githubusercontent.com/mafichman/MUSA_508_Lab/main/Midterm/data/2022/studentData.geojson")%>%
  st_transform('EPSG:2264')

primary.data<-
  primary.data%>%
  filter(price<10000000)

mklb.sf <- 
  primary.data %>% 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 2264, agr = "constant") %>%
  st_transform('EPSG:2264') %>%
  mutate(storyheigh = as.numeric(gsub(" STORY", "", primary.data$storyheigh))
  )

mklb.sf<-
  mklb.sf %>%
  mutate(age=2022-yearbuilt,prisqft =price/heatedarea)%>%
  dplyr::select(commonpid,price,age,storyheigh,heatedarea,numfirepla,totalac,
                fullbaths,halfbaths,bedrooms,units,toPredict,prisqft)

numericVars <- 
  select_if(st_drop_geometry(mklb.sf), is.numeric) %>% na.omit()

#school data
cmpd_school<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CMS_Schools.geojson') %>%
  st_transform('EPSG:2264')

# firestation
cmpd_firestation<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CFD_FireStations.geojson') %>%
  st_transform('EPSG:2264')

# church
cmpd_church<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/Churches.geojson') %>%
  st_transform('EPSG:2264')

# medical facilities
cmpd_medical<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/MedicalFacilities.geojson') %>%
  st_transform('EPSG:2264')
# park
cmpd_park<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/Park_Locations.geojson') %>%
  st_transform('EPSG:2264')
# police
cmpd_police<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CMPD_Police_Offices.geojson') %>%
  st_transform('EPSG:2264')

# bus
cmpd_busstation<-
  st_read('https://raw.githubusercontent.com/RumRon/MUSA_508/main/Midterm/geojson/CATS_BusStops.geojson') %>%
  st_transform('EPSG:2264')

#neighborhood
neighborhood <-
  st_read('https://raw.githubusercontent.com/RumRon/HW_UP_IP/main/Midterm/geojson/VFD_FireDistricts.geojson')%>%
  st_transform('EPSG:2264')

police.sf <-
  cmpd_police %>%
    dplyr::select(name) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

bus.sf<-
  cmpd_busstation %>%
    dplyr::select(StopID) %>%
    na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

church.sf<-
  cmpd_church %>%
    dplyr::select(num_parent) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()


firestation.sf<-
  cmpd_firestation %>%
    dplyr::select(NAME) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

medical.sf<-
  cmpd_medical %>%
    dplyr::select(Name) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

park.sf<-
  cmpd_park %>%
    dplyr::select(prkname) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

school.sf<-
  cmpd_school %>%
    dplyr::select(school) %>%
    #na.omit() %>%
    st_zm(drop = TRUE) %>%
    st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
    st_transform('EPSG:2264') %>%
    distinct()

nhood.sf<-
  neighborhood %>%
  dplyr::select(vfd) %>%
  #filter(name != NA)%>%
  st_zm(drop = TRUE) %>%
  st_as_sf(coords = c("Longitude", "Latitude"), crs = "EPSG:2264") %>%
  st_transform('EPSG:2264') %>%
  distinct()

mklb.sf <- 
  mklb.sf %>% 
     mutate(
       police_nn1 = nn_function(st_coordinates(mklb.sf), 
                                st_coordinates(police.sf), k = 1),
       police_nn2 = nn_function(st_coordinates(mklb.sf), 
                                st_coordinates(police.sf), k = 2),
       police_nn3 = nn_function(st_coordinates(mklb.sf), 
                                st_coordinates(police.sf), k = 3),
       school_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(school.sf), k = 1),
       school_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(school.sf), k = 2),
       school_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(school.sf), k = 3),
       park_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(park.sf), k = 1),
       park_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(park.sf), k = 2),
       park_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(park.sf), k = 3),
       firestation_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(firestation.sf), k = 1),
       firestation_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(firestation.sf), k = 2),
       firestation_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(firestation.sf), k = 3),
       church_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(church.sf), k = 1),
       church_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(church.sf), k = 2),
       church_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(church.sf), k = 3),
       medical_nn1 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(medical.sf), k = 1),
       medical_nn2 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(medical.sf), k = 2),
       medical_nn3 = nn_function(st_coordinates(mklb.sf), 
                              st_coordinates(medical.sf), k = 3))

mklb.sf <-st_join(mklb.sf, nhood.sf)

# acs
acs_variable_list.20 <- load_variables(2020, #year
                                         "acs5", #five year ACS estimates
                                         cache = TRUE)
tracts20 <- 
  get_acs(geography = "tract", variables = c("B25026_001E","B02001_002E","B15001_050E",
                                             "B15001_009E","B19013_001E","B25058_001E",
                                             "B06012_002E","B28010_007E","B08101_001E",
                                             "B09001_001E","B09001_003E","B09021_002E",
                                             "B11001I_001E", "B14001_009E",
                                             "B17001_002E","B27001_001E","B18101_001E",
                                             "B19001_001E","B25001_001E","B25040_001E"), 
          year=2020, state=37, county=119, geometry=T, output="wide") %>%
  st_transform('EPSG:2264') %>%
  mutate(TotalPop = B25026_001E, 
         Whites = B02001_002E,
         FemaleBachelors = B15001_050E, 
         MaleBachelors = B15001_009E,
         MedHHInc = B19013_001E, 
         MedRent = B25058_001E,
         TotalPoverty = B06012_002E,
         Nocom = B28010_007E, 
         Waytowork = B08101_001E,
         Popunder18 = B09001_001E, 
         Popunder3 = B09001_003E,
         Singleadult = B09021_002E, 
         Householdtype = B11001I_001E,
         Addmittogra = B14001_009E,
         Poverty  = B17001_002E,
         Healthins  = B27001_001E,
         Disable  = B18101_001E,
         Familyincome  = B19001_001E,
         Housingunits  = B25001_001E,
         Househeatingfuel  = B25040_001E)%>%
  mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop,0),
         pctBachelors = ifelse(TotalPop > 0, ((FemaleBachelors + MaleBachelors) / TotalPop),0),
         pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop, 0),
         year = "2020")

2.2 Summary Statistics

Present a table of summary statistics with variable descriptions. We sort these variables by their category (internal characteristics, amenities/public services or spatial structure).

#internal
factor_internal <- mklb.sf%>%
  dplyr::select(commonpid,price,age,storyheigh,heatedarea,numfirepla,totalac,
                fullbaths,halfbaths,bedrooms,units,prisqft) %>%
  filter(prisqft!= 'Inf') 

factor_internal.sf <- factor_internal

factor_internal <-
  select_if(st_drop_geometry(factor_internal), is.numeric)  %>% na.omit()
#amentities
factor_amenity <- mklb.sf%>%
  dplyr::select(price,prisqft,police_nn1, police_nn2, police_nn3, school_nn1, school_nn2,  school_nn3, church_nn1, church_nn2, church_nn3, park_nn1, park_nn2, park_nn3,firestation_nn1, firestation_nn2, firestation_nn3, medical_nn1, medical_nn2, medical_nn3,police_nn1, police_nn2, police_nn3)

factor_amenity <- factor_amenity%>%filter(prisqft!= 'Inf')
  # dplyr::select(-commonpid,-age,-storyheigh,-heatedarea,-numfirepla,-totalac,
  #                -fullbaths,-halfbaths,-bedrooms,-units)

factor_amenity.sf <- factor_amenity
factor_amenity <-
  select_if(st_drop_geometry(factor_amenity), is.numeric)  %>% na.omit()
#ACS
ACS1 <- mklb.sf%>%
  dplyr::select(price,prisqft)

ACS2 <- tracts20%>%
  dplyr::select(pctWhite, pctBachelors, pctPoverty, TotalPop, MedHHInc, Singleadult, Healthins, Familyincome, Housingunits, Househeatingfuel)

factor_acs <-st_join(ACS1, ACS2) #%>%
  #distinct(geometry, .keep_all = TRUE)

all.sf <-st_join(mklb.sf, ACS2)

factor_acs <-factor_acs%>%filter(prisqft!= 'Inf')
factor_acs.sf <- factor_acs
factor_acs <-
  select_if(st_drop_geometry(factor_acs), is.numeric) %>% na.omit()


star_internal <- st_drop_geometry(factor_internal)
star_amenity <- st_drop_geometry(factor_amenity)
star_acs <- st_drop_geometry(factor_acs)




star_internal <- st_drop_geometry(factor_internal)
star_amenity <- st_drop_geometry(factor_amenity)
star_acs <- st_drop_geometry(factor_acs)

stargazer(star_internal, type = "html", 
          title = "Table DATA 2.1 Summary Statistics of Internal Characteristics ",
          header = FALSE,
          #out = "try.html",
          single.row = TRUE)

**Table DATA 2.1 Summary Statistics of Internal Characteristics**

Statistic	N	Mean	St. Dev.	Min	Max

price	44,596	462,928.900	377,354.600	0	7,700,000
age	44,596	27.656	25.838	0	2,022
storyheigh	44,596	1.646	0.484	1.000	3.000
heatedarea	44,596	2,367.630	1,070.427	306.000	14,710.000
numfirepla	44,596	0.787	0.466	0	7
totalac	44,596	1.826	108.598	0.000	9,757.440
fullbaths	44,596	2.284	0.829	0	9
halfbaths	44,596	0.602	0.527	0	4
bedrooms	44,596	3.512	0.950	0	65
units	44,596	0.976	0.199	0	8
prisqft	44,596	195.697	107.364	0.000	5,080.808

stargazer(star_amenity, type = "html",
          title = "Table DATA 2.2 Summary Statistics of Amenities Characteristics ",
          header = FALSE,
          #out = "try.html",
          single.row = TRUE)

**Table DATA 2.2 Summary Statistics of Amenities Characteristics**

Statistic	N	Mean	St. Dev.	Min	Max

price	46,201	461,394.900	373,709.800	0	7,700,000
prisqft	46,201	195.586	106.372	0.000	5,080.808
police_nn1	46,201	19,759.000	11,646.630	232.622	60,796.570
police_nn2	46,201	24,973.550	13,360.550	1,689.842	72,490.110
police_nn3	46,201	28,464.720	14,531.070	5,592.504	79,044.680
school_nn1	46,201	4,763.712	3,075.632	116.992	22,481.340
school_nn2	46,201	5,883.655	3,092.816	116.992	22,972.960
school_nn3	46,201	6,949.664	3,335.845	634.527	24,897.730
church_nn1	46,201	2,932.691	2,097.287	17.465	14,635.710
church_nn2	46,201	3,469.833	2,208.135	42.527	16,387.490
church_nn3	46,201	3,929.411	2,308.243	113.048	17,229.310
park_nn1	46,201	4,069.461	2,518.566	68.884	16,490.430
park_nn2	46,201	4,977.050	2,588.607	205.557	16,614.240
park_nn3	46,201	5,711.460	2,679.841	294.621	18,545.810
firestation_nn1	46,201	11,730.960	11,229.980	143.254	63,000.030
firestation_nn2	46,201	14,871.700	11,175.460	2,031.637	64,225.180
firestation_nn3	46,201	17,164.520	11,329.790	3,618.091	66,259.220
medical_nn1	46,201	5,652.947	4,373.851	37.400	24,931.340
medical_nn2	46,201	6,285.044	4,412.851	124.004	25,453.750
medical_nn3	46,201	6,749.224	4,480.915	147.203	26,588.270

stargazer(star_acs, type = "html",
          title = "Table DATA 2.3 Summary Statistics of ACS ",
          header = FALSE,
          #out = "try.html",
          single.row = TRUE)

**Table DATA 2.3 Summary Statistics of ACS**

Statistic	N	Mean	St. Dev.	Min	Max

price	46,198	461,403.800	373,718.800	0	7,700,000
prisqft	46,198	195.587	106.376	0.000	5,080.808
pctWhite	46,198	0.571	0.270	0.011	1.710
pctBachelors	46,198	0.011	0.013	0.000	0.314
pctPoverty	46,198	0.091	0.082	0.000	0.614
TotalPop	46,198	4,030.445	1,309.832	790	7,703
MedHHInc	46,198	86,205.120	37,418.970	17,685	238,750
Singleadult	46,198	396.728	217.725	29	1,202
Healthins	46,198	4,049.650	1,311.905	790	7,703
Familyincome	46,198	1,503.466	440.230	237	2,659
Housingunits	46,198	1,598.289	469.749	249	2,778
Househeatingfuel	46,198	1,503.466	440.230	237	2,659

topredict <-
  all.sf%>%
  filter(toPredict!="CHALLENGE") %>%
  filter(prisqft!= 'Inf')

topredict_num <-
  select_if(st_drop_geometry(topredict), is.numeric) %>% na.omit() 

challenge <-
  all.sf%>%
  filter(toPredict=="CHALLENGE")

challenge_num<-
  select_if(st_drop_geometry(challenge), is.numeric) %>% na.omit()

Present a table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the stargazer package for this.

2.3 Correlation Matrix

For better visualization, features were divided into internal characteristics and amenity again when creating correlation matrix. Based on the graduated color, some features have strong relation with Sale Price, such as bathroom, fireplace and house age.

Internal characteristics:

# #internal features
# ggcorrplot(
#   round(cor(factor_internal), 1),
#   p.mat = cor_pmat(factor_internal),
#   colors = c('#05A167', "white", '#6897BB'),
#   type="lower",
#   insig = "blank") +
#     labs(title = "Correlation across numeric variables\n(internal features)\n",
#        caption = 'Figure DATA 2.2') +
#     plotTheme()
# 
# #amenity features
# factor_amenity <-factor_amenity%>%filter(prisqft != 'Inf')
# ggcorrplot(
#   round(cor(factor_amenity), 1),
#   p.mat = cor_pmat(factor_amenity),
#   colors = c('#05A167', "white", '#6897BB'),
#   type="lower",
#   insig = "blank") +
#     labs(title = "Correlation across numeric variables\n(Amenity)\n",
#        caption = 'Figure DATA 2.3') +
#     plotTheme()
# 
# ggcorrplot(
#   round(cor(factor_acs), 1),
#   p.mat = cor_pmat(factor_acs),
#   colors = c('#05A167', "white", '#6897BB'),
#   type="lower",
#   insig = "blank") +
#     labs(title = "Correlation across numeric variables\n(ACS)\n",
#        caption = 'Figure DATA 2.4')+
#     plotTheme()

factor_internal %>%
  correlate() %>%
  autoplot()+
  geom_text(aes(label = round(r,digits=2)),size = 2)

Amenity :There is also a strong connection between house price and public facilities. Generally speaking, if the public facilities are better, the house price will also increase.

factor_amenity %>%
  correlate() %>%
  autoplot() +
  geom_text(aes(label = round(r,digits=2)),size = 2)

ACS data :Some of the variables in the ACS data are also closely related to house prices. For example, in this matrix price per square feet seems to be related with percentage of white.

factor_acs %>%
  correlate() %>%
  autoplot() +
  geom_text(aes(label = round(r,digits=2)),size = 2)

2.4 Home prices scatterplots

We present 4 home price correlation scatterplots that we think are of interest, age, bedrooms, living area and single adult. We found a positive correlation between the number of rooms, livingarea and room rates, which is also consistent with common sense. Age and single adults are negatively correlated with house prices, but the coefficients are not large. Age may be due to the fact that buildings built earlier tend to be in low quality.

## Home Features cor
st_drop_geometry(all.sf) %>% 
  mutate(LivingArea = heatedarea) %>%
  dplyr::select(price, LivingArea, age,bedrooms, Singleadult) %>%
  filter(price <= 1000000, age < 500) %>%
  gather(Variable, Value, -price) %>% 
   ggplot(aes(Value, price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, ncol = 4, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     plotTheme()

2.5 Map Home Prices

Looking at the current situation, we can see that home prices are higher in the south and north of Mecklenburg County and lower in the middle.

maphomeprice <-topredict%>%filter(prisqft!= 'Inf')
ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = maphomeprice, aes(colour = q5(prisqft)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(maphomeprice,"prisqft"),
                   name="Quintile\nBreaks") +
  labs(title="Price Per Square Foot, Mecklenburg County") +
  mapTheme()

2.6 Three interesting maps

By plotting the matrix, we choose some variables with high positive or negative correlation. Percentage of white:
The first independent variables we choose is the percentage of white.As the matrix shows, percentage of white and house prices are more positively correlated.We found that both the southern and northern parts of Mecklenburg County have a higher percentage of whites, basically greater than 75%, and this all corresponds to the higher prices in the previous house price map. In contrast, the places with lower house prices have a lower percentage of whites, even below 25%.

# plot map of white

ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = factor_acs.sf%>%filter(pctWhite<1), aes(colour = pctWhite),
          show.legend = "point", size = .75) +
  #scale_colour_manual(values = palette5,
                   #labels=qBr(maphomeprice,"prisqft"),
                  #name="Quintile\nBreaks") +
  labs(title="Percentage of white, Mecklenburg County") +
  mapTheme()

Number of fullbaths in houses:
The second variable is Housing the number of full baths.We chose this variable because they have a strong correlation in the matrix, which is also in line with the general common sense inference. We can see that there are indeed more toilets in the south where the prices are higher as well.

# plot map of age
factor_internal.sf<-factor_internal.sf%>%filter(age != 2022)
ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = factor_internal.sf, aes(colour = fullbaths),
          show.legend = "point", size = .75) +
  #scale_colour_manual(values = palette5,
                   #labels=qBr(maphomeprice,"prisqft"),
                   #name="Quintile\nBreaks") +
  labs(title="Number of fullbaths in houses in Mecklenburg County") +
  mapTheme()

the average distance to nearest park:
The third independent variables is the average distance to nearest park. To our surprise, most of the amenities have negative correlation with the house price, so we choose park as an example. Perhaps the fact that the parks are mostly located at the edge of the city where traffic is difficult has affected the surrounding property prices.

# plot map of park

ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = factor_amenity.sf, aes(colour = park_nn1),
          show.legend = "point", size = .75) +
  #scale_colour_manual(values = palette5,
                   #labels=qBr(maphomeprice,"prisqft"),
                   #name="Quintile\nBreaks") +
  labs(title="Distribution of average distance to\nnearest park in Mecklenburg County") +
  mapTheme()

2.7 Bonus for something extra engaging

hospital: We believe that health care resources are also an important factor in housing prices. Although the correlation is not high in the matrix, it is possible that it is influenced by hospital quality. We still find that areas with higher house prices in the south are closer to hospitals.

hospital_nn1_expanded <- topredict %>%
  mutate(hospital_nn1.expanded = medical_nn1 * 1000)

ggplot() + 
  geom_sf(data = st_union(tracts20),fill = '#E5E5E5') +
  geom_sf(data = st_centroid(hospital_nn1_expanded),aes(color = hospital_nn1.expanded),size = .5) +
  # scale_color_manual(values = palette,
  #                    labels = qBr(hospital_nn1_expanded,'hospital_nn1.expanded'),
  #                    name = "Average Distrance\nto Nearest Hospital\n(timed by 1000 \nfor better visualization)\n") +
  labs(title = 'Distribution of Average Distrance to Nearest Hospital\nof Each Home Sale Observation\n') +
  mapTheme() + 
  plotTheme()

III Methods

Method: The mean objective of this analysis is to train a robust model to predict house price as accurately as possible. Accordingly, this section started to provide a detailed interpretation on the process of the model building and training.

Here are three models used in this part.
model1: the primitive model, where all the facotrs are in.
model2: the model after training, with the facotrs selected.
model3: the model to figure out the ‘challenge’

data，model, feature, result(map). why you interprete it

• Data and Sample: In order to accurately build and test models, we randomly split the whole into two parts: with 60% of observations for model training and with 40% of observations for model testing.
• Dependent Variable: The dependent variable is home sale price: Price
• Independent Variables: Based on the hedonic home price model, potential features should encompass at least three components: internal characteristics, like the number of bathrooms; neighborhood amenities/public services, like hospital accessibility; and the underlying spatial process of prices.Plus, we also take the ACS data into consideration.

After creating features and engineering and checking correlation coefficients(in model1), 38 features are eligible for the model and obviously or slightly correlated with that were the initial independent variables for model building. In further modifications of the model, we mainly considered p-values in determining the correlation of variables and removed the values with larger p-values:totalac, park_nn1, park_nn2, firestation_nn1, medical_nn1, TotalPop. We gradually screened out variables and eventually selected only 32 robust features included in the final model model2.

• Procedure: The primitive approach used in this analysis for model building is Ordinary Least Squares Regression (OLS) which is aimed at predicting the of houses as a function of several statistical parameters such as intercept, slope and selected features.

In order to determine which features were able to significantly increase the accuracy and generalizability of the model, we systematically compared, plotted, and mapping the mean absolute errors (MAE), mean absolute percent errors (MAPE), as well as root mean square errors (RMSE, the standard deviation of the residuals) between original and predicted of both training set and test set, conducted the cross-validation on 100 folds, explored the underlying spatial pattern of price and errors caused by problematic prediction. To address the problem of neglecting the underlying spatial correlation caused by neighborhood, we then compared, plotted, and mapped the mean absolute errors (MAE), and mean absolute percent errors (MAPE) on the test set between baseline model without neighborhood features and updated model with neighborhood features, and then determined our final model based on series model examinations.

In this process include:

IV. Result

4.1 Dataset Splitting and Model Building

Firstly, as mentioned above, we randomly split the variables into 60% and 40%. Secondly,we built three models with different features to examine which features should be included.

model1: the primitive model with baseline, where all the facotrs are in. model2: the model after training, with the facotrs selected. model3: it is based on model2, the function of model3 is to figure out the ‘challenge’ to complete the competition.

The baseline model contained 10 internal characteristic features: age, storyheigh, heatedare, numfirepla, totalac, fullbaths, halfbaths, bedrooms, units, prisqft and ,as well as 18 amenity features. Plus, there are also some ACS features.

set.seed(42)

inTrain <- caret::createDataPartition(
  y = paste(topredict$name,topredict$price),
  p = .60, list = FALSE)

# split data into training and test
model.training <- topredict[inTrain,]
model.test <- topredict[-inTrain,] 

## First model: All internal features, amenity features
model1 <- lm(price ~ ., data = as.data.frame(model.training) %>% 
           dplyr::select(colnames(factor_internal), 
                         colnames(factor_amenity),
                         colnames(factor_acs)
))

# stargazer(model1, type = "html", 
#            title = "Table 4.1.1 Summary Statistics of Model 1 ",
#            #out = "model1.html",
#            header = FALSE,
#            single.row = TRUE)

model3 <- lm(price ~ ., data = as.data.frame(topredict) %>% 
           dplyr::select(# internal features
                         price,age, storyheigh,heatedarea,#prisqft,
                         numfirepla, fullbaths,halfbaths,bedrooms,
                         units,
                         # amenity features
                         police_nn1,police_nn2,police_nn3,school_nn1,
                         school_nn2,school_nn3,church_nn1,church_nn2,
                         church_nn3,park_nn1,park_nn3,firestation_nn2,
                         firestation_nn3,medical_nn2,medical_nn3,
                         #acs
                         pctWhite,pctBachelors,pctPoverty,MedHHInc,
                         Singleadult,Healthins,Familyincome
                         ))
challenge<-
  challenge %>%
  mutate(SalePrice.Predict = predict(model3,challenge))

challenge<- challenge %>%
  mutate(price=SalePrice.Predict,
         prisqft = price/heatedarea)

4.2 Table of results (training set)

In this part, we study the Summary Statistics of Model 1 to select variables to build our final model. We mainly considered p-values in determining the correlation of variables and removed the values with larger p-values. The R² is 0.87, which is an is an okay fit.

model2 <- lm(price ~ ., data = as.data.frame(model.training) %>% 
           dplyr::select(# internal features
                         price, age, storyheigh,heatedarea, prisqft,
                         numfirepla, fullbaths,halfbaths,bedrooms,
                         units,
                         # amenity features
                         police_nn1,police_nn2,police_nn3,school_nn1,
                         school_nn2,school_nn3,church_nn1,church_nn2,
                         church_nn3,park_nn3,firestation_nn2,
                         firestation_nn3,medical_nn1,
                         #acs
                         pctWhite,pctBachelors,pctPoverty,MedHHInc,
                         Singleadult,Healthins,Familyincome
                         )
          
)
model.test <-
  model.test %>%
  mutate(SalePrice.Predict = predict(model2, model.test),
         SalePrice.Error = SalePrice.Predict - price,
         SalePrice.AbsError = abs(SalePrice.Predict - price),
         SalePrice.APE = (abs(SalePrice.Predict - price)) / SalePrice.Predict)%>%
  filter(price < 5000000)

stargazer(model2, type = "html", 
          title = "Table 4.1.2 Summary Statistics of Model 2 ",
          header = FALSE,
          #out = "model2.html",
          single.row = TRUE)

**Table 4.1.2 Summary Statistics of Model 2**

	Dependent variable:

	price

age	202.952^*** (44.923)
storyheigh	-53,389.670^*** (2,792.163)
heatedarea	247.015^*** (1.662)
prisqft	2,117.438^*** (9.094)
numfirepla	7,346.763^*** (2,213.482)
fullbaths	33,980.480^*** (2,035.367)
halfbaths	52,451.320^*** (2,297.486)
bedrooms	-17,336.010^*** (1,162.913)
units	-19,394.380^*** (4,619.095)
police_nn1	-2.092^*** (0.319)
police_nn2	9.818^*** (0.925)
police_nn3	-10.682^*** (0.743)
school_nn1	-3.998^*** (0.980)
school_nn2	13.874^*** (2.179)
school_nn3	-8.966^*** (1.667)
church_nn1	-10.073^*** (1.823)
church_nn2	19.892^*** (4.041)
church_nn3	-18.297^*** (2.940)
park_nn3	5.086^*** (0.437)
firestation_nn2	7.404^*** (0.946)
firestation_nn3	-3.523^*** (1.036)
medical_nn1	-1.122^*** (0.288)
pctWhite	-32,483.920^*** (5,633.095)
pctBachelors	-681,540.400^*** (76,951.500)
pctPoverty	152,606.600^*** (15,849.790)
MedHHInc	1.115^*** (0.044)
Singleadult	173.080^*** (9.437)
Healthins	17.852^*** (2.551)
Familyincome	-95.006^*** (9.078)
Constant	-507,483.900^*** (9,603.819)

Observations	27,781
R²	0.870
Adjusted R²	0.869
Residual Std. Error	152,455.000 (df = 27751)
F Statistic	6,376.479^*** (df = 29; 27751)

Note:	p<0.1; p<0.05; p<0.01

4.3 Table of goodness of fit (test set)

The Mean Absolute Errors (MAE) and Mean Absolute Percent Errors (MAPE) are statistical measures of how accurate a forecast system is.Based on the following table, the MAE is 125091.8, the errors are accounted for 0.4281398.

test_fit<-as.data.frame(cbind(mean(model.test$SalePrice.AbsError,na.rm = T),
                              mean(model.test$SalePrice.APE,na.rm = T)))
colnames(test_fit)<-c("Mean Absolute Errors (MAE)","MAPE")
kable(test_fit,
      caption = 'Table RESULT. MAE and MAPE for Test Set') %>%
  kable_styling("striped", full_width = F)

Table RESULT. MAE and MAPE for Test Set
Mean Absolute Errors (MAE)	MAPE
67696.76	0.2694697

4.4 Cross-Validation Results

Note that R² judges model accuracy on the data used to train the model. Cross-validation ensures that the goodness of fit results for a single hold out is not a fluke. While there are many forms of cross-validation.

This section conducted cross-validation to assess how our model would generalize to other independent data sets, in which we could flag problem of overfitting or feature selection bias.

Both a summary table and distribution plots of R-square, RMSE, and MAE in each of 100 fold are shown below.The cross-validation output provides very important goodness of fit information. The value of each metric is actually the mean value across all folds. The train function returns many objects (names(reg.cv)), one of which is resample which provides goodness of fit for each of the 100 folds.

The variation of 100 folds can also be visualized with a histogram of across-fold MAE. Although most of the errors are clustering tightly together, some errors are scattered, indicating the inconsistency of the model prediction.

fitControl <- trainControl(method = "cv", number = 100)
set.seed(42)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(topredict) %>% 
          dplyr::select(price, age, storyheigh,heatedarea, prisqft,
                         numfirepla, fullbaths,halfbaths,bedrooms,
                         units,
                         # amenity features
                         police_nn1,police_nn2,police_nn3,school_nn1,
                         school_nn2,school_nn3,church_nn1,church_nn2,
                         church_nn3,park_nn1,park_nn3,firestation_nn2,
                         firestation_nn3,medical_nn2,medical_nn3,
                         #acs
                         pctWhite,pctBachelors,pctPoverty,MedHHInc,
                         Singleadult,Healthins,Familyincome),
        method = "lm", trControl = fitControl, na.action = na.pass)

kable(reg.cv$resample,
          caption = 'Table RESULT. Cross-validation Test: Summary of RMSE, R Squared and MAE') %>%
  kable_styling("striped", full_width = F)

Table RESULT. Cross-validation Test: Summary of RMSE, R Squared and MAE
RMSE	Rsquared	MAE	Resample
122504.66	0.8743783	64995.85	Fold001
124896.26	0.8588380	68202.16	Fold002
161994.86	0.8501994	73482.32	Fold003
164450.47	0.8695818	68036.35	Fold004
232355.88	0.7604469	87745.75	Fold005
107231.59	0.8852886	58874.35	Fold006
113517.04	0.9099313	69629.86	Fold007
169506.55	0.8612415	66351.18	Fold008
124349.27	0.9020569	65360.79	Fold009
137400.39	0.8561117	71727.48	Fold010
93829.79	0.8977372	65428.13	Fold011
179934.58	0.8984982	74901.61	Fold012
154884.94	0.8054865	67446.97	Fold013
132590.30	0.8768806	66809.67	Fold014
251280.14	0.8101429	80274.94	Fold015
109039.12	0.8924788	69736.76	Fold016
92859.32	0.9117293	60955.78	Fold017
124155.40	0.9281102	71874.20	Fold018
100533.15	0.9415016	65049.85	Fold019
179268.54	0.8178322	74135.68	Fold020
116399.02	0.9112885	64687.85	Fold021
128788.02	0.8563619	69126.81	Fold022
104117.79	0.8871478	64615.74	Fold023
146810.39	0.8604731	70608.83	Fold024
158307.23	0.9063928	80484.43	Fold025
129193.00	0.8839831	73458.36	Fold026
125955.56	0.8920068	69655.12	Fold027
118312.84	0.9095948	69633.09	Fold028
198265.11	0.8501643	76105.95	Fold029
108238.14	0.8852280	65449.00	Fold030
109046.91	0.9447907	63812.01	Fold031
126490.16	0.8707950	67027.19	Fold032
87321.48	0.9163583	58679.07	Fold033
106528.26	0.9058978	65071.24	Fold034
157714.06	0.8805110	81052.13	Fold035
235190.04	0.8580654	76437.91	Fold036
127674.15	0.9228530	67403.00	Fold037
233191.85	0.8386995	74514.32	Fold038
97814.49	0.8952665	63047.65	Fold039
102358.01	0.9093912	66173.10	Fold040
112505.58	0.9170178	67778.25	Fold041
162883.67	0.8892207	73312.85	Fold042
98120.88	0.9074506	66002.77	Fold043
110778.78	0.8861272	69547.06	Fold044
98891.90	0.8861429	63566.78	Fold045
144205.06	0.8711989	68590.85	Fold046
157739.82	0.8456615	69601.91	Fold047
125412.27	0.8972081	70654.62	Fold048
110048.23	0.8957732	64754.34	Fold049
190007.52	0.8005864	72659.83	Fold050
104048.33	0.9186885	60732.56	Fold051
125236.13	0.8781373	73615.91	Fold052
101578.30	0.9005791	67809.58	Fold053
123969.59	0.8846029	71353.91	Fold054
117987.88	0.8924390	69108.59	Fold055
159625.63	0.8318301	71377.70	Fold056
123243.97	0.8862276	71867.86	Fold057
108392.98	0.8762266	67567.22	Fold058
117386.06	0.9184440	68697.02	Fold059
109793.58	0.8552744	66327.24	Fold060
144880.62	0.8488145	73357.18	Fold061
136770.47	0.8723256	74378.98	Fold062
88753.19	0.9049809	59530.08	Fold063
95735.44	0.9083808	63364.06	Fold064
120142.18	0.8995945	76963.70	Fold065
105672.02	0.8756720	64940.74	Fold066
111164.83	0.9059645	68931.19	Fold067
153613.46	0.8584548	69407.24	Fold068
103301.61	0.8895758	64793.44	Fold069
175748.89	0.8517784	74892.60	Fold070
101318.78	0.8848949	65510.83	Fold071
207164.14	0.8239167	77540.45	Fold072
137219.06	0.8898394	68582.86	Fold073
163950.27	0.8747377	72493.63	Fold074
176636.86	0.8256292	68399.89	Fold075
114895.26	0.8866002	67287.00	Fold076
169918.35	0.8645435	76566.94	Fold077
96132.56	0.8980111	63998.02	Fold078
113673.02	0.8779258	70226.71	Fold079
105753.26	0.8886117	64894.23	Fold080
111718.72	0.8957923	65028.89	Fold081
192409.31	0.8389409	74206.22	Fold082
167120.32	0.8679844	74065.45	Fold083
133460.93	0.8811429	72745.85	Fold084
145402.01	0.8708571	68966.66	Fold085
107468.71	0.9044587	62019.22	Fold086
144057.51	0.8719382	67479.74	Fold087
145949.43	0.8836133	64587.22	Fold088
127397.63	0.8749113	64681.93	Fold089
122381.26	0.8843794	69220.71	Fold090
170742.17	0.8806996	77403.90	Fold091
171960.36	0.8423798	80525.62	Fold092
132629.81	0.8851896	63103.04	Fold093
113658.81	0.8739056	60451.39	Fold094
126422.42	0.9071228	67514.70	Fold095
119956.37	0.9051804	68877.02	Fold096
115539.32	0.8793529	65974.18	Fold097
121882.34	0.8707135	63871.41	Fold098
106684.54	0.9184723	65713.60	Fold099
226283.68	0.7269545	74794.28	Fold100

ggplot(data = reg.cv$resample) +
  geom_histogram(aes(x = reg.cv$resample$MAE), fill = 'blue') +
  labs(title="Distribution of Cross-validation MAE",
       subtitle = "K = 100\n",
       caption = "Figure RESULT") +
  xlab('Mean Absolute Error') +
  ylab('Count') +
  plotTheme()

reg.cv$resample %>% 
  pivot_longer(-Resample) %>% 
  mutate(name = as.factor(name)) %>% 
  ggplot(., aes(x = name, y = value, color = name)) +
  geom_jitter(width = 0.1) +
  facet_wrap(~name, ncol = 3, scales = "free") +
  theme_bw() +
  theme(legend.position = "none") +
  labs(title = 'Cross-validation Test: Distribution of MAE, RMSE, R Squared\n',
       caption = "Figure RESULT") +
  plotTheme()

4.5 Map of residuals for our test:setResiduals Map w/ Moran’s I/plot of the spatial lag in errors

lIn this section, we focus on verifying whether the spatial distribution will have an impact on our model. We mainly use the following approaches:

residual maps: It shows the residuals in different places in Mecklenberg County, and we can also observed the distribution of residuals.

map.res.test <- ggplot() + 
  geom_sf(data = st_union(tracts20),fill = 'grey') +
  geom_sf(data = model.test %>%filter(SalePrice.AbsError!= 'NA') ,aes(color = q5(SalePrice.AbsError)), show.legend = "point",size = 1) +
  scale_color_manual(values = palette5,
                     labels = qBr(model.test,'SalePrice.AbsError'),
                     name = "Residuals") +
  labs(title = 'Map of residuals for test set\n',
       caption = 'Figure 6.1') +
  mapTheme() + 
  plotTheme()
map.res.test

spatial lag in errors： A spatial lag is a variable that averages the neighboring values of a location. The spatial lag of error plot shows that as home price errors increase, the nearby home price errors slightly increase, indicating that our model errors are rarely spatially autocorrelated.

coords <- st_coordinates(all.sf) 

neighborList <- knn2nb(knearneigh(coords, 5))

spatialWeights <- nb2listw(neighborList, style="W")

all.sf$lagPrice <- lag.listw(spatialWeights, all.sf$price)

coords.test <-  st_coordinates(model.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))

spatialWeights.test <- nb2listw(neighborList.test, style="W")

model.test <- model.test %>%
  mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error,NAOK =TRUE))

ggplot(model.test, aes(x = lagPriceError, y = price)) +
  geom_point(colour = "#6897BB") +
  geom_smooth(method = "lm", se = FALSE, colour = "#05A167") +
  labs(title = "Error on Test Set as a function of the Spatial Lag of Price",
       caption = "Figure RESULT 6.2",
       x = "Spatial lag of errors (Mean error of 5 nearest neighbors)",
       y = "Sale Price") +
  plotTheme()

#morantest

MoranTest:An approach for measuring spatial autocorrelation. In the picture, the frequency of all 999 randomly permutated I are plotted as a histogram with the Observed I indicated by the orange line. That the observed I is higher then all of the 999 randomly generated, so it provides visual confirmation of spatial autocorrelation.

moranTest <- moran.mc(model.test$SalePrice.Error, zero.policy=TRUE, #zero.policy make code runs, but the neighbor is empty
                      spatialWeights.test, nsim = 999,na.action=na.omit)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()

4.6 Plot predicted prices as a function of observed prices

From the comparison of the two graphs, we can find the change before and after adding the neighborhood effect data. Although this change is not very obvious in this graph, it is still an improvement as far as the overall picture is concerned. Perhaps it is because the local neighborhoods are not very homogeneous.

reg.nhood <- lm(price ~ ., data = as.data.frame(model.training) %>% 
                                 dplyr::select(price,age, storyheigh,heatedarea,#prisqft,
                         numfirepla, fullbaths,halfbaths,bedrooms,
                         units,
                         # amenity features
                         police_nn1,police_nn2,police_nn3,school_nn1,
                         school_nn2,school_nn3,church_nn1,church_nn2,
                         church_nn3,park_nn1,park_nn3,firestation_nn2,
                         firestation_nn3,medical_nn2,medical_nn3,
                         #acs
                         pctWhite,pctBachelors,pctPoverty,MedHHInc,
                         Singleadult,Healthins,Familyincome))

model.test.nhood <-
  model.test %>%
  mutate(Regression = "Neighborhood Effects",
         SalePrice.Predict = predict(reg.nhood, model.test),
         SalePrice.Error = SalePrice.Predict- price,
         SalePrice.AbsError = abs(SalePrice.Predict- price),
         SalePrice.APE = (abs(SalePrice.Predict- price)) / price)%>%
  filter(price < 5000000)

model.test <- model.test %>%
  mutate(Regression = "Baseline Regression")

bothRegressions <- 
  rbind(
    dplyr::select(model.test, starts_with("SalePrice"), Regression, vfd, price) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error,NAOK =TRUE)),
    dplyr::select(model.test.nhood, starts_with("SalePrice"), Regression, vfd, price) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, SalePrice.Error,NAOK =TRUE)))   

st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -vfd) %>%
  filter(Variable == "SalePrice.AbsError" | Variable == "SalePrice.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable()

Regression	SalePrice.AbsError	SalePrice.APE
Baseline Regression	67696.76	0.2694697
Neighborhood Effects	125118.41	0.3317715

bothRegressions %>%
  dplyr::select(SalePrice.Predict, price, Regression) %>%
    ggplot(aes(price, SalePrice.Predict)) +
  geom_point() +
  stat_smooth(aes(price, price), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(SalePrice.Predict, price), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  plotTheme()

4.7 map of predicted values for CHALLENGE and entire data set

We filled in the data for the challenge with the modified model.

# challenge<- challenge %>%
#   dplyr::select(-price,-prisqft)

testprice <-challenge #%>%filter(prisqft!= 'Inf')
ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = testprice%>%filter(prisqft!= 'NA'), aes(colour = q5(prisqft)), 
          show.legend = "point", size = .75) +
  scale_colour_manual(values = palette5,
                   labels=qBr(testprice,"prisqft"),
                   name="Quintile\nBreaks") +
  labs(title="Price Per Sqaure Foot , Mecklenburg County") +
  mapTheme()

modeling111 <- topredict %>%
  dplyr::select(commonpid,prisqft)

challenge111 <- challenge %>%
  dplyr::select(commonpid,prisqft)

allpricesqft <- rbind(modeling111,challenge111) %>%
  filter(prisqft!= 'NA')


ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = allpricesqft, aes(colour = q5(prisqft)), 
          show.legend = "point", size = .5) +
  scale_colour_manual(values = palette5,
                   labels=qBr(allpricesqft,"prisqft"),
                   name="Quintile\nBreaks") +
  labs(title="Price Per Sqaure Foot (entire)",subtitle="Mecklenburg County") +
  mapTheme()

4.8 Map of MAPE by neighborhood

Indicated in the summary table as well as the choropleth map below, MAPE varies across neighborhoods. However, we can find that the neighborhood effect help us to improve the model, especially for the places in north Mecklenburg County.

left_join(
  st_drop_geometry(model.test) %>%
    group_by(vfd) %>%
    summarize(meanPrice = mean(price, na.rm = T),
              meanPrediction = mean(SalePrice.Predict, na.rm = T),
            meanMAE = mean(SalePrice.AbsError, na.rm = T),
            meanMAPE = mean(SalePrice.APE, na.rm = T)) ,
    mutate(model.test, predict.fe = 
                        predict(lm(price ~ vfd, data = model.test), 
                        model.test)) %>%
    st_drop_geometry %>%
    group_by(vfd) %>%
      summarize(meanPrediction = mean(predict.fe))) %>%
      kable() %>% kable_styling()

vfd	meanPrice	meanPrediction	meanMAE	meanMAPE
Carolina	351840.3	365681.1	42812.93	0.2110667
Charlotte	424147.2	428015.1	65794.23	0.2543526
Charlotte Rural	332920.2	321013.3	56363.09	0.2126259
Cook’s	385791.7	351826.8	82903.41	0.0004740
Cornelius	624048.7	617531.8	79572.01	0.6514301
Cornelius Rural	527500.0	590905.0	118206.97	0.1862289
Davidson	587246.3	604355.7	80885.35	0.1300826
Davidson Rural	765058.8	711736.9	119976.50	0.2309773
Hemby Bridge	448000.0	456287.3	29568.14	0.0628255
Huntersville	442093.2	473562.6	73203.90	0.1843572
Huntersville Rural	665235.3	663242.1	115645.82	-1.5420211
Idlewild	276348.5	280378.2	29753.74	0.1197498
Long Creek	298238.6	285984.8	45245.10	0.2445054
Matthews	421957.7	431926.4	61024.72	0.2168084
Midland	340500.0	414049.1	13049.15	0.0315159
Mint Hill	399338.2	417444.1	65657.35	0.8153683
Mint Hill Rural	381580.5	355958.1	76335.50	0.3542023
Pineville	369035.3	387666.6	47419.69	0.0714546
Robinson	319331.6	263131.3	72953.90	1.0179925
Steele Creek #1	397110.0	487275.2	106506.87	0.1493951
Steele Creek #2	500462.2	550223.7	87543.24	0.1499737
West Mecklenburg	350218.5	332017.8	54543.38	0.3702651

st_drop_geometry(bothRegressions) %>%
  group_by(Regression, vfd) %>%
  summarize(mean.MAPE = mean(SalePrice.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(nhood.sf) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE)) +
      geom_sf(data = bothRegressions, colour = "black", size = .05) +
      facet_wrap(~Regression) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood") +
      mapTheme()

4.9 Predicted Map (Test Set)

In this map, we can see that the house price of south Mecklenburg County is relatively high, and most of them are more than $237/squarefeet.

ggplot() +
  geom_sf(data = st_union(tracts20), fill = "grey50") +
  geom_sf(data = model.test, aes(colour = q5(prisqft)), 
          show.legend = "point", size = .6) +
  scale_colour_manual(values = palette5,
                   labels=qBr(model.test,"prisqft"),
                   name="Quintile\nBreaks") +
  labs(title="Price predicted per sqft (Test Set)",subtitle="Mecklenburg County") +
  mapTheme()

4.10 Scatterplot - MAPE by neighborhood mean price

nhood_sum <- model.test %>% 
  group_by(vfd) %>%
  summarize(meanPrice = mean(price, na.rm = T),
            meanPrediction = mean(SalePrice.Predict, na.rm = T),
            meanMAE = mean(SalePrice.AbsError, na.rm = T),
            meanMAPE = mean(SalePrice.APE, na.rm = T)) 
MAPE.nhood.plot <-ggplot()+
  geom_point(data = nhood_sum, aes(x = meanPrice, y = meanMAPE)) +
  stat_smooth(data = nhood_sum, aes(x = meanPrice, y = meanMAPE), method = "loess", se = FALSE, size = 1, colour="red") + 
  labs(title = "MAPE by Neighborhood as a Function of\nMean Price by Neighborhood\n",
       caption = 'Figure RESULT 9') +
  xlab('Mean Price by Neighborhood') +
  ylab('Mean MAPE by Neighborhood') +
  plotTheme()



MAPE.nhood.plot

4.11 Split by Census Groups

In this part, we chose two variables:race, income and plot them by census tract.

We consider a local white population of more than 50% to be Majority white, and a MedHHInc of more than $32,322 to be High Income. The first map shows that Mecklenburg County is racially divided, with predominantly white neighborhoods to the north and south, which are also where housing prices are higher.

Next, we calculated the Mean absolute percentage error by merging the ACS data with the model2 data. After adding the Neighborhood Effects model,from the perspective of ethnicity we can see that there is a significant difference in goodness of fit across the races, the MAPE of the majority non-white baseline is 32%, but that is just 23% for the majority white. This may be determined by the characteristics of the local neighborhood, such as the wide variation in neighborhood housing conditions and lack of consistency; or it may be that ethnic factors play a greater role here.

From the perspective of income, we can find something different. After adding the Neighborhood Effects model, the The MAPE of neighborhood effects in high-income is higher than that in high-income places, which may indicate the neighborhood’s adjustability for high-income areas.

tracts20 <- tracts20 %>%
  mutate(percentWhite = Whites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority White", "Majority Non-White"),
         incomeContext = ifelse(MedHHInc > 32322, "High Income", "Low Income"))

grid.arrange(ncol = 2,
  ggplot() + geom_sf(data = na.omit(tracts20), aes(fill = raceContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Race Context") +
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + geom_sf(data = na.omit(tracts20), aes(fill = incomeContext)) +
    scale_fill_manual(values = c("#25CB10", "#FA7800"), name="Income Context") +
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))

st_join(bothRegressions, tracts20) %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(raceContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood racial context")

Test set MAPE by neighborhood racial context
Regression	Majority Non-White	Majority White
Baseline Regression	32%	23%
Neighborhood Effects	37%	30%

st_join(bothRegressions, tracts20) %>% 
  filter(!is.na(incomeContext)) %>%
  group_by(Regression, incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(SalePrice.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  spread(incomeContext, mean.MAPE) %>%
  kable(caption = "Test set MAPE by neighborhood income context")

Test set MAPE by neighborhood income context
Regression	High Income	Low Income
Baseline Regression	27%	10%
Neighborhood Effects	33%	51%

V Discussion

In general, the data of neighborhood ‘fixed effects’ increased the accuracy and generalizability of the model, but model errors remained clustered. We will then discuss these two aspects separately.

We think the from the coefficient in model2, we can find some important feature, such as the number of fullbaths, percentage of white and living area. We think the errors in my model mainly rely on the amenities. From the matrix, I find that the data we found about the facilities are not strongly correlated with house prices. The results fitted with poorly correlated data will lack some accuracy.

5.1 Accuracy

From the accuracy point of view, the R² value of our model is high, about 0.87 in Model2, which can be considered as a good indicator. With the inclusion of neighborhood effects, there is a slight improvement in accuracy, especially in the neighborhoods on the east side of Mecklenburg County. It is worth noting that despite the adjustment for neighborhood, the APE for house prices is still 33%, which seems a bit too high for data scientists and suggests that the model still needs a lot of refinement before it can be used in the real world. But the model also has strengths. The model shows relatively low APE and error values when measuring white concentrations in the south and north, meaning that the accuracy of the model may vary for different parcels of land. And the model has smaller error values in places with high concentrations of whites and higher house prices.

5.2 Generalizability

As for generalizability，neighborhood controls created a more generalizable model, presumably because it successfully ‘borrowed’ a diverse set of housing market ‘experiences’ across different communities. However, despite the use of neighborhoods for adjustment, there are still some things that are not right, such as what we mentioned above, for low-income areas，the mape of them with neighborhood effect is even larger than that of them in basic regression。 The MAPE of low-income places in basic regression is 43%, but it increase to 51% when we add the data of neighborhood. Therefore, I believe that the model has some adaptability when adjusted for the neighborhood effect, but it will be different for different kinds of neighborhoods. The model is more applicable in areas with high income and high housing prices.

VI Conclusion

If all the model examination and interpretation have been conducted correctly as shown above, we would like to recommend this model to Zillow as it incorporates some Mecklenburg County’s local intelligent features that were absent in the previous models and performed comparatively well on either accuracy or generalizability. We can continue add some elements to improve the model. To further reduce the error rate, more amenity features such as local restaurants, commercials and bars (data unavailable so far), financial related agencies, public transportation and spatial features such as zoning should be taken into consideration. What’s more, given the previous research done on ethnicity, I think we should do more research in areas with concentrations of people of color to observe the pattern of housing prices to adjust our models.

Predicting Housing Prices for Mecklenberg County, NC - MUSA508 Midterm

Rui Jiang, Jingyi Cai

2022-09-25