Mapping Statewide School Ratings with US Census Tracts

In this post, I’d like to share some work related to geo-spatial mapping, statewide school ratings, and US Census Bureau data using census tracts. Specifically, I wanted to investigate whether there was a relation between the median housing price in an area, and the statewide achievement ratings for schools in the corresponding area. There is a strong relation between socio-economic status and student achievement, but less is known about how statewide ratings for schools relate to the demographics of the corresponding area. Fair warning - this is a rather long post, so you’ll have to have a bit of endurance to get through it all (I considered splitting it into two posts but it felt less cohesive). We’ll be using R to produce the following map.

This post uses the tidyverse, Thomas Leeper’s rio package for data import and Sam Firke’s janitor package for quick cleaning up of column names. The leaflet packages, is used for mapping, as well the terrific tidycensus package by Kyle Walker, and the
ggmap package is used for pulling lattitude and longitude data from physical addresses. Finally, this post builds off a wonderful post by Julia Silge.

The data

For this project, I wanted to use only publicly available data. I meandered around the Oregon Department of Education (ODE) website for a while and eventually came to the file I wanted: Report Card Ratings for each school. So let’s import the data directly from the ODE website, clean it up a bit, and take a look.

# load packages for data import/prep

ratings <- import("",
                  setclass = "tbl_df") %>% 

# Clean up the file to what we need
achievement <- ratings %>% 
  filter(district_name == "Portland SD 1J") %>% 
  spread(indicator, rating) %>% 
  clean_names() %>% 
  select(school_name, achievement) %>% 
  mutate(achievement = parse_number(achievement)) %>% 
## # A tibble: 84 x 2
##                      school_name achievement
##                            <chr>       <dbl>
##  1   Abernethy Elementary School           5
##  2   Ainsworth Elementary School           5
##  3     Alameda Elementary School           5
##  4      Arleta Elementary School           3
##  5       Astor Elementary School           3
##  6    Atkinson Elementary School           4
##  7  Rosa Parks Elementary School           2
##  8       Beach Elementary School           3
##  9        Beaumont Middle School           4
## 10 Boise-Eliot Elementary School           2
## # ... with 74 more rows

In the above code, we import the ratings, filter for only schools in Portland School District, spread the data out so we have different columns for each type of rating (achievement, growth, and student subgroup growth), and then select only the two variables we really need - the school name and the achievement rating.

This is a good start, but we don’t yet have the geographical coordinates of the schools. To do that, we’ll get another dataset from ODE that has the physical address of each school. We’ll then transform those addresses to lattitude and longitude using a bit of help from the ggmap package First, the addresses:

addresses <- import("",
                    setclass = "tbl_df") %>% 
  clean_names() %>% 
  filter(city == "PORTLAND") %>% 
  mutate(name = str_to_title(name)) %>% 
  unite(address, c(address, city, state, zip), sep = " ") %>% 
  select(name, address)
## # A tibble: 146 x 2
##                           name                                     address
##  *                       <chr>                                       <chr>
##  1 Abernethy Elementary School   2421 SE ORANGE AVE PORTLAND OR 97214-5398
##  2 Ainsworth Elementary School    2425 SW VISTA AVE PORTLAND OR 97201-2399
##  3   Alameda Elementary School   2732 NE FREMONT ST PORTLAND OR 97212-2542
##  4     Alder Elementary School    17200 SE ALDER ST PORTLAND OR 97233-4260
##  5     Alice Ott Middle School   12500 SE RAMONA ST PORTLAND OR 97236-4699
##  6        Alliance High School   4039 NE ALBERTA CT PORTLAND OR 97211-8199
##  7    Arleta Elementary School     5109 SE 66TH AVE PORTLAND OR 97206-4599
##  8              Arthur Academy 13717 SE DIVISION ST PORTLAND OR 97236-2841
##  9     Astor Elementary School       5601 N YALE ST PORTLAND OR 97203-5299
## 10  Atkinson Elementary School  5800 SE DIVISION ST PORTLAND OR 97206-1499
## # ... with 136 more rows

After import, we limit to only schools in Portland because we’re going to be joining this dataset with our ratings dataset based on the school name, and we want to ensure we don’t join the data for schools with the same name but but different districts. Unfortunately, the addresses file doesn’t tell us the district, or we could include that as one of the keyed variables in our join.

We’ll now join the addresses dataset with our ratings dataset.

d <- inner_join(achievement, addresses, by = c("school_name" = "name"))

Now, we can find the longitude and lattitude of each school using the physical address with the help of the ggmap package and, specifically, the geocode function. This takes a bit of time.

lat_long <- geocode(d$address)
d <- bind_cols(d, lat_long)
d[ ,c(1, 4:5)]
## # A tibble: 81 x 3
##                      school_name       lon      lat
##                            <chr>     <dbl>    <dbl>
##  1   Abernethy Elementary School -122.6515 45.50576
##  2   Ainsworth Elementary School -122.7001 45.50995
##  3     Alameda Elementary School -122.6373 45.54771
##  4      Arleta Elementary School -122.5968 45.48616
##  5       Astor Elementary School -122.7303 45.57785
##  6    Atkinson Elementary School -122.6049 45.50503
##  7  Rosa Parks Elementary School -122.7129 45.58781
##  8       Beach Elementary School -122.6856 45.55763
##  9        Beaumont Middle School -122.6209 45.54885
## 10 Boise-Eliot Elementary School -122.6727 45.54786
## # ... with 71 more rows

Note - an earlier draft of this post had specified source = "dsk" when running the geocode function. I was getting some missing values (~19%) when using google (the default) and I received no missing data with source = "dsk". However, it seems the Data Science Toolkit website is fairly frequently overwhelmed. Last time I tried this it took forever and failed, so I suppose the best option may depend on the availability of the servers at time you run the function.

Mapping the data

To map the data, we’ll use the leaflet and sf packages. We’ll first write a quick function that tells leaflet what color we want the schools in our map to appear depending on their statewide achievement rating. We’ll make a simple diverging palette, where low values are red, middle values are white, and high values are green. These colors must correspond with the possible values for markerColor from leaflet::awesomeIcons.


get_col <- function(df, indicator) {
  sapply(df[[as.character(indicator)]], function(rating) {
    if(rating == 5) {
    } else if(rating == 4) {
    } else if(rating == 3) {
    } else if(rating == 2) {
    } else {
    } })

Next we’ll map the Portland area using a combination of leaflet and the tidycensus packages to not only produce the map, but color census tracts according to the median housing income for the tract. Note that you have to have a US Census API key for these function to work (see here). Below, I use the get_acs function from tidycensus package to pull census tract data for Multnomah County. The variable argument tells the function which variable to get data from (see the load_variables function for other variables).


pdx_acs <- get_acs(geography = "tract", 
                     variables = "B25077_001", 
                     state = "OR",
                     county = "Multnomah County",
                     geometry = TRUE)

The pdx_acs object now has data on the median housing cost for all census tracts in Multnomah County, as well as the geometry of each census tract. We can use this information to produce the map. First, however, we need to create a color palette for the tracts. We’ll use the viridis palette, which is not only beautiful, but also good for people with color blindness, as well as for printing in black and white. We’ll use the colorNumeric function from leaflet, and ask it to fill each tract according to the estimate of the median housing cost, which we just extracted with the tidycensus package.

pal <- colorNumeric(palette = "viridis", 
                    domain = pdx_acs$estimate)

Next we’ll get the actual icons we want to use, coloring them according to the achievement ratings. The last line in this code gets a hexadecimal approximation for these colors, which we’ll use in a legend later.

pins <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = get_col(d, "achievement")

cols <- gplots::col2hex(c("red", "pink", "white", "lightgreen", "darkgreen"))

Finally, we’ll produce the map, using the code below. Although it appears somewhat complicated, it’s actually not too bad, it’s just rather long(ish). Essentially we first separate NAME into three distinct variables for the census tract, county, and state, then we (a) create a blank map, (b) set the style in which we want the map to render, (c) add tract information, (d) add school information, and (e) add legends.

pdx_acs %>%
  separate(NAME, c("tract", "county", "state"), sep = ",") %>% 
  leaflet() %>%
  addProviderTiles(provider = "Stamen.TerrainBackground") %>%
  addPolygons(popup = ~ tract,
              stroke = FALSE,
              smoothFactor = 0,
              fillOpacity = 0.7,
              color = ~ pal(estimate)) %>%
  addAwesomeMarkers(data = d, 
                    icon = pins, 
                    popup= ~school_name) %>% 
            pal = pal, 
            values = ~ estimate,
            title = "Median Home Value",
            labFormat = labelFormat(prefix = "$"),
            opacity = 1) %>% 
            colors = rev(cols),
            labels = 5:1,
            title = "Achievement Ratings",
            opacity = 1)

One of the nice things about this map is that it’s fully interactive. And that’s particularly helpful here because to really see the relation you’ll need to zoom in quite a bit. Because we added the optional popup argument to addPolygons and addAwesomeMarkers, we’re able to click to identify a specific census tract or the name of a specific school. If we wanted the plot to render differently we could change the provider. For example, you could produce a nice terrain map with provider = "Stamen.TerrainBackground". All the different options are available here, where you can preview a map before applying it to yours.

Note that we have a very clear and very strong relation here - essentially all the red, pink, and white schools are in the more purple tracts, correpsonding to lower median housing price areas. This is powerful information but it is important that it’s not confused to mean that the lowest quality schools are located in the more impoverished areas. This information tells us very little about the quality of the school and, indeed, if we look at estimates of growth (shown below without all the code) there is a much less strong correlation. That is, the ratings schools receive on the amount of growth students make is less correlated with the median housing cost in the area than is absolute achievement, although there does still appear to be some correlation.

comments powered by Disqus