Data Science for Nighttime Lights

by JEFF CHEN and STAR YING, COMMERCE DATA SERVICE
in collaboration with CHRIS ELVIDGE, NOAA
As part of the Commerce Data Usability Project, the NOAA, in collaboration with the Commerce Data Service has created a tutorial that merges demographic data with NOAA public data to provide a better look at satellite imagery to improve economic monitoring. If you have question, feel free to reach out to the Commerce Data Service at DataUsability@doc.gov.

Everyday, NOAA satellites collect terabytes of raw data

Everytime a satellite passes over is an opportunity to improve our understanding of society in near real time. The various imagery captures different bandwidths of light, enabling a wide range of scientific and operational capabilities. In particular, nighttime satellite data can be transformed into approximations of economic activity as well as demographic patterns.

In collaboration with the National Oceanic and Atmospheric Administration (NOAA), the Commerce Data Service has conducted a number of research and development sprints to identify new age applications of satellite imagery to improve economic monitoring.

Among the many satellite missions is the Suomi National Polar-orbiting Partnership (Suomi NPP), which carries five state-of-the-art instruments that help understand the Earth in unprecedented detail. Among these instruments is the Visible Infrared Imaging Radiometer Suite (VIIRS) that is designed to provide moderate-resolution, radiometrically accurate images of the entire Earth twice daily. The VIIRS instrument collects a variety of data corresponding to different bandwidths of light. These in turn can be combined through well-tuned remote sensing algorithms to monitor and analyze a robust set of environmental conditions, including aerosols, clouds, land (fires, temperature), and ocean (ocean color and sea temperature). The immediate scientific contributions have been many and the applications in fields beyond earth science are largely untapped.

The VIIRS Day/Night Bands (DNB) collect levels of nighttime light across the globe at 750-meter resolution, which can be used to proxy popualation among other demographic and economic indicators. The data is available in a number of forms:

For ease of analysis, we'll focus on the monthly composites. To illustrate the value, the 35 largest US cities by population are mapped and colored in order to reveal the intricate detail and gradations in each local urban environment. Within each city, the relative intensity of light is indicative of the differences in human activity: brighter yellow indicates relatively more population acitivity and darker blue indicates relatively less activity. The development patterns are distinct in each city, contoured to the waterways and roadways. The central business districts are appear in brilliant yellow clusters.

In the map above, light intensity is scaled relative to each respective city's light distribution. Placing all cities on the same light intensity scale allows for between city comparisions that reveal relative differences in human activity across the country. Whereas the yellow areas were relatively contained in the previous map, the new map shows yellow areas that are far more widespread in large cities like New York and Los Angeles, indicating that neighborhoods within those cities are among the brightest in the United States. Occassional specks of black indicate areas where the emitted light is "off the charts" and were not well sampled for the intervals calculation due to their rarity.

Seeing is just the beginning

Beyond visual examination, the radiance (light) distribution for a given geographic area provides clues into the size and density of local populations. To help illustrate this, the radiance distribution is plotted for five Metropolitan Statistical Areas (MSAs), a common geographic unit of analysis. Each of the follow MSAs were selected to represent different strata within the correponding population distribution:

  • Carson City, NV (Pop: 54,522 ~ Smallest MSA)
  • Dalton, GA (Pop: 142,952 ~ 25th Percentile)
  • Topeka, KS (Pop: 233,758 ~ 50th Percentile)
  • Scranton-Wilkes-Barre-Hazleton, PA (Pop: 559,679 ~ 75th Percentile)
  • New York-Newark-Jersey City, NY-NJ-PA (Pop: 20,092,883 ~ 100th Percentile)

In the histograms below, the x-axis represents the radiance level captured in the nighttime imagery, logarithmically scaled for each of interpretation where larger values indicate exponential increases in brightness. The y-axis represents the total number of pixels within the geographic footprint that correspond to each level of brightness. The Total Nighttime Light (TNL) values, which roughly corresponds to the summed product of the radiances multiplied by the pixel count, are provided in the label. It is clear that MSAs with greater population have larger TNL values as well as fuller, larger radiance distributions that span both low and high radiance values.

In direct comparison, TNL and population are positively correlated with a relatively high correlation coefficient of 0.78, showing promise as a proxy and can be tightened by incorporating the radiance distribution for approximating density.

Correlating with population is just the beginning. As the data is collected and disseminated daily in great spatial detail, there is an opportunity to improve via approximation the timeliness and spatial granularity of socioeconomic and demographic measures.

Getting Started

In this tutorial, we will illustrate how to manipulate raster data using the R statistical programming language. As a refresher, raster data is essentially a matrix of pixels organized into a grid (rows and columns) where each grid cell contains a value containing some information such as light, temperature, and concentration. Photographs and satellite imagery are considered to be raster data. By the end of this notebook, you will be able to process raster data into a form that can be joined with basic demographic data (population estimates).

The processing steps are as follows:

  • Obtain and load GeoTIFFs from NOAA (geo-referenced raster file) and MSA shapefiles from the US Census Bureau, then assign a commonly used spatial projection that dictates the shape of a map canvas;
  • Visually inspect rasters by color coding raster values on a map using k-means clustering;
  • Crop raster data by geographic boundaries and extract associated radiance summary statistics;
  • Combine raster summary data with demographic data to uncover underlying patterns

To get started quickly, the code for this tutorial can be found at the following Github repo.

Leading Off

Start off by specifying the working directory as well as calling 7 libraries:

  • doParallel: Allows for parallel processing using multiple cores
  • foreach: Enables for-each statements to be used for parallel processing
  • raster: interface to raster (e.g.TIFF) data
  • sp: interface to spatial data processing
  • rgdal: interface for manipulating spatial data
  • ggmap: contains a Google geocoding wrapper
  • plotly: API interface to plotly's graphing libraries providing interactive web visualizations
library(doParallel)
library(foreach)
library(raster)
library(sp)
library(rgdal)
library(ggmap)
library(plotly)

Loading Data

Satellite data is complex and processed in a number of ways to correct for a manifold of environmental conditions like stray light. In the case of analyzing population demographics, we use VIIRS DNB monthly composites that omit records that have been effected by stray light. To obtain the data:

  • Raster files can be downloaded from http://ngdc.noaa.gov/eog/viirs/download_monthly.html;
  • Under each month's folder, select Tile1_75N180W, which contains data for North America;
  • Within this folder, download the file labeled VCMCFG containing data that excludes any stray light.
  • There are two files. The file ending in avg_rade9 contains average radiances; this should be set aside for use. The other file ending in "" contains the number of cloud free pixels included in each pixel's calculation.

Create a folder labeled "imagery" and save the avg_rade9 .tif file into that directory. Below, we then specify the path for future use, open the first and only raster in the list of .tif files, then assign WGS84 as the spatial projection.

##Set directory path
   imagery = "[path]/imagery"

##Obtain a list of TIF files, load in the first file in list
   tifs = list.files(imagery,pattern = "\\.tif")
   rast <- raster(paste(imagery,"/",tifs[1],sep=""))

##Specify WGS84 as the projection of the raster file
   wgs84 <- "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"
   projection(rast) <- CRS(wgs84)

In addition, we use MSA shapefiles and population data from the US Census Bureau. To draw down the shapefile, we write a function to download the shapefiles from the US Census Bureau website, then assign the same WGS84 projection to the shapefile. By assigning the same spatial projection, we can work across shapefiles and raster files.

##Draw down MSA Shapefile
shape_direct <- function(url, shp) {
library(rgdal)
temp = tempfile()
download.file(url, temp) ##download the URL taret to the temp file
unzip(temp,exdir=getwd()) ##unzip that file
return(readOGR(paste(shp,".shp",sep=""),shp))
}
msa <- shape_direct(url="http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_cbsa_20m.zip", 
shp= "cb_2014_us_cbsa_20m")
projection(msa) <- CRS(wgs84)

We load in a CSV containing population data by MSA as found on the Census Bureau Population Estimates section. As we only need MSAs, we'll subset the data to MSAs only using the LSAD field, re-sort the data frame by 2014 population estimates, and convert the MSA names as characters/string as opposed to factors. This will save some trouble down the line.

msa_pop <- read.csv("http://www.census.gov/popest/data/metro/totals/2014/files/CBSA-EST2014-alldata.csv")
msa_pop <- msa_pop[msa_pop$LSAD=="Metropolitan Statistical Area",]
msa_pop <- msa_pop[order(msa_pop$POPESTIMATE2014),]
msa_pop$NAME <- as.character(msa_pop$NAME) 

Mapping

At this point, we'll begin to map the top 35 most populous cities, scaling the color palette to reveal local detail within each city. The top 35 cities are listed below:

  cities <- c("New York, NY", "Los Angeles, CA","Chicago, IL", "Houston, TX",
              "Philadelphia, PA", "Phoenix, AZ", "San Antonio, TX", "San Diego, CA",     
              "Dallas, TX", "San Jose, CA", "Austin, TX", "Jacksonville, FL",
              "San Francisco, CA", "Indianapolis, IN", "Columbus, OH", "Fort Worth, TX",
              "Charlotte, NC", "Detroit, MI", "El Paso, TX", "Seattle, WA",
              "Denver, CO","Washington, DC", "Memphis, TN", "Boston, MA",
              "Nashville, TN", "Baltimore, MD", "Oklahoma City, OK", "Portland, OR",
              "Las Vegas, NV", "Louisville, KY","Milwaukee, WI","Albuquerque, NM",
              "Tucson, AZ","Fresno, CA","Sacramento, CA")

Next, we'll set up the graph layout with no margins (mai), with 7 rows and 5 columns (mfrow), with a navy blue background (bg).

##Set graph layout
  par(mai=c(0,0,0,0),mfrow = c(7,5),bg='#001a4d', bty='n')

Now let's get looping to create a montage of city maps. The process is as follows:

  • First set a placeholder dataframe for coordinates
  • For each city, geocode the city name for the centroid using the geocode() function from ggmaps, append the coordinates to the placeholder. This coordinates file will be used again later.
    • Use the extent function to specify the spatial bounding box (the frame around the city) as +/-1 degree longitude and +/-0.25 degree latitude.
    • Crop the raster file (rast) by the bounding box.
    • Convert the cropped raster into a vector in order to get the radiances, then run a k-means clustering algorithm to find natural intervals within the radiance distribution. For each cluster, extract the maximum radiance.
    • Map the city with a navy to yellow color palette with intervals from the k-means clustering.
##Loop through data
  coords <- data.frame() ##place holder
  
  for(i in 1:length(cities)){
    
    ##Coords
    temp_coord <- geocode(cities[i], source = "google")
    coords <- rbind(coords,temp_coord)
    
      e <- extent(temp_coord$lon - 1, temp_coord$lon + 1,
                  temp_coord$lat - 0.25, temp_coord$lat + 0.25)
      rc <- crop(rast, e)    
      
    ##Rescale brackets
      sampled <- as.vector(rc)
      clusters <- 15
      clust <- kmeans(sampled,clusters)$cluster
      combined <- as.data.frame(cbind(sampled,clust))
      brk <- sort(aggregate(combined[,1], list(combined[,2]), max)[,2])
      
    #Plots
      plot(rc, breaks=brk, col=colorRampPalette(c("#001a4d","#0066FF", "yellow"))(clusters), 
           legend=F,yaxt='n',xaxt='n',frame = F, asp=1.5)
      text(temp_coord$lon ,temp_coord$lat + 0.15,
           substr(cities[i],1,regexpr(",",cities[i])-1), 
           col="white", cex=1.25)
       
    rm(combined)
  }

Whereas the map above shows city-specific patterns, the second map focuses on comparing patterns across cities using the same color coding. Rather than running a k-means algorithm for each city in order to generate the radiance intervals, one interval is generated based on a random sample of pixels from across the country.

#Set layout
  par(mai=c(0,0,0,0),mfrow = c(7,5),bg='#001a4d', bty='n')

#Run clustering
  set.seed(123) #set seed for reproducibility
  sampled <- sample(rast, 20000) #sample 20,000 pixels
  clusters <- 15 ##15 clusters
  clust <- kmeans(sampled,clusters)$cluster
  combined <- as.data.frame(cbind(sampled,clust))
  brk <- sort(aggregate(combined[,1], list(combined[,2]), max)[,2])

##Loop through each city
  for(i in 1:length(cities)){
    
    temp_coord <- coords[i,] ##re-use the coordinates 
      e <- extent(temp_coord$lon - 1, temp_coord$lon + 1,
                  temp_coord$lat - 0.25, temp_coord$lat + 0.25)
      rc <- crop(rast, e)    
    
    #Plots
      plot(rc, breaks=brk, col=colorRampPalette(c("#001a4d","#0066FF", "yellow"))(clusters), 
           legend=F,yaxt='n',xaxt='n',frame = F, asp=1.5)
      text(temp_coord$lon ,temp_coord$lat + 0.15,
           substr(cities[i],1,regexpr(",",cities[i])-1), 
           col="white", cex=1.25)
       
    rm(combined)
  }

Comparing measures through visualizations

Histogram by select MSA

To compare the radiance distribution across geographic areas, we'll need to develop a function to extract the radiance values (rast) based on the shape of a geographic area from a Census shapefile (shp), doing so for 1 to the i-th shape.



  masq <- function(shp,rast,i){
  
    #Extract one polygon based on index value i
    polygon <- shp[i,] #extract one polygon
    extent <- extent(polygon) #extract the polygon extent 
    
    #Raster extract
    outer <- crop(rast, extent) #extract raster by polygon extent
    inner <- mask(outer,polygon) #keeps values from raster extract that are within polygon
    
    #Convert cropped raster into a vector
    #Specify coordinates
    coords <- expand.grid(seq(extent@xmin,extent@xmax,(extent@xmax-extent@xmin)/(ncol(inner)-1)),
    seq(extent@ymin,extent@ymax,(extent@ymax-extent@ymin)/(nrow(inner)-1)))
    #Convert raster into vector
    data <- as.vector(inner)
    
    #package data in neat dataframe
    data <- cbind(as.character(shp@data$CBSAFP[i]),coords, data) 
    colnames(data)<-c("GEOID","lon","lat","avg_rad") #note that 
    data <- data[!is.na(data$avg_rad),] #keep non-NA values only
    
    return(data)
  }

For a quick example, we've selected five MSAs that are serially processed into histograms of logarithmically transformed radiance. The data is processed first for each of the five MSAs, then turned into a series of histograms using a combination of ggplot and plotly.

##MSAs by GEOID
  msa_list <- c(16180,19140,45820,42540,35620)

##Placeholder
  radiances <- data.frame() 
  
##Loop MSA file
  for(i in msa_list){
  
    print(i)
    
    #Extract MSA i polygon
      shp_temp <- msa[msa@data$GEOID==i,]
      
    #Extract MSA abbreviated name
      if(regexpr("-",as.character(shp_temp@data$NAME)[1])[1]==-1){
        loc = as.character(substr(as.character(shp_temp@data$NAME)[1],1,regexpr(",",as.character(shp_temp@data$NAME)[1])-1))
      } else{
        loc = as.character(substr(as.character(shp_temp@data$NAME)[1],1,regexpr("-",as.character(shp_temp@data$NAME)[1])-1))
      }
    
    #Extract the radiances, append to radiances placeholder
      rad <- masq(shp_temp,rast,1)$avg_rad 
      temp <- data.frame(loc = as.character(paste(loc,"(TNL = ",round(sum(rad),0),")",sep="")), avg_rad = rad) 
      radiances <- rbind(radiances,temp)
  }

#Use ggplot to create histograms by MSA group. Preload.
  ggplot(radiances, aes(x=log(avg_rad))) +
    geom_histogram(position="identity", alpha=0.4) +
    facet_grid(. ~ loc)

#Remove all axes labels for style
    x <- list(
        zeroline = FALSE,
        showline = FALSE,
        showticklabels = FALSE,
        showgrid = FALSE
    )
    y <- list(
        zeroline = FALSE,
        showline = FALSE,
        showticklabels = FALSE,
        showgrid = FALSE
    ) 
    
#Initiate a plotly graph without axes
  ggplotly()  %>% layout(xaxis=x, yaxis=y)

Scatterplot of TNL vs. Population

To extract radiance data from five MSAs take a few seconds, but a few hundred MSAs takes a while longer. We rely on the doParallel and foreach libraries to parallelize data extraction. Below, two processor cores are used to extract the TNL for each MSA and record into a consolidated data frame of MSAs and TNL. This data frame is then joined with the MSA population files to allow direct comparison between population and TNL. The ultimate interactive graph illustrates a clear positive correlation between the two quantities.

##Set up comparisons
    registerDoParallel(cores=2)
    extract <- foreach(i=1:nrow(msa@data), .combine=rbind) %dopar% {
        data <- masq(msa,rast,i)
        data.frame(GEOID = data$GEOID[1],sum = sum(data$avg_rad))
    }
   extract$GEOID <- as.numeric(as.character(extract$GEOID))
    
  ##Join in data
    joined<-merge(extract, msa_pop[,c("CBSA","NAME","POPESTIMATE2014")],by.x="GEOID",by.y="CBSA")
   
    colnames(joined) <- c("GEOID","TNL","MSA","Population")
    
    plot_ly(joined, 
            x = log(TNL), 
            y = log(Population), 
            text = paste("MSA: ", MSA),
            mode = "markers", 
            color = TNL,colors="PuOr")  %>% 
            layout(title="Total Nighttime Light vs. Population", showlegend = F)

Basic joining is just the beginning. In an upcoming tutorial, we will demonstrate how to mine county-level data and identify which can be accurately predicted using VIIRS data.