Everytime a satellite passes over is an opportunity to improve our understanding of society in near real time. The various imagery captures different bandwidths of light, enabling a wide range of scientific and operational capabilities. In particular, nighttime satellite data can be transformed into approximations of economic activity as well as demographic patterns.
In collaboration with the National Oceanic and Atmospheric Administration (NOAA), the Commerce Data Service has conducted a number of research and development sprints to identify new age applications of satellite imagery to improve economic monitoring.
Among the many satellite missions is the Suomi National Polar-orbiting Partnership (Suomi NPP), which carries five state-of-the-art instruments that help understand the Earth in unprecedented detail. Among these instruments is the Visible Infrared Imaging Radiometer Suite (VIIRS) that is designed to provide moderate-resolution, radiometrically accurate images of the entire Earth twice daily. The VIIRS instrument collects a variety of data corresponding to different bandwidths of light. These in turn can be combined through well-tuned remote sensing algorithms to monitor and analyze a robust set of environmental conditions, including aerosols, clouds, land (fires, temperature), and ocean (ocean color and sea temperature). The immediate scientific contributions have been many and the applications in fields beyond earth science are largely untapped.
The VIIRS Day/Night Bands (DNB) collect levels of nighttime light across the globe at 750-meter resolution, which can be used to proxy popualation among other demographic and economic indicators. The data is available in a number of forms:
For ease of analysis, we'll focus on the monthly composites. To illustrate the value, the 35 largest US cities by population are mapped and colored in order to reveal the intricate detail and gradations in each local urban environment. Within each city, the relative intensity of light is indicative of the differences in human activity: brighter yellow indicates relatively more population acitivity and darker blue indicates relatively less activity. The development patterns are distinct in each city, contoured to the waterways and roadways. The central business districts are appear in brilliant yellow clusters.
In the map above, light intensity is scaled relative to each respective city's light distribution. Placing all cities on the same light intensity scale allows for between city comparisions that reveal relative differences in human activity across the country. Whereas the yellow areas were relatively contained in the previous map, the new map shows yellow areas that are far more widespread in large cities like New York and Los Angeles, indicating that neighborhoods within those cities are among the brightest in the United States. Occassional specks of black indicate areas where the emitted light is "off the charts" and were not well sampled for the intervals calculation due to their rarity.
Beyond visual examination, the radiance (light) distribution for a given geographic area provides clues into the size and density of local populations. To help illustrate this, the radiance distribution is plotted for five Metropolitan Statistical Areas (MSAs), a common geographic unit of analysis. Each of the follow MSAs were selected to represent different strata within the correponding population distribution:
In the histograms below, the x-axis represents the radiance level captured in the nighttime imagery, logarithmically scaled for each of interpretation where larger values indicate exponential increases in brightness. The y-axis represents the total number of pixels within the geographic footprint that correspond to each level of brightness. The Total Nighttime Light (TNL) values, which roughly corresponds to the summed product of the radiances multiplied by the pixel count, are provided in the label. It is clear that MSAs with greater population have larger TNL values as well as fuller, larger radiance distributions that span both low and high radiance values.
In direct comparison, TNL and population are positively correlated with a relatively high correlation coefficient of 0.78, showing promise as a proxy and can be tightened by incorporating the radiance distribution for approximating density.
Correlating with population is just the beginning. As the data is collected and disseminated daily in great spatial detail, there is an opportunity to improve via approximation the timeliness and spatial granularity of socioeconomic and demographic measures.
In this tutorial, we will illustrate how to manipulate raster data using the R statistical programming language. As a refresher, raster data is essentially a matrix of pixels organized into a grid (rows and columns) where each grid cell contains a value containing some information such as light, temperature, and concentration. Photographs and satellite imagery are considered to be raster data. By the end of this notebook, you will be able to process raster data into a form that can be joined with basic demographic data (population estimates).
The processing steps are as follows:
To get started quickly, the code for this tutorial can be found at the following Github repo.
Start off by specifying the working directory as well as calling 7 libraries:
library(doParallel)
library(foreach)
library(raster)
library(sp)
library(rgdal)
library(ggmap)
library(plotly)
Satellite data is complex and processed in a number of ways to correct for a manifold of environmental conditions like stray light. In the case of analyzing population demographics, we use VIIRS DNB monthly composites that omit records that have been effected by stray light. To obtain the data:
Create a folder labeled "imagery" and save the avg_rade9 .tif file into that directory. Below, we then specify the path for future use, open the first and only raster in the list of .tif files, then assign WGS84 as the spatial projection.
##Set directory path
imagery = "[path]/imagery"
##Obtain a list of TIF files, load in the first file in list
tifs = list.files(imagery,pattern = "\\.tif")
rast <- raster(paste(imagery,"/",tifs[1],sep=""))
##Specify WGS84 as the projection of the raster file
wgs84 <- "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"
projection(rast) <- CRS(wgs84)
In addition, we use MSA shapefiles and population data from the US Census Bureau. To draw down the shapefile, we write a function to download the shapefiles from the US Census Bureau website, then assign the same WGS84 projection to the shapefile. By assigning the same spatial projection, we can work across shapefiles and raster files.
##Draw down MSA Shapefile
shape_direct <- function(url, shp) {
library(rgdal)
temp = tempfile()
download.file(url, temp) ##download the URL taret to the temp file
unzip(temp,exdir=getwd()) ##unzip that file
return(readOGR(paste(shp,".shp",sep=""),shp))
}
msa <- shape_direct(url="http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_cbsa_20m.zip",
shp= "cb_2014_us_cbsa_20m")
projection(msa) <- CRS(wgs84)
We load in a CSV containing population data by MSA as found on the Census Bureau Population Estimates section. As we only need MSAs, we'll subset the data to MSAs only using the LSAD field, re-sort the data frame by 2014 population estimates, and convert the MSA names as characters/string as opposed to factors. This will save some trouble down the line.
msa_pop <- read.csv("http://www.census.gov/popest/data/metro/totals/2014/files/CBSA-EST2014-alldata.csv")
msa_pop <- msa_pop[msa_pop$LSAD=="Metropolitan Statistical Area",]
msa_pop <- msa_pop[order(msa_pop$POPESTIMATE2014),]
msa_pop$NAME <- as.character(msa_pop$NAME)
At this point, we'll begin to map the top 35 most populous cities, scaling the color palette to reveal local detail within each city. The top 35 cities are listed below:
cities <- c("New York, NY", "Los Angeles, CA","Chicago, IL", "Houston, TX",
"Philadelphia, PA", "Phoenix, AZ", "San Antonio, TX", "San Diego, CA",
"Dallas, TX", "San Jose, CA", "Austin, TX", "Jacksonville, FL",
"San Francisco, CA", "Indianapolis, IN", "Columbus, OH", "Fort Worth, TX",
"Charlotte, NC", "Detroit, MI", "El Paso, TX", "Seattle, WA",
"Denver, CO","Washington, DC", "Memphis, TN", "Boston, MA",
"Nashville, TN", "Baltimore, MD", "Oklahoma City, OK", "Portland, OR",
"Las Vegas, NV", "Louisville, KY","Milwaukee, WI","Albuquerque, NM",
"Tucson, AZ","Fresno, CA","Sacramento, CA")
Next, we'll set up the graph layout with no margins (mai), with 7 rows and 5 columns (mfrow), with a navy blue background (bg).
##Set graph layout
par(mai=c(0,0,0,0),mfrow = c(7,5),bg='#001a4d', bty='n')
Now let's get looping to create a montage of city maps. The process is as follows:
##Loop through data
coords <- data.frame() ##place holder
for(i in 1:length(cities)){
##Coords
temp_coord <- geocode(cities[i], source = "google")
coords <- rbind(coords,temp_coord)
e <- extent(temp_coord$lon - 1, temp_coord$lon + 1,
temp_coord$lat - 0.25, temp_coord$lat + 0.25)
rc <- crop(rast, e)
##Rescale brackets
sampled <- as.vector(rc)
clusters <- 15
clust <- kmeans(sampled,clusters)$cluster
combined <- as.data.frame(cbind(sampled,clust))
brk <- sort(aggregate(combined[,1], list(combined[,2]), max)[,2])
#Plots
plot(rc, breaks=brk, col=colorRampPalette(c("#001a4d","#0066FF", "yellow"))(clusters),
legend=F,yaxt='n',xaxt='n',frame = F, asp=1.5)
text(temp_coord$lon ,temp_coord$lat + 0.15,
substr(cities[i],1,regexpr(",",cities[i])-1),
col="white", cex=1.25)
rm(combined)
}
Whereas the map above shows city-specific patterns, the second map focuses on comparing patterns across cities using the same color coding. Rather than running a k-means algorithm for each city in order to generate the radiance intervals, one interval is generated based on a random sample of pixels from across the country.
#Set layout
par(mai=c(0,0,0,0),mfrow = c(7,5),bg='#001a4d', bty='n')
#Run clustering
set.seed(123) #set seed for reproducibility
sampled <- sample(rast, 20000) #sample 20,000 pixels
clusters <- 15 ##15 clusters
clust <- kmeans(sampled,clusters)$cluster
combined <- as.data.frame(cbind(sampled,clust))
brk <- sort(aggregate(combined[,1], list(combined[,2]), max)[,2])
##Loop through each city
for(i in 1:length(cities)){
temp_coord <- coords[i,] ##re-use the coordinates
e <- extent(temp_coord$lon - 1, temp_coord$lon + 1,
temp_coord$lat - 0.25, temp_coord$lat + 0.25)
rc <- crop(rast, e)
#Plots
plot(rc, breaks=brk, col=colorRampPalette(c("#001a4d","#0066FF", "yellow"))(clusters),
legend=F,yaxt='n',xaxt='n',frame = F, asp=1.5)
text(temp_coord$lon ,temp_coord$lat + 0.15,
substr(cities[i],1,regexpr(",",cities[i])-1),
col="white", cex=1.25)
rm(combined)
}
To compare the radiance distribution across geographic areas, we'll need to develop a function to extract the radiance values (rast) based on the shape of a geographic area from a Census shapefile (shp), doing so for 1 to the i-th shape.
masq <- function(shp,rast,i){
#Extract one polygon based on index value i
polygon <- shp[i,] #extract one polygon
extent <- extent(polygon) #extract the polygon extent
#Raster extract
outer <- crop(rast, extent) #extract raster by polygon extent
inner <- mask(outer,polygon) #keeps values from raster extract that are within polygon
#Convert cropped raster into a vector
#Specify coordinates
coords <- expand.grid(seq(extent@xmin,extent@xmax,(extent@xmax-extent@xmin)/(ncol(inner)-1)),
seq(extent@ymin,extent@ymax,(extent@ymax-extent@ymin)/(nrow(inner)-1)))
#Convert raster into vector
data <- as.vector(inner)
#package data in neat dataframe
data <- cbind(as.character(shp@data$CBSAFP[i]),coords, data)
colnames(data)<-c("GEOID","lon","lat","avg_rad") #note that
data <- data[!is.na(data$avg_rad),] #keep non-NA values only
return(data)
}
For a quick example, we've selected five MSAs that are serially processed into histograms of logarithmically transformed radiance. The data is processed first for each of the five MSAs, then turned into a series of histograms using a combination of ggplot and plotly.
##MSAs by GEOID
msa_list <- c(16180,19140,45820,42540,35620)
##Placeholder
radiances <- data.frame()
##Loop MSA file
for(i in msa_list){
print(i)
#Extract MSA i polygon
shp_temp <- msa[msa@data$GEOID==i,]
#Extract MSA abbreviated name
if(regexpr("-",as.character(shp_temp@data$NAME)[1])[1]==-1){
loc = as.character(substr(as.character(shp_temp@data$NAME)[1],1,regexpr(",",as.character(shp_temp@data$NAME)[1])-1))
} else{
loc = as.character(substr(as.character(shp_temp@data$NAME)[1],1,regexpr("-",as.character(shp_temp@data$NAME)[1])-1))
}
#Extract the radiances, append to radiances placeholder
rad <- masq(shp_temp,rast,1)$avg_rad
temp <- data.frame(loc = as.character(paste(loc,"(TNL = ",round(sum(rad),0),")",sep="")), avg_rad = rad)
radiances <- rbind(radiances,temp)
}
#Use ggplot to create histograms by MSA group. Preload.
ggplot(radiances, aes(x=log(avg_rad))) +
geom_histogram(position="identity", alpha=0.4) +
facet_grid(. ~ loc)
#Remove all axes labels for style
x <- list(
zeroline = FALSE,
showline = FALSE,
showticklabels = FALSE,
showgrid = FALSE
)
y <- list(
zeroline = FALSE,
showline = FALSE,
showticklabels = FALSE,
showgrid = FALSE
)
#Initiate a plotly graph without axes
ggplotly() %>% layout(xaxis=x, yaxis=y)
To extract radiance data from five MSAs take a few seconds, but a few hundred MSAs takes a while longer. We rely on the doParallel and foreach libraries to parallelize data extraction. Below, two processor cores are used to extract the TNL for each MSA and record into a consolidated data frame of MSAs and TNL. This data frame is then joined with the MSA population files to allow direct comparison between population and TNL. The ultimate interactive graph illustrates a clear positive correlation between the two quantities.
##Set up comparisons
registerDoParallel(cores=2)
extract <- foreach(i=1:nrow(msa@data), .combine=rbind) %dopar% {
data <- masq(msa,rast,i)
data.frame(GEOID = data$GEOID[1],sum = sum(data$avg_rad))
}
extract$GEOID <- as.numeric(as.character(extract$GEOID))
##Join in data
joined<-merge(extract, msa_pop[,c("CBSA","NAME","POPESTIMATE2014")],by.x="GEOID",by.y="CBSA")
colnames(joined) <- c("GEOID","TNL","MSA","Population")
plot_ly(joined,
x = log(TNL),
y = log(Population),
text = paste("MSA: ", MSA),
mode = "markers",
color = TNL,colors="PuOr") %>%
layout(title="Total Nighttime Light vs. Population", showlegend = F)
Basic joining is just the beginning. In an upcoming tutorial, we will demonstrate how to mine county-level data and identify which can be accurately predicted using VIIRS data.