1 Overview

First and foremost is to install R (following instructions here) so that your computer knows how to interpret R code. Then, I highly recommend you install RStudio (following instructions here), which is a powerful program for organizing your interaction with the R language (a subtle distinction, and you wouldn’t really be ill-served by considering RStudio and R to be the same thing); for the jargon-hungry, RStudio is an Integrated Development Environment (IDE) for the R language.

Some general guidelines for getting the most out of this workshop:

  1. Active participation is key! You should be running every snippet of code I go over on your own machine. As with human languages, exposure is one of the building blocks of fluency. You would be insane to think that you can osmote the ability to understand and use R code just by watching me do things on a screen. This facility takes practice, practice, hair pulling, and more practice. For those with little exposure to more advanced programming languages, today’s workshop will be very heavy with material. You shouldn’t expect to absorb everything, but actively participating will help a lot.

  2. Questions are openly encouraged! As I mentioned, this will, despite my efforts to make things as straightforward as possible, inevitably be a somewhat dense workshop. So nobody should hesitate to stop me if they’re lost. I am no fan of hearing myself speak, so if nobody in the audience is following me, that’s a waste of everyone’s time.




2 R Basics (8 AM - 9 AM)

Before we get to the fun stuff (statistical analysis), it’s important that we gain some facility with doing basic things in R first – adding numbers, creating objects, and tools for exploring/understanding new objects as we come across them, including understanding and overcoming common errors that can creep into our code.

First things first, let’s follow the time-honored tradition of making R say “Hello World”.

Make sure you’re in the console (the cursor in front of the right angle bracket (>) should be blinking) and type (or copy-paste) the following:

print("Hello World")
## [1] "Hello World"

Easy-peasy. The console is a great place to do what I call sandboxing – running small commands to test whether they run as expected and produce the right output in the right format, etc. But it would be very cumbersome to use the console to do everything. There’s a distinct lack of permanency to anything we do in the console.

More common is to run code from an R script. You can create an R script in RStudio by clicking the plus-page icon (New R script) and hitting “R Script” or by using the keyboard shortcut Ctrl+Shift+N. Try printing “Hello World” again by adding print("Hello World") to your new script and pressing Ctrl+Enter on this line. Ctrl+Enter is the shortcut for running the selected line, or for running a highlighted section of code.

Once we save our R script, we can easily share it with coauthors or the general public (or even ourselves, on a different computer), who can simply run the code again on their machine to reproduce your analysis.


2.1 Assignment

Consider the following:

x = 3

When we execute the above line of code, we’re creating the variable x and associating with it the value 3. This is like creating a local in Stata.

Variables (a.k.a. objects) are the most crucial building block for everything we want to do in R. When we create a variable, we create a shorthand for some value that we’ll refer to lower in our code.

Assignment has to pass from right to left – the object on the right of = is assigned to the name on the left of =.

The opposite is an error, since we haven’t told R what y is yet:

3 = y
## Error in 3 = y: invalid (do_set) left-hand side to assignment

This error tells us that 3 is not a valid thing to which to assign; as a rule of thumb for beginners, all variable names must start with letters (though can contain other characters thereafter).

Side note: the following works just as well: x <- 3 this method of assignment is probably more common, but we’ll stick with using = today since I think it’s more intuitive. The differences between = and <- are all-but irrelevant for beginners, but for future reference, this Q&A is worth a read:


2.2 Vectors

The easiest way to think of a vector is as a column (or a row) in Excel. A column in excel can contain many numbers, but instead of referring to each of them individually, we refer to the row. Typically, these numbers all have something in common (for example, a column Name in Excel should be filled with peoples’ names).

The way to declare vectors in R is using c, which stands for _c_oncatenate:

x <- c(1, 2, 3, 4)
x
## [1] 1 2 3 4

Now x contains four numbers. Note that since this type of sequence is so common, R has built in the colon (:) as an operator to create a variable like this more quickly/concisely:

x <- 1:4
x
## [1] 1 2 3 4
#Also works in reverse
x <- 4:1
x
## [1] 4 3 2 1

2.3 Types

Thus far, everything we’ve assigned to a variable has been a number. But this is far from the only thing we can assign to a variable in R. Consider:

x <- c("Philadelphia", "Pennsylvania")
y <- c(TRUE, FALSE)
z <- c(1L, 2L)

x contains a string of letters demarcated by "" (or '') to distinguish them from variable names. We refer to x as a character vector.

y is a logical vector; it is often very convenient to keep binary variables (gender, treatment group, etc.) stored as logical variables, for reasons we’ll see below.

z is an integer vector. This is distinguished from c(1, 2), which gets stored as a numeric vector. The difference between the two probably won’t affect you for quite some time, but it’s often important for saving computer memory. 1, as a numeric, takes up more space in your computer’s memory than does 1L. L signifies integer; don’t worry about why (if you insist, it stands for _L_ong integer, which is too involved for this session).

2.3.1 Lists

Consider this:

x <- c(1, 2L, TRUE, "America")

All four components of x are of different type (namely, numeric, integer, logical, character). Recall above that we said vectors must have something in common, but as we declared it, this couldn’t be further from the truth. In fact, R will force all of these components to have the same type – namely, character:

x
## [1] "1"       "2"       "TRUE"    "America"

Note the quotation marks – none of the components are any longer considered as character strings in x.

The specific heirarchy is logical < integer < numeric < character

However, mixing types is a fundamental feature of almost all data analysis, so it stands to reason that there is a straightforward way to do so. In R, this is done with the list type. We can replace the code above using list:

x <- list(1, 2L, TRUE, "America")
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] "America"

Note how different the output looks, as compared to using c!! The quotation marks are gone except for the last component. You can ignore the mess of [[ and [ for now, but as an intimation, consider some more complicated lists:

x <- list(c(1, 2), c("a", "b"), c(TRUE, FALSE), c(5L, 6L))
x
## [[1]]
## [1] 1 2
## 
## [[2]]
## [1] "a" "b"
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 5 6
y <- list(list(1, 2, 3), list(4:5), 6)
y
## [[1]]
## [[1]][[1]]
## [1] 1
## 
## [[1]][[2]]
## [1] 2
## 
## [[1]][[3]]
## [1] 3
## 
## 
## [[2]]
## [[2]][[1]]
## [1] 4 5
## 
## 
## [[3]]
## [1] 6

x is a list which has 4 components, each of which is a vector with 2 components. This gives the first hint at how R treats a dataset with many variables of different types – at core, R stores a data set in a list!

y is a nested list – it’s a list that has lists for some of its components. This is very useful for more advanced operations, but probably won’t come up for quite some time, so don’t worry if you haven’t wrapped your head around this yet.


2.4 Extraction/Indexing

Consider a simple numeric vector:

x <- 5:14

How do we get at the various numbers stored in certain positions in x? For example, how could we get the first number in x, 5?

This process is called extraction, and, for vectors, is done with square brackets ([]), e.g.:

x[1] #first element of x
## [1] 5
x[5] #fifth element of x
## [1] 9

This also works on lists, but lists also have some other ways to get at their contents:

y <- list(1:3, 4:6, 7:9)
y[1]
## [[1]]
## [1] 1 2 3

Note that the output still has [[ in it. This means the result of y[1] is still a list.

More typically, we want to remove the list structure and just get 1:3 instead of list(1:3). To do this, we use [[:

y[[1]]
## [1] 1 2 3

2.4.1 Named vectors and lists

It is also possible to name the elements of vectors and lists. This is convenient for making it easy to get certain elements without having to remember whether you stored it first, third, or whatever:

x <- c("Iowa" = "Cruz", "Ohio" = "Kasich", "Pennsylvania" = "Trump")

y <- list(names = c("Kasich", "Cruz", "Trump"),
          ages = c(63, 45, 69),
          hates_clinton = c(TRUE, TRUE, TRUE))

For named vectors, we keep using [] to extract elements, but we can use the name instead of the index:

x["Iowa"]
##   Iowa 
## "Cruz"

For named lists, we can use [], but, as with numbered indices, we’ll get a list in return. If we want the actual object contained at that point in the list, we can still use [[, or we can also use $, which is another extraction operator:

#now that y has names, we no longer see [[ -- instead, we see
#  the name of each element of the list
y["ages"]
## $ages
## [1] 63 45 69
y[["ages"]]
## [1] 63 45 69
y$names
## [1] "Kasich" "Cruz"   "Trump"

2.4.2 Multiple extraction

As often as not, when we need to extract, we need more than one element of the vector.

To do this, we pass a vector to []. It’s must intuitive when we need elements in sequence, e.g.:

# R conveniently keeps the letters of the
#   alphabet stored automatically in two
#   vectors, letters and LETTERS;
#   the former is lower-, the latter uppercase
x <- LETTERS
x
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
#Get the first 5 letters
x[1:5]
## [1] "A" "B" "C" "D" "E"

What if we need the first and 5th elements, but not the 3rd/4th/5th elements?

Remember that 1:5 is the same as c(1, 2, 3, 4, 5). So we could have just used the latter:

x[c(1, 2, 3, 4, 5)]
## [1] "A" "B" "C" "D" "E"

By extension, if we just want the first and fifth elements:

x[c(1, 5)]
## [1] "A" "E"

NOTE: there’s no* way to extract multiple elements using the other extraction operators, [[ and $. It’s an error to try:

x <- list(a = 1:3, b = 4:6)
#multiple extraction by names works fine:
x[c("a", "b")]
## $a
## [1] 1 2 3
## 
## $b
## [1] 4 5 6
#but not with [[ or $
x[[1:2]] #technically, R interprets this as a recursive index...
## [1] 2
x[[c("a", "b")]] #again, a recursive index...
## Error in x[[c("a", "b")]]: subscript out of bounds
x$c("a", "b")
## Error in eval(expr, envir, enclos): attempt to apply non-function

2.4.3 Logical indexing

Another exceedingly common approach is to subset an object based on a condition satisfied by some of the elements.

Here, we must introduce the logical operators (also called binary operators, which is important for understanding associated error messages) in R.

These are <, >, <=, >=, ==, and !=. These are all the same in Stata (note: ~=, which works in Stata as an alternate to !=, doesn’t work in R). First, let’s explore them:

ages <- c(12, 14, 16, 18, 20, 22, 24, 30,
          38, 40, 55, 60, 63, 66, 68, 70)

#Who is voting age?
ages >= 18
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE
#Who can't drink?
ages < 21
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE
#Who is exactly 40?
ages == 40
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE

Note that R automatically figures out whether each element of ages satisfies the condition.

The way to extend this to extraction is simple:

#Only take people under age 18
ages[ages < 18]
## [1] 12 14 16
#Only take AARPers
ages[ages >= 50]
## [1] 55 60 63 66 68 70
#Exclude any 70 year old:
ages[ages != 70]
##  [1] 12 14 16 18 20 22 24 30 38 40 55 60 63 66 68

The other common logical operations are intersection, union and negation, aka AND, OR and NOT, which in R, as in Stata, are &, | and ! (you’ll also often see && and || in other peoples’ code; don’t worry about the difference yet, but for those who know Matlab – it’s the same).

Quickly, let’s get working-agers:

ages[ages >= 18 & ages <= 65]
##  [1] 18 20 22 24 30 38 40 55 60 63

We can combine these to our heart’s fancy and create any combination of logical requirements, e.g.:

#get people who are over 18, but not exactly 34 or 70
ages[ages >= 18 & !(ages == 34 | ages == 70)]
##  [1] 18 20 22 24 30 38 40 55 60 63 66 68

2.4.4 Negative indexing

Sometimes, we’d rather exclude a small number of elements, rather than include elements like we have been doing so far.

To do this, we precede the vector we want to extract with - (if it’s numeric/integer/character) or ! (if it’s logical).

Suppose we wanted to find out how many years have passed since our earliest observation, and to exclude the earliest observation. We might do something like:

years <- c(1970, 1973, 1978, 1980, 1990, 1995)
#the earliest year is 1970, which comes first; to exclude:
years[-1]
## [1] 1973 1978 1980 1990 1995
#to exclude and subtract 1970:
years[-1] - years[1]
## [1]  3  8 10 20 25

2.5 Base functions

R comes equipped with a vast (vast) library of built-in functions intended to make your life as a data analyst as easy as possible. As of this writing I count 1,203 functions included in base R, and 2,374 total functions that come ready to use as soon as you open up RStudio.

These are the real work horses of R, and many of them should be familiar to Excel and Stata users. Like in those programs, functions are distinguished by the use of parentheses () (in fact we’ve already used many functions to this point, most obviously the concatentation function c). We won’t be able to get anyhwere near understanding all of the multitudinous functions R makes available to us today, but we can make a start.

2.5.1 Basic Arithmetic/Stats

Finding the sum of a column in Excel is sort of a pain. You have to write =SUM( and then highlight all the proper cells. It’s kind of a pain in Stata as well. You either have to summarize var the variable and run di r(sum) or create a new variable with egen sum_var = sum(var). I find this and many other artihmetic operations much simpler to do in R.

x <- 1:10

#find the sum
sum(x)
## [1] 55
#find the mean
mean(x)
## [1] 5.5
#find the variance and standard deviation
var(x)
## [1] 9.166667
sd(x)
## [1] 3.02765
#scalar arithmetic
3 * x
##  [1]  3  6  9 12 15 18 21 24 27 30
x - 2
##  [1] -1  0  1  2  3  4  5  6  7  8
x^2
##  [1]   1   4   9  16  25  36  49  64  81 100
x / 4
##  [1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
#exponentials
exp(1:3)
## [1]  2.718282  7.389056 20.085537
log(1:10)
##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851
#other arithmetic
abs(-5:5)
##  [1] 5 4 3 2 1 0 1 2 3 4 5
#rounding up/down
x <- c(.1, 1.1, 1.9, 2, 2.8)
floor(x)
## [1] 0 1 1 2 2
ceiling(x)
## [1] 1 2 2 2 3

2.5.2 Function arguments

Most functions accept more than one argument. Consider rounding a number:

x <- c(1.12, 10.2, 1.56, 21.9, 2)
round(x)
## [1]  1 10  2 22  2
#round to 1 digit past the decimal
round(x, 1)
## [1]  1.1 10.2  1.6 21.9  2.0
#round to -1 digit past the decimal
#  (i.e., 1 digit befor the decimal)
round(x, -1)
## [1]  0 10  0 20  0

There’s no limit to the number of arguments a function can accept to accomplish a wide variety of tasks. With round, it’s easy to handle this because there are only two arguments – first, a vector of numbers, and second, a (single) number of digits to which to round the vector.

When the number of arguments balloons, however, it gets more and more difficult to keep track of which argument goes where.

Consider the quantile function, which takes 5 arguments – x, a vector of numbers; probs, a vector of probabilities (quantiles); na.rm, which tells it whether to ignore missing data (more on that later); names, which tells R whether the result should be named (see example); and type, which tells R the algorithm to use for measuring the quantiles (there are 9 readily available).

It would be quite a pain to expect users to memorize the order in which quantile expects arguments, especially if we had to do so for every function we ever wanted to use (remember, there are thousands). To facilitate this, R allows us to name our arguments.

x <- c(1, 1, 1, 2, 3, 4, 5, 6, 
       7, 8, 8, 8, 9, 10, 10, 10)

#by default, quantile calculates all 5 quartiles:
#  min, 25%-ile, median, 75%-ile, max
quantile(x)
##    0%   25%   50%   75%  100% 
##  1.00  2.75  6.50  8.25 10.00
median(x)
## [1] 6.5
#if we just want the median
quantile(x, .5)
## 50% 
## 6.5
#or, we can name it to be explicit
quantile(x, probs = .5)
## 50% 
## 6.5
#if we want to exclude the names
quantile(x, probs = .5, names = FALSE)
## [1] 6.5

In addition to making it easier for us as analysts to use the functions, naming arguments also makes it easier for others to read our code. While it’s fair to expect most users to know the basic functions and their arguments pretty well, the more advanced the function is that you’re using, the more understandable your code gets when aided by named arguments.

2.5.3 Infix operators and match

Infix operators are things that are written from left to right like basic arithmetic, but which do things not covered by the five main arithmetic operators (+, -, *, /, ^). These operators in R are always surrounded by two percent signs (%). The two math-y ones have to do with modular arithmetic:

x <- 1:20
# integer division
x %/% 3
##  [1] 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6
# modular division (remainder)
x %% 3
##  [1] 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2

By far the most commonly used infix, though, has to be %in%. This is used to determine whether some objects (the left) can be found in another set (the right). Best with an example:

candidates <- 
  c("Bernie Sanders", "Hillary Clinton", "Lincoln Chafee", 
    "Jim Webb", "Laurence Lessig", "Donald Trump", "John Kasich",
    "Ted Cruz", "Marco Rubio", "Jeb Bush", "Ben Carson",
    "Carly Fiorina", "Lindsey Graham", "Chris Christie",
    "Rick Santorum", "Rand Paul", "Jim Gilmore", "Mike Huckabee",
    "George Pataki", "Bobby Jindal", "Scott Walker", "Rick Perry")

nominees <- c("Hillary Clinton", "Donald Trump")

candidates %in% nominees
##  [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

A quick note that what %in% is actually doing is using the match function, which is also useful. match takes three arguments. The first and second are like the left- and right-hand side of %in%, respectively. match tries to find each element of the left in the right, and, if it’s found, gives the position. The third argument, nomatch tells it what to do when the left element is not found. Example:

color <- c("red", "green", "orange", "blue",
           "purple", "teal", "mahogany", "yellow",
           "orange", "white", "lavendar")

roygbiv <- c("red", "orange", "yellow", "green",
             "blue", "indigo", "violet")

#%in% tells you if each element of
#  color is in roygbiv at all
color %in% roygbiv
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
#match tells you WHERE in roygbiv 
#  each element of color was found;
#  we set no.match to say what
#  we expect when the color isn't found
#  (it must be an integer)
match(color, roygbiv, nomatch = -1)
##  [1]  1  4  2  5 -1 -1 -1  3  2 -1 -1

2.5.4 More useful functions

2.5.4.1 length

Length is used to find how many elements something has (in data, how many observations there are)

x <- 1:50
length(x)
## [1] 50

2.5.4.2 rep

rep is used to _rep_eat an object a certain number of times. The each argument repeats each element a set number of times. length.out makes sure the output has a certain number of elements.

rep(1:3, 4)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
rep(1:3, each = 4)
##  [1] 1 1 1 1 2 2 2 2 3 3 3 3
rep(1:3, each = 4, length.out = 10)
##  [1] 1 1 1 1 2 2 2 2 3 3

2.5.4.3 unique / duplicated

unique eliminates all duplicates of a vector; duplicated returns a logical vector telling you which elements appeared earlier in the vector.

x <- c(1, 1, 1, 2, 2, 3)
unique(x)
## [1] 1 2 3
duplicated(x)
## [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE
### unique is the same as (but faster than) x[!duplicated(x)]
x[!duplicated(x)]
## [1] 1 2 3
#unique is often used in conjunction with length
#  to find out the number of IDs in a vector, e.g.
length(unique(x))
## [1] 3

2.5.4.4 seq

seq produces a _seq_uence. It has four main arguments: to (where to start), from (where to end), by (increment), and length.out (what should be the length of the result?). We must specify no more than 3 of these. Some examples:

#same as 1:10
seq(10)
##  [1]  1  2  3  4  5  6  7  8  9 10
#same as 2:11
seq(2, 11)
##  [1]  2  3  4  5  6  7  8  9 10 11
#make a grid from 0 to 10 with 30 points;
#  R automatically figures out the increment
seq(0, 10, length.out = 30)
##  [1]  0.0000000  0.3448276  0.6896552  1.0344828  1.3793103  1.7241379
##  [7]  2.0689655  2.4137931  2.7586207  3.1034483  3.4482759  3.7931034
## [13]  4.1379310  4.4827586  4.8275862  5.1724138  5.5172414  5.8620690
## [19]  6.2068966  6.5517241  6.8965517  7.2413793  7.5862069  7.9310345
## [25]  8.2758621  8.6206897  8.9655172  9.3103448  9.6551724 10.0000000
#get all the odd numbers from 1 to 50:
#  note that the upper endpoint won't
#  necessarily be included in the output
seq(1, 50, by = 2)
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49

2.5.4.5 sort

sort will give an ordered version of its input. Default is increasing order; use the decreasing argument to reverse this.

x <- c(1, 2, 4, -1, 8, 9, 20, 13, 0)
sort(x)
## [1] -1  0  1  2  4  8  9 13 20
sort(x, decreasing = TRUE)
## [1] 20 13  9  8  4  2  1  0 -1

2.5.4.6 paste / paste0

paste and paste0 are R’s ways of concatenating strings, i.e., combining them together. paste has an argument sep which tells how to _sep_arate the components; paste0 is a slightly faster version of sep that uses sep = "" (i.e., don’t separate the output). The collapse argument, available to both, will reduce a vector to a single string, using collapse to separate. Examples:

first <- c("Sam", "Charles", "Matthew", "Thom")
last <- c("Beam", "Mingus", "Embree", "Yorke")

#default sep is a space, sep = " "
paste(first, last)
## [1] "Sam Beam"       "Charles Mingus" "Matthew Embree" "Thom Yorke"
paste(last, first, sep = ", ")
## [1] "Beam, Sam"       "Mingus, Charles" "Embree, Matthew" "Yorke, Thom"
paste(last, first, collapse = ", ")
## [1] "Beam Sam, Mingus Charles, Embree Matthew, Yorke Thom"
paste0(first, last, collapse = ", ")
## [1] "SamBeam, CharlesMingus, MatthewEmbree, ThomYorke"

2.6 Random Numbers

As statistical analysts, randomness is our lifeblood. As a language designed for statistical analysis, then, it stands to reason that R comes well-equipped to handle many common statistical operations. This includes producing random numbers a number of common distributions, random permutations, random subsets, etc.

2.6.1 Uniform and normal random numbers

The most common kind of random numbers that people want are uniform draws and normal draws.

#generate 10 U[0, 1] draws
runif(10)
##  [1] 0.3533790 0.4154977 0.2749768 0.8417500 0.3754361 0.3238726 0.4187158
##  [8] 0.8402353 0.8282875 0.1257449
#generate 10 U[3, 5] draws
runif(10, min = 3, max = 5)
##  [1] 3.034160 3.778142 3.449561 4.149737 3.859568 4.279103 4.196884
##  [8] 3.718585 4.642759 4.263920
#generate 10 N(0, 1) draws
rnorm(10)
##  [1] -0.2616094  0.4220936  0.5989283 -0.1320829 -0.6563510 -0.1877255
##  [7]  1.2570120 -0.9213782  0.4822936 -1.5137514
#generate 10 N(3, 5) draws
rnorm(10, mean = 3, sd = 5)
##  [1] -5.5352378 -0.2025639  8.3196767  2.5940324  0.4463488 -1.8024958
##  [7]  0.2398940  0.9296785 -0.9146949  5.5596145

These examples highlight the common format of random number generators in R. To get random numbers from a distribution, there is probably a function named like rdist, where dist is an abbreviation for the distribution (here, uniform and normal). The first argument is the number of draws; the rest of the arguments are parameters.

Here is a complete table of all of the common distributions built in to R, and the function used to invoke their random number generator (RNG):

List of native distributions from built-in stats package
Distribution RNG Other Parameters Wikipedia
Beta rbeta shape1, shape2, ncp https://en.wikipedia.org/wiki/Beta_distribution
Binomial rbinom size, prob https://en.wikipedia.org/wiki/Binomial_distribution
Cauchy rcauchy location, scale https://en.wikipedia.org/wiki/Cauchy_distribution
Chi-Squared rchisq df, ncp https://en.wikipedia.org/wiki/Chi-squared_distribution
Exponential rexp rate https://en.wikipedia.org/wiki/Exponential_distribution
F rf df1, df2, ncp https://en.wikipedia.org/wiki/F-distribution
Gamma rgamma shape, rate, scale https://en.wikipedia.org/wiki/Gamma_distribution
Geometric rgeom prob https://en.wikipedia.org/wiki/Geometric_distribution
Hypergeometric rhyper m, n, k https://en.wikipedia.org/wiki/Hypergeometric_distribution
Log-Normal rlnorm meanlog, sdlog https://en.wikipedia.org/wiki/Log-normal_distribution
Logistic rlogis location, scale https://en.wikipedia.org/wiki/Logistic_distribution
Multinomial rmultinom size, prob https://en.wikipedia.org/wiki/Multinomial_distribution
Negative Binomial rnbinom size, prob, mu https://en.wikipedia.org/wiki/Negative_binomial_distribution
Poisson rpois lambda https://en.wikipedia.org/wiki/Poisson_distribution
Wilcoxon Sign Ranked Statistic rsignrank* n https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
Student t rt df, ncp https://en.wikipedia.org/wiki/Student%27s_t-distribution
Weibull rweibull shape, scale https://en.wikipedia.org/wiki/Weibull_distribution
Wilcoxon Sign Ranked Statistic rwilcox* m, n https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
Wishart rWishart df, Sigma https://en.wikipedia.org/wiki/Wishart_distribution

In addition to RNGs, R typically has three other functions associated with a given distribution: a density function with prefix d, e.g. dnorm, giving the PDF; a probability function with prefix p, e.g. pnorm, giving the CDF; and a quantile function with prefix q, e.g. qnorm, giving quantiles.

#this is mainly used for plotting, see below
dnorm(seq(-3, 3, length.out = 30))
##  [1] 0.004431848 0.008069595 0.014077583 0.023529569 0.037679866
##  [6] 0.057811501 0.084982332 0.119688537 0.161505004 0.208799242
## [11] 0.258631467 0.306932813 0.348991450 0.380185689 0.396813332
## [16] 0.396813332 0.380185689 0.348991450 0.306932813 0.258631467
## [21] 0.208799242 0.161505004 0.119688537 0.084982332 0.057811501
## [26] 0.037679866 0.023529569 0.014077583 0.008069595 0.004431848
#great for calculating p values
pnorm(1.96)
## [1] 0.9750021
#great for getting critical values
qnorm(c(.005, .025, .05))
## [1] -2.575829 -1.959964 -1.644854

2.6.2 sample

As often as we need random draws from uniform or normal distributions, we need to take random subsets of vectors, or to get random integers. The main function for tasks like this is sample, which takes 4 arguments: x (a vector from which to sample, or an integer representing the max of numbers from which to sample), size (the number of elements to sample, which defaults to length(x)), replace (whether draws are taken with replacement), and prob (a vector of probabilities, for weighted sampling – default is uniform).

As always, it’s easiest to see with some examples:

#A permutation of 1:5
sample(5)
## [1] 2 4 5 1 3
#A size-3 subset of 1:5
sample(5, 3)
## [1] 2 3 5
#A size-5 subset of 1:5, with replacement
sample(5, 5, replace = TRUE)
## [1] 3 1 2 3 5
#A weighted sample of 1:5, heavily weighting 1
sample(5, 5, replace = TRUE, prob = c(.99, .0025, .0025, .0025, .0025))
## [1] 1 1 1 1 1

We can use the basic tools here to run basic randomization exercises, like rolling dice, flipping coins, or drawing cards from a deck:

#Flip 10 coins, get the percentage heads
flips <- sample(c("H", "T"), 10, replace = TRUE)
mean(flips == "H")
## [1] 0.6
#Roll a pair of dice and add them
sum(sample(6, 2, replace = TRUE))
## [1] 10
#create a deck of labeled cards:
#  (I copy-pasted the text for the suits from
#   Wikipedia, but we could have used S/H/D/C:
#   https://en.wikipedia.org/wiki/Playing_cards_in_Unicode)
deck <- paste0(rep(c(2:10, "J", "Q", "K", "A"), 4),      #card values
               rep(c("♠", "♥", "♦", "♣"), each = 13)) #suits
deck
##  [1] "2♠"  "3♠"  "4♠"  "5♠"  "6♠"  "7♠"  "8♠"  "9♠"  "10♠" "J♠"  "Q♠" 
## [12] "K♠"  "A♠"  "2♥"  "3♥"  "4♥"  "5♥"  "6♥"  "7♥"  "8♥"  "9♥"  "10♥"
## [23] "J♥"  "Q♥"  "K♥"  "A♥"  "2♦"  "3♦"  "4♦"  "5♦"  "6♦"  "7♦"  "8♦" 
## [34] "9♦"  "10♦" "J♦"  "Q♦"  "K♦"  "A♦"  "2♣"  "3♣"  "4♣"  "5♣"  "6♣" 
## [45] "7♣"  "8♣"  "9♣"  "10♣" "J♣"  "Q♣"  "K♣"  "A♣"
#Now draw a poker hand
sample(deck, 5)
## [1] "A♠" "K♣" "7♣" "9♥" "Q♠"

2.6.3 replicate

Full scale Monte Carlo/bootstrapping requires replicating a simulation exercise over and over. Again, R is ready and has the replicate function for us. Let’s simulate a few poker hands to find their frequency and compare with the theoretical frequency:

#It'll be easier to deal with numbers than
#  with the fancy strings we used above
deck_nos <- 1:52
#Simulating a flush
simulations <- 
  replicate(10000, #number of repetitions/simulations
            { #use curly braces to run a simulation that takes more than one line
              hand <- sample(deck_nos, 5) #draw a five-card hand
              suits <- hand %% 4 # force card number into one of 4 categories
              length(unique(suits)) == 1 #returns TRUE if all suits are equal, FALSE else
            }) #close braces and parentheses
#How often does a flush happen?
#  True frequency is 0.1965%
mean(simulations)
## [1] 0.0017
#What about a pair (just two cards match)?
#  true frequency: 42%
simulations <- 
  replicate(10000, {
    hand <- sample(deck_nos, 5)
    faces <- hand %% 13 # force into one of 13 categories
    sum(duplicated(faces)) == 1 #one pair if there's exactly _one_ match
  })
mean(simulations)
## [1] 0.4272

2.6.4 set.seed

Just a quick note that any time we run a simulation, it’s a good idea (for replicability) to set and save the random seed! Random numbers from computers may look random, but in fact they’re completely deterministic – if we know the value of the random seed! This is a disadvantage for information security (which is only sometimes important to us as data analysts), but an advantage for academic research, where replicability is paramount.

A simple example:

set.seed(100)
runif(1)
## [1] 0.3077661
#if we run again, we'll get a different number
runif(1)
## [1] 0.2576725
#but if we re-set the seed, we'll get the same
set.seed(100)
runif(1)
## [1] 0.3077661

2.7 Vectorization / *apply

The last set of major workhorse functions in R are the *apply functions – sapply and lapply. apply is also important, but I want to avoid touching on matrices today, if I can. The oft-ignored cousins are rapply (recursive), tapply (tagged), mapply (multivariate), vapply (verified), and eapply (environment).

You should think of lapply as list apply and sapply as simplified apply, because the former returns a list, and the latter returns a more simplified object (vector/matrix/array).

Both of these functions are, by and large, what you should be using in R instead of a for loop. A for loop often consists of going through every element of a list and doing something to its contents. for loops exist in R, as they probably do in every langauge, but are typically much slower than what is often called a vectorized approach. We’ll get into that more later, but for now I’ll just leave an example of lapply and sapply in action:

#pick three triplets of numbers at random
l <- list(sample(100, 3), sample(100, 3), sample(100, 3))
l
## [[1]]
## [1] 26 55  6
## 
## [[2]]
## [1] 47 48 80
## 
## [[3]]
## [1] 38 55 17
#now find the max within each group
lapply(l, max)
## [[1]]
## [1] 55
## 
## [[2]]
## [1] 80
## 
## [[3]]
## [1] 55
#note that the output is a list. It's typically
#  more convenient to have the output as a vector:
sapply(l, max)
## [1] 55 80 55

2.8 Packages

One of the things that makes R truly exceptional is its vast library of user-contributed packages.

R comes pre-loaded with a boat-load of the most common functions / methods of analysis. But in no way is this congenital library complete.

Complementing this core of the most common operations are external packages, which are basically sets of functions designed to accomplish specific tasks. The entire afternoon of this workshop is devoted to gaining facility in just a few of these.

Best of all, unlike some super-expensive programming languages, all of the thousands of packages available to R users (most importantly through CRAN, the Comprehensive R Archive Network) are completely free of charge.

The two most important things to know about packages for now (we’ll start using my favorite, data.table, shortly) is where to find them, how to install them, and how to load them.

2.8.1 Where to find packages

Long story short: Google. Got a particular statistical technique in mind? The best R package for this is almost always the top Google result if asked correctly.

How about heirarchical linear modeling (HLM)? A quick Google for “HLM R” or “heirarchical linear modeling R” both turn up some articles/tutorials/CRAN pages referencing lmer.

2.8.2 How to install packages

Just use install.packages!

#we won't be able to run this, because it requires
#  administrative priviliges, but you can do
#  this easily on your own computer
install.packages("lmer")

2.8.3 How to load packages

Just add it to your library!

#this won't run if it's not installed
library(lmer)

Et voila! You’ll now be able to run HLM in R. You can also Google “tutorial lmer” (or in general “tutorial [package name]”) and you’re very likely to find a trove of sites trying to help you learn the package. Most popular packages also come with worked examples available through the example function, e.g., example(package = "lmer").


2.9 Toolkit: ls/rm/?/class/dput/str

Just a few more things that aren’t related to statistical analysis per se, but which will often facilitate your coding experience.

First, ls and rm, short-hand for list objects and remove objects. These can be useful for understanding what names you’ve assigned to objects, and for occasionally cleaning up the clutter of variables you’ll no longer use (for the meticulously clean, but also because leaving a bunch of unused objects lying around is a great way to create irritating errors / problems with your code).

Let’s see all the junk we’ve created thus far in the workshop:

#ls is a function, so we have to use () to call it
ls()
##  [1] "ages"        "candidates"  "color"       "deck"        "deck_nos"   
##  [6] "first"       "flips"       "l"           "last"        "nominees"   
## [11] "roygbiv"     "simulations" "x"           "y"           "years"      
## [16] "z"
#one of the optional arguments to ls is to 
#  also list some "hidden" objects:
ls(all = TRUE)
##  [1] "ages"         "candidates"   "color"        "deck"        
##  [5] "deck_nos"     "first"        "flips"        "l"           
##  [9] "last"         "nominees"     ".Random.seed" "roygbiv"     
## [13] "simulations"  "x"            "y"            "years"       
## [17] "z"
#what a mess! let's get rid of some stuff
rm(ages, candidates, color)

#ugh, this is getting boring. What about
#  a scorched earth technique?
rm(list = ls(all = TRUE))

#after the apocalypse, nothing remains:
ls(all = TRUE)
## character(0)

2.9.1 Troubleshooting

Finally, I’ll leave you with some tools that are often the first place to look when trying to troubleshoot code that isn’t working correctly.

2.9.1.1 ?

? is the help operator in R. If you’d like to know anything about any function in R, just type ? before its name and hit enter. Let’s start with the help page for replicate by entering ?replicate. All help files are structured similarly, with the following common sections (I’ve highlighted in bold the most useful for troubleshooting):

  1. Description: an overview of the function(s) described on the page. ?replicate, for example, redirects us to the help page for lapply, since in addition to replicate this page covers lapply, sapply, vapply, and simplify2array (don’t worry about the last two, they’re a bit more advanced and won’t come up today).
  2. Usage: a generic snippet of code which shows the name and default value of every argument of every function on the page. This is very useful for understanding the ordering and naming of all the arguments to a function. replicate, for example, has 3 arguments, only 2 of which we used above: n is the number of repetitions, which has no default; expr is the expression to repeat (surrounded in curly braces {} above), which has no default; and simplify, which gives us control over how/whether the results are converted from a list, which has default "array", leading to the output we saw above.
  3. Arguments: gives a brief overview of what each function expects to receive for each argument (the type, whether it can be a vector or has to be a scalar, etc.). You shouldn’t expect to fully understand everything that’s being told to you here for a while, since it’s pretty specific, but it can still help.
  4. Details: Some nitty-gritty about how the functions work, some mistakes to avoid, some of the more intricate details of how exactly the function goes about its task. Tends to be very dense and jargon-heavy.
  5. Value: Tells you details about the type of object that you should expect to be returned as a result of running each function (e.g., we noted that lapply always returns a list; this is noted in the Value section of ?replicate in the first sentence).
  6. Note: Some more technical details about the function and its edge cases.
  7. References: Especially useful for packages – this usually tells you where to go to find some more reading about the algorithms implemented.
  8. See Also: Gives some related functions which might also help accomplish similar tasks.
  9. Examples: A great place to look for some simple reproducible examples of how the functions work, with code that can be copy-pasted and run by you, the user.

The help files are an absolute must go-to reference – it’s always the first place you should check when you’re stuck.

2.9.1.2 class

class simply tells you the class/type of an object in memory. You’d be surprised how often the root of an issue is simply that you think a certain variable is type X (say, numeric), but actually R is keeping track of it as if it’s type Y (frighteningly often, factor – more on that later).

2.9.1.3 dput

dput is like a microscope for R objects. It deconstructs any variable in R to its atoms – its absolute fundamental components – and prints out what’s hidden under the hood. This is especially useful because all of the most complicated objects in R tend to have an associated print method which renders them in a digestible form. This is great for taking a glance at a complicated object, but not useful for understanding what’s going wrong with that object.

Take, for example:

#we'll get more into how R understands dates/times
#  later in the day; the key takeaway here is 
#  how R prints the object vs. how it's actually stored.
t <- as.POSIXlt("2015-05-12")
t
## [1] "2015-05-12 EDT"
#when we just enter the object at the console, its
#  print method is invoked, i.e.:
print(t)
## [1] "2015-05-12 EDT"
#this is a bit misleading. From what we see, it would appear
#  that t is just a character string. However:
t == "2015-05-15 EDT"
## [1] FALSE
#so, what is t? use dput
dput(t)
## structure(list(sec = 0, min = 0L, hour = 0L, mday = 12L, mon = 4L, 
##     year = 115L, wday = 2L, yday = 131L, isdst = 1L, zone = "EDT", 
##     gmtoff = NA_integer_), .Names = c("sec", "min", "hour", "mday", 
## "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", 
## "POSIXt"))
#actually, it's a rather complicated list with components
#  describing every which thing about the date/time we entered

2.9.1.4 str

Often the output of dput is far too verbose to be of real use. str is a similar function which gives us a more detailed look at the components of a complicated object, but with the advantage that it doesn’t give us too many details; it’s more like a table of contents for R objects.

#don't worry about what lm is yet, we'll get there soon
y <- rnorm(1000)
x <- rnorm(1000)
r <- lm(y ~ x)
str(r)
## List of 12
##  $ coefficients : Named num [1:2] 0.015549 -0.000618
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
##  $ residuals    : Named num [1:1000] -0.5905 -1.1076 -0.1212 0.0885 0.5711 ...
##   ..- attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:1000] -0.4921 0.0198 -0.1315 0.0739 0.6333 ...
##   ..- attr(*, "names")= chr [1:1000] "(Intercept)" "x" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:1000] 0.0148 0.0147 0.0161 0.0162 0.0147 ...
##   ..- attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:1000, 1:2] -31.6228 0.0316 0.0316 0.0316 0.0316 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:1000] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.03 1.04
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 998
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = y ~ x)
##  $ terms        :Classes 'terms', 'formula' length 3 y ~ x
##   .. ..- attr(*, "variables")= language list(y, x)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. ..$ : chr "x"
##   .. ..- attr(*, "term.labels")= chr "x"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(y, x)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  $ model        :'data.frame':   1000 obs. of  2 variables:
##   ..$ y: num [1:1000] -0.576 -1.093 -0.105 0.105 0.586 ...
##   ..$ x: num [1:1000] 1.214 1.296 -0.851 -0.983 1.369 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 y ~ x
##   .. .. ..- attr(*, "variables")= language list(y, x)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. .. ..$ : chr "x"
##   .. .. ..- attr(*, "term.labels")= chr "x"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(y, x)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  - attr(*, "class")= chr "lm"

Whew, that was a lot! Congratulations!

Now, on to the more fun and useful stuff!




3 Data Basics (9 AM - 11 AM)

This section is intended to lay the groundwork of all of the most common data analysis tasks we face when tackling our empirical problems.

3.1 Reading .csv Files

The most common format of file you’re like to find in the real world is in .csv (comma-separated value) format.

Quick clarification that not all comma-separated files have the extension. Often you’ll find a file with the extension .txt that actually contains text that is comma-separated. You just have to inspect the file to figure out how it’s stored.

In fact that’s the case for or first example. We’re going to look at teacher attendance data for the School District of Philadelphia.

The school district releases, through their Open Data Initiative, a whole bunch of data files related to the topics like school catchment zones, employee salaries, budget details, etc. All of the data can be found here: http://webgui.phila.k12.pa.us/offices/o/open-data-initiative.

We could use R to download the files (we may touch on this in the afternoon), but to ease us into things, I’ve gone ahead and downloaded some files and put them on my GitHub page. Let’s look at the data from 2013-14. Here’s a link to the data, but it should also be loaded on your machine.

We can preview the file using the terminal on a Mac. Learning to do some basic stuff in the terminal is a great investment of your time, but that’s not the focus of today’s lesson, so we’ll just stick with doing things through R. We can send commands to the terminal in R using the system command. The terminal command head (which is also a function in R, but only applies to objects already loaded) will spit out the first couple lines of a file so we can see what it looks like without having to load the whole thing (which, for larger files, can be time-consuming).

#Just to flex some of R's muscles, we'll find
#  the data file from within R. As often as not,
#  we'll just navigate to the file ourselves and
#  just copy-paste the file path.
#  We're looking for the folder with the data
#  so we use include.dirs to include directories
list.files(include.dirs = TRUE) 
## [1] "data"                                  
## [2] "iesrtutorial_cache"                    
## [3] "iesrtutorial_files"                    
## [4] "iesrtutorial.html"                     
## [5] "iesrtutorial.Rmd"                      
## [6] "README.md"                             
## [7] "workshop_installation_instructions.pdf"
#I've kept things in the data folder here. We
#  can tell it's a folder as it has no extensions.
#  We can find the files in that folder using a 
#  relative path (./ means starting from the 
#  current folder, enter the folder after /).
#  We use the full.names of the file for the next step.
list.files("./data", full.names = TRUE)
##  [1] "./data/00keys.csv"                                      
##  [2] "./data/00staff.dat"                                     
##  [3] "./data/2015-2016 Enr Dem (Suppressed).xls"              
##  [4] "./data/ag131a_supp.sas7bdat"                            
##  [5] "./data/carsdata.dta"                                    
##  [6] "./data/employee_information.csv"                        
##  [7] "./data/Neighborhoods_Philadelphia.dbf"                  
##  [8] "./data/Neighborhoods_Philadelphia.prj"                  
##  [9] "./data/Neighborhoods_Philadelphia.sbn"                  
## [10] "./data/Neighborhoods_Philadelphia.sbx"                  
## [11] "./data/Neighborhoods_Philadelphia.shp"                  
## [12] "./data/Neighborhoods_Philadelphia.shx"                  
## [13] "./data/Neighborhoods_Philadelphia.xml"                  
## [14] "./data/pennsylvania_districts.cpg"                      
## [15] "./data/pennsylvania_districts.dbf"                      
## [16] "./data/pennsylvania_districts.prj"                      
## [17] "./data/pennsylvania_districts.qpj"                      
## [18] "./data/pennsylvania_districts.shp"                      
## [19] "./data/pennsylvania_districts.shx"                      
## [20] "./data/phila_school_addr.csv"                           
## [21] "./data/README_SDP_Employee.txt"                         
## [22] "./data/School Profiles Teacher Attendance 2013-2014.TXT"
#Now we see the data files. The one we're after
#  currently is "School Profiles Teacher Attendance 2013-2014.TXT"
#  To get it previewed, we send the command head to the system
#  and follow it with the quoted file name (since the
#  file name has spaces, which is a pain)
command = paste0('head ', '"./data/School Profiles Teacher Attendance 2013-2014.TXT"')
system(command)
## [1] "/home/michael/Documents/iesrtutorial"
## ULCS_NO,SCHOOL_YEAR,SCH_TEACHER_ATTEND,SDP_TEACHER_ATTEND,SCHOOL_ID
## "8350","2013-2014",92.7,93.4,"835"
## "7320","2013-2014",95,93.4,"732"
## "5440","2013-2014",96,93.4,"544"
## "8320","2013-2014",95.1,93.4,"832"
## "6440","2013-2014",96.6,93.4,"644"
## "4370","2013-2014",93.8,93.4,"437"
## "4030","2013-2014",94.3,93.4,"403"
## "2240","2013-2014",94,93.4,"224"
## "6450","2013-2014",95.3,93.4,"645"

We can see that even though the data has extension .txt, the file itself is very regular and clearly comma-separated.

3.1.1 fread/data.table

To read in this data, I’m going to eschew the standard intro-to-R suggestion to use read.csv because we’re presented now with the perfect opportunity to introduce to you the crown jewel of R – data.table (full disclosure, I am an active contributor to it, but this has followed from my love for it).

data.table is a package created by Matthew Dowle and actively maintained by him, Arun Srinivasan, and Jan Gorecki. Basically, it’s designed to make working with data in R very straightforward/intuitive and incredibly fast (since it’s designed to handle data with tens of millions of observations). We’ll be spending a lot of time working with data.table today, and picking most of it up as we go along. You can read more on the data.table homepage when you get a chance.

The first and most famous tool of data.table is fread. The “f” means FAST. fread is a highly optimized tool for reading data into R at blitzing fast speeds. You’ll barely notice the difference for typical files (less than 100,000 observations), but when you do notice the difference, it’s a sight to behold.

On your own machine, you’ll need to install data.table (luckily for us, it’s already been pre-installed on your machines for today) using:

install.packages("data.table")

#the more confident and ambitious can install
#  the development version of the package,
#  which tends to have more features, but may
#  occasionally be more error-prone.
#  Caveat coder.
install.packages("data.table", type = "source",
                 repos = "https://Rdatatable.github.io/data.table")

Once we’re sure it’s on our machine, we can read the data by running:

#load the data.table package
library(data.table)

attendance <- 
  fread("./data/School Profiles Teacher Attendance 2013-2014.TXT")

Just like that!

Entering the data variable by itself will invoke the print.data.table command and display a preview of the data that was loaded:

attendance
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
#we can also explicitly invoke print,
#  which has a few more bells and whistles to
#  facilitate understanding what our data.table
#  looks like.
print(attendance, class = TRUE)
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##       <char>      <char>              <num>              <num>    <char>
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430

Using the class argument to print, the preview now includes an abbreviated type under each of the column headers letting you know what class R thinks each column is (note class is a recent update to data.table; for the moment, it is not available in the version on CRAN, but should be available there soon).

What about attendance itself? It looks like something we’ve not seen yet. Let’s explore:

class(attendance)
## [1] "data.table" "data.frame"
str(attendance)
## Classes 'data.table' and 'data.frame':   212 obs. of  5 variables:
##  $ ULCS_NO           : chr  "8350" "7320" "5440" "8320" ...
##  $ SCHOOL_YEAR       : chr  "2013-2014" "2013-2014" "2013-2014" "2013-2014" ...
##  $ SCH_TEACHER_ATTEND: num  92.7 95 96 95.1 96.6 93.8 94.3 94 95.3 92.2 ...
##  $ SDP_TEACHER_ATTEND: num  93.4 93.4 93.4 93.4 93.4 93.4 93.4 93.4 93.4 93.4 ...
##  $ SCHOOL_ID         : chr  "835" "732" "544" "832" ...
##  - attr(*, ".internal.selfref")=<externalptr>
is.list(attendance)
## [1] TRUE

OK, lots of important stuff here. We’ve discovered a new type of object! attendance is a data.table. You can think of this as being a single sheet of an Excel workbook. It is specific to the data.table package, and is an improvement on the data.frame type found in base R (we’ll see this type a bit later).

As far as how R thinks of it, a data.table is a list on steroids. Each element of the list is a column, and each element has the same number of items. We can see from str that, like a list, we can use $ to extract items (here: columns):

attendance$SCH_TEACHER_ATTEND
##   [1] 92.7 95.0 96.0 95.1 96.6 93.8 94.3 94.0 95.3 92.2 90.8 94.3 93.6 95.0
##  [15] 92.8 85.3 86.4 95.9 93.0 94.8 90.1 95.0 93.4 94.4 91.8 94.6 95.4 92.3
##  [29] 95.8 94.0 95.1 94.7 88.4 93.7 94.6 95.5 94.8 95.3 94.0 95.3 89.9 93.2
##  [43] 91.7 89.0 93.1 92.7 94.2 97.2 93.6 92.4 92.6 92.7 92.0 94.4 94.3 93.6
##  [57] 92.4 95.1 93.4 95.6 87.8 94.7 93.5 95.7 90.9 94.4 95.7 90.2 95.7 95.2
##  [71] 93.3 93.5 89.0 95.0 89.7 93.9 93.2 93.5 92.1 93.0 94.8 95.1 94.8 90.8
##  [85] 95.7 93.3 92.8 89.6 95.4 91.5 94.1 93.3 93.7 96.8 95.3 94.1 93.9 88.0
##  [99] 94.6 95.1 86.9 95.5 95.0 97.6 95.8 93.8 94.4 96.0 95.8 89.7 94.3 92.7
## [113] 93.1 95.1 94.5 95.4 94.4 94.4 90.8 94.5 93.3 90.6 93.5 94.4 91.9 90.7
## [127] 91.6 92.1 94.7 95.4 86.5 94.8 94.5 90.1 96.1 93.1 95.8 92.0 94.1 91.7
## [141] 94.0 92.4 95.6 91.2 92.1 92.8 93.8 93.7 88.8 96.1 94.4 90.6 95.3 95.3
## [155] 90.9 94.0 94.1 95.6 93.9 91.9 91.5 93.2 94.2 93.0 91.2 91.8 95.9 92.1
## [169] 93.8 92.6 90.0 94.1 96.5 91.1 89.8 93.6 94.3 97.6 93.0 98.1 91.4 93.4
## [183] 94.5 94.1 94.1 88.7 94.0 93.4 94.2 94.6 96.2 91.4 92.0 96.5 92.0 95.1
## [197] 94.7 96.6 96.8 91.7 96.1 95.9 93.0 91.5 94.4 93.2 88.5 92.0 91.4 95.3
## [211] 94.7 94.7

3.1.2 Exploring Attendance Data

What are the columns in this data? We usually hope we have a data dictionary, but I didn’t find one on the SDP website. It appears that ULCS_NO is the same as SCHOOL_ID (padded with a 0). It’s not clear if the school district is anonymizing the schools or if we simply have to match the SCHOOL_ID to the school’s name in another file.

The meat of the data are the two columns SCH_TEACHER_ATTEND and SDP_TEACHER_ATTEND. The former is the attendance rate of teachers at a given school; the latter is the average for the whole district (which is why it’s the same in every row).

We can now start to explore the data. The only real column of interest is SCH_TEACHER_ATTEND. With this, we can explore how teacher attendance varies across the city.

Interacting with a data.table for exploration involves accessing it with []. Within [], there are three main arguments:

  1. i: which rows do we want? (subsetting)
  2. j: which columns do we want? (selection and manipulation)
  3. by: grouping our operations in j by categories within a column.

The i and j notation come from thinking of a data.table as a matrix. To express the ith row and jth column of a matrix A, math people often write A[i,j].

Let’s see the first two in action in exploring our first data set

#Summarize the variable to get a quick understanding
#we're looking at ALL rows, so we leave the first argument BLANK
#we're trying to summarize SCH_TEACHER_ATTEND, so we
#  use the summary function in the second argument
attendance[ , summary(SCH_TEACHER_ATTEND)]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   85.30   92.10   93.90   93.43   95.00   98.10

So most schools are within 2 percent of the district average. What about schools that are particularly high or particularly low?

#We are focusing on certain rows, so we
#  say the condition defining those
#  rows in the first argument
attendance[SCH_TEACHER_ATTEND < 90]
##     ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##  1:    4350   2013-2014               85.3               93.4       435
##  2:    4570   2013-2014               86.4               93.4       457
##  3:    1340   2013-2014               88.4               93.4       134
##  4:    7350   2013-2014               89.9               93.4       735
##  5:    7220   2013-2014               89.0               93.4       722
##  6:    2480   2013-2014               87.8               93.4       248
##  7:    6360   2013-2014               89.0               93.4       636
##  8:    8240   2013-2014               89.7               93.4       824
##  9:    5430   2013-2014               89.6               93.4       543
## 10:    6260   2013-2014               88.0               93.4       626
## 11:    6340   2013-2014               86.9               93.4       634
## 12:    7290   2013-2014               89.7               93.4       729
## 13:    5370   2013-2014               86.5               93.4       537
## 14:    1330   2013-2014               88.8               93.4       133
## 15:    7510   2013-2014               89.8               93.4       751
## 16:    4240   2013-2014               88.7               93.4       424
## 17:    1130   2013-2014               88.5               93.4       113
#And schools with particularly high attendance
attendance[SCH_TEACHER_ATTEND > 97]
##    ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1:    5340   2013-2014               97.2               93.4       534
## 2:    6040   2013-2014               97.6               93.4       604
## 3:    8560   2013-2014               97.6               93.4       856
## 4:    1190   2013-2014               98.1               93.4       119

Without knowing more about the schools, all we can say is that there are a certain number of schools with very high or very low attendance.

3.1.3 Teacher Salaries

Another file posted to the SDP website contains information on teachers’ salaries. I’ve again downloaded this data already for you, but if you’re curious here’s a link to the data from the website.

This one actually comes with a data dictionary of sorts in the form of a README file. We can use R to display the README like so:

writeLines(readLines("./data/README_SDP_Employee.txt"))
README_SDP_Employee.txt
REVISIONS
20160401 (April 1st, 2016) - Second release in 2016.

This README file contains some background information and guidance on School District of Philadelphia (SDP) data. This set of data was extracted on March 31st 2016, and includes every active employee of the School District of Philadelphia as of the extract date. Do not hesitate to contact us if you have any questions, or would like to request other data sets.
Happy data diving!
opendata@philasd.org

Individual files:

README_SDP_EMPLOYEE.TXT  - You're soaking in it!

employee_information.csv
Provides basic information on every employee of the School District of Philadelphia. 
LAST_NAME - Employee's last name.
FIRST_NAME - Employee's first name.
PAY_RATE_TYPE - The pay rate type for the employee, either SALARIED, DAILY or HOURLY.
PAY_RATE - The pay rate for the employee. If the pay rate type is salaried, then the annual salary (minus any benefits). If the pay rate type is daily, then the daily pay, and if pay rate type is hourly, then the hourly pay rate. If the pay rate is zero the employee is typically in a leave status that does not receive pay; for example military leave.
TITLE_DESCRIPTION - Title for employee; there are over 500 unique titles.
HOME_ORGANIZATION - The home organization code identifies where the employee is primarily stationed. Some employees are "per diem", meaning they work on a daily basis at a variety of locations; other employees may work at multiple locations, but are attributed to one location.
HOME_ORGANIZATION_DESCRIPTION - The home organization description identifies the home location by name, typically the name of the school or office.
ORGANIZATION_LEVEL - Identifies the type of location the employee works at; for example, administrative office, garage, elementary school, etc.
TYPE_OF_REPRESENTATION - The union representation for the employee. NON REP means the employee is not represented by any union.
GENDER - Gender of employee.
RUN_DATE - The date the data was extracted from SDP data systems.

School District Data Sets Terms of Use.pdf
Terms of use for use the open data sets from the School District of Philadelphia.

readLines is a function used to bring in text files to R objects. It returns every line of the file as one element of a character vector. writeLines is typically used for creating text files, but if we don’t tell it a file to use, it prints the output to the console.

The file describes all of three of the files that came with the download; of importance is employee_imformation.csv. We see here the columns are all described, which will help us understand what we see when we load in the data. Let’s do it!

salaries <- fread("./data/employee_information.csv")
print(salaries, class = TRUE)
##           LAST_NAME FIRST_NAME PAY_RATE_TYPE PAY_RATE
##              <char>     <char>        <char>    <int>
##     1:        AARON     ANDREA      SALARIED    31261
##     2:        AARON      PEGGY      SALARIED     9349
##     3:        ABARY     RODNEY      SALARIED    76461
##     4:        ABATE     JO-ANN        HOURLY       48
##    ---                                               
## 17558:        ZWICK      KEVIN      SALARIED    46694
## 17559:        ZWIRN     RHONDA      SALARIED    76461
## 17560:       ZWOLAK       PAUL      SALARIED    54364
## 17561:    ZYCHOWICZ       LISA      SALARIED    67706
## 17562: ZYGMUND-FELT      ALLYN      SALARIED    90051
##                     TITLE_DESCRIPTION HOME_ORGANIZATION
##                                <char>            <char>
##     1:       GENERAL CLEANER, 8 HOURS              4300
##     2:  STUDENT CLIMATE STAFF,4 HOURS              6360
##     3:                   SCHOOL NURSE              2370
##     4: TEACHER-EXTRA CURR/STAFF DEVEL              9EW0
##    ---                                                 
## 17558:              TEACHER,FULL TIME              8300
## 17559:         TEACHER,SPEC EDUCATION              6210
## 17560:              TEACHER,FULL TIME              2290
## 17561:              TEACHER,FULL TIME              8420
## 17562:              TEACHER,FULL TIME              7100
##        HOME_ORGANIZATION_DESCRIPTION        ORGANIZATION_LEVEL
##                               <char>                    <char>
##     1:         HESTON, EDWARD SCHOOL         ELEMENTARY SCHOOL
##     2:   ROOSEVELT ELEMENTARY SCHOOL         ELEMENTARY SCHOOL
##     3:   MCDANIEL, DELAPLAINE SCHOOL         ELEMENTARY SCHOOL
##     4:           NON-PUBLIC PROGRAMS NON ADMINISTRATIVE OFFICE
##    ---                                                        
## 17558:                MAYFAIR SCHOOL         ELEMENTARY SCHOOL
## 17559:   EDMONDS, FRANKLIN S. SCHOOL         ELEMENTARY SCHOOL
## 17560:      FRANKLIN LEARNING CENTER               HIGH SCHOOL
## 17561:       DECATUR, STEPHEN SCHOOL         ELEMENTARY SCHOOL
## 17562:  COOKE, JAY ELEMENTARY SCHOOL         ELEMENTARY SCHOOL
##        TYPE_OF_REPRESENTATION GENDER RUN_DATE
##                        <char> <char>   <char>
##     1:             LOCAL 1201      F 4/1/2016
##     2:              LOCAL 634      F 4/1/2016
##     3:            PFT-TEACHER      M 4/1/2016
##     4:            PFT-TEACHER      F 4/1/2016
##    ---                                       
## 17558:            PFT-TEACHER      M 4/1/2016
## 17559:            PFT-TEACHER      F 4/1/2016
## 17560:            PFT-TEACHER      M 4/1/2016
## 17561:            PFT-TEACHER      F 4/1/2016
## 17562:            PFT-TEACHER      F 4/1/2016
#it's a pain to have to lay on the SHIFT key all the
#  time, so let's rename the columns so that
#  are in lower case. We need three functions:
#  tolower, which takes all upper-case letters in a 
#  string and replaces them with their lower-case
#  counterpart; names, which returns the names of
#  any object (vector, list, data.table, etc.);
#  and setnames, which has three arguments:
#  1) a data.table, 2) the columns to rename, and
#  3) their replacements. If we only use two arguments,
#  R assumes we want to replace _all_ column names, and
#  that the second argument is the replacement values.
setnames(salaries, tolower(names(salaries)))

#checking the result:
salaries
##           last_name first_name pay_rate_type pay_rate
##     1:        AARON     ANDREA      SALARIED    31261
##     2:        AARON      PEGGY      SALARIED     9349
##     3:        ABARY     RODNEY      SALARIED    76461
##     4:        ABATE     JO-ANN        HOURLY       48
##     5:  ABAYOMI-IGE   OLABIMPE      SALARIED    76461
##    ---                                               
## 17558:        ZWICK      KEVIN      SALARIED    46694
## 17559:        ZWIRN     RHONDA      SALARIED    76461
## 17560:       ZWOLAK       PAUL      SALARIED    54364
## 17561:    ZYCHOWICZ       LISA      SALARIED    67706
## 17562: ZYGMUND-FELT      ALLYN      SALARIED    90051
##                     title_description home_organization
##     1:       GENERAL CLEANER, 8 HOURS              4300
##     2:  STUDENT CLIMATE STAFF,4 HOURS              6360
##     3:                   SCHOOL NURSE              2370
##     4: TEACHER-EXTRA CURR/STAFF DEVEL              9EW0
##     5:         TEACHER,SPEC EDUCATION              6100
##    ---                                                 
## 17558:              TEACHER,FULL TIME              8300
## 17559:         TEACHER,SPEC EDUCATION              6210
## 17560:              TEACHER,FULL TIME              2290
## 17561:              TEACHER,FULL TIME              8420
## 17562:              TEACHER,FULL TIME              7100
##         home_organization_description        organization_level
##     1:          HESTON, EDWARD SCHOOL         ELEMENTARY SCHOOL
##     2:    ROOSEVELT ELEMENTARY SCHOOL         ELEMENTARY SCHOOL
##     3:    MCDANIEL, DELAPLAINE SCHOOL         ELEMENTARY SCHOOL
##     4:            NON-PUBLIC PROGRAMS NON ADMINISTRATIVE OFFICE
##     5: LEEDS, MORRIS E. MIDDLE SCHOOL             MIDDLE SCHOOL
##    ---                                                         
## 17558:                 MAYFAIR SCHOOL         ELEMENTARY SCHOOL
## 17559:    EDMONDS, FRANKLIN S. SCHOOL         ELEMENTARY SCHOOL
## 17560:       FRANKLIN LEARNING CENTER               HIGH SCHOOL
## 17561:        DECATUR, STEPHEN SCHOOL         ELEMENTARY SCHOOL
## 17562:   COOKE, JAY ELEMENTARY SCHOOL         ELEMENTARY SCHOOL
##        type_of_representation gender run_date
##     1:             LOCAL 1201      F 4/1/2016
##     2:              LOCAL 634      F 4/1/2016
##     3:            PFT-TEACHER      M 4/1/2016
##     4:            PFT-TEACHER      F 4/1/2016
##     5:            PFT-TEACHER      F 4/1/2016
##    ---                                       
## 17558:            PFT-TEACHER      M 4/1/2016
## 17559:            PFT-TEACHER      F 4/1/2016
## 17560:            PFT-TEACHER      M 4/1/2016
## 17561:            PFT-TEACHER      F 4/1/2016
## 17562:            PFT-TEACHER      F 4/1/2016

This is a much bigger data set – covering all employees of the SDP and giving many more details about them. Let’s explore a bit:

#table is a great function for describing discrete variables.
#  it gives the count of observations in each cell, and can also
#  be used to create what Stata calls two-way tables.
salaries[ , table(pay_rate_type)]
## pay_rate_type
##    DAILY   HOURLY SALARIED 
##      141      526    16895
salaries[ , table(organization_level)]
## organization_level
##                        ACADEMY          ADMINISTRATIVE OFFICE 
##                              4                           1578 
## CAREER AND TECHNICAL HIGH SCHL                CHARTER SCHOOLS 
##                            456                             12 
##     DISTRICT BUILDING/PROPERTY                EARLY CHILDHOOD 
##                             64                            356 
##              ELEMENTARY SCHOOL                         GARAGE 
##                           9667                              6 
##                    HIGH SCHOOL         INACTIVE WITH ACTIVITY 
##                           3101                             11 
##               LEARNING NETWORK                  MIDDLE SCHOOL 
##                             20                            954 
##      NON ADMINISTRATIVE OFFICE              NON PUBLIC SCHOOL 
##                           1191                              1 
##    TRANSITION / OVERAGE SCHOOL 
##                            141
#a cross-tabulation of these two; the first argument will
#  appear in the rows, the second in the columns
salaries[ , table(pay_rate_type, organization_level)]
##              organization_level
## pay_rate_type ACADEMY ADMINISTRATIVE OFFICE CAREER AND TECHNICAL HIGH SCHL
##      DAILY          2                    31                              4
##      HOURLY         0                   147                              0
##      SALARIED       2                  1400                            452
##              organization_level
## pay_rate_type CHARTER SCHOOLS DISTRICT BUILDING/PROPERTY EARLY CHILDHOOD
##      DAILY                  0                          0               0
##      HOURLY                 0                          0               2
##      SALARIED              12                         64             354
##              organization_level
## pay_rate_type ELEMENTARY SCHOOL GARAGE HIGH SCHOOL INACTIVE WITH ACTIVITY
##      DAILY                   23      0          11                      0
##      HOURLY                   0      0           2                      0
##      SALARIED              9644      6        3088                     11
##              organization_level
## pay_rate_type LEARNING NETWORK MIDDLE SCHOOL NON ADMINISTRATIVE OFFICE
##      DAILY                   2             1                        65
##      HOURLY                  0             0                       375
##      SALARIED               18           953                       751
##              organization_level
## pay_rate_type NON PUBLIC SCHOOL TRANSITION / OVERAGE SCHOOL
##      DAILY                    0                           2
##      HOURLY                   0                           0
##      SALARIED                 1                         139
salaries[ , table(organization_level, gender)]
##                                 gender
## organization_level                  F    M
##   ACADEMY                           1    3
##   ADMINISTRATIVE OFFICE           911  667
##   CAREER AND TECHNICAL HIGH SCHL  269  187
##   CHARTER SCHOOLS                  11    1
##   DISTRICT BUILDING/PROPERTY       26   38
##   EARLY CHILDHOOD                 343   13
##   ELEMENTARY SCHOOL              7989 1678
##   GARAGE                            0    6
##   HIGH SCHOOL                    1988 1113
##   INACTIVE WITH ACTIVITY            5    6
##   LEARNING NETWORK                  7   13
##   MIDDLE SCHOOL                   706  248
##   NON ADMINISTRATIVE OFFICE       597  594
##   NON PUBLIC SCHOOL                 1    0
##   TRANSITION / OVERAGE SCHOOL      94   47

Let’s say we’re only interested in salaried employees. How do we get rid of the hourly/daily employees?

Unlike in Stata, where we’d use drop and would have to jump through hoops to recover the lost observations (in case we decided to expand our focus later in our analysis), we can simply create a new data.table in R that contain only the observations we care about. This is, in my opinion, one of the most salient examples of a feature of R that blows Stata out of the water. We can keep many data sets at our disposal at once, without having to unload/reload the ones we’re not currently using. This will prove especially helpful when it comes to data cleaning/merging/reshaping.

salaried <- salaries[pay_rate_type == "SALARIED"]

Even more useful may be to exclude anyone who’s not a teacher. First, we have to exclude anyone who’s not a teacher. How do we figure out who’s a teacher?

Approach:

  1. Count the number of observations in each category of title_description (we can see from the preview that there are at least some teachers under TEACHER,FULL TIME)
  2. Sort these categories to find the most common categories, since presumably teachers are by far the most common SDP employees.

Carrying this out will introduce a few new things:

  • Grouping. We’re going to group by title_description, and count observations in each group. If I remember correctly, in Stata, this would be like by title_description: count. When data.table does this, you should think of your data.table being split into a whole bunch of smaller data.tables, each of which has all of the observations for exactly one level of title_description.

Here’s an illustration:

group_split

  • .N. data.table uses the abbreviation .N to stand for “number of observations”. It is defined locally, within each group, so all of the many data.tables mentioned above has its own .N. This is basically like count in Stata.
  • order. We can arrange the rows of a table by using order in the first argument (i); use a minus sign (-) to go in decreasing order, otherwise it’s increasing order (like we saw with sort).
  • Chaining. We can tack on more and more [] to any given data.table operation to hone or continue or adjust our outcomes beyond what we could do in a single call to []. The general syntax is DT[...][...][...][...], where each [...] consists of doing some manipulation/subset/aggregation, etc.
#Let's take it step-by-step
# 1: Find the count of employees in each category
salaried[ , .N, by = title_description]
##                   title_description    N
##   1:       GENERAL CLEANER, 8 HOURS  503
##   2:  STUDENT CLIMATE STAFF,4 HOURS  451
##   3:                   SCHOOL NURSE  184
##   4:         TEACHER,SPEC EDUCATION 1314
##   5:  STUDENT CLIMATE STAFF,3 HOURS  410
##  ---                                    
## 531: DATA ANALYST - CHARTER SCHOOLS    1
## 532:        DIR,OPERATIONAL SYS DEV    1
## 533:         DIR,SYSTEMS ADMIN UNIT    1
## 534:  DIR,INSURANCE RISK MANAGEMENT    1
## 535:   DIR,DISTRICT PERFORMANCE OFF    1
# 2: reorder these to find the most common
salaried[ , .N, by = title_description][order(N)]
##                   title_description    N
##   1:   SCHOOL-BASED TECH MAINT ASST    1
##   2: DIR,CERT,SUB SVCS,SCH ALLOT SU    1
##   3:             NUTRITIONIST, PKHS    1
##   4: ADMINISTRATOR,PHILA VIRTUAL AC    1
##   5:     HR SYSTEMS CONTROL ANALYST    1
##  ---                                    
## 531:       GENERAL CLEANER, 8 HOURS  503
## 532:    CLASSROOM ASST,SP ED,SV HND  594
## 533:    ONE TO ONE ASST, SPECIAL ED  757
## 534:         TEACHER,SPEC EDUCATION 1314
## 535:              TEACHER,FULL TIME 6652
# 3: put it in decreasing order
salaried[ , .N, by = title_description][order(-N)]
##                   title_description    N
##   1:              TEACHER,FULL TIME 6652
##   2:         TEACHER,SPEC EDUCATION 1314
##   3:    ONE TO ONE ASST, SPECIAL ED  757
##   4:    CLASSROOM ASST,SP ED,SV HND  594
##   5:       GENERAL CLEANER, 8 HOURS  503
##  ---                                    
## 531: DATA ANALYST - CHARTER SCHOOLS    1
## 532:        DIR,OPERATIONAL SYS DEV    1
## 533:         DIR,SYSTEMS ADMIN UNIT    1
## 534:  DIR,INSURANCE RISK MANAGEMENT    1
## 535:   DIR,DISTRICT PERFORMANCE OFF    1
# 4: Only look at the top 20 most frequent positions
# (idea -- once it's resorted, these are simply
#  the first twenty rows of the sorted table)
salaried[ , .N, by = title_description][order(-N)][1:20]
##                  title_description    N
##  1:              TEACHER,FULL TIME 6652
##  2:         TEACHER,SPEC EDUCATION 1314
##  3:    ONE TO ONE ASST, SPECIAL ED  757
##  4:    CLASSROOM ASST,SP ED,SV HND  594
##  5:       GENERAL CLEANER, 8 HOURS  503
##  6:  STUDENT CLIMATE STAFF,4 HOURS  451
##  7:  STUDENT CLIMATE STAFF,3 HOURS  410
##  8:            FOOD SVCS ASSISTANT  386
##  9: SUPPORTIVE SERVICES ASST, 4 HR  317
## 10:                  BUS ATTENDANT  295
## 11:          SCHOOL POLICE OFFICER  264
## 12: SUPPORTIVE SERVICES ASST, 3 HR  249
## 13:    SCHOOL COUNSELOR, 10 MONTHS  245
## 14:            CUSTODIAL ASSISTANT  241
## 15:                      PRINCIPAL  225
## 16:                    SECRETARY I  224
## 17:                   SCHOOL NURSE  184
## 18:        FOOD SVCS WORKER SENIOR  163
## 19:  STUDENT CLIMATE STAFF,5 HOURS  145
## 20:  BUS CHAUFFEUR PT (4-5HRS/DAY)  128

We can start to see many interesting facets of employment at SDP here. Almost 300 school police officers but only 180 nurses. Ripe ground for exploring on your own!

I’ll continue focusing on teachers; it appears all but the special ed. teachers are grouped under TEACHER,FULL TIME, so we can subset:

teachers <- salaried[title_description == "TEACHER,FULL TIME"]
teachers
##          last_name first_name pay_rate_type pay_rate title_description
##    1:       ABBOTT      JOYCE      SALARIED    76461 TEACHER,FULL TIME
##    2:     ABDALLAH JUWAYRIYAH      SALARIED    46694 TEACHER,FULL TIME
##    3:  ABDEL-JALIL    GHADEER      SALARIED    45359 TEACHER,FULL TIME
##    4:  ABDUL BASIT    BARBARA      SALARIED    67706 TEACHER,FULL TIME
##    5: ABDUL-LATEEF     VILLIA      SALARIED    48945 TEACHER,FULL TIME
##   ---                                                                 
## 6648:      ZUNGOLO    WILLIAM      SALARIED    90051 TEACHER,FULL TIME
## 6649:        ZWICK      KEVIN      SALARIED    46694 TEACHER,FULL TIME
## 6650:       ZWOLAK       PAUL      SALARIED    54364 TEACHER,FULL TIME
## 6651:    ZYCHOWICZ       LISA      SALARIED    67706 TEACHER,FULL TIME
## 6652: ZYGMUND-FELT      ALLYN      SALARIED    90051 TEACHER,FULL TIME
##       home_organization home_organization_description organization_level
##    1:              1290       HAMILTON, ANDREW SCHOOL  ELEMENTARY SCHOOL
##    2:              1470           LOCKE, ALAIN SCHOOL  ELEMENTARY SCHOOL
##    3:              7440         TAYLOR, BAYARD SCHOOL  ELEMENTARY SCHOOL
##    4:              7530         ROWEN, WILLIAM SCHOOL  ELEMENTARY SCHOOL
##    5:              1010     BARTRAM, JOHN HIGH SCHOOL        HIGH SCHOOL
##   ---                                                                   
## 6648:              8390     FITZPATRICK, A. L. SCHOOL  ELEMENTARY SCHOOL
## 6649:              8300                MAYFAIR SCHOOL  ELEMENTARY SCHOOL
## 6650:              2290      FRANKLIN LEARNING CENTER        HIGH SCHOOL
## 6651:              8420       DECATUR, STEPHEN SCHOOL  ELEMENTARY SCHOOL
## 6652:              7100  COOKE, JAY ELEMENTARY SCHOOL  ELEMENTARY SCHOOL
##       type_of_representation gender run_date
##    1:            PFT-TEACHER      F 4/1/2016
##    2:            PFT-TEACHER      F 4/1/2016
##    3:            PFT-TEACHER      F 4/1/2016
##    4:            PFT-TEACHER      F 4/1/2016
##    5:            PFT-TEACHER      F 4/1/2016
##   ---                                       
## 6648:            PFT-TEACHER      M 4/1/2016
## 6649:            PFT-TEACHER      M 4/1/2016
## 6650:            PFT-TEACHER      M 4/1/2016
## 6651:            PFT-TEACHER      F 4/1/2016
## 6652:            PFT-TEACHER      F 4/1/2016

3.1.4 Group mean salaries – various cuts

Now we can explore data by group. How does pay differ by gender?

#simple to code!
teachers[ , mean(pay_rate), by = gender]
##    gender       V1
## 1:      F 68624.85
## 2:      M 67689.59
#mean automatically got called V1. If we want
#  a more friendly output, we have to change a little:
teachers[ , .(avg_wage = mean(pay_rate)), by = gender]
##    gender avg_wage
## 1:      F 68624.85
## 2:      M 67689.59

So in teaching in Philly, actually women earn significantly more than men (this is indeed significant; I’ll show you how to test that in a bit)!

(Of course men and women with the same experience and certification are paid basically the same – it’s written explicitly in the teachers’ contract – but we don’t have any experience or certification measures to control for this)

How does pay differ by organization level?

teachers[ , .(avg_wage = mean(pay_rate)), by = organization_level]
##                organization_level avg_wage
## 1:              ELEMENTARY SCHOOL 67639.16
## 2:                    HIGH SCHOOL 69527.33
## 3:                  MIDDLE SCHOOL 69481.66
## 4: CAREER AND TECHNICAL HIGH SCHL 71931.52
## 5:                EARLY CHILDHOOD 74058.15
## 6:    TRANSITION / OVERAGE SCHOOL 63434.37
## 7:          ADMINISTRATIVE OFFICE 66255.22
## 8:      NON ADMINISTRATIVE OFFICE 82723.82
## 9:                        ACADEMY 67789.00

Which school has the best-paid teachers?

teachers[ , .(avg_wage = mean(pay_rate)),
         by = home_organization_description
          ][order(-avg_wage)]
##      home_organization_description avg_wage
##   1:       LOGAN SCHOOL HEAD START 90051.00
##   2:   COOK-WISSAHICKON HEAD START 90051.00
##   3:       SHAWMONT BRIGHT FUTURES 90051.00
##   4:        MUNOZ-MARIN HEAD START 90051.00
##   5:          SHARSWOOD HEAD START 90051.00
##  ---                                       
## 287:        DUNBAR, PAUL L. SCHOOL 56805.95
## 288:        SOUTH PHILA HEAD START 51113.00
## 289:             BLAINE HEAD START 51113.00
## 290:            BETHUNE HEAD START 51113.00
## 291:            MIFFLIN HEAD START     0.00

This looks fishy! It’s rare for any average worth knowing to come out exactly to an integer. I suspect there are very few teachers at these schools. How can we check?

#remember our handy friend .N!!
teachers[ , .(.N, avg_wage = mean(pay_rate)),
          by = home_organization_description
          ][order(-avg_wage)]
##      home_organization_description  N avg_wage
##   1:       LOGAN SCHOOL HEAD START  1 90051.00
##   2:   COOK-WISSAHICKON HEAD START  1 90051.00
##   3:       SHAWMONT BRIGHT FUTURES  1 90051.00
##   4:        MUNOZ-MARIN HEAD START  1 90051.00
##   5:          SHARSWOOD HEAD START  1 90051.00
##  ---                                          
## 287:        DUNBAR, PAUL L. SCHOOL 21 56805.95
## 288:        SOUTH PHILA HEAD START  1 51113.00
## 289:             BLAINE HEAD START  1 51113.00
## 290:            BETHUNE HEAD START  1 51113.00
## 291:            MIFFLIN HEAD START  1     0.00

In fact, almost all of the “best-” and “worst-” paid schools are outliers because there’s only one employee listed. So that’s not really a fair measure of dominance.

So what can we do to improve our measure?

What’s typically done is to exclude all schools that don’t have pass some cutoff minimum number of employees. For this exercise, let’s say we want to exclude all schools that have fewer than 10 full-time teachers in the data.

The way to do so is pretty natural. Remember that when we use by, we’re essentially splitting our main data.table up into 291 smaller data.tables, one for each school. We want to keep only those sub-data.tables that have at least 10 teachers – i.e., they have at least 10 rows, i.e., .N is at least 10:

#No longer keep .N as an output -- this is just for focus,
#  since we already know all the schools have at least
#  ten teachers
teachers[ , if (.N >= 10) .(avg_wage = mean(pay_rate)),
          by = home_organization_description
          ][order(-avg_wage)]
##      home_organization_description avg_wage
##   1:           NON-PUBLIC PROGRAMS 82723.82
##   2:       SOUTH PHILADELPHIA H.S. 78295.27
##   3:  GIRLS, PHILA HIGH SCHOOL FOR 77710.83
##   4:          POWEL, SAMUEL SCHOOL 77700.50
##   5:       PENN TREATY HIGH SCHOOL 76291.82
##  ---                                       
## 208:               KENSINGTON CAPA 59277.92
## 209:        HUEY, SAMUEL B. SCHOOL 58582.12
## 210:      PEIRCE, THOMAS M. SCHOOL 58114.21
## 211: CLEMENTE, ROBERTO MIDDLE SCHL 57635.85
## 212:        DUNBAR, PAUL L. SCHOOL 56805.95

Here we see our first if statement in R. The basic form of an if statement in R is:

#Note the parentheses around the logical!! Very easy to forget
if (some_logical){
  # what to do if true
} else {
  # what to do if false
}

It’s important to note that some_logical has to be either TRUE or FALSE – it can’t be a vector!! This sounds easy now but it’s guaranteed to trip you up at some point.

There’s no else statement in the condition we used above. So, for all the schools where there are fewer than 10 teachers, nothing happens (technically, the statement returns NULL – but don’t worry about that for now), and data.table ignores any such sub-data.table.

I don’t really know what NON-PUBLIC PROGRAMS means… but besides that, we see a lot of staple high school names. Of course, all teachers in Philly are on the same teachers’ contract and hence the same pay scale – so what this exercise has really told us is that South Philly high has the most experienced and/or certified teachers, and that teachers at Dunbar are young and/or uncertified.

We could go on exploring / slicing&dicing this data all day, but I think we’ve learned enough essentials of manipulation/inspection for now. We’ll pick up more tricks as we move along.


3.2 Merging

Recall our teacher attendance data:

attendance
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430

We couldn’t really tell anything about the schools because we didn’t know what the school codes meant.

Well, it appears we’ve found the Rosetta Stone in the teacher salary data. The home_organization field looks very much like the ULCS_NO field in the attendance data.

Obviously, it’d be ideal to know for sure that they represent the same ID for each school. In lieu of that, we can bolster our confidence in the alignment by comparing the fields a bit. Here are some techniques:

#We know from attendance there are 212 schools.
#  How many unique schools are represented in the teacher data?
teachers[ , uniqueN(home_organization)]
## [1] 291
#How many of the ULCS_NO values are also found in home_organization
#  intersect find the middle of the venn diagram for the 
#  first and second arguments
length(intersect(attendance$ULCS_NO, teachers[ , unique(home_organization)]))
## [1] 211

That’s pretty convincing if you ask me.

So, how do we match the two data sets? In Stata, depending on which data set you currently have in memory, we would either perform a one-to-many or a many-to-one merge.

In R, merging either way is a cinch. But first, we need a few more tools.

3.2.1 := - Adding a column to a data.table

When we want to add a new column to an existing data.table, we use the := operator j (the second argument). A quick aside, this is done by reference, which means it’s memory-efficient and fast; see this Q&A for more details.

Let’s try it out with two examples. Right now, in attendance, teacher attendance rates are listed as a percentage (out of 100); sometimes, it’s more convenient to have percentages as a proportion (out of one).

Also, we might like to know the relative attendance rates – the ratio of attendance at a given school to average attendance.

In teachers, gender is stored as a character – either M or F. But it’s often quite convenient to have binary variables stored as logical – for subsetting.

#add attendance as a proportion
attendance[ , attendance_proportion := SCH_TEACHER_ATTEND / 100]

#add relative attendance
attendance[ , attendance_relative := SCH_TEACHER_ATTEND / SDP_TEACHER_ATTEND]

#a quirk of data.table is that after using
#  :=, we sometimes have to force printing,
#  which we can do by appending [] to the end
attendance[]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
#add logical version of gender
teachers[ , male := gender == "M"]

#now, subset to show only male teachers
teachers[(male)]
##        last_name first_name pay_rate_type pay_rate title_description
##    1: ABDULALEEM   MUHAMMAD      SALARIED    65121 TEACHER,FULL TIME
##    2:   ABDULLAH    RASHEED      SALARIED    83382 TEACHER,FULL TIME
##    3:   ACKERMAN       CARL      SALARIED    83382 TEACHER,FULL TIME
##    4:      ADAIR    DARRELL      SALARIED    66461 TEACHER,FULL TIME
##    5:      ADAIR     JOSEPH      SALARIED    51113 TEACHER,FULL TIME
##   ---                                                               
## 1822:     ZUCKER    JEFFREY      SALARIED    73453 TEACHER,FULL TIME
## 1823:     ZUFOLO      BRIAN      SALARIED    57451 TEACHER,FULL TIME
## 1824:    ZUNGOLO    WILLIAM      SALARIED    90051 TEACHER,FULL TIME
## 1825:      ZWICK      KEVIN      SALARIED    46694 TEACHER,FULL TIME
## 1826:     ZWOLAK       PAUL      SALARIED    54364 TEACHER,FULL TIME
##       home_organization  home_organization_description
##    1:              6090 RANDOLPH TECHNICAL HIGH SCHOOL
##    2:              4220        BLAINE, JAMES G. SCHOOL
##    3:              2670       CONSTITUTION HIGH SCHOOL
##    4:              2010 FRANKLIN, BENJAMIN HIGH SCHOOL
##    5:              5400                RICHMOND SCHOOL
##   ---                                                 
## 1822:              7310       FELTONVILLE INTERMEDIATE
## 1823:              5330      HUNTER, WILLIAM H. SCHOOL
## 1824:              8390      FITZPATRICK, A. L. SCHOOL
## 1825:              8300                 MAYFAIR SCHOOL
## 1826:              2290       FRANKLIN LEARNING CENTER
##                   organization_level type_of_representation gender
##    1: CAREER AND TECHNICAL HIGH SCHL            PFT-TEACHER      M
##    2:              ELEMENTARY SCHOOL            PFT-TEACHER      M
##    3:                    HIGH SCHOOL            PFT-TEACHER      M
##    4:                    HIGH SCHOOL            PFT-TEACHER      M
##    5:              ELEMENTARY SCHOOL            PFT-TEACHER      M
##   ---                                                             
## 1822:              ELEMENTARY SCHOOL            PFT-TEACHER      M
## 1823:              ELEMENTARY SCHOOL            PFT-TEACHER      M
## 1824:              ELEMENTARY SCHOOL            PFT-TEACHER      M
## 1825:              ELEMENTARY SCHOOL            PFT-TEACHER      M
## 1826:                    HIGH SCHOOL            PFT-TEACHER      M
##       run_date male
##    1: 4/1/2016 TRUE
##    2: 4/1/2016 TRUE
##    3: 4/1/2016 TRUE
##    4: 4/1/2016 TRUE
##    5: 4/1/2016 TRUE
##   ---              
## 1822: 4/1/2016 TRUE
## 1823: 4/1/2016 TRUE
## 1824: 4/1/2016 TRUE
## 1825: 4/1/2016 TRUE
## 1826: 4/1/2016 TRUE

3.2.2 Merge and update

Now we’re equipped to understand the syntax for a merge. Let’s add the school’s name (home_organization_description in teachers) to the attendance data.

First, note that teachers has more data in it than we need. There are many teachers in each school, meaning the mapping between home_organization code and school name is repeated many times (once for each teacher in a given school). This presents a sort of an issue when we try and merge the teachers data into attendance – for each ULCS_NO that gets matched to a home_organization in teachers, there will be many rows, even if they all have the same information. To deal with this we’ll use unique to eliminate all the duplicates. We’ll go step-by-step first, then all at once:

#this will take the FIRST row associated
#  with each school. This is important in
#  general, but not here since we only
#  care about the school-level data for now,
#  which is the same in all rows.
schools <- unique(teachers, by = "home_organization")
schools
##         last_name first_name pay_rate_type pay_rate title_description
##   1:       ABBOTT      JOYCE      SALARIED    76461 TEACHER,FULL TIME
##   2:     ABDALLAH JUWAYRIYAH      SALARIED    46694 TEACHER,FULL TIME
##   3:  ABDEL-JALIL    GHADEER      SALARIED    45359 TEACHER,FULL TIME
##   4:  ABDUL BASIT    BARBARA      SALARIED    67706 TEACHER,FULL TIME
##   5: ABDUL-LATEEF     VILLIA      SALARIED    48945 TEACHER,FULL TIME
##  ---                                                                 
## 287:     VALENTIN   MADELINE      SALARIED    65121 TEACHER,FULL TIME
## 288:   WERTZ YORI    DESIREE      SALARIED    76461 TEACHER,FULL TIME
## 289:     WIDMAIER      DONNA      SALARIED    76461 TEACHER,FULL TIME
## 290:     WILLIAMS     TAMMAR      SALARIED    76461 TEACHER,FULL TIME
## 291:       WRIGHT        AMY      SALARIED    67789 TEACHER,FULL TIME
##      home_organization home_organization_description organization_level
##   1:              1290       HAMILTON, ANDREW SCHOOL  ELEMENTARY SCHOOL
##   2:              1470           LOCKE, ALAIN SCHOOL  ELEMENTARY SCHOOL
##   3:              7440         TAYLOR, BAYARD SCHOOL  ELEMENTARY SCHOOL
##   4:              7530         ROWEN, WILLIAM SCHOOL  ELEMENTARY SCHOOL
##   5:              1010     BARTRAM, JOHN HIGH SCHOOL        HIGH SCHOOL
##  ---                                                                   
## 287:              7262     ELLWOOD SCHOOL HEAD START    EARLY CHILDHOOD
## 288:              8352    SPRUANCE SCHOOL HEAD START    EARLY CHILDHOOD
## 289:              6202       DAY, ANNA B. HEAD START    EARLY CHILDHOOD
## 290:              1027         LEA SCHOOL HEAD START    EARLY CHILDHOOD
## 291:              2242              BREGY HEAD START    EARLY CHILDHOOD
##      type_of_representation gender run_date  male
##   1:            PFT-TEACHER      F 4/1/2016 FALSE
##   2:            PFT-TEACHER      F 4/1/2016 FALSE
##   3:            PFT-TEACHER      F 4/1/2016 FALSE
##   4:            PFT-TEACHER      F 4/1/2016 FALSE
##   5:            PFT-TEACHER      F 4/1/2016 FALSE
##  ---                                             
## 287:             PFT- PRE K      F 4/1/2016 FALSE
## 288:             PFT- PRE K      F 4/1/2016 FALSE
## 289:             PFT- PRE K      F 4/1/2016 FALSE
## 290:             PFT- PRE K      F 4/1/2016 FALSE
## 291:             PFT- PRE K      F 4/1/2016 FALSE
#now merge, using := to add the school name column
attendance[schools, school_name :=
             home_organization_description, 
           on = c(ULCS_NO = "home_organization")]
attendance
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
##                         school_name
##   1:       SPRUANCE, GILBERT SCHOOL
##   2:        HOWE, JULIA WARD SCHOOL
##   3:     WILLARD, FRANCES E. SCHOOL
##   4: LABRUM,GEN HARRY MIDDLE SCHOOL
##   5:     LINGELBACH, ANNA L. SCHOOL
##  ---                               
## 208:     KEARNY, GEN. PHILIP SCHOOL
## 209:    HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211:      ROBESON, PAUL HIGH SCHOOL
## 212:          HESTON, EDWARD SCHOOL

This general form of the syntax we just used is X[Y, variable_in_x := variable_in_y, on = c(a = "b")]. Let’s break that down a bit:

  • X and Y are data.tables. We match their rows according to alignment in the variables specified by the on argument within the square brackets ([]) used to access X.
  • on = c(a = "b") tells R that rows in column a in X (both things on the left) should be matched & associated with rows of the same value in column b in Y.
  • we add the variable on the left of := to data.table X. The variable on the right of := is typically found in Y – it could also be an expression consisting of columns in X and Y.

Let’s explore a few other ways we could have gone about doing this merge. First, let’s remove the column that we just created so that we’re really starting from scratch. The way to remove columns from a data.table is to set them to NULL, which is very fast.

attendance[ , school_name := NULL][]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
#alternative 1: skip defining schools
#  extra tidbit: sometimes, both the 
#  main table (x) and the merging table
#  (in the first argument, i -- we called
#   it Y above) have columns with the same
#  name. To overcome the ambiguity this 
#  engenders, we can prepend i. to the
#  column names in teachers to tell R
#  that we're referring _explicitly_
#  to that column in teachers. Similarly,
#  we can prepend x. to tell R that we're
#  referring to the column in attendance
attendance[unique(teachers, by = "home_organization"),
           school_name := i.home_organization_description,
           on = c(ULCS_NO = "home_organization")][]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
##                         school_name
##   1:       SPRUANCE, GILBERT SCHOOL
##   2:        HOWE, JULIA WARD SCHOOL
##   3:     WILLARD, FRANCES E. SCHOOL
##   4: LABRUM,GEN HARRY MIDDLE SCHOOL
##   5:     LINGELBACH, ANNA L. SCHOOL
##  ---                               
## 208:     KEARNY, GEN. PHILIP SCHOOL
## 209:    HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211:      ROBESON, PAUL HIGH SCHOOL
## 212:          HESTON, EDWARD SCHOOL
#reset again
attendance[ , school_name := NULL]

#alternative 2: use unique within the merge
#  by = .EACHI is a syntax unique to merging.
#  Using this tells R to do the thing in j
#  within each group that's matched.
#  So, here, each ULCS_NO gets matched to many
#  home_organization rows in teachers.
#  within this group, we use unique to
#  get the school's name, since unique turns the
#  vector of school names (one per teacher in
#  each school that's matched) into a single value.
attendance[teachers, 
           school_name := unique(home_organization_description),
           on = c(ULCS_NO = "home_organization"), by = .EACHI][]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
##                         school_name
##   1:       SPRUANCE, GILBERT SCHOOL
##   2:        HOWE, JULIA WARD SCHOOL
##   3:     WILLARD, FRANCES E. SCHOOL
##   4: LABRUM,GEN HARRY MIDDLE SCHOOL
##   5:     LINGELBACH, ANNA L. SCHOOL
##  ---                               
## 208:     KEARNY, GEN. PHILIP SCHOOL
## 209:    HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211:      ROBESON, PAUL HIGH SCHOOL
## 212:          HESTON, EDWARD SCHOOL
#reset again
attendance[ , school_name := NULL]

#alternative 3: like 2, except use [] instead of unique.
#  This alternative works, and is faster, but is less robust
#  to finding mistakes in your data. Thankfully, this data is
#  pretty clean, but we often find messy data in the wild --
#  maybe there was a typo in the organization code or school name,
#  and instead of each home_organization being matched to
#  exactly one home_organization_description, we might find
#  it matched to several (e.g., "PS 131" and "PS #131").
#  If we use the unique approach from alternative 2,
#  we'll get an error, which tells us we need to examine our data
#  more closely. This approach will definitely not generate an error.
attendance[teachers,
           school_name := home_organization_description[1],
           on = c(ULCS_NO = "home_organization"), by = .EACHI][]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
##                         school_name
##   1:       SPRUANCE, GILBERT SCHOOL
##   2:        HOWE, JULIA WARD SCHOOL
##   3:     WILLARD, FRANCES E. SCHOOL
##   4: LABRUM,GEN HARRY MIDDLE SCHOOL
##   5:     LINGELBACH, ANNA L. SCHOOL
##  ---                               
## 208:     KEARNY, GEN. PHILIP SCHOOL
## 209:    HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211:      ROBESON, PAUL HIGH SCHOOL
## 212:          HESTON, EDWARD SCHOOL

It’s up to your taste which approach suits you – each is perfect for a certain situation.

3.2.3 Missing Data

No data set is perfect. In fact, most data sets are nightmares. Missing data is pervasive, so it stands to reason that R is well-equipped to handle missing data.

All missing data is coded in R as NA. That is, if we’ve been careful when loading the data into R, usually by telling R what missing data looks like in a given file. Take fread, for example. We haven’t needed it yet because the two data sets we’ve loaded have been complete/balanced, but fread has an argument called na.strings which is used to tell R what to look for in a data file to consider as missing (i.e., as NA).

We now have some missing data after our merge, however. To see this, note the following:

#is.na returns a logical vector saying
#  whether each element of school_name
#  is missing (NA) or not
attendance[is.na(school_name)]
##    ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1:    2140   2013-2014               95.7               93.4       214
##    attendance_proportion attendance_relative school_name
## 1:                 0.957            1.024625          NA
#NOTE: TESTING NA WITH EQUALITY DOESN'T WORK
attendance[school_name == NA]
## Empty data.table (0 rows) of 8 cols: ULCS_NO,SCHOOL_YEAR,SCH_TEACHER_ATTEND,SDP_TEACHER_ATTEND,SCHOOL_ID,attendance_proportion...

So what’s up with ULCS_NO 2140? Why does this school have no teachers in the salary data??

Turns out, it is in the salary data, but none of the teachers are listed as TEACHER,FULL TIME:

#remember, salaries is the full set
#  of ALL employees of SDP -- we
#  didn't remove any observations from here
#(using the nrows argument for print to condense output)
print(salaries[home_organization == "2140"], nrows = 10)
##        last_name  first_name pay_rate_type pay_rate
##  1:        ABNEY       GILDA      SALARIED    79586
##  2:       AIKENS       SARAH      SALARIED    48110
##  3:         AKES      RONALD      SALARIED    17963
##  4: AMIT-CUBBAGE       DONNA      SALARIED    84880
##  5:       BALLEW       MEGAN      SALARIED    52530
## ---                                                
## 79:      TARANTA CHRISTOPHER      SALARIED    84880
## 80:       TAYLOR   ELIZABETH      SALARIED    84880
## 81:        VECSI       JANEL      SALARIED    77961
## 82: VOLIN AVELIN   ALEXANDRA      SALARIED    77961
## 83:     WILLIAMS      LESLIE      SALARIED    34164
##               title_description home_organization
##  1: SCHOOL COUNSELOR, 10 MONTHS              2140
##  2:        FOOD SVCS MANAGER II              2140
##  3:         FOOD SVCS WORKER II              2140
##  4:       TEACHER,DEMONSTRATION              2140
##  5:       TEACHER,DEMONSTRATION              2140
## ---                                              
## 79:       TEACHER,DEMONSTRATION              2140
## 80:       TEACHER,DEMONSTRATION              2140
## 81:       TEACHER,DEMONSTRATION              2140
## 82:       TEACHER,DEMONSTRATION              2140
## 83:         CUSTODIAL ASSISTANT              2140
##      home_organization_description organization_level
##  1: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##  2: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##  3: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##  4: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##  5: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## ---                                                  
## 79: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## 80: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## 81: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## 82: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## 83: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##     type_of_representation gender run_date
##  1:            PFT-TEACHER      F 4/1/2016
##  2:            PFT-F/S MGR      F 4/1/2016
##  3:              LOCAL 634      M 4/1/2016
##  4:            PFT-TEACHER      F 4/1/2016
##  5:            PFT-TEACHER      F 4/1/2016
## ---                                       
## 79:            PFT-TEACHER      M 4/1/2016
## 80:            PFT-TEACHER      F 4/1/2016
## 81:            PFT-TEACHER      F 4/1/2016
## 82:            PFT-TEACHER      F 4/1/2016
## 83:             LOCAL 1201      F 4/1/2016
#since they're not full-time teachers, what are they?
salaries[home_organization == "2140",
         table(title_description)]
## title_description
##                ASST PRINCIPAL   BUILDING ENGINEER-GROUP III 
##                             1                             1 
##           CUSTODIAL ASSISTANT           FOOD SVCS ASSISTANT 
##                             2                             2 
##          FOOD SVCS MANAGER II      FOOD SVCS UTILITY WORKER 
##                             1                             1 
##            FOOD SVCS WORKER I           FOOD SVCS WORKER II 
##                             1                             1 
##      GENERAL CLEANER, 8 HOURS   ONE TO ONE ASST, SPECIAL ED 
##                             3                             1 
##                     PRINCIPAL   SCHOOL COUNSELOR, 10 MONTHS 
##                             1                             3 
##                  SCHOOL NURSE     SCHOOL OPERATIONS OFFICER 
##                             1                             1 
##                   SECRETARY I STUDENT CLIMATE STAFF,3 HOURS 
##                             3                             2 
## STUDENT CLIMATE STAFF,4 HOURS         TEACHER,DEMONSTRATION 
##                             3                            54 
## TEACHER,DEMONSTRATION,SPEC ED 
##                             1

That’s right – at Masterman, none of the teachers appear to be classified as such. They’re classified for wage purposes as TEACHER,DEMONSTRATION. I don’t really know what this means… I was only able to find a small tidbit on the SDP website:

Demonstration Teachers are unique educators. They utilize the most current educational techniques and instructional media, which contribute to effective teaching practices in their specific teaching areas. They are observed at times by personnel from within the School District as well as visitors from other areas and institutions, both national and international.

This appears to be something basically exclusive to Masterman, with a fair presence at John Hancock as well:

salaries[title_description == "TEACHER,DEMONSTRATION",
         .N, by = home_organization_description]
##     home_organization_description  N
## 1:           HANCOCK, JOHN SCHOOL 22
## 2:        PENN TREATY HIGH SCHOOL  1
## 3: MASTERMAN,JULIA R. HIGH SCHOOL 54
## 4:        HOWE, JULIA WARD SCHOOL  1
## 5:          FRANKFORD HIGH SCHOOL  1
## 6: LABRUM,GEN HARRY MIDDLE SCHOOL  3
## 7: MASTBAUM, JULES E. HIGH SCHOOL  1

So, it turns out we’d be better off just using salaries for the merge to define the school’s name:

attendance[salaries, 
           school_name := unique(home_organization_description),
           on = c(ULCS_NO = "home_organization"), by = .EACHI]
attendance[is.na(school_name)] #fixed!
## Empty data.table (0 rows) of 8 cols: ULCS_NO,SCHOOL_YEAR,SCH_TEACHER_ATTEND,SDP_TEACHER_ATTEND,SCHOOL_ID,attendance_proportion...

Now we can repeat our earlier exercise and reveal the names of the best- and worst- attended schools in Philly. While we’re at it, we can introduce column subsetting. To focus attention on only a few columns, we can simply put them in j and wrap them in .(). Only the columns we name here will show up in the output (so that it doesn’t take up as much room and we can immediately focus on the most important information)

attendance[SCH_TEACHER_ATTEND < 90 | 
             SCH_TEACHER_ATTEND > 97
           ][order(-SCH_TEACHER_ATTEND), 
             .(ULCS_NO, school_name, SCH_TEACHER_ATTEND)]
##     ULCS_NO                    school_name SCH_TEACHER_ATTEND
##  1:    1190         MOTIVATION HIGH SCHOOL               98.1
##  2:    6040    SAUL, WALTER B. HIGH SCHOOL               97.6
##  3:    8560            THE WORKSHOP SCHOOL               97.6
##  4:    5340        LUDLOW, JAMES R. SCHOOL               97.2
##  5:    7350        LOWELL, JAMES R. SCHOOL               89.9
##  6:    7510    BETHUNE, MARY MCLEOD SCHOOL               89.8
##  7:    8240       DISSTON, HAMILTON SCHOOL               89.7
##  8:    7290       STEARNE, ALLEN M. SCHOOL               89.7
##  9:    5430          AMY 5 AT JAMES MARTIN               89.6
## 10:    7220       CARNELL, LAURA H. SCHOOL               89.0
## 11:    6360    ROOSEVELT ELEMENTARY SCHOOL               89.0
## 12:    1330         HUEY, SAMUEL B. SCHOOL               88.8
## 13:    4240 CASSIDY,LEWIS C ACADEMICS PLUS               88.7
## 14:    1130           TILDEN MIDDLE SCHOOL               88.5
## 15:    1340                  LEA, HENRY C.               88.4
## 16:    6260       HOUSTON, HENRY H. SCHOOL               88.0
## 17:    2480      ARTHUR, CHESTER A. SCHOOL               87.8
## 18:    6340     PENNELL, JOSEPH ELEMENTARY               86.9
## 19:    5370            MOFFET, JOHN SCHOOL               86.5
## 20:    4570   MEADE, GEN. GEORGE G. SCHOOL               86.4
## 21:    4350       RHODES ELEMENTARY SCHOOL               85.3
##     ULCS_NO                    school_name SCH_TEACHER_ATTEND

How fitting! Motivation High School appears to value teacher attendance very highly.

3.3 Non-.csv Data, Part I: Excel Files

Often, the world is cruel and we are unable to find the data that we want to use for our project in a standard character-separated format. When this situation arises, we have two options.

The first is to simply convert that data into our favorite format and treat the new file as the master data file, never looking back.

There are three major problems with this approach:

  1. Processing time. It can be time-consuming to convert many files; this is a fixed cost, but often our time is better spent analyzing the data than toying with it and tinkering with the parameters of a data converter.
  2. Data fidelity. All data converters have their flaws. Encoding issues, unexpected data types, you name it. It’s almost impossible to tell when, in the process of converting your data, some observation’s entries have been mistakenly changed without your knowing.
  3. Software licenses. Some files are stored in program-specific binary file formats (SAS, Stata, even R all have specific file types). This is intended to make it really easy to work from session to session within that program, and re-loading a data environment is fast and simple. Moving to machines that don’t have that software installed, however, can be nightmarish.

My preferred approach is to use R itself to read in file in all formats. Users of R have encountered all of our problems before, and many kind souls out there in the ether have been so generous as to donate their tailor-made solutions to the R community.

Excel files are an example of one such foreign format. As you can imagine, Excel files are ubiquitous. It’s no surprise, then, that CRAN is plastered with different packages designed to load and interact with data created and stored in Excel formats. See this Q&A for the gamut of options.

Today we’ll cover just two options, since I have found them to be quite effective and have never needed any of the others. The first is the package readxl. It is designed to be streamlined and fast. It has few bells and whistles and is not very flexible. To handle more fringe/out-of-the-ordinary data files, there is xlsx. The main advantage I’ve found for xlsx is that it’s better at properly reading fields formatted as dates, which is a complicated facet of how .xlsx files are stored.

On our own machines, we would install them like we would any other package:

#we can install multiple packages at once by
#  passing a character vector to install.packages
install.packages(c("xlsx", "readxl"))

3.3.1 School Demographic Data

Philly’s ODI also releases some Excel files summarizing some basic demographic information about each school. This data, as usual already on your machine, was downloaded from here.

R doesn’t have any easy facility to preview the data in an Excel file, so we’ll just pause from R for now and check it out in Excel.

Excuse us for a moment, R.

<<R sips coffee>>

As we can see, the data is a tad messy, but R has the strength to deal with it! Let’s go!

library(readxl)
#As we saw, this file has many sheets. the read_excel function
#  has an argument sheet which we use to specify the sheet we want
#Also, as we saw, there's a bunch of extraneous rows at the top of the file.
#  We use the skip argument to ignore them. But in so doing, we necessarily
#  lose some information (namely, which columns correspond to Female and
#  which correspond to Male).
#To overcome this, we're going to supply
#  the column names ourselves. This is also favorable since the column
#  headers in the raw data file are plagued by spaces -- which makes
#  them quite hard to deal with once they're brought into R.
school_gender <-
  read_excel("./data/2015-2016 Enr Dem (Suppressed).xls", 
             sheet = "Gender", skip = 6,
             #enter a blank for the first column, which
             #  will be dropped anyway
             col_names = c("", "school_id", "school_name", "grade",
                           "total_enrolled", "count_female",
                           "pct_female", "count_male", "pct_male"))
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1

3.3.1.1 setDT

Now that we’ve ventured outside of the data.table world, we’ll have to deal with a minor annoyance – data.frames.

data.frames are R’s native data management class. I find them to be a pain to deal with; for example, if we print a data.frame, it will print the entire table, instead of truncating all but the beginning and end of the object, which is basically a useless overload of information. Also, most of the handy syntax that we’re starting to get used to from data.table is inaccessible/doesn’t work in data.frames. data.frame syntax is ugly.

Luckily for us, data.table has a perfect tool for people like us that are lost in the data.frame wilderness: setDT!

setDT will do its best to convert non-data.table objects to data.tables by reference, which means that there’s no copy being made. This is probably meaningless to you now, but just know that if you have a huge data set (millions of observations), making a copy of your data is something you should try and avoid doing at all costs, as it’s very time- and memory-consuming.

class(school_gender)
## [1] "tbl_df"     "tbl"        "data.frame"
#(usually, I just do this wrap this
# around read_excel to do it all at once)
setDT(school_gender)
class(school_gender)
## [1] "data.table" "data.frame"

3.3.1.2 Forcing numeric: lapply(.SD, ...)

print(school_gender, class = TRUE)
##       school_id                  school_name      grade total_enrolled
##           <num>                       <char>     <char>          <num>
##    1:       101     John Bartram High School   9.000000            200
##    2:       101     John Bartram High School  10.000000            197
##    3:       101     John Bartram High School  11.000000            172
##    4:       101     John Bartram High School  12.000000            141
##   ---                                                                 
## 1662:       878 Philadelphia Virtual Academy   9.000000             46
## 1663:       878 Philadelphia Virtual Academy  10.000000             69
## 1664:       878 Philadelphia Virtual Academy  11.000000             79
## 1665:       878 Philadelphia Virtual Academy  12.000000             64
## 1666:       878 Philadelphia Virtual Academy ALL GRADES            311
##       count_female pct_female count_male pct_male
##             <char>     <char>     <char>   <char>
##    1:    70.000000   0.350000 130.000000 0.650000
##    2:    86.000000   0.436548 111.000000 0.563452
##    3:    78.000000   0.453488  94.000000 0.546512
##    4:    63.000000   0.446809  78.000000 0.553191
##   ---                                            
## 1662:            s          s          s        s
## 1663:    43.000000   0.623188  26.000000 0.376812
## 1664:    47.000000   0.594937  32.000000 0.405063
## 1665:    41.000000   0.640625  23.000000 0.359375
## 1666:   191.000000   0.614148 120.000000 0.385852

We’ve got some issues to deal with.

  1. There’s too much data here (for what we’ve got in mind) – we’re only interested in the school-wide demographics for now. So we’ve got to cut all the rows where grade is not ALL GRADES (we know how to do this already)
  2. A couple of columns that are clearly numeric are currently being stored as charactercount_female, pct_female, count_male, pct_male. This is because some data has been suppressed, probably for privacy reasons when the sample size is too small (e.g., at Philadelphia Virtual Academy, it appears there are only 46 students, which would mean roughly 23 students of each gender – it’s common to hide data that comes from such small samples for privacy). We could have used the na argument in read_excel (setting it to "s" instead of the default ""), but cleaning up ex-post allows us to show some new tricks!
#subsetting
school_gender <- school_gender[grade == "ALL GRADES"]

#forcing numeric -- note the warning telling us 
#  there is some data loss because R doesn't know
#  how to convert "s" to a number, so it forces it to be NA
school_gender[ , count_female := as.numeric(count_female)]
## Warning: NAs introduced by coercion
#repeat for the other columns
school_gender[ , pct_female := as.numeric(pct_female)]
## Warning: NAs introduced by coercion
school_gender[ , count_male := as.numeric(count_male)]
## Warning: NAs introduced by coercion
school_gender[ , pct_male := as.numeric(pct_male)]
## Warning: NAs introduced by coercion

It’s kind of a pain to run basically the same line of code over and over again, isn’t it?

Don’t worry, R has a simple way to automate this process for us, of course!

We’ll need two more bits of knowledge about data.table before we proceed.

First, there’s another data.table variable like .N that is useful to know. .SD (stands for Subset of the Data), like .N, is two things, depending on whether we doing a grouping operation using by. If we’re not using by, .SD is just our data.table. If we’re using by, there is a different .SD for every group (recall the idea of splitting the data.table into one group for each level of the by variable – each of the split sub-tables has its own .SD) Watch:

school_gender[ , .SD]
##      school_id                                 school_name      grade
##   1:       101                    John Bartram High School ALL GRADES
##   2:       102               West Philadelphia High School ALL GRADES
##   3:       103                   High School of the Future ALL GRADES
##   4:       105 Paul Robeson High School for Human Services ALL GRADES
##   5:       110                William L. Sayre High School ALL GRADES
##  ---                                                                 
## 214:       842                      Stephen Decatur School ALL GRADES
## 215:       843                     Joseph Greenberg School ALL GRADES
## 216:       844                   William H. Loesche School ALL GRADES
## 217:       856                         The Workshop School ALL GRADES
## 218:       878                Philadelphia Virtual Academy ALL GRADES
##      total_enrolled count_female pct_female count_male pct_male
##   1:            710          297   0.418310        413 0.581690
##   2:            538          242   0.449814        296 0.550186
##   3:            518          243   0.469112        275 0.530888
##   4:            294          169   0.574830        125 0.425170
##   5:            511          231   0.452055        280 0.547945
##  ---                                                           
## 214:           1030          478   0.464078        552 0.535922
## 215:            738          368   0.498645        370 0.501355
## 216:            782          368   0.470588        414 0.529412
## 217:            207           79   0.381643        128 0.618357
## 218:            311          191   0.614148        120 0.385852

.SD is itself a data.table. That means we can use [] on it to treat it just like any other data.table:

#the same as school_gender[1:10]
school_gender[ , .SD[1:10] ]
##     school_id                                 school_name      grade
##  1:       101                    John Bartram High School ALL GRADES
##  2:       102               West Philadelphia High School ALL GRADES
##  3:       103                   High School of the Future ALL GRADES
##  4:       105 Paul Robeson High School for Human Services ALL GRADES
##  5:       110                William L. Sayre High School ALL GRADES
##  6:       113                    William T. Tilden School ALL GRADES
##  7:       119                      Motivation High School ALL GRADES
##  8:       120                           John Barry School ALL GRADES
##  9:       123                    William C. Bryant School ALL GRADES
## 10:       125                  Joseph W. Catharine School ALL GRADES
##     total_enrolled count_female pct_female count_male pct_male
##  1:            710          297   0.418310        413 0.581690
##  2:            538          242   0.449814        296 0.550186
##  3:            518          243   0.469112        275 0.530888
##  4:            294          169   0.574830        125 0.425170
##  5:            511          231   0.452055        280 0.547945
##  6:            470          210   0.446809        260 0.553191
##  7:            344          183   0.531977        161 0.468023
##  8:            757          395   0.521797        362 0.478203
##  9:            478          234   0.489540        244 0.510460
## 10:            619          304   0.491115        315 0.508885
#get the entire first row associated with each school_id
school_gender[ , .SD[1], by = school_id]
##      school_id                                 school_name      grade
##   1:       101                    John Bartram High School ALL GRADES
##   2:       102               West Philadelphia High School ALL GRADES
##   3:       103                   High School of the Future ALL GRADES
##   4:       105 Paul Robeson High School for Human Services ALL GRADES
##   5:       110                William L. Sayre High School ALL GRADES
##  ---                                                                 
## 214:       842                      Stephen Decatur School ALL GRADES
## 215:       843                     Joseph Greenberg School ALL GRADES
## 216:       844                   William H. Loesche School ALL GRADES
## 217:       856                         The Workshop School ALL GRADES
## 218:       878                Philadelphia Virtual Academy ALL GRADES
##      total_enrolled count_female pct_female count_male pct_male
##   1:            710          297   0.418310        413 0.581690
##   2:            538          242   0.449814        296 0.550186
##   3:            518          243   0.469112        275 0.530888
##   4:            294          169   0.574830        125 0.425170
##   5:            511          231   0.452055        280 0.547945
##  ---                                                           
## 214:           1030          478   0.464078        552 0.535922
## 215:            738          368   0.498645        370 0.501355
## 216:            782          368   0.470588        414 0.529412
## 217:            207           79   0.381643        128 0.618357
## 218:            311          191   0.614148        120 0.385852

This is mainly useful in conjunction with another argument we can use within [] with data.table: .SDcols. .SDcols tells R which columns to include in the current operation:

school_gender[ , .SD, .SDcols = c("school_id", "grade", "total_enrolled")]
##      school_id      grade total_enrolled
##   1:       101 ALL GRADES            710
##   2:       102 ALL GRADES            538
##   3:       103 ALL GRADES            518
##   4:       105 ALL GRADES            294
##   5:       110 ALL GRADES            511
##  ---                                    
## 214:       842 ALL GRADES           1030
## 215:       843 ALL GRADES            738
## 216:       844 ALL GRADES            782
## 217:       856 ALL GRADES            207
## 218:       878 ALL GRADES            311

This is mainly useful for what we’re trying to do with as.numeric, where we want to apply the same function to many columns. Recall from earlier that the *apply functions are our friends when it comes to doing repetitive things. When we do lapply(DT, sum), we apply sum to every column of DT. For as.numeric, it’s much the same:

#Troublesome. We don't want to convert
#  most of the columns, and we go too far here
school_gender[ , lapply(.SD, as.numeric)]
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion

## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
##      school_id school_name grade total_enrolled count_female pct_female
##   1:       101          NA    NA            710          297   0.418310
##   2:       102          NA    NA            538          242   0.449814
##   3:       103          NA    NA            518          243   0.469112
##   4:       105          NA    NA            294          169   0.574830
##   5:       110          NA    NA            511          231   0.452055
##  ---                                                                   
## 214:       842          NA    NA           1030          478   0.464078
## 215:       843          NA    NA            738          368   0.498645
## 216:       844          NA    NA            782          368   0.470588
## 217:       856          NA    NA            207           79   0.381643
## 218:       878          NA    NA            311          191   0.614148
##      count_male pct_male
##   1:        413 0.581690
##   2:        296 0.550186
##   3:        275 0.530888
##   4:        125 0.425170
##   5:        280 0.547945
##  ---                    
## 214:        552 0.535922
## 215:        370 0.501355
## 216:        414 0.529412
## 217:        128 0.618357
## 218:        120 0.385852
#Use .SDcols to control which columns we affect
school_gender[ , lapply(.SD, as.numeric), 
               .SDcols = c("count_female", "pct_female", 
                           "count_male", "pct_male")]
##      count_female pct_female count_male pct_male
##   1:          297   0.418310        413 0.581690
##   2:          242   0.449814        296 0.550186
##   3:          243   0.469112        275 0.530888
##   4:          169   0.574830        125 0.425170
##   5:          231   0.452055        280 0.547945
##  ---                                            
## 214:          478   0.464078        552 0.535922
## 215:          368   0.498645        370 0.501355
## 216:          368   0.470588        414 0.529412
## 217:           79   0.381643        128 0.618357
## 218:          191   0.614148        120 0.385852
#Great, we've converted the columns to numeric, right?
#  No, not quite. We have done the calculation,
#  but we haven't _assigned_ the output to anything.
#  To do that, we need := again.

num_cols <- c("count_female", "pct_female",
              "count_male", "pct_male")
#**NOTE** we have to surround num_cols on the left with
#  parentheses so that data.table knows we're not
#  trying to create a column called num_cols.
school_gender[ , (num_cols) := lapply(.SD, as.numeric),
               .SDcols = num_cols]
print(school_gender, class = TRUE)
##      school_id                                 school_name      grade
##          <num>                                      <char>     <char>
##   1:       101                    John Bartram High School ALL GRADES
##   2:       102               West Philadelphia High School ALL GRADES
##   3:       103                   High School of the Future ALL GRADES
##   4:       105 Paul Robeson High School for Human Services ALL GRADES
##  ---                                                                 
## 214:       842                      Stephen Decatur School ALL GRADES
## 215:       843                     Joseph Greenberg School ALL GRADES
## 216:       844                   William H. Loesche School ALL GRADES
## 217:       856                         The Workshop School ALL GRADES
## 218:       878                Philadelphia Virtual Academy ALL GRADES
##      total_enrolled count_female pct_female count_male pct_male
##               <num>        <num>      <num>      <num>    <num>
##   1:            710          297   0.418310        413 0.581690
##   2:            538          242   0.449814        296 0.550186
##   3:            518          243   0.469112        275 0.530888
##   4:            294          169   0.574830        125 0.425170
##  ---                                                           
## 214:           1030          478   0.464078        552 0.535922
## 215:            738          368   0.498645        370 0.501355
## 216:            782          368   0.470588        414 0.529412
## 217:            207           79   0.381643        128 0.618357
## 218:            311          191   0.614148        120 0.385852

3.3.1.3 Merging, and Repeating

Now, we can add this data to our teacher data file from earlier. How? A merge of course!

First, unfortunately, the school codes in school_gender are missing the trailing 0, so we have to fix this.

school_gender[ , school_id := paste0(school_id, "0")]

teachers[school_gender, 
         school_pct_male := i.pct_male, 
         on = c(home_organization = "school_id")]

The school demographics Excel file also contains data on ELL status, IEP status, ethnicity, and economic disadvantage. The approach to adding this is basically the same, so let’s do it all at once.

#For conciseness, we'll do all of the data manipulation we did
#  in addition to reading in the gender sheet all at once here
school_ell <- 
  setDT(read_excel("./data/2015-2016 Enr Dem (Suppressed).xls", 
                   sheet = "ELL", skip = 6,
                   col_names = c("", "school_id", "school_name", "grade",
                                 "total_enrolled", "count_ell", "pct_ell", 
                                 "count_non_ell", "pct_non_ell"))
        )[grade == "ALL GRADES"
          #we're only going to keep pct_ell, so only convert that;
          #  also convert school code
          ][ , c("school_id", "pct_ell") := 
               .(paste0(school_id, "0"), as.numeric(pct_ell))]
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning: NAs introduced by coercion
#note: ELL status is commonly missing
school_ell[is.na(pct_ell), .N]
## [1] 104
#merge
teachers[school_ell, 
         school_pct_ell := i.pct_ell,
         on = c(home_organization = "school_id")]

##NOW: IEP
## Note: since all we're doing is selecting a column, we don't
##   even have to store the Excel sheet as a new data set --
##   we can pass the intermediat result directly to the
##   merge with teachers, like so:
teachers[setDT(read_excel(
  "./data/2015-2016 Enr Dem (Suppressed).xls", 
  sheet = "IEP", skip = 6,
  col_names = c("", "school_id", "school_name", "grade",
                "total_enrolled", "count_iep", "pct_iep", 
                "count_non_iep", "pct_non_iep"))
        )[grade == "ALL GRADES"
          ][ , c("school_id", "pct_iep") := 
               .(paste0(school_id, "0"), as.numeric(pct_iep))],
  school_pct_iep := i.pct_iep, 
  on = c(home_organization = "school_id")]
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning: NAs introduced by coercion
##NOW: Ethnicity
##  Ethnicity is not binary, so things will be a little bit more involved.
##  we'll use the `:=`() form of assignment here. Basically, we can
##  treat := like a function (surrounded by backticks ``), since, 
##  technically, it IS a function
teachers[setDT(read_excel(
  "./data/2015-2016 Enr Dem (Suppressed).xls", 
  sheet = "Ethnicity", skip = 6,
  col_names = c("", "school_id", "school_name", 
                "grade", "total_enrolled", 
                "count_amerindian", "pct_amerindian", 
                "count_asian", "pct_asian", 
                "count_black", "pct_black",
                "count_hispanic", "pct_hispanic", 
                "count_multi", "pct_multi", 
                "count_pacific", "pct_pacific",
                "count_white", "pct_white"))
        )[grade == "ALL GRADES"
          ][ , c("school_id",
                 paste0("pct_", c("asian", "black", 
                                  "hispanic", "white"))) := 
               #we could do this with lapply, but it looks ugly...
               .(paste0(school_id, "0"), as.numeric(pct_asian),
                 as.numeric(pct_black), as.numeric(pct_hispanic),
                 as.numeric(pct_white))],
  `:=`(school_pct_asian = i.pct_asian,
       school_pct_black = i.pct_black,
       school_pct_hispanic = i.pct_hispanic,
       school_pct_white = i.pct_white),
  on = c(home_organization = "school_id")]
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion
#Whew, what a mess! If you prefer, of course, you
#  can always do this step by step.

#Lastly, economically disadvantaged.
#  this sheet is formatted a bit differently,
#  so our merge must adapt accordingly.
#  In particular, there are more blank columns.
teachers[setDT(read_excel(
  "./data/2015-2016 Enr Dem (Suppressed).xls", 
  sheet = "Econ Disadv", skip = 6,
  col_names = c("", "school_id", "school_name", 
                "grade", "pct_disadv", rep("", 4)))
        )[ , school_id := paste0(school_id, "0")],
  school_pct_disadv := 100 * i.pct_disadv,
  on = c(home_organization = "school_id")]
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1,6,7,8,9
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
teachers
##          last_name first_name pay_rate_type pay_rate title_description
##    1:       ABBOTT      JOYCE      SALARIED    76461 TEACHER,FULL TIME
##    2:     ABDALLAH JUWAYRIYAH      SALARIED    46694 TEACHER,FULL TIME
##    3:  ABDEL-JALIL    GHADEER      SALARIED    45359 TEACHER,FULL TIME
##    4:  ABDUL BASIT    BARBARA      SALARIED    67706 TEACHER,FULL TIME
##    5: ABDUL-LATEEF     VILLIA      SALARIED    48945 TEACHER,FULL TIME
##   ---                                                                 
## 6648:      ZUNGOLO    WILLIAM      SALARIED    90051 TEACHER,FULL TIME
## 6649:        ZWICK      KEVIN      SALARIED    46694 TEACHER,FULL TIME
## 6650:       ZWOLAK       PAUL      SALARIED    54364 TEACHER,FULL TIME
## 6651:    ZYCHOWICZ       LISA      SALARIED    67706 TEACHER,FULL TIME
## 6652: ZYGMUND-FELT      ALLYN      SALARIED    90051 TEACHER,FULL TIME
##       home_organization home_organization_description organization_level
##    1:              1290       HAMILTON, ANDREW SCHOOL  ELEMENTARY SCHOOL
##    2:              1470           LOCKE, ALAIN SCHOOL  ELEMENTARY SCHOOL
##    3:              7440         TAYLOR, BAYARD SCHOOL  ELEMENTARY SCHOOL
##    4:              7530         ROWEN, WILLIAM SCHOOL  ELEMENTARY SCHOOL
##    5:              1010     BARTRAM, JOHN HIGH SCHOOL        HIGH SCHOOL
##   ---                                                                   
## 6648:              8390     FITZPATRICK, A. L. SCHOOL  ELEMENTARY SCHOOL
## 6649:              8300                MAYFAIR SCHOOL  ELEMENTARY SCHOOL
## 6650:              2290      FRANKLIN LEARNING CENTER        HIGH SCHOOL
## 6651:              8420       DECATUR, STEPHEN SCHOOL  ELEMENTARY SCHOOL
## 6652:              7100  COOKE, JAY ELEMENTARY SCHOOL  ELEMENTARY SCHOOL
##       type_of_representation gender run_date  male school_pct_male
##    1:            PFT-TEACHER      F 4/1/2016 FALSE        0.518072
##    2:            PFT-TEACHER      F 4/1/2016 FALSE        0.502146
##    3:            PFT-TEACHER      F 4/1/2016 FALSE        0.464029
##    4:            PFT-TEACHER      F 4/1/2016 FALSE        0.502868
##    5:            PFT-TEACHER      F 4/1/2016 FALSE        0.581690
##   ---                                                             
## 6648:            PFT-TEACHER      M 4/1/2016  TRUE        0.522134
## 6649:            PFT-TEACHER      M 4/1/2016  TRUE        0.523733
## 6650:            PFT-TEACHER      M 4/1/2016  TRUE        0.403207
## 6651:            PFT-TEACHER      F 4/1/2016 FALSE        0.535922
## 6652:            PFT-TEACHER      F 4/1/2016 FALSE        0.516899
##       school_pct_ell school_pct_iep school_pct_asian school_pct_black
##    1:             NA       0.185886               NA         0.893287
##    2:       0.077253       0.111588         0.053648         0.796137
##    3:       0.210432       0.088129               NA         0.176259
##    4:             NA       0.124283               NA         0.927342
##    5:       0.067606       0.263380               NA         0.929577
##   ---                                                                
## 6648:       0.061294       0.145289         0.044268         0.187287
## 6649:       0.140788       0.098150         0.211585         0.132743
## 6650:       0.159221       0.061856         0.114548         0.505155
## 6651:       0.054369       0.127184         0.053398         0.153398
## 6652:       0.091451       0.121272               NA         0.819085
##       school_pct_hispanic school_pct_white school_pct_disadv
##    1:                  NA               NA         100.00000
##    2:                  NA               NA         100.00000
##    3:            0.726619               NA         100.00000
##    4:                  NA               NA         100.00000
##    5:                  NA               NA         100.00000
##   ---                                                       
## 6648:            0.107832         0.547106          77.72985
## 6649:            0.209171         0.348351          91.26307
## 6650:            0.230241         0.119129          89.62199
## 6651:            0.093204         0.577670          65.70874
## 6652:            0.083499               NA         100.00000

3.4 Regressions

OK, enough data manipulation for now. Let’s get to some actual analysis, for cryin’ out loud!

We of course don’t have any experimental or quasi-experimental data here, so we’ll only be able to tease out correlations from our data. But we do have enough data to illustrate the basics of typical regression techniques, so here we go!

3.4.1 OLS

Most linear regressions in R are handled by the lm function (stands for linear model).

Let’s run a very simple regression – the most basic regression testing what the gender pay gap is. Like any of the other typical functions we’ve been using, we run lm in j of our data.table.

##no frills regression
teachers[ , lm(pay_rate ~ gender)]
## 
## Call:
## lm(formula = pay_rate ~ gender)
## 
## Coefficients:
## (Intercept)      genderM  
##     68624.9       -935.3

The basic output of lm tells us 1) the regression we ran and 2) the point estimates of the coefficients. This is nice, but I find it rather uninformative – crucially, the standard errors are missing. But fear not! We’re only seeing the surface of what lm does.

#now we assign the output of lm so
#  we can explore it in a bit more detail
gender_reg <- teachers[ , lm(pay_rate ~ gender)]
str(gender_reg)
## List of 13
##  $ coefficients : Named num [1:2] 68625 -935
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "genderM"
##  $ residuals    : Named num [1:6652] 7836 -21931 -23266 -919 -19680 ...
##   ..- attr(*, "names")= chr [1:6652] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:6652] -5576090 -34041 -23198 -851 -19612 ...
##   ..- attr(*, "names")= chr [1:6652] "(Intercept)" "genderM" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:6652] 68625 68625 68625 68625 68625 ...
##   ..- attr(*, "names")= chr [1:6652] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:6652, 1:2] -81.5598 0.0123 0.0123 0.0123 0.0123 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:6652] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "genderM"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   .. ..- attr(*, "contrasts")=List of 1
##   .. .. ..$ gender: chr "contr.treatment"
##   ..$ qraux: num [1:2] 1.01 1.01
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 6650
##  $ contrasts    :List of 1
##   ..$ gender: chr "contr.treatment"
##  $ xlevels      :List of 1
##   ..$ gender: chr [1:2] "F" "M"
##  $ call         : language lm(formula = pay_rate ~ gender)
##  $ terms        :Classes 'terms', 'formula' length 3 pay_rate ~ gender
##   .. ..- attr(*, "variables")= language list(pay_rate, gender)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "pay_rate" "gender"
##   .. .. .. ..$ : chr "gender"
##   .. ..- attr(*, "term.labels")= chr "gender"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x4549aa8> 
##   .. ..- attr(*, "predvars")= language list(pay_rate, gender)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "character"
##   .. .. ..- attr(*, "names")= chr [1:2] "pay_rate" "gender"
##  $ model        :'data.frame':   6652 obs. of  2 variables:
##   ..$ pay_rate: int [1:6652] 76461 46694 45359 67706 48945 49615 65121 83382 75850 59532 ...
##   ..$ gender  : chr [1:6652] "F" "F" "F" "F" ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 pay_rate ~ gender
##   .. .. ..- attr(*, "variables")= language list(pay_rate, gender)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "pay_rate" "gender"
##   .. .. .. .. ..$ : chr "gender"
##   .. .. ..- attr(*, "term.labels")= chr "gender"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: 0x4549aa8> 
##   .. .. ..- attr(*, "predvars")= language list(pay_rate, gender)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "character"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "pay_rate" "gender"
##  - attr(*, "class")= chr "lm"

We can see from str that lm is a list with a long set of components. The output is a bit information overload; we can hone in on some important details by tinkering with our gender_reg object some more:

names(gender_reg)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"

Our gender_reg object is actually pretty complicate! Just by running lm, we’re given back all sorts of essential results about the relationship between pay_rate and gender:

  • coefficients are of course the coefficients of the regression. We can access them with gender_reg$coefficients.
  • residuals are of course the residuals – actual - predicted value for each observation.
  • effects are “the uncorrelated single-degree-of-freedom values obtained by projecting the data onto the successive orthogonal subspaces generated by the QR decomposition during the fitting process.” (I have no idea what this means)
  • rank should be equal to the number of coefficients (including the intercept), unless there is a multicollinearity problem.
  • fitted.values are the y_hat, the predicted pay_rate produced by the model for each observation.
  • assign helps map the matrix of regressors to the formula (I’ve never used this)
  • qr is the QR decomposition matrix associated with the regression (I’ve never used this)
  • df.residual is residual degrees of freedom (= n - k), where n is the total observations and k is the number of regressors
  • contrasts tell which contrasts were used (contrasts are levels of discrete variables)
  • xlevels gives the levels of the factors (discrete variables) that we included – for gender_reg, this is F and M, e.g.
  • call stores the regression we ran as a call to lm
  • terms gives a bunch of information about the variables we included in the regression – what they are, which are factors, which is the intercept, etc.
  • model gives the X matrix (predictor matrix)

It’s wonderful to know that R keeps track of all of this stuff for us, but what about the real meat of our regression – the stars! For this, we’ll use the summary function, which has a method for the output of lm that gives a nice, easily-digested summary of the model we ran:

summary(gender_reg)
## 
## Call:
## lm(formula = pay_rate ~ gender)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68625 -10239   1517   8771  22361 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  68624.9      203.6 337.040   <2e-16 ***
## genderM       -935.3      388.6  -2.407   0.0161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14140 on 6650 degrees of freedom
## Multiple R-squared:  0.0008702,  Adjusted R-squared:  0.00072 
## F-statistic: 5.792 on 1 and 6650 DF,  p-value: 0.01613

This calls attention to some summary statistics about the residuals (5-number-summary, as well as residual standard erro), the predictive power (R squared, multiple R squared, F statistic), but most importantly, it prints the standard error of each coefficient, and the associated two-sided t test results for each coefficient.

Here, we can see that men are paid on average significantly less (at alpha = .05) than are women among Philadelphia teachers, by about $1,000.

We spent some time going over the details, so let’s summarize quickly with a comparison to Stata. Here’s the boiled-down version of getting a regression summary in R:

teachers[ , summary(lm(pay_rate ~ gender))]
## 
## Call:
## lm(formula = pay_rate ~ gender)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68625 -10239   1517   8771  22361 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  68624.9      203.6 337.040   <2e-16 ***
## genderM       -935.3      388.6  -2.407   0.0161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14140 on 6650 degrees of freedom
## Multiple R-squared:  0.0008702,  Adjusted R-squared:  0.00072 
## F-statistic: 5.792 on 1 and 6650 DF,  p-value: 0.01613

In Stata, to do what we just did, one would simply run reg pay_rate gender. Hard to get simpler than that! My sales pitch in favor of R is only that it’s not that much more complicated, and as we saw above, R gives us at-hand access to a slew of summary statistics about the regression that are a bit less transparent to access in Stata.

3.4.1.1 Other covariates

It’s not a stretch to expand our regression to include other covariates to try and tame their influence. Perfect covariates here would be experience and certification, but sadly we’re not so lucky. But we do have a bunch of other school-level covariates we can include. Let’s try a few:

#skipping right to the summary
teachers[ , summary(lm(pay_rate ~ gender + school_pct_male + school_pct_black))]
## 
## Call:
## lm(formula = pay_rate ~ gender + school_pct_male + school_pct_black)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69929 -10025   1259   9614  24684 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       66136.6     1904.8  34.722  < 2e-16 ***
## genderM            -513.0      397.8  -1.290   0.1972    
## school_pct_male    8317.6     3626.9   2.293   0.0219 *  
## school_pct_black  -4014.2      588.1  -6.826 9.56e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14060 on 6312 degrees of freedom
##   (336 observations deleted due to missingness)
## Multiple R-squared:  0.008304,   Adjusted R-squared:  0.007833 
## F-statistic: 17.62 on 3 and 6312 DF,  p-value: 2.188e-11

Note that we lost a lot of observations because school_pct_male and school_pct_black are frequently missing. This was included in the summary output: (336 observations deleted due to missingness).

How do we interpret the results of this regression? The most naive possible conclusion is that mostly-black schools pay their teachers less – i.e., that there’s somehow a causal relationship between student makeup and the wages offered to teachers.

But we know that the teachers’ contract is fixed district-wide, so each school has no power whatsoever (well, not quite…) over what its teachers are paid. Instead, we’re witnessing the direct result of selection bias. We can get into this more later when we explore plotting.

3.4.1.2 Interactions

Stata handles interactions with the # operator. R, by contrast, uses *. So if we wanted to explore the interaction between gender and school_pct_male (i.e., to allow the slope of school_pct_male to differ by gender), we’d simply run:

#In Stata: reg pay_rate gender#school_pct_white
teachers[ , summary(lm(pay_rate ~ gender*school_pct_white))]
## 
## Call:
## lm(formula = pay_rate ~ gender * school_pct_white)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72526  -9521   3577   7368  21002 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               68943.1      453.4 152.044   <2e-16 ***
## genderM                     770.6      880.4   0.875   0.3814    
## school_pct_white           4297.2     1515.3   2.836   0.0046 ** 
## genderM:school_pct_white  -3852.0     3205.6  -1.202   0.2296    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13740 on 3330 degrees of freedom
##   (3318 observations deleted due to missingness)
## Multiple R-squared:  0.002439,   Adjusted R-squared:  0.001541 
## F-statistic: 2.714 on 3 and 3330 DF,  p-value: 0.04332

The output genderM:school_pct_white is the interaction term – it’s the difference slope of the linear relationship between salary and student gender for female and male teachers. Specifically, the expected pay change for a female teacher from a one-percentage-point increase in the percentage of male students at her school is 43 (we divide by 100 since school_pct_white is measured in proportions, not percentages); for men, the expected pay change is 43 + -39 = 4.

This is not significant, meaning that the relationship between school whiteness, gender, and pay mainly play out through student ethnicity.

3.4.1.3 Functions of variables

One of my big pet peeves in Stata is the inability to run something like reg y log(x) – we’re forced to first create a logx variable like gen logx = log(x). Now we’ve got two objects in memory that are basically the same information! A waste of memory, in addition to being a pain to remember to do this all the time.

R does not suffer from this inanity. We can simply use functions within the formula specification:

#Let's try!
teachers[ , summary(lm(log(pay_rate) ~ gender))]
## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'y'
#Whoops. Turns out we still have some teachers
#  who are stored as having pay rate 0:
teachers[pay_rate == 0, .N]
## [1] 31
#So we need to exclude them because log(0) is -Inf:
teachers[pay_rate>0, summary(lm(log(pay_rate) ~ gender))]
## 
## Call:
## lm(formula = log(pay_rate) ~ gender)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76603 -0.12744  0.04124  0.14091  0.30451 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 11.121710   0.002995 3713.032  < 2e-16 ***
## genderM     -0.018088   0.005708   -3.169  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2075 on 6619 degrees of freedom
## Multiple R-squared:  0.001515,   Adjusted R-squared:  0.001364 
## F-statistic: 10.04 on 1 and 6619 DF,  p-value: 0.001538

It’s a bit harder to interpret the coefficients in a regression with logs and discrete covariates, but the rule is still roughly that the coefficients are percentages. So men are paid about 2% less on average in Philly.

3.4.1.4 More functions of variables

What if we wanted to control not for the combined percentage of black and hispanic students at the school?

In Stata, we’d have to do something like gen black_hisp = school_pct_black + school_pct_hispanic. And from what we’ve seen so far, lm(pay_rate ~ school_pct_black + school_pct_hispanic) will give us separate coefficients for black and hispanic groups.

To get variables like this (most commonly, involving simple arithmetic operations like + or *), there’s the special I function for regression formulas. Basically, we surround some operation done on our variables with I(), and whatever results from that will be considered as a single covariate. Watch:

teachers[ , summary(lm(pay_rate ~ gender + 
                         I(school_pct_black + school_pct_hispanic)))]
## 
## Call:
## lm(formula = pay_rate ~ gender + I(school_pct_black + school_pct_hispanic))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71223  -9740   1986   9112  23446 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## (Intercept)                                73646.2      602.5 122.243
## genderM                                     -278.9      458.3  -0.609
## I(school_pct_black + school_pct_hispanic)  -6876.7      859.7  -7.999
##                                           Pr(>|t|)    
## (Intercept)                                < 2e-16 ***
## genderM                                      0.543    
## I(school_pct_black + school_pct_hispanic) 1.59e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13770 on 4509 degrees of freedom
##   (2140 observations deleted due to missingness)
## Multiple R-squared:  0.01422,    Adjusted R-squared:  0.01378 
## F-statistic: 32.52 on 2 and 4509 DF,  p-value: 9.465e-15

3.4.2 Predictions from a model

One of the most common things to do with a fitted regression is to use it for in- or out-of-sample predictions. For this, R has the predict function, which takes as its first argument the result of a model (e.g., lm), and as its second argument a new data.table.

Any prediction is model dependent; we’ll compare the predictions that come out of a few different relevant specifications.

#Model 1: Simplest linear specification
reg1 <- 
  teachers[ , lm(pay_rate ~ gender + school_pct_male +
                   school_pct_black + school_pct_disadv)]

#Model 2: Some interactions
reg2 <- 
    teachers[ , lm(pay_rate ~ gender*school_pct_disadv + 
                     school_pct_male + school_pct_black)]

#Model 3: Full interactions
reg3 <- 
    teachers[ , lm(pay_rate ~ gender*school_pct_disadv* 
                     school_pct_male*school_pct_black)]

Let’s try to predict the wages of a teacher who is male and at a school with 70% male, 40% black students, and 80% economically disadvantaged students with each model.

#declare new data to use for prediction
predict_table <- data.table(gender = "M", school_pct_male = .7,
                            school_pct_black = .4,
                            school_pct_disadv = .8)

#prediction from Model 1
predict(reg1, predict_table)
##        1 
## 77028.89
#prediction from Model 2
predict(reg2, predict_table)
##       1 
## 76675.7
#prediction from Model 3
predict(reg3, predict_table)
##       1 
## 73401.6

3.4.2.1 Many predictions

This will be more useful when we get to plotting, but often in more complicated models we want to trace out the marginal effect of a variable on our prediction, holding other regressors constant – is the relationship monotonic? Increasing?

For this, we simply provide more than one row in the predict_table that we supply, e.g.:

#explore the relationship between % male students
#  and teacher pay
predict_table <- 
  data.table(gender = "M", 
             school_pct_male =
               #only use 10 points for now for
               #  simplicity since we're
               #  not plotting
               seq(.5, 1, length.out = 10),
             school_pct_black = .4,
             school_pct_disadv = .8)

#Since all of these models are linear, 
#  we know for sure the relationship
#  is monotonic, and we can predict
#  the change from step to step because
#  we know the coefficient on school_pct_male
predict(reg1, predict_table)
##        1        2        3        4        5        6        7        8 
## 74441.11 75159.94 75878.77 76597.59 77316.42 78035.24 78754.07 79472.90 
##        9       10 
## 80191.72 80910.55

3.4.2.2 Prediction intervals

Point estimates are rarely the only thing we care about – as important are reasonable ranges in which our predictions might fall. We may predict wages of $75,000, but it would hardly call our model into question if we found actual wages of $75,001 for that person. Prediction intervals are a way of coupling our predictions with a range of estimates that are consistent with our model.

To this end, predict also has a few other arguments we can use to produce prediction intervals, e.g., interval and level, which tell the type of interval we want and the “alpha” to use, respectively:

predict(reg1, predict_table,
        interval = "prediction", level = .95)
##         fit      lwr      upr
## 1  74441.11 46800.92 102081.3
## 2  75159.94 47509.83 102810.0
## 3  75878.77 48212.76 103544.8
## 4  76597.59 48909.71 104285.5
## 5  77316.42 49600.70 105032.1
## 6  78035.24 50285.75 105784.7
## 7  78754.07 50964.87 106543.3
## 8  79472.90 51638.10 107307.7
## 9  80191.72 52305.46 108078.0
## 10 80910.55 52966.98 108854.1

That the intervals are so wide shows just how poor our chosen covariates are at predicting teachers’ wages.

3.4.3 Probit/Logit

The next most common form of regression analysis is logit/probit, for when we’re interested in understanding predictors of a binary dependent variable. I’ve never heard a compelling empirical reason for using one over the other; logistic is far more common, probably because it provides odds ratios, which are infinitely more interpretable than are the plain coefficients of either logit or probit models. But the difference in R between running logit and probit is trivial, so we’ll cover both here.

To avoid simply re-hashing what we did in the previous section and inducing a room more rife with glaze than Krispy Kreme, let’s focus on a new variable for our left-hand side.

Noticing that most schools are 100% economically disadvantaged, I wonder, are there certain characteristics which help predict that a teacher is at a school with less than full “poverty”?

We’ll define “less than full poverty” as school_pct_disadv < 100, which is true for teachers[school_pct_disadv < 100, .N] = 1666 teachers (and unique(teachers, by = "home_organization")[school_pct_disadv < 100, .N] = 52 schools).

The workhorse of binary-dependent-variable models in R is glm. This stands for generalized linear models. Generalized linear models subsume OLS and incorporate several regression models which can be related to OLS by a “link function”, including logit, probit and Poisson regression. See Wikipedia for more.

glm works basically like lm – its first argument is specified in exactly the same way (specifically, with an object of class formula), but we also have to include a second argument – family. This is basically specifying to glm the type of link function that you’ve got in mind for your regression. There is a multitude of options – Gamma, inverse.gaussian, poisson, quasipoisson, etc. (see ?family), but the most common and important is binomial.

Technically, what we pass to the family argument is a function – meaning most importantly that it itself accepts arguments! Specifically, the binomial function accepts one important argument, link. By default, this is "logit", meaning that the default for family = binomial is to run a logistic regression. There are two other options – family = "probit" (probit regression) and family = "cauchit" (which uses a Cauchy CDF, if you know what that is).

Let’s see this in action and run our first logit and probits:

#We could define our own "poverty" variable,
#  but instead we'll illustrate again another
#  way to use the I function in formulas
logit <- teachers[ , glm(I(school_pct_disadv < 100) ~ 
                           pay_rate + gender + school_pct_male, 
                         family = binomial)]

summary(logit)
## 
## Call:
## glm(formula = I(school_pct_disadv < 100) ~ pay_rate + gender + 
##     school_pct_male, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5236  -0.7773  -0.6549   0.5844   2.3622  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.072e+00  3.590e-01   8.556  < 2e-16 ***
## pay_rate         1.922e-05  2.240e-06   8.579  < 2e-16 ***
## genderM          3.649e-01  6.554e-02   5.568 2.58e-08 ***
## school_pct_male -1.080e+01  6.429e-01 -16.795  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7216.1  on 6363  degrees of freedom
## Residual deviance: 6793.8  on 6360  degrees of freedom
##   (288 observations deleted due to missingness)
## AIC: 6801.8
## 
## Number of Fisher Scoring iterations: 4
#run a probit instead by specifying the link
#  argument in the binomial function
probit <- teachers[ , glm(I(school_pct_disadv < 100) ~ 
                            pay_rate + gender + school_pct_male, 
                          family = binomial(link = "probit"))]


summary(probit)                  
## 
## Call:
## glm(formula = I(school_pct_disadv < 100) ~ pay_rate + gender + 
##     school_pct_male, family = binomial(link = "probit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4922  -0.7824  -0.6595   0.5984   2.4259  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.809e+00  2.082e-01   8.686  < 2e-16 ***
## pay_rate         1.104e-05  1.287e-06   8.579  < 2e-16 ***
## genderM          2.212e-01  3.866e-02   5.721 1.06e-08 ***
## school_pct_male -6.357e+00  3.705e-01 -17.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7216.1  on 6363  degrees of freedom
## Residual deviance: 6800.8  on 6360  degrees of freedom
##   (288 observations deleted due to missingness)
## AIC: 6808.8
## 
## Number of Fisher Scoring iterations: 4

We can’t compare coefficients between the models, except for their signs. The signs and significance both agree (as we would hope), and again seem to be telling us stuff we may have surmised – high-poverty schools have lower-wage teachers, have fewer male teachers, and have more male students.

To compare the models, we can look at the predicted probability of being in a low-poverty district for a typical teacher. We’ll define a typical teacher as having the median pay rate; we’ll plug in the percentage of male teachers to the coefficient on gender; and we’ll plug in the average percentage of male students.

I’m unaware of a way to tell R to use a percentage instead of 0/1 in predict, so we have to get our predictions by hand. Luckily this isn’t so bad:

#here is the "typical teacher"; we
#  include 1 first to stand for the
#  intercept term
typical <- teachers[!is.na(school_pct_male),
                    c(1, median(pay_rate), mean(gender == "M"),
                      mean(school_pct_male))]

#prediction from the logit model; recall that 
#  the interpretation of logit is that:
#  P[y = 1 | X] = F(X'B)
#  where X are our predictors, F is the link
#  function CDF, and B is the coefficient vector

#In the case of logistic regression, F is the
#  logistic CDF:
plogis(sum(logit$coefficients * typical))
## [1] 0.2446419
#In the case of probit, F is the normal CDF:
pnorm(sum(probit$coefficients * typical))
## [1] 0.2480797
#For comparison, the overall average in the sample is:
teachers[ , mean(school_pct_disadv < 100, na.rm = TRUE)]
## [1] 0.2598253

3.5 Reshaping

Let’s read the full demographic file on student gender again. This is just a copy-paste of what we did above.

school_gender <-
  setDT(read_excel("./data/2015-2016 Enr Dem (Suppressed).xls", 
                   sheet = "Gender", skip = 6,
                   col_names = c("", "school_id", "school_name", "grade",
                                 "total_enrolled", "count_female",
                                 "pct_female", "count_male", "pct_male")))
school_gender
##       school_id                  school_name      grade total_enrolled
##    1:       101     John Bartram High School   9.000000            200
##    2:       101     John Bartram High School  10.000000            197
##    3:       101     John Bartram High School  11.000000            172
##    4:       101     John Bartram High School  12.000000            141
##    5:       101     John Bartram High School ALL GRADES            710
##   ---                                                                 
## 1662:       878 Philadelphia Virtual Academy   9.000000             46
## 1663:       878 Philadelphia Virtual Academy  10.000000             69
## 1664:       878 Philadelphia Virtual Academy  11.000000             79
## 1665:       878 Philadelphia Virtual Academy  12.000000             64
## 1666:       878 Philadelphia Virtual Academy ALL GRADES            311
##       count_female pct_female count_male pct_male
##    1:    70.000000   0.350000 130.000000 0.650000
##    2:    86.000000   0.436548 111.000000 0.563452
##    3:    78.000000   0.453488  94.000000 0.546512
##    4:    63.000000   0.446809  78.000000 0.553191
##    5:   297.000000   0.418310 413.000000 0.581690
##   ---                                            
## 1662:            s          s          s        s
## 1663:    43.000000   0.623188  26.000000 0.376812
## 1664:    47.000000   0.594937  32.000000 0.405063
## 1665:    41.000000   0.640625  23.000000 0.359375
## 1666:   191.000000   0.614148 120.000000 0.385852
#Note that grade is pretty poorly formatted.
#  Let's fix that quickly. I'll explain what's
#  going on here later when we go into
#  regular expressions in a bit more depth.
#  Basically, we're deleting the decimal and everything after it.
school_gender[grade != "ALL GRADES", 
              grade := paste0("GRADE_", gsub("\\..*", "", grade))][]
##       school_id                  school_name      grade total_enrolled
##    1:       101     John Bartram High School    GRADE_9            200
##    2:       101     John Bartram High School   GRADE_10            197
##    3:       101     John Bartram High School   GRADE_11            172
##    4:       101     John Bartram High School   GRADE_12            141
##    5:       101     John Bartram High School ALL GRADES            710
##   ---                                                                 
## 1662:       878 Philadelphia Virtual Academy    GRADE_9             46
## 1663:       878 Philadelphia Virtual Academy   GRADE_10             69
## 1664:       878 Philadelphia Virtual Academy   GRADE_11             79
## 1665:       878 Philadelphia Virtual Academy   GRADE_12             64
## 1666:       878 Philadelphia Virtual Academy ALL GRADES            311
##       count_female pct_female count_male pct_male
##    1:    70.000000   0.350000 130.000000 0.650000
##    2:    86.000000   0.436548 111.000000 0.563452
##    3:    78.000000   0.453488  94.000000 0.546512
##    4:    63.000000   0.446809  78.000000 0.553191
##    5:   297.000000   0.418310 413.000000 0.581690
##   ---                                            
## 1662:            s          s          s        s
## 1663:    43.000000   0.623188  26.000000 0.376812
## 1664:    47.000000   0.594937  32.000000 0.405063
## 1665:    41.000000   0.640625  23.000000 0.359375
## 1666:   191.000000   0.614148 120.000000 0.385852

Above, we were focused only on school-level demographic data, so we threw away a lot of information.

In certain settings, we may be more focused on grade-level outcomes (if, for example, we knew the grade that each teacher taught, primarily, than we might want to associate the grade-level demographics to each teacher).

All of this information is here in the file we’ve got, but it’s in a slightly unfortunate format – there are many rows associated with each school. It’s much easier to merge a data.table where each school is found in one, and only one, row.

3.5.1 Long to Wide: dcast

To accomplish this in Stata we would use reshape wide, like reshape wide stub, i(i) j(j).

In R, we use the dcast function (I believe this means data cast, with the idea that we’re casting data “out and to the side” by going from long to wide). There are three main arguments to dcast: data (the data.table to be reshaped), formula (a specification of how to reshape, akin to the i(i) and j(j) in Stata), and value.var (the equivalent of stub in Stata).

It can be confusing to wrap your head around which arguments go where; here is an illustration to help orient your thinking about a given problem:

Dcast Illustration

Basically, the variables that we want to end up as rows, we put on the left of the formula (Left-Hand Side), and those that we want to end up as columns we put on the right of the formula. value.var is(are) the variable(s) dictating what comes from each LHS/RHS pair in the data.

Let’s reshape the gender demographics data so that there is only one row associated with each school:

school_gender_wide <- 
  dcast(school_gender, school_id ~ grade, value.var = "pct_male")
school_gender_wide
##      school_id ALL GRADES  GRADE_0  GRADE_1 GRADE_10 GRADE_11 GRADE_12
##   1:       101   0.581690       NA       NA 0.563452 0.546512 0.553191
##   2:       102   0.550186       NA       NA 0.468531 0.549180 0.537736
##   3:       103   0.530888       NA       NA 0.547826 0.480315 0.512195
##   4:       105   0.425170       NA       NA 0.485714 0.426667 0.324324
##   5:       110   0.547945       NA       NA 0.517241 0.537190 0.500000
##  ---                                                                  
## 214:       842   0.535922 0.603604 0.457944       NA       NA       NA
## 215:       843   0.501355 0.515625 0.515152       NA       NA       NA
## 216:       844   0.529412 0.515873 0.551724       NA       NA       NA
## 217:       856   0.618357       NA       NA        s 0.543860        s
## 218:       878   0.385852       NA       NA 0.376812 0.405063 0.359375
##       GRADE_2  GRADE_3  GRADE_4  GRADE_5  GRADE_6  GRADE_7  GRADE_8
##   1:       NA       NA       NA       NA       NA       NA       NA
##   2:       NA       NA       NA       NA       NA       NA       NA
##   3:       NA       NA       NA       NA       NA       NA       NA
##   4:       NA       NA       NA       NA       NA       NA       NA
##   5:       NA       NA       NA       NA       NA       NA       NA
##  ---                                                               
## 214: 0.470149 0.549550 0.500000 0.568000 0.576577 0.459459 0.656863
## 215: 0.556962 0.529412 0.481481 0.486842 0.468354 0.478723 0.481481
## 216: 0.552448 0.537879 0.467213 0.545455       NA       NA       NA
## 217:       NA       NA       NA       NA       NA       NA       NA
## 218:       NA       NA       NA       NA        s        s        s
##       GRADE_9
##   1: 0.650000
##   2: 0.628743
##   3: 0.575163
##   4: 0.466667
##   5: 0.607595
##  ---         
## 214:       NA
## 215:       NA
## 216:       NA
## 217: 0.555556
## 218:        s

We now have all of the grades as columns in the output. Note how much data is missing – this shows that most schools only cover a small range of grades. The grades are also out of order – they’ve been alphabetized, and "GRADE 10" comes before "GRADE 2" alphabetically. We can fix this by using factors; we may get into that later.

Note that we’ve lost a lot of information that was in or data originally – the data on student counts, the name of the school, etc. We can keep this sort of thing by using dcast more flexibly:

#keep school name by adding it as an ID
#  variable; since school_name is the same
#  for every school_id, this will not
#  change our output.
dcast(school_gender, school_id + school_name ~ grade, value.var = "pct_male")
##      school_id                                 school_name ALL GRADES
##   1:       101                    John Bartram High School   0.581690
##   2:       102               West Philadelphia High School   0.550186
##   3:       103                   High School of the Future   0.530888
##   4:       105 Paul Robeson High School for Human Services   0.425170
##   5:       110                William L. Sayre High School   0.547945
##  ---                                                                 
## 214:       842                      Stephen Decatur School   0.535922
## 215:       843                     Joseph Greenberg School   0.501355
## 216:       844                   William H. Loesche School   0.529412
## 217:       856                         The Workshop School   0.618357
## 218:       878                Philadelphia Virtual Academy   0.385852
##       GRADE_0  GRADE_1 GRADE_10 GRADE_11 GRADE_12  GRADE_2  GRADE_3
##   1:       NA       NA 0.563452 0.546512 0.553191       NA       NA
##   2:       NA       NA 0.468531 0.549180 0.537736       NA       NA
##   3:       NA       NA 0.547826 0.480315 0.512195       NA       NA
##   4:       NA       NA 0.485714 0.426667 0.324324       NA       NA
##   5:       NA       NA 0.517241 0.537190 0.500000       NA       NA
##  ---                                                               
## 214: 0.603604 0.457944       NA       NA       NA 0.470149 0.549550
## 215: 0.515625 0.515152       NA       NA       NA 0.556962 0.529412
## 216: 0.515873 0.551724       NA       NA       NA 0.552448 0.537879
## 217:       NA       NA        s 0.543860        s       NA       NA
## 218:       NA       NA 0.376812 0.405063 0.359375       NA       NA
##       GRADE_4  GRADE_5  GRADE_6  GRADE_7  GRADE_8  GRADE_9
##   1:       NA       NA       NA       NA       NA 0.650000
##   2:       NA       NA       NA       NA       NA 0.628743
##   3:       NA       NA       NA       NA       NA 0.575163
##   4:       NA       NA       NA       NA       NA 0.466667
##   5:       NA       NA       NA       NA       NA 0.607595
##  ---                                                      
## 214: 0.500000 0.568000 0.576577 0.459459 0.656863       NA
## 215: 0.481481 0.486842 0.468354 0.478723 0.481481       NA
## 216: 0.467213 0.545455       NA       NA       NA       NA
## 217:       NA       NA       NA       NA       NA 0.555556
## 218:       NA       NA        s        s        s        s
#keep the count of male students; for brevity,
#  we'll also exclude a bunch of grades to the
#  output is not as space-consuming
dcast(school_gender[grade %in% paste0("GRADE_", 1:6)],
      school_id ~ grade, value.var = c("pct_male", "count_male"))
##      school_id pct_male_GRADE_1 pct_male_GRADE_2 pct_male_GRADE_3
##   1:       113               NA               NA               NA
##   2:       120         0.450820         0.470000         0.489362
##   3:       123         0.491525         0.571429         0.402985
##   4:       125         0.534483         0.478723         0.509091
##   5:       126         0.439560         0.523256         0.530303
##  ---                                                             
## 164:       841         0.486486         0.554622         0.438776
## 165:       842         0.457944         0.470149         0.549550
## 166:       843         0.515152         0.556962         0.529412
## 167:       844         0.551724         0.552448         0.537879
## 168:       878               NA               NA               NA
##      pct_male_GRADE_4 pct_male_GRADE_5 pct_male_GRADE_6 count_male_GRADE_1
##   1:               NA                s         0.548148                 NA
##   2:         0.425000         0.482353         0.564516          55.000000
##   3:                s         0.615385         0.551020          29.000000
##   4:         0.536082         0.515789               NA          62.000000
##   5:         0.633803         0.588235         0.557692          40.000000
##  ---                                                                      
## 164:         0.560000         0.535714         0.552083          54.000000
## 165:         0.500000         0.568000         0.576577          49.000000
## 166:         0.481481         0.486842         0.468354          51.000000
## 167:         0.467213         0.545455               NA          64.000000
## 168:               NA               NA                s                 NA
##      count_male_GRADE_2 count_male_GRADE_3 count_male_GRADE_4
##   1:                 NA                 NA                 NA
##   2:          47.000000          46.000000          34.000000
##   3:          36.000000          27.000000                  s
##   4:          45.000000          56.000000          52.000000
##   5:          45.000000          35.000000          45.000000
##  ---                                                         
## 164:          66.000000          43.000000          56.000000
## 165:          63.000000          61.000000          59.000000
## 166:          44.000000          45.000000          39.000000
## 167:          79.000000          71.000000          57.000000
## 168:                 NA                 NA                 NA
##      count_male_GRADE_5 count_male_GRADE_6
##   1:                  s          74.000000
##   2:          41.000000          35.000000
##   3:          32.000000          27.000000
##   4:          49.000000                 NA
##   5:          30.000000          29.000000
##  ---                                      
## 164:          60.000000          53.000000
## 165:          71.000000          64.000000
## 166:          37.000000          37.000000
## 167:          78.000000                 NA
## 168:                 NA                  s

3.5.2 Wide to Long: melt

As often as we’re given data in long form that we prefer in wide form, we’re presented with data in wide form that we’d prefer in long form. Stata uses similar terminology for both – reshape wide for the former and reshape long for the latter, but R uses a different word – dcast for the former and melt for the latter. I picture the data set made of wax and melting away into a longer form.

melt also takes three main arguments – data (same as before), id.vars (same as LHS above), and measure.vars (basically RHS above). measure.vars can be a bit tricky, and there are a few ways to use it. First we can specify the column names explicitly. Second, we can specify a pattern matched by the names of the columns we’d like. Last, we can use column numbers (this isn’t recommended, so I won’t demonstrate it). This gets slightly more complicated when we want to split into more than one new column.

Let’s try this with our school_gender data. We’ll stack the genders on top of each other and create a single column for student counts, then another for percentages (basically doubling the number of rows in the table).

#First, using explicit column names
melt(school_gender, 
     id.vars = c("school_id", "grade"), 
     measure.vars = c("count_male", "count_female"))
##       school_id      grade     variable      value
##    1:       101    GRADE_9   count_male 130.000000
##    2:       101   GRADE_10   count_male 111.000000
##    3:       101   GRADE_11   count_male  94.000000
##    4:       101   GRADE_12   count_male  78.000000
##    5:       101 ALL GRADES   count_male 413.000000
##   ---                                             
## 3328:       878    GRADE_9 count_female          s
## 3329:       878   GRADE_10 count_female  43.000000
## 3330:       878   GRADE_11 count_female  47.000000
## 3331:       878   GRADE_12 count_female  41.000000
## 3332:       878 ALL GRADES count_female 191.000000
#It's usually more concise to use the patterns
#  convenience "function"; the syntax for this
#  generally involves regular expressions, so
#  I'll hold off on demonstrating the full
#  power of this approach for now. 
melt(school_gender, 
     id.vars = c("school_id", "grade"), 
     measure.vars = patterns("count"))
##       school_id      grade     variable      value
##    1:       101    GRADE_9 count_female  70.000000
##    2:       101   GRADE_10 count_female  86.000000
##    3:       101   GRADE_11 count_female  78.000000
##    4:       101   GRADE_12 count_female  63.000000
##    5:       101 ALL GRADES count_female 297.000000
##   ---                                             
## 3328:       878    GRADE_9   count_male          s
## 3329:       878   GRADE_10   count_male  26.000000
## 3330:       878   GRADE_11   count_male  32.000000
## 3331:       878   GRADE_12   count_male  23.000000
## 3332:       878 ALL GRADES   count_male 120.000000
#A bit more careful about the output to make
#  it a bit more understandable
melt(school_gender, 
     id.vars = c("school_id", "grade"), 
     measure.vars = patterns("count"),
     variable.name = "gender",
     value.name = "count")
##       school_id      grade       gender      count
##    1:       101    GRADE_9 count_female  70.000000
##    2:       101   GRADE_10 count_female  86.000000
##    3:       101   GRADE_11 count_female  78.000000
##    4:       101   GRADE_12 count_female  63.000000
##    5:       101 ALL GRADES count_female 297.000000
##   ---                                             
## 3328:       878    GRADE_9   count_male          s
## 3329:       878   GRADE_10   count_male  26.000000
## 3330:       878   GRADE_11   count_male  32.000000
## 3331:       878   GRADE_12   count_male  23.000000
## 3332:       878 ALL GRADES   count_male 120.000000

We can extend this approach to handle reshaping both count and pct like so:

#Admittedly a bit tough to tell whether variable = 1
#  corresponds to Male or Female
melt(school_gender,
     id.vars = c("school_id", "grade"),
     measure.vars = patterns("count", "pct"))
##       school_id      grade variable     value1   value2
##    1:       101    GRADE_9        1  70.000000 0.350000
##    2:       101   GRADE_10        1  86.000000 0.436548
##    3:       101   GRADE_11        1  78.000000 0.453488
##    4:       101   GRADE_12        1  63.000000 0.446809
##    5:       101 ALL GRADES        1 297.000000 0.418310
##   ---                                                  
## 3328:       878    GRADE_9        2          s        s
## 3329:       878   GRADE_10        2  26.000000 0.376812
## 3330:       878   GRADE_11        2  32.000000 0.405063
## 3331:       878   GRADE_12        2  23.000000 0.359375
## 3332:       878 ALL GRADES        2 120.000000 0.385852
#If we want to use the explicitly-name approach
#  of before, we have to keep all associated columns
#  as one item of a list -- here, the first
#  item of the list is for counts, and the second
#  item of the list is for percentages.
#Advantage: we know for sure that count_male
#  and pct_male correspond to variable = 1
melt(school_gender,
     id.vars = c("school_id", "grade"),
     measure.vars = list(c("count_male", "count_female"),
                         c("pct_male", "pct_female")))
##       school_id      grade variable     value1   value2
##    1:       101    GRADE_9        1 130.000000 0.650000
##    2:       101   GRADE_10        1 111.000000 0.563452
##    3:       101   GRADE_11        1  94.000000 0.546512
##    4:       101   GRADE_12        1  78.000000 0.553191
##    5:       101 ALL GRADES        1 413.000000 0.581690
##   ---                                                  
## 3328:       878    GRADE_9        2          s        s
## 3329:       878   GRADE_10        2  43.000000 0.623188
## 3330:       878   GRADE_11        2  47.000000 0.594937
## 3331:       878   GRADE_12        2  41.000000 0.640625
## 3332:       878 ALL GRADES        2 191.000000 0.614148
#We can again specify value.name to facilitate understanding
melt(school_gender,
     id.vars = c("school_id", "grade"),
     measure.vars = list(c("count_male", "count_female"),
                         c("pct_male", "pct_female")),
     value.name = c("count", "pct"))
##       school_id      grade variable      count      pct
##    1:       101    GRADE_9        1 130.000000 0.650000
##    2:       101   GRADE_10        1 111.000000 0.563452
##    3:       101   GRADE_11        1  94.000000 0.546512
##    4:       101   GRADE_12        1  78.000000 0.553191
##    5:       101 ALL GRADES        1 413.000000 0.581690
##   ---                                                  
## 3328:       878    GRADE_9        2          s        s
## 3329:       878   GRADE_10        2  43.000000 0.623188
## 3330:       878   GRADE_11        2  47.000000 0.594937
## 3331:       878   GRADE_12        2  41.000000 0.640625
## 3332:       878 ALL GRADES        2 191.000000 0.614148

3.6 String Data / regex

String data is ubiquitous. And it is ubiquitously messy. Typos, transpositions, inconsistent abbreviations, strange character encodings, arbitrary punctuation… there’s no end to the nightmares that can be caused to a data analyst forced to work with string data.

The most powerful tool that a data analyst can have in their belt to attack such data is a facility with regular expressions.

Basically, regular expressions are a language for talking to and manipulating string data. Have you ever wondered how many words in English fit some certain pattern? Regular expressions are your best friend.

What are all the words in the English language that have the five vowels in order? I used the Regex Dictionary to search for the pattern a.*e.*i.*o.*u.* (we’ll cover what this means momentarily) and got the following list:

abstemious, adventitious, amentiferous, anemophilous, arenicolous, argentiferous, arsenious, arteriovenous, autoecious, cavernicolous, facetious, garnetiferous, sacrilegious, theater-in-the-round

Just like that, instantly! Regular expressions is basically a programming language in and of itself. They can be accessed through most other programs – Stata, Excel, SAS, MATLAB, Python, you name it. Here we’ll obviously focus on their use in R.

To just introduce some of the most common things used in regex, I’ll just do a bunch of quick examples demonstrating a certain type of pattern on a single string; then at the end we’ll quickly apply this to some of our school data.

3.6.1 gsub, grep, grepl, strsplit

These four are the workhorse functions of using regular expressions in R. They each serve a distinct purpose, and I’ll try and use all four in this brief overview a couple of different times.

  1. gsub is used for substitution of patterns. Want to delete all numbers? Cut certain strings from names? This is your friend. A quick side note on etymology – the “g” in gsub stands for global. There is a rarely-used alternative to gsub, sub, which will replace only the first match of the supplied pattern… I’ve only ever used it once, just something to keep in mind.
  2. grep (an obscure acronym for global regular expression print, though I only ever say “grep”) has two purposes. The first identifies if elements of a string match a pattern, and then returns the indices of the vector corresponding to matches. The second (with setting the argument value = TRUE) returns the matched strings themselves.
  3. grepl (the “l” stands for logical) is used to simply test for existence of the pattern in strings. If the pattern is matched, we get TRUE, otherwise FALSE.

3.6.2 Simple, Exact Substring Matching

The most common regular expression is simply an exact sequence of letters. Suppose we wanted to find all the people named Bob. Then we might do:

strings <- c("Bob Dylan", "Johnny Cash", "Bob Marley", "Janis Joplin")

#note the order of arguments!!
#  this is a pet peeve of mine, as I'd expect
#  strings to go first.
grep("Bob", strings, value = TRUE)
## [1] "Bob Dylan"  "Bob Marley"
firsts <- c("Bob", "Johnny", "Bob", "Janis")
lasts <- c("Dylan", "Cash", "Marley", "Joplin")
#We might use grep without value = TRUE to
#  find the last names of all Bobs:
grep("Bob", strings)
## [1] 1 3
lasts[grep("Bob", strings)]
## [1] "Dylan"  "Marley"
mons <- c("January", "February", "March", "April",
          "May", "June", "July", "August",
          "September", "October", "November", "December")

#Which months can we eat oysters?
grep("r", mons)
## [1]  1  2  3  4  9 10 11 12

3.6.3 Beginning/End Anchors: ^ and $

Sometimes we want to be sure we’re matching a pattern from the beginning or end of a string. Regex uses two symbols to represent these: ^ (beginning) and $ (end).

Let’s try it out:

strings <- c("Bob Dylan", "Johnny Cash",
             "Bob Marley", "Janis Joplin",
             "Billy-Bob Thornton")

#Just matching Bob will give us the last
#  string as well:
grep("Bob", strings, value = TRUE)
## [1] "Bob Dylan"          "Bob Marley"         "Billy-Bob Thornton"
#To exclude this, we can force Bob
#  to have come from the beginning
#  of the string:
grep("^Bob", strings, value = TRUE)
## [1] "Bob Dylan"  "Bob Marley"
states <- 
  c("Alabama", "Alaska", "Arizona", "Arkansas",
    "California", "Colorado", "Connecticut",
    "Delaware", "Florida", "Georgia", "Hawaii",
    "Idaho", "Illinois", "Indiana", "Iowa",
    "Kansas", "Kentucky", "Louisiana", "Maine",
    "Maryland", "Massachusetts", "Michigan",
    "Minnesota", "Mississippi", "Missouri",
    "Montana", "Nebraska", "Nevada", "New Hampshire", 
    "New Jersey", "New Mexico", "New York", 
    "North Carolina", "North Dakota", "Ohio", 
    "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", 
    "South Carolina", "South Dakota", "Tennessee", 
    "Texas", "Utah", "Vermont", "Virginia", "Washington", 
    "West Virginia", "Wisconsin", "Wyoming")

#Which states start with New?
grep("^New", states, value = TRUE)
## [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"
#Which states DON'T end with a?
states[!grepl("a$", states)]
##  [1] "Arkansas"      "Colorado"      "Connecticut"   "Delaware"     
##  [5] "Hawaii"        "Idaho"         "Illinois"      "Kansas"       
##  [9] "Kentucky"      "Maine"         "Maryland"      "Massachusetts"
## [13] "Michigan"      "Mississippi"   "Missouri"      "New Hampshire"
## [17] "New Jersey"    "New Mexico"    "New York"      "Ohio"         
## [21] "Oregon"        "Rhode Island"  "Tennessee"     "Texas"        
## [25] "Utah"          "Vermont"       "Washington"    "Wisconsin"    
## [29] "Wyoming"
#We could also have used another
#  argument of grep, invert,
#  to accomplish the same thing as
#  x[!grepl("pattern", x)]
grep("a$", states, value = TRUE, invert = TRUE)
##  [1] "Arkansas"      "Colorado"      "Connecticut"   "Delaware"     
##  [5] "Hawaii"        "Idaho"         "Illinois"      "Kansas"       
##  [9] "Kentucky"      "Maine"         "Maryland"      "Massachusetts"
## [13] "Michigan"      "Mississippi"   "Missouri"      "New Hampshire"
## [17] "New Jersey"    "New Mexico"    "New York"      "Ohio"         
## [21] "Oregon"        "Rhode Island"  "Tennessee"     "Texas"        
## [25] "Utah"          "Vermont"       "Washington"    "Wisconsin"    
## [29] "Wyoming"

3.6.4 Multiple Possibilities: |

As typically in programming, we can use | to mean “or”. | separates patterns. Let’s try it out:

#Let's get states starting with directional names
grep("North|South|East|West", states, value = TRUE)
## [1] "North Carolina" "North Dakota"   "South Carolina" "South Dakota"  
## [5] "West Virginia"

3.6.5 White Space: \\s

White space comes in many forms. Obviously the most common is a simple space. But also included in this are other sorts of blank characters that may show up unexpectedly in our data, like tabs (\t), newlines (\n), a carriage return (\r), a form feed (\f), etc. (like I said, string data is messy).

In regex in R, \\s matches all of these (those who have seen regex before should note that we need two backslashes because the backslash itself needs to be escaped in an R string). For those interested in web scraping, unfortunately, it appears that \\s does not match &nbsp;; we’ll return to this later.

#Which states have a space of any sort?
grep("\\s", states, value = TRUE)
##  [1] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
##  [5] "North Carolina" "North Dakota"   "Rhode Island"   "South Carolina"
##  [9] "South Dakota"   "West Virginia"

3.6.6 Wildcard: .

Sometimes, we don’t care what character is in a certain place, just that it’s there. This type of pattern is called a wildcard, and it’s denoted in regex by ..

Actually, it’s rare to use it by itself, but it’ll come up a lot when we have more tools. The only thing I can think of for this alone is cheating on crosswords. Suppose we wanted to fill in a crossword answer that had the pattern L _ V _ _ _ _ _ _. As long as we know it’s one word, we can go to the Regex dictionary and search the pattern: l.v...... and we’ll get the full list of possible answers: lava-lava, lavaliere, leviathan, Levitical, Leviticus, liverleaf, liverwort, liveryman, livestock. We’ll use this more soon.

3.6.7 Quantifiers: *, +, ?, {m(, n)}

The most common appearance of . from the last section is as .*. We saw this when I found all the pan-vowel words in the introduction to this section.

* is a quantifier – it quantifies the piece of the pattern directly before it. In particular, * says to match the thing before it 0 or more times. So .* means “match anything at least 0 times.” So the regex in the introduction, a.*e.*i.*o.*u means match “‘a’ followed by anything any number of times followed by ‘e’ followed by anything any number of times followed by ‘i’ followed by anything any number of times followed by ‘o’ followed by anything any number of times followed by ‘u’”.

Here are all the possible quantifiers:

  1. *: Match at least 0 times
  2. +: Match at least 1 time
  3. ?: Match exactly 0 or 1 time.
  4. {m}: Match exactly m times.
  5. {m,n}: Match between m and n times, inclusive. If n is blank, it’s at least m times.

Some demonstration:

#What states have a doubled s?
grep("s{2}", states, value = TRUE)
## [1] "Massachusetts" "Mississippi"   "Missouri"      "Tennessee"
#Which states have an a in them
#  between the first and the last letter?
#  (first letter is excluded because
#   we're only looking for lowercase a)
grep("a.+", states, value = TRUE)
##  [1] "Alabama"        "Alaska"         "Arkansas"       "California"    
##  [5] "Colorado"       "Delaware"       "Hawaii"         "Idaho"         
##  [9] "Indiana"        "Kansas"         "Louisiana"      "Maine"         
## [13] "Maryland"       "Massachusetts"  "Michigan"       "Montana"       
## [17] "Nebraska"       "Nevada"         "New Hampshire"  "North Carolina"
## [21] "North Dakota"   "Oklahoma"       "Pennsylvania"   "Rhode Island"  
## [25] "South Carolina" "South Dakota"   "Texas"          "Utah"          
## [29] "Washington"
flips <-
  c("HTHHHTHHHTH",
    "THTTHTTHHTH",
    "THTHTHHTHHHT",
    "THTHTTTHTTHTTT")

#Which sequences of coin flips had
#  at least 3 heads in a row?
grepl("H{3, }", flips)
## [1]  TRUE FALSE  TRUE FALSE
#Let's match the middle initial K, possibly
#  followed by some punctuation
grepl("\\sK.?", c("Johnson, Mary K",
                  "Hobbes, Calvin K.",
                  "Martin, George R. R."))
## [1]  TRUE  TRUE FALSE

3.6.8 Group Matching: []

[] is used for a less inclusive version of .. We specify the characters that we’re OK with inside of [] and they’ll be matched. Most commonly, this is used for matching any letter or any number. To do this, we use - inside [] to mean “through”, like [A-Z] (A through Z):

#Let's match ANY middle initial
nms <- c("Johnson, Mary K",
         "Hobbes, Calvin K.",
         "Martin, George R. R.", "Obama, Barack",
         "Prince, The Artist Formerly Known As")
grepl("\\s[A-Z]\\.?$", nms)
## [1]  TRUE  TRUE  TRUE FALSE FALSE
#If we want to find anyone with a period, 
#  we have to be careful. This won't work:
grepl(".", nms)
## [1] TRUE TRUE TRUE TRUE TRUE
#We CAN put a period inside a group:
grepl("[.]", nms)
## [1] FALSE  TRUE  TRUE FALSE FALSE
#OR we can escape it with \\:
grepl("\\.", nms)
## [1] FALSE  TRUE  TRUE FALSE FALSE
#Let's remove all digits from some addresses
gsub("[0-9]", "", c("3718 Locust Walk",
                    "219 S 41st Street",
                    "1600 Pennsylvania Ave NW"))
## [1] " Locust Walk"         " S st Street"         " Pennsylvania Ave NW"
#Whoops, we took away too much there!
#  We also missed the leading white space
#  Let's try again:
gsub("^[0-9]*\\s*", "", c("3718 Locust Walk",
                          "219 S 41st Street",
                          "1600 Pennsylvania Ave NW"))
## [1] "Locust Walk"         "S 41st Street"       "Pennsylvania Ave NW"

3.6.8.1 Specials in Grouping: [:punct:], [:alnum:]

Also within [] we can refer to certain sets of characters with special shorthands. The two most useful are [:punct:] to refer to all punctuation (well, most – specifically, [:punct:] matches all of [-[]\;',./!@#%&*()_{}::"?]), and [:alnum:] to refer to all alphanumeric characters (i.e., [:alnum:] is the same as [A-Za-z0-9]). (There are also [:upper:], [:lower:], [:digit:], and [:alpha:], but all of these are longer to write out than just writing A-Z, a-z, 0-9, and a-zA-Z, respectively)

#get rid of all punctuation
gsub("[[:punct:]]", "", "John C. Willington, Jr.")
## [1] "John C Willington Jr"
#get rid of all alphanumeric characters
gsub("[[:alnum:]]", "", "John C. Willington, Jr.")
## [1] " . , ."

3.6.8.2 Negative Group Matching: [^]

We can specify a group of characters not to match by starting the matching group with [^ instead of [.

#Keep only digits
gsub("[^0-9]", "", c("3718 Locust Walk",
                     "219 S 41st Street",
                     "1600 Pennsylvania Ave NW"))
## [1] "3718"  "21941" "1600"

3.6.9 Awesome Tools for Practicing

I think that’s enough for today. Some other common things to look up:

  • () to denote matched groups, along with \\1, \\2, etc.
  • Look-ahead and look-behind, both positive and negative, using perl = TRUE
  • \\U and \\L to force letters upper- or lower-case within regex

Besides that, regex is all about playing around extensively with your actual data.

And staring at a regex cheat sheet! I have this printed and pasted above my desk:

https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

I also frequently visit this website, which allows you to test regexes and breaks down the pattern step-by-step with explanations and pretty colors:

https://regex101.com/

3.6.10 A Quick Data Exercise

Remember how we found out that Masterman had a strange job description for its teachers? The reason we were susceptible to this is that we relied on finding the most common job description to identify teachers. This was bound to fail if a small, specialized school like Masterman uses some alternative description on a small scale.

A better approach to tackling this problem would have been to use regex. After some brief inspection of our salaries data, we could have seen that TEACHER appears to be a constant in the job description of teachers. We can start from there and winnow down our options to only identify the teachers we’re interested in, but also be more inclusive than just including TEACHER,FULL TIME:

#putting grepl in the first argument means
#  we'll only get rows for which there is
#  a match to the pattern
salaries[grepl("TEACHER", title_description),
         .N, by = title_description][order(-N)]
##                  title_description    N
##  1:              TEACHER,FULL TIME 6652
##  2:         TEACHER,SPEC EDUCATION 1314
##  3: TEACHER-EXTRA CURR/STAFF DEVEL  336
##  4:              TEACHER ASST,PKHS  109
##  5:          TEACHER,DEMONSTRATION   83
##  6:       TEACHER,HEAD,PKHS/BRIGHT   38
##  7:  RETIRED TEACHER, PER DIEM SUB   19
##  8:             CONSULTING TEACHER   19
##  9:      TEACHER,SPEC ASSIGN,12 MO   16
## 10:  TEACHER,DEMONSTRATION,SPEC ED   16
## 11: RETIRED TEACHER,PER DIEM SP ED   16
## 12:   TEACHER,LONG TERM SUBSTITUTE    6
## 13:          TEACHER,NON-CERT HRLY    4
## 14:   TEACHER, PER DIEM SUBSTITUTE    1
## 15:            MGR TEACHER COACHES    1

On a smaller scale, it’s a lot easier to piece through every possible TEACHER-related job title. We can see Masterman’s TEACHER,DEMONSTRATION right there as 5th-most-common. Suppose we had wanted to keep only TEACHER,FULL TIME and TEACHER,DEMONSTRATION for our teachers data set. We might have subsetted like this:

#(FULL|DEMO) matches ONLY
#  TEACHER,FULL or TEACHER,DEMO;
#  the rest of the regex is to exclude
#  TEACHER,DEMONSTRATION,SPEC ED, which
#  is distinguished by having a comma
#  later in the string, so we require
#  our match to have all the remaining
#  characters after (FULL|DEMO) to
#  NOT be a comma.
salaries[grepl("TEACHER,(FULL|DEMO)[^,]*$", title_description),
         unique(title_description)]
## [1] "TEACHER,FULL TIME"     "TEACHER,DEMONSTRATION"

What about if we only wanted to compare the average wage of PFT employees vs. non-PFT employees? PFT has several arms/classes of employee employed by the district, so we might not want to look solely at PFT-TEACHER. We can do this with regex:

#make sure we don't mix hourly employees
#  since their wages are quoted by the hour
salaries[pay_rate_type == "SALARIED", 
         .(avg_wage = mean(pay_rate)), 
         by = .(pft = grepl("^PFT", type_of_representation))]
##      pft avg_wage
## 1: FALSE 35481.39
## 2:  TRUE 58571.78

Lastly, we can extract only employees working at schools (as opposed to offices) with something like:

salaries[grepl("SCHOOL", organization_level),
         .N, organization_level][order(-N)]
##             organization_level    N
## 1:           ELEMENTARY SCHOOL 9667
## 2:                 HIGH SCHOOL 3101
## 3:               MIDDLE SCHOOL  954
## 4: TRANSITION / OVERAGE SCHOOL  141
## 5:             CHARTER SCHOOLS   12
## 6:           NON PUBLIC SCHOOL    1

3.7 Non-.csv Data

Comma- or tab-separated data is ideal, especially since we have the incredibly-fast fread by our side to facilitate the data reading process when we’re lucky enough to have this kind of data.

But we’re not always so lucky. We’ve already covered how to read in Excel data files. Here we’ll cover three more file types which I’ve found frequently in the wild – files created in SAS, files created in Stata, and fixed-width files. All are handled by the haven package, which is specifically designed to handle reading foreign, non-csv data types into R.

3.7.1 SAS: read_sas

Users of SAS may be familiar with its sas7bdat export file (I think it stands for “SAS 7 binary data”). I have found this file format in one place while working on projects in education – the Common Core of Data. Many of the district- and school-level data files found there are available in sas7bdat format.

To just run through a quick example, let’s load in the most recent LEA Universe Survey Data, from 2013-14. The original data can be found on this site here, but has been loaded onto your computer already.

Basically, reading such a file is as simple as using the read_sas function from the haven package:

#To install from CRAN
install.packages("haven")
#Now load the package:
library(haven)

leas_13 <- read_sas("data/ag131a_supp.sas7bdat")
class(leas_13)
## [1] "tbl_df"     "tbl"        "data.frame"
#again, it's been read as a data.frame,
#  so we'll use setDT to make it a data.table
setDT(leas_13)

#these files are huge...
ncol(leas_13)
## [1] 339
#so let's just preview the first, say, 30 columns:
leas_13[ , 1:30, with = FALSE]
##        SURVYEAR   LEAID FIPST STID
##     1:     2013 0100002    01  210
##     2:     2013 0100005    01  101
##     3:     2013 0100006    01  048
##     4:     2013 0100007    01  158
##     5:     2013 0100008    01  169
##    ---                            
## 18781:     2013 6600040    66  001
## 18782:     2013 6900030    69  001
## 18783:     2013 7200030    72   01
## 18784:     2013 7800002    78  002
## 18785:     2013 7800030    78  001
##                                             NAME      PHONE
##     1:                    ALABAMA YOUTH SERVICES 3342153850
##     2:                          ALBERTVILLE CITY 2568911183
##     3:                           MARSHALL COUNTY 2565823171
##     4:                               HOOVER CITY 2054391015
##     5:                              MADISON CITY 2564648370
##    ---                                                     
## 18781:                    GUAM DEPT OF EDUCATION 6714750512
## 18782:                 CNMI PUBLIC SCHOOL SYSTEM 6702373061
## 18783:       PUERTO RICO DEPARTMENT OF EDUCATION 7877735800
## 18784:               SAINT CROIX SCHOOL DISTRICT 3407731747
## 18785: SAINT THOMAS - SAINT JOHN SCHOOL DISTRICT 3407752250
##                        MSTREE        MCITY MSTATE  MZIP MZIP4
##     1:             P O BOX 66     MT MEIGS     AL 36057  0066
##     2:       107 WEST MAIN ST  ALBERTVILLE     AL 35950  0025
##     3: 12380 US HIGHWAY 431 S GUNTERSVILLE     AL 35976  9351
##     4:  2810 METROPOLITAN WAY       HOOVER     AL 35243  5500
##     5:       211 CELTIC DRIVE      MADISON     AL 35758  1615
##    ---                                                       
## 18781:              PO BOX DE        AGANA     GU 96932  9008
## 18782:       PO BOX 501370 CK       SAIPAN     MP 96950      
## 18783:          PO BOX 190759     SAN JUAN     PR 00919  0759
## 18784:   2133 HOSPITAL STREET  SAINT CROIX     VI 00820  4665
## 18785:      386 ANNAS RETREAT SAINT THOMAS     VI 00802      
##                                LSTREE        LCITY LSTATE  LZIP LZIP4 TYPE
##     1:    1000 INDUSTRIAL SCHOOL ROAD    MT. MEIGS     AL 36057  0066    1
##     2:               107 WEST MAIN ST  ALBERTVILLE     AL 35950  0025    1
##     3:         12380 US HWY 431 SOUTH GUNTERSVILLE     AL 35976  9351    1
##     4:          2810 METROPOLITAN WAY       HOOVER     AL 35243  5500    1
##     5:                  211 CELTIC DR      MADISON     AL 35758  1615    1
##    ---                                                                    
## 18781:               312 ASPINALL AVE        AGANA     GU 96932  9008    1
## 18782:                  PO BOX 501370       SAIPAN     MP 96950          1
## 18783: CALLE FEDERICO COSTAS NUM. 150     HATO REY     PR 00919  0759    1
## 18784:               2133 HOSPITAL ST  SAINT CROIX     VI 00820  4665    1
## 18785:              386 ANNAS RETREAT SAINT THOMAS     VI 00802          1
##        UNION CONUM             CONAME CSA  CBSA METMIC ULOCAL CDCODE
##     1:   000 01101  MONTGOMERY COUNTY   N 33860      1     21   0103
##     2:   000 01095    MARSHALL COUNTY 290 10700      2     32   0104
##     3:   000 01095    MARSHALL COUNTY 290 10700      2     42   0104
##     4:   000 01073   JEFFERSON COUNTY 142 13820      1     13   0106
##     5:   000 01089     MADISON COUNTY 290 26620      1     21   0105
##    ---                                                              
## 18781:   000     N                  N   N     N      0      N      N
## 18782:   000     N                  N   N     N      0      N      N
## 18783:   000 72127 SAN JUAN MUNICIPIO 490 41980      1     21   7298
## 18784:   000 78010   ST. CROIX ISLAND   N     N      0     33   7898
## 18785:   000 78030  ST. THOMAS ISLAND   N     N      0     33   7898
##         LATCOD   LONCOD BIEA BOUND AMEMPUP
##     1: 32.3770 -86.0830    2     1       2
##     2: 34.2675 -86.2086    2     1       2
##     3: 34.3050 -86.2867    2     1       2
##     4: 33.4062 -86.7669    2     1       2
##     5: 34.6873 -86.7449    2     1       2
##    ---                                    
## 18781:  0.0000   0.0000    2     1       2
## 18782:  0.0000   0.0000    2     1       2
## 18783: 18.4243 -66.0706    2     1       2
## 18784: 17.7025 -64.8669    2     1       2
## 18785: 18.3395 -64.8822    2     1       2

3.7.2 Stata: read_dta

I haven’t come across any files in education that are only available in .dta format, so we’ll just quickly digress into using non-educational data to demonstrate how to read such files in case you find them in your research.

The file I found is from here, this file. Again, it’s pre-loaded on your machine.

As I mentioned, the haven package is designed to handle reading all sorts of foreign formats into R; Stata is among them.

#this is a very tiny file
car_data <- read_dta("data/carsdata.dta")
#we could use setDT, etc., but we won't
car_data
##   cars hhsize
## 1    1      1
## 2    2      2
## 3    2      3
## 4    2      4
## 5    3      5

3.7.3 Fixed-width read_fwf/fread redux

Fixed-width files are basically the bane of my existence. Trouble is, I keep running into them all the time, regardless of how nightmarish they are to deal with. Especially if you have to create the width dictionary yourself.

This is essentially the case with old Universe files from the Common Core. Take, for example, the 1995-96 LEA Universe Flat File, which is in fixed-width format. There is a record layout on the NCES website as well… unfortunately, it, too is in fixed-width format! See what I mean about fixed-width…

To simplify our exercise a bit, instead of using the NCES files, we’ll use some files from Wisconsin. The main advantage being I’ve already done a fair bit of data cleaning on the original raw files.

Much like Philadelphia, Wisconsin publishes data on all of its teachers. But the Wisconsin data is much more comprehensive. It’s statewide, and it includes information on teachers’ age, experience, fringe benefits pay, etc. The data goes back to 1994-95, but we’ll just use the 1999-2000 data for now.

The data files can be found here, but be forewarned – they’re very messy. Messy enough to have served as inspiration for the “Data Cleaning” section in the appendix. I’ve done everyone the favor of including the more manageable version in today’s data files.

3.7.3.1 read_fwf

The package we’ll use for reading fixed-width files is readr. Let’s load it:

install.packages("readr")
library(readr)

read_fwf is the function in readr. In addition to the column names, it requires an extra argument – col_positions. This tells read_fwf how to find the columns in the data. It can be specified in a number of ways (see ?read_fwf), but the one we’ll use today is to use the helper function fwf_widths. This latter function maps column widths into something that col_positions can understand. It also takes an argument allowing us to specify column names, which is convenient.

#The column widths are kept in a separate dictionary file,
#  which I had to create myself
fwf_dict <- fread("data/00keys.csv", header = FALSE)
teachers_wisc <-
  read_fwf("data/00staff.dat", 
           col_positions = 
             fwf_widths(fwf_dict$V2, fwf_dict$V1))
#we gotta setDT again
class(teachers_wisc)
## [1] "tbl_df"     "tbl"        "data.frame"
setDT(teachers_wisc)[]
##                id    last_name first_name gender ethnicity birth_year
##      1: 000540952     Anderson  Shelley J      F         W       1949
##      2: 000540943       Bowers     Teresa      F         W       1966
##      3: 000540930   Branstiter      Zoila      F         H       1957
##      4: 000540973        Braun   Connie M      F         W       1940
##      5: 000540896      Broeske    Marilyn      F         W       1952
##     ---                                                              
## 155428: 000603626    Vanpernis       Anna      F         W       1977
## 155429: 000603663     Vonfange     Elaine      F         W       1947
## 155430: 000603604    Wahlquist       Jean      F         W       1955
## 155431: 000603590 Zuber Sequin        Amy      F         W       1977
## 155432: 000603590 Zuber Sequin        Amy      F         W       1977
##         highest_degree highest_degree_year year_session months_employed
##      1:             NA                  NA         2000            0000
##      2:              4                1999         2000            0950
##      3:             NA                  NA         2000            0000
##      4:              4                1962         2000            0950
##      5:             NA                  NA         2000            0000
##     ---                                                                
## 155428:             NA                  NA         2000            0000
## 155429:             NA                  NA         2000            0000
## 155430:             NA                  NA         2000            0000
## 155431:              4                1999         2000            0950
## 155432:              4                1999         2000            0950
##         local_exp total_exp  salary  fringe filler category filler
##      1:       000       000 0011946 0007620     NA        3     NA
##      2:       020       020 0022211 0004296     NA        1     NA
##      3:       000       000 0012224 0007673     NA        3     NA
##      4:       250       310 0039487 0011001     NA        1     NA
##      5:       000       000 0012848 0009661     NA        3     NA
##     ---                                                           
## 155428:       000       000 0017557 0006214     NA        3     NA
## 155429:       000       000 0023056 0007543     NA        3     NA
## 155430:       000       000 0034772 0009772     NA        3     NA
## 155431:       010       010 0024973 0004813     NA        1     NA
## 155432:       010       010 0024973 0004813     NA        1     NA
##         hire_agency agency agency_hire_type school position_code area
##      1:        0007   0007               04   0020            97  907
##      2:        0007   0007               04   0020            53  050
##      3:        0007   0007               04   0020            97  907
##      4:        0007   0007               04   0020            53  265
##      5:        0007   0007               04   0020            97  907
##     ---                                                              
## 155428:        9912   9912               01     NA            68  000
## 155429:        9912   9912               01     NA            68  000
## 155430:        9912   9912               01     NA            69  000
## 155431:        9912   9912               01     NA            64  000
## 155432:        9912   9912               01     NA            64  020
##         low_grade high_grade low_grade_code high_grade_code bilingual
##      1:        KG         KG             16              16         N
##      2:        01         03             20              28         N
##      3:        KG         05             16              36         N
##      4:        04         05             32              36         N
##      5:        KG         05             16              36         N
##     ---                                                              
## 155428:        3E         12              4              64         N
## 155429:        KG         12             16              64         N
## 155430:        KG         12             16              64         N
## 155431:        3E         12              4              64         N
## 155432:        3E         12              4              64         N
##         full_time_equiv filler final_contract filler         agency_name
##      1:             094     NA              Y     NA Abbotsford Sch Dist
##      2:             085     NA              N     NA Abbotsford Sch Dist
##      3:             054     NA              Y     NA Abbotsford Sch Dist
##      4:             013     NA              N     NA Abbotsford Sch Dist
##      5:             094     NA              Y     NA Abbotsford Sch Dist
##     ---                                                                 
## 155428:             091     NA              Y     NA                  NA
## 155429:             100     NA              Y     NA                  NA
## 155430:             100     NA              Y     NA                  NA
## 155431:             045     NA              Y     NA                  NA
## 155432:             055     NA              Y     NA                  NA
##                         school_name grade_level cesa county
##      1:               Abbotsford El           6   10     10
##      2:               Abbotsford El           6   10     10
##      3:               Abbotsford El           6   10     10
##      4:               Abbotsford El           6   10     10
##      5:               Abbotsford El           6   10     10
##     ---                                                    
## 155428: Cooperative Ed Serv Agcy 12          NA   12     74
## 155429: Cooperative Ed Serv Agcy 12          NA   12     74
## 155430: Cooperative Ed Serv Agcy 12          NA   12     74
## 155431: Cooperative Ed Serv Agcy 12          NA   12     74
## 155432: Cooperative Ed Serv Agcy 12          NA   12     74
##                county_name agency_work_type   school_mail_1 school_mail_2
##      1:       Clark County               04 112 W Spruce St            NA
##      2:       Clark County               04 112 W Spruce St            NA
##      3:       Clark County               04 112 W Spruce St            NA
##      4:       Clark County               04 112 W Spruce St            NA
##      5:       Clark County               04 112 W Spruce St            NA
##     ---                                                                  
## 155428: CESA County County               01  618 Beaser Ave            NA
## 155429: CESA County County               01  618 Beaser Ave            NA
## 155430: CESA County County               01  618 Beaser Ave            NA
## 155431: CESA County County               01  618 Beaser Ave            NA
## 155432: CESA County County               01  618 Beaser Ave            NA
##                     school_mail_3 school_shipping_1 school_shipping_2
##      1: Abbotsford WI  54405-9734                NA                NA
##      2: Abbotsford WI  54405-9734                NA                NA
##      3: Abbotsford WI  54405-9734                NA                NA
##      4: Abbotsford WI  54405-9734                NA                NA
##      5: Abbotsford WI  54405-9734                NA                NA
##     ---                                                              
## 155428:    Ashland WI  54806-2751                NA                NA
## 155429:    Ashland WI  54806-2751                NA                NA
## 155430:    Ashland WI  54806-2751                NA                NA
## 155431:    Ashland WI  54806-2751                NA                NA
## 155432:    Ashland WI  54806-2751                NA                NA
##         school_shipping_3  mail_city mail_state   mail_zip ship_city
##      1:                NA Abbotsford         WI 54405-9734        NA
##      2:                NA Abbotsford         WI 54405-9734        NA
##      3:                NA Abbotsford         WI 54405-9734        NA
##      4:                NA Abbotsford         WI 54405-9734        NA
##      5:                NA Abbotsford         WI 54405-9734        NA
##     ---                                                             
## 155428:                NA    Ashland         WI 54806-2751        NA
## 155429:                NA    Ashland         WI 54806-2751        NA
## 155430:                NA    Ashland         WI 54806-2751        NA
## 155431:                NA    Ashland         WI 54806-2751        NA
## 155432:                NA    Ashland         WI 54806-2751        NA
##         ship_state ship_zip    telephone filler       admin_name nee
##      1:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##      2:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##      3:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##      4:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##      5:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##     ---                                                             
## 155428:         NA       NA 715-682-2363     NA Fred Schlichting  NA
## 155429:         NA       NA 715-682-2363     NA Fred Schlichting  NA
## 155430:         NA       NA 715-682-2363     NA Fred Schlichting  NA
## 155431:         NA       NA 715-682-2363     NA Fred Schlichting  NA
## 155432:         NA       NA 715-682-2363     NA Fred Schlichting  NA
##         contracted_employee long_term_sub
##      1:                   N             N
##      2:                   N             N
##      3:                   N             N
##      4:                   N             N
##      5:                   N             N
##     ---                                  
## 155428:                   N             N
## 155429:                   N             N
## 155430:                   N             N
## 155431:                   N             N
## 155432:                   N             N

4 Plotting (11 AM - 12 PM)

Now we can finally start working on some design!

Fine-tuning plots is a pain in any language/program. But doing basic plots in R very simple. We’ll start with some easy plots and move on to some more bells and whistles.

4.1 base Plots

R comes ready and loaded with functions for all of the most common types of plots – histograms, kernel density, scatter plots, line plots, barplots, etc. I’ll go through these one by one, sprinkling in some tricks along the way which are applicable to any plotting sceneario (e.g., setting the axis labels and plot title).

After we have all of the main plot types handy, I’ll move on to tiling plots (plotting more than one plot at once).

4.1.1 Histograms

Histograms are probably the most common way to get a glance at the way that a variable is distributed in your data. Let’s examine the distribution of birth years in the Wisconsin teacher data.

First, a quick jab of data cleaning:

#Only include teachers matching the following criteria:
#  * Have bachelor's or master's degree
#  * Are full-time
#  * Strictly positive salary
#  * Aren't hired by administrative buildings

teachers_wisc <- 
  teachers_wisc[highest_degree %in% c(4, 5) & 
                  position_code == "53" &
                  as.integer(salary) > 0 & 
                  substr(agency, 1, 2) != "99" & 
                  as.integer(total_exp) %between% c(10, 300)]

#Now set salaries and fringe benefits to be numeric
teachers_wisc[ , salary := as.numeric(salary)]
teachers_wisc[ , fringe := as.numeric(fringe)]

#experience is kept in units of tenth-years, for some reason
teachers_wisc[ , total_exp := as.numeric(total_exp)/10]

Now we can examine the distribution on a more select/interesting sample of employees. Drawing histograms is managed through the hist function (so it’s very similar to Stata). It only needs one argument – a vector which will be converted into a distribution.

#no-frills histogram
teachers_wisc[ , hist(birth_year)]

That’s not bad considering how simple it was to program. There are three things that jump out at me immediately as being deficiencies of this plot:

  1. The title is abstruse
  2. The axis titles are odd
  3. There’s no color – how drab!

Let’s take care of the three of these in turn.

4.1.1.1 Main Title

The main title of a plot is almost always handled by the main argument to a plotting function:

#Customizing the main title
teachers_wisc[ , hist(birth_year, 
                      main = "Distribution of Year of Birth among Wisconsin Teachers")]

#May be preferable to have the title split over
#  two lines -- to do this, add \n where
#  you want the title to split
teachers_wisc[ , hist(birth_year, 
                      main = "Distribution of Year of Birth\nAmong Wisconsin Teachers")]

#To make our code less espansive, we can split the
#  title over two lines with paste0:
teachers_wisc[ , hist(birth_year, 
                      main = paste0("Distribution of Year of Birth\n",
                                    "Among Wisconsin Teachers"))]

4.1.1.2 Axis Labels

Looking better already. Now let’s handle the axes. There are four arguments involved here. xlab and ylab handle the labels of the two axes. And we can use xaxt and yaxt to tell R not to plot an axis (usually this means we want to specify our own axes/ticks; we’ll experiment with this later):

#remember, argument order doesn't
#  matter as long as we name the arguments
teachers_wisc[ , hist(birth_year, xlab = "Year of Birth",
                      main = paste0("Distribution of Year of Birth\n",
                                    "Among Wisconsin Teachers"))]

4.1.1.3 Colors

Now let’s add some real life by coloring our plot! This is handled with the col argument.

There are 657 color names understood by this argument. You can see all of them by running colors(). More often when choosing a color, I visit this website, which has the name of all of these colors as well as a preview of the color itself. Here’s the first page:

Thanks Tian Zhang!

#something plain
teachers_wisc[ , hist(birth_year, xlab = "Year of Birth", col = "red",
                      main = paste0("Distribution of Year of Birth\n",
                                    "Among Wisconsin Teachers"))]

#something more exotic
teachers_wisc[ , hist(birth_year, xlab = "Year of Birth",
                      col = "darkslateblue",
                      main = paste0("Distribution of Year of Birth\n",
                                    "Among Wisconsin Teachers"))]

4.1.1.4 More Histogram Bells & Whistles

There are a bunch of other ways we can spruce up plots, but we’ll save those for other plot types. For now, let’s just name a few extras specifically made for histograms.

4.1.1.4.1 Density vs. Frequency

Histograms can be used to represent either frequency or probability. Basically, frequencies are raw numbers, but when plotted as probabilities, we force the total area of the histogram to be 1 (just like every random variable). We can switch to probabilities by setting the argument freq to FALSE:

#note the different y axis
teachers_wisc[ , hist(birth_year, xlab = "Year of Birth",
                      col = "darkslateblue", freq = FALSE,
                      main = paste0("Distribution of Year of Birth\n",
                                    "Among Wisconsin Teachers"))]

4.1.1.4.2 Customized Bins/Breaks

Actually, the histogram is a fairly well-studied (and oft-criticized) non-parametric estimator of the distribution of a random variable. There is extensive debate and argument about the proper way to construct a histogram – how wide the bins should be, how many there should be, where the bins should be centered, etc.

By default, R uses Sturges’ formula (based on the range of the data)for determining most of this. But there is also easy access to using alternative algorithms in the breaks argument. If we pass "Scott" (specifically tailored to the normal distribution) or "FD" (based on the inter-quartile range), we’ll get a slightly different histogram based on the bins produced by these approaches. To be honest, I’ve never used any of these, so we’ll skip them for now.

We can also do send some other things to breaks – most commonly, a single number (meaning the number of bins to be used), but also a vector (specifying the endpoints of the bins) and some other more obscure options (see ?hist for more). Let’s test these out:

#Same as above, for comparison without scrolling
teachers_wisc[ , hist(birth_year, xlab = "Year of Birth",
                      col = "darkslateblue", freq = FALSE,
                      main = paste0("Distribution of Year of Birth\n",
                                    "Among Wisconsin Teachers"))]

#Setting the breakpoints ourself to have one for every year
teachers_wisc[ , hist(birth_year, xlab = "Year of Birth",
                      breaks = min(birth_year):max(birth_year),
                      col = "darkslateblue", freq = FALSE,
                      main = paste0("Distribution of Year of Birth\n",
                                    "Among Wisconsin Teachers"))]

#Setting the NUMBER of bins ourself
teachers_wisc[ , hist(birth_year, xlab = "Year of Birth", breaks = 20,
                      col = "darkslateblue", freq = FALSE,
                      main = paste0("Distribution of Year of Birth\n",
                                    "Among Wisconsin Teachers"))]

4.1.2 Kernel Density Plots

A kernel density plot is simply the continuous version of a histogram. Rather than trying to approximate the distribution of a continuous variable with a necessarily-discrete histogram, statisticians have devised kernel density estimates as a continuous approximation for continuous variables.

4.1.2.1 Barebones

These are handled in R with the density function (c.f. kdensity in Stata). Unlike hist, the density function does not itself plot the kernel density. Instead, we have to send the output of density to the main plotting function in R plot. plot accepts all of the same arguments that we sent to hist as plot-style bells and whistles.

teachers_wisc[ , plot(density(salary), main = "Density of Salaries",
                      xlab = "$", col = "darkgreen")]

A few things about this basic plot jump out to me and highlight some new things we can add to our toolbelt to improve the readability of this plot.

  1. The line is too thin. It’s a bit hard to see.
  2. The numbers on the x- and y-axes are impossible to understand.

After handling each of these, we’ll turn to addressing some density-specific parameters.

4.1.2.2 Line Width

Line width is handled by the lwd parameter. This is specified as a number; the default is 1. I usually specify lwd = 3 in presentation-ready plots. As we draw more and more complicated plots, this is something we can use to distinguish certain parts of our plot.

teachers_wisc[ , plot(density(salary), main = "Density of Salaries",
                      xlab = "$", col = "darkgreen", lwd = 3)]

4.1.2.3 Axis Labels

R does its best to come up with nice, round, easy-on-the-eyes axis labels, but sometimes it doesn’t do so well. When we notice this is the case, R has given us the tools to customize what shows up on all of the axes. Surprise, surprise, this is done through the axis function. axis takes three main arguments: 1) side, which tells to which side of the plot to add the axis (from 1 through 4, going clockwise from the bottom, so side = 1 is the x axis, side = 2 is the y axis, etc.); 2) at, which tells where (which x or y values) to put ticks (this defaulted to c(0, 20000, 40000, 60000, 80000) in our original plot); and 3) labels, which tells what to write at those ticks.

When we specify custome axis labels, we probably want to shut off the automatic axis creation done by the original plot; we can do this by specifying xaxt = "n" (or similarly for yaxt). Let’s first demonstrate the effect of this:

teachers_wisc[ , plot(density(salary), main = "Density of Salaries",
                      xlab = "$", col = "darkgreen", lwd = 3,
                      xaxt = "n")]

If we follow a plot call with a call to axis, it will add the axis we specify to the “current” plot. Basically, the “current” plot is the one drawn the most recent time we ran plot (with some exceptions we won’t get into today). So let’s add an axis:

axis(side = 1, at = seq(0, 90000, by = 10000),
     labels = paste0("$", seq(0, 90, by = 10), "k"))

You’ll also notice the numbers on the y axis are hard to read. To be honest, I don’t think those numbers tell us much of anything, so I’m generally fine excluding the y axis on a density plot entirely:

teachers_wisc[ , plot(density(salary), main = "Density of Salaries",
                      xlab = "$", col = "darkgreen", lwd = 3,
                      xaxt = "n", yaxt = "n")]
axis(side = 1, at = seq(0, 90000, by = 10000),
     labels = paste0("$", seq(0, 90, by = 10), "k"))

4.1.2.4 Density Parameters

Just like with histograms, and perhaps even more so, kernel densities can be micromanaged to no end by specifying a number of parameters. See ?density for more, as I don’t think it’s worthwhile to spend much time on the nuanced differences between the estimators; here, I’ll just list a few options and some sample values.

  • bw: the bandwidth. As with breaks from hist, there are a few typical algorithms available out of the box. I’ll just name them here; see ?bandwidth for more: "nrd0", "nrd", "ucv", "bcv", "SJ". We can also specify bandwidth as a number ourselves; the larger is the bandwidth, the further away informative points can be.

  • kernel: the kernel used to accrue information from nearby points. These basically differ in how much information is ascribed to very-nearby points as opposed to slightly-further points. The default in R is the Gaussian (normal) kernel (kernel = "gaussian"), which means R differs from Stata, which uses the Epanechnikov kernel by default. The latter has some better statistical properties, but is less appealing for other reasons. Besides Gaussian, R also supports the following values for kernel: "epanechnikov", "rectangular", "triangular", "biweight", "cosine", and "optcosine".

4.1.3 Scatterplots

Histograms and kernel densities are the main tools of examining graphically one-dimensional random variables.

Now, we’ll move on to examining joint distributions graphically. The main tool people use here is the scatter plot.

Let’s examine the relationship between teachers’ salary and fringe benefits. This would be an essential first step in trying to understand our data if we planned to run a regression involving these two variables. Since it’s not very interesting to see a scatterplot of ~65,000 points, we’ll focus more specifically on English teachers (distinguished by having the variable area equal 300:

english <- teachers_wisc[area == 300]

4.1.3.1 Barebones

Scatterplots are handled through the plot function. If we pass two vectors and say nothing else, plot assumes we want a scatterplot and treats the first vector as the “x” variable, and the second as the “y” variable. Other than that, business as usual, per above:

english[ , plot(salary, fringe, xlab = "Salary ($)",
                ylab = "Fringe Benefits ($)",
                main = "Fringe Benefits vs. Salary",
                col = "darkblue")]

Immediately, we can see that some outliers are going to be tainting any naive analysis we may perform on the data. After fixing this, we’ll learn one more general plotting trick, and one scatterplot-specific trick.

english[fringe<3e4, plot(salary, fringe, xlab = "Salary ($)",
                         ylab = "Fringe Benefits ($)",
                         main = "Fringe Benefits vs. Salary",
                         col = "darkblue")]

4.1.3.2 Axis Label Orientation

In some plots, having the y axis labels parallel to the y axis makes it harder to read. There is a parameter that controls the orientation of axis labels relative to each axis – las. It can take 4 possible values – the default, 0, means “always parallel”; 1 means “always horizontal”; 2 means “always perpendicular”; and 3 means “always vertical”. These numbers don’t really correspond to anything, so far as I can tell – in a given situation, I usually just plug in one or two of the options to see what works best. Here, we want las = 1:

english[fringe<3e4, plot(salary, fringe, xlab = "Salary ($)",
                         ylab = "Fringe Benefits ($)",
                         main = "Fringe Benefits vs. Salary",
                         col = "darkblue", las = 1)]

#This makes the y axis label conflict
#  with the tick labels; there's another way
#  to deal with this, but for now we can just abbreviate
#  the axis labels like we did above
english[fringe<3e4, plot(salary, fringe, xlab = "Salary ($)",
                         ylab = "Fringe Benefits ($)",
                         main = "Fringe Benefits vs. Salary",
                         col = "darkblue", yaxt = "n")]
#Since we suppressed the y axis in plot, changing las
#  there wouldn't matter (since the x axis is
#  unaffected by our choice of las); to handle
#  the orientation here, we have to send our
#  las value of choice to the axis function
axis(side = 2, at = 5000*(0:5), las = 1,
     labels = paste0("$", 5*(0:5), "k"))

4.1.3.3 Conditional Coloring

One awesome thing about scatterplots is that we can set the color of each point to be different if we so choose. We can do this by supplying a vector to col that is equal in length to our data. This is rare, but it is common to set the color of each point according to some condition. The easiest way to do this is with ifelse (this is inefficient, and I hate ifelse, but will suffice for today). Let’s color the points by the degree that a teacher has.

english[fringe<3e4, plot(salary, fringe, xlab = "Salary ($)",
                         ylab = "Fringe Benefits ($)",
                         main = "Fringe Benefits vs. Salary", yaxt = "n",
                         col = ifelse(highest_degree == 4, "darkred", "darkblue"))]
axis(side = 2, at = 5000*(0:5), las = 1,
     labels = paste0("$", 5*(0:5), "k"))

4.1.3.4 Legends

The above plot clearly demonstrates that there is a strong correlation between the degree you have, the fringe benefits you receive, and your salary. But it’s not at all clear to a casual observer what the difference is between red and blue dots (Republican vs. Democrat?). To elucidate this, we’ll add a legend. You guessed right – the function we need is legend. legend, like axis, is a plotting function that will add its output to the “current” plot. legend takes a bunch of parameters. The first (or first and second) specifies the position (either relative position – e.g. "top" or "bottomright", or, when specifying x and y, the exact coordinates of the box). The legend argument tells the labels in the legend.

In our case, we also need the pch argument, which tells the point character to associate with each entry (since we’re making a legend about points and not lines). The default (open circles) is pch = 1. Here is a chart of all the possible numeric values of pch and their associted plotting symbol (we can also set pch to any single character, e.g., pch = "x" or pch = "$"):

pch

legend("topleft", pch = 1,
       legend = c("Bachelor's", "Master's"), 
       #make sure to make the first color
       #  correspond to the first item in legend, etc.
       col = c("darkred", "darkblue"))

4.1.4 Line Plots

The next very common plot type is line plots. Most often I use these for time series. Line plots are also created with plot, but with a different option activated – we have to set type = "l".

We’ll look at how pay changes with experience. I’ll also introduce a new option for inside [] in data.table: keyby. keyby is just like by – it tells data.table how to group the operation you ask for in j, with the added bonus that the resultant output is sorted by the column(s) in by. We require this for creating a line plot because, when plotting lines, R considers successive elements of the input to plot as endpoints of a line segment – so if the input has the “x” values out of order, the result will jump all over the screen. Let’s demonstrate:

#Create the average pay by experience level
teachers_wisc[ , mean(salary), by = total_exp]
##     total_exp       V1
##  1:       2.0 27939.43
##  2:      14.0 43214.71
##  3:       8.5 36623.25
##  4:       3.5 29290.36
##  5:       9.0 37472.05
##  6:      10.5 38771.31
##  7:       8.0 35874.94
##  8:      11.0 39882.17
##  9:      30.0 50009.76
## 10:      22.0 47629.35
## 11:      17.0 45231.46
## 12:      19.0 45954.67
## 13:       1.0 25787.18
## 14:      16.0 44828.16
## 15:      12.0 40629.91
## 16:      20.0 46839.10
## 17:       5.0 31790.32
## 18:      19.5 47670.21
## 19:      21.0 47046.23
## 20:       3.0 29222.76
## 21:      25.0 48829.72
## 22:      23.0 48097.09
## 23:       7.0 34340.35
## 24:       6.0 32877.20
## 25:      10.0 38643.73
## 26:      15.0 43997.41
## 27:      28.0 50174.52
## 28:      26.0 49163.91
## 29:      29.0 49327.19
## 30:      24.0 48221.33
## 31:      27.0 49673.50
## 32:      18.0 45777.87
## 33:      26.5 50390.47
## 34:       4.0 30795.40
## 35:      13.0 41843.66
## 36:      23.5 49427.25
## 37:       6.5 34378.83
## 38:      25.5 49214.40
## 39:       1.5 26772.91
## 40:      20.5 47701.77
## 41:      29.5 50819.22
## 42:      22.5 49248.42
## 43:      24.5 48752.19
## 44:      21.5 47647.44
## 45:      13.5 41760.95
## 46:       2.5 27983.53
## 47:       9.5 37868.50
## 48:       4.5 31068.15
## 49:      28.5 50647.09
## 50:      27.5 51242.26
## 51:      11.5 41130.22
## 52:      12.5 41097.27
## 53:      17.5 46761.14
## 54:      15.5 44543.61
## 55:      16.5 45558.84
## 56:      18.5 46618.63
## 57:      14.5 43890.86
## 58:       5.5 32467.58
## 59:       7.5 35376.18
##     total_exp       V1
#We then build on this output with a plot
teachers_wisc[ , mean(salary), by = total_exp
               ][ , plot(total_exp, V1, type = "l")]

A total mess! Because total_exp was out of order. Let’s fix that with keyby:

teachers_wisc[ , mean(salary), keyby = total_exp
               ][ , plot(total_exp, V1, type = "l", 
                         ylab = "Salary ($)", col = "blue",
                         xlab = "Total Experience", lwd = 3,
                         main = "Evolution of Pay with Experience")]

Looking much better. I don’t like those jumps. I assume it’s due to some fractional values of experience, which may have a smaller sample size. To fix this, we might force experience to be an integer; let’s use the floor, for example:

teachers_wisc[ , mean(salary), keyby = .(exp = floor(total_exp))
               ][ , plot(exp, V1, type = "l", ylab = "Salary ($)", 
                         col = "blue", xlab = "Total Experience", lwd = 3,
                         main = "Evolution of Pay with Experience")]

I suspect the remaining jitters are due to most teachers retiring by 30 years’ experience (so the selection effects become powerful enough to distort experience effects).

One more thing we can introduce which you might want to change about plots in general is the axis limits. These are controlled by the xlim and ylim parameters, and are automatically chosen to bracket the range of your data. However, we might want to force the y axis to include 0, for example. Both xlim and ylim take a length-two numeric vector – the first element being the minimum of the axis, the second being the maximum.

#force 0 at the bottom, then set the top
#  to be the highest average salary
teachers_wisc[ , mean(salary), keyby = .(exp = floor(total_exp))
               ][ , plot(exp, V1, type = "l", ylab = "Salary ($)", 
                         col = "blue", xlab = "Total Experience", 
                         lwd = 3, ylim = c(0, max(V1)),
                         main = "Evolution of Pay with Experience")]

We’ll get into some more facets of line plotting later.

4.1.5 Bar Plots

Bar plots a few common uses. Their simplest use is to demonstrate the relationship between a discrete variable and a moment (a mean, an average, a count) of a continuous variable. Let’s start by showing the average pay by CESA. A CESA (Cooperative Educational Service Agency) is basically an administrative unit of school districts between the state level and the district level. Here’s a map of the 12 CESAs of Wisconsin:

CESA Map

4.1.5.1 Categorical Bar Plot (1D)

The function used to create a bar plot (R is full of surprises) is barplot. It can be pretty confusing to use (as we’ll see below), but for this example will be simple. We need to supply a vector to the first argument, height, which will be used as plot heights. We can also supply a character vector to names.arg which will help us distinguish the bars from one another. That should suffice for now:

teachers_wisc[ , mean(salary), keyby = cesa
               ][ , barplot(V1, names.arg = cesa, 
                            col = "cornflowerblue",
                            main = "Average Pay in Each CESA")]

We could also associate colors to each CESA, as was done on the map:

cols <- c("01" = "khaki", "02" = "darkred",
          "03" = "darkgreen", "04" = "goldenrod",
          "05" = "cornflowerblue", "06" = "magenta",
          "07" = "navy", "08" = "salmon",
          "09" = "steelblue4", "10" = "lavender",
          "11" = "tomato", "12" = "wheat")
teachers_wisc[ , mean(salary), keyby = cesa
               ][ , barplot(V1, names.arg = cesa, col = cols[cesa],
                            main = "Average Pay in Each CESA")]

If you’re going to be analyzing things by CESA consistently in your paper, you probably want to create a color scheme for consistency throughout your paper – cols as used here is a great approach.

4.1.5.2 Categorical Bar Plot (2D)

Another highly-useful but somewhat-more-complicated version of the barplot is like the much-hated pie chart.

Basically, we want to show the breakdown of some variable within each of several categories. Let’s stick with CESA as our category, and see what Masters’ certification rates are like in each.

In this case, we need to supply a matrix to height. Each column of the matrix will represent one bar. The values within each column will correspond to (cumulative) heights within each bar. To create this matrix, we’ll use rbind, which binds vectors together as rows (there’s also cbind for columns). Let’s demonstrate:

#We want to know the prevalence of
#  masters' certification within each CESA
teachers_wisc[ , .(bach = mean(highest_degree == 4)), keyby = cesa]
##     cesa      bach
##  1:   01 0.6167300
##  2:   02 0.6286683
##  3:   03 0.7002237
##  4:   04 0.6257831
##  5:   05 0.6548516
##  6:   06 0.6684145
##  7:   07 0.7225452
##  8:   08 0.7987013
##  9:   09 0.5936842
## 10:   10 0.6929825
## 11:   11 0.6927874
## 12:   12 0.6932476
teachers_wisc[ , .(bach = mean(highest_degree == 4)), 
               keyby = cesa
               #1-bach is the masters' incidence,
               #  since there are only two degree
               #  types in our subsetted data
               ][ , barplot(rbind(1-bach, bach), names.arg = cesa,
                            col = c("red", "blue"), las = 1,
                            ylab = "Proportion",
                            main = "Certification Rates by CESA")]

In this case, col can only be specified within each bar. Since each column has two rows, we can only specify two colors. This is somewhat unfortunate; see this rather intricate approach for a way to get around this problem.

4.1.5.3 Side-by-Side Bar Plot

As common as comparing breakdown within a category is to do so with side-by-side bars. This is almost identical to the previous case, except we add the argument beside = TRUE. Let’s use this to see ethnicity by CESA. We’ll have to take a slightly more complicated approach here, since not every ethnicity is represented in every CESA.

We’ll also need a new function – as.matrix – which can be used to convert an object into a matrix. In particular, if we use as.matrix on a data.table, it will convert each column of the data.table into a column of the resulting matrix. If there are conflicts in data types within the data.table, they’ll all be coerced, since a matrix can only store one type of data. Typically this means that if there’s a character column, everything will be converted to character.

#First, count by CESA & ethnicity;
#  note that in, e.g., CESA 12, there are
#  no black teachers
teachers_wisc[ , .N, keyby = .(cesa, ethnicity)]
##     cesa ethnicity     N
##  1:   01         A   102
##  2:   01         B  1437
##  3:   01         H   388
##  4:   01         I    40
##  5:   01         W 15929
##  6:   02         A    52
##  7:   02         B   122
##  8:   02         H    93
##  9:   02         I    16
## 10:   02         W 10928
## 11:   03         H     2
## 12:   03         I     1
## 13:   03         W  2232
## 14:   04         A     6
## 15:   04         B     2
## 16:   04         H     8
## 17:   04         W  3017
## 18:   05         A     2
## 19:   05         H     6
## 20:   05         I     1
## 21:   05         W  4505
## 22:   06         A    27
## 23:   06         B     2
## 24:   06         H     9
## 25:   06         W  7209
## 26:   07         A    20
## 27:   07         B     6
## 28:   07         H    26
## 29:   07         I     4
## 30:   07         W  6136
## 31:   08         A     3
## 32:   08         B     1
## 33:   08         H     2
## 34:   08         I    36
## 35:   08         W  2114
## 36:   09         A    11
## 37:   09         B     1
## 38:   09         H     2
## 39:   09         I     5
## 40:   09         W  2831
## 41:   10         A     9
## 42:   10         H     3
## 43:   10         I     4
## 44:   10         W  3290
## 45:   11         A     1
## 46:   11         H     3
## 47:   11         I     1
## 48:   11         W  3683
## 49:   12         A     2
## 50:   12         H     3
## 51:   12         I    19
## 52:   12         W  1531
##     cesa ethnicity     N
#We'll now reshape the result of this wide so that
#  the columns are CESAs, since that's how we'll
#  plot it afterwards
dcast(teachers_wisc[ , .N, keyby = .(cesa, ethnicity)],
      ethnicity ~ cesa, value.var = "N")
##    ethnicity    01    02   03   04   05   06   07   08   09   10   11   12
## 1:         A   102    52   NA    6    2   27   20    3   11    9    1    2
## 2:         B  1437   122   NA    2   NA    2    6    1    1   NA   NA   NA
## 3:         H   388    93    2    8    6    9   26    2    2    3    3    3
## 4:         I    40    16    1   NA    1   NA    4   36    5    4    1   19
## 5:         W 15929 10928 2232 3017 4505 7209 6136 2114 2831 3290 3683 1531
#Uh-oh! NAs are no good. We'll need a new argument for
#  dcast -- fill, which tells dcast what to do
#  when a certain combination is missing:
dcast(teachers_wisc[ , .N, keyby = .(cesa, ethnicity)],
      ethnicity ~ cesa, value.var = "N", fill = 0)
##    ethnicity    01    02   03   04   05   06   07   08   09   10   11   12
## 1:         A   102    52    0    6    2   27   20    3   11    9    1    2
## 2:         B  1437   122    0    2    0    2    6    1    1    0    0    0
## 3:         H   388    93    2    8    6    9   26    2    2    3    3    3
## 4:         I    40    16    1    0    1    0    4   36    5    4    1   19
## 5:         W 15929 10928 2232 3017 4505 7209 6136 2114 2831 3290 3683 1531
#Now we've got a format we can work with, that
#  barplot understands
dcast(teachers_wisc[ , .N, keyby = .(cesa, ethnicity)],
      ethnicity ~ cesa, value.var = "N", fill = 0
      )[ , barplot(as.matrix(.SD[ , -1, with = FALSE]),
                   main = "Ethnicity of Teachers by CESA",
                   beside = TRUE, names.arg = names(.SD)[-1],
                   col = c("blue", "red", "darkgreen",
                           "orange", "purple"))]
#Add a legend for clarity
legend("topright", legend = c("Asian", "Black", "Hispanic",
                              "Indian", "White"),
       #pch 15 is a solid block
       pch = 15, col = c("blue", "red", "darkgreen",
                           "orange", "purple"))

(Wisconsin isn’t very ethnically diverse…)

4.1.6 Box and Whisker Plots

Box and whisker plots provide a graphical version of the five-number summary for a distribution. This is especially useful for comparing the distribution of a continuous variable with respect to a discrete variable. Above, we looked at average salary and how it changes by CESA. But this might be obscuring some more interesting action going on underneath the surface. Are the different averages being driven by skewing in one direction or the other in some CESAs? A boxplot will evince this for us graphically.

In a typical show of originality, a boxplot is generated by the boxplot command. There are a few ways to specify your boxplot; I find the approach using a formula to be most intuitive. boxplot can be deployed within the j argument to data.table, as above, but doing so is somewhat troublesome for reasons I don’t really want to get into. Instead we’ll use the data argument to boxplot; this tells boxplot where to find the salary and cesa variables we specify in our formula:

boxplot(salary ~ cesa, data = teachers_wisc,
        main = "Distribution of Wages by CESA",
        xlab = "CESA", ylab = "Salary ($)", col = cols)

boxplot automatically singles out the outliers in the data. However, I am unsatisfied with this plot. I don’t like the orientation, and there is a simple argument we can add to make it much more informative.

I think this plot is much easier to read if the bars are stacked vertically. To do this, we use the horizontal argument:

boxplot(salary ~ cesa, data = teachers_wisc,
        main = "Distribution of Wages by CESA",
        horizontal = TRUE, las = 1, ylab = "CESA",
        #note we have to switch to xlab and ylab
        xlab = "Salary ($)", col = cols)

boxplot also comes with a handy argument for displaying confidence intervals around the sample medians – notch. Observe:

boxplot(salary ~ cesa, data = teachers_wisc,
        main = "Distribution of Wages by CESA",
        horizontal = TRUE, las = 1, ylab = "CESA",
        xlab = "Salary ($)", col = cols, notch = TRUE)

4.1.7 Multiple Plots

Often we need to combine more than one plot at once to concisely present more information. This comes in two basic flavors – adding more points/plots/lines to a single set of axes, and creating multiple sets of axes with different plots in the same output. The former is easier to handle (it’s basically a matter of learning a few new functions and managing not to create a new plot window by accident).

4.1.7.1 More Information on the Same Axes

Let’s return to the wage-tenure curve highlighted above:

#force 0 at the bottom, then set the top
#  to be the highest average salary
teachers_wisc[ , mean(salary), keyby = .(exp = floor(total_exp))
               ][ , plot(exp, V1, type = "l", ylab = "Salary ($)", 
                         col = "blue", xlab = "Total Experience", 
                         lwd = 3, ylim = c(0, max(V1)),
                         main = "Evolution of Pay with Experience")]

This shows the mean wage at each experience level. It could also be instructive to include lines denoting some quantiles of the wage distribution at experience level, say the 25th and 75th percentile. We’ll show two ways to do this.

4.1.7.1.1 lines

The first is a minor step beyond what we did above. This entails adding lines to the existing plot using the lines function. lines cannot be called to start a plot – it can only be called to add to an existing plot. And it can only draw lines.

We’ll also demonstrate here the use of curly braces {} within j. When we surround our statements in j with curly braces, it allows us to execute more than one line of code there. Basically, when we uses braces, j is considered to be all of the lines between the braces.

teachers_wisc[ , .(avg = mean(salary), 
                   #include the other quantities of interest
                   q25 = quantile(salary, .25),
                   q75 = quantile(salary, .75)), 
               keyby = .(exp = floor(total_exp))
               ][ , {#start the braces. The first statement is to plot
                 plot(exp, avg, type = "l", ylab = "Salary ($)", 
                      col = "blue", xlab = "Total Experience", 
                      #also need to adjust the y limits since
                      #  the high quantile is likely to
                      #  exceed the average
                      lwd = 3, ylim = c(0, max(q75)),
                      main = "Evolution of Pay with Experience")
                 #now we can start our second statement within j
                 lines(exp, q25)
                 lines(exp, q75)
               }]

This looks OK, but I’m not satisfied with the quantile lines. I’d prefer them not to be solid, but dashed lines. We can do this with a new argument – lty. lty specifies the line type. Like pch, it for the most part takes integer arguments. Here are the possibilities illustrated:

lty

teachers_wisc[ , .(avg = mean(salary),
                   q25 = quantile(salary, .25),
                   q75 = quantile(salary, .75)), 
               keyby = .(exp = floor(total_exp))
               ][ , {
                 plot(exp, avg, type = "l", ylab = "Salary ($)", 
                      col = "blue", xlab = "Total Experience",
                      lwd = 3, ylim = c(0, max(q75)),
                      main = "Evolution of Pay with Experience")
                 lines(exp, q25, lwd = 3, col = "blue", lty = 2)
                 lines(exp, q75, lwd = 3, col = "blue", lty = 2)
               }]

4.1.7.1.2 matplot

The other approach introduces a new function, matplot. This operates like barplot in that it accepts a matrix argument and turns each column into a new plot object. The main advantage of this over using lines is conciseness and extensibility. If we wanted to add 15 lines to a plot, it would be nightmarish to repetitively type lines over and over (yes, there is a way to do this in a for loop, but that will necessarily be ugly and hard to wrangle with). matplot allows us to plot all of the lines at once. Let’s see it in action:

teachers_wisc[ , .(avg = mean(salary),
                   q25 = quantile(salary, .25),
                   q75 = quantile(salary, .75)), 
               keyby = .(exp = floor(total_exp))
               #the [ , !"col", with = FALSE] syntax
               #  tells data.table to drop the column "col"
               ][ , matplot(exp, .SD[ , !"exp", with = FALSE],
                            type = "l", xlab = "Total Experience",
                            #we have to specify one lty for each
                            #  line we add (order matters)
                            ylab = "Salary ($)", lty = c(1, 2, 2), 
                            lwd = 3, col = "blue",
                            main = "Evolution of Pay with Experience")]
#Specify top-to-bottom order so it jives with what we see in the plot
legend("topleft", legend = c("Q75", "Mean", "Q25"),
       lty = c(2, 1, 2), lwd = 3, col = "blue")

4.1.7.1.3 Adding Vertical/Horizontal/Trend Lines

There is one more function that is worth mentioning since it comes up so often – abline (so named because it’s designed to add lines of the form a + b*x, i.e., in slope-intercept form). Primarily, I use this to add vertical, horizontal, and trend lines to a plot. Let’s return to our scatterplot from before, plotting fringe benefits vs. salary. Let’s decorate this plot with some extra lines. Like lines, points, legend, and axis, abline adds objects to the current plot.

english[fringe<3e4, plot(salary, fringe, xlab = "Salary ($)",
                         ylab = "Fringe Benefits ($)",
                         main = "Fringe Benefits vs. Salary",
                         col = "darkblue", yaxt = "n")]
axis(side = 2, at = 5000*(0:5), las = 1,
     labels = paste0("$", 5*(0:5), "k"))
#add a vertical line at the average salary
english[fringe<3e4, abline(v = mean(salary),
                           #and a horizontal line at average fringe
                           h = mean(fringe),
                           #and a trend line as determined by the results
                           #  of the regression of fringe on salary
                           reg = lm(fringe ~ salary),
                           #emphasize the trend line
                           lwd = c(3, 1, 1))]

4.1.7.2 Multiple Axis Sets

Plotting in one figure on more than one pair of axes can get very messy and involved, if perfection is what you’re after. The basic approach basically consists of specifying an option before plotting.

First, a quick aside to talk about plotting metaparameters. Every time we create a plot, R draws from a list of 72 (yes, SEVENTY-TWO) parameters dictating subtle things about how the plot is presented, from the x:y ratio to the current plot limits to the font and scaling of the axes. If you have a few hours on your hand, try and pore through ?par. To this day, I’m only familiar with about 30% of these parameters.

That said, there are a few that are crucial to be aware of if you want to create more fine-tuned, perfectionist plots in your research. The first involve the figure’s margins. In all of the plots we used above, we implicitly allowed the default plot margins to reign supreme. This created somewhat of an issue when we ran into the issue of the y axis label smushing up next to the y axis ticks. Rather than abbreviating the axis ticks, we could have given more space to the left-hand-side margin. Earl Glynn has provided this excellent page with a few graphics that I have stared at time and time again (pro tip: Google “oma r” and it’s the first result). Here’s the one explicating margins in simple plots:

oma

Both oma and mar are specified as four numbers, each representing one side of the plot. By default, oma is 0 on all sides; it’s only really useful for tiling plots. mar, on the other hand, has the cryptic defaults of c(5.1, 4.1, 4.1, 2.1). I got nothing as far as an explanation goes – this is just something I’ve committed to memory. As with axis, they proceed clockwise starting from the bottom. So if we wanted to give some extra space to the margin next to the y axis, we would probably use par(mar = c(5.1, 5.1, 4.1, 2.1)).

The next most common to adjust is mfcol. This stands for “multiple figures, column-wise” (see this site of mnemonics created by Paul Murrell for an attempt to create mnemonics for all of the par parameters). It is a length-two vector specifying how many rows (first element) and columns (second element) the total figure to plot will have. If mfcol = c(2, 2), we will fill the figure by making four calls to plot. The first will go in the upper-left; the second in the lower-left; the third in the upper-right; and the fourth in the lower right. Earl Glynn’s site is again handy:

mfcol

Let’s make a wage-tenure scatter plot overlaid with per-experience average wages for four Wisconsin districts: Milwaukee, Madison, Green Bay, and Manitowoc. Here’s a first pass:

#set main plot parameter to create 2x2 figure
par(mfcol = c(2, 2))
#first plot: Milwaukee scatter
teachers_wisc[agency_name == "Milwaukee Sch Dist",
              plot(total_exp, salary, main = "Milwaukee")]
#add by-tenure average using lines
teachers_wisc[agency_name == "Milwaukee Sch Dist",
              mean(salary), keyby = .(exp = floor(total_exp))
              ][ , lines(exp, V1, lwd = 3, col = "red")]
#now repeat for the other districts
teachers_wisc[agency_name == "Madison Metropolitan Sch Dist",
              plot(total_exp, salary, main = "Madison")]
teachers_wisc[agency_name == "Madison Metropolitan Sch Dist",
              mean(salary), keyby = .(exp = floor(total_exp))
              ][ , lines(exp, V1, lwd = 3, col = "red")]

teachers_wisc[agency_name == "Green Bay Area Sch Dist",
              plot(total_exp, salary, main = "Green Bay")]
teachers_wisc[agency_name == "Green Bay Area Sch Dist",
              mean(salary), keyby = .(exp = floor(total_exp))
              ][ , lines(exp, V1, lwd = 3, col = "red")]

teachers_wisc[agency_name == "Manitowoc Sch Dist",
              plot(total_exp, salary, main = "Manitowoc")]
teachers_wisc[agency_name == "Manitowoc Sch Dist",
              mean(salary), keyby = .(exp = floor(total_exp))
              ][ , lines(exp, V1, lwd = 3, col = "red")]

This is fine, but it leaves something to be desired. First, the plots are hard to compare since they’re on different scales. Second, we had to spend a lot of time writing repetitive code. We should be able to generate this plot with a for loop. Third, look at all the wasted white space! The plot makes inefficient use of the space allotted. Fourth, in a similar vane, we don’t need all of these axes. Once we fix the same scale on all four plots, we can infer the axes of plots from those to the left/right or above/below.

All of these are somewhat simple to take care of, but it’ll probably look intimidating at a first pass. The idea is to crush the margins of each individual figure to save on white space, and to put all of the relevant information in the outer margins. We’ll also add some bonus information to each sub-plot – the number of teachers at each district, prettily formatted.

dists <- c("Milwaukee", "Madison Metropolitan",
           "Green Bay Area", "Manitowoc")
#get full range of salaries to use as axis limits:
yl <- teachers_wisc[ , range(salary)]
par(mfcol = c(2, 2), 
    mar = c(0, 0, 0, 0),
    oma = c(5.1, 4.1, 4.1, 2.1))
#seq_along dists is the same as 1:length(dists)
for (ii in seq_along(dists)){
  #get the name of the current district
  dd <- dists[ii]
  teachers_wisc[agency_name == paste(dd, "Sch Dist"),
                plot(total_exp, salary,
                     #we'll use the title function to 
                     #  add the main title; also
                     #  set the ylimits so they're common
                     #  across charts
                     main = "", ylim = yl,
                     #exclude x axes from plots 1 & 3
                     xaxt = if (ii %% 2 == 0) "s" else "n",
                     #exclude y axes from plots 3 & 4
                     yaxt = if (ii <= 2) "s" else "n")]
  teachers_wisc[agency_name == paste(dd, "Sch Dist"),
                mean(salary), keyby = .(exp = floor(total_exp))
                ][ , lines(exp, V1, lwd = 3, col = "red")]
  
  #count teachers
  N_teach <- teachers_wisc[agency_name == paste(dd, "Sch Dist"), .N]
  
  #line = -1 forces the title to come down closer to the x axis
  #prettyNum prettifies numbers, in particular adding
  #  comma-separation (1000 -> 1,000)
  title(paste0(dd, " (", prettyNum(N_teach, big.mark = ","),
               " teachers)"), line = -1)
}

#Now for some bonus fun! Adding margin text and a main title
#  we specify outer = TRUE to make sure that we mean to use
#  the OUTER margin instead of the CURRENT FIGURE's margins.
#  line controls how far away from the main plot body the
#  text is written
mtext(side = 1, text = "Total Experience", outer = TRUE, line = 3)
mtext(side = 2, text = "Salary ($)", outer = TRUE, line = 2.5)
title(main = "Wage-Tenure Profiles in Wisconsin Districts", outer = TRUE)

4.2 ggplot

I don’t know how to use it really, but many of my R-using friends swear by ggplot2. It is the plotting method/aesthetic of choice for Nate Silver’s fivethirtyeight.com, a site about quantitative analysis of politics, sports, etc. Here’s an example:

538 Style

There are whole textbooks dedicated to how to use ggplot. Search around on Google for some examples and see if you like the aesthetic (I don’t), in which case it may be worth exploring and investing some time in learning how to use it. This book was written by the package author (Hadley Wickham, who in fact used this package as a substantial portion of his PhD thesis), and there’s a free chapter on graphs of Winston Chang’s R Cookbook.

4.3 LaTeX Tables: texreg/xtable/stargazer

I don’t think we’ll have time today to get into LaTeX tables, especially since I don’t think much of the audience has ever heard of LaTeX (for those who haven’t – look it up when you get a chance, it’s awesome). There are three main packages – texreg (for the output of regression models), xtable (for other tables), and stargazer (I’ve honestly never used it).

For the TeX-savvy, see this Q&A.

5 Web Scraping: rvest (1:30 PM - 2:30 PM)

A lot of data that we find is seemingly locked out of our grasp – we find it on a website, ripe for absorbing, but we’re estopped from reaping the fruits of those sweet sweet raw data seeds by a lack of access to the raw data in a readily-handled format. Tables on sites like Wikipedia, the Census Bureau, ESPN, you name it, all tantalize the data analyst who hasn’t added the tools of web scraping to their belt.

Web scraping is basically the process of extracting data from websites. I have mainly used this for toy side projects – scraping data about baseball statistics, Supreme Court judges, swing states, etc.

Scraping in R is done almost entirely through the rvest package. If I’m being honest, it’s not very fully-flushed out, since it’s still quite a young package, but I’ve been able to make it work in every situation I’ve come to so far. Python-ers are blessed with having methods for web scraping which are mature and robust – though to be honest I found them much harder to use/pick up on than rvest. Installing rvest:

install.packages("rvest")

CAVEAT SCRAPER: It’s not always legal to scrape a website. Amazon, for example, includes in its terms of service that using automated scrapers is against policy. Something to be aware of.

Also, too much scraping can get your IP address banned and/or throttled. This means that you’ll be denied access to a website, perhaps permanently, if you generate too much traffic.

The proper approach for longer-term, more intense scraping operations is thus to download the HTML file associated with each web page, then read from that file, rather than loading the page over and over again, which causes undue stress on companies’/organizations’ servers (i.e., you want to avoid anything looking like a DDoS attack).

All that said, web scraping is powerful and (to me) fun. This is basically what’s being done when you hear people talking about a sentiment analysis of Facebook or Twitter.

All web scraping starts with finding a page with data on it.

Let’s try and glean some data from U.S. News & World Report’s National Universities Rankings. Again we have to give R a moment’s rest.

<<R sips coffee>>

Web scraping illustrates one of the best features about Google Chrome – the Inspect view for trying to target information about key elements of web pages.

We’re targeting the main table of information, so right-click it, and hit “Inspect”. This will bring up the Inspection Toolbar which gives a surfeit of information about the HTML tree of the current web page.

If you hover over an HTML element in the inspection view, the output on the web page that it corresponds to is highlighted. You can expand/contract nodes of the HTML tree by clicking the arrows next to the ones visible. Clicking an arrow pointing down will collapse its “children”; clicking an arrow pointing right will expand its “children”.

After hovering around a bit, we see that the following item, when we hover over it, appears to encompass the entire table of interest:

<table class="ranking-data table table-lighthead table-packed ranking-data-free ">

It’s no coincidence that this item is a <table>. The <table> tag in HTML will come up almost constantly in your quests to scrape the web, so make note.

5.1 xpath and Selectors

It’s great that we foud the correct node, but this is often of little use per se. We need a way to tell an HTML parser (which is basically what all web scraping is – parsing HTML) how to find the node we’re interested in.

There are two sort-of languages used for specifying HTML/XML nodes. The first is xpath. If you’ve got time, it’s really worth investing some time in learning the xpath syntax, which will really open the door to becoming a web scraping wizard (I to date have only a facile mastery of the subject).

Happily, we don’t really need to know anything about how xpath works, because Google Chrome is awesome. We can simply right click the XML element in the Inspection sidebar and hit Copy -> Copy XPath, and the xpath associated with that element is copied to our clipboard. For example, the xpath of our table is: //*[@id="article"]/table, which basically says that this <table> tag is the child of another tag (of any sort – *) which has the HTML attribute id equal to "article". Indeed, scanning up the HTML tree, we see: <div id="article" class="article">. These two characteristics (being a <table> and being the child of a tag with id="article") are enough to uniquely identify our object of interest in this page.

An alternative to xpath is CSS Selectors. They’re generally very similar to xpath – which you use is a matter of persona preference, and I prefer xpath, so we’ll stick with that for today. See this page for more on CSS selectors.

5.2 Scraping U.S. News

OK, enough front-end lead-up, where’s the payoff?

We’ve now got all the ingredients necessary to scrape this page successfully. Before we do that, I want to introduce one more thing – the infamous pipe of R: %>%.

%>% is originally found in the magrittr package, but has permeated throughout the hadleyverse, and is particularly in fashion in code associated with rvest. The pipe sends the output of the thing to the left of %>% to the function on the right of it. In particular, it’s sent as the first argument to that function unless we specify otherwise. Some quick examples to show what I mean:

#the pipe is one of the functions
#  exported by rvest, so we need to 
#  load it first
library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
#convoluted version of cbind(1:3, 4:6)
1:3 %>% cbind(4:6)
##      .  
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
#we can also be explicit about where to send
#  the output but using its alias on the RHS, .
1:3 %>% cbind(., 4:6)
##      .  
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
1:3 %>% cbind(4:6, .)
##        .
## [1,] 4 1
## [2,] 5 2
## [3,] 6 3
#something slightly more interesting;
#  note we have to open and close
#  the parentheses for it to work
rnorm(10) %>% sort()
##  [1] -1.73870282 -1.70163777 -1.19474078 -0.84678196  0.04950394
##  [6]  0.37545330  0.72647934  1.38989501  1.54481210  2.02377989

OK, enough build-up. Let’s scrape!

#I always store the page URL
#  as its own variable
URL <- paste0("http://colleges.usnews.rankingsandreviews.com/",
              "best-colleges/rankings/national-universities/data")

#now we break out the rvest weapons
#  read_html takes to the web and 
#  retrieves the web page at the
#  address you give it (i.e., it's
#  doing the first part of what
#  Chrome does when you enter a URL
#  in your browser search bar).
#  It returns the XML tree of the page.
pg <- read_html(URL)

#Now that we've got the XML tree, we
#  have to find the node we were after
#  from before. To do this, we use
#  the html_node function, which
#  has an xpath option, to which
#  we can supply the xpath from above --
#  making sure that we're careful
#  about quotation marks.
xp <- '//*[@id="article"]/table'
pg %>% html_node(xpath = xp)
## {xml_node}
## <table>
## [1] <thead>\n        \n        \n        <tr>\n          \n          \n  ...
## [2] <tbody>\n        <tr valign="top" class="row-first row-odd table-row ...
#This has returned an `xml_node` corresponding
#  to our table. We just need one more function 
#  and we'll really be cooking -- html_table.
#  html_table tries to parse an table from
#  HTML form into a data.frame so that we can
#  start dealing with it like we would other
#  objects read from CSV, Excel, etc.
pg %>% html_node(xpath = xp) %>%
  html_table()
##                                                                                           U.S. News rank
## 1     #1\n                \n\n            \n            \n                Overall Score: 100 out of 100.
## 2      #2\n                \n\n            \n            \n                Overall Score: 99 out of 100.
## 3      #3\n                \n\n            \n            \n                Overall Score: 97 out of 100.
## 4   #4Tie\n                \n\n            \n            \n                Overall Score: 95 out of 100.
## 5   #4Tie\n                \n\n            \n            \n                Overall Score: 95 out of 100.
## 6   #4Tie\n                \n\n            \n            \n                Overall Score: 95 out of 100.
## 7      #7\n                \n\n            \n            \n                Overall Score: 93 out of 100.
## 8      #8\n                \n\n            \n            \n                Overall Score: 92 out of 100.
## 9      #9\n                \n\n            \n            \n                Overall Score: 91 out of 100.
## 10 #10Tie\n                \n\n            \n            \n                Overall Score: 90 out of 100.
## 11 #10Tie\n                \n\n            \n            \n                Overall Score: 90 out of 100.
## 12 #12Tie\n                \n\n            \n            \n                Overall Score: 89 out of 100.
## 13 #12Tie\n                \n\n            \n            \n                Overall Score: 89 out of 100.
## 14    #14\n                \n\n            \n            \n                Overall Score: 85 out of 100.
## 15 #15Tie\n                \n\n            \n            \n                Overall Score: 84 out of 100.
## 16 #15Tie\n                \n\n            \n            \n                Overall Score: 84 out of 100.
## 17 #15Tie\n                \n\n            \n            \n                Overall Score: 84 out of 100.
## 18 #18Tie\n                \n\n            \n            \n                Overall Score: 82 out of 100.
## 19 #18Tie\n                \n\n            \n            \n                Overall Score: 82 out of 100.
## 20    #20\n                \n\n            \n            \n                Overall Score: 77 out of 100.
## 21 #21Tie\n                \n\n            \n            \n                Overall Score: 76 out of 100.
## 22 #21Tie\n                \n\n            \n            \n                Overall Score: 76 out of 100.
## 23 #23Tie\n                \n\n            \n            \n                Overall Score: 74 out of 100.
## 24 #23Tie\n                \n\n            \n            \n                Overall Score: 74 out of 100.
## 25 #23Tie\n                \n\n            \n            \n                Overall Score: 74 out of 100.
##                                                                                          School
## 1                    Princeton University \n                  \n                  Princeton, NJ
## 2                      Harvard University \n                  \n                  Cambridge, MA
## 3                         Yale University \n                  \n                  New Haven, CT
## 4                      Columbia University \n                  \n                  New York, NY
## 5                      Stanford University \n                  \n                  Stanford, CA
## 6                     University of Chicago \n                  \n                  Chicago, IL
## 7   Massachusetts Institute of Technology \n                  \n                  Cambridge, MA
## 8                            Duke University \n                  \n                  Durham, NC
## 9           University of Pennsylvania \n                  \n                  Philadelphia, PA
## 10      California Institute of Technology \n                  \n                  Pasadena, CA
## 11               Johns Hopkins University \n                  \n                  Baltimore, MD
## 12                        Dartmouth College \n                  \n                  Hanover, NH
## 13                 Northwestern University \n                  \n                  Evanston, IL
## 14                      Brown University \n                  \n                  Providence, RI
## 15                        Cornell University \n                  \n                  Ithaca, NY
## 16                  Vanderbilt University \n                  \n                  Nashville, TN
## 17     Washington University in St. Louis \n                  \n                  St. Louis, MO
## 18                          Rice University \n                  \n                  Houston, TX
## 19              University of Notre Dame \n                  \n                  Notre Dame, IN
## 20       University of California—​Berkeley \n                  \n                  Berkeley, CA
## 21                         Emory University \n                  \n                  Atlanta, GA
## 22                 Georgetown University \n                  \n                  Washington, DC
## 23            Carnegie Mellon University \n                  \n                  Pittsburgh, PA
## 24 University of California—​Los Angeles \n                  \n                  Los Angeles, CA
## 25    University of Southern California \n                  \n                  Los Angeles, CA
##                            Tuition and Fees Total enrollment
## 1                                   $43,450            8,088
## 2                                   $45,278           19,929
## 3                                   $47,600           12,336
## 4                                   $51,008           24,221
## 5                                   $46,320           16,795
## 6                                   $50,193           12,558
## 7                                   $46,704           11,319
## 8                                   $49,341           15,856
## 9                                   $49,536           21,296
## 10                                  $45,390            2,209
## 11                                  $48,710           21,484
## 12                                  $49,506            6,298
## 13                                  $49,047           20,336
## 14                                  $49,346            9,181
## 15                                  $49,116           21,850
## 16                                  $43,838           12,686
## 17                                  $48,093           14,348
## 18                                  $42,253            6,621
## 19                                  $47,929           12,179
## 20 in-state: $13,432, out-of-state: $38,140           37,581
## 21                                  $46,314           14,769
## 22                                  $48,611           17,858
## 23                                  $50,410           13,285
## 24 in-state: $12,753, out-of-state: $35,631           43,239
## 25                                  $50,210           42,469
##    Fall 2014 acceptance rate Average freshman retention rate
## 1                       7.4%                             98%
## 2                         6%                             97%
## 3                       6.3%                             99%
## 4                         7%                             99%
## 5                       5.1%                             98%
## 6                       8.8%                             99%
## 7                       7.9%                             98%
## 8                      11.4%                             97%
## 9                      10.4%                             98%
## 10                      8.8%                             97%
## 11                       15%                             97%
## 12                     11.5%                             98%
## 13                     13.1%                             97%
## 14                      8.7%                             98%
## 15                     14.2%                             97%
## 16                     13.1%                             97%
## 17                     17.1%                             97%
## 18                     15.1%                             97%
## 19                     21.1%                             98%
## 20                       16%                             97%
## 21                     26.8%                             95%
## 22                     17.4%                             96%
## 23                     24.6%                             95%
## 24                     18.6%                             97%
## 25                       18%                             97%
##    6-year graduation rate Classes with fewer than 20 students
## 1                     97%                              Locked
## 2                     98%                              Locked
## 3                     96%                              Locked
## 4                     96%                              Locked
## 5                     95%                              Locked
## 6                     93%                              Locked
## 7                     91%                              Locked
## 8                     95%                              Locked
## 9                     96%                              Locked
## 10                    92%                              Locked
## 11                    94%                              Locked
## 12                    95%                              Locked
## 13                    94%                              Locked
## 14                    96%                              Locked
## 15                    93%                              Locked
## 16                    93%                              Locked
## 17                    95%                              Locked
## 18                    93%                              Locked
## 19                    96%                              Locked
## 20                    91%                              Locked
## 21                    91%                              Locked
## 22                    95%                              Locked
## 23                    88%                              Locked
## 24                    91%                              Locked
## 25                    91%                              Locked
##    SAT/ACT 25th-75th percentile
## 1                        Locked
## 2                        Locked
## 3                        Locked
## 4                        Locked
## 5                        Locked
## 6                        Locked
## 7                        Locked
## 8                        Locked
## 9                        Locked
## 10                       Locked
## 11                       Locked
## 12                       Locked
## 13                       Locked
## 14                       Locked
## 15                       Locked
## 16                       Locked
## 17                       Locked
## 18                       Locked
## 19                       Locked
## 20                       Locked
## 21                       Locked
## 22                       Locked
## 23                       Locked
## 24                       Locked
## 25                       Locked
#Now that we know it works, we can assign it to
#  an object by repeating the code and also 
#  sending it to setDT so we have a data.table
usnews <- pg %>% html_node(xpath = xp) %>%
  html_table() %>% setDT()

#First, the last two columns are only available
#  to paid subscribers, let's just drop them:
usnews[ , (ncol(usnews) - 1):ncol(usnews) := NULL]

#The table is a bit of a mess; we'll start
#  by assigning more manageable names:
setnames(usnews, c("rank", "name", "tuition_fees",
                   "enrollment", "acceptance_2014", 
                   "retention_freshman", "grad_rate_6_yr"))[]
##                                                                                                      rank
##  1:    #1\n                \n\n            \n            \n                Overall Score: 100 out of 100.
##  2:     #2\n                \n\n            \n            \n                Overall Score: 99 out of 100.
##  3:     #3\n                \n\n            \n            \n                Overall Score: 97 out of 100.
##  4:  #4Tie\n                \n\n            \n            \n                Overall Score: 95 out of 100.
##  5:  #4Tie\n                \n\n            \n            \n                Overall Score: 95 out of 100.
##  6:  #4Tie\n                \n\n            \n            \n                Overall Score: 95 out of 100.
##  7:     #7\n                \n\n            \n            \n                Overall Score: 93 out of 100.
##  8:     #8\n                \n\n            \n            \n                Overall Score: 92 out of 100.
##  9:     #9\n                \n\n            \n            \n                Overall Score: 91 out of 100.
## 10: #10Tie\n                \n\n            \n            \n                Overall Score: 90 out of 100.
## 11: #10Tie\n                \n\n            \n            \n                Overall Score: 90 out of 100.
## 12: #12Tie\n                \n\n            \n            \n                Overall Score: 89 out of 100.
## 13: #12Tie\n                \n\n            \n            \n                Overall Score: 89 out of 100.
## 14:    #14\n                \n\n            \n            \n                Overall Score: 85 out of 100.
## 15: #15Tie\n                \n\n            \n            \n                Overall Score: 84 out of 100.
## 16: #15Tie\n                \n\n            \n            \n                Overall Score: 84 out of 100.
## 17: #15Tie\n                \n\n            \n            \n                Overall Score: 84 out of 100.
## 18: #18Tie\n                \n\n            \n            \n                Overall Score: 82 out of 100.
## 19: #18Tie\n                \n\n            \n            \n                Overall Score: 82 out of 100.
## 20:    #20\n                \n\n            \n            \n                Overall Score: 77 out of 100.
## 21: #21Tie\n                \n\n            \n            \n                Overall Score: 76 out of 100.
## 22: #21Tie\n                \n\n            \n            \n                Overall Score: 76 out of 100.
## 23: #23Tie\n                \n\n            \n            \n                Overall Score: 74 out of 100.
## 24: #23Tie\n                \n\n            \n            \n                Overall Score: 74 out of 100.
## 25: #23Tie\n                \n\n            \n            \n                Overall Score: 74 out of 100.
##                                                                                                      rank
##                                                                                             name
##  1:                   Princeton University \n                  \n                  Princeton, NJ
##  2:                     Harvard University \n                  \n                  Cambridge, MA
##  3:                        Yale University \n                  \n                  New Haven, CT
##  4:                     Columbia University \n                  \n                  New York, NY
##  5:                     Stanford University \n                  \n                  Stanford, CA
##  6:                    University of Chicago \n                  \n                  Chicago, IL
##  7:  Massachusetts Institute of Technology \n                  \n                  Cambridge, MA
##  8:                           Duke University \n                  \n                  Durham, NC
##  9:          University of Pennsylvania \n                  \n                  Philadelphia, PA
## 10:      California Institute of Technology \n                  \n                  Pasadena, CA
## 11:               Johns Hopkins University \n                  \n                  Baltimore, MD
## 12:                        Dartmouth College \n                  \n                  Hanover, NH
## 13:                 Northwestern University \n                  \n                  Evanston, IL
## 14:                      Brown University \n                  \n                  Providence, RI
## 15:                        Cornell University \n                  \n                  Ithaca, NY
## 16:                  Vanderbilt University \n                  \n                  Nashville, TN
## 17:     Washington University in St. Louis \n                  \n                  St. Louis, MO
## 18:                          Rice University \n                  \n                  Houston, TX
## 19:              University of Notre Dame \n                  \n                  Notre Dame, IN
## 20:       University of California—​Berkeley \n                  \n                  Berkeley, CA
## 21:                         Emory University \n                  \n                  Atlanta, GA
## 22:                 Georgetown University \n                  \n                  Washington, DC
## 23:            Carnegie Mellon University \n                  \n                  Pittsburgh, PA
## 24: University of California—​Los Angeles \n                  \n                  Los Angeles, CA
## 25:    University of Southern California \n                  \n                  Los Angeles, CA
##                                                                                             name
##                                 tuition_fees enrollment acceptance_2014
##  1:                                  $43,450      8,088            7.4%
##  2:                                  $45,278     19,929              6%
##  3:                                  $47,600     12,336            6.3%
##  4:                                  $51,008     24,221              7%
##  5:                                  $46,320     16,795            5.1%
##  6:                                  $50,193     12,558            8.8%
##  7:                                  $46,704     11,319            7.9%
##  8:                                  $49,341     15,856           11.4%
##  9:                                  $49,536     21,296           10.4%
## 10:                                  $45,390      2,209            8.8%
## 11:                                  $48,710     21,484             15%
## 12:                                  $49,506      6,298           11.5%
## 13:                                  $49,047     20,336           13.1%
## 14:                                  $49,346      9,181            8.7%
## 15:                                  $49,116     21,850           14.2%
## 16:                                  $43,838     12,686           13.1%
## 17:                                  $48,093     14,348           17.1%
## 18:                                  $42,253      6,621           15.1%
## 19:                                  $47,929     12,179           21.1%
## 20: in-state: $13,432, out-of-state: $38,140     37,581             16%
## 21:                                  $46,314     14,769           26.8%
## 22:                                  $48,611     17,858           17.4%
## 23:                                  $50,410     13,285           24.6%
## 24: in-state: $12,753, out-of-state: $35,631     43,239           18.6%
## 25:                                  $50,210     42,469             18%
##                                 tuition_fees enrollment acceptance_2014
##     retention_freshman grad_rate_6_yr
##  1:                98%            97%
##  2:                97%            98%
##  3:                99%            96%
##  4:                99%            96%
##  5:                98%            95%
##  6:                99%            93%
##  7:                98%            91%
##  8:                97%            95%
##  9:                98%            96%
## 10:                97%            92%
## 11:                97%            94%
## 12:                98%            95%
## 13:                97%            94%
## 14:                98%            96%
## 15:                97%            93%
## 16:                97%            93%
## 17:                97%            95%
## 18:                97%            93%
## 19:                98%            96%
## 20:                97%            91%
## 21:                95%            91%
## 22:                96%            95%
## 23:                95%            88%
## 24:                97%            91%
## 25:                97%            91%
##     retention_freshman grad_rate_6_yr
#There's still some mess to be cleaned up. Time to 
#  break out the regex skills from this morning:
usnews[ , score := 
          gsub(".*Overall Score: ([0-9]*) out of.*", "\\1", rank)]
usnews[ , rank := gsub("#([0-9]*).*", "\\1", rank)]
usnews[ , name := gsub("\n.*", "", name)]
usnews[ , enrollment := 
          as.integer(gsub(",", "", enrollment, fixed = TRUE))]
usnews[ , acceptance_2014 := 
          as.numeric(gsub("%", "", acceptance_2014, fixed = TRUE))]
usnews[ , retention_freshman := 
          as.numeric(gsub("%", "", retention_freshman, fixed = TRUE))]
usnews[ , grad_rate_6_yr := 
          as.numeric(gsub("%", "", grad_rate_6_yr, fixed = TRUE))]

#What remains is the issue of tuition_fees:
usnews[]
##     rank                                   name
##  1:    1                  Princeton University 
##  2:    2                    Harvard University 
##  3:    3                       Yale University 
##  4:    4                   Columbia University 
##  5:    4                   Stanford University 
##  6:    4                 University of Chicago 
##  7:    7 Massachusetts Institute of Technology 
##  8:    8                       Duke University 
##  9:    9            University of Pennsylvania 
## 10:   10    California Institute of Technology 
## 11:   10              Johns Hopkins University 
## 12:   12                     Dartmouth College 
## 13:   12               Northwestern University 
## 14:   14                      Brown University 
## 15:   15                    Cornell University 
## 16:   15                 Vanderbilt University 
## 17:   15    Washington University in St. Louis 
## 18:   18                       Rice University 
## 19:   18              University of Notre Dame 
## 20:   20     University of California—​Berkeley 
## 21:   21                      Emory University 
## 22:   21                 Georgetown University 
## 23:   23            Carnegie Mellon University 
## 24:   23  University of California—​Los Angeles 
## 25:   23     University of Southern California 
##     rank                                   name
##                                 tuition_fees enrollment acceptance_2014
##  1:                                  $43,450       8088             7.4
##  2:                                  $45,278      19929             6.0
##  3:                                  $47,600      12336             6.3
##  4:                                  $51,008      24221             7.0
##  5:                                  $46,320      16795             5.1
##  6:                                  $50,193      12558             8.8
##  7:                                  $46,704      11319             7.9
##  8:                                  $49,341      15856            11.4
##  9:                                  $49,536      21296            10.4
## 10:                                  $45,390       2209             8.8
## 11:                                  $48,710      21484            15.0
## 12:                                  $49,506       6298            11.5
## 13:                                  $49,047      20336            13.1
## 14:                                  $49,346       9181             8.7
## 15:                                  $49,116      21850            14.2
## 16:                                  $43,838      12686            13.1
## 17:                                  $48,093      14348            17.1
## 18:                                  $42,253       6621            15.1
## 19:                                  $47,929      12179            21.1
## 20: in-state: $13,432, out-of-state: $38,140      37581            16.0
## 21:                                  $46,314      14769            26.8
## 22:                                  $48,611      17858            17.4
## 23:                                  $50,410      13285            24.6
## 24: in-state: $12,753, out-of-state: $35,631      43239            18.6
## 25:                                  $50,210      42469            18.0
##                                 tuition_fees enrollment acceptance_2014
##     retention_freshman grad_rate_6_yr score
##  1:                 98             97   100
##  2:                 97             98    99
##  3:                 99             96    97
##  4:                 99             96    95
##  5:                 98             95    95
##  6:                 99             93    95
##  7:                 98             91    93
##  8:                 97             95    92
##  9:                 98             96    91
## 10:                 97             92    90
## 11:                 97             94    90
## 12:                 98             95    89
## 13:                 97             94    89
## 14:                 98             96    85
## 15:                 97             93    84
## 16:                 97             93    84
## 17:                 97             95    84
## 18:                 97             93    82
## 19:                 98             96    82
## 20:                 97             91    77
## 21:                 95             91    76
## 22:                 96             95    76
## 23:                 95             88    74
## 24:                 97             91    74
## 25:                 97             91    74
##     retention_freshman grad_rate_6_yr score
#We have to account for the fact that tuition is 
#  sometimes listed as in- and sometimes as 
#  out-of-state. We have three options that I see:
#  1) Pick one or the other and delete the other
#  2) Average the two
#  3) Create two columns -- in- and out-of-state tuition
#  We'll go with number 3.

#We'll have to introduce some new tricks to get this to work.
#  Let's do this step by step, then go all at once.

## Get _only_ the tuition_fees values which differentiate
##   between in- and out-of-state tuition:
in_out <- usnews[grepl("in-state", tuition_fees), tuition_fees]
in_out
## [1] "in-state: $13,432, out-of-state: $38,140"
## [2] "in-state: $12,753, out-of-state: $35,631"
## Now, split each element of this string into pieces.
##   We can do this with the strsplit function, which
##   takes a character vector and an argument split.
##   split is a regular expression saying where to
##   split each element of the string. In our current
##   case, we'll split at the colon following in-state
##   and the out-of-state, as well as the comma preceding
##   out-of-state. We can't just split on commas, though,
##   since the tuition values themselves still have these.
##   What distinguishes the comma we want (and, as luck
##   would have it, the colons as well) is that they're
##   followed by a space, which we'll leverage to split it:
in_out_split <- strsplit(in_out, split = "[,:] ")
in_out_split
## [[1]]
## [1] "in-state"     "$13,432"      "out-of-state" "$38,140"     
## 
## [[2]]
## [1] "in-state"     "$12,753"      "out-of-state" "$35,631"
## Next, we want to extract only the numbers from this split.
##   To do this, extract only the 2nd and 4th element, and then
##   (as above) remove the dollar sign and comma. We want to
##   return a list since the right hand side of := has to be a list.
in_out_num_l <- lapply(in_out_split, function(io)
  as.integer(gsub("[$,]", "", io[c(2, 4)])))
in_out_num_l
## [[1]]
## [1] 13432 38140
## 
## [[2]]
## [1] 12753 35631
## Finally, one more trick to do. := expects to find each 
##   _element_ of the list it receives corresponding to one 
##   _column_. Right now, this is messed up (it would be more 
##   obvious if the list wasn't 2x2...), since the first element
##   has one in-state and one out-of-state value (and same
##   for the second element). What we need is for the first
##   element to have BOTH in-state values, and the second
##   element to have BOTH out-of-state values. data.table
##   has a tool designed for just such a situation:
##   transpose. Basically, transpose flips the order of
##   a list. Right now, the list has the list elements
##   corresponding to different ROWS and the elements
##   WITHIN a list element corresponding to different COLUMNS
##   If we transpose this, we'll get list elements
##   corresponding to different COLUMNS and the elements
##   WITHIN a list element corresponding to different ROWS,
##   which is just what we want.
transpose(in_out_num_l)
## [[1]]
## [1] 13432 12753
## 
## [[2]]
## [1] 38140 35631
##Now, we can do all of that at once like so:
usnews[grepl("in-state", tuition_fees),
       paste0("tuition_", c("in", "out"), "_state") :=
       transpose(lapply(strsplit(tuition_fees, split = "[:,] "),
               function(io) as.integer(gsub("[$,]", "", io[c(2, 4)]))))]

#And to finish off, we assign the same value to in- and 
#  out-of-state tuition for those (private) universities which
#  don't make a distinction:
usnews[is.na(tuition_in_state), 
       paste0("tuition_", c("in", "out"), "_state") :=
         #since we haven't quite fixed the punctuation in
         #  the other rows yet, we do so here as an 
         #  intermediate step
       {x <- as.integer(gsub("[$,]", "", tuition_fees))
         .(x, x)}][]
##     rank                                   name
##  1:    1                  Princeton University 
##  2:    2                    Harvard University 
##  3:    3                       Yale University 
##  4:    4                   Columbia University 
##  5:    4                   Stanford University 
##  6:    4                 University of Chicago 
##  7:    7 Massachusetts Institute of Technology 
##  8:    8                       Duke University 
##  9:    9            University of Pennsylvania 
## 10:   10    California Institute of Technology 
## 11:   10              Johns Hopkins University 
## 12:   12                     Dartmouth College 
## 13:   12               Northwestern University 
## 14:   14                      Brown University 
## 15:   15                    Cornell University 
## 16:   15                 Vanderbilt University 
## 17:   15    Washington University in St. Louis 
## 18:   18                       Rice University 
## 19:   18              University of Notre Dame 
## 20:   20     University of California—​Berkeley 
## 21:   21                      Emory University 
## 22:   21                 Georgetown University 
## 23:   23            Carnegie Mellon University 
## 24:   23  University of California—​Los Angeles 
## 25:   23     University of Southern California 
##     rank                                   name
##                                 tuition_fees enrollment acceptance_2014
##  1:                                  $43,450       8088             7.4
##  2:                                  $45,278      19929             6.0
##  3:                                  $47,600      12336             6.3
##  4:                                  $51,008      24221             7.0
##  5:                                  $46,320      16795             5.1
##  6:                                  $50,193      12558             8.8
##  7:                                  $46,704      11319             7.9
##  8:                                  $49,341      15856            11.4
##  9:                                  $49,536      21296            10.4
## 10:                                  $45,390       2209             8.8
## 11:                                  $48,710      21484            15.0
## 12:                                  $49,506       6298            11.5
## 13:                                  $49,047      20336            13.1
## 14:                                  $49,346       9181             8.7
## 15:                                  $49,116      21850            14.2
## 16:                                  $43,838      12686            13.1
## 17:                                  $48,093      14348            17.1
## 18:                                  $42,253       6621            15.1
## 19:                                  $47,929      12179            21.1
## 20: in-state: $13,432, out-of-state: $38,140      37581            16.0
## 21:                                  $46,314      14769            26.8
## 22:                                  $48,611      17858            17.4
## 23:                                  $50,410      13285            24.6
## 24: in-state: $12,753, out-of-state: $35,631      43239            18.6
## 25:                                  $50,210      42469            18.0
##                                 tuition_fees enrollment acceptance_2014
##     retention_freshman grad_rate_6_yr score tuition_in_state
##  1:                 98             97   100            43450
##  2:                 97             98    99            45278
##  3:                 99             96    97            47600
##  4:                 99             96    95            51008
##  5:                 98             95    95            46320
##  6:                 99             93    95            50193
##  7:                 98             91    93            46704
##  8:                 97             95    92            49341
##  9:                 98             96    91            49536
## 10:                 97             92    90            45390
## 11:                 97             94    90            48710
## 12:                 98             95    89            49506
## 13:                 97             94    89            49047
## 14:                 98             96    85            49346
## 15:                 97             93    84            49116
## 16:                 97             93    84            43838
## 17:                 97             95    84            48093
## 18:                 97             93    82            42253
## 19:                 98             96    82            47929
## 20:                 97             91    77            13432
## 21:                 95             91    76            46314
## 22:                 96             95    76            48611
## 23:                 95             88    74            50410
## 24:                 97             91    74            12753
## 25:                 97             91    74            50210
##     retention_freshman grad_rate_6_yr score tuition_in_state
##     tuition_out_state
##  1:             43450
##  2:             45278
##  3:             47600
##  4:             51008
##  5:             46320
##  6:             50193
##  7:             46704
##  8:             49341
##  9:             49536
## 10:             45390
## 11:             48710
## 12:             49506
## 13:             49047
## 14:             49346
## 15:             49116
## 16:             43838
## 17:             48093
## 18:             42253
## 19:             47929
## 20:             38140
## 21:             46314
## 22:             48611
## 23:             50410
## 24:             35631
## 25:             50210
##     tuition_out_state
#And now we can delete the old column
usnews[ , tuition_fees := NULL][]
##     rank                                   name enrollment acceptance_2014
##  1:    1                  Princeton University        8088             7.4
##  2:    2                    Harvard University       19929             6.0
##  3:    3                       Yale University       12336             6.3
##  4:    4                   Columbia University       24221             7.0
##  5:    4                   Stanford University       16795             5.1
##  6:    4                 University of Chicago       12558             8.8
##  7:    7 Massachusetts Institute of Technology       11319             7.9
##  8:    8                       Duke University       15856            11.4
##  9:    9            University of Pennsylvania       21296            10.4
## 10:   10    California Institute of Technology        2209             8.8
## 11:   10              Johns Hopkins University       21484            15.0
## 12:   12                     Dartmouth College        6298            11.5
## 13:   12               Northwestern University       20336            13.1
## 14:   14                      Brown University        9181             8.7
## 15:   15                    Cornell University       21850            14.2
## 16:   15                 Vanderbilt University       12686            13.1
## 17:   15    Washington University in St. Louis       14348            17.1
## 18:   18                       Rice University        6621            15.1
## 19:   18              University of Notre Dame       12179            21.1
## 20:   20     University of California—​Berkeley       37581            16.0
## 21:   21                      Emory University       14769            26.8
## 22:   21                 Georgetown University       17858            17.4
## 23:   23            Carnegie Mellon University       13285            24.6
## 24:   23  University of California—​Los Angeles       43239            18.6
## 25:   23     University of Southern California       42469            18.0
##     rank                                   name enrollment acceptance_2014
##     retention_freshman grad_rate_6_yr score tuition_in_state
##  1:                 98             97   100            43450
##  2:                 97             98    99            45278
##  3:                 99             96    97            47600
##  4:                 99             96    95            51008
##  5:                 98             95    95            46320
##  6:                 99             93    95            50193
##  7:                 98             91    93            46704
##  8:                 97             95    92            49341
##  9:                 98             96    91            49536
## 10:                 97             92    90            45390
## 11:                 97             94    90            48710
## 12:                 98             95    89            49506
## 13:                 97             94    89            49047
## 14:                 98             96    85            49346
## 15:                 97             93    84            49116
## 16:                 97             93    84            43838
## 17:                 97             95    84            48093
## 18:                 97             93    82            42253
## 19:                 98             96    82            47929
## 20:                 97             91    77            13432
## 21:                 95             91    76            46314
## 22:                 96             95    76            48611
## 23:                 95             88    74            50410
## 24:                 97             91    74            12753
## 25:                 97             91    74            50210
##     retention_freshman grad_rate_6_yr score tuition_in_state
##     tuition_out_state
##  1:             43450
##  2:             45278
##  3:             47600
##  4:             51008
##  5:             46320
##  6:             50193
##  7:             46704
##  8:             49341
##  9:             49536
## 10:             45390
## 11:             48710
## 12:             49506
## 13:             49047
## 14:             49346
## 15:             49116
## 16:             43838
## 17:             48093
## 18:             42253
## 19:             47929
## 20:             38140
## 21:             46314
## 22:             48611
## 23:             50410
## 24:             35631
## 25:             50210
##     tuition_out_state

That wasn’t so bad, was it?

Well, we’re not quite done yet. Observe the following:

print(usnews, quote = TRUE)
##       rank                                     name   enrollment
##  1:    "1"                  "Princeton University "      " 8088"
##  2:    "2"                    "Harvard University "      "19929"
##  3:    "3"                       "Yale University "      "12336"
##  4:    "4"                   "Columbia University "      "24221"
##  5:    "4"                   "Stanford University "      "16795"
##  6:    "4"                 "University of Chicago "      "12558"
##  7:    "7" "Massachusetts Institute of Technology "      "11319"
##  8:    "8"                       "Duke University "      "15856"
##  9:    "9"            "University of Pennsylvania "      "21296"
## 10:   "10"    "California Institute of Technology "      " 2209"
## 11:   "10"              "Johns Hopkins University "      "21484"
## 12:   "12"                     "Dartmouth College "      " 6298"
## 13:   "12"               "Northwestern University "      "20336"
## 14:   "14"                      "Brown University "      " 9181"
## 15:   "15"                    "Cornell University "      "21850"
## 16:   "15"                 "Vanderbilt University "      "12686"
## 17:   "15"    "Washington University in St. Louis "      "14348"
## 18:   "18"                       "Rice University "      " 6621"
## 19:   "18"              "University of Notre Dame "      "12179"
## 20:   "20"     "University of California—​Berkeley "      "37581"
## 21:   "21"                      "Emory University "      "14769"
## 22:   "21"                 "Georgetown University "      "17858"
## 23:   "23"            "Carnegie Mellon University "      "13285"
## 24:   "23"  "University of California—​Los Angeles "      "43239"
## 25:   "23"     "University of Southern California "      "42469"
##     "rank"                                   "name" "enrollment"
##       acceptance_2014   retention_freshman   grad_rate_6_yr   score
##  1:            " 7.4"                 "98"             "97"   "100"
##  2:            " 6.0"                 "97"             "98"    "99"
##  3:            " 6.3"                 "99"             "96"    "97"
##  4:            " 7.0"                 "99"             "96"    "95"
##  5:            " 5.1"                 "98"             "95"    "95"
##  6:            " 8.8"                 "99"             "93"    "95"
##  7:            " 7.9"                 "98"             "91"    "93"
##  8:            "11.4"                 "97"             "95"    "92"
##  9:            "10.4"                 "98"             "96"    "91"
## 10:            " 8.8"                 "97"             "92"    "90"
## 11:            "15.0"                 "97"             "94"    "90"
## 12:            "11.5"                 "98"             "95"    "89"
## 13:            "13.1"                 "97"             "94"    "89"
## 14:            " 8.7"                 "98"             "96"    "85"
## 15:            "14.2"                 "97"             "93"    "84"
## 16:            "13.1"                 "97"             "93"    "84"
## 17:            "17.1"                 "97"             "95"    "84"
## 18:            "15.1"                 "97"             "93"    "82"
## 19:            "21.1"                 "98"             "96"    "82"
## 20:            "16.0"                 "97"             "91"    "77"
## 21:            "26.8"                 "95"             "91"    "76"
## 22:            "17.4"                 "96"             "95"    "76"
## 23:            "24.6"                 "95"             "88"    "74"
## 24:            "18.6"                 "97"             "91"    "74"
## 25:            "18.0"                 "97"             "91"    "74"
##     "acceptance_2014" "retention_freshman" "grad_rate_6_yr" "score"
##       tuition_in_state   tuition_out_state
##  1:            "43450"             "43450"
##  2:            "45278"             "45278"
##  3:            "47600"             "47600"
##  4:            "51008"             "51008"
##  5:            "46320"             "46320"
##  6:            "50193"             "50193"
##  7:            "46704"             "46704"
##  8:            "49341"             "49341"
##  9:            "49536"             "49536"
## 10:            "45390"             "45390"
## 11:            "48710"             "48710"
## 12:            "49506"             "49506"
## 13:            "49047"             "49047"
## 14:            "49346"             "49346"
## 15:            "49116"             "49116"
## 16:            "43838"             "43838"
## 17:            "48093"             "48093"
## 18:            "42253"             "42253"
## 19:            "47929"             "47929"
## 20:            "13432"             "38140"
## 21:            "46314"             "46314"
## 22:            "48611"             "48611"
## 23:            "50410"             "50410"
## 24:            "12753"             "35631"
## 25:            "50210"             "50210"
##     "tuition_in_state" "tuition_out_state"
#See that space after name? It's gonna be a bugbear.
#Observe:
gsub("\\s", "", usnews$name)
##  [1] "PrincetonUniversity "               
##  [2] "HarvardUniversity "                 
##  [3] "YaleUniversity "                    
##  [4] "ColumbiaUniversity "                
##  [5] "StanfordUniversity "                
##  [6] "UniversityofChicago "               
##  [7] "MassachusettsInstituteofTechnology "
##  [8] "DukeUniversity "                    
##  [9] "UniversityofPennsylvania "          
## [10] "CaliforniaInstituteofTechnology "   
## [11] "JohnsHopkinsUniversity "            
## [12] "DartmouthCollege "                  
## [13] "NorthwesternUniversity "            
## [14] "BrownUniversity "                   
## [15] "CornellUniversity "                 
## [16] "VanderbiltUniversity "              
## [17] "WashingtonUniversityinSt.Louis "    
## [18] "RiceUniversity "                    
## [19] "UniversityofNotreDame "             
## [20] "UniversityofCalifornia—​Berkeley "   
## [21] "EmoryUniversity "                   
## [22] "GeorgetownUniversity "              
## [23] "CarnegieMellonUniversity "          
## [24] "UniversityofCalifornia—​LosAngeles " 
## [25] "UniversityofSouthernCalifornia "
#The problem is, that space at the end isn't just
#  any space. It's the dreaded NON-BREAKING SPACE
#  (you'll sometimes see this on website or in
#   newspapers: &nbsp;). The \\s token in 
#  reges doesn't recognize it as blank space.
#  We have to do something really ugly to get
#  rid of the non-breaking space. First, the magic:

usnews[ , name := gsub(intToUtf8(160), "", name)][]
##     rank                                  name enrollment acceptance_2014
##  1:    1                  Princeton University       8088             7.4
##  2:    2                    Harvard University      19929             6.0
##  3:    3                       Yale University      12336             6.3
##  4:    4                   Columbia University      24221             7.0
##  5:    4                   Stanford University      16795             5.1
##  6:    4                 University of Chicago      12558             8.8
##  7:    7 Massachusetts Institute of Technology      11319             7.9
##  8:    8                       Duke University      15856            11.4
##  9:    9            University of Pennsylvania      21296            10.4
## 10:   10    California Institute of Technology       2209             8.8
## 11:   10              Johns Hopkins University      21484            15.0
## 12:   12                     Dartmouth College       6298            11.5
## 13:   12               Northwestern University      20336            13.1
## 14:   14                      Brown University       9181             8.7
## 15:   15                    Cornell University      21850            14.2
## 16:   15                 Vanderbilt University      12686            13.1
## 17:   15    Washington University in St. Louis      14348            17.1
## 18:   18                       Rice University       6621            15.1
## 19:   18              University of Notre Dame      12179            21.1
## 20:   20     University of California—​Berkeley      37581            16.0
## 21:   21                      Emory University      14769            26.8
## 22:   21                 Georgetown University      17858            17.4
## 23:   23            Carnegie Mellon University      13285            24.6
## 24:   23  University of California—​Los Angeles      43239            18.6
## 25:   23     University of Southern California      42469            18.0
##     rank                                  name enrollment acceptance_2014
##     retention_freshman grad_rate_6_yr score tuition_in_state
##  1:                 98             97   100            43450
##  2:                 97             98    99            45278
##  3:                 99             96    97            47600
##  4:                 99             96    95            51008
##  5:                 98             95    95            46320
##  6:                 99             93    95            50193
##  7:                 98             91    93            46704
##  8:                 97             95    92            49341
##  9:                 98             96    91            49536
## 10:                 97             92    90            45390
## 11:                 97             94    90            48710
## 12:                 98             95    89            49506
## 13:                 97             94    89            49047
## 14:                 98             96    85            49346
## 15:                 97             93    84            49116
## 16:                 97             93    84            43838
## 17:                 97             95    84            48093
## 18:                 97             93    82            42253
## 19:                 98             96    82            47929
## 20:                 97             91    77            13432
## 21:                 95             91    76            46314
## 22:                 96             95    76            48611
## 23:                 95             88    74            50410
## 24:                 97             91    74            12753
## 25:                 97             91    74            50210
##     retention_freshman grad_rate_6_yr score tuition_in_state
##     tuition_out_state
##  1:             43450
##  2:             45278
##  3:             47600
##  4:             51008
##  5:             46320
##  6:             50193
##  7:             46704
##  8:             49341
##  9:             49536
## 10:             45390
## 11:             48710
## 12:             49506
## 13:             49047
## 14:             49346
## 15:             49116
## 16:             43838
## 17:             48093
## 18:             42253
## 19:             47929
## 20:             38140
## 21:             46314
## 22:             48611
## 23:             50410
## 24:             35631
## 25:             50210
##     tuition_out_state

So the space is gone, but what happened? It turns out that we can create the &nbsp; character from basically binary code. It turns out that 160 is the code corresponding in UTF-8 encoding to the &nbsp;. So intToUtf8(160) outputs the &nbsp; as a character understood by R and gsub, after which point we can delete it like we would any other character.

5.2.1 Extending

All that work for very little payoff, right?

Well, sort of. We got 9*25 = 225 data points from this scraping operation… which is not many, admittedly.

However, we can now re-use this code to scrape other pages of data from the U.S. News Site, since the format is basically the same (crucially, we’ll be able to re-use the same xpath to get the table on all the other pages). We can then stitch together the table we get from each page into a much bigger data set. The additional footwork required to go from scraping the first page to scraping all the pages is typically not much (unless there are major differences in format from one page to the next).

Let’s demonstrate:

#Only the first 8 pages contain
#  universities that are
#  actually ranked. We could have
#  determined this programmatically
#  (useful in case the paradigm later
#   changes, for example),
#  but I didn't see the point for today
URLS <- paste0(URL, "/page+", 1:8)

usnews <- 
  #rbindlist compiles a list of data.tables
  #  into a single data.table. This is particularly
  #  useful for situations where we create a bunch of
  #  tables in a loop, as is happening here
  rbindlist(lapply(URLS, function(uu){
    #for each page uu, repeat the above...
    usn <- read_html(uu) %>% html_node(xpath = xp) %>%
      html_table() %>% setDT()
    usn[ , (ncol(usn) - 1):ncol(usn) := NULL]
    setnames(usn, c("rank", "name", "tuition_fees",
                    "enrollment", "acceptance_2014", 
                    "retention_freshman", "grad_rate_6_yr"))
    usn[ , score := 
           gsub(".*Overall Score: ([0-9]*) out of.*", "\\1", rank)]
    usn[ , rank := gsub("#([0-9]*).*", "\\1", rank)]
    usn[ , name := gsub("\n.*", "", name)]
    usn[ , enrollment := 
           as.integer(gsub(",", "", enrollment, fixed = TRUE))]
    usn[ , acceptance_2014 := 
           as.numeric(gsub("%", "", acceptance_2014, fixed = TRUE))]
    #Well, slight adjustment here. There are some colleges where
    #  these fields have a footnote because it's calculated
    #  idiosyncratically. Everything follows the percent sign,
    #  so it's only a minor adjustment.
    usn[ , retention_freshman := 
           as.numeric(gsub("%.*", "", retention_freshman))]
    usn[ , grad_rate_6_yr := 
           as.numeric(gsub("%.*", "", grad_rate_6_yr))]
    usn[grepl("in-state", tuition_fees),
        paste0("tuition_", c("in", "out"), "_state") :=
          transpose(lapply(strsplit(tuition_fees, split = "[:,] "),
                           function(io) 
                             as.integer(gsub("[$,]", "", io[c(2, 4)]))))]
    usn[is.na(tuition_in_state), 
        paste0("tuition_", c("in", "out"), "_state") :=
        {x <- as.integer(gsub("[$,]", "", tuition_fees))
        .(x, x)}]
    usn[ , tuition_fees := NULL]
    usn[ , name := gsub(intToUtf8(160), "", name)]}))
    
#just like that! All done. Just some clean-up:
usnews <- usnews[!grepl("Rank Not", score)]
usnews
##      rank                                              name enrollment
##   1:    1                              Princeton University       8088
##   2:    2                                Harvard University      19929
##   3:    3                                   Yale University      12336
##   4:    4                               Columbia University      24221
##   5:    4                               Stanford University      16795
##  ---                                                                  
## 199:  199                                Ashland University       5737
## 200:  199 Indiana University-​Purdue University—​Indianapolis      30690
## 201:  199                         Louisiana Tech University      11225
## 202:  199                       New Mexico State University      15829
## 203:  199                     University of Colorado—​Denver      22791
##      acceptance_2014 retention_freshman grad_rate_6_yr score
##   1:             7.4                 98             97   100
##   2:             6.0                 97             98    99
##   3:             6.3                 99             96    97
##   4:             7.0                 99             96    95
##   5:             5.1                 98             95    95
##  ---                                                        
## 199:            71.9                 74             57    21
## 200:            70.5                 72             44    21
## 201:            65.0                 78             54    21
## 202:            69.6                 73             46    21
## 203:            73.0                 74             40    21
##      tuition_in_state tuition_out_state
##   1:            43450             43450
##   2:            45278             45278
##   3:            47600             47600
##   4:            51008             51008
##   5:            46320             46320
##  ---                                   
## 199:            20242             20242
## 200:             9056             29774
## 201:             8853             25851
## 202:             6729             21234
## 203:            10404             29334

6 Geospatial Tools (2:30 PM - 3:30 PM)

You could spend a whole semester learning about geospatial analysis, so this will necessarily be a shallow overview.

Geospatial analysis basically consists of incorporating geographic information into your approach to understand a problem. Is college dropout related to how far a student lives from their school? Are there geographic discontinuitites that we can leverage to find quasi-experimental results (boundary-induced exogeneities)? Is there geographic heterogeneity in the treatment effect of your RCT?

Answering all of these questions requires some familiarity/facility with the tools of geospatial analysis, only a smattering of which we’ll cover today. Let’s dive right in by briefly covering the main file type associated with GIS: the shapefile (.shp).

6.1 Shapefiles

A shapefile basically encapsulates a variety of information about a geography of interest – maybe the counties of Pennsylvania, maybe the school district boundaries in Iowa, maybe the rivers and lakes of Utah.

Shapefiles basically come in three varieties:

  1. Points: Consist of zero-dimensional spatial objects, like markers for cities, towns, places of interest, etc.
  2. Lines: Consist of one-dimensional spatial objects, like markers for rivers, roads, sewage lines, etc.
  3. Polygons: Consist of two-dimensional spatial objects, like catchment areas, oceans, country borders, etc.

The third are by far the most common/useful, though the first can also be helpful for enhancing plots. I’ve never used the second (they’re probably most useful to people doing research in the life sciences).

In addition to limning every detail of the boundaries of certain objects, point-by-point, .shp files are also typically accompanied by a .dbf file which associates to each polygon some data – the population of that city, the number of students in that catchment area, etc.

6.1.1 Coordinate Reference Systems

One thing that’s very important for dealing empirically with shapefiles is to have at least a rudimentary understanding of what coordinate reference systems (CRSs) are, what the most common ones are, and how to tell what the CRS is of a shapefile you’ve got your hands on.

A CRS is basically an approximation of the three-dimensional surface of the Earth. When we first learn the shape of the Earth, we’re told it’s a sphere. And indeed approximating the surface of the Earth by a sphere isn’t a terrible idea. But if you measure the distance between the two points on the approximating sphere, depending on where on the Earth they are, your measure could be horribly wrong. For most purposes of concern to us, this makes no matter. But if you want to launch a rocket into space, for example, or fly a plane at high velocity halfway around the world, such inaccuracies could be dangerous, or at the very least costly.

As you advanced in school, perhaps you learned that the Earth is not a sphere, but an oblate spheroid – basically like a football, compressed at the poles and stretched at the equator, ever-so-slightly distinct from a sphere.

This, too, is only an approximation. Actually there’s no nice, neat shape that can be used to describe the surface of the Earth. That said, we only gain very minimally in accuracy beyond a certain point, while we trade off dearly in computational cost.

So instead of using an exact model of the Earth’s surface, typically what is done is to use either the most common CRS (the one used by Google Maps, for example), or to use an approximation of the Earth’s surface that is pristinely accurate for a small part of the Earth’s surface. This latter is typically done, for example, by States’ GIS Departments – most maps of Philadelphia I’ve seen, for example, use a CRS that is most accurate in Southeastern Pennsylvania – a tiny fraction of the Earth’s surface.

We’ll return to this once we have some actual data in hand.

6.2 Loading PA District Boundaries

The library we need to read .shp files is maptools:

install.packages("maptools")

This package has three main reading functions – readShapePoints, readShapeLines, and readShapePoly. As you can imagine, the last is the most useful, and it’s what we must use to read in a .shp file with polygons:

library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE
## 
## Attaching package: 'maptools'
## The following object is masked from 'package:sp':
## 
##     nowrapSpatialLines
pa_districts <- readShapePoly("data/pennsylvania_districts.shp")

#what have we got here?
class(pa_districts)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
#beautifully, we can just use the same
#  plot function we've been using thus
#  far and still get what we expect
plot(pa_districts, main = "PA School Districts",
     col = "seagreen")

#set max level because 
#  SpatialPolygonsDataFrame objects
#  are typically massive and it
#  would take about 10,000 lines to print
#  the full str for this one
str(pa_districts, max.level = 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  500 obs. of  16 variables:
##   .. ..- attr(*, "data_types")= chr [1:16] "C" "C" "C" "C" ...
##   ..@ polygons   :List of 500
##   .. .. [list output truncated]
##   ..@ plotOrder  : int [1:500] 86 323 123 159 250 460 456 90 430 85 ...
##   ..@ bbox       : num [1:2, 1:2] -80.5 39.7 -74.7 42.3
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot

Something’s new here – something we haven’t seen before: @.

6.2.1 S4 objects/methods

A SpatialPolygonsDataFrame is slightly different from the things we’ve seen so far. It’s an S4 object. The things we’ve been using up through now have been S3 objects. The difference is unlikely to make a difference to you any time soon.

Practically speaking, the difference between an S4 object and an S3 object is that we use @ to extract components instead of $. Let’s experiment a bit with pa_districts:

#proj4string is where information about
#  our polygons' CRS would (SHOULD) be
pa_districts@proj4string
## CRS arguments: NA
#we'll deal with this being a data.frame
#  a bit later on...
class(pa_districts@data)
## [1] "data.frame"
head(pa_districts@data)
##   STATEFP10 UNSDLEA10 GEOID10                            NAME10 LSAD10
## 0        42     26610 4226610      Windber Area School District     00
## 1        42     26130 4226130  Westmont Hilltop School District     00
## 2        42     20250 4220250          Richland School District     00
## 3        42     19560 4219560      Portage Area School District     00
## 4        42     17610 4217610  Northern Cambria School District     00
## 5        42     10950 4210950 Greater Johnstown School District     00
##   LOGRADE10 HIGRADE10 MTFCC10 SDTYP10 FUNCSTAT10   ALAND10 AWATER10
## 0        PK        12   G5420    <NA>          E 149008250   172076
## 1        KG        12   G5420    <NA>          E  40389065    84890
## 2        KG        12   G5420    <NA>          E  55489201    81591
## 3        KG        12   G5420    <NA>          E  64940601   252439
## 4        KG        12   G5420    <NA>          E 161563015   143662
## 5        KG        12   G5420    <NA>          E  74222845  1233573
##    INTPTLAT10   INTPTLON10 Shape_Leng  Shape_Area
## 0 +40.2029364 -078.7526689  0.6476420 0.015779913
## 1 +40.2980142 -078.9656740  0.4289308 0.004287447
## 2 +40.2847439 -078.8471471  0.3839040 0.005885062
## 3 +40.3819464 -078.6372129  0.3963847 0.006913603
## 4 +40.6324824 -078.7881406  0.6604640 0.017215603
## 5 +40.3422418 -078.9475814  0.9270410 0.007997733
#Unlike with a list, names doesn't return the
#  names of the slots. There's a new
#  function for that:
names(pa_districts)
##  [1] "STATEFP10"  "UNSDLEA10"  "GEOID10"    "NAME10"     "LSAD10"    
##  [6] "LOGRADE10"  "HIGRADE10"  "MTFCC10"    "SDTYP10"    "FUNCSTAT10"
## [11] "ALAND10"    "AWATER10"   "INTPTLAT10" "INTPTLON10" "Shape_Leng"
## [16] "Shape_Area"
slotNames(pa_districts)
## [1] "data"        "polygons"    "plotOrder"   "bbox"        "proj4string"
#bbox is for Bounding BOX, it tells the 
#  extremes of the coordinates (very
#  useful for plotting our polygons);
#  the units here also give a hint as
#  to what the CRS being used is!
pa_districts@bbox
##        min       max
## x -80.5195 -74.68950
## y  39.7198  42.26933

6.2.2 Setting the CRS with Reading

Typically, a .shp file is accompanied by several other data files. The most important right now is the .prj file, which should contain some information about the projection associated with the file. Let’s examine the one we have:

cat(readLines("data/pennsylvania_districts.prj"))
## GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137,298.257222101]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]]

I just discovered a website which can help us turn the information from a .prj file into something we can use. There are a number of other ways to do this, but all are a bit too involved for today’s purposes (basically, they involve loading them in free GIS software like QGIS or SAGA, which have built-in automatic interpreters for these types of things that are pretty good).

Plugging the contents of our .prj file into prj2epsg.org, we find that our projection is equivalent to EPSG:4269. I’ve found that getting things in EPSG standard index form is the simplest way to handle defining a CRS in R. With this in hand, we can repeat our reading in exercise:

#the CRS function converts an EPSG
#  specification into something
#  understood by proj4string
pa_districts <- 
  readShapePoly("data/pennsylvania_districts.shp",
                proj4string = CRS("+init=epsg:4269"))

#it's the same as before, basically:
plot(pa_districts, main = "PA School Districts",
     col = "seagreen")

#but now comes more complete:
pa_districts@proj4string
## CRS arguments:
##  +init=epsg:4269 +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80
## +towgs84=0,0,0

6.2.3 Reprojection of CRS

You’ll often find that one shapefile is set in one CRS, and another shapefile you need for your analysis is stored in another CRS. Thankfully, R has simple functionalities for transforming polygons from CRS to CRS.

Let’s try that out here quickly. The most common CRS I’ve come across is known concisely as WGS84, but more coloquially as Web Mercator. For example this is the projection used by Google Maps. Let’s convert our pa_districts into WGS84:

pa_districts <-
  spTransform(pa_districts, CRS("+proj=longlat +datum=WGS84"))

pa_districts@proj4string
## CRS arguments:
##  +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0

6.3 Geocoding

Geocoding is the process of converting street addresses (like "3700 Walnut Street, Philadelphia, PA 19104") into spatial coordinates. This is actually a very expensive operation, computationally. As evidence of this, consider that this is one of the only services I know of that Google doesn’t offer on a massive scale for free – there is a daily limit of a somewhat small number of queries. And other firms typically charge quite high amounts to geocode your data for you. As a minor plug, I managed to come across a firm called SmartyStreets which offers unlimited free geocoding to students (basically in the hope that you’ll become a subscriber upon graduating to a job with real income).

For smaller-scale jobs (on the order of a few hundred queries), using Google Maps’ service should be just fine. And in fact there is an R function that handles most of the heavy lifting under the hood to make geocoding as simple as possible. So let’s get to it!

We’ll be finding coordinates for all of the (traditional) public schools in Philadelphia (from 2013-14, per the Common Core; I pre-condensed the data file to save space here). Let’s read in the file quickly:

phila_addr <- fread("data/phila_school_addr.csv")
phila_addr
##                                  SCHNAM                  LSTREE
##   1: FELTONVILLE SCH OF ARTS & SCIENCES   210 EAST COURTLAND ST
##   2:                      CAYUGA SCHOOL   43444358 NORTH 5TH ST
##   3:        OVERBROOK ELEMENTARY SCHOOL      2032 NORTH 62ND ST
##   4:                  MARSHALL THURGOOD       5120 NORTH 6TH ST
##   5:                   MUNOZ-MARIN LUIS       3300 NORTH 3RD ST
##  ---                                                           
## 195:                             AMY NW          6000 RIDGE AVE
## 196:                BODINE WILLIAM W HS       1101 NORTH 4TH ST
## 197:       DEBURGOS BILINGUAL MAGNET MS     401 WEST LEHIGH AVE
## 198: KENSINGTON URBAN EDUCATION ACADEMY 2051 EAST CUMBERLAND ST
## 199:              FRANKLIN BENJAMIN SCH     5735 RISING SUN AVE
##             LCITY LSTATE  LZIP
##   1: PHILADELPHIA     PA 19120
##   2: PHILADELPHIA     PA 19140
##   3: PHILADELPHIA     PA 19151
##   4: PHILADELPHIA     PA 19120
##   5: PHILADELPHIA     PA 19140
##  ---                          
## 195: PHILADELPHIA     PA 19128
## 196: PHILADELPHIA     PA 19123
## 197: PHILADELPHIA     PA 19133
## 198: PHILADELPHIA     PA 19125
## 199: PHILADELPHIA     PA 19120

As you can see, we have inherently geospatial data, but without a deep knowledge of the street system of Philadelphia, the data is basically useless in its current form as such. Geocoding will bridge the gap between textual geospatial data and numeric geospatial data. The ggmap library provides a geocode function which is perfect for the data at hand:

install.packages("ggmap")

The geocode function takes a source argument which allows us to choose the service used to look up the addresses. I find that Google is the most robust, so I usually stick with source = "google". You can read more in ?ggmap. Now we’ve got all we need to do the geocoding:

#cache here to avoid running the geocoding too often
library(ggmap)
#the output of geocode comes back as a 
#  data.frame with two columns --
#  lon and lat. We'll add these columns
#  to our data.table
phila_addr[ , c("longitude","latitude") := 
              geocode(paste(LSTREE, LCITY, LSTATE, LZIP),
                      source = "google")]
phila_addr[]
##                                  SCHNAM                  LSTREE
##   1: FELTONVILLE SCH OF ARTS & SCIENCES   210 EAST COURTLAND ST
##   2:                      CAYUGA SCHOOL   43444358 NORTH 5TH ST
##   3:        OVERBROOK ELEMENTARY SCHOOL      2032 NORTH 62ND ST
##   4:                  MARSHALL THURGOOD       5120 NORTH 6TH ST
##   5:                   MUNOZ-MARIN LUIS       3300 NORTH 3RD ST
##  ---                                                           
## 195:                             AMY NW          6000 RIDGE AVE
## 196:                BODINE WILLIAM W HS       1101 NORTH 4TH ST
## 197:       DEBURGOS BILINGUAL MAGNET MS     401 WEST LEHIGH AVE
## 198: KENSINGTON URBAN EDUCATION ACADEMY 2051 EAST CUMBERLAND ST
## 199:              FRANKLIN BENJAMIN SCH     5735 RISING SUN AVE
##             LCITY LSTATE  LZIP longitude latitude
##   1: PHILADELPHIA     PA 19120 -75.12164 40.01939
##   2: PHILADELPHIA     PA 19140 -75.13561 40.01311
##   3: PHILADELPHIA     PA 19151 -75.24561 39.98187
##   4: PHILADELPHIA     PA 19120 -75.13394 40.03010
##   5: PHILADELPHIA     PA 19140 -75.13571 40.00121
##  ---                                             
## 195: PHILADELPHIA     PA 19128 -75.21287 40.03092
## 196: PHILADELPHIA     PA 19123 -75.14311 39.96820
## 197: PHILADELPHIA     PA 19133 -75.13863 39.99275
## 198: PHILADELPHIA     PA 19125 -75.12633 39.98430
## 199: PHILADELPHIA     PA 19120 -75.10503 40.04217
#It's common for some addresses to be poorly formatted, 
#  in which case the entry for that place would be missing
phila_addr[is.na(longitude)]
## Empty data.table (0 rows) of 7 cols: SCHNAM,LSTREE,LCITY,LSTATE,LZIP,longitude...

6.4 Spatial Joins

One of the most important tasks of GIS is the spatial join – basically cross-referencing information from different shapefiles and combining them based on shared spatial characteristics.

This comes in several forms (partial overlap of polygons, containment of lines, etc.), but for today I think it will suffice to consider the simplest form of spatial join, namely, overlapping points to polygons.

Specifically, we’re going to find out the neighborhood in Philadelphia where each of our schools is found.

First we’ll need to load the polygons associated with these neighborhoods. Philadelphia is blessed to have a local GIS firm, Azavea Inc., which has taken it upon themselves to partition Philadelphia into coloquially-known neighborhoods (Fishtown, Point Breeze, Society Hill), which sounds like a pretty daunting task if you ask me – particularly deciding on definitive borders for something that is defined socially, not politically. Anyway, you can download their shapefile through OpenDataPhilly, which is, by the way, an excellent source for all sorts of free data about Philadelphia, here.

#This file is stored in the ever-famous
#  NAD 1983 State Plane Pennsylvania South.
#  This is used by basically ever government
#  map made in Philly. This was again
#  determined by checking out the EPSG
#  returned by prj2epsg.org.
azavea <- readShapePoly("data/Neighborhoods_Philadelphia.shp",
                        proj4string = CRS("+init=epsg:2272"))

plot(azavea, main = "Azavea Neighborhoods in Philadelphia",
     col = "darkblue", lwd = 3)

Next, we have to convert our coordinates that we retrieved into a SpatialPointsDataFrame, since this is the type of object for which we have methods to perform a spatial join.

#Remembering that we got our points in WGS84 above,
#  we'll also take this opportunity to convert
#  them to the coordinates used in the 
#  neighborhood map
school_points <- 
  spTransform(SpatialPointsDataFrame(
    coords = with(phila_addr, cbind(longitude, latitude)),
    data = phila_addr,
    proj4string = CRS("+proj=longlat +datum=WGS84")),
    CRS("+init=epsg:2272"))

#To see what we're after, let's overlay the points
#  to the neighborhood map by using the add = TRUE
#  parameter to plot, which prevents plot from
#  creating a new figure upon being called
plot(azavea, main = "Azavea Neighborhoods in Philadelphia",
     col = "darkblue", lwd = 3)
plot(school_points, pch = 8, add = TRUE, col = "red")

Spatial joins, once we have the pieces arranged nicely (all in the right CRS, most crucially), are actually pretty simple. The hardest part is dealing with the output, since it’s a data.frame. The crux of the operation is using the %over% infix operator, which returns, for each object on its left, the data attributes (i.e., the rows) of the object spatially associated with that object from the right-hand side.

In our example, the left will be the schools, and the right will be the neighborhood. We’ll find the neighborhood that is spatially associated with each school (here, that simply means the neighborhood in whose boundary the school is located), and return the attributes of that neighborhood.

Typically, we then add those returned attributes to our data, either adding it to the original data.table, phila_addr, or adding it to the derived SpatialPointsDataFrame, school_points. We’ll do the latter here since it demonstrates a bit of dealing with data.frames, an unfortunate reality that chases all those who would dare do geospatial analysis (data.tables unfortunately don’t play very nicely with GIS in R, yet).

Without further ado:

#simple join, no assignment yet
head(school_points %over% azavea)
##           NAME     LISTNAME      MAPNAME Shape_Leng Shape_Area
## 1  FELTONVILLE  Feltonville  Feltonville   22648.99   27907972
## 2 HUNTING_PARK Hunting Park Hunting Park   32920.80   39024501
## 3    OVERBROOK    Overbrook    Overbrook   57004.92   76924995
## 4        OLNEY        Olney        Olney   32197.21   50308403
## 5     FAIRHILL     Fairhill     Fairhill   12327.77    7514288
## 6        OLNEY        Olney        Olney   32197.21   50308403
#We only care about the MAPNAME,
#  which is the abbreviation of the
#  neighborhood name which is most
#  mapping-friendly. We can extract
#  only that column like so (note 
#  the subtle difference vis-a-vis data.table)
head((school_points %over% azavea)["MAPNAME"])
##        MAPNAME
## 1  Feltonville
## 2 Hunting Park
## 3    Overbrook
## 4        Olney
## 5     Fairhill
## 6        Olney
#Now we can formally complete the join.
#  Assigning to a data.frame means
#  doing the following:
school_points@data$azavea <-
  (school_points %over% azavea)["MAPNAME"]

head(school_points@data)
##                               SCHNAM                LSTREE        LCITY
## 1 FELTONVILLE SCH OF ARTS & SCIENCES 210 EAST COURTLAND ST PHILADELPHIA
## 2                      CAYUGA SCHOOL 43444358 NORTH 5TH ST PHILADELPHIA
## 3        OVERBROOK ELEMENTARY SCHOOL    2032 NORTH 62ND ST PHILADELPHIA
## 4                  MARSHALL THURGOOD     5120 NORTH 6TH ST PHILADELPHIA
## 5                   MUNOZ-MARIN LUIS     3300 NORTH 3RD ST PHILADELPHIA
## 6           WASHINGTON GROVER JR SCH    201 EAST OLNEY AVE PHILADELPHIA
##   LSTATE  LZIP longitude latitude      MAPNAME
## 1     PA 19120 -75.12164 40.01939  Feltonville
## 2     PA 19140 -75.13561 40.01311 Hunting Park
## 3     PA 19151 -75.24561 39.98187    Overbrook
## 4     PA 19120 -75.13394 40.03010        Olney
## 5     PA 19140 -75.13571 40.00121     Fairhill
## 6     PA 19120 -75.11828 40.03556        Olney

6.5 Polygon Aggregation/Dissolution

Polygon aggregation, the fusion of neighboring polygons into single polygons, is done primarily through the unionSpatialPolygons function. See unionSpatialPolygons.

Polygon dissolution is a bit more complicated, depending on what exactly it is you have in mind, but you should start with the gIntersection and gDifference functions from the rgeos library, see this Q&A.

7 Automatic Presentations/Papers: knitr (3:30 PM - 4:30 PM)

knitr is what was used to make this gigantic presentation/workshop today!

8 Dynamic/Interactive Output: shiny (4:30 PM - 5:30 PM)

See here.

9 Appendix

vapply

CJ

setkey

9.1 Data Cleaning

9.2 Parallelization

9.3 Source Code

9.4 Package development

9.5 Advanced types