1 Overview

First and foremost is to install R (following instructions here) so that your computer knows how to interpret R code. Then, I highly recommend you install RStudio (following instructions here), which is a powerful program for organizing your interaction with the R language (a subtle distinction, and you wouldn’t really be ill-served by considering RStudio and R to be the same thing); for the jargon-hungry, RStudio is an Integrated Development Environment (IDE) for the R language.

Some general guidelines for getting the most out of this workshop:

  1. Active participation is key! You should be running every snippet of code I go over on your own machine. As with human languages, exposure is one of the building blocks of fluency. You would be insane to think that you can osmote the ability to understand and use R code just by watching me do things on a screen. This facility takes practice, practice, hair pulling, and more practice. For those with little exposure to more advanced programming languages, today’s workshop will be very heavy with material. You shouldn’t expect to absorb everything, but actively participating will help a lot.

  2. Questions are openly encouraged! As I mentioned, this will, despite my efforts to make things as straightforward as possible, inevitably be a somewhat dense workshop. So nobody should hesitate to stop me if they’re lost. I am no fan of hearing myself speak, so if nobody in the audience is following me, that’s a waste of everyone’s time.




2 R Basics (8 AM - 9 AM)

Before we get to the fun stuff (statistical analysis), it’s important that we gain some facility with doing basic things in R first – adding numbers, creating objects, and tools for exploring/understanding new objects as we come across them, including understanding and overcoming common errors that can creep into our code.

First things first, let’s follow the time-honored tradition of making R say “Hello World”.

Make sure you’re in the console (the cursor in front of the right angle bracket (>) should be blinking) and type (or copy-paste) the following:

print("Hello World")
## [1] "Hello World"

Easy-peasy. The console is a great place to do what I call sandboxing – running small commands to test whether they run as expected and produce the right output in the right format, etc. But it would be very cumbersome to use the console to do everything. There’s a distinct lack of permanency to anything we do in the console.

More common is to run code from an R script. You can create an R script in RStudio by clicking the plus-page icon (New R script) and hitting “R Script” or by using the keyboard shortcut Ctrl+Shift+N. Try printing “Hello World” again by adding print("Hello World") to your new script and pressing Ctrl+Enter on this line. Ctrl+Enter is the shortcut for running the selected line, or for running a highlighted section of code.

Once we save our R script, we can easily share it with coauthors or the general public (or even ourselves, on a different computer), who can simply run the code again on their machine to reproduce your analysis.


2.1 Assignment

Consider the following:

x = 3

When we execute the above line of code, we’re creating the variable x and associating with it the value 3. This is like creating a local in Stata.

Variables (a.k.a. objects) are the most crucial building block for everything we want to do in R. When we create a variable, we create a shorthand for some value that we’ll refer to lower in our code.

Assignment has to pass from right to left – the object on the right of = is assigned to the name on the left of =.

The opposite is an error, since we haven’t told R what y is yet:

3 = y
## Error in 3 = y: invalid (do_set) left-hand side to assignment

This error tells us that 3 is not a valid thing to which to assign; as a rule of thumb for beginners, all variable names must start with letters (though can contain other characters thereafter).

Side note: the following works just as well: x <- 3 this method of assignment is probably more common, but we’ll stick with using = today since I think it’s more intuitive. The differences between = and <- are all-but irrelevant for beginners, but for future reference, this Q&A is worth a read:


2.2 Vectors

The easiest way to think of a vector is as a column (or a row) in Excel. A column in excel can contain many numbers, but instead of referring to each of them individually, we refer to the row. Typically, these numbers all have something in common (for example, a column Name in Excel should be filled with peoples’ names).

The way to declare vectors in R is using c, which stands for _c_oncatenate:

x <- c(1, 2, 3, 4)
x
## [1] 1 2 3 4

Now x contains four numbers. Note that since this type of sequence is so common, R has built in the colon (:) as an operator to create a variable like this more quickly/concisely:

x <- 1:4
x
## [1] 1 2 3 4
#Also works in reverse
x <- 4:1
x
## [1] 4 3 2 1

2.3 Types

Thus far, everything we’ve assigned to a variable has been a number. But this is far from the only thing we can assign to a variable in R. Consider:

x <- c("Philadelphia", "Pennsylvania")
y <- c(TRUE, FALSE)
z <- c(1L, 2L)

x contains a string of letters demarcated by "" (or '') to distinguish them from variable names. We refer to x as a character vector.

y is a logical vector; it is often very convenient to keep binary variables (gender, treatment group, etc.) stored as logical variables, for reasons we’ll see below.

z is an integer vector. This is distinguished from c(1, 2), which gets stored as a numeric vector. The difference between the two probably won’t affect you for quite some time, but it’s often important for saving computer memory. 1, as a numeric, takes up more space in your computer’s memory than does 1L. L signifies integer; don’t worry about why (if you insist, it stands for _L_ong integer, which is too involved for this session).

2.3.1 Lists

Consider this:

x <- c(1, 2L, TRUE, "America")

All four components of x are of different type (namely, numeric, integer, logical, character). Recall above that we said vectors must have something in common, but as we declared it, this couldn’t be further from the truth. In fact, R will force all of these components to have the same type – namely, character:

x
## [1] "1"       "2"       "TRUE"    "America"

Note the quotation marks – none of the components are any longer considered as character strings in x.

The specific heirarchy is logical < integer < numeric < character

However, mixing types is a fundamental feature of almost all data analysis, so it stands to reason that there is a straightforward way to do so. In R, this is done with the list type. We can replace the code above using list:

x <- list(1, 2L, TRUE, "America")
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] "America"

Note how different the output looks, as compared to using c!! The quotation marks are gone except for the last component. You can ignore the mess of [[ and [ for now, but as an intimation, consider some more complicated lists:

x <- list(c(1, 2), c("a", "b"), c(TRUE, FALSE), c(5L, 6L))
x
## [[1]]
## [1] 1 2
## 
## [[2]]
## [1] "a" "b"
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 5 6
y <- list(list(1, 2, 3), list(4:5), 6)
y
## [[1]]
## [[1]][[1]]
## [1] 1
## 
## [[1]][[2]]
## [1] 2
## 
## [[1]][[3]]
## [1] 3
## 
## 
## [[2]]
## [[2]][[1]]
## [1] 4 5
## 
## 
## [[3]]
## [1] 6

x is a list which has 4 components, each of which is a vector with 2 components. This gives the first hint at how R treats a dataset with many variables of different types – at core, R stores a data set in a list!

y is a nested list – it’s a list that has lists for some of its components. This is very useful for more advanced operations, but probably won’t come up for quite some time, so don’t worry if you haven’t wrapped your head around this yet.


2.4 Extraction/Indexing

Consider a simple numeric vector:

x <- 5:14

How do we get at the various numbers stored in certain positions in x? For example, how could we get the first number in x, 5?

This process is called extraction, and, for vectors, is done with square brackets ([]), e.g.:

x[1] #first element of x
## [1] 5
x[5] #fifth element of x
## [1] 9

This also works on lists, but lists also have some other ways to get at their contents:

y <- list(1:3, 4:6, 7:9)
y[1]
## [[1]]
## [1] 1 2 3

Note that the output still has [[ in it. This means the result of y[1] is still a list.

More typically, we want to remove the list structure and just get 1:3 instead of list(1:3). To do this, we use [[:

y[[1]]
## [1] 1 2 3

2.4.1 Named vectors and lists

It is also possible to name the elements of vectors and lists. This is convenient for making it easy to get certain elements without having to remember whether you stored it first, third, or whatever:

x <- c("Iowa" = "Cruz", "Ohio" = "Kasich", "Pennsylvania" = "Trump")

y <- list(names = c("Kasich", "Cruz", "Trump"),
          ages = c(63, 45, 69),
          hates_clinton = c(TRUE, TRUE, TRUE))

For named vectors, we keep using [] to extract elements, but we can use the name instead of the index:

x["Iowa"]
##   Iowa 
## "Cruz"

For named lists, we can use [], but, as with numbered indices, we’ll get a list in return. If we want the actual object contained at that point in the list, we can still use [[, or we can also use $, which is another extraction operator:

#now that y has names, we no longer see [[ -- instead, we see
#  the name of each element of the list
y["ages"]
## $ages
## [1] 63 45 69
y[["ages"]]
## [1] 63 45 69
y$names
## [1] "Kasich" "Cruz"   "Trump"

2.4.2 Multiple extraction

As often as not, when we need to extract, we need more than one element of the vector.

To do this, we pass a vector to []. It’s must intuitive when we need elements in sequence, e.g.:

# R conveniently keeps the letters of the
#   alphabet stored automatically in two
#   vectors, letters and LETTERS;
#   the former is lower-, the latter uppercase
x <- LETTERS
x
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
#Get the first 5 letters
x[1:5]
## [1] "A" "B" "C" "D" "E"

What if we need the first and 5th elements, but not the 3rd/4th/5th elements?

Remember that 1:5 is the same as c(1, 2, 3, 4, 5). So we could have just used the latter:

x[c(1, 2, 3, 4, 5)]
## [1] "A" "B" "C" "D" "E"

By extension, if we just want the first and fifth elements:

x[c(1, 5)]
## [1] "A" "E"

NOTE: there’s no* way to extract multiple elements using the other extraction operators, [[ and $. It’s an error to try:

x <- list(a = 1:3, b = 4:6)
#multiple extraction by names works fine:
x[c("a", "b")]
## $a
## [1] 1 2 3
## 
## $b
## [1] 4 5 6
#but not with [[ or $
x[[1:2]] #technically, R interprets this as a recursive index...
## [1] 2
x[[c("a", "b")]] #again, a recursive index...
## Error in x[[c("a", "b")]]: subscript out of bounds
x$c("a", "b")
## Error in eval(expr, envir, enclos): attempt to apply non-function

2.4.3 Logical indexing

Another exceedingly common approach is to subset an object based on a condition satisfied by some of the elements.

Here, we must introduce the logical operators (also called binary operators, which is important for understanding associated error messages) in R.

These are <, >, <=, >=, ==, and !=. These are all the same in Stata (note: ~=, which works in Stata as an alternate to !=, doesn’t work in R). First, let’s explore them:

ages <- c(12, 14, 16, 18, 20, 22, 24, 30,
          38, 40, 55, 60, 63, 66, 68, 70)

#Who is voting age?
ages >= 18
##  [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE  TRUE  TRUE  TRUE  TRUE
#Who can't drink?
ages < 21
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE
#Who is exactly 40?
ages == 40
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE

Note that R automatically figures out whether each element of ages satisfies the condition.

The way to extend this to extraction is simple:

#Only take people under age 18
ages[ages < 18]
## [1] 12 14 16
#Only take AARPers
ages[ages >= 50]
## [1] 55 60 63 66 68 70
#Exclude any 70 year old:
ages[ages != 70]
##  [1] 12 14 16 18 20 22 24 30 38 40 55 60 63 66 68

The other common logical operations are intersection, union and negation, aka AND, OR and NOT, which in R, as in Stata, are &, | and ! (you’ll also often see && and || in other peoples’ code; don’t worry about the difference yet, but for those who know Matlab – it’s the same).

Quickly, let’s get working-agers:

ages[ages >= 18 & ages <= 65]
##  [1] 18 20 22 24 30 38 40 55 60 63

We can combine these to our heart’s fancy and create any combination of logical requirements, e.g.:

#get people who are over 18, but not exactly 34 or 70
ages[ages >= 18 & !(ages == 34 | ages == 70)]
##  [1] 18 20 22 24 30 38 40 55 60 63 66 68

2.4.4 Negative indexing

Sometimes, we’d rather exclude a small number of elements, rather than include elements like we have been doing so far.

To do this, we precede the vector we want to extract with - (if it’s numeric/integer/character) or ! (if it’s logical).

Suppose we wanted to find out how many years have passed since our earliest observation, and to exclude the earliest observation. We might do something like:

years <- c(1970, 1973, 1978, 1980, 1990, 1995)
#the earliest year is 1970, which comes first; to exclude:
years[-1]
## [1] 1973 1978 1980 1990 1995
#to exclude and subtract 1970:
years[-1] - years[1]
## [1]  3  8 10 20 25

2.5 Base functions

R comes equipped with a vast (vast) library of built-in functions intended to make your life as a data analyst as easy as possible. As of this writing I count 1,203 functions included in base R, and 2,374 total functions that come ready to use as soon as you open up RStudio.

These are the real work horses of R, and many of them should be familiar to Excel and Stata users. Like in those programs, functions are distinguished by the use of parentheses () (in fact we’ve already used many functions to this point, most obviously the concatentation function c). We won’t be able to get anyhwere near understanding all of the multitudinous functions R makes available to us today, but we can make a start.

2.5.1 Basic Arithmetic/Stats

Finding the sum of a column in Excel is sort of a pain. You have to write =SUM( and then highlight all the proper cells. It’s kind of a pain in Stata as well. You either have to summarize var the variable and run di r(sum) or create a new variable with egen sum_var = sum(var). I find this and many other artihmetic operations much simpler to do in R.

x <- 1:10

#find the sum
sum(x)
## [1] 55
#find the mean
mean(x)
## [1] 5.5
#find the variance and standard deviation
var(x)
## [1] 9.166667
sd(x)
## [1] 3.02765
#scalar arithmetic
3 * x
##  [1]  3  6  9 12 15 18 21 24 27 30
x - 2
##  [1] -1  0  1  2  3  4  5  6  7  8
x^2
##  [1]   1   4   9  16  25  36  49  64  81 100
x / 4
##  [1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
#exponentials
exp(1:3)
## [1]  2.718282  7.389056 20.085537
log(1:10)
##  [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595 1.9459101
##  [8] 2.0794415 2.1972246 2.3025851
#other arithmetic
abs(-5:5)
##  [1] 5 4 3 2 1 0 1 2 3 4 5
#rounding up/down
x <- c(.1, 1.1, 1.9, 2, 2.8)
floor(x)
## [1] 0 1 1 2 2
ceiling(x)
## [1] 1 2 2 2 3

2.5.2 Function arguments

Most functions accept more than one argument. Consider rounding a number:

x <- c(1.12, 10.2, 1.56, 21.9, 2)
round(x)
## [1]  1 10  2 22  2
#round to 1 digit past the decimal
round(x, 1)
## [1]  1.1 10.2  1.6 21.9  2.0
#round to -1 digit past the decimal
#  (i.e., 1 digit befor the decimal)
round(x, -1)
## [1]  0 10  0 20  0

There’s no limit to the number of arguments a function can accept to accomplish a wide variety of tasks. With round, it’s easy to handle this because there are only two arguments – first, a vector of numbers, and second, a (single) number of digits to which to round the vector.

When the number of arguments balloons, however, it gets more and more difficult to keep track of which argument goes where.

Consider the quantile function, which takes 5 arguments – x, a vector of numbers; probs, a vector of probabilities (quantiles); na.rm, which tells it whether to ignore missing data (more on that later); names, which tells R whether the result should be named (see example); and type, which tells R the algorithm to use for measuring the quantiles (there are 9 readily available).

It would be quite a pain to expect users to memorize the order in which quantile expects arguments, especially if we had to do so for every function we ever wanted to use (remember, there are thousands). To facilitate this, R allows us to name our arguments.

x <- c(1, 1, 1, 2, 3, 4, 5, 6, 
       7, 8, 8, 8, 9, 10, 10, 10)

#by default, quantile calculates all 5 quartiles:
#  min, 25%-ile, median, 75%-ile, max
quantile(x)
##    0%   25%   50%   75%  100% 
##  1.00  2.75  6.50  8.25 10.00
median(x)
## [1] 6.5
#if we just want the median
quantile(x, .5)
## 50% 
## 6.5
#or, we can name it to be explicit
quantile(x, probs = .5)
## 50% 
## 6.5
#if we want to exclude the names
quantile(x, probs = .5, names = FALSE)
## [1] 6.5

In addition to making it easier for us as analysts to use the functions, naming arguments also makes it easier for others to read our code. While it’s fair to expect most users to know the basic functions and their arguments pretty well, the more advanced the function is that you’re using, the more understandable your code gets when aided by named arguments.

2.5.3 Infix operators and match

Infix operators are things that are written from left to right like basic arithmetic, but which do things not covered by the five main arithmetic operators (+, -, *, /, ^). These operators in R are always surrounded by two percent signs (%). The two math-y ones have to do with modular arithmetic:

x <- 1:20
# integer division
x %/% 3
##  [1] 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6
# modular division (remainder)
x %% 3
##  [1] 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2

By far the most commonly used infix, though, has to be %in%. This is used to determine whether some objects (the left) can be found in another set (the right). Best with an example:

candidates <- 
  c("Bernie Sanders", "Hillary Clinton", "Lincoln Chafee", 
    "Jim Webb", "Laurence Lessig", "Donald Trump", "John Kasich",
    "Ted Cruz", "Marco Rubio", "Jeb Bush", "Ben Carson",
    "Carly Fiorina", "Lindsey Graham", "Chris Christie",
    "Rick Santorum", "Rand Paul", "Jim Gilmore", "Mike Huckabee",
    "George Pataki", "Bobby Jindal", "Scott Walker", "Rick Perry")

nominees <- c("Hillary Clinton", "Donald Trump")

candidates %in% nominees
##  [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

A quick note that what %in% is actually doing is using the match function, which is also useful. match takes three arguments. The first and second are like the left- and right-hand side of %in%, respectively. match tries to find each element of the left in the right, and, if it’s found, gives the position. The third argument, nomatch tells it what to do when the left element is not found. Example:

color <- c("red", "green", "orange", "blue",
           "purple", "teal", "mahogany", "yellow",
           "orange", "white", "lavendar")

roygbiv <- c("red", "orange", "yellow", "green",
             "blue", "indigo", "violet")

#%in% tells you if each element of
#  color is in roygbiv at all
color %in% roygbiv
##  [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
#match tells you WHERE in roygbiv 
#  each element of color was found;
#  we set no.match to say what
#  we expect when the color isn't found
#  (it must be an integer)
match(color, roygbiv, nomatch = -1)
##  [1]  1  4  2  5 -1 -1 -1  3  2 -1 -1

2.5.4 More useful functions

2.5.4.1 length

Length is used to find how many elements something has (in data, how many observations there are)

x <- 1:50
length(x)
## [1] 50

2.5.4.2 rep

rep is used to _rep_eat an object a certain number of times. The each argument repeats each element a set number of times. length.out makes sure the output has a certain number of elements.

rep(1:3, 4)
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
rep(1:3, each = 4)
##  [1] 1 1 1 1 2 2 2 2 3 3 3 3
rep(1:3, each = 4, length.out = 10)
##  [1] 1 1 1 1 2 2 2 2 3 3

2.5.4.3 unique / duplicated

unique eliminates all duplicates of a vector; duplicated returns a logical vector telling you which elements appeared earlier in the vector.

x <- c(1, 1, 1, 2, 2, 3)
unique(x)
## [1] 1 2 3
duplicated(x)
## [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE
### unique is the same as (but faster than) x[!duplicated(x)]
x[!duplicated(x)]
## [1] 1 2 3
#unique is often used in conjunction with length
#  to find out the number of IDs in a vector, e.g.
length(unique(x))
## [1] 3

2.5.4.4 seq

seq produces a _seq_uence. It has four main arguments: to (where to start), from (where to end), by (increment), and length.out (what should be the length of the result?). We must specify no more than 3 of these. Some examples:

#same as 1:10
seq(10)
##  [1]  1  2  3  4  5  6  7  8  9 10
#same as 2:11
seq(2, 11)
##  [1]  2  3  4  5  6  7  8  9 10 11
#make a grid from 0 to 10 with 30 points;
#  R automatically figures out the increment
seq(0, 10, length.out = 30)
##  [1]  0.0000000  0.3448276  0.6896552  1.0344828  1.3793103  1.7241379
##  [7]  2.0689655  2.4137931  2.7586207  3.1034483  3.4482759  3.7931034
## [13]  4.1379310  4.4827586  4.8275862  5.1724138  5.5172414  5.8620690
## [19]  6.2068966  6.5517241  6.8965517  7.2413793  7.5862069  7.9310345
## [25]  8.2758621  8.6206897  8.9655172  9.3103448  9.6551724 10.0000000
#get all the odd numbers from 1 to 50:
#  note that the upper endpoint won't
#  necessarily be included in the output
seq(1, 50, by = 2)
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49

2.5.4.5 sort

sort will give an ordered version of its input. Default is increasing order; use the decreasing argument to reverse this.

x <- c(1, 2, 4, -1, 8, 9, 20, 13, 0)
sort(x)
## [1] -1  0  1  2  4  8  9 13 20
sort(x, decreasing = TRUE)
## [1] 20 13  9  8  4  2  1  0 -1

2.5.4.6 paste / paste0

paste and paste0 are R’s ways of concatenating strings, i.e., combining them together. paste has an argument sep which tells how to _sep_arate the components; paste0 is a slightly faster version of sep that uses sep = "" (i.e., don’t separate the output). The collapse argument, available to both, will reduce a vector to a single string, using collapse to separate. Examples:

first <- c("Sam", "Charles", "Matthew", "Thom")
last <- c("Beam", "Mingus", "Embree", "Yorke")

#default sep is a space, sep = " "
paste(first, last)
## [1] "Sam Beam"       "Charles Mingus" "Matthew Embree" "Thom Yorke"
paste(last, first, sep = ", ")
## [1] "Beam, Sam"       "Mingus, Charles" "Embree, Matthew" "Yorke, Thom"
paste(last, first, collapse = ", ")
## [1] "Beam Sam, Mingus Charles, Embree Matthew, Yorke Thom"
paste0(first, last, collapse = ", ")
## [1] "SamBeam, CharlesMingus, MatthewEmbree, ThomYorke"

2.6 Random Numbers

As statistical analysts, randomness is our lifeblood. As a language designed for statistical analysis, then, it stands to reason that R comes well-equipped to handle many common statistical operations. This includes producing random numbers a number of common distributions, random permutations, random subsets, etc.

2.6.1 Uniform and normal random numbers

The most common kind of random numbers that people want are uniform draws and normal draws.

#generate 10 U[0, 1] draws
runif(10)
##  [1] 0.3533790 0.4154977 0.2749768 0.8417500 0.3754361 0.3238726 0.4187158
##  [8] 0.8402353 0.8282875 0.1257449
#generate 10 U[3, 5] draws
runif(10, min = 3, max = 5)
##  [1] 3.034160 3.778142 3.449561 4.149737 3.859568 4.279103 4.196884
##  [8] 3.718585 4.642759 4.263920
#generate 10 N(0, 1) draws
rnorm(10)
##  [1] -0.2616094  0.4220936  0.5989283 -0.1320829 -0.6563510 -0.1877255
##  [7]  1.2570120 -0.9213782  0.4822936 -1.5137514
#generate 10 N(3, 5) draws
rnorm(10, mean = 3, sd = 5)
##  [1] -5.5352378 -0.2025639  8.3196767  2.5940324  0.4463488 -1.8024958
##  [7]  0.2398940  0.9296785 -0.9146949  5.5596145

These examples highlight the common format of random number generators in R. To get random numbers from a distribution, there is probably a function named like rdist, where dist is an abbreviation for the distribution (here, uniform and normal). The first argument is the number of draws; the rest of the arguments are parameters.

Here is a complete table of all of the common distributions built in to R, and the function used to invoke their random number generator (RNG):

List of native distributions from built-in stats package
Distribution RNG Other Parameters Wikipedia
Beta rbeta shape1, shape2, ncp https://en.wikipedia.org/wiki/Beta_distribution
Binomial rbinom size, prob https://en.wikipedia.org/wiki/Binomial_distribution
Cauchy rcauchy location, scale https://en.wikipedia.org/wiki/Cauchy_distribution
Chi-Squared rchisq df, ncp https://en.wikipedia.org/wiki/Chi-squared_distribution
Exponential rexp rate https://en.wikipedia.org/wiki/Exponential_distribution
F rf df1, df2, ncp https://en.wikipedia.org/wiki/F-distribution
Gamma rgamma shape, rate, scale https://en.wikipedia.org/wiki/Gamma_distribution
Geometric rgeom prob https://en.wikipedia.org/wiki/Geometric_distribution
Hypergeometric rhyper m, n, k https://en.wikipedia.org/wiki/Hypergeometric_distribution
Log-Normal rlnorm meanlog, sdlog https://en.wikipedia.org/wiki/Log-normal_distribution
Logistic rlogis location, scale https://en.wikipedia.org/wiki/Logistic_distribution
Multinomial rmultinom size, prob https://en.wikipedia.org/wiki/Multinomial_distribution
Negative Binomial rnbinom size, prob, mu https://en.wikipedia.org/wiki/Negative_binomial_distribution
Poisson rpois lambda https://en.wikipedia.org/wiki/Poisson_distribution
Wilcoxon Sign Ranked Statistic rsignrank* n https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
Student t rt df, ncp https://en.wikipedia.org/wiki/Student%27s_t-distribution
Weibull rweibull shape, scale https://en.wikipedia.org/wiki/Weibull_distribution
Wilcoxon Sign Ranked Statistic rwilcox* m, n https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
Wishart rWishart df, Sigma https://en.wikipedia.org/wiki/Wishart_distribution

In addition to RNGs, R typically has three other functions associated with a given distribution: a density function with prefix d, e.g. dnorm, giving the PDF; a probability function with prefix p, e.g. pnorm, giving the CDF; and a quantile function with prefix q, e.g. qnorm, giving quantiles.

#this is mainly used for plotting, see below
dnorm(seq(-3, 3, length.out = 30))
##  [1] 0.004431848 0.008069595 0.014077583 0.023529569 0.037679866
##  [6] 0.057811501 0.084982332 0.119688537 0.161505004 0.208799242
## [11] 0.258631467 0.306932813 0.348991450 0.380185689 0.396813332
## [16] 0.396813332 0.380185689 0.348991450 0.306932813 0.258631467
## [21] 0.208799242 0.161505004 0.119688537 0.084982332 0.057811501
## [26] 0.037679866 0.023529569 0.014077583 0.008069595 0.004431848
#great for calculating p values
pnorm(1.96)
## [1] 0.9750021
#great for getting critical values
qnorm(c(.005, .025, .05))
## [1] -2.575829 -1.959964 -1.644854

2.6.2 sample

As often as we need random draws from uniform or normal distributions, we need to take random subsets of vectors, or to get random integers. The main function for tasks like this is sample, which takes 4 arguments: x (a vector from which to sample, or an integer representing the max of numbers from which to sample), size (the number of elements to sample, which defaults to length(x)), replace (whether draws are taken with replacement), and prob (a vector of probabilities, for weighted sampling – default is uniform).

As always, it’s easiest to see with some examples:

#A permutation of 1:5
sample(5)
## [1] 2 4 5 1 3
#A size-3 subset of 1:5
sample(5, 3)
## [1] 2 3 5
#A size-5 subset of 1:5, with replacement
sample(5, 5, replace = TRUE)
## [1] 3 1 2 3 5
#A weighted sample of 1:5, heavily weighting 1
sample(5, 5, replace = TRUE, prob = c(.99, .0025, .0025, .0025, .0025))
## [1] 1 1 1 1 1

We can use the basic tools here to run basic randomization exercises, like rolling dice, flipping coins, or drawing cards from a deck:

#Flip 10 coins, get the percentage heads
flips <- sample(c("H", "T"), 10, replace = TRUE)
mean(flips == "H")
## [1] 0.6
#Roll a pair of dice and add them
sum(sample(6, 2, replace = TRUE))
## [1] 10
#create a deck of labeled cards:
#  (I copy-pasted the text for the suits from
#   Wikipedia, but we could have used S/H/D/C:
#   https://en.wikipedia.org/wiki/Playing_cards_in_Unicode)
deck <- paste0(rep(c(2:10, "J", "Q", "K", "A"), 4),      #card values
               rep(c("♠", "♥", "♦", "♣"), each = 13)) #suits
deck
##  [1] "2♠"  "3♠"  "4♠"  "5♠"  "6♠"  "7♠"  "8♠"  "9♠"  "10♠" "J♠"  "Q♠" 
## [12] "K♠"  "A♠"  "2♥"  "3♥"  "4♥"  "5♥"  "6♥"  "7♥"  "8♥"  "9♥"  "10♥"
## [23] "J♥"  "Q♥"  "K♥"  "A♥"  "2♦"  "3♦"  "4♦"  "5♦"  "6♦"  "7♦"  "8♦" 
## [34] "9♦"  "10♦" "J♦"  "Q♦"  "K♦"  "A♦"  "2♣"  "3♣"  "4♣"  "5♣"  "6♣" 
## [45] "7♣"  "8♣"  "9♣"  "10♣" "J♣"  "Q♣"  "K♣"  "A♣"
#Now draw a poker hand
sample(deck, 5)
## [1] "A♠" "K♣" "7♣" "9♥" "Q♠"

2.6.3 replicate

Full scale Monte Carlo/bootstrapping requires replicating a simulation exercise over and over. Again, R is ready and has the replicate function for us. Let’s simulate a few poker hands to find their frequency and compare with the theoretical frequency:

#It'll be easier to deal with numbers than
#  with the fancy strings we used above
deck_nos <- 1:52
#Simulating a flush
simulations <- 
  replicate(10000, #number of repetitions/simulations
            { #use curly braces to run a simulation that takes more than one line
              hand <- sample(deck_nos, 5) #draw a five-card hand
              suits <- hand %% 4 # force card number into one of 4 categories
              length(unique(suits)) == 1 #returns TRUE if all suits are equal, FALSE else
            }) #close braces and parentheses
#How often does a flush happen?
#  True frequency is 0.1965%
mean(simulations)
## [1] 0.0017
#What about a pair (just two cards match)?
#  true frequency: 42%
simulations <- 
  replicate(10000, {
    hand <- sample(deck_nos, 5)
    faces <- hand %% 13 # force into one of 13 categories
    sum(duplicated(faces)) == 1 #one pair if there's exactly _one_ match
  })
mean(simulations)
## [1] 0.4272

2.6.4 set.seed

Just a quick note that any time we run a simulation, it’s a good idea (for replicability) to set and save the random seed! Random numbers from computers may look random, but in fact they’re completely deterministic – if we know the value of the random seed! This is a disadvantage for information security (which is only sometimes important to us as data analysts), but an advantage for academic research, where replicability is paramount.

A simple example:

set.seed(100)
runif(1)
## [1] 0.3077661
#if we run again, we'll get a different number
runif(1)
## [1] 0.2576725
#but if we re-set the seed, we'll get the same
set.seed(100)
runif(1)
## [1] 0.3077661

2.7 Vectorization / *apply

The last set of major workhorse functions in R are the *apply functions – sapply and lapply. apply is also important, but I want to avoid touching on matrices today, if I can. The oft-ignored cousins are rapply (recursive), tapply (tagged), mapply (multivariate), vapply (verified), and eapply (environment).

You should think of lapply as list apply and sapply as simplified apply, because the former returns a list, and the latter returns a more simplified object (vector/matrix/array).

Both of these functions are, by and large, what you should be using in R instead of a for loop. A for loop often consists of going through every element of a list and doing something to its contents. for loops exist in R, as they probably do in every langauge, but are typically much slower than what is often called a vectorized approach. We’ll get into that more later, but for now I’ll just leave an example of lapply and sapply in action:

#pick three triplets of numbers at random
l <- list(sample(100, 3), sample(100, 3), sample(100, 3))
l
## [[1]]
## [1] 26 55  6
## 
## [[2]]
## [1] 47 48 80
## 
## [[3]]
## [1] 38 55 17
#now find the max within each group
lapply(l, max)
## [[1]]
## [1] 55
## 
## [[2]]
## [1] 80
## 
## [[3]]
## [1] 55
#note that the output is a list. It's typically
#  more convenient to have the output as a vector:
sapply(l, max)
## [1] 55 80 55

2.8 Packages

One of the things that makes R truly exceptional is its vast library of user-contributed packages.

R comes pre-loaded with a boat-load of the most common functions / methods of analysis. But in no way is this congenital library complete.

Complementing this core of the most common operations are external packages, which are basically sets of functions designed to accomplish specific tasks. The entire afternoon of this workshop is devoted to gaining facility in just a few of these.

Best of all, unlike some super-expensive programming languages, all of the thousands of packages available to R users (most importantly through CRAN, the Comprehensive R Archive Network) are completely free of charge.

The two most important things to know about packages for now (we’ll start using my favorite, data.table, shortly) is where to find them, how to install them, and how to load them.

2.8.1 Where to find packages

Long story short: Google. Got a particular statistical technique in mind? The best R package for this is almost always the top Google result if asked correctly.

How about heirarchical linear modeling (HLM)? A quick Google for “HLM R” or “heirarchical linear modeling R” both turn up some articles/tutorials/CRAN pages referencing lmer.

2.8.2 How to install packages

Just use install.packages!

#we won't be able to run this, because it requires
#  administrative priviliges, but you can do
#  this easily on your own computer
install.packages("lmer")

2.8.3 How to load packages

Just add it to your library!

#this won't run if it's not installed
library(lmer)

Et voila! You’ll now be able to run HLM in R. You can also Google “tutorial lmer” (or in general “tutorial [package name]”) and you’re very likely to find a trove of sites trying to help you learn the package. Most popular packages also come with worked examples available through the example function, e.g., example(package = "lmer").


2.9 Toolkit: ls/rm/?/class/dput/str

Just a few more things that aren’t related to statistical analysis per se, but which will often facilitate your coding experience.

First, ls and rm, short-hand for list objects and remove objects. These can be useful for understanding what names you’ve assigned to objects, and for occasionally cleaning up the clutter of variables you’ll no longer use (for the meticulously clean, but also because leaving a bunch of unused objects lying around is a great way to create irritating errors / problems with your code).

Let’s see all the junk we’ve created thus far in the workshop:

#ls is a function, so we have to use () to call it
ls()
##  [1] "ages"        "candidates"  "color"       "deck"        "deck_nos"   
##  [6] "first"       "flips"       "l"           "last"        "nominees"   
## [11] "roygbiv"     "simulations" "x"           "y"           "years"      
## [16] "z"
#one of the optional arguments to ls is to 
#  also list some "hidden" objects:
ls(all = TRUE)
##  [1] "ages"         "candidates"   "color"        "deck"        
##  [5] "deck_nos"     "first"        "flips"        "l"           
##  [9] "last"         "nominees"     ".Random.seed" "roygbiv"     
## [13] "simulations"  "x"            "y"            "years"       
## [17] "z"
#what a mess! let's get rid of some stuff
rm(ages, candidates, color)

#ugh, this is getting boring. What about
#  a scorched earth technique?
rm(list = ls(all = TRUE))

#after the apocalypse, nothing remains:
ls(all = TRUE)
## character(0)

2.9.1 Troubleshooting

Finally, I’ll leave you with some tools that are often the first place to look when trying to troubleshoot code that isn’t working correctly.

2.9.1.1 ?

? is the help operator in R. If you’d like to know anything about any function in R, just type ? before its name and hit enter. Let’s start with the help page for replicate by entering ?replicate. All help files are structured similarly, with the following common sections (I’ve highlighted in bold the most useful for troubleshooting):

  1. Description: an overview of the function(s) described on the page. ?replicate, for example, redirects us to the help page for lapply, since in addition to replicate this page covers lapply, sapply, vapply, and simplify2array (don’t worry about the last two, they’re a bit more advanced and won’t come up today).
  2. Usage: a generic snippet of code which shows the name and default value of every argument of every function on the page. This is very useful for understanding the ordering and naming of all the arguments to a function. replicate, for example, has 3 arguments, only 2 of which we used above: n is the number of repetitions, which has no default; expr is the expression to repeat (surrounded in curly braces {} above), which has no default; and simplify, which gives us control over how/whether the results are converted from a list, which has default "array", leading to the output we saw above.
  3. Arguments: gives a brief overview of what each function expects to receive for each argument (the type, whether it can be a vector or has to be a scalar, etc.). You shouldn’t expect to fully understand everything that’s being told to you here for a while, since it’s pretty specific, but it can still help.
  4. Details: Some nitty-gritty about how the functions work, some mistakes to avoid, some of the more intricate details of how exactly the function goes about its task. Tends to be very dense and jargon-heavy.
  5. Value: Tells you details about the type of object that you should expect to be returned as a result of running each function (e.g., we noted that lapply always returns a list; this is noted in the Value section of ?replicate in the first sentence).
  6. Note: Some more technical details about the function and its edge cases.
  7. References: Especially useful for packages – this usually tells you where to go to find some more reading about the algorithms implemented.
  8. See Also: Gives some related functions which might also help accomplish similar tasks.
  9. Examples: A great place to look for some simple reproducible examples of how the functions work, with code that can be copy-pasted and run by you, the user.

The help files are an absolute must go-to reference – it’s always the first place you should check when you’re stuck.

2.9.1.2 class

class simply tells you the class/type of an object in memory. You’d be surprised how often the root of an issue is simply that you think a certain variable is type X (say, numeric), but actually R is keeping track of it as if it’s type Y (frighteningly often, factor – more on that later).

2.9.1.3 dput

dput is like a microscope for R objects. It deconstructs any variable in R to its atoms – its absolute fundamental components – and prints out what’s hidden under the hood. This is especially useful because all of the most complicated objects in R tend to have an associated print method which renders them in a digestible form. This is great for taking a glance at a complicated object, but not useful for understanding what’s going wrong with that object.

Take, for example:

#we'll get more into how R understands dates/times
#  later in the day; the key takeaway here is 
#  how R prints the object vs. how it's actually stored.
t <- as.POSIXlt("2015-05-12")
t
## [1] "2015-05-12 EDT"
#when we just enter the object at the console, its
#  print method is invoked, i.e.:
print(t)
## [1] "2015-05-12 EDT"
#this is a bit misleading. From what we see, it would appear
#  that t is just a character string. However:
t == "2015-05-15 EDT"
## [1] FALSE
#so, what is t? use dput
dput(t)
## structure(list(sec = 0, min = 0L, hour = 0L, mday = 12L, mon = 4L, 
##     year = 115L, wday = 2L, yday = 131L, isdst = 1L, zone = "EDT", 
##     gmtoff = NA_integer_), .Names = c("sec", "min", "hour", "mday", 
## "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", 
## "POSIXt"))
#actually, it's a rather complicated list with components
#  describing every which thing about the date/time we entered

2.9.1.4 str

Often the output of dput is far too verbose to be of real use. str is a similar function which gives us a more detailed look at the components of a complicated object, but with the advantage that it doesn’t give us too many details; it’s more like a table of contents for R objects.

#don't worry about what lm is yet, we'll get there soon
y <- rnorm(1000)
x <- rnorm(1000)
r <- lm(y ~ x)
str(r)
## List of 12
##  $ coefficients : Named num [1:2] 0.015549 -0.000618
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
##  $ residuals    : Named num [1:1000] -0.5905 -1.1076 -0.1212 0.0885 0.5711 ...
##   ..- attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:1000] -0.4921 0.0198 -0.1315 0.0739 0.6333 ...
##   ..- attr(*, "names")= chr [1:1000] "(Intercept)" "x" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:1000] 0.0148 0.0147 0.0161 0.0162 0.0147 ...
##   ..- attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:1000, 1:2] -31.6228 0.0316 0.0316 0.0316 0.0316 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:1000] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.03 1.04
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 998
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = y ~ x)
##  $ terms        :Classes 'terms', 'formula' length 3 y ~ x
##   .. ..- attr(*, "variables")= language list(y, x)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. ..$ : chr "x"
##   .. ..- attr(*, "term.labels")= chr "x"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(y, x)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  $ model        :'data.frame':   1000 obs. of  2 variables:
##   ..$ y: num [1:1000] -0.576 -1.093 -0.105 0.105 0.586 ...
##   ..$ x: num [1:1000] 1.214 1.296 -0.851 -0.983 1.369 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 y ~ x
##   .. .. ..- attr(*, "variables")= language list(y, x)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. .. ..$ : chr "x"
##   .. .. ..- attr(*, "term.labels")= chr "x"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(y, x)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  - attr(*, "class")= chr "lm"

Whew, that was a lot! Congratulations!

Now, on to the more fun and useful stuff!




3 Data Basics (9 AM - 11 AM)

This section is intended to lay the groundwork of all of the most common data analysis tasks we face when tackling our empirical problems.

3.1 Reading .csv Files

The most common format of file you’re like to find in the real world is in .csv (comma-separated value) format.

Quick clarification that not all comma-separated files have the extension. Often you’ll find a file with the extension .txt that actually contains text that is comma-separated. You just have to inspect the file to figure out how it’s stored.

In fact that’s the case for or first example. We’re going to look at teacher attendance data for the School District of Philadelphia.

The school district releases, through their Open Data Initiative, a whole bunch of data files related to the topics like school catchment zones, employee salaries, budget details, etc. All of the data can be found here: http://webgui.phila.k12.pa.us/offices/o/open-data-initiative.

We could use R to download the files (we may touch on this in the afternoon), but to ease us into things, I’ve gone ahead and downloaded some files and put them on my GitHub page. Let’s look at the data from 2013-14. Here’s a link to the data, but it should also be loaded on your machine.

We can preview the file using the terminal on a Mac. Learning to do some basic stuff in the terminal is a great investment of your time, but that’s not the focus of today’s lesson, so we’ll just stick with doing things through R. We can send commands to the terminal in R using the system command. The terminal command head (which is also a function in R, but only applies to objects already loaded) will spit out the first couple lines of a file so we can see what it looks like without having to load the whole thing (which, for larger files, can be time-consuming).

#Just to flex some of R's muscles, we'll find
#  the data file from within R. As often as not,
#  we'll just navigate to the file ourselves and
#  just copy-paste the file path.
#  We're looking for the folder with the data
#  so we use include.dirs to include directories
list.files(include.dirs = TRUE) 
## [1] "data"                                  
## [2] "iesrtutorial_cache"                    
## [3] "iesrtutorial_files"                    
## [4] "iesrtutorial.html"                     
## [5] "iesrtutorial.Rmd"                      
## [6] "README.md"                             
## [7] "workshop_installation_instructions.pdf"
#I've kept things in the data folder here. We
#  can tell it's a folder as it has no extensions.
#  We can find the files in that folder using a 
#  relative path (./ means starting from the 
#  current folder, enter the folder after /).
#  We use the full.names of the file for the next step.
list.files("./data", full.names = TRUE)
##  [1] "./data/00keys.csv"                                      
##  [2] "./data/00staff.dat"                                     
##  [3] "./data/2015-2016 Enr Dem (Suppressed).xls"              
##  [4] "./data/ag131a_supp.sas7bdat"                            
##  [5] "./data/carsdata.dta"                                    
##  [6] "./data/employee_information.csv"                        
##  [7] "./data/Neighborhoods_Philadelphia.dbf"                  
##  [8] "./data/Neighborhoods_Philadelphia.prj"                  
##  [9] "./data/Neighborhoods_Philadelphia.sbn"                  
## [10] "./data/Neighborhoods_Philadelphia.sbx"                  
## [11] "./data/Neighborhoods_Philadelphia.shp"                  
## [12] "./data/Neighborhoods_Philadelphia.shx"                  
## [13] "./data/Neighborhoods_Philadelphia.xml"                  
## [14] "./data/pennsylvania_districts.cpg"                      
## [15] "./data/pennsylvania_districts.dbf"                      
## [16] "./data/pennsylvania_districts.prj"                      
## [17] "./data/pennsylvania_districts.qpj"                      
## [18] "./data/pennsylvania_districts.shp"                      
## [19] "./data/pennsylvania_districts.shx"                      
## [20] "./data/phila_school_addr.csv"                           
## [21] "./data/README_SDP_Employee.txt"                         
## [22] "./data/School Profiles Teacher Attendance 2013-2014.TXT"
#Now we see the data files. The one we're after
#  currently is "School Profiles Teacher Attendance 2013-2014.TXT"
#  To get it previewed, we send the command head to the system
#  and follow it with the quoted file name (since the
#  file name has spaces, which is a pain)
command = paste0('head ', '"./data/School Profiles Teacher Attendance 2013-2014.TXT"')
system(command)
## [1] "/home/michael/Documents/iesrtutorial"
## ULCS_NO,SCHOOL_YEAR,SCH_TEACHER_ATTEND,SDP_TEACHER_ATTEND,SCHOOL_ID
## "8350","2013-2014",92.7,93.4,"835"
## "7320","2013-2014",95,93.4,"732"
## "5440","2013-2014",96,93.4,"544"
## "8320","2013-2014",95.1,93.4,"832"
## "6440","2013-2014",96.6,93.4,"644"
## "4370","2013-2014",93.8,93.4,"437"
## "4030","2013-2014",94.3,93.4,"403"
## "2240","2013-2014",94,93.4,"224"
## "6450","2013-2014",95.3,93.4,"645"

We can see that even though the data has extension .txt, the file itself is very regular and clearly comma-separated.

3.1.1 fread/data.table

To read in this data, I’m going to eschew the standard intro-to-R suggestion to use read.csv because we’re presented now with the perfect opportunity to introduce to you the crown jewel of R – data.table (full disclosure, I am an active contributor to it, but this has followed from my love for it).

data.table is a package created by Matthew Dowle and actively maintained by him, Arun Srinivasan, and Jan Gorecki. Basically, it’s designed to make working with data in R very straightforward/intuitive and incredibly fast (since it’s designed to handle data with tens of millions of observations). We’ll be spending a lot of time working with data.table today, and picking most of it up as we go along. You can read more on the data.table homepage when you get a chance.

The first and most famous tool of data.table is fread. The “f” means FAST. fread is a highly optimized tool for reading data into R at blitzing fast speeds. You’ll barely notice the difference for typical files (less than 100,000 observations), but when you do notice the difference, it’s a sight to behold.

On your own machine, you’ll need to install data.table (luckily for us, it’s already been pre-installed on your machines for today) using:

install.packages("data.table")

#the more confident and ambitious can install
#  the development version of the package,
#  which tends to have more features, but may
#  occasionally be more error-prone.
#  Caveat coder.
install.packages("data.table", type = "source",
                 repos = "https://Rdatatable.github.io/data.table")

Once we’re sure it’s on our machine, we can read the data by running:

#load the data.table package
library(data.table)

attendance <- 
  fread("./data/School Profiles Teacher Attendance 2013-2014.TXT")

Just like that!

Entering the data variable by itself will invoke the print.data.table command and display a preview of the data that was loaded:

attendance
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
#we can also explicitly invoke print,
#  which has a few more bells and whistles to
#  facilitate understanding what our data.table
#  looks like.
print(attendance, class = TRUE)
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##       <char>      <char>              <num>              <num>    <char>
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430

Using the class argument to print, the preview now includes an abbreviated type under each of the column headers letting you know what class R thinks each column is (note class is a recent update to data.table; for the moment, it is not available in the version on CRAN, but should be available there soon).

What about attendance itself? It looks like something we’ve not seen yet. Let’s explore:

class(attendance)
## [1] "data.table" "data.frame"
str(attendance)
## Classes 'data.table' and 'data.frame':   212 obs. of  5 variables:
##  $ ULCS_NO           : chr  "8350" "7320" "5440" "8320" ...
##  $ SCHOOL_YEAR       : chr  "2013-2014" "2013-2014" "2013-2014" "2013-2014" ...
##  $ SCH_TEACHER_ATTEND: num  92.7 95 96 95.1 96.6 93.8 94.3 94 95.3 92.2 ...
##  $ SDP_TEACHER_ATTEND: num  93.4 93.4 93.4 93.4 93.4 93.4 93.4 93.4 93.4 93.4 ...
##  $ SCHOOL_ID         : chr  "835" "732" "544" "832" ...
##  - attr(*, ".internal.selfref")=<externalptr>
is.list(attendance)
## [1] TRUE

OK, lots of important stuff here. We’ve discovered a new type of object! attendance is a data.table. You can think of this as being a single sheet of an Excel workbook. It is specific to the data.table package, and is an improvement on the data.frame type found in base R (we’ll see this type a bit later).

As far as how R thinks of it, a data.table is a list on steroids. Each element of the list is a column, and each element has the same number of items. We can see from str that, like a list, we can use $ to extract items (here: columns):

attendance$SCH_TEACHER_ATTEND
##   [1] 92.7 95.0 96.0 95.1 96.6 93.8 94.3 94.0 95.3 92.2 90.8 94.3 93.6 95.0
##  [15] 92.8 85.3 86.4 95.9 93.0 94.8 90.1 95.0 93.4 94.4 91.8 94.6 95.4 92.3
##  [29] 95.8 94.0 95.1 94.7 88.4 93.7 94.6 95.5 94.8 95.3 94.0 95.3 89.9 93.2
##  [43] 91.7 89.0 93.1 92.7 94.2 97.2 93.6 92.4 92.6 92.7 92.0 94.4 94.3 93.6
##  [57] 92.4 95.1 93.4 95.6 87.8 94.7 93.5 95.7 90.9 94.4 95.7 90.2 95.7 95.2
##  [71] 93.3 93.5 89.0 95.0 89.7 93.9 93.2 93.5 92.1 93.0 94.8 95.1 94.8 90.8
##  [85] 95.7 93.3 92.8 89.6 95.4 91.5 94.1 93.3 93.7 96.8 95.3 94.1 93.9 88.0
##  [99] 94.6 95.1 86.9 95.5 95.0 97.6 95.8 93.8 94.4 96.0 95.8 89.7 94.3 92.7
## [113] 93.1 95.1 94.5 95.4 94.4 94.4 90.8 94.5 93.3 90.6 93.5 94.4 91.9 90.7
## [127] 91.6 92.1 94.7 95.4 86.5 94.8 94.5 90.1 96.1 93.1 95.8 92.0 94.1 91.7
## [141] 94.0 92.4 95.6 91.2 92.1 92.8 93.8 93.7 88.8 96.1 94.4 90.6 95.3 95.3
## [155] 90.9 94.0 94.1 95.6 93.9 91.9 91.5 93.2 94.2 93.0 91.2 91.8 95.9 92.1
## [169] 93.8 92.6 90.0 94.1 96.5 91.1 89.8 93.6 94.3 97.6 93.0 98.1 91.4 93.4
## [183] 94.5 94.1 94.1 88.7 94.0 93.4 94.2 94.6 96.2 91.4 92.0 96.5 92.0 95.1
## [197] 94.7 96.6 96.8 91.7 96.1 95.9 93.0 91.5 94.4 93.2 88.5 92.0 91.4 95.3
## [211] 94.7 94.7

3.1.2 Exploring Attendance Data

What are the columns in this data? We usually hope we have a data dictionary, but I didn’t find one on the SDP website. It appears that ULCS_NO is the same as SCHOOL_ID (padded with a 0). It’s not clear if the school district is anonymizing the schools or if we simply have to match the SCHOOL_ID to the school’s name in another file.

The meat of the data are the two columns SCH_TEACHER_ATTEND and SDP_TEACHER_ATTEND. The former is the attendance rate of teachers at a given school; the latter is the average for the whole district (which is why it’s the same in every row).

We can now start to explore the data. The only real column of interest is SCH_TEACHER_ATTEND. With this, we can explore how teacher attendance varies across the city.

Interacting with a data.table for exploration involves accessing it with []. Within [], there are three main arguments:

  1. i: which rows do we want? (subsetting)
  2. j: which columns do we want? (selection and manipulation)
  3. by: grouping our operations in j by categories within a column.

The i and j notation come from thinking of a data.table as a matrix. To express the ith row and jth column of a matrix A, math people often write A[i,j].

Let’s see the first two in action in exploring our first data set

#Summarize the variable to get a quick understanding
#we're looking at ALL rows, so we leave the first argument BLANK
#we're trying to summarize SCH_TEACHER_ATTEND, so we
#  use the summary function in the second argument
attendance[ , summary(SCH_TEACHER_ATTEND)]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   85.30   92.10   93.90   93.43   95.00   98.10

So most schools are within 2 percent of the district average. What about schools that are particularly high or particularly low?

#We are focusing on certain rows, so we
#  say the condition defining those
#  rows in the first argument
attendance[SCH_TEACHER_ATTEND < 90]
##     ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##  1:    4350   2013-2014               85.3               93.4       435
##  2:    4570   2013-2014               86.4               93.4       457
##  3:    1340   2013-2014               88.4               93.4       134
##  4:    7350   2013-2014               89.9               93.4       735
##  5:    7220   2013-2014               89.0               93.4       722
##  6:    2480   2013-2014               87.8               93.4       248
##  7:    6360   2013-2014               89.0               93.4       636
##  8:    8240   2013-2014               89.7               93.4       824
##  9:    5430   2013-2014               89.6               93.4       543
## 10:    6260   2013-2014               88.0               93.4       626
## 11:    6340   2013-2014               86.9               93.4       634
## 12:    7290   2013-2014               89.7               93.4       729
## 13:    5370   2013-2014               86.5               93.4       537
## 14:    1330   2013-2014               88.8               93.4       133
## 15:    7510   2013-2014               89.8               93.4       751
## 16:    4240   2013-2014               88.7               93.4       424
## 17:    1130   2013-2014               88.5               93.4       113
#And schools with particularly high attendance
attendance[SCH_TEACHER_ATTEND > 97]
##    ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1:    5340   2013-2014               97.2               93.4       534
## 2:    6040   2013-2014               97.6               93.4       604
## 3:    8560   2013-2014               97.6               93.4       856
## 4:    1190   2013-2014               98.1               93.4       119

Without knowing more about the schools, all we can say is that there are a certain number of schools with very high or very low attendance.

3.1.3 Teacher Salaries

Another file posted to the SDP website contains information on teachers’ salaries. I’ve again downloaded this data already for you, but if you’re curious here’s a link to the data from the website.

This one actually comes with a data dictionary of sorts in the form of a README file. We can use R to display the README like so:

writeLines(readLines("./data/README_SDP_Employee.txt"))
README_SDP_Employee.txt
REVISIONS
20160401 (April 1st, 2016) - Second release in 2016.

This README file contains some background information and guidance on School District of Philadelphia (SDP) data. This set of data was extracted on March 31st 2016, and includes every active employee of the School District of Philadelphia as of the extract date. Do not hesitate to contact us if you have any questions, or would like to request other data sets.
Happy data diving!
opendata@philasd.org

Individual files:

README_SDP_EMPLOYEE.TXT  - You're soaking in it!

employee_information.csv
Provides basic information on every employee of the School District of Philadelphia. 
LAST_NAME - Employee's last name.
FIRST_NAME - Employee's first name.
PAY_RATE_TYPE - The pay rate type for the employee, either SALARIED, DAILY or HOURLY.
PAY_RATE - The pay rate for the employee. If the pay rate type is salaried, then the annual salary (minus any benefits). If the pay rate type is daily, then the daily pay, and if pay rate type is hourly, then the hourly pay rate. If the pay rate is zero the employee is typically in a leave status that does not receive pay; for example military leave.
TITLE_DESCRIPTION - Title for employee; there are over 500 unique titles.
HOME_ORGANIZATION - The home organization code identifies where the employee is primarily stationed. Some employees are "per diem", meaning they work on a daily basis at a variety of locations; other employees may work at multiple locations, but are attributed to one location.
HOME_ORGANIZATION_DESCRIPTION - The home organization description identifies the home location by name, typically the name of the school or office.
ORGANIZATION_LEVEL - Identifies the type of location the employee works at; for example, administrative office, garage, elementary school, etc.
TYPE_OF_REPRESENTATION - The union representation for the employee. NON REP means the employee is not represented by any union.
GENDER - Gender of employee.
RUN_DATE - The date the data was extracted from SDP data systems.

School District Data Sets Terms of Use.pdf
Terms of use for use the open data sets from the School District of Philadelphia.

readLines is a function used to bring in text files to R objects. It returns every line of the file as one element of a character vector. writeLines is typically used for creating text files, but if we don’t tell it a file to use, it prints the output to the console.

The file describes all of three of the files that came with the download; of importance is employee_imformation.csv. We see here the columns are all described, which will help us understand what we see when we load in the data. Let’s do it!

salaries <- fread("./data/employee_information.csv")
print(salaries, class = TRUE)
##           LAST_NAME FIRST_NAME PAY_RATE_TYPE PAY_RATE
##              <char>     <char>        <char>    <int>
##     1:        AARON     ANDREA      SALARIED    31261
##     2:        AARON      PEGGY      SALARIED     9349
##     3:        ABARY     RODNEY      SALARIED    76461
##     4:        ABATE     JO-ANN        HOURLY       48
##    ---                                               
## 17558:        ZWICK      KEVIN      SALARIED    46694
## 17559:        ZWIRN     RHONDA      SALARIED    76461
## 17560:       ZWOLAK       PAUL      SALARIED    54364
## 17561:    ZYCHOWICZ       LISA      SALARIED    67706
## 17562: ZYGMUND-FELT      ALLYN      SALARIED    90051
##                     TITLE_DESCRIPTION HOME_ORGANIZATION
##                                <char>            <char>
##     1:       GENERAL CLEANER, 8 HOURS              4300
##     2:  STUDENT CLIMATE STAFF,4 HOURS              6360
##     3:                   SCHOOL NURSE              2370
##     4: TEACHER-EXTRA CURR/STAFF DEVEL              9EW0
##    ---                                                 
## 17558:              TEACHER,FULL TIME              8300
## 17559:         TEACHER,SPEC EDUCATION              6210
## 17560:              TEACHER,FULL TIME              2290
## 17561:              TEACHER,FULL TIME              8420
## 17562:              TEACHER,FULL TIME              7100
##        HOME_ORGANIZATION_DESCRIPTION        ORGANIZATION_LEVEL
##                               <char>                    <char>
##     1:         HESTON, EDWARD SCHOOL         ELEMENTARY SCHOOL
##     2:   ROOSEVELT ELEMENTARY SCHOOL         ELEMENTARY SCHOOL
##     3:   MCDANIEL, DELAPLAINE SCHOOL         ELEMENTARY SCHOOL
##     4:           NON-PUBLIC PROGRAMS NON ADMINISTRATIVE OFFICE
##    ---                                                        
## 17558:                MAYFAIR SCHOOL         ELEMENTARY SCHOOL
## 17559:   EDMONDS, FRANKLIN S. SCHOOL         ELEMENTARY SCHOOL
## 17560:      FRANKLIN LEARNING CENTER               HIGH SCHOOL
## 17561:       DECATUR, STEPHEN SCHOOL         ELEMENTARY SCHOOL
## 17562:  COOKE, JAY ELEMENTARY SCHOOL         ELEMENTARY SCHOOL
##        TYPE_OF_REPRESENTATION GENDER RUN_DATE
##                        <char> <char>   <char>
##     1:             LOCAL 1201      F 4/1/2016
##     2:              LOCAL 634      F 4/1/2016
##     3:            PFT-TEACHER      M 4/1/2016
##     4:            PFT-TEACHER      F 4/1/2016
##    ---                                       
## 17558:            PFT-TEACHER      M 4/1/2016
## 17559:            PFT-TEACHER      F 4/1/2016
## 17560:            PFT-TEACHER      M 4/1/2016
## 17561:            PFT-TEACHER      F 4/1/2016
## 17562:            PFT-TEACHER      F 4/1/2016
#it's a pain to have to lay on the SHIFT key all the
#  time, so let's rename the columns so that
#  are in lower case. We need three functions:
#  tolower, which takes all upper-case letters in a 
#  string and replaces them with their lower-case
#  counterpart; names, which returns the names of
#  any object (vector, list, data.table, etc.);
#  and setnames, which has three arguments:
#  1) a data.table, 2) the columns to rename, and
#  3) their replacements. If we only use two arguments,
#  R assumes we want to replace _all_ column names, and
#  that the second argument is the replacement values.
setnames(salaries, tolower(names(salaries)))

#checking the result:
salaries
##           last_name first_name pay_rate_type pay_rate
##     1:        AARON     ANDREA      SALARIED    31261
##     2:        AARON      PEGGY      SALARIED     9349
##     3:        ABARY     RODNEY      SALARIED    76461
##     4:        ABATE     JO-ANN        HOURLY       48
##     5:  ABAYOMI-IGE   OLABIMPE      SALARIED    76461
##    ---                                               
## 17558:        ZWICK      KEVIN      SALARIED    46694
## 17559:        ZWIRN     RHONDA      SALARIED    76461
## 17560:       ZWOLAK       PAUL      SALARIED    54364
## 17561:    ZYCHOWICZ       LISA      SALARIED    67706
## 17562: ZYGMUND-FELT      ALLYN      SALARIED    90051
##                     title_description home_organization
##     1:       GENERAL CLEANER, 8 HOURS              4300
##     2:  STUDENT CLIMATE STAFF,4 HOURS              6360
##     3:                   SCHOOL NURSE              2370
##     4: TEACHER-EXTRA CURR/STAFF DEVEL              9EW0
##     5:         TEACHER,SPEC EDUCATION              6100
##    ---                                                 
## 17558:              TEACHER,FULL TIME              8300
## 17559:         TEACHER,SPEC EDUCATION              6210
## 17560:              TEACHER,FULL TIME              2290
## 17561:              TEACHER,FULL TIME              8420
## 17562:              TEACHER,FULL TIME              7100
##         home_organization_description        organization_level
##     1:          HESTON, EDWARD SCHOOL         ELEMENTARY SCHOOL
##     2:    ROOSEVELT ELEMENTARY SCHOOL         ELEMENTARY SCHOOL
##     3:    MCDANIEL, DELAPLAINE SCHOOL         ELEMENTARY SCHOOL
##     4:            NON-PUBLIC PROGRAMS NON ADMINISTRATIVE OFFICE
##     5: LEEDS, MORRIS E. MIDDLE SCHOOL             MIDDLE SCHOOL
##    ---                                                         
## 17558:                 MAYFAIR SCHOOL         ELEMENTARY SCHOOL
## 17559:    EDMONDS, FRANKLIN S. SCHOOL         ELEMENTARY SCHOOL
## 17560:       FRANKLIN LEARNING CENTER               HIGH SCHOOL
## 17561:        DECATUR, STEPHEN SCHOOL         ELEMENTARY SCHOOL
## 17562:   COOKE, JAY ELEMENTARY SCHOOL         ELEMENTARY SCHOOL
##        type_of_representation gender run_date
##     1:             LOCAL 1201      F 4/1/2016
##     2:              LOCAL 634      F 4/1/2016
##     3:            PFT-TEACHER      M 4/1/2016
##     4:            PFT-TEACHER      F 4/1/2016
##     5:            PFT-TEACHER      F 4/1/2016
##    ---                                       
## 17558:            PFT-TEACHER      M 4/1/2016
## 17559:            PFT-TEACHER      F 4/1/2016
## 17560:            PFT-TEACHER      M 4/1/2016
## 17561:            PFT-TEACHER      F 4/1/2016
## 17562:            PFT-TEACHER      F 4/1/2016

This is a much bigger data set – covering all employees of the SDP and giving many more details about them. Let’s explore a bit:

#table is a great function for describing discrete variables.
#  it gives the count of observations in each cell, and can also
#  be used to create what Stata calls two-way tables.
salaries[ , table(pay_rate_type)]
## pay_rate_type
##    DAILY   HOURLY SALARIED 
##      141      526    16895
salaries[ , table(organization_level)]
## organization_level
##                        ACADEMY          ADMINISTRATIVE OFFICE 
##                              4                           1578 
## CAREER AND TECHNICAL HIGH SCHL                CHARTER SCHOOLS 
##                            456                             12 
##     DISTRICT BUILDING/PROPERTY                EARLY CHILDHOOD 
##                             64                            356 
##              ELEMENTARY SCHOOL                         GARAGE 
##                           9667                              6 
##                    HIGH SCHOOL         INACTIVE WITH ACTIVITY 
##                           3101                             11 
##               LEARNING NETWORK                  MIDDLE SCHOOL 
##                             20                            954 
##      NON ADMINISTRATIVE OFFICE              NON PUBLIC SCHOOL 
##                           1191                              1 
##    TRANSITION / OVERAGE SCHOOL 
##                            141
#a cross-tabulation of these two; the first argument will
#  appear in the rows, the second in the columns
salaries[ , table(pay_rate_type, organization_level)]
##              organization_level
## pay_rate_type ACADEMY ADMINISTRATIVE OFFICE CAREER AND TECHNICAL HIGH SCHL
##      DAILY          2                    31                              4
##      HOURLY         0                   147                              0
##      SALARIED       2                  1400                            452
##              organization_level
## pay_rate_type CHARTER SCHOOLS DISTRICT BUILDING/PROPERTY EARLY CHILDHOOD
##      DAILY                  0                          0               0
##      HOURLY                 0                          0               2
##      SALARIED              12                         64             354
##              organization_level
## pay_rate_type ELEMENTARY SCHOOL GARAGE HIGH SCHOOL INACTIVE WITH ACTIVITY
##      DAILY                   23      0          11                      0
##      HOURLY                   0      0           2                      0
##      SALARIED              9644      6        3088                     11
##              organization_level
## pay_rate_type LEARNING NETWORK MIDDLE SCHOOL NON ADMINISTRATIVE OFFICE
##      DAILY                   2             1                        65
##      HOURLY                  0             0                       375
##      SALARIED               18           953                       751
##              organization_level
## pay_rate_type NON PUBLIC SCHOOL TRANSITION / OVERAGE SCHOOL
##      DAILY                    0                           2
##      HOURLY                   0                           0
##      SALARIED                 1                         139
salaries[ , table(organization_level, gender)]
##                                 gender
## organization_level                  F    M
##   ACADEMY                           1    3
##   ADMINISTRATIVE OFFICE           911  667
##   CAREER AND TECHNICAL HIGH SCHL  269  187
##   CHARTER SCHOOLS                  11    1
##   DISTRICT BUILDING/PROPERTY       26   38
##   EARLY CHILDHOOD                 343   13
##   ELEMENTARY SCHOOL              7989 1678
##   GARAGE                            0    6
##   HIGH SCHOOL                    1988 1113
##   INACTIVE WITH ACTIVITY            5    6
##   LEARNING NETWORK                  7   13
##   MIDDLE SCHOOL                   706  248
##   NON ADMINISTRATIVE OFFICE       597  594
##   NON PUBLIC SCHOOL                 1    0
##   TRANSITION / OVERAGE SCHOOL      94   47

Let’s say we’re only interested in salaried employees. How do we get rid of the hourly/daily employees?

Unlike in Stata, where we’d use drop and would have to jump through hoops to recover the lost observations (in case we decided to expand our focus later in our analysis), we can simply create a new data.table in R that contain only the observations we care about. This is, in my opinion, one of the most salient examples of a feature of R that blows Stata out of the water. We can keep many data sets at our disposal at once, without having to unload/reload the ones we’re not currently using. This will prove especially helpful when it comes to data cleaning/merging/reshaping.

salaried <- salaries[pay_rate_type == "SALARIED"]

Even more useful may be to exclude anyone who’s not a teacher. First, we have to exclude anyone who’s not a teacher. How do we figure out who’s a teacher?

Approach:

  1. Count the number of observations in each category of title_description (we can see from the preview that there are at least some teachers under TEACHER,FULL TIME)
  2. Sort these categories to find the most common categories, since presumably teachers are by far the most common SDP employees.

Carrying this out will introduce a few new things:

  • Grouping. We’re going to group by title_description, and count observations in each group. If I remember correctly, in Stata, this would be like by title_description: count. When data.table does this, you should think of your data.table being split into a whole bunch of smaller data.tables, each of which has all of the observations for exactly one level of title_description.

Here’s an illustration:

group_split

  • .N. data.table uses the abbreviation .N to stand for “number of observations”. It is defined locally, within each group, so all of the many data.tables mentioned above has its own .N. This is basically like count in Stata.
  • order. We can arrange the rows of a table by using order in the first argument (i); use a minus sign (-) to go in decreasing order, otherwise it’s increasing order (like we saw with sort).
  • Chaining. We can tack on more and more [] to any given data.table operation to hone or continue or adjust our outcomes beyond what we could do in a single call to []. The general syntax is DT[...][...][...][...], where each [...] consists of doing some manipulation/subset/aggregation, etc.
#Let's take it step-by-step
# 1: Find the count of employees in each category
salaried[ , .N, by = title_description]
##                   title_description    N
##   1:       GENERAL CLEANER, 8 HOURS  503
##   2:  STUDENT CLIMATE STAFF,4 HOURS  451
##   3:                   SCHOOL NURSE  184
##   4:         TEACHER,SPEC EDUCATION 1314
##   5:  STUDENT CLIMATE STAFF,3 HOURS  410
##  ---                                    
## 531: DATA ANALYST - CHARTER SCHOOLS    1
## 532:        DIR,OPERATIONAL SYS DEV    1
## 533:         DIR,SYSTEMS ADMIN UNIT    1
## 534:  DIR,INSURANCE RISK MANAGEMENT    1
## 535:   DIR,DISTRICT PERFORMANCE OFF    1
# 2: reorder these to find the most common
salaried[ , .N, by = title_description][order(N)]
##                   title_description    N
##   1:   SCHOOL-BASED TECH MAINT ASST    1
##   2: DIR,CERT,SUB SVCS,SCH ALLOT SU    1
##   3:             NUTRITIONIST, PKHS    1
##   4: ADMINISTRATOR,PHILA VIRTUAL AC    1
##   5:     HR SYSTEMS CONTROL ANALYST    1
##  ---                                    
## 531:       GENERAL CLEANER, 8 HOURS  503
## 532:    CLASSROOM ASST,SP ED,SV HND  594
## 533:    ONE TO ONE ASST, SPECIAL ED  757
## 534:         TEACHER,SPEC EDUCATION 1314
## 535:              TEACHER,FULL TIME 6652
# 3: put it in decreasing order
salaried[ , .N, by = title_description][order(-N)]
##                   title_description    N
##   1:              TEACHER,FULL TIME 6652
##   2:         TEACHER,SPEC EDUCATION 1314
##   3:    ONE TO ONE ASST, SPECIAL ED  757
##   4:    CLASSROOM ASST,SP ED,SV HND  594
##   5:       GENERAL CLEANER, 8 HOURS  503
##  ---                                    
## 531: DATA ANALYST - CHARTER SCHOOLS    1
## 532:        DIR,OPERATIONAL SYS DEV    1
## 533:         DIR,SYSTEMS ADMIN UNIT    1
## 534:  DIR,INSURANCE RISK MANAGEMENT    1
## 535:   DIR,DISTRICT PERFORMANCE OFF    1
# 4: Only look at the top 20 most frequent positions
# (idea -- once it's resorted, these are simply
#  the first twenty rows of the sorted table)
salaried[ , .N, by = title_description][order(-N)][1:20]
##                  title_description    N
##  1:              TEACHER,FULL TIME 6652
##  2:         TEACHER,SPEC EDUCATION 1314
##  3:    ONE TO ONE ASST, SPECIAL ED  757
##  4:    CLASSROOM ASST,SP ED,SV HND  594
##  5:       GENERAL CLEANER, 8 HOURS  503
##  6:  STUDENT CLIMATE STAFF,4 HOURS  451
##  7:  STUDENT CLIMATE STAFF,3 HOURS  410
##  8:            FOOD SVCS ASSISTANT  386
##  9: SUPPORTIVE SERVICES ASST, 4 HR  317
## 10:                  BUS ATTENDANT  295
## 11:          SCHOOL POLICE OFFICER  264
## 12: SUPPORTIVE SERVICES ASST, 3 HR  249
## 13:    SCHOOL COUNSELOR, 10 MONTHS  245
## 14:            CUSTODIAL ASSISTANT  241
## 15:                      PRINCIPAL  225
## 16:                    SECRETARY I  224
## 17:                   SCHOOL NURSE  184
## 18:        FOOD SVCS WORKER SENIOR  163
## 19:  STUDENT CLIMATE STAFF,5 HOURS  145
## 20:  BUS CHAUFFEUR PT (4-5HRS/DAY)  128

We can start to see many interesting facets of employment at SDP here. Almost 300 school police officers but only 180 nurses. Ripe ground for exploring on your own!

I’ll continue focusing on teachers; it appears all but the special ed. teachers are grouped under TEACHER,FULL TIME, so we can subset:

teachers <- salaried[title_description == "TEACHER,FULL TIME"]
teachers
##          last_name first_name pay_rate_type pay_rate title_description
##    1:       ABBOTT      JOYCE      SALARIED    76461 TEACHER,FULL TIME
##    2:     ABDALLAH JUWAYRIYAH      SALARIED    46694 TEACHER,FULL TIME
##    3:  ABDEL-JALIL    GHADEER      SALARIED    45359 TEACHER,FULL TIME
##    4:  ABDUL BASIT    BARBARA      SALARIED    67706 TEACHER,FULL TIME
##    5: ABDUL-LATEEF     VILLIA      SALARIED    48945 TEACHER,FULL TIME
##   ---                                                                 
## 6648:      ZUNGOLO    WILLIAM      SALARIED    90051 TEACHER,FULL TIME
## 6649:        ZWICK      KEVIN      SALARIED    46694 TEACHER,FULL TIME
## 6650:       ZWOLAK       PAUL      SALARIED    54364 TEACHER,FULL TIME
## 6651:    ZYCHOWICZ       LISA      SALARIED    67706 TEACHER,FULL TIME
## 6652: ZYGMUND-FELT      ALLYN      SALARIED    90051 TEACHER,FULL TIME
##       home_organization home_organization_description organization_level
##    1:              1290       HAMILTON, ANDREW SCHOOL  ELEMENTARY SCHOOL
##    2:              1470           LOCKE, ALAIN SCHOOL  ELEMENTARY SCHOOL
##    3:              7440         TAYLOR, BAYARD SCHOOL  ELEMENTARY SCHOOL
##    4:              7530         ROWEN, WILLIAM SCHOOL  ELEMENTARY SCHOOL
##    5:              1010     BARTRAM, JOHN HIGH SCHOOL        HIGH SCHOOL
##   ---                                                                   
## 6648:              8390     FITZPATRICK, A. L. SCHOOL  ELEMENTARY SCHOOL
## 6649:              8300                MAYFAIR SCHOOL  ELEMENTARY SCHOOL
## 6650:              2290      FRANKLIN LEARNING CENTER        HIGH SCHOOL
## 6651:              8420       DECATUR, STEPHEN SCHOOL  ELEMENTARY SCHOOL
## 6652:              7100  COOKE, JAY ELEMENTARY SCHOOL  ELEMENTARY SCHOOL
##       type_of_representation gender run_date
##    1:            PFT-TEACHER      F 4/1/2016
##    2:            PFT-TEACHER      F 4/1/2016
##    3:            PFT-TEACHER      F 4/1/2016
##    4:            PFT-TEACHER      F 4/1/2016
##    5:            PFT-TEACHER      F 4/1/2016
##   ---                                       
## 6648:            PFT-TEACHER      M 4/1/2016
## 6649:            PFT-TEACHER      M 4/1/2016
## 6650:            PFT-TEACHER      M 4/1/2016
## 6651:            PFT-TEACHER      F 4/1/2016
## 6652:            PFT-TEACHER      F 4/1/2016

3.1.4 Group mean salaries – various cuts

Now we can explore data by group. How does pay differ by gender?

#simple to code!
teachers[ , mean(pay_rate), by = gender]
##    gender       V1
## 1:      F 68624.85
## 2:      M 67689.59
#mean automatically got called V1. If we want
#  a more friendly output, we have to change a little:
teachers[ , .(avg_wage = mean(pay_rate)), by = gender]
##    gender avg_wage
## 1:      F 68624.85
## 2:      M 67689.59

So in teaching in Philly, actually women earn significantly more than men (this is indeed significant; I’ll show you how to test that in a bit)!

(Of course men and women with the same experience and certification are paid basically the same – it’s written explicitly in the teachers’ contract – but we don’t have any experience or certification measures to control for this)

How does pay differ by organization level?

teachers[ , .(avg_wage = mean(pay_rate)), by = organization_level]
##                organization_level avg_wage
## 1:              ELEMENTARY SCHOOL 67639.16
## 2:                    HIGH SCHOOL 69527.33
## 3:                  MIDDLE SCHOOL 69481.66
## 4: CAREER AND TECHNICAL HIGH SCHL 71931.52
## 5:                EARLY CHILDHOOD 74058.15
## 6:    TRANSITION / OVERAGE SCHOOL 63434.37
## 7:          ADMINISTRATIVE OFFICE 66255.22
## 8:      NON ADMINISTRATIVE OFFICE 82723.82
## 9:                        ACADEMY 67789.00

Which school has the best-paid teachers?

teachers[ , .(avg_wage = mean(pay_rate)),
         by = home_organization_description
          ][order(-avg_wage)]
##      home_organization_description avg_wage
##   1:       LOGAN SCHOOL HEAD START 90051.00
##   2:   COOK-WISSAHICKON HEAD START 90051.00
##   3:       SHAWMONT BRIGHT FUTURES 90051.00
##   4:        MUNOZ-MARIN HEAD START 90051.00
##   5:          SHARSWOOD HEAD START 90051.00
##  ---                                       
## 287:        DUNBAR, PAUL L. SCHOOL 56805.95
## 288:        SOUTH PHILA HEAD START 51113.00
## 289:             BLAINE HEAD START 51113.00
## 290:            BETHUNE HEAD START 51113.00
## 291:            MIFFLIN HEAD START     0.00

This looks fishy! It’s rare for any average worth knowing to come out exactly to an integer. I suspect there are very few teachers at these schools. How can we check?

#remember our handy friend .N!!
teachers[ , .(.N, avg_wage = mean(pay_rate)),
          by = home_organization_description
          ][order(-avg_wage)]
##      home_organization_description  N avg_wage
##   1:       LOGAN SCHOOL HEAD START  1 90051.00
##   2:   COOK-WISSAHICKON HEAD START  1 90051.00
##   3:       SHAWMONT BRIGHT FUTURES  1 90051.00
##   4:        MUNOZ-MARIN HEAD START  1 90051.00
##   5:          SHARSWOOD HEAD START  1 90051.00
##  ---                                          
## 287:        DUNBAR, PAUL L. SCHOOL 21 56805.95
## 288:        SOUTH PHILA HEAD START  1 51113.00
## 289:             BLAINE HEAD START  1 51113.00
## 290:            BETHUNE HEAD START  1 51113.00
## 291:            MIFFLIN HEAD START  1     0.00

In fact, almost all of the “best-” and “worst-” paid schools are outliers because there’s only one employee listed. So that’s not really a fair measure of dominance.

So what can we do to improve our measure?

What’s typically done is to exclude all schools that don’t have pass some cutoff minimum number of employees. For this exercise, let’s say we want to exclude all schools that have fewer than 10 full-time teachers in the data.

The way to do so is pretty natural. Remember that when we use by, we’re essentially splitting our main data.table up into 291 smaller data.tables, one for each school. We want to keep only those sub-data.tables that have at least 10 teachers – i.e., they have at least 10 rows, i.e., .N is at least 10:

#No longer keep .N as an output -- this is just for focus,
#  since we already know all the schools have at least
#  ten teachers
teachers[ , if (.N >= 10) .(avg_wage = mean(pay_rate)),
          by = home_organization_description
          ][order(-avg_wage)]
##      home_organization_description avg_wage
##   1:           NON-PUBLIC PROGRAMS 82723.82
##   2:       SOUTH PHILADELPHIA H.S. 78295.27
##   3:  GIRLS, PHILA HIGH SCHOOL FOR 77710.83
##   4:          POWEL, SAMUEL SCHOOL 77700.50
##   5:       PENN TREATY HIGH SCHOOL 76291.82
##  ---                                       
## 208:               KENSINGTON CAPA 59277.92
## 209:        HUEY, SAMUEL B. SCHOOL 58582.12
## 210:      PEIRCE, THOMAS M. SCHOOL 58114.21
## 211: CLEMENTE, ROBERTO MIDDLE SCHL 57635.85
## 212:        DUNBAR, PAUL L. SCHOOL 56805.95

Here we see our first if statement in R. The basic form of an if statement in R is:

#Note the parentheses around the logical!! Very easy to forget
if (some_logical){
  # what to do if true
} else {
  # what to do if false
}

It’s important to note that some_logical has to be either TRUE or FALSE – it can’t be a vector!! This sounds easy now but it’s guaranteed to trip you up at some point.

There’s no else statement in the condition we used above. So, for all the schools where there are fewer than 10 teachers, nothing happens (technically, the statement returns NULL – but don’t worry about that for now), and data.table ignores any such sub-data.table.

I don’t really know what NON-PUBLIC PROGRAMS means… but besides that, we see a lot of staple high school names. Of course, all teachers in Philly are on the same teachers’ contract and hence the same pay scale – so what this exercise has really told us is that South Philly high has the most experienced and/or certified teachers, and that teachers at Dunbar are young and/or uncertified.

We could go on exploring / slicing&dicing this data all day, but I think we’ve learned enough essentials of manipulation/inspection for now. We’ll pick up more tricks as we move along.


3.2 Merging

Recall our teacher attendance data:

attendance
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430

We couldn’t really tell anything about the schools because we didn’t know what the school codes meant.

Well, it appears we’ve found the Rosetta Stone in the teacher salary data. The home_organization field looks very much like the ULCS_NO field in the attendance data.

Obviously, it’d be ideal to know for sure that they represent the same ID for each school. In lieu of that, we can bolster our confidence in the alignment by comparing the fields a bit. Here are some techniques:

#We know from attendance there are 212 schools.
#  How many unique schools are represented in the teacher data?
teachers[ , uniqueN(home_organization)]
## [1] 291
#How many of the ULCS_NO values are also found in home_organization
#  intersect find the middle of the venn diagram for the 
#  first and second arguments
length(intersect(attendance$ULCS_NO, teachers[ , unique(home_organization)]))
## [1] 211

That’s pretty convincing if you ask me.

So, how do we match the two data sets? In Stata, depending on which data set you currently have in memory, we would either perform a one-to-many or a many-to-one merge.

In R, merging either way is a cinch. But first, we need a few more tools.

3.2.1 := - Adding a column to a data.table

When we want to add a new column to an existing data.table, we use the := operator j (the second argument). A quick aside, this is done by reference, which means it’s memory-efficient and fast; see this Q&A for more details.

Let’s try it out with two examples. Right now, in attendance, teacher attendance rates are listed as a percentage (out of 100); sometimes, it’s more convenient to have percentages as a proportion (out of one).

Also, we might like to know the relative attendance rates – the ratio of attendance at a given school to average attendance.

In teachers, gender is stored as a character – either M or F. But it’s often quite convenient to have binary variables stored as logical – for subsetting.

#add attendance as a proportion
attendance[ , attendance_proportion := SCH_TEACHER_ATTEND / 100]

#add relative attendance
attendance[ , attendance_relative := SCH_TEACHER_ATTEND / SDP_TEACHER_ATTEND]

#a quirk of data.table is that after using
#  :=, we sometimes have to force printing,
#  which we can do by appending [] to the end
attendance[]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
#add logical version of gender
teachers[ , male := gender == "M"]

#now, subset to show only male teachers
teachers[(male)]
##        last_name first_name pay_rate_type pay_rate title_description
##    1: ABDULALEEM   MUHAMMAD      SALARIED    65121 TEACHER,FULL TIME
##    2:   ABDULLAH    RASHEED      SALARIED    83382 TEACHER,FULL TIME
##    3:   ACKERMAN       CARL      SALARIED    83382 TEACHER,FULL TIME
##    4:      ADAIR    DARRELL      SALARIED    66461 TEACHER,FULL TIME
##    5:      ADAIR     JOSEPH      SALARIED    51113 TEACHER,FULL TIME
##   ---                                                               
## 1822:     ZUCKER    JEFFREY      SALARIED    73453 TEACHER,FULL TIME
## 1823:     ZUFOLO      BRIAN      SALARIED    57451 TEACHER,FULL TIME
## 1824:    ZUNGOLO    WILLIAM      SALARIED    90051 TEACHER,FULL TIME
## 1825:      ZWICK      KEVIN      SALARIED    46694 TEACHER,FULL TIME
## 1826:     ZWOLAK       PAUL      SALARIED    54364 TEACHER,FULL TIME
##       home_organization  home_organization_description
##    1:              6090 RANDOLPH TECHNICAL HIGH SCHOOL
##    2:              4220        BLAINE, JAMES G. SCHOOL
##    3:              2670       CONSTITUTION HIGH SCHOOL
##    4:              2010 FRANKLIN, BENJAMIN HIGH SCHOOL
##    5:              5400                RICHMOND SCHOOL
##   ---                                                 
## 1822:              7310       FELTONVILLE INTERMEDIATE
## 1823:              5330      HUNTER, WILLIAM H. SCHOOL
## 1824:              8390      FITZPATRICK, A. L. SCHOOL
## 1825:              8300                 MAYFAIR SCHOOL
## 1826:              2290       FRANKLIN LEARNING CENTER
##                   organization_level type_of_representation gender
##    1: CAREER AND TECHNICAL HIGH SCHL            PFT-TEACHER      M
##    2:              ELEMENTARY SCHOOL            PFT-TEACHER      M
##    3:                    HIGH SCHOOL            PFT-TEACHER      M
##    4:                    HIGH SCHOOL            PFT-TEACHER      M
##    5:              ELEMENTARY SCHOOL            PFT-TEACHER      M
##   ---                                                             
## 1822:              ELEMENTARY SCHOOL            PFT-TEACHER      M
## 1823:              ELEMENTARY SCHOOL            PFT-TEACHER      M
## 1824:              ELEMENTARY SCHOOL            PFT-TEACHER      M
## 1825:              ELEMENTARY SCHOOL            PFT-TEACHER      M
## 1826:                    HIGH SCHOOL            PFT-TEACHER      M
##       run_date male
##    1: 4/1/2016 TRUE
##    2: 4/1/2016 TRUE
##    3: 4/1/2016 TRUE
##    4: 4/1/2016 TRUE
##    5: 4/1/2016 TRUE
##   ---              
## 1822: 4/1/2016 TRUE
## 1823: 4/1/2016 TRUE
## 1824: 4/1/2016 TRUE
## 1825: 4/1/2016 TRUE
## 1826: 4/1/2016 TRUE

3.2.2 Merge and update

Now we’re equipped to understand the syntax for a merge. Let’s add the school’s name (home_organization_description in teachers) to the attendance data.

First, note that teachers has more data in it than we need. There are many teachers in each school, meaning the mapping between home_organization code and school name is repeated many times (once for each teacher in a given school). This presents a sort of an issue when we try and merge the teachers data into attendance – for each ULCS_NO that gets matched to a home_organization in teachers, there will be many rows, even if they all have the same information. To deal with this we’ll use unique to eliminate all the duplicates. We’ll go step-by-step first, then all at once:

#this will take the FIRST row associated
#  with each school. This is important in
#  general, but not here since we only
#  care about the school-level data for now,
#  which is the same in all rows.
schools <- unique(teachers, by = "home_organization")
schools
##         last_name first_name pay_rate_type pay_rate title_description
##   1:       ABBOTT      JOYCE      SALARIED    76461 TEACHER,FULL TIME
##   2:     ABDALLAH JUWAYRIYAH      SALARIED    46694 TEACHER,FULL TIME
##   3:  ABDEL-JALIL    GHADEER      SALARIED    45359 TEACHER,FULL TIME
##   4:  ABDUL BASIT    BARBARA      SALARIED    67706 TEACHER,FULL TIME
##   5: ABDUL-LATEEF     VILLIA      SALARIED    48945 TEACHER,FULL TIME
##  ---                                                                 
## 287:     VALENTIN   MADELINE      SALARIED    65121 TEACHER,FULL TIME
## 288:   WERTZ YORI    DESIREE      SALARIED    76461 TEACHER,FULL TIME
## 289:     WIDMAIER      DONNA      SALARIED    76461 TEACHER,FULL TIME
## 290:     WILLIAMS     TAMMAR      SALARIED    76461 TEACHER,FULL TIME
## 291:       WRIGHT        AMY      SALARIED    67789 TEACHER,FULL TIME
##      home_organization home_organization_description organization_level
##   1:              1290       HAMILTON, ANDREW SCHOOL  ELEMENTARY SCHOOL
##   2:              1470           LOCKE, ALAIN SCHOOL  ELEMENTARY SCHOOL
##   3:              7440         TAYLOR, BAYARD SCHOOL  ELEMENTARY SCHOOL
##   4:              7530         ROWEN, WILLIAM SCHOOL  ELEMENTARY SCHOOL
##   5:              1010     BARTRAM, JOHN HIGH SCHOOL        HIGH SCHOOL
##  ---                                                                   
## 287:              7262     ELLWOOD SCHOOL HEAD START    EARLY CHILDHOOD
## 288:              8352    SPRUANCE SCHOOL HEAD START    EARLY CHILDHOOD
## 289:              6202       DAY, ANNA B. HEAD START    EARLY CHILDHOOD
## 290:              1027         LEA SCHOOL HEAD START    EARLY CHILDHOOD
## 291:              2242              BREGY HEAD START    EARLY CHILDHOOD
##      type_of_representation gender run_date  male
##   1:            PFT-TEACHER      F 4/1/2016 FALSE
##   2:            PFT-TEACHER      F 4/1/2016 FALSE
##   3:            PFT-TEACHER      F 4/1/2016 FALSE
##   4:            PFT-TEACHER      F 4/1/2016 FALSE
##   5:            PFT-TEACHER      F 4/1/2016 FALSE
##  ---                                             
## 287:             PFT- PRE K      F 4/1/2016 FALSE
## 288:             PFT- PRE K      F 4/1/2016 FALSE
## 289:             PFT- PRE K      F 4/1/2016 FALSE
## 290:             PFT- PRE K      F 4/1/2016 FALSE
## 291:             PFT- PRE K      F 4/1/2016 FALSE
#now merge, using := to add the school name column
attendance[schools, school_name :=
             home_organization_description, 
           on = c(ULCS_NO = "home_organization")]
attendance
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
##                         school_name
##   1:       SPRUANCE, GILBERT SCHOOL
##   2:        HOWE, JULIA WARD SCHOOL
##   3:     WILLARD, FRANCES E. SCHOOL
##   4: LABRUM,GEN HARRY MIDDLE SCHOOL
##   5:     LINGELBACH, ANNA L. SCHOOL
##  ---                               
## 208:     KEARNY, GEN. PHILIP SCHOOL
## 209:    HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211:      ROBESON, PAUL HIGH SCHOOL
## 212:          HESTON, EDWARD SCHOOL

This general form of the syntax we just used is X[Y, variable_in_x := variable_in_y, on = c(a = "b")]. Let’s break that down a bit:

  • X and Y are data.tables. We match their rows according to alignment in the variables specified by the on argument within the square brackets ([]) used to access X.
  • on = c(a = "b") tells R that rows in column a in X (both things on the left) should be matched & associated with rows of the same value in column b in Y.
  • we add the variable on the left of := to data.table X. The variable on the right of := is typically found in Y – it could also be an expression consisting of columns in X and Y.

Let’s explore a few other ways we could have gone about doing this merge. First, let’s remove the column that we just created so that we’re really starting from scratch. The way to remove columns from a data.table is to set them to NULL, which is very fast.

attendance[ , school_name := NULL][]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
#alternative 1: skip defining schools
#  extra tidbit: sometimes, both the 
#  main table (x) and the merging table
#  (in the first argument, i -- we called
#   it Y above) have columns with the same
#  name. To overcome the ambiguity this 
#  engenders, we can prepend i. to the
#  column names in teachers to tell R
#  that we're referring _explicitly_
#  to that column in teachers. Similarly,
#  we can prepend x. to tell R that we're
#  referring to the column in attendance
attendance[unique(teachers, by = "home_organization"),
           school_name := i.home_organization_description,
           on = c(ULCS_NO = "home_organization")][]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
##                         school_name
##   1:       SPRUANCE, GILBERT SCHOOL
##   2:        HOWE, JULIA WARD SCHOOL
##   3:     WILLARD, FRANCES E. SCHOOL
##   4: LABRUM,GEN HARRY MIDDLE SCHOOL
##   5:     LINGELBACH, ANNA L. SCHOOL
##  ---                               
## 208:     KEARNY, GEN. PHILIP SCHOOL
## 209:    HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211:      ROBESON, PAUL HIGH SCHOOL
## 212:          HESTON, EDWARD SCHOOL
#reset again
attendance[ , school_name := NULL]

#alternative 2: use unique within the merge
#  by = .EACHI is a syntax unique to merging.
#  Using this tells R to do the thing in j
#  within each group that's matched.
#  So, here, each ULCS_NO gets matched to many
#  home_organization rows in teachers.
#  within this group, we use unique to
#  get the school's name, since unique turns the
#  vector of school names (one per teacher in
#  each school that's matched) into a single value.
attendance[teachers, 
           school_name := unique(home_organization_description),
           on = c(ULCS_NO = "home_organization"), by = .EACHI][]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
##                         school_name
##   1:       SPRUANCE, GILBERT SCHOOL
##   2:        HOWE, JULIA WARD SCHOOL
##   3:     WILLARD, FRANCES E. SCHOOL
##   4: LABRUM,GEN HARRY MIDDLE SCHOOL
##   5:     LINGELBACH, ANNA L. SCHOOL
##  ---                               
## 208:     KEARNY, GEN. PHILIP SCHOOL
## 209:    HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211:      ROBESON, PAUL HIGH SCHOOL
## 212:          HESTON, EDWARD SCHOOL
#reset again
attendance[ , school_name := NULL]

#alternative 3: like 2, except use [] instead of unique.
#  This alternative works, and is faster, but is less robust
#  to finding mistakes in your data. Thankfully, this data is
#  pretty clean, but we often find messy data in the wild --
#  maybe there was a typo in the organization code or school name,
#  and instead of each home_organization being matched to
#  exactly one home_organization_description, we might find
#  it matched to several (e.g., "PS 131" and "PS #131").
#  If we use the unique approach from alternative 2,
#  we'll get an error, which tells us we need to examine our data
#  more closely. This approach will definitely not generate an error.
attendance[teachers,
           school_name := home_organization_description[1],
           on = c(ULCS_NO = "home_organization"), by = .EACHI][]
##      ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
##   1:    8350   2013-2014               92.7               93.4       835
##   2:    7320   2013-2014               95.0               93.4       732
##   3:    5440   2013-2014               96.0               93.4       544
##   4:    8320   2013-2014               95.1               93.4       832
##   5:    6440   2013-2014               96.6               93.4       644
##  ---                                                                    
## 208:    5480   2013-2014               92.0               93.4       548
## 209:    1300   2013-2014               91.4               93.4       130
## 210:    6090   2013-2014               95.3               93.4       609
## 211:    1050   2013-2014               94.7               93.4       105
## 212:    4300   2013-2014               94.7               93.4       430
##      attendance_proportion attendance_relative
##   1:                 0.927           0.9925054
##   2:                 0.950           1.0171306
##   3:                 0.960           1.0278373
##   4:                 0.951           1.0182013
##   5:                 0.966           1.0342612
##  ---                                          
## 208:                 0.920           0.9850107
## 209:                 0.914           0.9785867
## 210:                 0.953           1.0203426
## 211:                 0.947           1.0139186
## 212:                 0.947           1.0139186
##                         school_name
##   1:       SPRUANCE, GILBERT SCHOOL
##   2:        HOWE, JULIA WARD SCHOOL
##   3:     WILLARD, FRANCES E. SCHOOL
##   4: LABRUM,GEN HARRY MIDDLE SCHOOL
##   5:     LINGELBACH, ANNA L. SCHOOL
##  ---                               
## 208:     KEARNY, GEN. PHILIP SCHOOL
## 209:    HARRINGTON, AVERY D. SCHOOL
## 210: RANDOLPH TECHNICAL HIGH SCHOOL
## 211:      ROBESON, PAUL HIGH SCHOOL
## 212:          HESTON, EDWARD SCHOOL

It’s up to your taste which approach suits you – each is perfect for a certain situation.

3.2.3 Missing Data

No data set is perfect. In fact, most data sets are nightmares. Missing data is pervasive, so it stands to reason that R is well-equipped to handle missing data.

All missing data is coded in R as NA. That is, if we’ve been careful when loading the data into R, usually by telling R what missing data looks like in a given file. Take fread, for example. We haven’t needed it yet because the two data sets we’ve loaded have been complete/balanced, but fread has an argument called na.strings which is used to tell R what to look for in a data file to consider as missing (i.e., as NA).

We now have some missing data after our merge, however. To see this, note the following:

#is.na returns a logical vector saying
#  whether each element of school_name
#  is missing (NA) or not
attendance[is.na(school_name)]
##    ULCS_NO SCHOOL_YEAR SCH_TEACHER_ATTEND SDP_TEACHER_ATTEND SCHOOL_ID
## 1:    2140   2013-2014               95.7               93.4       214
##    attendance_proportion attendance_relative school_name
## 1:                 0.957            1.024625          NA
#NOTE: TESTING NA WITH EQUALITY DOESN'T WORK
attendance[school_name == NA]
## Empty data.table (0 rows) of 8 cols: ULCS_NO,SCHOOL_YEAR,SCH_TEACHER_ATTEND,SDP_TEACHER_ATTEND,SCHOOL_ID,attendance_proportion...

So what’s up with ULCS_NO 2140? Why does this school have no teachers in the salary data??

Turns out, it is in the salary data, but none of the teachers are listed as TEACHER,FULL TIME:

#remember, salaries is the full set
#  of ALL employees of SDP -- we
#  didn't remove any observations from here
#(using the nrows argument for print to condense output)
print(salaries[home_organization == "2140"], nrows = 10)
##        last_name  first_name pay_rate_type pay_rate
##  1:        ABNEY       GILDA      SALARIED    79586
##  2:       AIKENS       SARAH      SALARIED    48110
##  3:         AKES      RONALD      SALARIED    17963
##  4: AMIT-CUBBAGE       DONNA      SALARIED    84880
##  5:       BALLEW       MEGAN      SALARIED    52530
## ---                                                
## 79:      TARANTA CHRISTOPHER      SALARIED    84880
## 80:       TAYLOR   ELIZABETH      SALARIED    84880
## 81:        VECSI       JANEL      SALARIED    77961
## 82: VOLIN AVELIN   ALEXANDRA      SALARIED    77961
## 83:     WILLIAMS      LESLIE      SALARIED    34164
##               title_description home_organization
##  1: SCHOOL COUNSELOR, 10 MONTHS              2140
##  2:        FOOD SVCS MANAGER II              2140
##  3:         FOOD SVCS WORKER II              2140
##  4:       TEACHER,DEMONSTRATION              2140
##  5:       TEACHER,DEMONSTRATION              2140
## ---                                              
## 79:       TEACHER,DEMONSTRATION              2140
## 80:       TEACHER,DEMONSTRATION              2140
## 81:       TEACHER,DEMONSTRATION              2140
## 82:       TEACHER,DEMONSTRATION              2140
## 83:         CUSTODIAL ASSISTANT              2140
##      home_organization_description organization_level
##  1: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##  2: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##  3: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##  4: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##  5: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## ---                                                  
## 79: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## 80: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## 81: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## 82: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
## 83: MASTERMAN,JULIA R. HIGH SCHOOL        HIGH SCHOOL
##     type_of_representation gender run_date
##  1:            PFT-TEACHER      F 4/1/2016
##  2:            PFT-F/S MGR      F 4/1/2016
##  3:              LOCAL 634      M 4/1/2016
##  4:            PFT-TEACHER      F 4/1/2016
##  5:            PFT-TEACHER      F 4/1/2016
## ---                                       
## 79:            PFT-TEACHER      M 4/1/2016
## 80:            PFT-TEACHER      F 4/1/2016
## 81:            PFT-TEACHER      F 4/1/2016
## 82:            PFT-TEACHER      F 4/1/2016
## 83:             LOCAL 1201      F 4/1/2016
#since they're not full-time teachers, what are they?
salaries[home_organization == "2140",
         table(title_description)]
## title_description
##                ASST PRINCIPAL   BUILDING ENGINEER-GROUP III 
##                             1                             1 
##           CUSTODIAL ASSISTANT           FOOD SVCS ASSISTANT 
##                             2                             2 
##          FOOD SVCS MANAGER II      FOOD SVCS UTILITY WORKER 
##                             1                             1 
##            FOOD SVCS WORKER I           FOOD SVCS WORKER II 
##                             1                             1 
##      GENERAL CLEANER, 8 HOURS   ONE TO ONE ASST, SPECIAL ED 
##                             3                             1 
##                     PRINCIPAL   SCHOOL COUNSELOR, 10 MONTHS 
##                             1                             3 
##                  SCHOOL NURSE     SCHOOL OPERATIONS OFFICER 
##                             1                             1 
##                   SECRETARY I STUDENT CLIMATE STAFF,3 HOURS 
##                             3                             2 
## STUDENT CLIMATE STAFF,4 HOURS         TEACHER,DEMONSTRATION 
##                             3                            54 
## TEACHER,DEMONSTRATION,SPEC ED 
##                             1

That’s right – at Masterman, none of the teachers appear to be classified as such. They’re classified for wage purposes as TEACHER,DEMONSTRATION. I don’t really know what this means… I was only able to find a small tidbit on the SDP website:

Demonstration Teachers are unique educators. They utilize the most current educational techniques and instructional media, which contribute to effective teaching practices in their specific teaching areas. They are observed at times by personnel from within the School District as well as visitors from other areas and institutions, both national and international.

This appears to be something basically exclusive to Masterman, with a fair presence at John Hancock as well:

salaries[title_description == "TEACHER,DEMONSTRATION",
         .N, by = home_organization_description]
##     home_organization_description  N
## 1:           HANCOCK, JOHN SCHOOL 22
## 2:        PENN TREATY HIGH SCHOOL  1
## 3: MASTERMAN,JULIA R. HIGH SCHOOL 54
## 4:        HOWE, JULIA WARD SCHOOL  1
## 5:          FRANKFORD HIGH SCHOOL  1
## 6: LABRUM,GEN HARRY MIDDLE SCHOOL  3
## 7: MASTBAUM, JULES E. HIGH SCHOOL  1

So, it turns out we’d be better off just using salaries for the merge to define the school’s name:

attendance[salaries, 
           school_name := unique(home_organization_description),
           on = c(ULCS_NO = "home_organization"), by = .EACHI]
attendance[is.na(school_name)] #fixed!
## Empty data.table (0 rows) of 8 cols: ULCS_NO,SCHOOL_YEAR,SCH_TEACHER_ATTEND,SDP_TEACHER_ATTEND,SCHOOL_ID,attendance_proportion...

Now we can repeat our earlier exercise and reveal the names of the best- and worst- attended schools in Philly. While we’re at it, we can introduce column subsetting. To focus attention on only a few columns, we can simply put them in j and wrap them in .(). Only the columns we name here will show up in the output (so that it doesn’t take up as much room and we can immediately focus on the most important information)

attendance[SCH_TEACHER_ATTEND < 90 | 
             SCH_TEACHER_ATTEND > 97
           ][order(-SCH_TEACHER_ATTEND), 
             .(ULCS_NO, school_name, SCH_TEACHER_ATTEND)]
##     ULCS_NO                    school_name SCH_TEACHER_ATTEND
##  1:    1190         MOTIVATION HIGH SCHOOL               98.1
##  2:    6040    SAUL, WALTER B. HIGH SCHOOL               97.6
##  3:    8560            THE WORKSHOP SCHOOL               97.6
##  4:    5340        LUDLOW, JAMES R. SCHOOL               97.2
##  5:    7350        LOWELL, JAMES R. SCHOOL               89.9
##  6:    7510    BETHUNE, MARY MCLEOD SCHOOL               89.8
##  7:    8240       DISSTON, HAMILTON SCHOOL               89.7
##  8:    7290       STEARNE, ALLEN M. SCHOOL               89.7
##  9:    5430          AMY 5 AT JAMES MARTIN               89.6
## 10:    7220       CARNELL, LAURA H. SCHOOL               89.0
## 11:    6360    ROOSEVELT ELEMENTARY SCHOOL               89.0
## 12:    1330         HUEY, SAMUEL B. SCHOOL               88.8
## 13:    4240 CASSIDY,LEWIS C ACADEMICS PLUS               88.7
## 14:    1130           TILDEN MIDDLE SCHOOL               88.5
## 15:    1340                  LEA, HENRY C.               88.4
## 16:    6260       HOUSTON, HENRY H. SCHOOL               88.0
## 17:    2480      ARTHUR, CHESTER A. SCHOOL               87.8
## 18:    6340     PENNELL, JOSEPH ELEMENTARY               86.9
## 19:    5370            MOFFET, JOHN SCHOOL               86.5
## 20:    4570   MEADE, GEN. GEORGE G. SCHOOL               86.4
## 21:    4350       RHODES ELEMENTARY SCHOOL               85.3
##     ULCS_NO                    school_name SCH_TEACHER_ATTEND

How fitting! Motivation High School appears to value teacher attendance very highly.

3.3 Non-.csv Data, Part I: Excel Files

Often, the world is cruel and we are unable to find the data that we want to use for our project in a standard character-separated format. When this situation arises, we have two options.

The first is to simply convert that data into our favorite format and treat the new file as the master data file, never looking back.

There are three major problems with this approach:

  1. Processing time. It can be time-consuming to convert many files; this is a fixed cost, but often our time is better spent analyzing the data than toying with it and tinkering with the parameters of a data converter.
  2. Data fidelity. All data converters have their flaws. Encoding issues, unexpected data types, you name it. It’s almost impossible to tell when, in the process of converting your data, some observation’s entries have been mistakenly changed without your knowing.
  3. Software licenses. Some files are stored in program-specific binary file formats (SAS, Stata, even R all have specific file types). This is intended to make it really easy to work from session to session within that program, and re-loading a data environment is fast and simple. Moving to machines that don’t have that software installed, however, can be nightmarish.

My preferred approach is to use R itself to read in file in all formats. Users of R have encountered all of our problems before, and many kind souls out there in the ether have been so generous as to donate their tailor-made solutions to the R community.

Excel files are an example of one such foreign format. As you can imagine, Excel files are ubiquitous. It’s no surprise, then, that CRAN is plastered with different packages designed to load and interact with data created and stored in Excel formats. See this Q&A for the gamut of options.

Today we’ll cover just two options, since I have found them to be quite effective and have never needed any of the others. The first is the package readxl. It is designed to be streamlined and fast. It has few bells and whistles and is not very flexible. To handle more fringe/out-of-the-ordinary data files, there is xlsx. The main advantage I’ve found for xlsx is that it’s better at properly reading fields formatted as dates, which is a complicated facet of how .xlsx files are stored.

On our own machines, we would install them like we would any other package:

#we can install multiple packages at once by
#  passing a character vector to install.packages
install.packages(c("xlsx", "readxl"))

3.3.1 School Demographic Data

Philly’s ODI also releases some Excel files summarizing some basic demographic information about each school. This data, as usual already on your machine, was downloaded from here.

R doesn’t have any easy facility to preview the data in an Excel file, so we’ll just pause from R for now and check it out in Excel.

Excuse us for a moment, R.

<<R sips coffee>>

As we can see, the data is a tad messy, but R has the strength to deal with it! Let’s go!

library(readxl)
#As we saw, this file has many sheets. the read_excel function
#  has an argument sheet which we use to specify the sheet we want
#Also, as we saw, there's a bunch of extraneous rows at the top of the file.
#  We use the skip argument to ignore them. But in so doing, we necessarily
#  lose some information (namely, which columns correspond to Female and
#  which correspond to Male).
#To overcome this, we're going to supply
#  the column names ourselves. This is also favorable since the column
#  headers in the raw data file are plagued by spaces -- which makes
#  them quite hard to deal with once they're brought into R.
school_gender <-
  read_excel("./data/2015-2016 Enr Dem (Suppressed).xls", 
             sheet = "Gender", skip = 6,
             #enter a blank for the first column, which
             #  will be dropped anyway
             col_names = c("", "school_id", "school_name", "grade",
                           "total_enrolled", "count_female",
                           "pct_female", "count_male", "pct_male"))
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1

3.3.1.1 setDT

Now that we’ve ventured outside of the data.table world, we’ll have to deal with a minor annoyance – data.frames.

data.frames are R’s native data management class. I find them to be a pain to deal with; for example, if we print a data.frame, it will print the entire table, instead of truncating all but the beginning and end of the object, which is basically a useless overload of information. Also, most of the handy syntax that we’re starting to get used to from data.table is inaccessible/doesn’t work in data.frames. data.frame syntax is ugly.

Luckily for us, data.table has a perfect tool for people like us that are lost in the data.frame wilderness: setDT!

setDT will do its best to convert non-data.table objects to data.tables by reference, which means that there’s no copy being made. This is probably meaningless to you now, but just know that if you have a huge data set (millions of observations), making a copy of your data is something you should try and avoid doing at all costs, as it’s very time- and memory-consuming.

class(school_gender)
## [1] "tbl_df"     "tbl"        "data.frame"
#(usually, I just do this wrap this
# around read_excel to do it all at once)
setDT(school_gender)
class(school_gender)
## [1] "data.table" "data.frame"

3.3.1.2 Forcing numeric: lapply(.SD, ...)

print(school_gender, class = TRUE)
##       school_id                  school_name      grade total_enrolled
##           <num>                       <char>     <char>          <num>
##    1:       101     John Bartram High School   9.000000            200
##    2:       101     John Bartram High School  10.000000            197
##    3:       101     John Bartram High School  11.000000            172
##    4:       101     John Bartram High School  12.000000            141
##   ---                                                                 
## 1662:       878 Philadelphia Virtual Academy   9.000000             46
## 1663:       878 Philadelphia Virtual Academy  10.000000             69
## 1664:       878 Philadelphia Virtual Academy  11.000000             79
## 1665:       878 Philadelphia Virtual Academy  12.000000             64
## 1666:       878 Philadelphia Virtual Academy ALL GRADES            311
##       count_female pct_female count_male pct_male
##             <char>     <char>     <char>   <char>
##    1:    70.000000   0.350000 130.000000 0.650000
##    2:    86.000000   0.436548 111.000000 0.563452
##    3:    78.000000   0.453488  94.000000 0.546512
##    4:    63.000000   0.446809  78.000000 0.553191
##   ---                                            
## 1662:            s          s          s        s
## 1663:    43.000000   0.623188  26.000000 0.376812
## 1664:    47.000000   0.594937  32.000000 0.405063
## 1665:    41.000000   0.640625  23.000000 0.359375
## 1666:   191.000000   0.614148 120.000000 0.385852

We’ve got some issues to deal with.

  1. There’s too much data here (for what we’ve got in mind) – we’re only interested in the school-wide demographics for now. So we’ve got to cut all the rows where grade is not ALL GRADES (we know how to do this already)
  2. A couple of columns that are clearly numeric are currently being stored as charactercount_female, pct_female, count_male, pct_male. This is because some data has been suppressed, probably for privacy reasons when the sample size is too small (e.g., at Philadelphia Virtual Academy, it appears there are only 46 students, which would mean roughly 23 students of each gender – it’s common to hide data that comes from such small samples for privacy). We could have used the na argument in read_excel (setting it to "s" instead of the default ""), but cleaning up ex-post allows us to show some new tricks!
#subsetting
school_gender <- school_gender[grade == "ALL GRADES"]

#forcing numeric -- note the warning telling us 
#  there is some data loss because R doesn't know
#  how to convert "s" to a number, so it forces it to be NA
school_gender[ , count_female := as.numeric(count_female)]
## Warning: NAs introduced by coercion
#repeat for the other columns
school_gender[ , pct_female := as.numeric(pct_female)]
## Warning: NAs introduced by coercion
school_gender[ , count_male := as.numeric(count_male)]
## Warning: NAs introduced by coercion
school_gender[ , pct_male := as.numeric(pct_male)]
## Warning: NAs introduced by coercion

It’s kind of a pain to run basically the same line of code over and over again, isn’t it?

Don’t worry, R has a simple way to automate this process for us, of course!

We’ll need two more bits of knowledge about data.table before we proceed.

First, there’s another data.table variable like .N that is useful to know. .SD (stands for Subset of the Data), like .N, is two things, depending on whether we doing a grouping operation using by. If we’re not using by, .SD is just our data.table. If we’re using by, there is a different .SD for every group (recall the idea of splitting the data.table into one group for each level of the by variable – each of the split sub-tables has its own .SD) Watch:

school_gender[ , .SD]
##      school_id                                 school_name      grade
##   1:       101                    John Bartram High School ALL GRADES
##   2:       102               West Philadelphia High School ALL GRADES
##   3:       103                   High School of the Future ALL GRADES
##   4:       105 Paul Robeson High School for Human Services ALL GRADES
##   5:       110                William L. Sayre High School ALL GRADES
##  ---                                                                 
## 214:       842                      Stephen Decatur School ALL GRADES
## 215:       843                     Joseph Greenberg School ALL GRADES
## 216:       844                   William H. Loesche School ALL GRADES
## 217:       856                         The Workshop School ALL GRADES
## 218:       878                Philadelphia Virtual Academy ALL GRADES
##      total_enrolled count_female pct_female count_male pct_male
##   1:            710          297   0.418310        413 0.581690
##   2:            538          242   0.449814        296 0.550186
##   3:            518          243   0.469112        275 0.530888
##   4:            294          169   0.574830        125 0.425170
##   5:            511          231   0.452055        280 0.547945
##  ---                                                           
## 214:           1030          478   0.464078        552 0.535922
## 215:            738          368   0.498645        370 0.501355
## 216:            782          368   0.470588        414 0.529412
## 217:            207           79   0.381643        128 0.618357
## 218:            311          191   0.614148        120 0.385852

.SD is itself a data.table. That means we can use [] on it to treat it just like any other data.table:

#the same as school_gender[1:10]
school_gender[ , .SD[1:10] ]
##     school_id                                 school_name      grade
##  1:       101                    John Bartram High School ALL GRADES
##  2:       102               West Philadelphia High School ALL GRADES
##  3:       103                   High School of the Future ALL GRADES
##  4:       105 Paul Robeson High School for Human Services ALL GRADES
##  5:       110                William L. Sayre High School ALL GRADES
##  6:       113                    William T. Tilden School ALL GRADES
##  7:       119                      Motivation High School ALL GRADES
##  8:       120                           John Barry School ALL GRADES
##  9:       123                    William C. Bryant School ALL GRADES
## 10:       125                  Joseph W. Catharine School ALL GRADES
##     total_enrolled count_female pct_female count_male pct_male
##  1:            710          297   0.418310        413 0.581690
##  2:            538          242   0.449814        296 0.550186
##  3:            518          243   0.469112        275 0.530888
##  4:            294          169   0.574830        125 0.425170
##  5:            511          231   0.452055        280 0.547945
##  6:            470          210   0.446809        260 0.553191
##  7:            344          183   0.531977        161 0.468023
##  8:            757          395   0.521797        362 0.478203
##  9:            478          234   0.489540        244 0.510460
## 10:            619          304   0.491115        315 0.508885
#get the entire first row associated with each school_id
school_gender[ , .SD[1], by = school_id]
##      school_id                                 school_name      grade
##   1:       101                    John Bartram High School ALL GRADES
##   2:       102               West Philadelphia High School ALL GRADES
##   3:       103                   High School of the Future ALL GRADES
##   4:       105 Paul Robeson High School for Human Services ALL GRADES
##   5:       110                William L. Sayre High School ALL GRADES
##  ---                                                                 
## 214:       842                      Stephen Decatur School ALL GRADES
## 215:       843                     Joseph Greenberg School ALL GRADES
## 216:       844                   William H. Loesche School ALL GRADES
## 217:       856                         The Workshop School ALL GRADES
## 218:       878                Philadelphia Virtual Academy ALL GRADES
##      total_enrolled count_female pct_female count_male pct_male
##   1:            710          297   0.418310        413 0.581690
##   2:            538          242   0.449814        296 0.550186
##   3:            518          243   0.469112        275 0.530888
##   4:            294          169   0.574830        125 0.425170
##   5:            511          231   0.452055        280 0.547945
##  ---                                                           
## 214:           1030          478   0.464078        552 0.535922
## 215:            738          368   0.498645        370 0.501355
## 216:            782          368   0.470588        414 0.529412
## 217:            207           79   0.381643        128 0.618357
## 218:            311          191   0.614148        120 0.385852

This is mainly useful in conjunction with another argument we can use within [] with data.table: .SDcols. .SDcols tells R which columns to include in the current operation:

school_gender[ , .SD, .SDcols = c("school_id", "grade", "total_enrolled")]
##      school_id      grade total_enrolled
##   1:       101 ALL GRADES            710
##   2:       102 ALL GRADES            538
##   3:       103 ALL GRADES            518
##   4:       105 ALL GRADES            294
##   5:       110 ALL GRADES            511
##  ---                                    
## 214:       842 ALL GRADES           1030
## 215:       843 ALL GRADES            738
## 216:       844 ALL GRADES            782
## 217:       856 ALL GRADES            207
## 218:       878 ALL GRADES            311

This is mainly useful for what we’re trying to do with as.numeric, where we want to apply the same function to many columns. Recall from earlier that the *apply functions are our friends when it comes to doing repetitive things. When we do lapply(DT, sum), we apply sum to every column of DT. For as.numeric, it’s much the same:

#Troublesome. We don't want to convert
#  most of the columns, and we go too far here
school_gender[ , lapply(.SD, as.numeric)]
## Warning in lapply(.SD, as.numeric): NAs introduced by coercion

## Warning in lapply(.SD, as.numeric): NAs introduced by coercion
##      school_id school_name grade total_enrolled count_female pct_female
##   1:       101          NA    NA            710          297   0.418310
##   2:       102          NA    NA            538          242   0.449814
##   3:       103          NA    NA            518          243   0.469112
##   4:       105          NA    NA            294          169   0.574830
##   5:       110          NA    NA            511          231   0.452055
##  ---                                                                   
## 214:       842          NA    NA           1030          478   0.464078
## 215:       843          NA    NA            738          368   0.498645
## 216:       844          NA    NA            782          368   0.470588
## 217:       856          NA    NA            207           79   0.381643
## 218:       878          NA    NA            311          191   0.614148
##      count_male pct_male
##   1:        413 0.581690
##   2:        296 0.550186
##   3:        275 0.530888
##   4:        125 0.425170
##   5:        280 0.547945
##  ---                    
## 214:        552 0.535922
## 215:        370 0.501355
## 216:        414 0.529412
## 217:        128 0.618357
## 218:        120 0.385852
#Use .SDcols to control which columns we affect
school_gender[ , lapply(.SD, as.numeric), 
               .SDcols = c("count_female", "pct_female", 
                           "count_male", "pct_male")]
##      count_female pct_female count_male pct_male
##   1:          297   0.418310        413 0.581690
##   2:          242   0.449814        296 0.550186
##   3:          243   0.469112        275 0.530888
##   4:          169   0.574830        125 0.425170
##   5:          231   0.452055        280 0.547945
##  ---                                            
## 214:          478   0.464078        552 0.535922
## 215:          368   0.498645        370 0.501355
## 216:          368   0.470588        414 0.529412
## 217:           79   0.381643        128 0.618357
## 218:          191   0.614148        120 0.385852
#Great, we've converted the columns to numeric, right?
#  No, not quite. We have done the calculation,
#  but we haven't _assigned_ the output to anything.
#  To do that, we need := again.

num_cols <- c("count_female", "pct_female",
              "count_male", "pct_male")
#**NOTE** we have to surround num_cols on the left with
#  parentheses so that data.table knows we're not
#  trying to create a column called num_cols.
school_gender[ , (num_cols) := lapply(.SD, as.numeric),
               .SDcols = num_cols]
print(school_gender, class = TRUE)
##      school_id                                 school_name      grade
##          <num>                                      <char>     <char>
##   1:       101                    John Bartram High School ALL GRADES
##   2:       102               West Philadelphia High School ALL GRADES
##   3:       103                   High School of the Future ALL GRADES
##   4:       105 Paul Robeson High School for Human Services ALL GRADES
##  ---                                                                 
## 214:       842                      Stephen Decatur School ALL GRADES
## 215:       843                     Joseph Greenberg School ALL GRADES
## 216:       844                   William H. Loesche School ALL GRADES
## 217:       856                         The Workshop School ALL GRADES
## 218:       878                Philadelphia Virtual Academy ALL GRADES
##      total_enrolled count_female pct_female count_male pct_male
##               <num>        <num>      <num>      <num>    <num>
##   1:            710          297   0.418310        413 0.581690
##   2:            538          242   0.449814        296 0.550186
##   3:            518          243   0.469112        275 0.530888
##   4:            294          169   0.574830        125 0.425170
##  ---                                                           
## 214:           1030          478   0.464078        552 0.535922
## 215:            738          368   0.498645        370 0.501355
## 216:            782          368   0.470588        414 0.529412
## 217:            207           79   0.381643        128 0.618357
## 218:            311          191   0.614148        120 0.385852

3.3.1.3 Merging, and Repeating

Now, we can add this data to our teacher data file from earlier. How? A merge of course!

First, unfortunately, the school codes in school_gender are missing the trailing 0, so we have to fix this.

school_gender[ , school_id := paste0(school_id, "0")]

teachers[school_gender, 
         school_pct_male := i.pct_male, 
         on = c(home_organization = "school_id")]

The school demographics Excel file also contains data on ELL status, IEP status, ethnicity, and economic disadvantage. The approach to adding this is basically the same, so let’s do it all at once.

#For conciseness, we'll do all of the data manipulation we did
#  in addition to reading in the gender sheet all at once here
school_ell <- 
  setDT(read_excel("./data/2015-2016 Enr Dem (Suppressed).xls", 
                   sheet = "ELL", skip = 6,
                   col_names = c("", "school_id", "school_name", "grade",
                                 "total_enrolled", "count_ell", "pct_ell", 
                                 "count_non_ell", "pct_non_ell"))
        )[grade == "ALL GRADES"
          #we're only going to keep pct_ell, so only convert that;
          #  also convert school code
          ][ , c("school_id", "pct_ell") := 
               .(paste0(school_id, "0"), as.numeric(pct_ell))]
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning: NAs introduced by coercion
#note: ELL status is commonly missing
school_ell[is.na(pct_ell), .N]
## [1] 104
#merge
teachers[school_ell, 
         school_pct_ell := i.pct_ell,
         on = c(home_organization = "school_id")]

##NOW: IEP
## Note: since all we're doing is selecting a column, we don't
##   even have to store the Excel sheet as a new data set --
##   we can pass the intermediat result directly to the
##   merge with teachers, like so:
teachers[setDT(read_excel(
  "./data/2015-2016 Enr Dem (Suppressed).xls", 
  sheet = "IEP", skip = 6,
  col_names = c("", "school_id", "school_name", "grade",
                "total_enrolled", "count_iep", "pct_iep", 
                "count_non_iep", "pct_non_iep"))
        )[grade == "ALL GRADES"
          ][ , c("school_id", "pct_iep") := 
               .(paste0(school_id, "0"), as.numeric(pct_iep))],
  school_pct_iep := i.pct_iep, 
  on = c(home_organization = "school_id")]
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning: NAs introduced by coercion
##NOW: Ethnicity
##  Ethnicity is not binary, so things will be a little bit more involved.
##  we'll use the `:=`() form of assignment here. Basically, we can
##  treat := like a function (surrounded by backticks ``), since, 
##  technically, it IS a function
teachers[setDT(read_excel(
  "./data/2015-2016 Enr Dem (Suppressed).xls", 
  sheet = "Ethnicity", skip = 6,
  col_names = c("", "school_id", "school_name", 
                "grade", "total_enrolled", 
                "count_amerindian", "pct_amerindian", 
                "count_asian", "pct_asian", 
                "count_black", "pct_black",
                "count_hispanic", "pct_hispanic", 
                "count_multi", "pct_multi", 
                "count_pacific", "pct_pacific",
                "count_white", "pct_white"))
        )[grade == "ALL GRADES"
          ][ , c("school_id",
                 paste0("pct_", c("asian", "black", 
                                  "hispanic", "white"))) := 
               #we could do this with lapply, but it looks ugly...
               .(paste0(school_id, "0"), as.numeric(pct_asian),
                 as.numeric(pct_black), as.numeric(pct_hispanic),
                 as.numeric(pct_white))],
  `:=`(school_pct_asian = i.pct_asian,
       school_pct_black = i.pct_black,
       school_pct_hispanic = i.pct_hispanic,
       school_pct_white = i.pct_white),
  on = c(home_organization = "school_id")]
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion
#Whew, what a mess! If you prefer, of course, you
#  can always do this step by step.

#Lastly, economically disadvantaged.
#  this sheet is formatted a bit differently,
#  so our merge must adapt accordingly.
#  In particular, there are more blank columns.
teachers[setDT(read_excel(
  "./data/2015-2016 Enr Dem (Suppressed).xls", 
  sheet = "Econ Disadv", skip = 6,
  col_names = c("", "school_id", "school_name", 
                "grade", "pct_disadv", rep("", 4)))
        )[ , school_id := paste0(school_id, "0")],
  school_pct_disadv := 100 * i.pct_disadv,
  on = c(home_organization = "school_id")]
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
## Warning in read_xls(path, sheet, col_names, col_types, na, skip): Found
## blank columns: 1,6,7,8,9
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 03 00 00 00 00 00 00 0d 3b 01 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 05 00 00 00 00 00 00 0d 3b 03 00 05 00 05 00 01 00 12 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 02 00 00 00 00 00 00 0d 3b 00 00 05 00 05 00 01 00 08 00 
## DEFINEDNAME: 21 00 00 01 0b 00 00 00 04 00 00 00 00 00 00 0d 3b 02 00 05 00 05 00 01 00 08 00
teachers
##          last_name first_name pay_rate_type pay_rate title_description
##    1:       ABBOTT      JOYCE      SALARIED    76461 TEACHER,FULL TIME
##    2:     ABDALLAH JUWAYRIYAH      SALARIED    46694 TEACHER,FULL TIME
##    3:  ABDEL-JALIL    GHADEER      SALARIED    45359 TEACHER,FULL TIME
##    4:  ABDUL BASIT    BARBARA      SALARIED    67706 TEACHER,FULL TIME
##    5: ABDUL-LATEEF     VILLIA      SALARIED    48945 TEACHER,FULL TIME
##   ---                                                                 
## 6648:      ZUNGOLO    WILLIAM      SALARIED    90051 TEACHER,FULL TIME
## 6649:        ZWICK      KEVIN      SALARIED    46694 TEACHER,FULL TIME
## 6650:       ZWOLAK       PAUL      SALARIED    54364 TEACHER,FULL TIME
## 6651:    ZYCHOWICZ       LISA      SALARIED    67706 TEACHER,FULL TIME
## 6652: ZYGMUND-FELT      ALLYN      SALARIED    90051 TEACHER,FULL TIME
##       home_organization home_organization_description organization_level
##    1:              1290       HAMILTON, ANDREW SCHOOL  ELEMENTARY SCHOOL
##    2:              1470           LOCKE, ALAIN SCHOOL  ELEMENTARY SCHOOL
##    3:              7440         TAYLOR, BAYARD SCHOOL  ELEMENTARY SCHOOL
##    4:              7530         ROWEN, WILLIAM SCHOOL  ELEMENTARY SCHOOL
##    5:              1010     BARTRAM, JOHN HIGH SCHOOL        HIGH SCHOOL
##   ---                                                                   
## 6648:              8390     FITZPATRICK, A. L. SCHOOL  ELEMENTARY SCHOOL
## 6649:              8300                MAYFAIR SCHOOL  ELEMENTARY SCHOOL
## 6650:              2290      FRANKLIN LEARNING CENTER        HIGH SCHOOL
## 6651:              8420       DECATUR, STEPHEN SCHOOL  ELEMENTARY SCHOOL
## 6652:              7100  COOKE, JAY ELEMENTARY SCHOOL  ELEMENTARY SCHOOL
##       type_of_representation gender run_date  male school_pct_male
##    1:            PFT-TEACHER      F 4/1/2016 FALSE        0.518072
##    2:            PFT-TEACHER      F 4/1/2016 FALSE        0.502146
##    3:            PFT-TEACHER      F 4/1/2016 FALSE        0.464029
##    4:            PFT-TEACHER      F 4/1/2016 FALSE        0.502868
##    5:            PFT-TEACHER      F 4/1/2016 FALSE        0.581690
##   ---                                                             
## 6648:            PFT-TEACHER      M 4/1/2016  TRUE        0.522134
## 6649:            PFT-TEACHER      M 4/1/2016  TRUE        0.523733
## 6650:            PFT-TEACHER      M 4/1/2016  TRUE        0.403207
## 6651:            PFT-TEACHER      F 4/1/2016 FALSE        0.535922
## 6652:            PFT-TEACHER      F 4/1/2016 FALSE        0.516899
##       school_pct_ell school_pct_iep school_pct_asian school_pct_black
##    1:             NA       0.185886               NA         0.893287
##    2:       0.077253       0.111588         0.053648         0.796137
##    3:       0.210432       0.088129               NA         0.176259
##    4:             NA       0.124283               NA         0.927342
##    5:       0.067606       0.263380               NA         0.929577
##   ---                                                                
## 6648:       0.061294       0.145289         0.044268         0.187287
## 6649:       0.140788       0.098150         0.211585         0.132743
## 6650:       0.159221       0.061856         0.114548         0.505155
## 6651:       0.054369       0.127184         0.053398         0.153398
## 6652:       0.091451       0.121272               NA         0.819085
##       school_pct_hispanic school_pct_white school_pct_disadv
##    1:                  NA               NA         100.00000
##    2:                  NA               NA         100.00000
##    3:            0.726619               NA         100.00000
##    4:                  NA               NA         100.00000
##    5:                  NA               NA         100.00000
##   ---                                                       
## 6648:            0.107832         0.547106          77.72985
## 6649:            0.209171         0.348351          91.26307
## 6650:            0.230241         0.119129          89.62199
## 6651:            0.093204         0.577670          65.70874
## 6652:            0.083499               NA         100.00000

3.4 Regressions

OK, enough data manipulation for now. Let’s get to some actual analysis, for cryin’ out loud!

We of course don’t have any experimental or quasi-experimental data here, so we’ll only be able to tease out correlations from our data. But we do have enough data to illustrate the basics of typical regression techniques, so here we go!

3.4.1 OLS

Most linear regressions in R are handled by the lm function (stands for linear model).

Let’s run a very simple regression – the most basic regression testing what the gender pay gap is. Like any of the other typical functions we’ve been using, we run lm in j of our data.table.

##no frills regression
teachers[ , lm(pay_rate ~ gender)]
## 
## Call:
## lm(formula = pay_rate ~ gender)
## 
## Coefficients:
## (Intercept)      genderM  
##     68624.9       -935.3

The basic output of lm tells us 1) the regression we ran and 2) the point estimates of the coefficients. This is nice, but I find it rather uninformative – crucially, the standard errors are missing. But fear not! We’re only seeing the surface of what lm does.

#now we assign the output of lm so
#  we can explore it in a bit more detail
gender_reg <- teachers[ , lm(pay_rate ~ gender)]
str(gender_reg)
## List of 13
##  $ coefficients : Named num [1:2] 68625 -935
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "genderM"
##  $ residuals    : Named num [1:6652] 7836 -21931 -23266 -919 -19680 ...
##   ..- attr(*, "names")= chr [1:6652] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:6652] -5576090 -34041 -23198 -851 -19612 ...
##   ..- attr(*, "names")= chr [1:6652] "(Intercept)" "genderM" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:6652] 68625 68625 68625 68625 68625 ...
##   ..- attr(*, "names")= chr [1:6652] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:6652, 1:2] -81.5598 0.0123 0.0123 0.0123 0.0123 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:6652] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "genderM"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   .. ..- attr(*, "contrasts")=List of 1
##   .. .. ..$ gender: chr "contr.treatment"
##   ..$ qraux: num [1:2] 1.01 1.01
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 6650
##  $ contrasts    :List of 1
##   ..$ gender: chr "contr.treatment"
##  $ xlevels      :List of 1
##   ..$ gender: chr [1:2] "F" "M"
##  $ call         : language lm(formula = pay_rate ~ gender)
##  $ terms        :Classes 'terms', 'formula' length 3 pay_rate ~ gender
##   .. ..- attr(*, "variables")= language list(pay_rate, gender)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "pay_rate" "gender"
##   .. .. .. ..$ : chr "gender"
##   .. ..- attr(*, "term.labels")= chr "gender"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x4549aa8> 
##   .. ..- attr(*, "predvars")= language list(pay_rate, gender)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "character"
##   .. .. ..- attr(*, "names")= chr [1:2] "pay_rate" "gender"
##  $ model        :'data.frame':   6652 obs. of  2 variables:
##   ..$ pay_rate: int [1:6652] 76461 46694 45359 67706 48945 49615 65121 83382 75850 59532 ...
##   ..$ gender  : chr [1:6652] "F" "F" "F" "F" ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 pay_rate ~ gender
##   .. .. ..- attr(*, "variables")= language list(pay_rate, gender)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "pay_rate" "gender"
##   .. .. .. .. ..$ : chr "gender"
##   .. .. ..- attr(*, "term.labels")= chr "gender"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: 0x4549aa8> 
##   .. .. ..- attr(*, "predvars")= language list(pay_rate, gender)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "character"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "pay_rate" "gender"
##  - attr(*, "class")= chr "lm"

We can see from str that lm is a list with a long set of components. The output is a bit information overload; we can hone in on some important details by tinkering with our gender_reg object some more:

names(gender_reg)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"

Our gender_reg object is actually pretty complicate! Just by running lm, we’re given back all sorts of essential results about the relationship between pay_rate and gender:

  • coefficients are of course the coefficients of the regression. We can access them with gender_reg$coefficients.
  • residuals are of course the residuals – actual - predicted value for each observation.
  • effects are “the uncorrelated single-degree-of-freedom values obtained by projecting the data onto the successive orthogonal subspaces generated by the QR decomposition during the fitting process.” (I have no idea what this means)
  • rank should be equal to the number of coefficients (including the intercept), unless there is a multicollinearity problem.
  • fitted.values are the y_hat, the predicted pay_rate produced by the model for each observation.
  • assign helps map the matrix of regressors to the formula (I’ve never used this)
  • qr is the QR decomposition matrix associated with the regression (I’ve never used this)
  • df.residual is residual degrees of freedom (= n - k), where n is the total observations and k is the number of regressors
  • contrasts tell which contrasts were used (contrasts are levels of discrete variables)
  • xlevels gives the levels of the factors (discrete variables) that we included – for gender_reg, this is F and M, e.g.
  • call stores the regression we ran as a call to lm
  • terms gives a bunch of information about the variables we included in the regression – what they are, which are factors, which is the intercept, etc.
  • model gives the X matrix (predictor matrix)

It’s wonderful to know that R keeps track of all of this stuff for us, but what about the real meat of our regression – the stars! For this, we’ll use the summary function, which has a method for the output of lm that gives a nice, easily-digested summary of the model we ran:

summary(gender_reg)
## 
## Call:
## lm(formula = pay_rate ~ gender)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68625 -10239   1517   8771  22361 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  68624.9      203.6 337.040   <2e-16 ***
## genderM       -935.3      388.6  -2.407   0.0161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14140 on 6650 degrees of freedom
## Multiple R-squared:  0.0008702,  Adjusted R-squared:  0.00072 
## F-statistic: 5.792 on 1 and 6650 DF,  p-value: 0.01613

This calls attention to some summary statistics about the residuals (5-number-summary, as well as residual standard erro), the predictive power (R squared, multiple R squared, F statistic), but most importantly, it prints the standard error of each coefficient, and the associated two-sided t test results for each coefficient.

Here, we can see that men are paid on average significantly less (at alpha = .05) than are women among Philadelphia teachers, by about $1,000.

We spent some time going over the details, so let’s summarize quickly with a comparison to Stata. Here’s the boiled-down version of getting a regression summary in R:

teachers[ , summary(lm(pay_rate ~ gender))]
## 
## Call:
## lm(formula = pay_rate ~ gender)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68625 -10239   1517   8771  22361 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  68624.9      203.6 337.040   <2e-16 ***
## genderM       -935.3      388.6  -2.407   0.0161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14140 on 6650 degrees of freedom
## Multiple R-squared:  0.0008702,  Adjusted R-squared:  0.00072 
## F-statistic: 5.792 on 1 and 6650 DF,  p-value: 0.01613

In Stata, to do what we just did, one would simply run reg pay_rate gender. Hard to get simpler than that! My sales pitch in favor of R is only that it’s not that much more complicated, and as we saw above, R gives us at-hand access to a slew of summary statistics about the regression that are a bit less transparent to access in Stata.

3.4.1.1 Other covariates

It’s not a stretch to expand our regression to include other covariates to try and tame their influence. Perfect covariates here would be experience and certification, but sadly we’re not so lucky. But we do have a bunch of other school-level covariates we can include. Let’s try a few:

#skipping right to the summary
teachers[ , summary(lm(pay_rate ~ gender + school_pct_male + school_pct_black))]
## 
## Call:
## lm(formula = pay_rate ~ gender + school_pct_male + school_pct_black)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69929 -10025   1259   9614  24684 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       66136.6     1904.8  34.722  < 2e-16 ***
## genderM            -513.0      397.8  -1.290   0.1972    
## school_pct_male    8317.6     3626.9   2.293   0.0219 *  
## school_pct_black  -4014.2      588.1  -6.826 9.56e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14060 on 6312 degrees of freedom
##   (336 observations deleted due to missingness)
## Multiple R-squared:  0.008304,   Adjusted R-squared:  0.007833 
## F-statistic: 17.62 on 3 and 6312 DF,  p-value: 2.188e-11

Note that we lost a lot of observations because school_pct_male and school_pct_black are frequently missing. This was included in the summary output: (336 observations deleted due to missingness).

How do we interpret the results of this regression? The most naive possible conclusion is that mostly-black schools pay their teachers less – i.e., that there’s somehow a causal relationship between student makeup and the wages offered to teachers.

But we know that the teachers’ contract is fixed district-wide, so each school has no power whatsoever (well, not quite…) over what its teachers are paid. Instead, we’re witnessing the direct result of selection bias. We can get into this more later when we explore plotting.

3.4.1.2 Interactions

Stata handles interactions with the # operator. R, by contrast, uses *. So if we wanted to explore the interaction between gender and school_pct_male (i.e., to allow the slope of school_pct_male to differ by gender), we’d simply run:

#In Stata: reg pay_rate gender#school_pct_white
teachers[ , summary(lm(pay_rate ~ gender*school_pct_white))]
## 
## Call:
## lm(formula = pay_rate ~ gender * school_pct_white)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -72526  -9521   3577   7368  21002 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               68943.1      453.4 152.044   <2e-16 ***
## genderM                     770.6      880.4   0.875   0.3814    
## school_pct_white           4297.2     1515.3   2.836   0.0046 ** 
## genderM:school_pct_white  -3852.0     3205.6  -1.202   0.2296    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13740 on 3330 degrees of freedom
##   (3318 observations deleted due to missingness)
## Multiple R-squared:  0.002439,   Adjusted R-squared:  0.001541 
## F-statistic: 2.714 on 3 and 3330 DF,  p-value: 0.04332

The output genderM:school_pct_white is the interaction term – it’s the difference slope of the linear relationship between salary and student gender for female and male teachers. Specifically, the expected pay change for a female teacher from a one-percentage-point increase in the percentage of male students at her school is 43 (we divide by 100 since school_pct_white is measured in proportions, not percentages); for men, the expected pay change is 43 + -39 = 4.

This is not significant, meaning that the relationship between school whiteness, gender, and pay mainly play out through student ethnicity.

3.4.1.3 Functions of variables

One of my big pet peeves in Stata is the inability to run something like reg y log(x) – we’re forced to first create a logx variable like gen logx = log(x). Now we’ve got two objects in memory that are basically the same information! A waste of memory, in addition to being a pain to remember to do this all the time.

R does not suffer from this inanity. We can simply use functions within the formula specification:

#Let's try!
teachers[ , summary(lm(log(pay_rate) ~ gender))]
## Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): NA/NaN/Inf in 'y'
#Whoops. Turns out we still have some teachers
#  who are stored as having pay rate 0:
teachers[pay_rate == 0, .N]
## [1] 31
#So we need to exclude them because log(0) is -Inf:
teachers[pay_rate>0, summary(lm(log(pay_rate) ~ gender))]
## 
## Call:
## lm(formula = log(pay_rate) ~ gender)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76603 -0.12744  0.04124  0.14091  0.30451 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 11.121710   0.002995 3713.032  < 2e-16 ***
## genderM     -0.018088   0.005708   -3.169  0.00154 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2075 on 6619 degrees of freedom
## Multiple R-squared:  0.001515,   Adjusted R-squared:  0.001364 
## F-statistic: 10.04 on 1 and 6619 DF,  p-value: 0.001538

It’s a bit harder to interpret the coefficients in a regression with logs and discrete covariates, but the rule is still roughly that the coefficients are percentages. So men are paid about 2% less on average in Philly.

3.4.1.4 More functions of variables

What if we wanted to control not for the combined percentage of black and hispanic students at the school?

In Stata, we’d have to do something like gen black_hisp = school_pct_black + school_pct_hispanic. And from what we’ve seen so far, lm(pay_rate ~ school_pct_black + school_pct_hispanic) will give us separate coefficients for black and hispanic groups.

To get variables like this (most commonly, involving simple arithmetic operations like + or *), there’s the special I function for regression formulas. Basically, we surround some operation done on our variables with I(), and whatever results from that will be considered as a single covariate. Watch:

teachers[ , summary(lm(pay_rate ~ gender + 
                         I(school_pct_black + school_pct_hispanic)))]
## 
## Call:
## lm(formula = pay_rate ~ gender + I(school_pct_black + school_pct_hispanic))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71223  -9740   1986   9112  23446 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## (Intercept)                                73646.2      602.5 122.243
## genderM                                     -278.9      458.3  -0.609
## I(school_pct_black + school_pct_hispanic)  -6876.7      859.7  -7.999
##                                           Pr(>|t|)    
## (Intercept)                                < 2e-16 ***
## genderM                                      0.543    
## I(school_pct_black + school_pct_hispanic) 1.59e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13770 on 4509 degrees of freedom
##   (2140 observations deleted due to missingness)
## Multiple R-squared:  0.01422,    Adjusted R-squared:  0.01378 
## F-statistic: 32.52 on 2 and 4509 DF,  p-value: 9.465e-15

3.4.2 Predictions from a model

One of the most common things to do with a fitted regression is to use it for in- or out-of-sample predictions. For this, R has the predict function, which takes as its first argument the result of a model (e.g., lm), and as its second argument a new data.table.

Any prediction is model dependent; we’ll compare the predictions that come out of a few different relevant specifications.

#Model 1: Simplest linear specification
reg1 <- 
  teachers[ , lm(pay_rate ~ gender + school_pct_male +
                   school_pct_black + school_pct_disadv)]

#Model 2: Some interactions
reg2 <- 
    teachers[ , lm(pay_rate ~ gender*school_pct_disadv + 
                     school_pct_male + school_pct_black)]

#Model 3: Full interactions
reg3 <- 
    teachers[ , lm(pay_rate ~ gender*school_pct_disadv* 
                     school_pct_male*school_pct_black)]

Let’s try to predict the wages of a teacher who is male and at a school with 70% male, 40% black students, and 80% economically disadvantaged students with each model.

#declare new data to use for prediction
predict_table <- data.table(gender = "M", school_pct_male = .7,
                            school_pct_black = .4,
                            school_pct_disadv = .8)

#prediction from Model 1
predict(reg1, predict_table)
##        1 
## 77028.89
#prediction from Model 2
predict(reg2, predict_table)
##       1 
## 76675.7
#prediction from Model 3
predict(reg3, predict_table)
##       1 
## 73401.6

3.4.2.1 Many predictions

This will be more useful when we get to plotting, but often in more complicated models we want to trace out the marginal effect of a variable on our prediction, holding other regressors constant – is the relationship monotonic? Increasing?

For this, we simply provide more than one row in the predict_table that we supply, e.g.:

#explore the relationship between % male students
#  and teacher pay
predict_table <- 
  data.table(gender = "M", 
             school_pct_male =
               #only use 10 points for now for
               #  simplicity since we're
               #  not plotting
               seq(.5, 1, length.out = 10),
             school_pct_black = .4,
             school_pct_disadv = .8)

#Since all of these models are linear, 
#  we know for sure the relationship
#  is monotonic, and we can predict
#  the change from step to step because
#  we know the coefficient on school_pct_male
predict(reg1, predict_table)
##        1        2        3        4        5        6        7        8 
## 74441.11 75159.94 75878.77 76597.59 77316.42 78035.24 78754.07 79472.90 
##        9       10 
## 80191.72 80910.55

3.4.2.2 Prediction intervals

Point estimates are rarely the only thing we care about – as important are reasonable ranges in which our predictions might fall. We may predict wages of $75,000, but it would hardly call our model into question if we found actual wages of $75,001 for that person. Prediction intervals are a way of coupling our predictions with a range of estimates that are consistent with our model.

To this end, predict also has a few other arguments we can use to produce prediction intervals, e.g., interval and level, which tell the type of interval we want and the “alpha” to use, respectively:

predict(reg1, predict_table,
        interval = "prediction", level = .95)
##         fit      lwr      upr
## 1  74441.11 46800.92 102081.3
## 2  75159.94 47509.83 102810.0
## 3  75878.77 48212.76 103544.8
## 4  76597.59 48909.71 104285.5
## 5  77316.42 49600.70 105032.1
## 6  78035.24 50285.75 105784.7
## 7  78754.07 50964.87 106543.3
## 8  79472.90 51638.10 107307.7
## 9  80191.72 52305.46 108078.0
## 10 80910.55 52966.98 108854.1

That the intervals are so wide shows just how poor our chosen covariates are at predicting teachers’ wages.

3.4.3 Probit/Logit

The next most common form of regression analysis is logit/probit, for when we’re interested in understanding predictors of a binary dependent variable. I’ve never heard a compelling empirical reason for using one over the other; logistic is far more common, probably because it provides odds ratios, which are infinitely more interpretable than are the plain coefficients of either logit or probit models. But the difference in R between running logit and probit is trivial, so we’ll cover both here.

To avoid simply re-hashing what we did in the previous section and inducing a room more rife with glaze than Krispy Kreme, let’s focus on a new variable for our left-hand side.

Noticing that most schools are 100% economically disadvantaged, I wonder, are there certain characteristics which help predict that a teacher is at a school with less than full “poverty”?

We’ll define “less than full poverty” as school_pct_disadv < 100, which is true for teachers[school_pct_disadv < 100, .N] = 1666 teachers (and unique(teachers, by = "home_organization")[school_pct_disadv < 100, .N] = 52 schools).

The workhorse of binary-dependent-variable models in R is glm. This stands for generalized linear models. Generalized linear models subsume OLS and incorporate several regression models which can be related to OLS by a “link function”, including logit, probit and Poisson regression. See Wikipedia for more.

glm works basically like lm – its first argument is specified in exactly the same way (specifically, with an object of class formula), but we also have to include a second argument – family. This is basically specifying to glm the type of link function that you’ve got in mind for your regression. There is a multitude of options – Gamma, inverse.gaussian, poisson, quasipoisson, etc. (see ?family), but the most common and important is binomial.

Technically, what we pass to the family argument is a function – meaning most importantly that it itself accepts arguments! Specifically, the binomial function accepts one important argument, link. By default, this is "logit", meaning that the default for family = binomial is to run a logistic regression. There are two other options – family = "probit" (probit regression) and family = "cauchit" (which uses a Cauchy CDF, if you know what that is).

Let’s see this in action and run our first logit and probits:

#We could define our own "poverty" variable,
#  but instead we'll illustrate again another
#  way to use the I function in formulas
logit <- teachers[ , glm(I(school_pct_disadv < 100) ~ 
                           pay_rate + gender + school_pct_male, 
                         family = binomial)]

summary(logit)
## 
## Call:
## glm(formula = I(school_pct_disadv < 100) ~ pay_rate + gender + 
##     school_pct_male, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5236  -0.7773  -0.6549   0.5844   2.3622  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.072e+00  3.590e-01   8.556  < 2e-16 ***
## pay_rate         1.922e-05  2.240e-06   8.579  < 2e-16 ***
## genderM          3.649e-01  6.554e-02   5.568 2.58e-08 ***
## school_pct_male -1.080e+01  6.429e-01 -16.795  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7216.1  on 6363  degrees of freedom
## Residual deviance: 6793.8  on 6360  degrees of freedom
##   (288 observations deleted due to missingness)
## AIC: 6801.8
## 
## Number of Fisher Scoring iterations: 4
#run a probit instead by specifying the link
#  argument in the binomial function
probit <- teachers[ , glm(I(school_pct_disadv < 100) ~ 
                            pay_rate + gender + school_pct_male, 
                          family = binomial(link = "probit"))]


summary(probit)                  
## 
## Call:
## glm(formula = I(school_pct_disadv < 100) ~ pay_rate + gender + 
##     school_pct_male, family = binomial(link = "probit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4922  -0.7824  -0.6595   0.5984   2.4259  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      1.809e+00  2.082e-01   8.686  < 2e-16 ***
## pay_rate         1.104e-05  1.287e-06   8.579  < 2e-16 ***
## genderM          2.212e-01  3.866e-02   5.721 1.06e-08 ***
## school_pct_male -6.357e+00  3.705e-01 -17.160  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7216.1  on 6363  degrees of freedom
## Residual deviance: 6800.8  on 6360  degrees of freedom
##   (288 observations deleted due to missingness)
## AIC: 6808.8
## 
## Number of Fisher Scoring iterations: 4

We can’t compare coefficients between the models, except for their signs. The signs and significance both agree (as we would hope), and again seem to be telling us stuff we may have surmised – high-poverty schools have lower-wage teachers, have fewer male teachers, and have more male students.

To compare the models, we can look at the predicted probability of being in a low-poverty district for a typical teacher. We’ll define a typical teacher as having the median pay rate; we’ll plug in the percentage of male teachers to the coefficient on gender; and we’ll plug in the average percentage of male students.

I’m unaware of a way to tell R to use a percentage instead of 0/1 in predict, so we have to get our predictions by hand. Luckily this isn’t so bad:

#here is the "typical teacher"; we
#  include 1 first to stand for the
#  intercept term
typical <- teachers[!is.na(school_pct_male),
                    c(1, median(pay_rate), mean(gender == "M"),
                      mean(school_pct_male))]

#prediction from the logit model; recall that 
#  the interpretation of logit is that:
#  P[y = 1 | X] = F(X'B)
#  where X are our predictors, F is the link
#  function CDF, and B is the coefficient vector

#In the case of logistic regression, F is the
#  logistic CDF:
plogis(sum(logit$coefficients * typical))
## [1] 0.2446419
#In the case of probit, F is the normal CDF:
pnorm(sum(probit$coefficients * typical))
## [1] 0.2480797
#For comparison, the overall average in the sample is:
teachers[ , mean(school_pct_disadv < 100, na.rm = TRUE)]
## [1] 0.2598253

3.5 Reshaping

Let’s read the full demographic file on student gender again. This is just a copy-paste of what we did above.

school_gender <-
  setDT(read_excel("./data/2015-2016 Enr Dem (Suppressed).xls", 
                   sheet = "Gender", skip = 6,
                   col_names = c("", "school_id", "school_name", "grade",
                                 "total_enrolled", "count_female",
                                 "pct_female", "count_male", "pct_male")))
school_gender
##       school_id                  school_name      grade total_enrolled
##    1:       101     John Bartram High School   9.000000            200
##    2:       101     John Bartram High School  10.000000            197
##    3:       101     John Bartram High School  11.000000            172
##    4:       101     John Bartram High School  12.000000            141
##    5:       101     John Bartram High School ALL GRADES            710
##   ---                                                                 
## 1662:       878 Philadelphia Virtual Academy   9.000000             46
## 1663:       878 Philadelphia Virtual Academy  10.000000             69
## 1664:       878 Philadelphia Virtual Academy  11.000000             79
## 1665:       878 Philadelphia Virtual Academy  12.000000             64
## 1666:       878 Philadelphia Virtual Academy ALL GRADES            311
##       count_female pct_female count_male pct_male
##    1:    70.000000   0.350000 130.000000 0.650000
##    2:    86.000000   0.436548 111.000000 0.563452
##    3:    78.000000   0.453488  94.000000 0.546512
##    4:    63.000000   0.446809  78.000000 0.553191
##    5:   297.000000   0.418310 413.000000 0.581690
##   ---                                            
## 1662:            s          s          s        s
## 1663:    43.000000   0.623188  26.000000 0.376812
## 1664:    47.000000   0.594937  32.000000 0.405063
## 1665:    41.000000   0.640625  23.000000 0.359375
## 1666:   191.000000   0.614148 120.000000 0.385852
#Note that grade is pretty poorly formatted.
#  Let's fix that quickly. I'll explain what's
#  going on here later when we go into
#  regular expressions in a bit more depth.
#  Basically, we're deleting the decimal and everything after it.
school_gender[grade != "ALL GRADES", 
              grade := paste0("GRADE_", gsub("\\..*", "", grade))][]
##       school_id                  school_name      grade total_enrolled
##    1:       101     John Bartram High School    GRADE_9            200
##    2:       101     John Bartram High School   GRADE_10            197
##    3:       101     John Bartram High School   GRADE_11            172
##    4:       101     John Bartram High School   GRADE_12            141
##    5:       101     John Bartram High School ALL GRADES            710
##   ---                                                                 
## 1662:       878 Philadelphia Virtual Academy    GRADE_9             46
## 1663:       878 Philadelphia Virtual Academy   GRADE_10             69
## 1664:       878 Philadelphia Virtual Academy   GRADE_11             79
## 1665:       878 Philadelphia Virtual Academy   GRADE_12             64
## 1666:       878 Philadelphia Virtual Academy ALL GRADES            311
##       count_female pct_female count_male pct_male
##    1:    70.000000   0.350000 130.000000 0.650000
##    2:    86.000000   0.436548 111.000000 0.563452
##    3:    78.000000   0.453488  94.000000 0.546512
##    4:    63.000000   0.446809  78.000000 0.553191
##    5:   297.000000   0.418310 413.000000 0.581690
##   ---                                            
## 1662:            s          s          s        s
## 1663:    43.000000   0.623188  26.000000 0.376812
## 1664:    47.000000   0.594937  32.000000 0.405063
## 1665:    41.000000   0.640625  23.000000 0.359375
## 1666:   191.000000   0.614148 120.000000 0.385852

Above, we were focused only on school-level demographic data, so we threw away a lot of information.

In certain settings, we may be more focused on grade-level outcomes (if, for example, we knew the grade that each teacher taught, primarily, than we might want to associate the grade-level demographics to each teacher).

All of this information is here in the file we’ve got, but it’s in a slightly unfortunate format – there are many rows associated with each school. It’s much easier to merge a data.table where each school is found in one, and only one, row.

3.5.1 Long to Wide: dcast

To accomplish this in Stata we would use reshape wide, like reshape wide stub, i(i) j(j).

In R, we use the dcast function (I believe this means data cast, with the idea that we’re casting data “out and to the side” by going from long to wide). There are three main arguments to dcast: data (the data.table to be reshaped), formula (a specification of how to reshape, akin to the i(i) and j(j) in Stata), and value.var (the equivalent of stub in Stata).

It can be confusing to wrap your head around which arguments go where; here is an illustration to help orient your thinking about a given problem:

Dcast Illustration

Basically, the variables that we want to end up as rows, we put on the left of the formula (Left-Hand Side), and those that we want to end up as columns we put on the right of the formula. value.var is(are) the variable(s) dictating what comes from each LHS/RHS pair in the data.

Let’s reshape the gender demographics data so that there is only one row associated with each school:

school_gender_wide <- 
  dcast(school_gender, school_id ~ grade, value.var = "pct_male")
school_gender_wide
##      school_id ALL GRADES  GRADE_0  GRADE_1 GRADE_10 GRADE_11 GRADE_12
##   1:       101   0.581690       NA       NA 0.563452 0.546512 0.553191
##   2:       102   0.550186       NA       NA 0.468531 0.549180 0.537736
##   3:       103   0.530888       NA       NA 0.547826 0.480315 0.512195
##   4:       105   0.425170       NA       NA 0.485714 0.426667 0.324324
##   5:       110   0.547945       NA       NA 0.517241 0.537190 0.500000
##  ---                                                                  
## 214:       842   0.535922 0.603604 0.457944       NA       NA       NA
## 215:       843   0.501355 0.515625 0.515152       NA       NA       NA
## 216:       844   0.529412 0.515873 0.551724       NA       NA       NA
## 217:       856   0.618357       NA       NA        s 0.543860        s
## 218:       878   0.385852       NA       NA 0.376812 0.405063 0.359375
##       GRADE_2  GRADE_3  GRADE_4  GRADE_5  GRADE_6  GRADE_7  GRADE_8
##   1:       NA       NA       NA       NA       NA       NA       NA
##   2:       NA       NA       NA       NA       NA       NA       NA
##   3:       NA       NA       NA       NA       NA       NA       NA
##   4:       NA       NA       NA       NA       NA       NA       NA
##   5:       NA       NA       NA       NA       NA       NA       NA
##  ---                                                               
## 214: 0.470149 0.549550 0.500000 0.568000 0.576577 0.459459 0.656863
## 215: 0.556962 0.529412 0.481481 0.486842 0.468354 0.478723 0.481481
## 216: 0.552448 0.537879 0.467213 0.545455       NA       NA       NA
## 217:       NA       NA       NA       NA       NA       NA       NA
## 218:       NA       NA       NA       NA        s        s        s
##       GRADE_9
##   1: 0.650000
##   2: 0.628743
##   3: 0.575163
##   4: 0.466667
##   5: 0.607595
##  ---         
## 214:       NA
## 215:       NA
## 216:       NA
## 217: 0.555556
## 218:        s

We now have all of the grades as columns in the output. Note how much data is missing – this shows that most schools only cover a small range of grades. The grades are also out of order – they’ve been alphabetized, and "GRADE 10" comes before "GRADE 2" alphabetically. We can fix this by using factors; we may get into that later.

Note that we’ve lost a lot of information that was in or data originally – the data on student counts, the name of the school, etc. We can keep this sort of thing by using dcast more flexibly:

#keep school name by adding it as an ID
#  variable; since school_name is the same
#  for every school_id, this will not
#  change our output.
dcast(school_gender, school_id + school_name ~ grade, value.var = "pct_male")
##      school_id                                 school_name ALL GRADES
##   1:       101                    John Bartram High School   0.581690
##   2:       102               West Philadelphia High School   0.550186
##   3:       103                   High School of the Future   0.530888
##   4:       105 Paul Robeson High School for Human Services   0.425170
##   5:       110                William L. Sayre High School   0.547945
##  ---                                                                 
## 214:       842                      Stephen Decatur School   0.535922
## 215:       843                     Joseph Greenberg School   0.501355
## 216:       844                   William H. Loesche School   0.529412
## 217:       856                         The Workshop School   0.618357
## 218:       878                Philadelphia Virtual Academy   0.385852
##       GRADE_0  GRADE_1 GRADE_10 GRADE_11 GRADE_12  GRADE_2  GRADE_3
##   1:       NA       NA 0.563452 0.546512 0.553191       NA       NA
##   2:       NA       NA 0.468531 0.549180 0.537736       NA       NA
##   3:       NA       NA 0.547826 0.480315 0.512195       NA       NA
##   4:       NA       NA 0.485714 0.426667 0.324324       NA       NA
##   5:       NA       NA 0.517241 0.537190 0.500000       NA       NA
##  ---                                                               
## 214: 0.603604 0.457944       NA       NA       NA 0.470149 0.549550
## 215: 0.515625 0.515152       NA       NA       NA 0.556962 0.529412
## 216: 0.515873 0.551724       NA       NA       NA 0.552448 0.537879
## 217:       NA       NA        s 0.543860        s       NA       NA
## 218:       NA       NA 0.376812 0.405063 0.359375       NA       NA
##       GRADE_4  GRADE_5  GRADE_6  GRADE_7  GRADE_8  GRADE_9
##   1:       NA       NA       NA       NA       NA 0.650000
##   2:       NA       NA       NA       NA       NA 0.628743
##   3:       NA       NA       NA       NA       NA 0.575163
##   4:       NA       NA       NA       NA       NA 0.466667
##   5:       NA       NA       NA       NA       NA 0.607595
##  ---                                                      
## 214: 0.500000 0.568000 0.576577 0.459459 0.656863       NA
## 215: 0.481481 0.486842 0.468354 0.478723 0.481481       NA
## 216: 0.467213 0.545455       NA       NA       NA       NA
## 217:       NA       NA       NA       NA       NA 0.555556
## 218:       NA       NA        s        s        s        s
#keep the count of male students; for brevity,
#  we'll also exclude a bunch of grades to the
#  output is not as space-consuming
dcast(school_gender[grade %in% paste0("GRADE_", 1:6)],
      school_id ~ grade, value.var = c("pct_male", "count_male"))
##      school_id pct_male_GRADE_1 pct_male_GRADE_2 pct_male_GRADE_3
##   1:       113               NA               NA               NA
##   2:       120         0.450820         0.470000         0.489362
##   3:       123         0.491525         0.571429         0.402985
##   4:       125         0.534483         0.478723         0.509091
##   5:       126         0.439560         0.523256         0.530303
##  ---                                                             
## 164:       841         0.486486         0.554622         0.438776
## 165:       842         0.457944         0.470149         0.549550
## 166:       843         0.515152         0.556962         0.529412
## 167:       844         0.551724         0.552448         0.537879
## 168:       878               NA               NA               NA
##      pct_male_GRADE_4 pct_male_GRADE_5 pct_male_GRADE_6 count_male_GRADE_1
##   1:               NA                s         0.548148                 NA
##   2:         0.425000         0.482353         0.564516          55.000000
##   3:                s         0.615385         0.551020          29.000000
##   4:         0.536082         0.515789               NA          62.000000
##   5:         0.633803         0.588235         0.557692          40.000000
##  ---                                                                      
## 164:         0.560000         0.535714         0.552083          54.000000
## 165:         0.500000         0.568000         0.576577          49.000000
## 166:         0.481481         0.486842         0.468354          51.000000
## 167:         0.467213         0.545455               NA          64.000000
## 168:               NA               NA                s                 NA
##      count_male_GRADE_2 count_male_GRADE_3 count_male_GRADE_4
##   1:                 NA                 NA                 NA
##   2:          47.000000          46.000000          34.000000
##   3:          36.000000          27.000000                  s
##   4:          45.000000          56.000000          52.000000
##   5:          45.000000          35.000000          45.000000
##  ---                                                         
## 164:          66.000000          43.000000          56.000000
## 165:          63.000000          61.000000          59.000000
## 166:          44.000000          45.000000          39.000000
## 167:          79.000000          71.000000          57.000000
## 168:                 NA                 NA                 NA
##      count_male_GRADE_5 count_male_GRADE_6
##   1:                  s          74.000000
##   2:          41.000000          35.000000
##   3:          32.000000          27.000000
##   4:          49.000000                 NA
##   5:          30.000000          29.000000
##  ---                                      
## 164:          60.000000          53.000000
## 165:          71.000000          64.000000
## 166:          37.000000          37.000000
## 167:          78.000000                 NA
## 168:                 NA                  s

3.5.2 Wide to Long: melt

As often as we’re given data in long form that we prefer in wide form, we’re presented with data in wide form that we’d prefer in long form. Stata uses similar terminology for both – reshape wide for the former and reshape long for the latter, but R uses a different word – dcast for the former and melt for the latter. I picture the data set made of wax and melting away into a longer form.

melt also takes three main arguments – data (same as before), id.vars (same as LHS above), and measure.vars (basically RHS above). measure.vars can be a bit tricky, and there are a few ways to use it. First we can specify the column names explicitly. Second, we can specify a pattern matched by the names of the columns we’d like. Last, we can use column numbers (this isn’t recommended, so I won’t demonstrate it). This gets slightly more complicated when we want to split into more than one new column.

Let’s try this with our school_gender data. We’ll stack the genders on top of each other and create a single column for student counts, then another for percentages (basically doubling the number of rows in the table).

#First, using explicit column names
melt(school_gender, 
     id.vars = c("school_id", "grade"), 
     measure.vars = c("count_male", "count_female"))
##       school_id      grade     variable      value
##    1:       101    GRADE_9   count_male 130.000000
##    2:       101   GRADE_10   count_male 111.000000
##    3:       101   GRADE_11   count_male  94.000000
##    4:       101   GRADE_12   count_male  78.000000
##    5:       101 ALL GRADES   count_male 413.000000
##   ---                                             
## 3328:       878    GRADE_9 count_female          s
## 3329:       878   GRADE_10 count_female  43.000000
## 3330:       878   GRADE_11 count_female  47.000000
## 3331:       878   GRADE_12 count_female  41.000000
## 3332:       878 ALL GRADES count_female 191.000000
#It's usually more concise to use the patterns
#  convenience "function"; the syntax for this
#  generally involves regular expressions, so
#  I'll hold off on demonstrating the full
#  power of this approach for now. 
melt(school_gender, 
     id.vars = c("school_id", "grade"), 
     measure.vars = patterns("count"))
##       school_id      grade     variable      value
##    1:       101    GRADE_9 count_female  70.000000
##    2:       101   GRADE_10 count_female  86.000000
##    3:       101   GRADE_11 count_female  78.000000
##    4:       101   GRADE_12 count_female  63.000000
##    5:       101 ALL GRADES count_female 297.000000
##   ---                                             
## 3328:       878    GRADE_9   count_male          s
## 3329:       878   GRADE_10   count_male  26.000000
## 3330:       878   GRADE_11   count_male  32.000000
## 3331:       878   GRADE_12   count_male  23.000000
## 3332:       878 ALL GRADES   count_male 120.000000
#A bit more careful about the output to make
#  it a bit more understandable
melt(school_gender, 
     id.vars = c("school_id", "grade"), 
     measure.vars = patterns("count"),
     variable.name = "gender",
     value.name = "count")
##       school_id      grade       gender      count
##    1:       101    GRADE_9 count_female  70.000000
##    2:       101   GRADE_10 count_female  86.000000
##    3:       101   GRADE_11 count_female  78.000000
##    4:       101   GRADE_12 count_female  63.000000
##    5:       101 ALL GRADES count_female 297.000000
##   ---                                             
## 3328:       878    GRADE_9   count_male          s
## 3329:       878   GRADE_10   count_male  26.000000
## 3330:       878   GRADE_11   count_male  32.000000
## 3331:       878   GRADE_12   count_male  23.000000
## 3332:       878 ALL GRADES   count_male 120.000000

We can extend this approach to handle reshaping both count and pct like so:

#Admittedly a bit tough to tell whether variable = 1
#  corresponds to Male or Female
melt(school_gender,
     id.vars = c("school_id", "grade"),
     measure.vars = patterns("count", "pct"))
##       school_id      grade variable     value1   value2
##    1:       101    GRADE_9        1  70.000000 0.350000
##    2:       101   GRADE_10        1  86.000000 0.436548
##    3:       101   GRADE_11        1  78.000000 0.453488
##    4:       101   GRADE_12        1  63.000000 0.446809
##    5:       101 ALL GRADES        1 297.000000 0.418310
##   ---                                                  
## 3328:       878    GRADE_9        2          s        s
## 3329:       878   GRADE_10        2  26.000000 0.376812
## 3330:       878   GRADE_11        2  32.000000 0.405063
## 3331:       878   GRADE_12        2  23.000000 0.359375
## 3332:       878 ALL GRADES        2 120.000000 0.385852
#If we want to use the explicitly-name approach
#  of before, we have to keep all associated columns
#  as one item of a list -- here, the first
#  item of the list is for counts, and the second
#  item of the list is for percentages.
#Advantage: we know for sure that count_male
#  and pct_male correspond to variable = 1
melt(school_gender,
     id.vars = c("school_id", "grade"),
     measure.vars = list(c("count_male", "count_female"),
                         c("pct_male", "pct_female")))
##       school_id      grade variable     value1   value2
##    1:       101    GRADE_9        1 130.000000 0.650000
##    2:       101   GRADE_10        1 111.000000 0.563452
##    3:       101   GRADE_11        1  94.000000 0.546512
##    4:       101   GRADE_12        1  78.000000 0.553191
##    5:       101 ALL GRADES        1 413.000000 0.581690
##   ---                                                  
## 3328:       878    GRADE_9        2          s        s
## 3329:       878   GRADE_10        2  43.000000 0.623188
## 3330:       878   GRADE_11        2  47.000000 0.594937
## 3331:       878   GRADE_12        2  41.000000 0.640625
## 3332:       878 ALL GRADES        2 191.000000 0.614148
#We can again specify value.name to facilitate understanding
melt(school_gender,
     id.vars = c("school_id", "grade"),
     measure.vars = list(c("count_male", "count_female"),
                         c("pct_male", "pct_female")),
     value.name = c("count", "pct"))
##       school_id      grade variable      count      pct
##    1:       101    GRADE_9        1 130.000000 0.650000
##    2:       101   GRADE_10        1 111.000000 0.563452
##    3:       101   GRADE_11        1  94.000000 0.546512
##    4:       101   GRADE_12        1  78.000000 0.553191
##    5:       101 ALL GRADES        1 413.000000 0.581690
##   ---                                                  
## 3328:       878    GRADE_9        2          s        s
## 3329:       878   GRADE_10        2  43.000000 0.623188
## 3330:       878   GRADE_11        2  47.000000 0.594937
## 3331:       878   GRADE_12        2  41.000000 0.640625
## 3332:       878 ALL GRADES        2 191.000000 0.614148

3.6 String Data / regex

String data is ubiquitous. And it is ubiquitously messy. Typos, transpositions, inconsistent abbreviations, strange character encodings, arbitrary punctuation… there’s no end to the nightmares that can be caused to a data analyst forced to work with string data.

The most powerful tool that a data analyst can have in their belt to attack such data is a facility with regular expressions.

Basically, regular expressions are a language for talking to and manipulating string data. Have you ever wondered how many words in English fit some certain pattern? Regular expressions are your best friend.

What are all the words in the English language that have the five vowels in order? I used the Regex Dictionary to search for the pattern a.*e.*i.*o.*u.* (we’ll cover what this means momentarily) and got the following list:

abstemious, adventitious, amentiferous, anemophilous, arenicolous, argentiferous, arsenious, arteriovenous, autoecious, cavernicolous, facetious, garnetiferous, sacrilegious, theater-in-the-round

Just like that, instantly! Regular expressions is basically a programming language in and of itself. They can be accessed through most other programs – Stata, Excel, SAS, MATLAB, Python, you name it. Here we’ll obviously focus on their use in R.

To just introduce some of the most common things used in regex, I’ll just do a bunch of quick examples demonstrating a certain type of pattern on a single string; then at the end we’ll quickly apply this to some of our school data.

3.6.1 gsub, grep, grepl, strsplit

These four are the workhorse functions of using regular expressions in R. They each serve a distinct purpose, and I’ll try and use all four in this brief overview a couple of different times.

  1. gsub is used for substitution of patterns. Want to delete all numbers? Cut certain strings from names? This is your friend. A quick side note on etymology – the “g” in gsub stands for global. There is a rarely-used alternative to gsub, sub, which will replace only the first match of the supplied pattern… I’ve only ever used it once, just something to keep in mind.
  2. grep (an obscure acronym for global regular expression print, though I only ever say “grep”) has two purposes. The first identifies if elements of a string match a pattern, and then returns the indices of the vector corresponding to matches. The second (with setting the argument value = TRUE) returns the matched strings themselves.
  3. grepl (the “l” stands for logical) is used to simply test for existence of the pattern in strings. If the pattern is matched, we get TRUE, otherwise FALSE.

3.6.2 Simple, Exact Substring Matching

The most common regular expression is simply an exact sequence of letters. Suppose we wanted to find all the people named Bob. Then we might do:

strings <- c("Bob Dylan", "Johnny Cash", "Bob Marley", "Janis Joplin")

#note the order of arguments!!
#  this is a pet peeve of mine, as I'd expect
#  strings to go first.
grep("Bob", strings, value = TRUE)
## [1] "Bob Dylan"  "Bob Marley"
firsts <- c("Bob", "Johnny", "Bob", "Janis")
lasts <- c("Dylan", "Cash", "Marley", "Joplin")
#We might use grep without value = TRUE to
#  find the last names of all Bobs:
grep("Bob", strings)
## [1] 1 3
lasts[grep("Bob", strings)]
## [1] "Dylan"  "Marley"
mons <- c("January", "February", "March", "April",
          "May", "June", "July", "August",
          "September", "October", "November", "December")

#Which months can we eat oysters?
grep("r", mons)
## [1]  1  2  3  4  9 10 11 12

3.6.3 Beginning/End Anchors: ^ and $

Sometimes we want to be sure we’re matching a pattern from the beginning or end of a string. Regex uses two symbols to represent these: ^ (beginning) and $ (end).

Let’s try it out:

strings <- c("Bob Dylan", "Johnny Cash",
             "Bob Marley", "Janis Joplin",
             "Billy-Bob Thornton")

#Just matching Bob will give us the last
#  string as well:
grep("Bob", strings, value = TRUE)
## [1] "Bob Dylan"          "Bob Marley"         "Billy-Bob Thornton"
#To exclude this, we can force Bob
#  to have come from the beginning
#  of the string:
grep("^Bob", strings, value = TRUE)
## [1] "Bob Dylan"  "Bob Marley"
states <- 
  c("Alabama", "Alaska", "Arizona", "Arkansas",
    "California", "Colorado", "Connecticut",
    "Delaware", "Florida", "Georgia", "Hawaii",
    "Idaho", "Illinois", "Indiana", "Iowa",
    "Kansas", "Kentucky", "Louisiana", "Maine",
    "Maryland", "Massachusetts", "Michigan",
    "Minnesota", "Mississippi", "Missouri",
    "Montana", "Nebraska", "Nevada", "New Hampshire", 
    "New Jersey", "New Mexico", "New York", 
    "North Carolina", "North Dakota", "Ohio", 
    "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", 
    "South Carolina", "South Dakota", "Tennessee", 
    "Texas", "Utah", "Vermont", "Virginia", "Washington", 
    "West Virginia", "Wisconsin", "Wyoming")

#Which states start with New?
grep("^New", states, value = TRUE)
## [1] "New Hampshire" "New Jersey"    "New Mexico"    "New York"
#Which states DON'T end with a?
states[!grepl("a$", states)]
##  [1] "Arkansas"      "Colorado"      "Connecticut"   "Delaware"     
##  [5] "Hawaii"        "Idaho"         "Illinois"      "Kansas"       
##  [9] "Kentucky"      "Maine"         "Maryland"      "Massachusetts"
## [13] "Michigan"      "Mississippi"   "Missouri"      "New Hampshire"
## [17] "New Jersey"    "New Mexico"    "New York"      "Ohio"         
## [21] "Oregon"        "Rhode Island"  "Tennessee"     "Texas"        
## [25] "Utah"          "Vermont"       "Washington"    "Wisconsin"    
## [29] "Wyoming"
#We could also have used another
#  argument of grep, invert,
#  to accomplish the same thing as
#  x[!grepl("pattern", x)]
grep("a$", states, value = TRUE, invert = TRUE)
##  [1] "Arkansas"      "Colorado"      "Connecticut"   "Delaware"     
##  [5] "Hawaii"        "Idaho"         "Illinois"      "Kansas"       
##  [9] "Kentucky"      "Maine"         "Maryland"      "Massachusetts"
## [13] "Michigan"      "Mississippi"   "Missouri"      "New Hampshire"
## [17] "New Jersey"    "New Mexico"    "New York"      "Ohio"         
## [21] "Oregon"        "Rhode Island"  "Tennessee"     "Texas"        
## [25] "Utah"          "Vermont"       "Washington"    "Wisconsin"    
## [29] "Wyoming"

3.6.4 Multiple Possibilities: |

As typically in programming, we can use | to mean “or”. | separates patterns. Let’s try it out:

#Let's get states starting with directional names
grep("North|South|East|West", states, value = TRUE)
## [1] "North Carolina" "North Dakota"   "South Carolina" "South Dakota"  
## [5] "West Virginia"

3.6.5 White Space: \\s

White space comes in many forms. Obviously the most common is a simple space. But also included in this are other sorts of blank characters that may show up unexpectedly in our data, like tabs (\t), newlines (\n), a carriage return (\r), a form feed (\f), etc. (like I said, string data is messy).

In regex in R, \\s matches all of these (those who have seen regex before should note that we need two backslashes because the backslash itself needs to be escaped in an R string). For those interested in web scraping, unfortunately, it appears that \\s does not match &nbsp;; we’ll return to this later.

#Which states have a space of any sort?
grep("\\s", states, value = TRUE)
##  [1] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
##  [5] "North Carolina" "North Dakota"   "Rhode Island"   "South Carolina"
##  [9] "South Dakota"   "West Virginia"

3.6.6 Wildcard: .

Sometimes, we don’t care what character is in a certain place, just that it’s there. This type of pattern is called a wildcard, and it’s denoted in regex by ..

Actually, it’s rare to use it by itself, but it’ll come up a lot when we have more tools. The only thing I can think of for this alone is cheating on crosswords. Suppose we wanted to fill in a crossword answer that had the pattern L _ V _ _ _ _ _ _. As long as we know it’s one word, we can go to the Regex dictionary and search the pattern: l.v...... and we’ll get the full list of possible answers: lava-lava, lavaliere, leviathan, Levitical, Leviticus, liverleaf, liverwort, liveryman, livestock. We’ll use this more soon.

3.6.7 Quantifiers: *, +, ?, {m(, n)}

The most common appearance of . from the last section is as .*. We saw this when I found all the pan-vowel words in the introduction to this section.

* is a quantifier – it quantifies the piece of the pattern directly before it. In particular, * says to match the thing before it 0 or more times. So .* means “match anything at least 0 times.” So the regex in the introduction, a.*e.*i.*o.*u means match “‘a’ followed by anything any number of times followed by ‘e’ followed by anything any number of times followed by ‘i’ followed by anything any number of times followed by ‘o’ followed by anything any number of times followed by ‘u’”.

Here are all the possible quantifiers:

  1. *: Match at least 0 times
  2. +: Match at least 1 time
  3. ?: Match exactly 0 or 1 time.
  4. {m}: Match exactly m times.
  5. {m,n}: Match between m and n times, inclusive. If n is blank, it’s at least m times.

Some demonstration:

#What states have a doubled s?
grep("s{2}", states, value = TRUE)
## [1] "Massachusetts" "Mississippi"   "Missouri"      "Tennessee"
#Which states have an a in them
#  between the first and the last letter?
#  (first letter is excluded because
#   we're only looking for lowercase a)
grep("a.+", states, value = TRUE)
##  [1] "Alabama"        "Alaska"         "Arkansas"       "California"    
##  [5] "Colorado"       "Delaware"       "Hawaii"         "Idaho"         
##  [9] "Indiana"        "Kansas"         "Louisiana"      "Maine"         
## [13] "Maryland"       "Massachusetts"  "Michigan"       "Montana"       
## [17] "Nebraska"       "Nevada"         "New Hampshire"  "North Carolina"
## [21] "North Dakota"   "Oklahoma"       "Pennsylvania"   "Rhode Island"  
## [25] "South Carolina" "South Dakota"   "Texas"          "Utah"          
## [29] "Washington"
flips <-
  c("HTHHHTHHHTH",
    "THTTHTTHHTH",
    "THTHTHHTHHHT",
    "THTHTTTHTTHTTT")

#Which sequences of coin flips had
#  at least 3 heads in a row?
grepl("H{3, }", flips)
## [1]  TRUE FALSE  TRUE FALSE
#Let's match the middle initial K, possibly
#  followed by some punctuation
grepl("\\sK.?", c("Johnson, Mary K",
                  "Hobbes, Calvin K.",
                  "Martin, George R. R."))
## [1]  TRUE  TRUE FALSE

3.6.8 Group Matching: []

[] is used for a less inclusive version of .. We specify the characters that we’re OK with inside of [] and they’ll be matched. Most commonly, this is used for matching any letter or any number. To do this, we use - inside [] to mean “through”, like [A-Z] (A through Z):

#Let's match ANY middle initial
nms <- c("Johnson, Mary K",
         "Hobbes, Calvin K.",
         "Martin, George R. R.", "Obama, Barack",
         "Prince, The Artist Formerly Known As")
grepl("\\s[A-Z]\\.?$", nms)
## [1]  TRUE  TRUE  TRUE FALSE FALSE
#If we want to find anyone with a period, 
#  we have to be careful. This won't work:
grepl(".", nms)
## [1] TRUE TRUE TRUE TRUE TRUE
#We CAN put a period inside a group:
grepl("[.]", nms)
## [1] FALSE  TRUE  TRUE FALSE FALSE
#OR we can escape it with \\:
grepl("\\.", nms)
## [1] FALSE  TRUE  TRUE FALSE FALSE
#Let's remove all digits from some addresses
gsub("[0-9]", "", c("3718 Locust Walk",
                    "219 S 41st Street",
                    "1600 Pennsylvania Ave NW"))
## [1] " Locust Walk"         " S st Street"         " Pennsylvania Ave NW"
#Whoops, we took away too much there!
#  We also missed the leading white space
#  Let's try again:
gsub("^[0-9]*\\s*", "", c("3718 Locust Walk",
                          "219 S 41st Street",
                          "1600 Pennsylvania Ave NW"))
## [1] "Locust Walk"         "S 41st Street"       "Pennsylvania Ave NW"

3.6.8.1 Specials in Grouping: [:punct:], [:alnum:]

Also within [] we can refer to certain sets of characters with special shorthands. The two most useful are [:punct:] to refer to all punctuation (well, most – specifically, [:punct:] matches all of [-[]\;',./!@#%&*()_{}::"?]), and [:alnum:] to refer to all alphanumeric characters (i.e., [:alnum:] is the same as [A-Za-z0-9]). (There are also [:upper:], [:lower:], [:digit:], and [:alpha:], but all of these are longer to write out than just writing A-Z, a-z, 0-9, and a-zA-Z, respectively)

#get rid of all punctuation
gsub("[[:punct:]]", "", "John C. Willington, Jr.")
## [1] "John C Willington Jr"
#get rid of all alphanumeric characters
gsub("[[:alnum:]]", "", "John C. Willington, Jr.")
## [1] " . , ."

3.6.8.2 Negative Group Matching: [^]

We can specify a group of characters not to match by starting the matching group with [^ instead of [.

#Keep only digits
gsub("[^0-9]", "", c("3718 Locust Walk",
                     "219 S 41st Street",
                     "1600 Pennsylvania Ave NW"))
## [1] "3718"  "21941" "1600"

3.6.9 Awesome Tools for Practicing

I think that’s enough for today. Some other common things to look up:

  • () to denote matched groups, along with \\1, \\2, etc.
  • Look-ahead and look-behind, both positive and negative, using perl = TRUE
  • \\U and \\L to force letters upper- or lower-case within regex

Besides that, regex is all about playing around extensively with your actual data.

And staring at a regex cheat sheet! I have this printed and pasted above my desk:

https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

I also frequently visit this website, which allows you to test regexes and breaks down the pattern step-by-step with explanations and pretty colors:

https://regex101.com/

3.6.10 A Quick Data Exercise

Remember how we found out that Masterman had a strange job description for its teachers? The reason we were susceptible to this is that we relied on finding the most common job description to identify teachers. This was bound to fail if a small, specialized school like Masterman uses some alternative description on a small scale.

A better approach to tackling this problem would have been to use regex. After some brief inspection of our salaries data, we could have seen that TEACHER appears to be a constant in the job description of teachers. We can start from there and winnow down our options to only identify the teachers we’re interested in, but also be more inclusive than just including TEACHER,FULL TIME:

#putting grepl in the first argument means
#  we'll only get rows for which there is
#  a match to the pattern
salaries[grepl("TEACHER", title_description),
         .N, by = title_description][order(-N)]
##                  title_description    N
##  1:              TEACHER,FULL TIME 6652
##  2:         TEACHER,SPEC EDUCATION 1314
##  3: TEACHER-EXTRA CURR/STAFF DEVEL  336
##  4:              TEACHER ASST,PKHS  109
##  5:          TEACHER,DEMONSTRATION   83
##  6:       TEACHER,HEAD,PKHS/BRIGHT   38
##  7:  RETIRED TEACHER, PER DIEM SUB   19
##  8:             CONSULTING TEACHER   19
##  9:      TEACHER,SPEC ASSIGN,12 MO   16
## 10:  TEACHER,DEMONSTRATION,SPEC ED   16
## 11: RETIRED TEACHER,PER DIEM SP ED   16
## 12:   TEACHER,LONG TERM SUBSTITUTE    6
## 13:          TEACHER,NON-CERT HRLY    4
## 14:   TEACHER, PER DIEM SUBSTITUTE    1
## 15:            MGR TEACHER COACHES    1

On a smaller scale, it’s a lot easier to piece through every possible TEACHER-related job title. We can see Masterman’s TEACHER,DEMONSTRATION right there as 5th-most-common. Suppose we had wanted to keep only TEACHER,FULL TIME and TEACHER,DEMONSTRATION for our teachers data set. We might have subsetted like this:

#(FULL|DEMO) matches ONLY
#  TEACHER,FULL or TEACHER,DEMO;
#  the rest of the regex is to exclude
#  TEACHER,DEMONSTRATION,SPEC ED, which
#  is distinguished by having a comma
#  later in the string, so we require
#  our match to have all the remaining
#  characters after (FULL|DEMO) to
#  NOT be a comma.
salaries[grepl("TEACHER,(FULL|DEMO)[^,]*$", title_description),
         unique(title_description)]
## [1] "TEACHER,FULL TIME"     "TEACHER,DEMONSTRATION"

What about if we only wanted to compare the average wage of PFT employees vs. non-PFT employees? PFT has several arms/classes of employee employed by the district, so we might not want to look solely at PFT-TEACHER. We can do this with regex:

#make sure we don't mix hourly employees
#  since their wages are quoted by the hour
salaries[pay_rate_type == "SALARIED", 
         .(avg_wage = mean(pay_rate)), 
         by = .(pft = grepl("^PFT", type_of_representation))]
##      pft avg_wage
## 1: FALSE 35481.39
## 2:  TRUE 58571.78

Lastly, we can extract only employees working at schools (as opposed to offices) with something like:

salaries[grepl("SCHOOL", organization_level),
         .N, organization_level][order(-N)]
##             organization_level    N
## 1:           ELEMENTARY SCHOOL 9667
## 2:                 HIGH SCHOOL 3101
## 3:               MIDDLE SCHOOL  954
## 4: TRANSITION / OVERAGE SCHOOL  141
## 5:             CHARTER SCHOOLS   12
## 6:           NON PUBLIC SCHOOL    1

3.7 Non-.csv Data

Comma- or tab-separated data is ideal, especially since we have the incredibly-fast fread by our side to facilitate the data reading process when we’re lucky enough to have this kind of data.

But we’re not always so lucky. We’ve already covered how to read in Excel data files. Here we’ll cover three more file types which I’ve found frequently in the wild – files created in SAS, files created in Stata, and fixed-width files. All are handled by the haven package, which is specifically designed to handle reading foreign, non-csv data types into R.

3.7.1 SAS: read_sas

Users of SAS may be familiar with its sas7bdat export file (I think it stands for “SAS 7 binary data”). I have found this file format in one place while working on projects in education – the Common Core of Data. Many of the district- and school-level data files found there are available in sas7bdat format.

To just run through a quick example, let’s load in the most recent LEA Universe Survey Data, from 2013-14. The original data can be found on this site here, but has been loaded onto your computer already.

Basically, reading such a file is as simple as using the read_sas function from the haven package:

#To install from CRAN
install.packages("haven")
#Now load the package:
library(haven)

leas_13 <- read_sas("data/ag131a_supp.sas7bdat")
class(leas_13)
## [1] "tbl_df"     "tbl"        "data.frame"
#again, it's been read as a data.frame,
#  so we'll use setDT to make it a data.table
setDT(leas_13)

#these files are huge...
ncol(leas_13)
## [1] 339
#so let's just preview the first, say, 30 columns:
leas_13[ , 1:30, with = FALSE]
##        SURVYEAR   LEAID FIPST STID
##     1:     2013 0100002    01  210
##     2:     2013 0100005    01  101
##     3:     2013 0100006    01  048
##     4:     2013 0100007    01  158
##     5:     2013 0100008    01  169
##    ---                            
## 18781:     2013 6600040    66  001
## 18782:     2013 6900030    69  001
## 18783:     2013 7200030    72   01
## 18784:     2013 7800002    78  002
## 18785:     2013 7800030    78  001
##                                             NAME      PHONE
##     1:                    ALABAMA YOUTH SERVICES 3342153850
##     2:                          ALBERTVILLE CITY 2568911183
##     3:                           MARSHALL COUNTY 2565823171
##     4:                               HOOVER CITY 2054391015
##     5:                              MADISON CITY 2564648370
##    ---                                                     
## 18781:                    GUAM DEPT OF EDUCATION 6714750512
## 18782:                 CNMI PUBLIC SCHOOL SYSTEM 6702373061
## 18783:       PUERTO RICO DEPARTMENT OF EDUCATION 7877735800
## 18784:               SAINT CROIX SCHOOL DISTRICT 3407731747
## 18785: SAINT THOMAS - SAINT JOHN SCHOOL DISTRICT 3407752250
##                        MSTREE        MCITY MSTATE  MZIP MZIP4
##     1:             P O BOX 66     MT MEIGS     AL 36057  0066
##     2:       107 WEST MAIN ST  ALBERTVILLE     AL 35950  0025
##     3: 12380 US HIGHWAY 431 S GUNTERSVILLE     AL 35976  9351
##     4:  2810 METROPOLITAN WAY       HOOVER     AL 35243  5500
##     5:       211 CELTIC DRIVE      MADISON     AL 35758  1615
##    ---                                                       
## 18781:              PO BOX DE        AGANA     GU 96932  9008
## 18782:       PO BOX 501370 CK       SAIPAN     MP 96950      
## 18783:          PO BOX 190759     SAN JUAN     PR 00919  0759
## 18784:   2133 HOSPITAL STREET  SAINT CROIX     VI 00820  4665
## 18785:      386 ANNAS RETREAT SAINT THOMAS     VI 00802      
##                                LSTREE        LCITY LSTATE  LZIP LZIP4 TYPE
##     1:    1000 INDUSTRIAL SCHOOL ROAD    MT. MEIGS     AL 36057  0066    1
##     2:               107 WEST MAIN ST  ALBERTVILLE     AL 35950  0025    1
##     3:         12380 US HWY 431 SOUTH GUNTERSVILLE     AL 35976  9351    1
##     4:          2810 METROPOLITAN WAY       HOOVER     AL 35243  5500    1
##     5:                  211 CELTIC DR      MADISON     AL 35758  1615    1
##    ---                                                                    
## 18781:               312 ASPINALL AVE        AGANA     GU 96932  9008    1
## 18782:                  PO BOX 501370       SAIPAN     MP 96950          1
## 18783: CALLE FEDERICO COSTAS NUM. 150     HATO REY     PR 00919  0759    1
## 18784:               2133 HOSPITAL ST  SAINT CROIX     VI 00820  4665    1
## 18785:              386 ANNAS RETREAT SAINT THOMAS     VI 00802          1
##        UNION CONUM             CONAME CSA  CBSA METMIC ULOCAL CDCODE
##     1:   000 01101  MONTGOMERY COUNTY   N 33860      1     21   0103
##     2:   000 01095    MARSHALL COUNTY 290 10700      2     32   0104
##     3:   000 01095    MARSHALL COUNTY 290 10700      2     42   0104
##     4:   000 01073   JEFFERSON COUNTY 142 13820      1     13   0106
##     5:   000 01089     MADISON COUNTY 290 26620      1     21   0105
##    ---                                                              
## 18781:   000     N                  N   N     N      0      N      N
## 18782:   000     N                  N   N     N      0      N      N
## 18783:   000 72127 SAN JUAN MUNICIPIO 490 41980      1     21   7298
## 18784:   000 78010   ST. CROIX ISLAND   N     N      0     33   7898
## 18785:   000 78030  ST. THOMAS ISLAND   N     N      0     33   7898
##         LATCOD   LONCOD BIEA BOUND AMEMPUP
##     1: 32.3770 -86.0830    2     1       2
##     2: 34.2675 -86.2086    2     1       2
##     3: 34.3050 -86.2867    2     1       2
##     4: 33.4062 -86.7669    2     1       2
##     5: 34.6873 -86.7449    2     1       2
##    ---                                    
## 18781:  0.0000   0.0000    2     1       2
## 18782:  0.0000   0.0000    2     1       2
## 18783: 18.4243 -66.0706    2     1       2
## 18784: 17.7025 -64.8669    2     1       2
## 18785: 18.3395 -64.8822    2     1       2

3.7.2 Stata: read_dta

I haven’t come across any files in education that are only available in .dta format, so we’ll just quickly digress into using non-educational data to demonstrate how to read such files in case you find them in your research.

The file I found is from here, this file. Again, it’s pre-loaded on your machine.

As I mentioned, the haven package is designed to handle reading all sorts of foreign formats into R; Stata is among them.

#this is a very tiny file
car_data <- read_dta("data/carsdata.dta")
#we could use setDT, etc., but we won't
car_data
##   cars hhsize
## 1    1      1
## 2    2      2
## 3    2      3
## 4    2      4
## 5    3      5

3.7.3 Fixed-width read_fwf/fread redux

Fixed-width files are basically the bane of my existence. Trouble is, I keep running into them all the time, regardless of how nightmarish they are to deal with. Especially if you have to create the width dictionary yourself.

This is essentially the case with old Universe files from the Common Core. Take, for example, the 1995-96 LEA Universe Flat File, which is in fixed-width format. There is a record layout on the NCES website as well… unfortunately, it, too is in fixed-width format! See what I mean about fixed-width…

To simplify our exercise a bit, instead of using the NCES files, we’ll use some files from Wisconsin. The main advantage being I’ve already done a fair bit of data cleaning on the original raw files.

Much like Philadelphia, Wisconsin publishes data on all of its teachers. But the Wisconsin data is much more comprehensive. It’s statewide, and it includes information on teachers’ age, experience, fringe benefits pay, etc. The data goes back to 1994-95, but we’ll just use the 1999-2000 data for now.

The data files can be found here, but be forewarned – they’re very messy. Messy enough to have served as inspiration for the “Data Cleaning” section in the appendix. I’ve done everyone the favor of including the more manageable version in today’s data files.

3.7.3.1 read_fwf

The package we’ll use for reading fixed-width files is readr. Let’s load it:

install.packages("readr")
library(readr)

read_fwf is the function in readr. In addition to the column names, it requires an extra argument – col_positions. This tells read_fwf how to find the columns in the data. It can be specified in a number of ways (see ?read_fwf), but the one we’ll use today is to use the helper function fwf_widths. This latter function maps column widths into something that col_positions can understand. It also takes an argument allowing us to specify column names, which is convenient.

#The column widths are kept in a separate dictionary file,
#  which I had to create myself
fwf_dict <- fread("data/00keys.csv", header = FALSE)
teachers_wisc <-
  read_fwf("data/00staff.dat", 
           col_positions = 
             fwf_widths(fwf_dict$V2, fwf_dict$V1))
#we gotta setDT again
class(teachers_wisc)
## [1] "tbl_df"     "tbl"        "data.frame"
setDT(teachers_wisc)[]
##                id    last_name first_name gender ethnicity birth_year
##      1: 000540952     Anderson  Shelley J      F         W       1949
##      2: 000540943       Bowers     Teresa      F         W       1966
##      3: 000540930   Branstiter      Zoila      F         H       1957
##      4: 000540973        Braun   Connie M      F         W       1940
##      5: 000540896      Broeske    Marilyn      F         W       1952
##     ---                                                              
## 155428: 000603626    Vanpernis       Anna      F         W       1977
## 155429: 000603663     Vonfange     Elaine      F         W       1947
## 155430: 000603604    Wahlquist       Jean      F         W       1955
## 155431: 000603590 Zuber Sequin        Amy      F         W       1977
## 155432: 000603590 Zuber Sequin        Amy      F         W       1977
##         highest_degree highest_degree_year year_session months_employed
##      1:             NA                  NA         2000            0000
##      2:              4                1999         2000            0950
##      3:             NA                  NA         2000            0000
##      4:              4                1962         2000            0950
##      5:             NA                  NA         2000            0000
##     ---                                                                
## 155428:             NA                  NA         2000            0000
## 155429:             NA                  NA         2000            0000
## 155430:             NA                  NA         2000            0000
## 155431:              4                1999         2000            0950
## 155432:              4                1999         2000            0950
##         local_exp total_exp  salary  fringe filler category filler
##      1:       000       000 0011946 0007620     NA        3     NA
##      2:       020       020 0022211 0004296     NA        1     NA
##      3:       000       000 0012224 0007673     NA        3     NA
##      4:       250       310 0039487 0011001     NA        1     NA
##      5:       000       000 0012848 0009661     NA        3     NA
##     ---                                                           
## 155428:       000       000 0017557 0006214     NA        3     NA
## 155429:       000       000 0023056 0007543     NA        3     NA
## 155430:       000       000 0034772 0009772     NA        3     NA
## 155431:       010       010 0024973 0004813     NA        1     NA
## 155432:       010       010 0024973 0004813     NA        1     NA
##         hire_agency agency agency_hire_type school position_code area
##      1:        0007   0007               04   0020            97  907
##      2:        0007   0007               04   0020            53  050
##      3:        0007   0007               04   0020            97  907
##      4:        0007   0007               04   0020            53  265
##      5:        0007   0007               04   0020            97  907
##     ---                                                              
## 155428:        9912   9912               01     NA            68  000
## 155429:        9912   9912               01     NA            68  000
## 155430:        9912   9912               01     NA            69  000
## 155431:        9912   9912               01     NA            64  000
## 155432:        9912   9912               01     NA            64  020
##         low_grade high_grade low_grade_code high_grade_code bilingual
##      1:        KG         KG             16              16         N
##      2:        01         03             20              28         N
##      3:        KG         05             16              36         N
##      4:        04         05             32              36         N
##      5:        KG         05             16              36         N
##     ---                                                              
## 155428:        3E         12              4              64         N
## 155429:        KG         12             16              64         N
## 155430:        KG         12             16              64         N
## 155431:        3E         12              4              64         N
## 155432:        3E         12              4              64         N
##         full_time_equiv filler final_contract filler         agency_name
##      1:             094     NA              Y     NA Abbotsford Sch Dist
##      2:             085     NA              N     NA Abbotsford Sch Dist
##      3:             054     NA              Y     NA Abbotsford Sch Dist
##      4:             013     NA              N     NA Abbotsford Sch Dist
##      5:             094     NA              Y     NA Abbotsford Sch Dist
##     ---                                                                 
## 155428:             091     NA              Y     NA                  NA
## 155429:             100     NA              Y     NA                  NA
## 155430:             100     NA              Y     NA                  NA
## 155431:             045     NA              Y     NA                  NA
## 155432:             055     NA              Y     NA                  NA
##                         school_name grade_level cesa county
##      1:               Abbotsford El           6   10     10
##      2:               Abbotsford El           6   10     10
##      3:               Abbotsford El           6   10     10
##      4:               Abbotsford El           6   10     10
##      5:               Abbotsford El           6   10     10
##     ---                                                    
## 155428: Cooperative Ed Serv Agcy 12          NA   12     74
## 155429: Cooperative Ed Serv Agcy 12          NA   12     74
## 155430: Cooperative Ed Serv Agcy 12          NA   12     74
## 155431: Cooperative Ed Serv Agcy 12          NA   12     74
## 155432: Cooperative Ed Serv Agcy 12          NA   12     74
##                county_name agency_work_type   school_mail_1 school_mail_2
##      1:       Clark County               04 112 W Spruce St            NA
##      2:       Clark County               04 112 W Spruce St            NA
##      3:       Clark County               04 112 W Spruce St            NA
##      4:       Clark County               04 112 W Spruce St            NA
##      5:       Clark County               04 112 W Spruce St            NA
##     ---                                                                  
## 155428: CESA County County               01  618 Beaser Ave            NA
## 155429: CESA County County               01  618 Beaser Ave            NA
## 155430: CESA County County               01  618 Beaser Ave            NA
## 155431: CESA County County               01  618 Beaser Ave            NA
## 155432: CESA County County               01  618 Beaser Ave            NA
##                     school_mail_3 school_shipping_1 school_shipping_2
##      1: Abbotsford WI  54405-9734                NA                NA
##      2: Abbotsford WI  54405-9734                NA                NA
##      3: Abbotsford WI  54405-9734                NA                NA
##      4: Abbotsford WI  54405-9734                NA                NA
##      5: Abbotsford WI  54405-9734                NA                NA
##     ---                                                              
## 155428:    Ashland WI  54806-2751                NA                NA
## 155429:    Ashland WI  54806-2751                NA                NA
## 155430:    Ashland WI  54806-2751                NA                NA
## 155431:    Ashland WI  54806-2751                NA                NA
## 155432:    Ashland WI  54806-2751                NA                NA
##         school_shipping_3  mail_city mail_state   mail_zip ship_city
##      1:                NA Abbotsford         WI 54405-9734        NA
##      2:                NA Abbotsford         WI 54405-9734        NA
##      3:                NA Abbotsford         WI 54405-9734        NA
##      4:                NA Abbotsford         WI 54405-9734        NA
##      5:                NA Abbotsford         WI 54405-9734        NA
##     ---                                                             
## 155428:                NA    Ashland         WI 54806-2751        NA
## 155429:                NA    Ashland         WI 54806-2751        NA
## 155430:                NA    Ashland         WI 54806-2751        NA
## 155431:                NA    Ashland         WI 54806-2751        NA
## 155432:                NA    Ashland         WI 54806-2751        NA
##         ship_state ship_zip    telephone filler       admin_name nee
##      1:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##      2:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##      3:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##      4:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##      5:         NA       NA 715-223-4281     NA  Jerry Zanotelli  NA
##     ---                                                             
## 155428:         NA       NA 715-682-2363     NA Fred Schlichting  NA
## 155429:         NA       NA 715-682-2363     NA Fred Schlichting  NA
## 155430:         NA       NA 715-682-2363     NA Fred Schlichting  NA
## 155431:         NA       NA 715-682-2363     NA Fred Schlichting  NA
## 155432:         NA       NA 715-682-2363     NA Fred Schlichting  NA
##         contracted_employee long_term_sub
##      1:                   N             N
##      2:                   N             N
##      3:                   N             N
##      4:                   N             N
##      5:                   N             N
##     ---                                  
## 155428:                   N             N
## 155429:                   N             N
## 155430:                   N             N
## 155431:                   N             N
## 155432:                   N             N

4 Plotting (11 AM - 12 PM)

Now we can finally start working on some design!

Fine-tuning plots is a pain in any language/program. But doing basic plots in R very simple. We’ll start with some easy plots and move on to some more bells and whistles.

4.1 base Plots

R comes ready and loaded with functions for all of the most common types of plots – histograms, kernel density, scatter plots, line plots, barplots, etc. I’ll go through these one by one, sprinkling in some tricks along the way which are applicable to any plotting sceneario (e.g., setting the axis labels and plot title).

After we have all of the main plot types handy, I’ll move on to tiling plots (plotting more than one plot at once).

4.1.1 Histograms

Histograms are probably the most common way to get a glance at the way that a variable is distributed in your data. Let’s examine the distribution of birth years in the Wisconsin teacher data.

First, a quick jab of data cleaning:

#Only include teachers matching the following criteria:
#  * Have bachelor's or master's degree
#  * Are full-time
#  * Strictly positive salary
#  * Aren't hired by administrative buildings

teachers_wisc <- 
  teachers_wisc[highest_degree %in% c(4, 5) & 
                  position_code == "53" &
                  as.integer(salary) > 0 & 
                  substr(agency, 1, 2) != "99" & 
                  as.integer(total_exp) %between% c(10, 300)]

#Now set salaries and fringe benefits to be numeric
teachers_wisc[ , salary := as.numeric(salary)]
teachers_wisc[ , fringe := as.numeric(fringe)]

#experience is kept in units of tenth-years, for some reason
teachers_wisc[ , total_exp := as.numeric(total_exp)/10]

Now we can examine the distribution on a more select/interesting sample of employees. Drawing histograms is managed through the hist function (so it’s very similar to Stata). It only needs one argument – a vector which will be converted into a distribution.

#no-frills histogram
teachers_wisc[ , hist(birth_year)]

That’s not bad considering how simple it was to program. There are three things that jump out at me immediately as being deficiencies of this plot:

  1. The title is abstruse
  2. The axis titles are odd
  3. There’s no color – how drab!

Let’s take care of the three of these in turn.

4.1.1.1 Main Title

The main title of a plot is almost always handled by the main argument to a plotting function:

#Customizing the main title
teachers_wisc[ , hist(birth_year, 
                      main = "Distribution of Year of Birth among Wisconsin Teachers")]