A lot of data that we find is seemingly locked out of our grasp – we find it on a website, ripe for absorbing, but we’re estopped from reaping the fruits of those sweet sweet raw data seeds by a lack of access to the raw data in a readily-handled format. Tables on sites like Wikipedia, the Census Bureau, ESPN, you name it, all tantalize the data analyst who hasn’t added the tools of web scraping to their belt.
Web scraping is basically the process of extracting data from websites. I have mainly used this for toy side projects – scraping data about baseball statistics, Supreme Court judges, swing states, etc.
Also, too much scraping can get your IP address banned and/or throttled. This means that you’ll be denied access to a website, perhaps permanently, if you generate too much traffic.
The proper approach for longer-term, more intense scraping operations is thus to download the HTML file associated with each web page, then read from that file, rather than loading the page over and over again, which causes undue stress on companies’/organizations’ servers (i.e., you want to avoid anything looking like a DDoS attack).
All that said, web scraping is powerful and (to me) fun. This is basically what’s being done when you hear people talking about a sentiment analysis of Facebook or Twitter.
All web scraping starts with finding a page with data on it.
Web scraping illustrates one of the best features about Google Chrome – the Inspect view for trying to target information about key elements of web pages.
We’re targeting the main table of information, so right-click it, and hit “Inspect”. This will bring up the Inspection Toolbar which gives a surfeit of information about the HTML tree of the current web page.
If you hover over an HTML element in the inspection view, the output on the web page that it corresponds to is highlighted. You can expand/contract nodes of the HTML tree by clicking the arrows next to the ones visible. Clicking an arrow pointing down will collapse its “children”; clicking an arrow pointing right will expand its “children”.
After hovering around a bit, we see that the following item, when we hover over it, appears to encompass the entire table of interest:
Scraping U.S. News
OK, enough front-end lead-up, where’s the payoff?
We’ve now got all the ingredients necessary to scrape this page successfully. Before we do that, I want to introduce one more thing – the infamous pipe of R: %>%
.
%>%
is originally found in the magrittr
package, but has permeated throughout the hadleyverse, and is particularly in fashion in code associated with rvest
. The pipe sends the output of the thing to the left of %>%
to the function on the right of it. In particular, it’s sent as the first argument to that function unless we specify otherwise. Some quick examples to show what I mean:
#the pipe is one of the functions
# exported by rvest, so we need to
# load it first
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
#convoluted version of cbind(1:3, 4:6)
1:3 %>% cbind(4:6)
## .
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
#we can also be explicit about where to send
# the output but using its alias on the RHS, .
1:3 %>% cbind(., 4:6)
## .
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
1:3 %>% cbind(4:6, .)
## .
## [1,] 4 1
## [2,] 5 2
## [3,] 6 3
#something slightly more interesting;
# note we have to open and close
# the parentheses for it to work
rnorm(10) %>% sort()
## [1] -1.73870282 -1.70163777 -1.19474078 -0.84678196 0.04950394
## [6] 0.37545330 0.72647934 1.38989501 1.54481210 2.02377989
OK, enough build-up. Let’s scrape!
#I always store the page URL
# as its own variable
URL <- paste0("http://colleges.usnews.rankingsandreviews.com/",
"best-colleges/rankings/national-universities/data")
#now we break out the rvest weapons
# read_html takes to the web and
# retrieves the web page at the
# address you give it (i.e., it's
# doing the first part of what
# Chrome does when you enter a URL
# in your browser search bar).
# It returns the XML tree of the page.
pg <- read_html(URL)
#Now that we've got the XML tree, we
# have to find the node we were after
# from before. To do this, we use
# the html_node function, which
# has an xpath option, to which
# we can supply the xpath from above --
# making sure that we're careful
# about quotation marks.
xp <- '//*[@id="article"]/table'
pg %>% html_node(xpath = xp)
## {xml_node}
## <table>
## [1] <thead>\n \n \n <tr>\n \n \n ...
## [2] <tbody>\n <tr valign="top" class="row-first row-odd table-row ...
#This has returned an `xml_node` corresponding
# to our table. We just need one more function
# and we'll really be cooking -- html_table.
# html_table tries to parse an table from
# HTML form into a data.frame so that we can
# start dealing with it like we would other
# objects read from CSV, Excel, etc.
pg %>% html_node(xpath = xp) %>%
html_table()
## U.S. News rank
## 1 #1\n \n\n \n \n Overall Score: 100 out of 100.
## 2 #2\n \n\n \n \n Overall Score: 99 out of 100.
## 3 #3\n \n\n \n \n Overall Score: 97 out of 100.
## 4 #4Tie\n \n\n \n \n Overall Score: 95 out of 100.
## 5 #4Tie\n \n\n \n \n Overall Score: 95 out of 100.
## 6 #4Tie\n \n\n \n \n Overall Score: 95 out of 100.
## 7 #7\n \n\n \n \n Overall Score: 93 out of 100.
## 8 #8\n \n\n \n \n Overall Score: 92 out of 100.
## 9 #9\n \n\n \n \n Overall Score: 91 out of 100.
## 10 #10Tie\n \n\n \n \n Overall Score: 90 out of 100.
## 11 #10Tie\n \n\n \n \n Overall Score: 90 out of 100.
## 12 #12Tie\n \n\n \n \n Overall Score: 89 out of 100.
## 13 #12Tie\n \n\n \n \n Overall Score: 89 out of 100.
## 14 #14\n \n\n \n \n Overall Score: 85 out of 100.
## 15 #15Tie\n \n\n \n \n Overall Score: 84 out of 100.
## 16 #15Tie\n \n\n \n \n Overall Score: 84 out of 100.
## 17 #15Tie\n \n\n \n \n Overall Score: 84 out of 100.
## 18 #18Tie\n \n\n \n \n Overall Score: 82 out of 100.
## 19 #18Tie\n \n\n \n \n Overall Score: 82 out of 100.
## 20 #20\n \n\n \n \n Overall Score: 77 out of 100.
## 21 #21Tie\n \n\n \n \n Overall Score: 76 out of 100.
## 22 #21Tie\n \n\n \n \n Overall Score: 76 out of 100.
## 23 #23Tie\n \n\n \n \n Overall Score: 74 out of 100.
## 24 #23Tie\n \n\n \n \n Overall Score: 74 out of 100.
## 25 #23Tie\n \n\n \n \n Overall Score: 74 out of 100.
## School
## 1 Princeton University \n \n Princeton, NJ
## 2 Harvard University \n \n Cambridge, MA
## 3 Yale University \n \n New Haven, CT
## 4 Columbia University \n \n New York, NY
## 5 Stanford University \n \n Stanford, CA
## 6 University of Chicago \n \n Chicago, IL
## 7 Massachusetts Institute of Technology \n \n Cambridge, MA
## 8 Duke University \n \n Durham, NC
## 9 University of Pennsylvania \n \n Philadelphia, PA
## 10 California Institute of Technology \n \n Pasadena, CA
## 11 Johns Hopkins University \n \n Baltimore, MD
## 12 Dartmouth College \n \n Hanover, NH
## 13 Northwestern University \n \n Evanston, IL
## 14 Brown University \n \n Providence, RI
## 15 Cornell University \n \n Ithaca, NY
## 16 Vanderbilt University \n \n Nashville, TN
## 17 Washington University in St. Louis \n \n St. Louis, MO
## 18 Rice University \n \n Houston, TX
## 19 University of Notre Dame \n \n Notre Dame, IN
## 20 University of California—Berkeley \n \n Berkeley, CA
## 21 Emory University \n \n Atlanta, GA
## 22 Georgetown University \n \n Washington, DC
## 23 Carnegie Mellon University \n \n Pittsburgh, PA
## 24 University of California—Los Angeles \n \n Los Angeles, CA
## 25 University of Southern California \n \n Los Angeles, CA
## Tuition and Fees Total enrollment
## 1 $43,450 8,088
## 2 $45,278 19,929
## 3 $47,600 12,336
## 4 $51,008 24,221
## 5 $46,320 16,795
## 6 $50,193 12,558
## 7 $46,704 11,319
## 8 $49,341 15,856
## 9 $49,536 21,296
## 10 $45,390 2,209
## 11 $48,710 21,484
## 12 $49,506 6,298
## 13 $49,047 20,336
## 14 $49,346 9,181
## 15 $49,116 21,850
## 16 $43,838 12,686
## 17 $48,093 14,348
## 18 $42,253 6,621
## 19 $47,929 12,179
## 20 in-state: $13,432, out-of-state: $38,140 37,581
## 21 $46,314 14,769
## 22 $48,611 17,858
## 23 $50,410 13,285
## 24 in-state: $12,753, out-of-state: $35,631 43,239
## 25 $50,210 42,469
## Fall 2014 acceptance rate Average freshman retention rate
## 1 7.4% 98%
## 2 6% 97%
## 3 6.3% 99%
## 4 7% 99%
## 5 5.1% 98%
## 6 8.8% 99%
## 7 7.9% 98%
## 8 11.4% 97%
## 9 10.4% 98%
## 10 8.8% 97%
## 11 15% 97%
## 12 11.5% 98%
## 13 13.1% 97%
## 14 8.7% 98%
## 15 14.2% 97%
## 16 13.1% 97%
## 17 17.1% 97%
## 18 15.1% 97%
## 19 21.1% 98%
## 20 16% 97%
## 21 26.8% 95%
## 22 17.4% 96%
## 23 24.6% 95%
## 24 18.6% 97%
## 25 18% 97%
## 6-year graduation rate Classes with fewer than 20 students
## 1 97% Locked
## 2 98% Locked
## 3 96% Locked
## 4 96% Locked
## 5 95% Locked
## 6 93% Locked
## 7 91% Locked
## 8 95% Locked
## 9 96% Locked
## 10 92% Locked
## 11 94% Locked
## 12 95% Locked
## 13 94% Locked
## 14 96% Locked
## 15 93% Locked
## 16 93% Locked
## 17 95% Locked
## 18 93% Locked
## 19 96% Locked
## 20 91% Locked
## 21 91% Locked
## 22 95% Locked
## 23 88% Locked
## 24 91% Locked
## 25 91% Locked
## SAT/ACT 25th-75th percentile
## 1 Locked
## 2 Locked
## 3 Locked
## 4 Locked
## 5 Locked
## 6 Locked
## 7 Locked
## 8 Locked
## 9 Locked
## 10 Locked
## 11 Locked
## 12 Locked
## 13 Locked
## 14 Locked
## 15 Locked
## 16 Locked
## 17 Locked
## 18 Locked
## 19 Locked
## 20 Locked
## 21 Locked
## 22 Locked
## 23 Locked
## 24 Locked
## 25 Locked
#Now that we know it works, we can assign it to
# an object by repeating the code and also
# sending it to setDT so we have a data.table
usnews <- pg %>% html_node(xpath = xp) %>%
html_table() %>% setDT()
#First, the last two columns are only available
# to paid subscribers, let's just drop them:
usnews[ , (ncol(usnews) - 1):ncol(usnews) := NULL]
#The table is a bit of a mess; we'll start
# by assigning more manageable names:
setnames(usnews, c("rank", "name", "tuition_fees",
"enrollment", "acceptance_2014",
"retention_freshman", "grad_rate_6_yr"))[]
## rank
## 1: #1\n \n\n \n \n Overall Score: 100 out of 100.
## 2: #2\n \n\n \n \n Overall Score: 99 out of 100.
## 3: #3\n \n\n \n \n Overall Score: 97 out of 100.
## 4: #4Tie\n \n\n \n \n Overall Score: 95 out of 100.
## 5: #4Tie\n \n\n \n \n Overall Score: 95 out of 100.
## 6: #4Tie\n \n\n \n \n Overall Score: 95 out of 100.
## 7: #7\n \n\n \n \n Overall Score: 93 out of 100.
## 8: #8\n \n\n \n \n Overall Score: 92 out of 100.
## 9: #9\n \n\n \n \n Overall Score: 91 out of 100.
## 10: #10Tie\n \n\n \n \n Overall Score: 90 out of 100.
## 11: #10Tie\n \n\n \n \n Overall Score: 90 out of 100.
## 12: #12Tie\n \n\n \n \n Overall Score: 89 out of 100.
## 13: #12Tie\n \n\n \n \n Overall Score: 89 out of 100.
## 14: #14\n \n\n \n \n Overall Score: 85 out of 100.
## 15: #15Tie\n \n\n \n \n Overall Score: 84 out of 100.
## 16: #15Tie\n \n\n \n \n Overall Score: 84 out of 100.
## 17: #15Tie\n \n\n \n \n Overall Score: 84 out of 100.
## 18: #18Tie\n \n\n \n \n Overall Score: 82 out of 100.
## 19: #18Tie\n \n\n \n \n Overall Score: 82 out of 100.
## 20: #20\n \n\n \n \n Overall Score: 77 out of 100.
## 21: #21Tie\n \n\n \n \n Overall Score: 76 out of 100.
## 22: #21Tie\n \n\n \n \n Overall Score: 76 out of 100.
## 23: #23Tie\n \n\n \n \n Overall Score: 74 out of 100.
## 24: #23Tie\n \n\n \n \n Overall Score: 74 out of 100.
## 25: #23Tie\n \n\n \n \n Overall Score: 74 out of 100.
## rank
## name
## 1: Princeton University \n \n Princeton, NJ
## 2: Harvard University \n \n Cambridge, MA
## 3: Yale University \n \n New Haven, CT
## 4: Columbia University \n \n New York, NY
## 5: Stanford University \n \n Stanford, CA
## 6: University of Chicago \n \n Chicago, IL
## 7: Massachusetts Institute of Technology \n \n Cambridge, MA
## 8: Duke University \n \n Durham, NC
## 9: University of Pennsylvania \n \n Philadelphia, PA
## 10: California Institute of Technology \n \n Pasadena, CA
## 11: Johns Hopkins University \n \n Baltimore, MD
## 12: Dartmouth College \n \n Hanover, NH
## 13: Northwestern University \n \n Evanston, IL
## 14: Brown University \n \n Providence, RI
## 15: Cornell University \n \n Ithaca, NY
## 16: Vanderbilt University \n \n Nashville, TN
## 17: Washington University in St. Louis \n \n St. Louis, MO
## 18: Rice University \n \n Houston, TX
## 19: University of Notre Dame \n \n Notre Dame, IN
## 20: University of California—Berkeley \n \n Berkeley, CA
## 21: Emory University \n \n Atlanta, GA
## 22: Georgetown University \n \n Washington, DC
## 23: Carnegie Mellon University \n \n Pittsburgh, PA
## 24: University of California—Los Angeles \n \n Los Angeles, CA
## 25: University of Southern California \n \n Los Angeles, CA
## name
## tuition_fees enrollment acceptance_2014
## 1: $43,450 8,088 7.4%
## 2: $45,278 19,929 6%
## 3: $47,600 12,336 6.3%
## 4: $51,008 24,221 7%
## 5: $46,320 16,795 5.1%
## 6: $50,193 12,558 8.8%
## 7: $46,704 11,319 7.9%
## 8: $49,341 15,856 11.4%
## 9: $49,536 21,296 10.4%
## 10: $45,390 2,209 8.8%
## 11: $48,710 21,484 15%
## 12: $49,506 6,298 11.5%
## 13: $49,047 20,336 13.1%
## 14: $49,346 9,181 8.7%
## 15: $49,116 21,850 14.2%
## 16: $43,838 12,686 13.1%
## 17: $48,093 14,348 17.1%
## 18: $42,253 6,621 15.1%
## 19: $47,929 12,179 21.1%
## 20: in-state: $13,432, out-of-state: $38,140 37,581 16%
## 21: $46,314 14,769 26.8%
## 22: $48,611 17,858 17.4%
## 23: $50,410 13,285 24.6%
## 24: in-state: $12,753, out-of-state: $35,631 43,239 18.6%
## 25: $50,210 42,469 18%
## tuition_fees enrollment acceptance_2014
## retention_freshman grad_rate_6_yr
## 1: 98% 97%
## 2: 97% 98%
## 3: 99% 96%
## 4: 99% 96%
## 5: 98% 95%
## 6: 99% 93%
## 7: 98% 91%
## 8: 97% 95%
## 9: 98% 96%
## 10: 97% 92%
## 11: 97% 94%
## 12: 98% 95%
## 13: 97% 94%
## 14: 98% 96%
## 15: 97% 93%
## 16: 97% 93%
## 17: 97% 95%
## 18: 97% 93%
## 19: 98% 96%
## 20: 97% 91%
## 21: 95% 91%
## 22: 96% 95%
## 23: 95% 88%
## 24: 97% 91%
## 25: 97% 91%
## retention_freshman grad_rate_6_yr
#There's still some mess to be cleaned up. Time to
# break out the regex skills from this morning:
usnews[ , score :=
gsub(".*Overall Score: ([0-9]*) out of.*", "\\1", rank)]
usnews[ , rank := gsub("#([0-9]*).*", "\\1", rank)]
usnews[ , name := gsub("\n.*", "", name)]
usnews[ , enrollment :=
as.integer(gsub(",", "", enrollment, fixed = TRUE))]
usnews[ , acceptance_2014 :=
as.numeric(gsub("%", "", acceptance_2014, fixed = TRUE))]
usnews[ , retention_freshman :=
as.numeric(gsub("%", "", retention_freshman, fixed = TRUE))]
usnews[ , grad_rate_6_yr :=
as.numeric(gsub("%", "", grad_rate_6_yr, fixed = TRUE))]
#What remains is the issue of tuition_fees:
usnews[]
## rank name
## 1: 1 Princeton University
## 2: 2 Harvard University
## 3: 3 Yale University
## 4: 4 Columbia University
## 5: 4 Stanford University
## 6: 4 University of Chicago
## 7: 7 Massachusetts Institute of Technology
## 8: 8 Duke University
## 9: 9 University of Pennsylvania
## 10: 10 California Institute of Technology
## 11: 10 Johns Hopkins University
## 12: 12 Dartmouth College
## 13: 12 Northwestern University
## 14: 14 Brown University
## 15: 15 Cornell University
## 16: 15 Vanderbilt University
## 17: 15 Washington University in St. Louis
## 18: 18 Rice University
## 19: 18 University of Notre Dame
## 20: 20 University of California—Berkeley
## 21: 21 Emory University
## 22: 21 Georgetown University
## 23: 23 Carnegie Mellon University
## 24: 23 University of California—Los Angeles
## 25: 23 University of Southern California
## rank name
## tuition_fees enrollment acceptance_2014
## 1: $43,450 8088 7.4
## 2: $45,278 19929 6.0
## 3: $47,600 12336 6.3
## 4: $51,008 24221 7.0
## 5: $46,320 16795 5.1
## 6: $50,193 12558 8.8
## 7: $46,704 11319 7.9
## 8: $49,341 15856 11.4
## 9: $49,536 21296 10.4
## 10: $45,390 2209 8.8
## 11: $48,710 21484 15.0
## 12: $49,506 6298 11.5
## 13: $49,047 20336 13.1
## 14: $49,346 9181 8.7
## 15: $49,116 21850 14.2
## 16: $43,838 12686 13.1
## 17: $48,093 14348 17.1
## 18: $42,253 6621 15.1
## 19: $47,929 12179 21.1
## 20: in-state: $13,432, out-of-state: $38,140 37581 16.0
## 21: $46,314 14769 26.8
## 22: $48,611 17858 17.4
## 23: $50,410 13285 24.6
## 24: in-state: $12,753, out-of-state: $35,631 43239 18.6
## 25: $50,210 42469 18.0
## tuition_fees enrollment acceptance_2014
## retention_freshman grad_rate_6_yr score
## 1: 98 97 100
## 2: 97 98 99
## 3: 99 96 97
## 4: 99 96 95
## 5: 98 95 95
## 6: 99 93 95
## 7: 98 91 93
## 8: 97 95 92
## 9: 98 96 91
## 10: 97 92 90
## 11: 97 94 90
## 12: 98 95 89
## 13: 97 94 89
## 14: 98 96 85
## 15: 97 93 84
## 16: 97 93 84
## 17: 97 95 84
## 18: 97 93 82
## 19: 98 96 82
## 20: 97 91 77
## 21: 95 91 76
## 22: 96 95 76
## 23: 95 88 74
## 24: 97 91 74
## 25: 97 91 74
## retention_freshman grad_rate_6_yr score
#We have to account for the fact that tuition is
# sometimes listed as in- and sometimes as
# out-of-state. We have three options that I see:
# 1) Pick one or the other and delete the other
# 2) Average the two
# 3) Create two columns -- in- and out-of-state tuition
# We'll go with number 3.
#We'll have to introduce some new tricks to get this to work.
# Let's do this step by step, then go all at once.
## Get _only_ the tuition_fees values which differentiate
## between in- and out-of-state tuition:
in_out <- usnews[grepl("in-state", tuition_fees), tuition_fees]
in_out
## [1] "in-state: $13,432, out-of-state: $38,140"
## [2] "in-state: $12,753, out-of-state: $35,631"
## Now, split each element of this string into pieces.
## We can do this with the strsplit function, which
## takes a character vector and an argument split.
## split is a regular expression saying where to
## split each element of the string. In our current
## case, we'll split at the colon following in-state
## and the out-of-state, as well as the comma preceding
## out-of-state. We can't just split on commas, though,
## since the tuition values themselves still have these.
## What distinguishes the comma we want (and, as luck
## would have it, the colons as well) is that they're
## followed by a space, which we'll leverage to split it:
in_out_split <- strsplit(in_out, split = "[,:] ")
in_out_split
## [[1]]
## [1] "in-state" "$13,432" "out-of-state" "$38,140"
##
## [[2]]
## [1] "in-state" "$12,753" "out-of-state" "$35,631"
## Next, we want to extract only the numbers from this split.
## To do this, extract only the 2nd and 4th element, and then
## (as above) remove the dollar sign and comma. We want to
## return a list since the right hand side of := has to be a list.
in_out_num_l <- lapply(in_out_split, function(io)
as.integer(gsub("[$,]", "", io[c(2, 4)])))
in_out_num_l
## [[1]]
## [1] 13432 38140
##
## [[2]]
## [1] 12753 35631
## Finally, one more trick to do. := expects to find each
## _element_ of the list it receives corresponding to one
## _column_. Right now, this is messed up (it would be more
## obvious if the list wasn't 2x2...), since the first element
## has one in-state and one out-of-state value (and same
## for the second element). What we need is for the first
## element to have BOTH in-state values, and the second
## element to have BOTH out-of-state values. data.table
## has a tool designed for just such a situation:
## transpose. Basically, transpose flips the order of
## a list. Right now, the list has the list elements
## corresponding to different ROWS and the elements
## WITHIN a list element corresponding to different COLUMNS
## If we transpose this, we'll get list elements
## corresponding to different COLUMNS and the elements
## WITHIN a list element corresponding to different ROWS,
## which is just what we want.
transpose(in_out_num_l)
## [[1]]
## [1] 13432 12753
##
## [[2]]
## [1] 38140 35631
##Now, we can do all of that at once like so:
usnews[grepl("in-state", tuition_fees),
paste0("tuition_", c("in", "out"), "_state") :=
transpose(lapply(strsplit(tuition_fees, split = "[:,] "),
function(io) as.integer(gsub("[$,]", "", io[c(2, 4)]))))]
#And to finish off, we assign the same value to in- and
# out-of-state tuition for those (private) universities which
# don't make a distinction:
usnews[is.na(tuition_in_state),
paste0("tuition_", c("in", "out"), "_state") :=
#since we haven't quite fixed the punctuation in
# the other rows yet, we do so here as an
# intermediate step
{x <- as.integer(gsub("[$,]", "", tuition_fees))
.(x, x)}][]
## rank name
## 1: 1 Princeton University
## 2: 2 Harvard University
## 3: 3 Yale University
## 4: 4 Columbia University
## 5: 4 Stanford University
## 6: 4 University of Chicago
## 7: 7 Massachusetts Institute of Technology
## 8: 8 Duke University
## 9: 9 University of Pennsylvania
## 10: 10 California Institute of Technology
## 11: 10 Johns Hopkins University
## 12: 12 Dartmouth College
## 13: 12 Northwestern University
## 14: 14 Brown University
## 15: 15 Cornell University
## 16: 15 Vanderbilt University
## 17: 15 Washington University in St. Louis
## 18: 18 Rice University
## 19: 18 University of Notre Dame
## 20: 20 University of California—Berkeley
## 21: 21 Emory University
## 22: 21 Georgetown University
## 23: 23 Carnegie Mellon University
## 24: 23 University of California—Los Angeles
## 25: 23 University of Southern California
## rank name
## tuition_fees enrollment acceptance_2014
## 1: $43,450 8088 7.4
## 2: $45,278 19929 6.0
## 3: $47,600 12336 6.3
## 4: $51,008 24221 7.0
## 5: $46,320 16795 5.1
## 6: $50,193 12558 8.8
## 7: $46,704 11319 7.9
## 8: $49,341 15856 11.4
## 9: $49,536 21296 10.4
## 10: $45,390 2209 8.8
## 11: $48,710 21484 15.0
## 12: $49,506 6298 11.5
## 13: $49,047 20336 13.1
## 14: $49,346 9181 8.7
## 15: $49,116 21850 14.2
## 16: $43,838 12686 13.1
## 17: $48,093 14348 17.1
## 18: $42,253 6621 15.1
## 19: $47,929 12179 21.1
## 20: in-state: $13,432, out-of-state: $38,140 37581 16.0
## 21: $46,314 14769 26.8
## 22: $48,611 17858 17.4
## 23: $50,410 13285 24.6
## 24: in-state: $12,753, out-of-state: $35,631 43239 18.6
## 25: $50,210 42469 18.0
## tuition_fees enrollment acceptance_2014
## retention_freshman grad_rate_6_yr score tuition_in_state
## 1: 98 97 100 43450
## 2: 97 98 99 45278
## 3: 99 96 97 47600
## 4: 99 96 95 51008
## 5: 98 95 95 46320
## 6: 99 93 95 50193
## 7: 98 91 93 46704
## 8: 97 95 92 49341
## 9: 98 96 91 49536
## 10: 97 92 90 45390
## 11: 97 94 90 48710
## 12: 98 95 89 49506
## 13: 97 94 89 49047
## 14: 98 96 85 49346
## 15: 97 93 84 49116
## 16: 97 93 84 43838
## 17: 97 95 84 48093
## 18: 97 93 82 42253
## 19: 98 96 82 47929
## 20: 97 91 77 13432
## 21: 95 91 76 46314
## 22: 96 95 76 48611
## 23: 95 88 74 50410
## 24: 97 91 74 12753
## 25: 97 91 74 50210
## retention_freshman grad_rate_6_yr score tuition_in_state
## tuition_out_state
## 1: 43450
## 2: 45278
## 3: 47600
## 4: 51008
## 5: 46320
## 6: 50193
## 7: 46704
## 8: 49341
## 9: 49536
## 10: 45390
## 11: 48710
## 12: 49506
## 13: 49047
## 14: 49346
## 15: 49116
## 16: 43838
## 17: 48093
## 18: 42253
## 19: 47929
## 20: 38140
## 21: 46314
## 22: 48611
## 23: 50410
## 24: 35631
## 25: 50210
## tuition_out_state
#And now we can delete the old column
usnews[ , tuition_fees := NULL][]
## rank name enrollment acceptance_2014
## 1: 1 Princeton University 8088 7.4
## 2: 2 Harvard University 19929 6.0
## 3: 3 Yale University 12336 6.3
## 4: 4 Columbia University 24221 7.0
## 5: 4 Stanford University 16795 5.1
## 6: 4 University of Chicago 12558 8.8
## 7: 7 Massachusetts Institute of Technology 11319 7.9
## 8: 8 Duke University 15856 11.4
## 9: 9 University of Pennsylvania 21296 10.4
## 10: 10 California Institute of Technology 2209 8.8
## 11: 10 Johns Hopkins University 21484 15.0
## 12: 12 Dartmouth College 6298 11.5
## 13: 12 Northwestern University 20336 13.1
## 14: 14 Brown University 9181 8.7
## 15: 15 Cornell University 21850 14.2
## 16: 15 Vanderbilt University 12686 13.1
## 17: 15 Washington University in St. Louis 14348 17.1
## 18: 18 Rice University 6621 15.1
## 19: 18 University of Notre Dame 12179 21.1
## 20: 20 University of California—Berkeley 37581 16.0
## 21: 21 Emory University 14769 26.8
## 22: 21 Georgetown University 17858 17.4
## 23: 23 Carnegie Mellon University 13285 24.6
## 24: 23 University of California—Los Angeles 43239 18.6
## 25: 23 University of Southern California 42469 18.0
## rank name enrollment acceptance_2014
## retention_freshman grad_rate_6_yr score tuition_in_state
## 1: 98 97 100 43450
## 2: 97 98 99 45278
## 3: 99 96 97 47600
## 4: 99 96 95 51008
## 5: 98 95 95 46320
## 6: 99 93 95 50193
## 7: 98 91 93 46704
## 8: 97 95 92 49341
## 9: 98 96 91 49536
## 10: 97 92 90 45390
## 11: 97 94 90 48710
## 12: 98 95 89 49506
## 13: 97 94 89 49047
## 14: 98 96 85 49346
## 15: 97 93 84 49116
## 16: 97 93 84 43838
## 17: 97 95 84 48093
## 18: 97 93 82 42253
## 19: 98 96 82 47929
## 20: 97 91 77 13432
## 21: 95 91 76 46314
## 22: 96 95 76 48611
## 23: 95 88 74 50410
## 24: 97 91 74 12753
## 25: 97 91 74 50210
## retention_freshman grad_rate_6_yr score tuition_in_state
## tuition_out_state
## 1: 43450
## 2: 45278
## 3: 47600
## 4: 51008
## 5: 46320
## 6: 50193
## 7: 46704
## 8: 49341
## 9: 49536
## 10: 45390
## 11: 48710
## 12: 49506
## 13: 49047
## 14: 49346
## 15: 49116
## 16: 43838
## 17: 48093
## 18: 42253
## 19: 47929
## 20: 38140
## 21: 46314
## 22: 48611
## 23: 50410
## 24: 35631
## 25: 50210
## tuition_out_state
That wasn’t so bad, was it?
Well, we’re not quite done yet. Observe the following:
print(usnews, quote = TRUE)
## rank name enrollment
## 1: "1" "Princeton University " " 8088"
## 2: "2" "Harvard University " "19929"
## 3: "3" "Yale University " "12336"
## 4: "4" "Columbia University " "24221"
## 5: "4" "Stanford University " "16795"
## 6: "4" "University of Chicago " "12558"
## 7: "7" "Massachusetts Institute of Technology " "11319"
## 8: "8" "Duke University " "15856"
## 9: "9" "University of Pennsylvania " "21296"
## 10: "10" "California Institute of Technology " " 2209"
## 11: "10" "Johns Hopkins University " "21484"
## 12: "12" "Dartmouth College " " 6298"
## 13: "12" "Northwestern University " "20336"
## 14: "14" "Brown University " " 9181"
## 15: "15" "Cornell University " "21850"
## 16: "15" "Vanderbilt University " "12686"
## 17: "15" "Washington University in St. Louis " "14348"
## 18: "18" "Rice University " " 6621"
## 19: "18" "University of Notre Dame " "12179"
## 20: "20" "University of California—Berkeley " "37581"
## 21: "21" "Emory University " "14769"
## 22: "21" "Georgetown University " "17858"
## 23: "23" "Carnegie Mellon University " "13285"
## 24: "23" "University of California—Los Angeles " "43239"
## 25: "23" "University of Southern California " "42469"
## "rank" "name" "enrollment"
## acceptance_2014 retention_freshman grad_rate_6_yr score
## 1: " 7.4" "98" "97" "100"
## 2: " 6.0" "97" "98" "99"
## 3: " 6.3" "99" "96" "97"
## 4: " 7.0" "99" "96" "95"
## 5: " 5.1" "98" "95" "95"
## 6: " 8.8" "99" "93" "95"
## 7: " 7.9" "98" "91" "93"
## 8: "11.4" "97" "95" "92"
## 9: "10.4" "98" "96" "91"
## 10: " 8.8" "97" "92" "90"
## 11: "15.0" "97" "94" "90"
## 12: "11.5" "98" "95" "89"
## 13: "13.1" "97" "94" "89"
## 14: " 8.7" "98" "96" "85"
## 15: "14.2" "97" "93" "84"
## 16: "13.1" "97" "93" "84"
## 17: "17.1" "97" "95" "84"
## 18: "15.1" "97" "93" "82"
## 19: "21.1" "98" "96" "82"
## 20: "16.0" "97" "91" "77"
## 21: "26.8" "95" "91" "76"
## 22: "17.4" "96" "95" "76"
## 23: "24.6" "95" "88" "74"
## 24: "18.6" "97" "91" "74"
## 25: "18.0" "97" "91" "74"
## "acceptance_2014" "retention_freshman" "grad_rate_6_yr" "score"
## tuition_in_state tuition_out_state
## 1: "43450" "43450"
## 2: "45278" "45278"
## 3: "47600" "47600"
## 4: "51008" "51008"
## 5: "46320" "46320"
## 6: "50193" "50193"
## 7: "46704" "46704"
## 8: "49341" "49341"
## 9: "49536" "49536"
## 10: "45390" "45390"
## 11: "48710" "48710"
## 12: "49506" "49506"
## 13: "49047" "49047"
## 14: "49346" "49346"
## 15: "49116" "49116"
## 16: "43838" "43838"
## 17: "48093" "48093"
## 18: "42253" "42253"
## 19: "47929" "47929"
## 20: "13432" "38140"
## 21: "46314" "46314"
## 22: "48611" "48611"
## 23: "50410" "50410"
## 24: "12753" "35631"
## 25: "50210" "50210"
## "tuition_in_state" "tuition_out_state"
#See that space after name? It's gonna be a bugbear.
#Observe:
gsub("\\s", "", usnews$name)
## [1] "PrincetonUniversity "
## [2] "HarvardUniversity "
## [3] "YaleUniversity "
## [4] "ColumbiaUniversity "
## [5] "StanfordUniversity "
## [6] "UniversityofChicago "
## [7] "MassachusettsInstituteofTechnology "
## [8] "DukeUniversity "
## [9] "UniversityofPennsylvania "
## [10] "CaliforniaInstituteofTechnology "
## [11] "JohnsHopkinsUniversity "
## [12] "DartmouthCollege "
## [13] "NorthwesternUniversity "
## [14] "BrownUniversity "
## [15] "CornellUniversity "
## [16] "VanderbiltUniversity "
## [17] "WashingtonUniversityinSt.Louis "
## [18] "RiceUniversity "
## [19] "UniversityofNotreDame "
## [20] "UniversityofCalifornia—Berkeley "
## [21] "EmoryUniversity "
## [22] "GeorgetownUniversity "
## [23] "CarnegieMellonUniversity "
## [24] "UniversityofCalifornia—LosAngeles "
## [25] "UniversityofSouthernCalifornia "
#The problem is, that space at the end isn't just
# any space. It's the dreaded NON-BREAKING SPACE
# (you'll sometimes see this on website or in
# newspapers: ). The \\s token in
# reges doesn't recognize it as blank space.
# We have to do something really ugly to get
# rid of the non-breaking space. First, the magic:
usnews[ , name := gsub(intToUtf8(160), "", name)][]
## rank name enrollment acceptance_2014
## 1: 1 Princeton University 8088 7.4
## 2: 2 Harvard University 19929 6.0
## 3: 3 Yale University 12336 6.3
## 4: 4 Columbia University 24221 7.0
## 5: 4 Stanford University 16795 5.1
## 6: 4 University of Chicago 12558 8.8
## 7: 7 Massachusetts Institute of Technology 11319 7.9
## 8: 8 Duke University 15856 11.4
## 9: 9 University of Pennsylvania 21296 10.4
## 10: 10 California Institute of Technology 2209 8.8
## 11: 10 Johns Hopkins University 21484 15.0
## 12: 12 Dartmouth College 6298 11.5
## 13: 12 Northwestern University 20336 13.1
## 14: 14 Brown University 9181 8.7
## 15: 15 Cornell University 21850 14.2
## 16: 15 Vanderbilt University 12686 13.1
## 17: 15 Washington University in St. Louis 14348 17.1
## 18: 18 Rice University 6621 15.1
## 19: 18 University of Notre Dame 12179 21.1
## 20: 20 University of California—Berkeley 37581 16.0
## 21: 21 Emory University 14769 26.8
## 22: 21 Georgetown University 17858 17.4
## 23: 23 Carnegie Mellon University 13285 24.6
## 24: 23 University of California—Los Angeles 43239 18.6
## 25: 23 University of Southern California 42469 18.0
## rank name enrollment acceptance_2014
## retention_freshman grad_rate_6_yr score tuition_in_state
## 1: 98 97 100 43450
## 2: 97 98 99 45278
## 3: 99 96 97 47600
## 4: 99 96 95 51008
## 5: 98 95 95 46320
## 6: 99 93 95 50193
## 7: 98 91 93 46704
## 8: 97 95 92 49341
## 9: 98 96 91 49536
## 10: 97 92 90 45390
## 11: 97 94 90 48710
## 12: 98 95 89 49506
## 13: 97 94 89 49047
## 14: 98 96 85 49346
## 15: 97 93 84 49116
## 16: 97 93 84 43838
## 17: 97 95 84 48093
## 18: 97 93 82 42253
## 19: 98 96 82 47929
## 20: 97 91 77 13432
## 21: 95 91 76 46314
## 22: 96 95 76 48611
## 23: 95 88 74 50410
## 24: 97 91 74 12753
## 25: 97 91 74 50210
## retention_freshman grad_rate_6_yr score tuition_in_state
## tuition_out_state
## 1: 43450
## 2: 45278
## 3: 47600
## 4: 51008
## 5: 46320
## 6: 50193
## 7: 46704
## 8: 49341
## 9: 49536
## 10: 45390
## 11: 48710
## 12: 49506
## 13: 49047
## 14: 49346
## 15: 49116
## 16: 43838
## 17: 48093
## 18: 42253
## 19: 47929
## 20: 38140
## 21: 46314
## 22: 48611
## 23: 50410
## 24: 35631
## 25: 50210
## tuition_out_state
So the space is gone, but what happened? It turns out that we can create the
character from basically binary code. It turns out that 160 is the code corresponding in UTF-8 encoding to the
. So intToUtf8(160)
outputs the
as a character understood by R and gsub
, after which point we can delete it like we would any other character.
Extending
All that work for very little payoff, right?
Well, sort of. We got 9*25 = 225 data points from this scraping operation… which is not many, admittedly.
However, we can now re-use this code to scrape other pages of data from the U.S. News Site, since the format is basically the same (crucially, we’ll be able to re-use the same xpath
to get the table on all the other pages). We can then stitch together the table we get from each page into a much bigger data set. The additional footwork required to go from scraping the first page to scraping all the pages is typically not much (unless there are major differences in format from one page to the next).
Let’s demonstrate:
#Only the first 8 pages contain
# universities that are
# actually ranked. We could have
# determined this programmatically
# (useful in case the paradigm later
# changes, for example),
# but I didn't see the point for today
URLS <- paste0(URL, "/page+", 1:8)
usnews <-
#rbindlist compiles a list of data.tables
# into a single data.table. This is particularly
# useful for situations where we create a bunch of
# tables in a loop, as is happening here
rbindlist(lapply(URLS, function(uu){
#for each page uu, repeat the above...
usn <- read_html(uu) %>% html_node(xpath = xp) %>%
html_table() %>% setDT()
usn[ , (ncol(usn) - 1):ncol(usn) := NULL]
setnames(usn, c("rank", "name", "tuition_fees",
"enrollment", "acceptance_2014",
"retention_freshman", "grad_rate_6_yr"))
usn[ , score :=
gsub(".*Overall Score: ([0-9]*) out of.*", "\\1", rank)]
usn[ , rank := gsub("#([0-9]*).*", "\\1", rank)]
usn[ , name := gsub("\n.*", "", name)]
usn[ , enrollment :=
as.integer(gsub(",", "", enrollment, fixed = TRUE))]
usn[ , acceptance_2014 :=
as.numeric(gsub("%", "", acceptance_2014, fixed = TRUE))]
#Well, slight adjustment here. There are some colleges where
# these fields have a footnote because it's calculated
# idiosyncratically. Everything follows the percent sign,
# so it's only a minor adjustment.
usn[ , retention_freshman :=
as.numeric(gsub("%.*", "", retention_freshman))]
usn[ , grad_rate_6_yr :=
as.numeric(gsub("%.*", "", grad_rate_6_yr))]
usn[grepl("in-state", tuition_fees),
paste0("tuition_", c("in", "out"), "_state") :=
transpose(lapply(strsplit(tuition_fees, split = "[:,] "),
function(io)
as.integer(gsub("[$,]", "", io[c(2, 4)]))))]
usn[is.na(tuition_in_state),
paste0("tuition_", c("in", "out"), "_state") :=
{x <- as.integer(gsub("[$,]", "", tuition_fees))
.(x, x)}]
usn[ , tuition_fees := NULL]
usn[ , name := gsub(intToUtf8(160), "", name)]}))
#just like that! All done. Just some clean-up:
usnews <- usnews[!grepl("Rank Not", score)]
usnews
## rank name enrollment
## 1: 1 Princeton University 8088
## 2: 2 Harvard University 19929
## 3: 3 Yale University 12336
## 4: 4 Columbia University 24221
## 5: 4 Stanford University 16795
## ---
## 199: 199 Ashland University 5737
## 200: 199 Indiana University-Purdue University—Indianapolis 30690
## 201: 199 Louisiana Tech University 11225
## 202: 199 New Mexico State University 15829
## 203: 199 University of Colorado—Denver 22791
## acceptance_2014 retention_freshman grad_rate_6_yr score
## 1: 7.4 98 97 100
## 2: 6.0 97 98 99
## 3: 6.3 99 96 97
## 4: 7.0 99 96 95
## 5: 5.1 98 95 95
## ---
## 199: 71.9 74 57 21
## 200: 70.5 72 44 21
## 201: 65.0 78 54 21
## 202: 69.6 73 46 21
## 203: 73.0 74 40 21
## tuition_in_state tuition_out_state
## 1: 43450 43450
## 2: 45278 45278
## 3: 47600 47600
## 4: 51008 51008
## 5: 46320 46320
## ---
## 199: 20242 20242
## 200: 9056 29774
## 201: 8853 25851
## 202: 6729 21234
## 203: 10404 29334