Scraping data from a website using rvest

There are lots of web scraping tools available online, but sometimes I’d like to skip this element and prefer to write the code in R to keep everything in one place. In this blog post, I’ll be using the rvest package to show how simple it is to scrape the web and gather a neat data set for data analysis.

We’ll be scraping data from imdb. I’m going to be scraping data about Black Mirror.

First I want to scrape data on the cast, so this is the link that I will be using

http://www.imdb.com/title/tt2085059/fullcredits?ref_=tt_cl_sm#cast

Now I go into RStudio and call the necessary packages and read this html.

library(rvest)
bm <- read_html("https://www.imdb.com/title/tt2085059/fullcredits?ref_=tt_cl_sm#cast")

Next, I will be extracting data from the html we just read. For this I use the selector gadget which is a great Chrome extension to select my CSS selectors. Using this tool, I find that the CSS selector is .itemprop. If you want more information on how to use this works then go to their website . So using the following I get the entire list of cast.

bm %>% html_nodes(".itemprop") %>% html_text()

1 Daniel Lapaine 
2 Daniel Lapaine
3 Hannah John-Kamen 
4 Hannah John-Kamen
5 Michaela Coel 
6 Michaela Coel
7 Beatrice Robertson-Jones 
8 Beatrice Robertson-Jones
9 Daniel Kaluuya 
10 Daniel Kaluuya
11 Toby Kebbell 
12 Toby Kebbell
13 Rory Kinnear 
14 Rory Kinnear
15 Hayley Atwell 
16 Hayley Atwell
17 Lenora Crichlow 
18 Lenora Crichlow
19 Daniel Rigby 
20 Daniel Rigby
.
.
.

A simple gsub() would format the output and using unique() to remove duplicates.

as.data.frame(gsub(" ","",gsub("\n ", "", cast))) %>% unique()

We can also get other information from the website using the selector gadget and easily scrape using rvest.

Advertisements