There are lots of web scraping tools available online, but sometimes I’d like to skip this element and prefer to write the code in R to keep everything in one place. In this blog post, I’ll be using the rvest package to show how simple it is to scrape the web and gather a neat data set for data analysis.
We’ll be scraping data from imdb. I’m going to be scraping data about Black Mirror.
First I want to scrape data on the cast, so this is the link that I will be using
Now I go into RStudio and call the necessary packages and read this html.
library(rvest) bm <- read_html("https://www.imdb.com/title/tt2085059/fullcredits?ref_=tt_cl_sm#cast")
Next, I will be extracting data from the html we just read. For this I use the selector gadget which is a great Chrome extension to select my CSS selectors. Using this tool, I find that the CSS selector is .itemprop. If you want more information on how to use this works then go to their website . So using the following I get the entire list of cast.
bm %>% html_nodes(".itemprop") %>% html_text() 1 Daniel Lapaine 2 Daniel Lapaine 3 Hannah John-Kamen 4 Hannah John-Kamen 5 Michaela Coel 6 Michaela Coel 7 Beatrice Robertson-Jones 8 Beatrice Robertson-Jones 9 Daniel Kaluuya 10 Daniel Kaluuya 11 Toby Kebbell 12 Toby Kebbell 13 Rory Kinnear 14 Rory Kinnear 15 Hayley Atwell 16 Hayley Atwell 17 Lenora Crichlow 18 Lenora Crichlow 19 Daniel Rigby 20 Daniel Rigby . . .
A simple gsub() would format the output and using unique() to remove duplicates.
as.data.frame(gsub(" ","",gsub("\n ", "", cast))) %>% unique()
We can also get other information from the website using the selector gadget and easily scrape using rvest.