Automated Web Scraping with R

Resul Umit

June 2022

Skip intro — To the contents slide. I can teach this workshop at your institution — Email me.

1 / 187

Who am I?

Resul Umit

post-doctoral researcher in political science at the University of Oslo
teaching and studying representation, elections, and parliaments
- a recent publication: the effects of casualties in terror attacks on elections

2 / 187

Who am I?

Resul Umit

post-doctoral researcher in political science at the University of Oslo
teaching and studying representation, elections, and parliaments
- a recent publication: the effects of casualties in terror attacks on elections

teaching workshops, also on

2 / 187

Who am I?

Resul Umit

post-doctoral researcher in political science at the University of Oslo
teaching and studying representation, elections, and parliaments
- a recent publication: the effects of casualties in terror attacks on elections

teaching workshops, also on

more information available at resulumit.com

2 / 187

The Workshop — Overview

One and a half day, on how to automate the process of extracting data from websites
- 180+ slides, 30+ exercises
- a demonstration website for practice

3 / 187

The Workshop — Overview

One and a half day, on how to automate the process of extracting data from websites
- 180+ slides, 30+ exercises
- a demonstration website for practice

Designed for researchers with basic knowledge of R programming language
- does not cover programming with R
  - e.g., we will use existing functions and packages
- ability to work with R will be very helpful
  - but not absolutely necessary — this ability can be developed during and after the workshop as well

3 / 187

The Workshop — Motivation

Data available on websites provide attractive opportunities for academic research
- e.g., parliamentary websites were the main source of data for my PhD

4 / 187

The Workshop — Motivation

Data available on websites provide attractive opportunities for academic research
- e.g., parliamentary websites were the main source of data for my PhD

Acquiring such data requires
- either a lot of resources, such as time
- or a set of skills, such as automated web scraping

4 / 187

The Workshop — Motivation

Data available on websites provide attractive opportunities for academic research
- e.g., parliamentary websites were the main source of data for my PhD

Acquiring such data requires
- either a lot of resources, such as time
- or a set of skills, such as automated web scraping

Typically, such skills are not part of academic training
- for my PhD, I visited close to 3000 webpages to collect data manually
  - on members of ten parliaments
  - multiple times, to update the dataset as needed

4 / 187

The Workshop — Motivation — Aims

To provide you with an understanding of what is ethically possible
- we will cover a large breath of issues, not all of it is for long-term memory
  - hence the slides are designed for self study as well
- awareness of what is ethical and possible, Google, and perseverance are all you need

5 / 187

The Workshop — Motivation — Aims

To provide you with an understanding of what is ethically possible
- we will cover a large breath of issues, not all of it is for long-term memory
  - hence the slides are designed for self study as well
- awareness of what is ethical and possible, Google, and perseverance are all you need

To start you with acquiring and practicing the skills needed
- practice with the demonstration website
  - plenty of data, stable structure, and an ethical playground
- start working on a real project

5 / 187

The Workshop — Contents

Part 1. Getting the Tools Ready

e.g., installing software

Part 2. Preliminary Considerations

e.g., ethics of web scraping

Part 3. HTML Basics

e.g., elements and attributes

Part 4. CSS Selectors

e.g., selecting an element

Part 5. Scraping Static Pages

e.g., getting text from an element

Part 6. Scraping Dynamic Pages

e.g., clicking to create an element

To the list of references.

6 / 187

The Workshop — Organisation

I will go through a number of slides...
- introducing things
- demonstrating how-to do things

... and then pause, for you to use/do those things
- e.g., prepare your computer for the workshop, and/or
- complete a number of exercises

We are here to help
- ask me, other participants
- consult Google, slides, answer script
  - type, rather than copy and paste, the code you will find on the slides or the script

7 / 187

The Workshop — Organisation — Slides

Slides with this background colour indicate that your action is required, for

setting the workshop up
- e.g., installing R
completing the exercises
- e.g., checking website protocols
- these slides have countdown timers
  - as a guide, not to be followed strictly

03:00

8 / 187

The Workshop — Organisation — Slides

Code and text that go in R console or scripts appear as such — in a different font, on gray background
- long codes and texts will have their own line(s)

bow("https://luzpar.netlify.app/members/") %>%
  scrape() %>%
  html_elements(css = "td+ td a") %>% 
  html_attr("href") %>% 
  url_absolute(base = "https://luzpar.netlify.app/")

9 / 187

The Workshop — Organisation — Slides

Code and text that go in R console or scripts appear as such — in a different font, on gray background
- long codes and texts will have their own line(s)

Results that come out as output appear as such — in the same font, on green background
- except for some results, such as a browser popping up

10 / 187

The Workshop — Organisation — Slides

Code and text that go in R console or scripts appear as such — in a different font, on gray background
- long codes and texts will have their own line(s)

Results that come out as output appear as such — in the same font, on green background
- except for some results, such as a browser popping up

Specific sections are highlighted yellow as such for emphasis
- these could be for anything — codes and texts in input, results in output, and/or texts on slides

10 / 187

The Workshop — Organisation — Slides

Code and text that go in R console or scripts appear as such — in a different font, on gray background
- long codes and texts will have their own line(s)

Results that come out as output appear as such — in the same font, on green background
- except for some results, such as a browser popping up

Specific sections are highlighted yellow as such for emphasis
- these could be for anything — codes and texts in input, results in output, and/or texts on slides

The slides are designed for self-study as much as for the workshop
- accessible, in substance and form, to go through on your own

10 / 187

Part 1. Getting the Tools Ready

Back to the contents slide.

11 / 187

Workshop Slides — Access on Your Browser

Having the workshop slides^* on your own machine might be helpful
- flexibility to go back and forward on your own
- ability to scroll across long codes on some slides

Access at https://resulumit.com/teaching/scrp_workshop.html
- will remain accessible after the workshop
- might crash for some Safari users
  - if using a different browser application is not an option, view the PDF version of the slides on GitHub

^* These slides are produced in R, with the xaringan package (Xie, 2022).

12 / 187

Demonstration Website — Explore on Your Browser

There is a demonstration website for this workshop
- available at https://luzpar.netlify.app/
- includes fabricated data on the imaginary Parliament of Luzland
- provides us with plenty of data, stable structure, and an ethical playground
Using this demonstration website for practice is recommended
- tailored to exercises, no ethical concern
- but not compulsory — use a different one if you prefer so
Explore the website now
- click on the links to see an individual page for
  - states, constituencies, members, and documents
- notice that the documents section is different than the rest
  - it is a page with dynamic frame

05:00

13 / 187

R — Download from the Internet and Install

Programming language of this workshop
- created for data analysis, extending for other purposes
  - e.g., accessing websites
- allows for all three steps in one environment
  - accessing websites, scraping data, and processing data
Download R from https://cloud.r-project.org
- optional, if you have it already installed — but then consider updating^*
  - the R.version.string command checks the version of your copy
  - compare with the latest official release at https://cran.r-project.org/sources.html

^* The same applies to all software that follows — consider updating if you have them already installed. This ensures everyone works with the latest, exactly the same, tools.

14 / 187

RStudio — Download from the Internet and Install

Optional, but highly recommended
- facilitates working with R

A popular integrated development environment (IDE) for R
- an alternative: GNU Emacs

Download RStudio from https://rstudio.com/products/rstudio/download
- choose the free version
- to check for any updates, follow from the RStudio menu:

Help -> Check for Updates

15 / 187

RStudio Project — Create from within RStudio

RStudio allows for dividing your work with R into separate projects
- each project gets dedicated workspace, history, and source documents
- this page has more information on why projects are recommended

Create a new RStudio project for for this workshop, following from the RStudio menu:

File -> New Project -> New Directory -> New Project

Choose a location for the project with Browse...
- avoid choosing a synced location, e.g., Dropbox
  - likely to cause warning and/or error messages
  - if you must, pause syncing, or add an sync exclusion

16 / 187