Resul Umit
post-doctoral researcher in political science at the University of Oslo
teaching and studying representation, elections, and parliaments
Resul Umit
post-doctoral researcher in political science at the University of Oslo
teaching and studying representation, elections, and parliaments
teaching workshops, also on
Resul Umit
post-doctoral researcher in political science at the University of Oslo
teaching and studying representation, elections, and parliaments
teaching workshops, also on
One and a half day, on how to automate the process of extracting data from websites
One and a half day, on how to automate the process of extracting data from websites
Designed for researchers with basic knowledge of R programming language
Data available on websites provide attractive opportunities for academic research
Data available on websites provide attractive opportunities for academic research
Acquiring such data requires
Data available on websites provide attractive opportunities for academic research
Acquiring such data requires
Typically, such skills are not part of academic training
To provide you with an understanding of what is ethically possible
To provide you with an understanding of what is ethically possible
To start you with acquiring and practicing the skills needed
Part 1. Getting the Tools Ready
Part 2. Preliminary Considerations
Part 6. Scraping Dynamic Pages
I will go through a number of slides...
... and then pause, for you to use/do those things
We are here to help
Slides with this background colour indicate that your action is required, for
setting the workshop up
completing the exercises
03:00
Code and text that go in R console or scripts appear as such — in a different font, on gray background
bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td+ td a") %>% html_attr("href") %>% url_absolute(base = "https://luzpar.netlify.app/")
Code and text that go in R console or scripts appear as such — in a different font, on gray background
Results that come out as output appear as such — in the same font, on green background
Code and text that go in R console or scripts appear as such — in a different font, on gray background
Results that come out as output appear as such — in the same font, on green background
Specific sections are highlighted yellow as such for emphasis
Code and text that go in R console or scripts appear as such — in a different font, on gray background
Results that come out as output appear as such — in the same font, on green background
Specific sections are highlighted yellow as such for emphasis
The slides are designed for self-study as much as for the workshop
Having the workshop slides* on your own machine might be helpful
Access at https://resulumit.com/teaching/scrp_workshop.html
* These slides are produced in R, with the xaringan
package (Xie, 2022).
There is a demonstration website for this workshop
Using this demonstration website for practice is recommended
Explore the website now
05:00
Programming language of this workshop
Download R from https://cloud.r-project.org
R.version.string
command checks the version of your copy* The same applies to all software that follows — consider updating if you have them already installed. This ensures everyone works with the latest, exactly the same, tools.
Optional, but highly recommended
A popular integrated development environment (IDE) for R
Download RStudio from https://rstudio.com/products/rstudio/download
Help -> Check for Updates
RStudio allows for dividing your work with R into separate projects
File -> New Project -> New Directory -> New Project
Choose a location for the project with Browse...
Dropbox
Install the packages that we need
install.packages(c("rvest", "RSelenium", "robotstxt", "polite", "dplyr"))
* You may already have a copy of one or more of these packages. In that case, I recommend updating by re-installing them now.
02:00
Install the packages that we need
install.packages(c("rvest", "RSelenium", "robotstxt", "polite", "dplyr"))
We will use
rvest
(Wickham, 2021), for scraping websitesInstall the packages that we need
install.packages(c("rvest", "RSelenium", "robotstxt", "polite", "dplyr"))
We will use
rvest
(Wickham, 2021), for scraping websitesRSelenium
(Harrison, 2020), for browsing the web programmaticallyInstall the packages that we need
install.packages(c("rvest", "RSelenium", "robotstxt", "polite", "dplyr"))
We will use
rvest
(Wickham, 2021), for scraping websitesRSelenium
(Harrison, 2020), for browsing the web programmaticallyrobotstxt
(Meissner and Ren, 2020), for checking permissions to scrape websitesInstall the packages that we need
install.packages(c("rvest", "RSelenium", "robotstxt", "polite", "dplyr"))
We will use
rvest
(Wickham, 2021), for scraping websitesRSelenium
(Harrison, 2020), for browsing the web programmaticallyrobotstxt
(Meissner and Ren, 2020), for checking permissions to scrape websitespolite
(Perepolkin, 2019), for compliance with permissions to scrape websitesInstall the packages that we need
install.packages(c("rvest", "RSelenium", "robotstxt", "polite", "dplyr"))
We will use
rvest
(Wickham, 2021), for scraping websitesRSelenium
(Harrison, 2020), for browsing the web programmaticallyrobotstxt
(Meissner and Ren, 2020), for checking permissions to scrape websitespolite
(Perepolkin, 2019), for compliance with permissions to scrape websitesdplyr
(Wickham, François, Henry, and Müller, 2022), for data manipulationCheck that you are in your recently created project
Create a new R Script, following from the RStudio menu
File -> New File -> R Script
scrape_web.R
rvest
and other packageslibrary(rvest)library(RSelenium)library(robotstxt)library(polite)library(dplyr)
A language and software that RSelenium
needs
Download Java from https://www.java.com/en/download/
A browser that facilitates web scraping
RSelenium
and most programmersAn extension for Chrome
Add the extension to your browser
ScrapeMate is an alternative extension
Solutions to exercises, or links to them, are available online
I recommend the solutions to be consulted as a last resort
RSelenium
vignettes
R for Data Science (Wickham and Grolemund, 2021)
Text Mining with R: A Tidy Approach (Silge and Robinson, 2017)
* I recommend these to be consulted not during but after the workshop.
Web scraping might be unethical
Web scraping might be unethical
robots.txt
filesrobots.txt
Most websites declare a robots exclusion protocol
robots.txt
filesrobots.txt
cannot not enforced upon scrapersrobots.txt
files is specific but intuitiverobotstxt
package makes these even easierrobots.txt
— SyntaxIt has pre-defined keys, most importantly
User-agent
indicates who the protocol is for
Allow
indicates which part(s) of the website can be scraped
Disallow
indicates which part(s) must not be scraped
Crawl-delay
indicates how fast the website could be scraped
Note that
User-agent:Allow:Disallow:Crawl-delay:
robots.txt
— SyntaxWebsites define their own values
Note that
/
indicates all sections and pages/about/
indicates a specific pathCrawl-delay
are in seconds/about/
is left out, and User-agent: *Allow: /Disallow: /about/Crawl-delay: 5
robots.txt
— ExamplesThe protocol of this website only applies to Google
User-agent: googlebotAllow: /
robots.txt
— ExamplesThe protocol of this website only applies to Google
User-agent: googlebotDisallow: /about/Disallow: /history/
robots.txt
— ExamplesThis website has different protocols for different agents
Google is allowed to scrape everything, with a 5-second delay
Bing is not allowed to scrape anything
everyone else can scrape the section or page located at www.websiteurl/about/
User-agent: googlebotAllow: /Crawl-delay: 5User-agent: bingDisallow: /User-agent: *Allow: /about/
robots.txt
— NotesThere are also some other, lesser known, directives
User-agent: *Allow: /Disallow: /about/Crawl-delay: 5 Visit-time: 01:45-08:30
robots.txt
— NotesThere are also some other, lesser known, directives
User-agent: *Allow: /Disallow: /about/Crawl-delay: 5 Visit-time: 01:45-08:30
Files might include optional comments, written after the number sign #
# thank you for respecting our protocolUser-agent: *Allow: /Disallow: /about/Visit-time: 01:45-08:30 # please visit when it is night time in the UK (GMT)Crawl-delay: 5 # please delay for five seconds, to ensure our servers are not overloaded
robotstxt
The robotstxt
packages facilitates checking website protocols
There are two main functions
robotstxt
, which gets complete protocolspaths_allowed
, which checks protocols for one or more specific pathsrobotstxt
Use the robotstxt
function to get a protocol
domain
argumentrobotstxt( domain = NULL, ...)
robotstxt
robotstxt(domain = "https://luzpar.netlify.app")
## $domain## [1] "https://luzpar.netlify.app"## ## $text## [robots.txt]## --------------------------------------## ## User-agent: googlebot## Disallow: /states/## ## User-agent: *## Disallow: /exercises/## ## User-agent: *## Allow: /## Crawl-delay: 2## ## ## ## ## ## $robexclobj## <Robots Exclusion Protocol Object>## $bots## [1] "googlebot" "*" ## ## $comments## [1] line comment## <0 rows> (or 0-length row.names)## ## $permissions## field useragent value## 1 Disallow googlebot /states/## 2 Disallow * /exercises/## 3 Allow * /## ## $crawl_delay## field useragent value## 1 Crawl-delay * 2## ## $host## [1] field useragent value ## <0 rows> (or 0-length row.names)## ## $sitemap## [1] field useragent value ## <0 rows> (or 0-length row.names)## ## $other## [1] field useragent value ## <0 rows> (or 0-length row.names)## ## $check## function (paths = "/", bot = "*") ## {## spiderbar::can_fetch(obj = self$robexclobj, path = paths, ## user_agent = bot)## }## <bytecode: 0x00000257bc12e528>## <environment: 0x00000257bc129350>## ## attr(,"class")## [1] "robotstxt"
robotstxt
Check the list of permissions for the most relevant part in the output
robotstxt(domain = "https://luzpar.netlify.app")$permissions
## field useragent value## 1 Disallow googlebot /states/## 2 Disallow * /exercises/## 3 Allow * /
robotstxt
Use the paths_allowed
function to check protocols for one or more specific paths
domain
argumentpath
and bot
are the other important argumentsTRUE
(allowed to scrape) or FALSE
(not allowed)paths_allowed( domain = "auto", paths = "/", bot = "*", ...)
robotstxt
paths_allowed(domain = "https://luzpar.netlify.app")
## [1] TRUE
paths_allowed(domain = "https://luzpar.netlify.app", paths = c("/states/", "/constituencies/"))
## [1] TRUE TRUE
paths_allowed(domain = "https://luzpar.netlify.app", paths = c("/states/", "/constituencies/"), bot = "googlebot")
## [1] FALSE TRUE
1) Check the protocols for https://www.theguardian.com
robotstxt
function in R2) Check a path with the paths_allowed
function
FALSE
3) Check the protocols for any website that you might wish to scrape
robotstxt
function10:00
Websites are designed for visitors with human-speed in mind
Waiting a little between two visits makes scraping more ethical
Not waiting enough might lead to a ban
Ideally, we scrape for a purpose
Scraped data frequently requires
Webpages include more than what is immediately visible to visitors
Web scraping requires working with the source code
Source code also offers more, invisible, data to be scraped
The Ctrl
+
U
shortcut displays source code — alternatively, right click and View
Page
Source
Browsers also offer putting source codes in a structure, known as DOM (document object model)
F12
key on Chrome — alternatively, right click and Inspect
4) View the source code of a page
5) Search for a word or a phrase in source code
Ctrl
+
F
shortcut05:00
HTML stands for hypertext markup language
<!DOCTYPE html><html> <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body></html>
HTML documents
<!DOCTYPE html><html> <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body></html>
HTML documents
<!DOCTYPE html><html> <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body></html>
html
holds together the root element
head
and body
elements<!DOCTYPE html><html> <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body></html>
head contains metadata, such as
<!DOCTYPE html><html> <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body></html>
body contains the elements in the main body of pages, such as
<!DOCTYPE html><html> <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body></html>
Most elements have opening and closing tags
<p>This is a one sentence paragraph.</p>
This is a one sentence paragraph.
Note that
Most elements have some content
<p>This is a one sentence paragraph.</p>
This is a one sentence paragraph.
Elements can have attributes
<p>This is a <strong id="sentence-count">one</strong> sentence paragraph.</p>
This is a one sentence paragraph.
Note that
strong
, id
as they are pre-definedid
attribute has no visible effectsstyle
, can have visible effects There could be more than one attribute in a single element
<p>This is a <strong class="count" id="sentence-count">one</strong> sentence paragraph.</p><p>There are now <strong class="count" id="paragraph-count">two</strong> paragraphs.</p>
This is a one sentence paragraph.
There are now two paragraphs.
Note that
class
attribute (e.i., count
) can apply to multiple elementsid
attribute must be unique on a given pageElements can be nested
<p>This is a <strong>one</strong> sentence paragraph.</p>
This is a one sentence paragraph.
Note that
By default, multiple spaces and/or lines breaks are ignored by browsers
<ul><li>books</li><li>journal articles</li><li>reports</li></ul>
Note that
Links are provided with the a
(anchor) element
<p>Click <a href="https://www.google.com/">here</a> to google things.</p>
Click here to google things.
Note that
href
(hypertext reference) is a required attribute for this elementLinks can have titles
<p>Click <a title="This text appears when visitors hover over the link" href="https://www.google.com/">here</a> to google things.</p>
Click here to google things.
Note that
title
attribute is one of the optional attributesThe <ul>
tag introduces un-ordered lists, while the <li>
tag defines lists items
<ul> <li>books</li> <li>journal articles</li> <li>reports</li></ul>
Note that
<ol>
tag insteadThe <div>
tag defines a section, containing one or often more elements
<p>This is an introductory paragraph.</p><div style="text-decoration:underline;"><p>In this important division there are two elements, which are:</p><ul> <li>a paragraph, and</li> <li>an unordered list.</li></ul></div><p>This is the concluding paragraph.<p>
This is an introductory paragraph.
In this important division there are two elements, which are:
This is the concluding paragraph.
The <span>
tag also defines a section, containing a part of an element
<p>This is an <span style="text-decoration:underline;">important paragraph</span>, whichyou must read carefully.<p>
This is an important paragraph, which you must read carefully.
Note that
6) Re-create the page at https://luzpar.netlify.app/states/ in R
File -> New File -> HTML File
Preview
to view the result7) Add at least one extra tag and/or attribute
with a visible effect on how the page looks at the front end
save this document as we will continue working on it
15:00
CSS stands for cascading style sheets
CSS can be defined
head
elementhead
element p {font-size:12px;} h1 h2 {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:16px;}
p {font-size:14px;} h1 h2 {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:16px;}
Note that
p {font-size:14px;} h1 h2 {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:16px;}
Note that
p {font-size:14px;} h1 h2 {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:16px;}
Note that
p {font-size:14px;} h1 h2 {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:16px;}
Note that
p {font-size:14px;} h1 h2 {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:16px;}
p {font-size:14px;} h1 h2 {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:16px;}
Note that
Note that
property:value;
pairs are separated by a white space p {font-size:14px;} h1 h2 {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:14px;}
CSS rules can be defined internally
style
elementhead
elementInternally defined rules apply to all matching selectors
<!DOCTYPE html><html> <head> <style> h1 {color:blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body></html>
CSS rules can be defined externally
linked
elementhead
elementExternally defined rules
<!DOCTYPE html><html> <head> <link rel="styles" href="simple.css"> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body></html>
CSS rules can also be defined inline
style
attribute<p>This is a <strong style="color:blue;">one</strong> sentence paragraph.</p>
This is a one sentence paragraph.
8) Provide some simple style to your HTML document
07:30
Static pages are those that display the same source code to all visitors
every visitor sees the same content at a given URL
https://luzpar.netlify.app/ is a static page
Static pages are those that display the same source code to all visitors
every visitor sees the same content at a given URL
https://luzpar.netlify.app/ is a static page
Static pages are scraped typically in two steps
rvest
package can handle both stepsScraping dynamic pages involves three main steps
Get the source code into R
rvest
or polite
package, using URLs of these pagesExtract the exact information needed from the source code
rvest
package, using selectors for that exact informationrvest
— OverviewA relative small R package for web scraping
tidyverse
familytidyverse
rvest
— OverviewA relative small R package for web scraping
tidyverse
familytidyverse
A lot has already been written on this package
rvest
— OverviewA relative small R package for web scraping
tidyverse
familytidyverse
A lot has already been written on this package
Comes with the recommendation to combine it with the polite
package
rvest
— Get Source CodeUse the read_html
function to get the source code of a webpage into R
read_html("https://luzpar.netlify.app/")
## {html_document}## <html lang="en-us">## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...## [2] <body id="top" data-spy="scroll" data-offset="70" data-target="#navbar-ma ...
Note that
rvest
— Get Source CodeYou may wish to check the protocol first, for ethical scraping
paths_allowed(domain = "https://luzpar.netlify.app/")
## [1] TRUE
read_html("https://luzpar.netlify.app/")
## {html_document}## <html lang="en-us">## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...## [2] <body id="top" data-spy="scroll" data-offset="70" data-target="#navbar-ma ...
rvest
— Get Source Code — polite
The polite
package facilitates ethical scraping
rvest
It divides the step of getting source code into two
Among its other fuctions are
rvest
— Get Source Code — polite
First, use the bow
function to check the protocol
bow(url, user_agent = "polite R package - https://github.com/dmi3kno/polite", delay = 5, ... )
rvest
— Get Source Code — polite
First, use the bow
function to check the protocol
Note that
user_agent
argument can communicate information to website administratorsbow(url, user_agent = "polite R package - https://github.com/dmi3kno/polite", delay = 5, force = FALSE, ... )
rvest
— Get Source Code — polite
First, use the bow
function to check the protocol
Note that
delay
argument cannot be set to a number smaller than in the directivebow(url, user_agent = "polite R package - https://github.com/dmi3kno/polite", delay = 5, force = FALSE, ... )
rvest
— Get Source Code — polite
First, use the bow
function to check the protocol
Note that
delay
argument cannot be set to a number smaller than in the directiveforce
argument is set to FALSE
by defaultbow(url, user_agent = "polite R package - https://github.com/dmi3kno/polite", delay = 5, force = FALSE, ... )
rvest
— Get Source Code — polite
bow
function to check the protocolSecond, use the scrape
function get source code
bow
functionNote that
scrape
will only work if the results from bow
are positivescrape(bow, ... )
rvest
— Get Source Code — polite
bow
function to check the protocolSecond, use the scrape
function get source code
bow
functionNote that
scrape
will only work if the results from bow
are positivebow
into scrape
, you can avoid creating objectsscrape(bow, ... )
bow() %>% scrape()
rvest
— Get Source CodeThese two pieces of code lead to the same outcome, as there is no protocol against the access
read_html("https://luzpar.netlify.app/")
## {html_document}## <html lang="en-us">## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...## [2] <body id="top" data-spy="scroll" data-offset="70" data-target="#navbar-ma ...
bow("https://luzpar.netlify.app/") %>% scrape()
## {html_document}## <html lang="en-us">## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...## [2] <body id="top" data-spy="scroll" data-offset="70" data-target="#navbar-ma ...
rvest
— Get Source CodeThe difference occurs when there is a protocol against the access
read_html("https://luzpar.netlify.app/exercises/exercise_6.Rhtml")
## {html_document}## <html>## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...## [2] <body>\r\n\r\n<h1>States of Luzland</h1>\r\n \r\n<p>There are four ...
bow("https://luzpar.netlify.app/exercises/exercise_6.Rhtml") %>% scrape()
## Warning: No scraping allowed here!
## NULL
9) Get the source code of the page at https://luzpar.netlify.app/states/ in R
read_html
function10) Get the same page source, this time in the polite
way
05:00
rvest
— html_elements
Get one or more HTML elements
Note that
there are two versions of the same function
singular one gets the first instance of an element, plural gets all instances
if there is only one instance, both functions return the same result
html_element(x, css, xpath)html_elements(x, css, xpath)
rvest
— html_elements
Get one or more HTML elements
from the source code downloaded in the previous step
specified with a selector, CSS or XPATH
Note that
we will work with CSS only in this workshop
using CSS is facilitated by Chrome and SelectorGagdet
html_element(x, css, xpath)html_elements(x, css, xpath)
Finding the correct selector(s) is the key to successful scraping, and there are three ways to do it
Finding the correct selector(s) is the key to successful scraping, and there are three ways to do it
I recommend using
To find the selectors for the hyperlinks on the homepage of the Parliamenta of Luzland
Note that
a
rvest
— html_elements
Get the a
(anchor) elements on the homepage
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "a")
## {xml_nodeset (24)}## [1] <a class="js-search" href="#" aria-label="Close"><i class="fas fa-times- ...## [2] <a class="navbar-brand" href="/">Parliament of Luzland</a>## [3] <a class="navbar-brand" href="/">Parliament of Luzland</a>## [4] <a class="nav-link active" href="/"><span>Home</span></a>## [5] <a class="nav-link" href="/states/"><span>States</span></a>## [6] <a class="nav-link" href="/constituencies/"><span>Constituencies</span></a>## [7] <a class="nav-link" href="/members/"><span>Members</span></a>## [8] <a class="nav-link" href="/documents/"><span>Documents</span></a>## [9] <a class="nav-link js-search" href="#" aria-label="Search"><i class="fas ...## [10] <a href="#" class="nav-link" data-toggle="dropdown" aria-haspopup="true" ...## [11] <a href="#" class="dropdown-item js-set-theme-light"><span>Light</span></a>## [12] <a href="#" class="dropdown-item js-set-theme-dark"><span>Dark</span></a>## [13] <a href="#" class="dropdown-item js-set-theme-auto"><span>Automatic</spa ...## [14] <a href="https://github.com/resulumit/scrp_workshop" target="_blank" rel ...## [15] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Um ...## [16] <a href="/documents/">documents</a>## [17] <a href="/constituencies/">constituencies</a>## [18] <a href="/members/">members</a>## [19] <a href="/states/">states</a>## [20] <a href="https://github.com/rstudio/blogdown" target="_blank" rel="noope ...## ...
rvest
— html_element
Get the first a
(anchor) element on the homepage
bow("https://luzpar.netlify.app") %>% scrape() %>% html_element(css = "a")
## {html_node}## <a class="js-search" href="#" aria-label="Close">## [1] <i class="fas fa-times-circle text-muted" aria-hidden="true"></i>
Note that
To exclude the menu items from selection
4. click on a menu item
Note that
#title
a
rvest
— html_elements
Get the a
(anchor) elements on the homepage with a #title
attribute
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a")
## {xml_nodeset (9)}## [1] <a href="https://github.com/resulumit/scrp_workshop" target="_blank" rel= ...## [2] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Umi ...## [3] <a href="/documents/">documents</a>## [4] <a href="/constituencies/">constituencies</a>## [5] <a href="/members/">members</a>## [6] <a href="/states/">states</a>## [7] <a href="https://github.com/rstudio/blogdown" target="_blank" rel="noopen ...## [8] <a href="https://gohugo.io/" target="_blank" rel="noopener">Hugo</a>## [9] <a href="https://github.com/wowchemy" target="_blank" rel="noopener">Wowc ...
You can click further to exclude some and/or to include more elements
Note that the selection is colour-coded
rvest
— html_elements
Get the link behind the selected elements
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "br+ p a")
## {xml_nodeset (2)}## [1] <a href="https://github.com/resulumit/scrp_workshop" target="_blank" rel= ...## [2] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Umi ...
You can click further to select a single element
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "br+ p a+ a")
## {xml_nodeset (1)}## [1] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Umi ...
To find the selector for a single element, you could also use Chrome itself
Inspect
Copy -> Copy selector
rvest
— html_elements
Get the link behind one element, with css from Chrome
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title > div.container > div > p:nth-child(2) > a:nth-child(2)")
## {xml_nodeset (1)}## [1] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Umi ...
Note that
the selector is different than the one SelectorGagdet returns
but the outcome is the same
11) Get the fist item on the list on the page at https://luzpar.netlify.app/states/
12) Get all items on the list
13) Get only the second and fourth items on the list
10:00
rvest
— html_text
Get the text content of one or more HTML elements
html_elements
functionNote that
html_text
returns text with any space or line breaks around it html_text2
returns plain texthtml_text(x, trim = FALSE)html_text2(x, preserve_nbsp = FALSE)
rvest
— html_text
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a") %>% html_text()
## [1] "a workshop on automated web scraping"## [2] "Resul Umit" ## [3] "documents" ## [4] "constituencies" ## [5] "members" ## [6] "states" ## [7] "Blogdown" ## [8] "Hugo" ## [9] "Wowchemy"
14) Get the text on the list elements on the page at https://luzpar.netlify.app/states/
15) Get the constituency names on the page at https://luzpar.netlify.app/constituencies/
05:00
rvest
— html_attr
Get one or more attributes of one or more HTML elements
html_elements
functionNote that
html_attr(x, name, default = NA_character_)html_attrs(x)
rvest
— html_attrs
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a") %>% html_attrs()
## [[1]]## href ## "https://github.com/resulumit/scrp_workshop" ## target ## "_blank" ## rel ## "noopener" ## ## [[2]]## href target rel ## "https://resulumit.com/" "_blank" "noopener" ## ## [[3]]## href ## "/documents/" ## ## [[4]]## href ## "/constituencies/" ## ## [[5]]## href ## "/members/" ## ## [[6]]## href ## "/states/" ## ## [[7]]## href target ## "https://github.com/rstudio/blogdown" "_blank" ## rel ## "noopener" ## ## [[8]]## href target rel ## "https://gohugo.io/" "_blank" "noopener" ## ## [[9]]## href target ## "https://github.com/wowchemy" "_blank" ## rel ## "noopener"
rvest
— html_attr
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a") %>% html_attr(name = "href")
## [1] "https://github.com/resulumit/scrp_workshop"## [2] "https://resulumit.com/" ## [3] "/documents/" ## [4] "/constituencies/" ## [5] "/members/" ## [6] "/states/" ## [7] "https://github.com/rstudio/blogdown" ## [8] "https://gohugo.io/" ## [9] "https://github.com/wowchemy"
rvest
— html_attr
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a") %>% html_attr(name = "href")
## [1] "https://github.com/resulumit/scrp_workshop"## [2] "https://resulumit.com/" ## [3] "/documents/" ## [4] "/constituencies/" ## [5] "/members/" ## [6] "/states/" ## [7] "https://github.com/rstudio/blogdown" ## [8] "https://gohugo.io/" ## [9] "https://github.com/wowchemy"
Note that
/states/
, which is actually https://luzpar.netlify.app/states/url_absolute
functionrvest
— url_absolute
Complete the relative URLs with the url_absolute
function
bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a") %>% html_attr(name = "href") %>% url_absolute(base = "https://luzpar.netlify.app")
## [1] "https://github.com/resulumit/scrp_workshop"## [2] "https://resulumit.com/" ## [3] "https://luzpar.netlify.app/documents/" ## [4] "https://luzpar.netlify.app/constituencies/"## [5] "https://luzpar.netlify.app/members/" ## [6] "https://luzpar.netlify.app/states/" ## [7] "https://github.com/rstudio/blogdown" ## [8] "https://gohugo.io/" ## [9] "https://github.com/wowchemy"
16) Get the hyperlink attributes for the constituencies at https://luzpar.netlify.app/constituencies/
17) Create complete links to the constituency pages
05:00
rvest
— html_table
Use the html_table()
function to get the text content of table elements
bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "table") %>% html_table()
## [[1]]## # A tibble: 100 x 3## Member Constituency Party ## <chr> <chr> <chr> ## 1 Arthur Ali Mühlshafen Liberal ## 2 Chris Antony Benwerder Labour ## 3 Chloë Bakker Steffisfelden Labour ## 4 Rose Barnes Dillon Liberal ## 5 Emilia Bauer Kilnard Green ## 6 Wilma Baumann Granderry Green ## 7 Matteo Becker Enkmelo Labour ## 8 Patricia Bernard Gänsernten Labour ## 9 Lina Booth Leonrau Liberal ## 10 Sophie Bos Zotburg Independent## # ... with 90 more rows
rvest
We can create the same tibble with html_text
, which requires getting each variable separately to be merged
tibble("Member" = bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td:nth-child(1) a") %>% html_text(),"Constituency" = bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td:nth-child(2) a") %>% html_text(),"Party" = bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td:nth-child(3)") %>% html_text())
rvest
Keep the number of interactions with websites to minimum
the_page <- bow("https://luzpar.netlify.app/members/") %>% scrape()tibble("Member" = the_page %>% html_elements(css = "td:nth-child(1)") %>% html_text(),"Constituency" = the_page %>% html_elements(css = "td:nth-child(2)") %>% html_text(),"Party" = the_page %>% html_elements(css = "td:nth-child(3)") %>% html_text())
18) Create a dataframe out of the table at https://luzpar.netlify.app/members/
15:00
Rarely a single page includes all variables that we need
Web scraping then requires crawling across pages
We can write for loops to crawl
Task:
Task:
Plan:
Scrape the page that has all URLs, for absolute URLs
the_links <- bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td+ td a") %>% html_attr("href") %>% url_absolute(base = "https://luzpar.netlify.app/")# check if it workedhead(the_links)
## [1] "https://luzpar.netlify.app/constituencies/muhlshafen/" ## [2] "https://luzpar.netlify.app/constituencies/benwerder/" ## [3] "https://luzpar.netlify.app/constituencies/steffisfelden/"## [4] "https://luzpar.netlify.app/constituencies/dillon/" ## [5] "https://luzpar.netlify.app/constituencies/kilnard/" ## [6] "https://luzpar.netlify.app/constituencies/granderry/"
Create an empty list
temp_list <- list()for (i in 1:length(the_links)) {the_page <- bow(the_links[i]) %>% scrape()temp_tibble <- tibble("constituency" = the_page %>% html_elements("#constituency") %>% html_text(),"second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(),"vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text())temp_list[[i]] <- temp_tibble}df <- as_tibble(do.call(rbind, temp_list))
Start a for loop to iterate over the links one by one
temp_list <- list()for (i in 1:length(the_links)) {the_page <- bow(the_links[i]) %>% scrape()temp_tibble <- tibble("constituency" = the_page %>% html_elements("#constituency") %>% html_text(),"second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(),"vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text())temp_list[[i]] <- temp_tibble} df <- as_tibble(do.call(rbind, temp_list))
Get the source code for the next link
temp_list <- list()for (i in 1:length(the_links)) {the_page <- bow(the_links[i]) %>% scrape()temp_tibble <- tibble("constituency" = the_page %>% html_elements("#constituency") %>% html_text(),"second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(),"vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text())temp_list[[i]] <- temp_tibble} df <- as_tibble(do.call(rbind, temp_list))
Get the variables needed, put them in a tibble
temp_list <- list()for (i in 1:length(the_links)) {the_page <- bow(the_links[i]) %>% scrape()temp_tibble <- tibble( "constituency" = the_page %>% html_elements("#constituency") %>% html_text(), "second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(), "vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text() ) temp_list[[i]] <- temp_tibble} df <- as_tibble(do.call(rbind, temp_list))
Add each tibble into the previously-created list
temp_list <- list()for (i in 1:length(the_links)) {the_page <- bow(the_links[i]) %>% scrape()temp_tibble <- tibble("constituency" = the_page %>% html_elements("#constituency") %>% html_text(),"second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(),"vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text())temp_list[[i]] <- temp_tibble} df <- as_tibble(do.call(rbind, temp_list))
Turn the list into a tibble
temp_list <- list()for (i in 1:length(the_links)) {the_page <- bow(the_links[i]) %>% scrape()temp_tibble <- tibble("constituency" = the_page %>% html_elements("#constituency") %>% html_text(),"second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(),"vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text())temp_list[[i]] <- temp_tibble} df <- as_tibble(do.call(rbind, temp_list))
Check the resulting dataset
head(df, 10)
## # A tibble: 100 x 3## constituency second_party vote_share## <chr> <chr> <chr> ## 1 Mühlshafen Green 26.1% ## 2 Benwerder Conservative 24.8% ## 3 Steffisfelden Green 25.7% ## 4 Dillon Conservative 27% ## 5 Kilnard Conservative 28.8% ## 6 Granderry Labour 26.1% ## 7 Enkmelo Liberal 26.8% ## 8 Gänsernten Green 26.6% ## 9 Leonrau Conservative 25% ## 10 Zotburg Conservative 28.4% ## # ... with 90 more rows
19) Crawl into members' personal pages to create a rich dataset
Hints:
1:length(the_links)
with 1:3
for the loop45:00
Dynamic pages are ones that display custom content
Dynamic pages are ones that display custom content
Dynamic pages are scraped typically in three steps
RSelenium
, for the new stepScraping dynamic pages involves three main steps
RSelenium
packageRSelenium
downloads XMLrvest
turns it into HTMLrvest
packageRSelenium
— OverviewA package that integrates Selenium 2.0 WebDriver into R
RSelenium
— OverviewA package that integrates Selenium 2.0 WebDriver into R
A lot has already been written on this package
RSelenium
— OverviewThe package involves more methods than functions
RSelenium
— OverviewThe package involves more methods than functions
It allows interacting with two things — and it is crucial that users are aware of the difference
Use the rsDriver
function to start a server
rsDriver(port = 4567L, browser = "chrome", version = "latest", chromever = "latest", ... )
Use the rsDriver
function to start a server
Note that the defaults can cause errors, such as
rsDriver(port = 4567L, browser = "chrome", version = "latest", chromever = "latest", ... )
Use the rsDriver
function to start a server
Note that the defaults can cause errors, such as
rsDriver(port = 4567L, browser = "chrome", version = "latest", chromever = "latest", ... )
The latest version of the driver is too new for my browser
binman::list_versions("chromedriver")
## $win32## [1] "100.0.4896.20" "100.0.4896.60" "102.0.5005.27" "102.0.5005.61"## [5] "103.0.5060.24" "89.0.4389.23" "90.0.4430.24" "91.0.4472.19" ## [9] "99.0.4844.35" "99.0.4844.51"
The latest version of the driver is too new for my browser
binman::list_versions("chromedriver")
## $win32## [1] "100.0.4896.20" "100.0.4896.60" "102.0.5005.27" "102.0.5005.61"## [5] "103.0.5060.24" "89.0.4389.23" "90.0.4430.24" "91.0.4472.19" ## [9] "99.0.4844.35" "99.0.4844.51"
Note that
Then the function works
Note that
the browser says "Chrome is being controlled by automated test software."
you should avoid controlling this browser manually
you should also avoid creating multiple servers
driver <- rsDriver(chromever = "102.0.5005.27")
Separate the client and server as different objects
browser <- driver$clientserver <- driver$server
Note that
rsDriver()
creates a client and a serverremoteDriver
20) Start a server
21) Single out the client
browser
to help you follow the slides02:30
Navigate to a page with the following notation
browser$navigate("https://luzpar.netlify.app")
Navigate to a page with the following notation
browser$navigate("https://luzpar.netlify.app")
Note that
navigate
is called a method, not a function
browser
Check the description of any method as follows, with no parentheses after the method name
browser$navigate
Class method definition for method navigate()function (url) { "Navigate to a given url." qpath <- sprintf("%s/session/%s/url", serverURL, sessionInfo[["id"]]) queryRD(qpath, "POST", qdata = list(url = url))}<environment: 0x00000173db9035a8>Methods used: "queryRD"
Go back to the previous URL
browser$goBack()
Go forward
browser$goForward()
Refresh the page
browser$refresh()
22) Navigate to a website, and then to another one
23) Go back, and go forward
24) See what other methods are available to interact with browsers
25) Try one or more new methods
10:00
Get the URL of the current page
browser$CurrentUrl()
Get the title of the current page
browser$getTitle()
Close the browser
browser$close()
Open a new browser
rsDriver
functionbrowser$open()
Get the page source
browser$getPageSource()[[1]]
Get the page source
browser$getPageSource()[[1]]
Note that
read_html()
for static pagesbow()
%>%
scrape()
rvest
usually takes over after this stepExtract the links on the homepage, with functions from both the RSelenium
and rvest
packages
browser$navigate(url = "https://luzpar.netlify.app")browser$getPageSource()[[1]] %>% read_html() %>% html_elements("#title a") %>% html_attr("href")
[1] "https://github.com/resulumit/scrp_workshop" [2] "https://resulumit.com/" [3] "/documents/" [4] "/constituencies/"[5] "/members/" [6] "/states/" [7] "https://github.com/rstudio/blogdown" [8] "https://gohugo.io/" [9] "https://github.com/wowchemy"
Extract the links on the page, with functions from both the RSelenium
and rvest
packages
browser$navigate(url = "https://luzpar.netlify.app")browser$getPageSource()[[1]] %>% read_html() %>% html_elements("#title a") %>% html_attr("href")
Note that
read_html()
functionRSelenium
) into HTMLRSelenium
These two pieces of code lead to the same outcome, as the page we scrape is not dynamic
browser$navigate(url = "https://luzpar.netlify.app")browser$getPageSource()[[1]] %>% read_html() %>% html_elements("#title a") %>% html_attr("href")
[1] "https://github.com/resulumit/scrp_workshop" [2] "https://resulumit.com/" [3] "/documents/" [4] "/constituencies/"[5] "/members/" [6] "/states/" [7] "https://github.com/rstudio/blogdown" [8] "https://gohugo.io/" [9] "https://github.com/wowchemy"
read_html("https://luzpar.netlify.app") %>% html_elements(css = "#title a") %>% html_attr("href")
## [1] "https://github.com/resulumit/scrp_workshop" ## [2] "https://resulumit.com/" ## [3] "/documents/" ## [4] "/constituencies/"## [5] "/members/" ## [6] "/states/" ## [7] "https://github.com/rstudio/blogdown" ## [8] "https://gohugo.io/" ## [9] "https://github.com/wowchemy"
26) Get the page source for https://luzpar.netlify.app/members/
rvest
or polite
27) Get the same page source, using RSelenium
28) Collect names from https://luzpar.netlify.app/members/
rvest
onlyRSelenium
and rvest
together07:30
xpath
xpath
valuefindElement(using = "xpath", value )
Note that
typing "css", instead of "css selector", also works
there are other selector schemes as well, including
findElement(using = "css selector", value )
If there were a button created by the following code ...
<button class="big-button" id="only-button" name="clickable">Click Me</button>
... any of the lines below would find it
browser$findElement(using = "xpath", value = '//*[(@id = "only-button")]')browser$findElement(using = "css selector", value = ".big-button")browser$findElement(using = "css", value = "#only-button")browser$findElement(using = "id", value = "only-button")browser$findElement(using = "name", value = "clickable")
Save elements as R objects to be interacted later on
button <- browser$findElement(using = ..., value = ...)
Note the difference between the classes of clients and elements
class(browser)
[1] "remoteDriver"attr(,"package")[1] "RSelenium"
class(button)
[1] "webElement"attr(,"package")[1] "RSelenium"
Highlight the element found in the previous step, with the highlightElement
method
# navigate to a page browser$navigate("http://luzpar.netlify.app/") # find the element menu_states <- browser$findElement(using = "link text", value = "States") # highlight it to see if we found the correct element menu_states$highlightElement()
Note that
Highlight the element found in the previous step, with the highlightElement
method
# navigate to a page browser$navigate("http://luzpar.netlify.app/") # find the element menu_states <- browser$findElement(using = "link text", value = "States") # highlight it to see if we found the correct element menu_states$highlightElement()
Note that
menu_states
), not to the client (browser
)Click on the element found in the previous step, with the clickElement
method
# navigate to a pagebrowser$navigate("http://luzpar.netlify.app/")# find an elementsearch_icon <- browser$findElement(using = "css", value = ".fa-search")# click on itsearch_icon$clickElement()
29) Go to https://luzpar.netlify.app/constituencies/, and click the next page button
30) While on the second page, click the next page button again
07:30
sendKeysToElement(list(value, key ) )
Note that
sendKeysToElement(list(value, key ) )
View the list of Selenium keys
as_tibble(selKeys) %>% names()
## [1] "null" "cancel" "help" "backspace" "tab" ## [6] "clear" "return" "enter" "shift" "control" ## [11] "alt" "pause" "escape" "space" "page_up" ## [16] "page_down" "end" "home" "left_arrow" "up_arrow" ## [21] "right_arrow" "down_arrow" "insert" "delete" "semicolon" ## [26] "equals" "numpad_0" "numpad_1" "numpad_2" "numpad_3" ## [31] "numpad_4" "numpad_5" "numpad_6" "numpad_7" "numpad_8" ## [36] "numpad_9" "multiply" "add" "separator" "subtract" ## [41] "decimal" "divide" "f1" "f2" "f3" ## [46] "f4" "f5" "f6" "f7" "f8" ## [51] "f9" "f10" "f11" "f12" "command_meta"
Choosing the body element, you can scroll up and down a page
body <- browser$findElement(using = "css", value = "body")body$sendKeysToElement(list(key = "page_down"))
Search the demonstration site
# navigate to the home pagebrowser$navigate("http://luzpar.netlify.app/")# find the search icon and click on itsearch_icon <- browser$findElement(using = "css", value = ".fa-search")search_icon$clickElement()# find the search bar on the new page and click on itsearch_bar <- browser$findElement(using = "css", value = "#search-query")search_bar$clickElement()# search for the keyword "Law" and click entersearch_bar$sendKeysToElement(list(value = "Law", key = "enter"))
Slow down the code where necessary, with the Sys.sleep
# navigate to the home pagebrowser$navigate("http://luzpar.netlify.app/")# find the search icon and click on itsearch_icon <- browser$findElement(using = "css", value = ".fa-search")search_icon$clickElement()# sleep for 2 secondsSys.sleep(2)# find the search bar on the new page and click on itsearch_bar <- browser$findElement(using = "css", value = "#search-query")search_bar$clickElement()# search for the keyword "Law" and click entersearch_bar$sendKeysToElement(list(value = "Law", key = "enter"))
Clear text, or a value, from an element
search_bar$clearElement()
31) Conduct an internet search programatically
32) Scroll down programatically, and up
33) Go back, and conduct another search
15:00
Switch to a different frame on a page
switchToFrame(Id )
Note that
Id
argument takes an element object, unquotedNULL
returns to the default frameSwitch to a non-default frame
# navigate to a page and wait for the frame to loadbrowser$navigate("https://luzpar.netlify.app/documents/")Sys.sleep(4)# find the frame, which is an elementapp_frame <- browser$findElement("css", "iframe")# switch to itbrowser$switchToFrame(Id = app_frame)#switch back to the default framebrowser$switchToFrame(Id = NULL)
Task:
Task:
Plan:
Interact with the page until it displays the desired list of documents
# navigate to the desired page and wait a littlebrowser$navigate("https://luzpar.netlify.app/documents/")Sys.sleep(4)# switch to the frame with the appapp_frame <- browser$findElement("css", "iframe")browser$switchToFrame(Id = app_frame)# find and open the drop down menudrop_down <- browser$findElement(using = "css", value = ".bs-placeholder")drop_down$clickElement()# choose proposalsproposal <- browser$findElement(using = 'css', "[id='bs-select-1-1']")proposal$clickElement()# choose reportsreport <- browser$findElement(using = 'css', "[id='bs-select-1-2']")report$clickElement()# close the drop down menudrop_down$clickElement()
Get the page source and separate the links
the_links <- browser$getPageSource()[[1]] %>% read_html() %>% html_elements("td a") %>% html_attr("href")print(the_links)
## [1] "https://luzpar.netlify.app/documents/human-rights-2021/" ## [2] "https://luzpar.netlify.app/documents/greenhouse-gas-emissions-2021/"## [3] "https://luzpar.netlify.app/documents/tax-reform-2020/" ## [4] "https://luzpar.netlify.app/documents/parliamentary-staff-2020/" ## [5] "https://luzpar.netlify.app/documents/cyber-security-2019/" ## [6] "https://luzpar.netlify.app/documents/electronic-cigarettes-2019/"
Write a for loop to download PDFs
for (i in 1:length(the_links)) {pdf_link <- bow(the_links[i]) %>% scrape() %>% html_elements(css = ".btn-page-header") %>% html_attr("href") %>% url_absolute(base = "https://luzpar.netlify.app/")download.file(url = pdf_link, destfile = basename(pdf_link), mode = "wb")}
34) Collect data on a subset of documents
Hint:
30:00
Harrison, J. (2020). RSelenium: R Bindings for Selenium WebDriver. R package version 1.7.7. <http://docs.ropensci.org/RSelenium.
Meissner, P. and K. Ren (2020). robotstxt: A robots.txt Parser and Webbot/'Spider'/Crawler Permissions Checker. R package version 0.7.13. <https://CRAN.R-project.org/package=robotstxt.
Perepolkin, D. (2019). polite: Be Nice on the Web. R package version 0.1.1. <https://github.com/dmi3kno/polite.
Silge, J. and D. Robinson (2017). Text mining with R: A tidy approach. O'Reilly.
Wickham, H. (2021). rvest: Easily Harvest (Scrape) Web Pages. R package version 1.0.2. <https://CRAN.R-project.org/package=rvest.
Wickham, H., R. François, L. Henry, et al. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.9. <https://CRAN.R-project.org/package=dplyr.
Wickham, H. and G. Grolemund (2021). R for data science. O'Reilly.
Xie, Y. (2022). xaringan: Presentation Ninja. R package version 0.24. <https://github.com/yihui/xaringan.
Resul Umit
post-doctoral researcher in political science at the University of Oslo
teaching and studying representation, elections, and parliaments
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |