class: inverse, center, middle <style type="text/css"> .hljs-github .hljs { background: #e5e5e5; } .inline-c, remark-inline-code { background: #e5e5e5; border-radius: 3px; padding: 4px; font-family: 'Source Code Pro', 'Lucida Console', Monaco, monospace; } .yellow-h{ background: #ffff88; } .out-t, remark-inline-code { background: #9fff9f; border-radius: 3px; padding: 4px; } .pull-left-c { float: left; width: 58%; } .pull-right-c { float: right; width: 38%; } .medium { font-size: 75% } .small { font-size: 50% } .action { background-color: #f2eecb; } .remark-code { display: block; overflow-x: auto; padding: .5em; color: #333; background: #9fff9f; } </style> # Automated Web Scraping with R <br> ### Resul Umit ### June 2022 .footnote[ [Skip intro — To the contents slide](#contents-slide). <a href="mailto:resuluy@uio.no?subject=Workshop on web scraping">I can teach this workshop at your institution — Email me</a>. ] --- ## Who am I? Resul Umit - post-doctoral researcher in political science at the University of Oslo - teaching and studying representation, elections, and parliaments - [a recent publication](https://doi.org/10.1017/psrm.2021.30): the effects of casualties in terror attacks on elections -- <br> - teaching workshops, also on - [writing reproducible research papers](https://resulumit.com/blog/rmd-workshop/) - [version control and collaboration](https://resulumit.com/teaching/git_workshop.html) - [working with Twitter data](https://resulumit.com/teaching/twtr_workshop.html) - [creating academic websites](https://resulumit.com/teaching/rbd_workshop.html) -- <br> - more information available at [resulumit.com](https://resulumit.com/) --- ## The Workshop — Overview - One and a half day, on how to automate the process of extracting data from websites - 180+ slides, 30+ exercises - a [demonstration website](https://luzpar.netlify.app/) for practice -- <br> - Designed for researchers with basic knowledge of R programming language - does not cover programming with R - e.g., we will use existing functions and packages <br> - ability to work with R will be very helpful - but not absolutely necessary — this ability can be developed during and after the workshop as well --- ## The Workshop — Motivation - Data available on websites provide attractive opportunities for academic research - e.g., parliamentary websites were the main source of data for my PhD -- <br> - Acquiring such data requires - either a lot of resources, such as time - or a set of skills, such as automated web scraping -- <br> - Typically, such skills are not part of academic training - for my PhD, I visited close to 3000 webpages to collect data manually - on members of ten parliaments - multiple times, to update the dataset as needed --- ## The Workshop — Motivation — Aims - To provide you with an understanding of what is .yellow-h[ethically] possible - we will cover a large breath of issues, not all of it is for long-term memory - hence the slides are designed for self study as well <br> - awareness of what is ethical and possible, Google, and perseverance are all you need -- <br> - To start you with acquiring and practicing the skills needed - practice with the demonstration website - plenty of data, stable structure, and an ethical playground <br> - start working on a real project --- name: contents-slide ## The Workshop — Contents <br> .pull-left[ [Part 1. Getting the Tools Ready](#part1) - e.g., installing software [Part 2. Preliminary Considerations](#part2) - e.g., ethics of web scraping [Part 3. HTML Basics](#part3) - e.g., elements and attributes ] .pull-right[ [Part 4. CSS Selectors](#part4) - e.g., selecting an element [Part 5. Scraping Static Pages](#part5) - e.g., getting text from an element [Part 6. Scraping Dynamic Pages](#part6) - e.g., clicking to create an element ] .footnote[ [To the list of references](#reference-slide). ] --- ## The Workshop — Organisation - I will go through a number of slides... - introducing things - demonstrating how-to do things <br> - ... and then pause, for you to use/do those things - e.g., prepare your computer for the workshop, and/or - complete a number of exercises <br> - We are here to help - ask me, other participants - consult Google, [slides](https://resulumit.com/teaching/scrp_workshop.html), [answer script](https://luzpar.netlify.app/exercises/solutions.R) - type, rather than copy and paste, the code you will find on the slides or the script --- class: action ## The Workshop — Organisation — Slides Slides with this background colour indicate that your action is required, for - setting the workshop up - e.g., installing R - completing the exercises - e.g., checking website protocols - these slides have countdown timers - as a guide, not to be followed strictly
03
:
00
--- ## The Workshop — Organisation — Slides - Code and text that go in R console or scripts .inline-c[appear as such — in a different font, on gray background] - long codes and texts will have their own line(s) ```r bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td+ td a") %>% html_attr("href") %>% url_absolute(base = "https://luzpar.netlify.app/") ``` --- ## The Workshop — Organisation — Slides - Code and text that go in R console or scripts .inline-c[appear as such — in a different font, on gray background] - long codes and texts will have their own line(s) <br> - Results that come out as output .out-t[appear as such — in the same font, on green background] - except for some results, such as a browser popping up -- <br> - Specific sections are .yellow-h[highlighted yellow as such] for emphasis - these could be for anything — codes and texts in input, results in output, and/or texts on slides -- <br> - The slides are designed for self-study as much as for the workshop - *accessible*, in substance and form, to go through on your own --- name: part1 class: inverse, center, middle # Part 1. Getting the Tools Ready .footnote[ [Back to the contents slide](#contents-slide). ] --- class: action ## Workshop Slides — Access on Your Browser - Having the workshop slides<sup>*</sup> on your own machine might be helpful - flexibility to go back and forward on your own - ability to scroll across long codes on some slides <br> - Access at <https://resulumit.com/teaching/scrp_workshop.html> - will remain accessible after the workshop - might crash for some Safari users - if using a different browser application is not an option, view the [PDF version of the slides](https://github.com/resulumit/scrp_workshop/blob/master/presentation/scrp_workshop.pdf) on GitHub .footnote[ <sup>*</sup> These slides are produced in R, with the `xaringan` package <a name=cite-R-xaringan></a>([Xie, 2022](https://github.com/yihui/xaringan)). ] --- class: action ## Demonstration Website — Explore on Your Browser - There is a demonstration website for this workshop - available at <https://luzpar.netlify.app/> - includes fabricated data on the imaginary Parliament of Luzland - provides us with plenty of data, stable structure, and an ethical playground - Using this demonstration website for practice is recommended - tailored to exercises, no ethical concern - but not compulsory — use a different one if you prefer so - Explore the website now - click on the links to see an individual page for - states, constituencies, members, and documents <br> - notice that the documents section is different than the rest - it is a page with dynamic frame
05
:
00
--- class: action ## R — Download from the Internet and Install - Programming language of this workshop - created for data analysis, extending for other purposes - e.g., accessing websites <br> - allows for all three steps in one environment - accessing websites, scraping data, and processing data <br> - Download R from [https://cloud.r-project.org](https://cloud.r-project.org) - optional, if you have it already installed — but then consider updating<sup>*</sup> - the `R.version.string` command checks the version of your copy - compare with the latest official release at [https://cran.r-project.org/sources.html](https://cran.r-project.org/sources.html) .footnote[ <sup>*</sup> The same applies to all software that follows — consider updating if you have them already installed. This ensures everyone works with the latest, exactly the same, tools. ] --- class: action ## RStudio — Download from the Internet and Install - Optional, but highly recommended - facilitates working with R <br> - A popular integrated development environment (IDE) for R - an alternative: [GNU Emacs](https://www.gnu.org/software/emacs/) <br> - Download RStudio from [https://rstudio.com/products/rstudio/download](https://rstudio.com/products/rstudio/download) - choose the free version - to check for any updates, follow from the RStudio menu: > `Help -> Check for Updates` --- class: action name: rstudio-project ## RStudio Project — Create from within RStudio - RStudio allows for dividing your work with R into separate projects - each project gets dedicated workspace, history, and source documents - [this page](https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects) has more information on why projects are recommended <br> - Create a new RStudio project for for this workshop, following from the RStudio menu: > `File -> New Project -> New Directory -> New Project` <br> - Choose a location for the project with `Browse...` - avoid choosing a synced location, e.g., `Dropbox` - likely to cause warning and/or error messages - if you must, pause syncing, or add an sync exclusion --- class: action ## R Packages — Install from within RStudio<sup>*</sup> Install the packages that we need ```r install.packages(c("rvest", "RSelenium", "robotstxt", "polite", "dplyr")) ``` .footnote[ <sup>*</sup> You may already have a copy of one or more of these packages. In that case, I recommend updating by re-installing them now. ]
02
:
00
--- class: action ## R Packages — Install from within RStudio Install the packages that we need ```r install.packages(c("rvest", "RSelenium", "robotstxt", "polite", "dplyr")) ``` <br> We will use - `rvest` <a name=cite-R-rvest></a>([Wickham, 2021](https://CRAN.R-project.org/package=rvest)), for scraping websites -- - `RSelenium` <a name=cite-R-RSelenium></a>([Harrison, 2020](http://docs.ropensci.org/RSelenium)), for browsing the web programmatically -- - `robotstxt` <a name=cite-R-robotstxt></a>([Meissner and Ren, 2020](https://CRAN.R-project.org/package=robotstxt)), for checking permissions to scrape websites -- - `polite` <a name=cite-R-polite></a>([Perepolkin, 2019](https://github.com/dmi3kno/polite)), for compliance with permissions to scrape websites -- - `dplyr` <a name=cite-R-dplyr></a>([Wickham, François, Henry, and Müller, 2022](https://CRAN.R-project.org/package=dplyr)), for data manipulation --- class: action ## R Script — Start Your Script .pull-left[ - Check that you are in your recently created project - indicated at the upper-right corner of RStudio window - Create a new R Script, following from the RStudio menu > `File -> New File -> R Script` - Name and save your file - e.g., `scrape_web.R` - Load `rvest` and other packages ] .pull-right[ ```r library(rvest) library(RSelenium) library(robotstxt) library(polite) library(dplyr) ``` ] --- class: action ## Java — Download from the Internet and Install - A language and software that `RSelenium` needs - for automation scripts <br> - Download Java from <https://www.java.com/en/download/> - requires restarting any browser that you might have open --- class: action ## Chrome — Download from the Internet and Install - A browser that facilitates web scraping - favoured by `RSelenium` and most programmers <br> - Download Chrome from <https://www.google.com/chrome/> --- class: action ## SelectorGadget — Add Extension to Browser - An extension for Chrome - facilitates selecting what to scrape from a webpage - optional, but highly recommended - [open source software](https://github.com/cantino/selectorgadget) <br> - Add the extension to your browser - search for it at <https://chrome.google.com/webstore/category/extensions> - if you cannot use Chrome, <a href="javascript:(function(){var%20s=document.createElement('div');s.innerHTML='Loading...';s.style.color='black';s.style.padding='20px';s.style.position='fixed';s.style.zIndex='9999';s.style.fontSize='3.0em';s.style.border='2px%20solid%20black';s.style.right='40px';s.style.top='40px';s.setAttribute('class','selector_gadget_loading');s.style.background='white';document.body.appendChild(s);s=document.createElement('script');s.setAttribute('type','text/javascript');s.setAttribute('src','https://dv0akt2986vzh.cloudfront.net/unstable/lib/selectorgadget.js');document.body.appendChild(s);})();">drag and drop this link</a> to your bookmarks bar <br> - [ScrapeMate](https://github.com/hermit-crab/ScrapeMate) is an alternative extension - for both Chrome and Firefox - on Firefox, search at <https://addons.mozilla.org/> --- class: action ## Solutions — Note Where They Are - Solutions to exercises, or links to them, are available online - can be downloaded at <https://luzpar.netlify.app/exercises/solutions.R> <br> - I recommend the solutions to be consulted as a last resort - after a genuine effort to complete the exercises yourself first --- ## Other Resources<sup>*</sup> - `RSelenium` vignettes - available at <https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html> - R for Data Science <a name=cite-rfordatascience></a>([Wickham and Grolemund, 2021](#bib-rfordatascience)) - open access at <https://r4ds.had.co.nz> - Text Mining with R: A Tidy Approach <a name=cite-textminingwithr></a>([Silge and Robinson, 2017](#bib-textminingwithr)) - open access at [tidytextmining.com](https://www.tidytextmining.com/) - comes with [a course website](https://juliasilge.shinyapps.io/learntidytext/) where you can practice .footnote[ <sup>*</sup> I recommend these to be consulted not during but after the workshop. ] --- name: part2 class: inverse, center, middle # Part 2. Preliminary Considerations .footnote[ [Back to the contents slide](#contents-slide). ] --- ## Considerations — the Law - Web scraping might be illegal <br> - depending on who is scraping what, why, how — and under which jurisdiction - reflect, and check, before you scrape -- <br> - Web scraping might be more likely to be illegal if, for example, <br> - it is harmful to the source commercially and/or physically - e.g., scraping a commercial website to create a rival website - e.g., scraping a website so hard and fast that it collapses <br> - it gathers data that is - under copyright - not meant for the public to see - then used for financial gain --- ## Considerations — the Ethics - Web scraping might be unethical - depending on who is scraping what, why, and how - reflect before you scrape -- <br> - Web scraping might be more likely to be unethical if, for example, <br> - it is — edging towards — being illegal - it does not respect the restrictions - as defined in `robots.txt` files <br> - it harvests data - that is otherwise available to download, e.g., through APIs - without purpose, at dangerous speed, repeatedly --- ## Considerations — the Ethics — `robots.txt` - Most websites declare a robots exclusion protocol - making their rules known with respect to programmatic access - who is (not) allowed to scrape what, and sometimes, at what speed <br> - within `robots.txt` files - available at, e.g., www.websiteurl.com<span style="background-color: #ffff88;">/robots.txt</span> <br> - The rules in `robots.txt` cannot not enforced upon scrapers - but should be respected for ethical reasons <br> - The language in `robots.txt` files is specific but intuitive - easy to read and understand - the `robotstxt` package makes these even easier --- ## Considerations — the Ethics — `robots.txt` — Syntax .pull-left[ - It has pre-defined keys, most importantly - `User-agent` indicates who the protocol is for - `Allow` indicates which part(s) of the website can be scraped - `Disallow` indicates which part(s) must not be scraped - `Crawl-delay` indicates how fast the website could be scraped <br> - Note that - the keys start with capital letters - they are followed by a colon .yellow-h[:] ] .pull-right[ ```md `User-agent:` `Allow:` `Disallow:` `Crawl-delay:` ``` ] --- ## Considerations — the Ethics — `robots.txt` — Syntax .pull-left[ - Websites define their own values - after a colon and a white space <br> - Note that - * indicates the protocol is for everyone - `/` indicates all sections and pages - `/about/` indicates a specific path - values for `Crawl-delay` are in seconds <br> - this website allows anyone to scrape, provided that - `/about/` is left out, and - the website is accessed at 5-seconds intervals ] .pull-right[ ```md User-agent: `*` Allow: `/` Disallow: `/about/` Crawl-delay: `5` ``` ] --- ## Considerations — the Ethics — `robots.txt` — Examples .pull-left[ The protocol of this website only applies to Google - Google is allowed to scrape everything - there is no defined rule for anyone else ] .pull-right[ ```md User-agent: `googlebot` Allow: / ``` ] --- ## Considerations — the Ethics — `robots.txt` — Examples .pull-left[ The protocol of this website only applies to Google - Google is .yellow-h[disallowed] to scrape .yellow-h[two] specific paths - with no limit on speed <br> - there is no defined rule for anyone else ] .pull-right[ ```md User-agent: googlebot `Disallow: /about/` `Disallow: /history/` ``` ] --- ## Considerations — the Ethics — `robots.txt` — Examples .pull-left[ This website has different protocols for different agents - .yellow-h[Google] is allowed to scrape everything, with a 5-second delay - .yellow-h[Bing] is not allowed to scrape anything - .yellow-h[everyone else] can scrape the section or page located at www.websiteurl/about/ ] .pull-right[ ```md User-agent: `googlebot` Allow: / Crawl-delay: 5 User-agent: `bing` Disallow: / User-agent: `*` Allow: /about/ ``` ] --- ## Considerations — the Ethics — `robots.txt` — Notes There are also some other, lesser known, directives ```md User-agent: * Allow: / Disallow: /about/ Crawl-delay: 5 `Visit-time: 01:45-08:30` ``` -- <br> Files might include optional comments, written after the number sign .yellow-h[#] ```md `# thank you for respecting our protocol` User-agent: * Allow: / Disallow: /about/ Visit-time: 01:45-08:30 `# please visit when it is night time in the UK (GMT)` Crawl-delay: 5 `# please delay for five seconds, to ensure our servers are not overloaded` ``` --- ## Considerations — the Ethics — `robotstxt` - The `robotstxt` packages facilitates checking website protocols - from within R — no need to visit websites via browser - provides functions to check, among others, the rules for specific paths and/or agents <br> - There are two main functions - `robotstxt`, which gets complete protocols - `paths_allowed`, which checks protocols for one or more specific paths --- ## Considerations — the Ethics — `robotstxt` .pull-left[ Use the `robotstxt` function to get a protocol - supply a base URL with the `domain` argument - as a string - probably the only argument that you will need ] .pull-right[ ```md robotstxt( domain = NULL, ... ) ``` ] --- ## Considerations — the Ethics — `robotstxt` ```r robotstxt(domain = "https://luzpar.netlify.app") ``` ``` ## $domain ## [1] "https://luzpar.netlify.app" ## ## $text ## [robots.txt] ## -------------------------------------- ## ## User-agent: googlebot ## Disallow: /states/ ## ## User-agent: * ## Disallow: /exercises/ ## ## User-agent: * ## Allow: / ## Crawl-delay: 2 ## ## ## ## ## ## $robexclobj ## <Robots Exclusion Protocol Object> ## $bots ## [1] "googlebot" "*" ## ## $comments ## [1] line comment ## <0 rows> (or 0-length row.names) ## ## $permissions ## field useragent value ## 1 Disallow googlebot /states/ ## 2 Disallow * /exercises/ ## 3 Allow * / ## ## $crawl_delay ## field useragent value ## 1 Crawl-delay * 2 ## ## $host ## [1] field useragent value ## <0 rows> (or 0-length row.names) ## ## $sitemap ## [1] field useragent value ## <0 rows> (or 0-length row.names) ## ## $other ## [1] field useragent value ## <0 rows> (or 0-length row.names) ## ## $check ## function (paths = "/", bot = "*") ## { ## spiderbar::can_fetch(obj = self$robexclobj, path = paths, ## user_agent = bot) ## } ## <bytecode: 0x00000257bc12e528> ## <environment: 0x00000257bc129350> ## ## attr(,"class") ## [1] "robotstxt" ``` --- ## Considerations — the Ethics — `robotstxt` Check the list of permissions for the most relevant part in the output ```r robotstxt(domain = "https://luzpar.netlify.app")`$permissions` ``` ``` ## field useragent value ## 1 Disallow googlebot /states/ ## 2 Disallow * /exercises/ ## 3 Allow * / ``` --- ## Considerations — the Ethics — `robotstxt` .pull-left[ Use the `paths_allowed` function to check protocols for one or more specific paths - supply a base URL with the `domain` argument - `path` and `bot` are the other important arguments - notice the default values <br> - leads to either `TRUE` (allowed to scrape) or `FALSE` (not allowed) ] .pull-right[ ```md paths_allowed( domain = "auto", paths = "/", bot = "*", ... ) ``` ] --- ## Considerations — the Ethics — `robotstxt` ```r paths_allowed(domain = "https://luzpar.netlify.app") ``` ``` ## [1] TRUE ``` ```md paths_allowed(domain = "https://luzpar.netlify.app", `paths = c("/states/", "/constituencies/")`) ``` ``` ## [1] TRUE TRUE ``` ```md paths_allowed(domain = "https://luzpar.netlify.app", paths = c("/states/", "/constituencies/"), `bot = "googlebot"`) ``` ``` ## [1] FALSE TRUE ``` --- class: action ## Exercises 1) Check the protocols for <https://www.theguardian.com> - via (a) your browser and (b) with the `robotstxt` function in R - compare what you see <br> 2) Check a path with the `paths_allowed` function - such that it will return `FALSE` - taking the information from Exercise 1 into account - hint: try looking at the list of permissions first <br> 3) Check the protocols for any website that you might wish to scrape - with the `robotstxt` function - reflect on the ethics of scraping that website
10
:
00
--- ## Considerations — the Ethics — Speed - Websites are designed for visitors with human-speed in mind - computer-speed visits can overload servers, depending on bandwidth - popular websites might have more bandwidth - but, they might attract multiple scrapers at the same time <br> - Waiting a little between two visits makes scraping more ethical - waiting time may or may not be defined in the protocol - lookout for, and respect, the `Crawl-delay` key in `robots.txt` <br> - [Part 5](#part5) and [Part 6](#part6) covers how to wait <br> - Not waiting enough might lead to a ban - by site owners, administrators - for IP addresses with undesirably high number of visits in a short period of time --- ## Considerations — the Ethics — Purpose Ideally, we scrape for a purpose - e.g., for academics, to answer one or more research questions, test hypotheses <br> - developed prior to data collection, analysis - based on, e.g., theory, claims, observations <br> - perhaps, even pre-registered - e.g., at [OSF Registries](https://osf.io/registries) --- ## Considerations — Data Storage Scraped data frequently requires - large amounts of digital storage space - internet data is typically big data <br> - private, safe storage spaces - due to local rules, institutional requirements --- name: part3 class: inverse, center, middle # Part 3. HTML Basics .footnote[ [Back to the contents slide](#contents-slide). ] --- ## Source Code — Overview - Webpages include more than what is immediately visible to visitors - not only text, images, links - but also code for structure, style, and functionality — interpreted by browsers first - <span style="background-color: #ffff88;">HTML</span> provides the structure - <span style="background-color: #ffff88;">CSS</span> provides the style - <span style="background-color: #ffff88;">JavaScript</span> provides functionality, if any <br> - Web scraping requires working with the source code - even when scraping only what is already visible - to choose one or more desired parts of the visible - e.g., text in table and/or bold only <br> - Source code also offers more, invisible, data to be scraped - e.g., URLs hidden under text --- ## Source Code — Plain Text The `Ctrl` `+` `U` shortcut displays source code — alternatively, right click and `View` `Page` `Source` <img src="scrp_workshop_files/images_data/homepage.png" width="45%" /><img src="scrp_workshop_files/images_data/homepage_source.png" width="45%" /> --- ## Source Code — DOM Browsers also offer putting source codes in a structure, known as DOM (document object model) - initiated by the `F12` key on Chrome — alternatively, right click and `Inspect` <img src="scrp_workshop_files/images_data/homepage.png" width="45%" /><img src="scrp_workshop_files/images_data/homepage_dom.png" width="45%" /> --- class: action ## Exercises 4) View the source code of a page - as plain code and as in DOM - compare the look of the two <br> 5) Search for a word or a phrase in source code - copy from the front-end page - search in plain text code or in DOM - using the `Ctrl` `+` `F` shortcut <br> - compare the look of the front- and back-end
05
:
00
--- ## HTML — Overview .pull-left[ - HTML stands for .yellow-h[hypertext markup language] - it gives the structure to what is visible to visitors - text, images, links <br> - would a piece of text appear in a paragraph or a list? - depends on the HTML code around that text ] .pull-right[ ```md <!DOCTYPE html> <html> <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body> </html> ``` ] --- ## HTML — Overview .pull-left[ HTML documents - start with a .yellow-h[declaration] - so that browsers know what they are ] .pull-right[ ```md `<!DOCTYPE html>` <html> <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body> </html> ``` ] --- ## HTML — Overview .pull-left[ HTML documents - start with a declaration - so that browsers know what they are <br> - consist of .yellow-h[elements] - written in between opening and closing tags ] .pull-right[ ```md <!DOCTYPE html> <html> <head> `<style>` `h1 {color: blue;}` `</style>` <title>A title for browsers</title> </head> <body> `<h1>A header</h1>` <p>This is a paragraph.</p> <ul> `<li>This</li>` <li>is a</li> <li>list</li> </ul> </body> </html> ``` ] --- ## HTML — the Root .pull-left[ <span style="background-color: #ffff88;">`html`</span> holds together the root element - it is also the parent to all other elements - its important children are the `head` and `body` elements ] .pull-right[ ```md <!DOCTYPE html> `<html>` <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body> `</html>` ``` ] --- ## HTML — the Head .pull-left[ .yellow-h[head] contains metadata, such as - titles, which appear in browser bars and tabs - style elements ] .pull-right[ ```md <!DOCTYPE html> <html> `<head>` <style> h1 {color: blue;} </style> <title>A title for browsers</title> `</head>` <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body> </html> ``` ] --- ## HTML — the Body .pull-left[ .yellow-h[body] contains the elements in the main body of pages, such as - headers, paragraphs, lists, tables, images ] .pull-right[ ```md <!DOCTYPE html> <html> <head> <style> h1 {color: blue;} </style> <title>A title for browsers</title> </head> `<body>` <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> `</body>` </html> ``` ] --- ## HTML — Syntax — Tags Most elements have opening and closing .yellow-h[tags] ```md `<p>`This is a one sentence paragraph.`</p>` ``` .out-t[ This is a one sentence paragraph. ] <br> Note that - tag name, in this case .yellow-h[p], defines the structure of the element - the closing tag has a forward slash .yellow-h[/] before the element name --- ## HTML — Syntax — Content Most elements have some .yellow-h[content] ```md <p>`This is a one sentence paragraph.`</p> ``` .out-t[ This is a one sentence paragraph. ] --- ## HTML — Syntax — Attributes Elements can have .yellow-h[attributes] ```md <p>This is a <strong `id="sentence-count"`>one</strong> sentence paragraph.</p> ``` .out-t[ <p>This is a <strong id="sentence-count">one</strong> sentence paragraph.</p> ] <br> Note that - attributes are added to the opening tags - separated from anything else in the tag with a white space <br> - attribute string .yellow-h[sentence-count] could have been anything I could come up with - unlike the tag and attribute names — e.g., `strong`, `id` as they are pre-defined <br> - the `id` attribute has no visible effects - some other attributes, such as `style`, can have visible effects --- ## HTML — Syntax — Attributes There could be more than one attribute in a single element ```md <p>This is a <strong `class="count"` `id="sentence-count"`>one</strong> sentence paragraph.</p> <p>There are now <strong `class="count"` `id="paragraph-count"`>two</strong> paragraphs.</p> ``` .out-t[ <p>This is a <strong class="count" id="sentence-count">one</strong> sentence paragraph.</p> <p>There are now <strong class="count" id="paragraph-count">two</strong> paragraphs.</p> ] <br> Note that - the same `class` attribute (e.i., `count`) can apply to multiple elements - while the `id` attribute must be unique on a given page --- ## HTML — Syntax — Notes Elements can be nested ```md <p>This is a <strong>one</strong> sentence paragraph.</p> ``` .out-t[ <p>This is a <strong>one</strong> sentence paragraph.</p> ] <br> Note that - there are two elements above, a paragraph and a strong emphasis - strong is said to be the child of the paragraph element - there could be more than one child - in that case, children are numbered from the left <br> - paragraph is said to be the parent of the strong element --- ## HTML — Syntax — Notes By default, multiple spaces and/or lines breaks are ignored by browsers ```r <ul><li>books</li><li>journal articles</li><li>reports </li> </ul> ``` .out-t[ <ul><li>books</li><li>journal articles</li><li>reports </li> </ul> ] <br> Note that - plain source code may or may not be written in a readable manner - this is one reason why DOM is helpful --- ## HTML — Other Important Elements — Links Links are provided with the `a` (anchor) element ```md <p>Click <a href="https://www.google.com/">here</a> to google things.</p> ``` .out-t[ <p>Click <a href="https://www.google.com/">here</a> to google things.</p> ] <br> Note that - `href` (hypertext reference) is a .yellow-h[required attribute] for this element - most attributes are optional, but some are required --- ## HTML — Other Important Elements — Links Links can have titles ```md <p>Click <a `title="This text appears when visitors hover over the link"` href="https://www.google.com/">here</a> to google things.</p> ``` .out-t[ <p>Click <a title="This text appears when visitors hover over the link" href="https://www.google.com/">here</a> to google things.</p> ] <br> Note that - the `title` attribute is one of the optional attributes - it becomes visible when hovered over with mouse --- ## HTML — Other Important Elements — Lists The `<ul>` tag introduces un-ordered lists, while the `<li>` tag defines lists items ```r <ul> <li>books</li> <li>journal articles</li> <li>reports</li> </ul> ``` .out-t[ <ul> <li>books</li> <li>journal articles</li> <li>reports</li> </ul> ] <br> Note that - Ordered lists are introduced with the the `<ol>` tag instead --- ## HTML — Other Important Elements — Containers The `<div>` tag defines a section, containing one or often more elements .pull-left[ <br> ```r <p>This is an introductory paragraph.</p> <`div` style="text-decoration:underline;"> <p>In this important division there are two elements, which are:</p> <ul> <li>a paragraph, and</li> <li>an unordered list.</li> </ul> <`/div`> <p>This is the concluding paragraph.<p> ``` ] .pull-right[ <br> .out-t[ <p>This is an introductory paragraph.</p> <div style="text-decoration:underline;"> <p>In this important division there are two elements, which are:</p> <ul> <li>a paragraph, and</li> <li>an unordered list.</li> </ul> </div> <p>This is the concluding paragraph.<p> ] ] --- ## HTML — Other Important Elements — Containers The `<span>` tag also defines a section, containing a part of an element ```r <p>This is an <`span` style="text-decoration:underline;">important paragraph<`/span`>, which you must read carefully.<p> ``` .out-t[ <p>This is an <span style="text-decoration:underline;">important paragraph</span>, which you must read carefully.<p> ] <br> Note that - containers are useful in applying styles to sections - or, attributing classes or ids to them --- class: action ## Exercises 6) Re-create the page at <https://luzpar.netlify.app/states/> in R - start an HTML file, following from the RStudio menu: > `File -> New File -> HTML File` - copy the text from the website, paste in the HTML file - add the structure with HTML code - click `Preview` to view the result <br> 7) Add at least one extra tag and/or attribute - with a visible effect on how the page looks at the front end - hints: - google if you need to - [www.w3schools.com](https://www.w3schools.com/) has a lot resources - save this document as we will continue working on it
15
:
00
--- name: part4 class: inverse, center, middle # Part 3. CSS Selectors .footnote[ [Back to the contents slide](#contents-slide). ] --- ## CSS — Overview - CSS stands for .yellow-h[cascading style sheets] - it gives the style to what is visible to visitors - text, images, links <br> - would a piece of text appear in black or blue? - depends on the CSS for that text <br> - CSS can be defined - inline, as an attribute of an element - internally, as a child element of the `head` element - externally, but then linked in the `head` element --- ## CSS — Syntax .pull-left[ - CSS is written in .yellow-h[rules] ] .pull-right[ ```md `p {font-size:12px;}` `h1 h2 {color:blue;}` `.count {background-color:yellow;}` `#sentence-count {color:red; font-size:16px;}` ``` ] --- ## CSS — Syntax .pull-left[ - CSS is written in rules, with a syntax consisting of - one or more <span style="background-color: #ffff88;">selectors</span>, matching one or more HTML elements and/or attributes <br> ] .pull-right[ ```md `p` {font-size:14px;} `h1 h2` {color:blue;} `.count` {background-color:yellow;} `#sentence-count` {color:red; font-size:16px;} ``` ] --- ## CSS — Syntax .pull-left[ - CSS is written in rules, with a syntax consisting of - one or more <span style="background-color: #ffff88;">selectors</span>, matching one or more HTML elements and/or attributes <br> - Note that - the syntax changes with the selector type - elements and attributes are written as they are ] .pull-right[ ```md `p` {font-size:14px;} `h1 h2` {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:16px;} ``` ] --- ## CSS — Syntax .pull-left[ - CSS is written in rules, with a syntax consisting of - one or more <span style="background-color: #ffff88;">selectors</span>, matching one or more HTML elements and/or attributes <br> - Note that - the syntax changes with the selector type - elements and attributes are written as they are - classes are prefixed with a full stop, ids with a number sign ] .pull-right[ ```md p {font-size:14px;} h1 h2 {color:blue;} `.count` {background-color:yellow;} `#sentence-count` {color:red; font-size:16px;} ``` ] --- ## CSS — Syntax .pull-left[ - CSS is written in rules, with a syntax consisting of - one or more <span style="background-color: #ffff88;">selectors</span>, matching one or more HTML elements and/or attributes <br> - Note that - the syntax changes with the selector type - elements and attributes are written as they are - classes are prefixed with a full stop, ids with a number sign <br> - you can define the same rule for more than one element and/or attribute ] .pull-right[ ```md p {font-size:14px;} `h1 h2` {color:blue;} .count {background-color:yellow;} #sentence-count {color:red; font-size:16px;} ``` ] --- ## CSS — Syntax .pull-left[ - CSS is written in rules, with a syntax consisting of - one or more selectors, matching one or more HTML elements and/or attributes - a .yellow-h[declaration] <br> - Note that - declarations are written in between two curly brackets ] .pull-right[ ```md p `{font-size:14px;}` h1 h2 `{color:blue;}` .count `{background-color:yellow;}` #sentence-count `{color:red; font-size:16px;}` ``` ] --- ## CSS — Syntax .pull-left[ - CSS is written in rules, with a syntax consisting of - one or more selectors, matching one or more HTML elements and/or attributes - a declaration, with one or more .yellow-h[properties] ] .pull-right[ ```md p {`font-size:`14px;} h1 h2 {`color:`blue;} .count {`background-color:`yellow;} #sentence-count {`color:`red; `font-size:`16px;} ``` ] <br> - Note that - properties are followed by a colon --- ## CSS — Syntax .pull-left[ - CSS is written in rules, with a syntax consisting of - one or more selectors, matching one or more HTML elements and/or attributes - a declaration, with one or more properties and .yellow-h[values] <br> - Note that - values are followed by a semicolon - `property:value;` pairs are separated by a white space ] .pull-right[ ```md p {font-size:`14px;`} h1 h2 {color:`blue;`} .count {background-color:`yellow;`} #sentence-count {color:`red;` font-size:`14px;`} ``` ] --- ## CSS — Internal .pull-left[ - CSS rules can be defined internally - within the `style` element - as a child of the `head` element - Internally defined rules apply to all matching selectors - on the same page ] .pull-right[ ```r <!DOCTYPE html> <html> <head> * <style> * h1 {color:blue;} * </style> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body> </html> ``` ] --- ## CSS — External .pull-left[ - CSS rules can be defined externally - saved somewhere linkable - defined with the the `linked` element - as a child of the `head` element - Externally defined rules - are saved in a file with .css extension - apply to all matching selectors - on any page linked ] .pull-right[ ```r <!DOCTYPE html> <html> <head> * <link rel="styles" href="simple.css"> <title>A title for browsers</title> </head> <body> <h1>A header</h1> <p>This is a paragraph.</p> <ul> <li>This</li> <li>is a</li> <li>list</li> </ul> </body> </html> ``` ] --- ## CSS — Inline CSS rules can also be defined inline - with the `style` attribute - does not require selector - applies only to that element ```md <p>This is a <strong `style="color:blue;"`>one</strong> sentence paragraph.</p> ``` .out-t[ <p>This is a <strong style="color:blue;">one</strong> sentence paragraph.</p> ] --- class: action ## Exercise 8) Provide some simple style to your HTML document - one that you created during the previous exercise - using internal or external style, but not inline - so that you can practice selecting elements <br> - no idea what to do? - increase the font size of the text in paragraph - change the colour of the second item in the list to red - get more ideas from [www.w3schools.com/css](https://www.w3schools.com/css/default.asp)
07
:
30
--- name: part5 class: inverse, center, middle # Part 5. Scraping Static Pages .footnote[ [Back to the contents slide](#contents-slide). ] --- ## Static Pages — Overview - Static pages are those that display the same source code to all visitors - every visitor sees the same content at a given URL - for a different content, visitors go to a different page with a different URL - <https://luzpar.netlify.app/> is a static page -- <br> - Static pages are scraped typically in two steps - the `rvest` package can handle both steps - we may still wish to use other packages to ensure ethical scraping --- ## Static Pages — Two Steps to Scrape Scraping dynamic pages involves three main steps - .yellow-h[Get] the source code into R - with the `rvest` or `polite` package, using URLs of these pages - typically, the only interaction with the page itself - .yellow-h[Extract] the exact information needed from the source code - with the the `rvest` package, using selectors for that exact information - takes place locally, on your machine --- ## Static Pages — `rvest` — Overview - A relative small R package for web scraping - created by [Hadley Wickham](http://http://hadley.nz/) - popular — used by many for web scraping - downloaded 494,966 times last month - some of it must be thanks to being a part of the `tidyverse` family <br> - last major revision was in March 2021 - better alignment with `tidyverse` -- <br> - A lot has already been written on this package - you will find solutions to, or help for, any issues online - see first the [package documentation](https://cran.r-project.org/web/packages/rvest/rvest.pdf), numerous tutorials — such as [this](https://rvest.tidyverse.org/), [this](https://blog.rstudio.com/2014/11/24/rvest-easy-web-scraping-with-r/), and [this](https://steviep42.github.io/webscraping/book/index.html#quick-rvest-tutorial) -- <br> - Comes with the recommendation to combine it with the `polite` package - for ethical web scraping --- ## Static Pages — `rvest` — Get Source Code Use the `read_html` function to get the source code of a webpage into R ```r read_html("https://luzpar.netlify.app/") ``` ``` ## {html_document} ## <html lang="en-us"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body id="top" data-spy="scroll" data-offset="70" data-target="#navbar-ma ... ``` <br> Note that - this is the first of two steps in scraping static pages - typically, the only interaction with the page itself <br> - we still need to select the exact information that we need --- ## Static Pages — `rvest` — Get Source Code You may wish to check the protocol first, for ethical scraping ```r paths_allowed(domain = "https://luzpar.netlify.app/") ``` ``` ## [1] TRUE ``` ```r read_html("https://luzpar.netlify.app/") ``` ``` ## {html_document} ## <html lang="en-us"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body id="top" data-spy="scroll" data-offset="70" data-target="#navbar-ma ... ``` --- ## Static Pages — `rvest` — Get Source Code — `polite` - The `polite` package facilitates ethical scraping - recommended by `rvest` <br> - It divides the step of getting source code into two - check the protocol - get the source only if allowed <br> - Among its other fuctions are - waiting for a period of time - minimum by what is specified in the protocol <br> - introducing you to website administrators while scraping --- ## Static Pages — `rvest` — Get Source Code — `polite` .pull-left[ - First, use the `bow` function to check the protocol - for a specific .yellow-h[URL] ] .pull-right[ ```md bow(`url`, user_agent = "polite R package - https://github.com/dmi3kno/polite", delay = 5, ... ) ``` ] --- ## Static Pages — `rvest` — Get Source Code — `polite` .pull-left[ - First, use the `bow` function to check the protocol - for a specific URL - for a specific .yellow-h[agent] <br> - Note that - the `user_agent` argument can communicate information to website administrators - e.g., your name and contact details ] .pull-right[ ```md bow(url, `user_agent = "polite R package - https://github.com/dmi3kno/polite"`, delay = 5, force = FALSE, ... ) ``` ] --- ## Static Pages — `rvest` — Get Source Code — `polite` .pull-left[ - First, use the `bow` function to check the protocol - for a specific URL - for a specific agent - for any .yellow-h[crawl-delay directives] <br> - Note that - the `delay` argument cannot be set to a number smaller than in the directive - if there is one ] .pull-right[ ```md bow(url, user_agent = "polite R package - https://github.com/dmi3kno/polite", `delay = 5`, force = FALSE, ... ) ``` ] --- ## Static Pages — `rvest` — Get Source Code — `polite` .pull-left[ - First, use the `bow` function to check the protocol - for a specific URL - for a specific agent - for crawl-delay directives <br> - Note that - the `delay` argument cannot be set to a number smaller than in the directive - if there is one <br> - the `force` argument is set to `FALSE` by default - avoids repeated, unnecessary interactions with web page - by caching, and re-using, previously downloaded sources ] .pull-right[ ```md bow(url, user_agent = "polite R package - https://github.com/dmi3kno/polite", delay = 5, `force = FALSE`, ... ) ``` ] --- ## Static Pages — `rvest` — Get Source Code — `polite` .pull-left[ - First, use the `bow` function to check the protocol - .yellow-h[Second], use the `scrape` function get source code - for an object created with the .yellow-h[`bow`] function <br> - Note that - `scrape` will only work if the results from `bow` are positive - creating a safety valve for ethical scraping ] .pull-right[ ```md scrape(`bow`, ... ) ``` ] --- ## Static Pages — `rvest` — Get Source Code — `polite` .pull-left[ - First, use the `bow` function to check the protocol - Second, use the `scrape` function get source code - for an object created with the `bow` function <br> - Note that - `scrape` will only work if the results from `bow` are positive - creating a safety valve for ethical scraping <br> - by .yellow-h[piping], `bow` into `scrape`, you can avoid creating objects ] .pull-right[ ```md scrape(bow, ... ) ``` ```md bow() `%>%` scrape() ``` ] --- ## Static Pages — `rvest` — Get Source Code These two pieces of code lead to the same outcome, as there is .yellow-h[no protocol against the access] .pull-left[ ```r read_html("https://luzpar.netlify.app/") ``` ``` ## {html_document} ## <html lang="en-us"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body id="top" data-spy="scroll" data-offset="70" data-target="#navbar-ma ... ``` ] .pull-right[ ```r bow("https://luzpar.netlify.app/") %>% scrape() ``` ``` ## {html_document} ## <html lang="en-us"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body id="top" data-spy="scroll" data-offset="70" data-target="#navbar-ma ... ``` ] --- ## Static Pages — `rvest` — Get Source Code The difference occurs when there is .yellow-h[a protocol against the access] .pull-left[ ```r read_html("https://luzpar.netlify.app/exercises/exercise_6.Rhtml") ``` ``` ## {html_document} ## <html> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body>\r\n\r\n<h1>States of Luzland</h1>\r\n \r\n<p>There are four ... ``` ] .pull-right[ ```r bow("https://luzpar.netlify.app/exercises/exercise_6.Rhtml") %>% scrape() ``` ``` ## Warning: No scraping allowed here! ``` ``` ## NULL ``` ] --- class: action ## Exercises 9) Get the source code of the page at <https://luzpar.netlify.app/states/> in R - using the `read_html` function <br> 10) Get the same page source, this time in the `polite` way - let the website know who you are - define delay time
05
:
00
--- ## Static Pages — `rvest` — `html_elements` .pull-left[ - Get one or more HTML elements - from the .yellow-h[source code] downloaded in the previous step <br> - Note that - there are two versions of the same function - singular one gets the first instance of an element, plural gets all instances - if there is only one instance, both functions return the same result ] .pull-right[ ```md html_element(`x`, css, xpath) html_element`s`(`x`, css, xpath) ``` ] --- ## Static Pages — `rvest` — `html_elements` .pull-left[ - Get one or more HTML elements - from the source code downloaded in the previous step - specified with .yellow-h[a selector], CSS .yellow-h[or] XPATH <br> - Note that - we will work with CSS only in this workshop - using CSS is facilitated by Chrome and SelectorGagdet ] .pull-right[ ```md html_element(x, `css`, `xpath`) html_element`s`(x, css, xpath) ``` ] --- ## Static Pages — Finding Selectors - Finding the correct selector(s) is the key to successful scraping, and there are three ways to do it - figure it out yourself, by looking at the source code and/or the DOM - difficult, time consuming, prone to error <br> - use SelectorGagdet or other browser extensions - easy and quick - works well when selecting both single and multiple elements - but sometimes not accurate <br> - use the functionality that Chrome provides - an in-between option in terms of ease and time - works very well with single elements <br> -- <br> - I recommend using - the SelectorGagdet method first, and if it does not help - then the Chrome method, especially when selecting single elements --- ## Static Pages — Finding Selectors — SelectorGagdet To find the selectors for the hyperlinks on the homepage of the Parliamenta of Luzland .pull-left[ 1. visit the page on a Chrome browser 2. click on SelectorGagdet to activate it 3. click on a hyperlink <br> Note that - the element that you clicked is highlighted green - many other elements, including menu items, are in yellow - SelectorGagdet says the selector is `a` ] .pull-right[ <img src="scrp_workshop_files/images_data/sg1.png" width="100%" /> ] --- ## Static Pages — `rvest` — `html_elements` Get the `a` (anchor) elements on the homepage ```r bow("https://luzpar.netlify.app") %>% scrape() %>% `html_elements(css = "a")` ``` ``` ## {xml_nodeset (24)} ## [1] <a class="js-search" href="#" aria-label="Close"><i class="fas fa-times- ... ## [2] <a class="navbar-brand" href="/">Parliament of Luzland</a> ## [3] <a class="navbar-brand" href="/">Parliament of Luzland</a> ## [4] <a class="nav-link active" href="/"><span>Home</span></a> ## [5] <a class="nav-link" href="/states/"><span>States</span></a> ## [6] <a class="nav-link" href="/constituencies/"><span>Constituencies</span></a> ## [7] <a class="nav-link" href="/members/"><span>Members</span></a> ## [8] <a class="nav-link" href="/documents/"><span>Documents</span></a> ## [9] <a class="nav-link js-search" href="#" aria-label="Search"><i class="fas ... ## [10] <a href="#" class="nav-link" data-toggle="dropdown" aria-haspopup="true" ... ## [11] <a href="#" class="dropdown-item js-set-theme-light"><span>Light</span></a> ## [12] <a href="#" class="dropdown-item js-set-theme-dark"><span>Dark</span></a> ## [13] <a href="#" class="dropdown-item js-set-theme-auto"><span>Automatic</spa ... ## [14] <a href="https://github.com/resulumit/scrp_workshop" target="_blank" rel ... ## [15] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Um ... ## [16] <a href="/documents/">documents</a> ## [17] <a href="/constituencies/">constituencies</a> ## [18] <a href="/members/">members</a> ## [19] <a href="/states/">states</a> ## [20] <a href="https://github.com/rstudio/blogdown" target="_blank" rel="noope ... ## ... ``` --- ## Static Pages — `rvest` — `html_element` Get the .yellow-h[first] `a` (anchor) element on the homepage ```r bow("https://luzpar.netlify.app") %>% scrape() %>% html_elemen`t(`css = "a") ``` ``` ## {html_node} ## <a class="js-search" href="#" aria-label="Close"> ## [1] <i class="fas fa-times-circle text-muted" aria-hidden="true"></i> ``` <br> Note that - the function on this slide is the singular version --- ## Static Pages — Finding Selectors — SelectorGagdet To exclude the menu items from selection .pull-left[ <span>4.</span> click on a menu item <br> Note that - the element that you clicked is highlighted red - other menu items are not highlighted at all - SelectorGagdet says the selector is now `#title` `a` ] .pull-right[ <img src="scrp_workshop_files/images_data/sg2.png" width="100%" /> ] --- ## Static Pages — `rvest` — `html_elements` Get the `a` (anchor) elements on the homepage with a `#title` attribute ```r bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = `"#title a"`) ``` ``` ## {xml_nodeset (9)} ## [1] <a href="https://github.com/resulumit/scrp_workshop" target="_blank" rel= ... ## [2] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Umi ... ## [3] <a href="/documents/">documents</a> ## [4] <a href="/constituencies/">constituencies</a> ## [5] <a href="/members/">members</a> ## [6] <a href="/states/">states</a> ## [7] <a href="https://github.com/rstudio/blogdown" target="_blank" rel="noopen ... ## [8] <a href="https://gohugo.io/" target="_blank" rel="noopener">Hugo</a> ## [9] <a href="https://github.com/wowchemy" target="_blank" rel="noopener">Wowc ... ``` --- ## Static Pages — Finding Selectors — SelectorGagdet You can click further to exclude some and/or to include more elements .pull-left[ Note that the selection is colour-coded - <span style="background-color: #90FF33;">selected</span> - <span style="background-color: #ffff88;">also included</span> - <span style="background-color: #FF3F33;">excluded</span> - not included at all ] .pull-right[ <img src="scrp_workshop_files/images_data/sg3.png" width="100%" /> ] --- ## Static Pages — `rvest` — `html_elements` Get the link behind the selected elements ```r bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = `"br+ p a"`) ``` ``` ## {xml_nodeset (2)} ## [1] <a href="https://github.com/resulumit/scrp_workshop" target="_blank" rel= ... ## [2] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Umi ... ``` --- ## Static Pages — Finding Selectors — SelectorGagdet You can click further to select .yellow-h[a single element] .pull-left[ ```r bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = `"br+ p a+ a"`) ``` ``` ## {xml_nodeset (1)} ## [1] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Umi ... ``` ] .pull-right[ <img src="scrp_workshop_files/images_data/sg4.png" width="100%" /> ] --- ## Static Pages — Finding Selectors — Chrome To find the selector for a single element, you could also use Chrome itself .pull-left[ 1. right click, and then `Inspect` 2. click ![](scrp_workshop_files/images_data/ch2.png) 3. click on an element on the front end 4. right click on the highlighted section in the DOM 4. follow `Copy -> Copy selector` ] .pull-right[ <img src="scrp_workshop_files/images_data/ch1.png" width="100%" /> ] --- ## Static Pages — `rvest` — `html_elements` Get the link behind one element, with css from Chrome ```r bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = `"#title > div.container > div > p:nth-child(2) > a:nth-child(2)"`) ``` ``` ## {xml_nodeset (1)} ## [1] <a href="https://resulumit.com/" target="_blank" rel="noopener">Resul Umi ... ``` <br> Note that - the selector is different than the one SelectorGagdet returns - longer, and therefore, more specific and accurate - but the outcome is the same --- class: action ## Exercises 11) Get the fist item on the list on the page at <https://luzpar.netlify.app/states/> - find the selector with the functionality Chrome offers <br> 12) Get all items on the list - find the selector with SelectorGadget <br> 13) Get only the second and fourth items on the list - using a single selector that would return both
10
:
00
--- ## Static Pages — `rvest` — `html_text` .pull-left[ - Get the text content of one or more HTML elements - for the elements already chosen - with the `html_elements` function <br> - this returns what is already visible to visitors <br> - Note that - there are two versions of the same function - `html_text` returns text with any space or line breaks around it - `html_text2` returns plain text ] .pull-right[ ```r html_text(x, trim = FALSE) html_text2(x, preserve_nbsp = FALSE) ``` ] --- ## Static Pages — `rvest` — `html_text` ```r bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a") %>% `html_text()` ``` ``` ## [1] "a workshop on automated web scraping" ## [2] "Resul Umit" ## [3] "documents" ## [4] "constituencies" ## [5] "members" ## [6] "states" ## [7] "Blogdown" ## [8] "Hugo" ## [9] "Wowchemy" ``` --- class: action ## Exercises 14) Get the text on the list elements on the page at <https://luzpar.netlify.app/states/> <br> 15) Get the constituency names on the page at <https://luzpar.netlify.app/constituencies/>
05
:
00
--- ## Static Pages — `rvest` — `html_attr` .pull-left[ - Get one or more attributes of one or more HTML elements - for the elements already chosen - with the `html_elements` function <br> - attributes are specified with their name - not CSS or XPATH <br> - Note that - there are two versions of the same function - singular one gets a specified attribute - plural one gets all available attributes ] .pull-right[ ```r html_attr(x, name, default = NA_character_) html_attrs(x) ``` ] --- ## Static Pages — `rvest` — `html_attrs` ```r bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a") %>% `html_attrs()` ``` ``` ## [[1]] ## href ## "https://github.com/resulumit/scrp_workshop" ## target ## "_blank" ## rel ## "noopener" ## ## [[2]] ## href target rel ## "https://resulumit.com/" "_blank" "noopener" ## ## [[3]] ## href ## "/documents/" ## ## [[4]] ## href ## "/constituencies/" ## ## [[5]] ## href ## "/members/" ## ## [[6]] ## href ## "/states/" ## ## [[7]] ## href target ## "https://github.com/rstudio/blogdown" "_blank" ## rel ## "noopener" ## ## [[8]] ## href target rel ## "https://gohugo.io/" "_blank" "noopener" ## ## [[9]] ## href target ## "https://github.com/wowchemy" "_blank" ## rel ## "noopener" ``` --- ## Static Pages — `rvest` — `html_attr` ```r bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a") %>% `html_attr(name = "href")` ``` ``` ## [1] "https://github.com/resulumit/scrp_workshop" ## [2] "https://resulumit.com/" ## [3] "/documents/" ## [4] "/constituencies/" ## [5] "/members/" ## [6] "/states/" ## [7] "https://github.com/rstudio/blogdown" ## [8] "https://gohugo.io/" ## [9] "https://github.com/wowchemy" ``` -- Note that - some URLs are given relative to the base URL - e.g., `/states/`, which is actually <https://luzpar.netlify.app/states/> - you can complete them with the `url_absolute` function --- ## Static Pages — `rvest` — `url_absolute` Complete the relative URLs with the `url_absolute` function ```r bow("https://luzpar.netlify.app") %>% scrape() %>% html_elements(css = "#title a") %>% html_attr(name = "href") %>% `url_absolute(base = "https://luzpar.netlify.app")` ``` ``` ## [1] "https://github.com/resulumit/scrp_workshop" ## [2] "https://resulumit.com/" ## [3] "https://luzpar.netlify.app/documents/" ## [4] "https://luzpar.netlify.app/constituencies/" ## [5] "https://luzpar.netlify.app/members/" ## [6] "https://luzpar.netlify.app/states/" ## [7] "https://github.com/rstudio/blogdown" ## [8] "https://gohugo.io/" ## [9] "https://github.com/wowchemy" ``` --- class: action ## Exercises 16) Get the hyperlink attributes for the constituencies at <https://luzpar.netlify.app/constituencies/> <br> 17) Create complete links to the constituency pages
05
:
00
--- ## Static Pages — `rvest` — `html_table` Use the `html_table()` function to get the text content of table elements ```r bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "table") %>% `html_table()` ``` ``` ## [[1]] ## # A tibble: 100 x 3 ## Member Constituency Party ## <chr> <chr> <chr> ## 1 Arthur Ali Mühlshafen Liberal ## 2 Chris Antony Benwerder Labour ## 3 Chloë Bakker Steffisfelden Labour ## 4 Rose Barnes Dillon Liberal ## 5 Emilia Bauer Kilnard Green ## 6 Wilma Baumann Granderry Green ## 7 Matteo Becker Enkmelo Labour ## 8 Patricia Bernard Gänsernten Labour ## 9 Lina Booth Leonrau Liberal ## 10 Sophie Bos Zotburg Independent ## # ... with 90 more rows ``` --- ## Static Pages — `rvest` We can create the same tibble with `html_text`, which requires getting each variable separately to be merged ```r tibble( "Member" = bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td:nth-child(1) a") %>% html_text(), "Constituency" = bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td:nth-child(2) a") %>% html_text(), "Party" = bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td:nth-child(3)") %>% html_text() ) ``` --- ## Static Pages — `rvest` Keep the number of interactions with websites to minimum - by saving the source code as an object, which could be used repeatedly ```md `the_page <- bow("https://luzpar.netlify.app/members/")` %>% `scrape()` tibble( "Member" = `the_page` %>% html_elements(css = "td:nth-child(1)") %>% html_text(), "Constituency" = `the_page` %>% html_elements(css = "td:nth-child(2)") %>% html_text(), "Party" = `the_page` %>% html_elements(css = "td:nth-child(3)") %>% html_text() ) ``` --- class: action ## Exercise 18) Create a dataframe out of the table at <https://luzpar.netlify.app/members/> - with as many variables as possible - hints: - start with the code in the previous slide, and add new variables from attributes - the first two columns have important attributes - e.g., URLs for the pages for members and their constituencies - make these URLs absolute - see what other attributes are there to collect
15
:
00
--- ## Static Pages — Crawling — Overview - Rarely a single page includes all variables that we need - instead, they are often scattered across different pages of a website - e.g., we might need data on election results — in addition to constituency names <br> - Web scraping then requires crawling across pages - using information found on one page, to go to the next - website design may or may not facilitate crawling <br> - We can write for loops to crawl - the speed of our code matters the most when we crawl - ethical concerns are higher --- ## Static Pages — Crawling — Example **Task:** - I need data on the name and vote share of parties that came second in each constituency - This data is available on constituency pages, but - there are too many such pages - e.g., <https://luzpar.netlify.app/constituencies/arford/> - I do not have the URL to these pages -- <br> **Plan:** - Scrape <https://luzpar.netlify.app/members/> for URLs - Write a for loop to - visit these pages one by one - collect and save the variables needed - write these variables into a list - turn the list into a dataframe --- ## Static Pages — Crawling — Example Scrape the page that has all URLs, for absolute URLs ```r the_links <- bow("https://luzpar.netlify.app/members/") %>% scrape() %>% html_elements(css = "td+ td a") %>% html_attr("href") %>% url_absolute(base = "https://luzpar.netlify.app/") # check if it worked head(the_links) ``` ``` ## [1] "https://luzpar.netlify.app/constituencies/muhlshafen/" ## [2] "https://luzpar.netlify.app/constituencies/benwerder/" ## [3] "https://luzpar.netlify.app/constituencies/steffisfelden/" ## [4] "https://luzpar.netlify.app/constituencies/dillon/" ## [5] "https://luzpar.netlify.app/constituencies/kilnard/" ## [6] "https://luzpar.netlify.app/constituencies/granderry/" ``` --- ## Static Pages — Crawling — Example Create an empty list ```r *temp_list <- list() for (i in 1:length(the_links)) { the_page <- bow(the_links[i]) %>% scrape() temp_tibble <- tibble( "constituency" = the_page %>% html_elements("#constituency") %>% html_text(), "second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(), "vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text() ) temp_list[[i]] <- temp_tibble } df <- as_tibble(do.call(rbind, temp_list)) ``` --- ## Static Pages — Crawling — Example Start a for loop to iterate over the links one by one ```r temp_list <- list() *for (i in 1:length(the_links)) { the_page <- bow(the_links[i]) %>% scrape() temp_tibble <- tibble( "constituency" = the_page %>% html_elements("#constituency") %>% html_text(), "second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(), "vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text() ) temp_list[[i]] <- temp_tibble *} df <- as_tibble(do.call(rbind, temp_list)) ``` --- ## Static Pages — Crawling — Example Get the source code for the next link ```r temp_list <- list() for (i in 1:length(the_links)) { *the_page <- bow(the_links[i]) %>% scrape() temp_tibble <- tibble( "constituency" = the_page %>% html_elements("#constituency") %>% html_text(), "second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(), "vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text() ) temp_list[[i]] <- temp_tibble } df <- as_tibble(do.call(rbind, temp_list)) ``` --- ## Static Pages — Crawling — Example Get the variables needed, put them in a tibble ```r temp_list <- list() for (i in 1:length(the_links)) { the_page <- bow(the_links[i]) %>% scrape() *temp_tibble <- tibble( * *"constituency" = the_page %>% html_elements("#constituency") %>% html_text(), * *"second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% * html_text(), * *"vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% * html_text() * *) temp_list[[i]] <- temp_tibble } df <- as_tibble(do.call(rbind, temp_list)) ``` --- ## Static Pages — Crawling — Example Add each tibble into the previously-created list ```r temp_list <- list() for (i in 1:length(the_links)) { the_page <- bow(the_links[i]) %>% scrape() temp_tibble <- tibble( "constituency" = the_page %>% html_elements("#constituency") %>% html_text(), "second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(), "vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text() ) *temp_list[[i]] <- temp_tibble } df <- as_tibble(do.call(rbind, temp_list)) ``` --- ## Static Pages — Crawling — Example Turn the list into a tibble ```r temp_list <- list() for (i in 1:length(the_links)) { the_page <- bow(the_links[i]) %>% scrape() temp_tibble <- tibble( "constituency" = the_page %>% html_elements("#constituency") %>% html_text(), "second_party" = the_page %>% html_element("tr:nth-child(3) td:nth-child(1)") %>% html_text(), "vote_share" = the_page %>% html_elements("tr:nth-child(3) td:nth-child(3)") %>% html_text() ) temp_list[[i]] <- temp_tibble } *df <- as_tibble(do.call(rbind, temp_list)) ``` --- ## Static Pages — Crawling — Example Check the resulting dataset ```r head(df, 10) ``` ``` ## # A tibble: 100 x 3 ## constituency second_party vote_share ## <chr> <chr> <chr> ## 1 Mühlshafen Green 26.1% ## 2 Benwerder Conservative 24.8% ## 3 Steffisfelden Green 25.7% ## 4 Dillon Conservative 27% ## 5 Kilnard Conservative 28.8% ## 6 Granderry Labour 26.1% ## 7 Enkmelo Liberal 26.8% ## 8 Gänsernten Green 26.6% ## 9 Leonrau Conservative 25% ## 10 Zotburg Conservative 28.4% ## # ... with 90 more rows ``` --- class: action ## Exercise 19) Crawl into members' personal pages to create a rich dataset - with members being the unit of observation <br> Hints: - see an example dataset at <https://luzpar.netlify.app/exercises/static_data.csv> - start with the related code in the previous slides, and adopt it to your needs - practice with 3 members until you are ready to run the loop for all - e.g., by replacing `1:length(the_links)` with `1:3` for the loop
45
:
00
--- name: part6 class: inverse, center, middle # Part 6. Scraping Dynamic Pages .footnote[ [Back to the contents slide](#contents-slide). ] --- ## Dynamic Pages — Overview - Dynamic pages are ones that display custom content - different visitors might see different content on the same page - at the same URL <br> - depending on, for example, their own input - e.g., clicks, scrolls — while the URL remains the same <br> - <https://luzpar.netlify.app/documents/> is a page with a dynamic part -- <br> - Dynamic pages are scraped typically in three steps - as opposed to two steps, in scraping static pages - we will use an additional package, `RSelenium`, for the new step --- ## Dynamic Pages — Three Steps to Scrape Scraping dynamic pages involves three main steps <br> - .yellow-h[Create] the desired instance of the dynamic page - with the `RSelenium` package - e.g., by clicking, scrolling, filling in forms, from within R <br> - .yellow-h[Get] the source code into R - `RSelenium` downloads XML - `rvest` turns it into HTML <br> - .yellow-h[Extract] the exact information needed from the source code - as for static pages - with the the `rvest` package --- ## Dynamic Pages — `RSelenium` — Overview - A package that integrates [Selenium 2.0 WebDriver](https://www.selenium.dev/documentation/en/) into R - created by [John Harrison](http://johndharrison.github.io/#/cover) - downloaded 6,901 times last month - last updated in February 2020 -- <br> - A lot has already been written on this package - you will find solutions to, or help for, any issues online - see the [package documentation](https://cran.r-project.org/web/packages/RSelenium/RSelenium.pdf) and the [vignettes](https://cran.r-project.org/web/packages/RSelenium/vignettes/basics.html) for basic functionality - Google searches return code and tutorials in various languages - not only R but also Python, Java --- ## Dynamic Pages — `RSelenium` — Overview - The package involves more methods than functions - code look slightly unusual for R - as it follows the logic behind Selenium -- <br> - It allows interacting with two things — and it is crucial that users are aware of the difference - with .yellow-h[browsers] on your computer - e.g., opening a browser and navigating to a page <br> - with .yellow-h[elements] on a webpage - e.g., opening and clicking on a drop-down menu --- class: center, middle ## Interacting with Browsers --- ## Dynamic Pages — Browsers — Starting a Server .pull-left[ - Use the `rsDriver` function to start a server - so that you can control a web browser from within R <br> ] .pull-right[ ```md rsDriver(port = 4567L, browser = "chrome", version = "latest", chromever = "latest", ... ) ``` ] --- ## Dynamic Pages — Browsers — Starting a Server .pull-left[ - Use the `rsDriver` function to start a server - so that you can control a web browser from within R <br> - Note that the defaults can cause errors, such as - trying to start two servers from the same .yellow-[port] ] .pull-right[ ```md rsDriver(`port` = 4567L, browser = "chrome", version = "latest", chromever = "latest", ... ) ``` ] --- ## Dynamic Pages — Browsers — Starting a Server .pull-left[ - Use the `rsDriver` function to start a server - so that you can control a web browser from within R <br> - Note that the defaults can cause errors, such as - trying to start two servers from the same port - any mismatch between the .yellow[version and driver numbers] ] .pull-right[ ```md rsDriver(port = 4567L, browser = "chrome", `version = "latest"`, `chromever = "latest"`, ... ) ``` ] --- ## Dynamic Pages — Browsers — Starting a Server - The latest version of the driver is too new for my browser - I have to use an older version to make it work - after checking the available versions with the following code ```r binman::list_versions("chromedriver") ``` ``` ## $win32 ## [1] "100.0.4896.20" "100.0.4896.60" "102.0.5005.27" "102.0.5005.61" ## [5] "103.0.5060.24" "89.0.4389.23" "90.0.4430.24" "91.0.4472.19" ## [9] "99.0.4844.35" "99.0.4844.51" ``` -- <br> - Note that - you can only use the version that *you* have - you might have a different the version than the ones on this slide --- ## Dynamic Pages — Browsers — Starting a Server .pull-left[ - Then the function works - a web browser opens as a result - an R object named .yellow-h[driver] is created <br> - Note that - the browser says .yellow-h["Chrome is being controlled by automated test software."] - you should avoid controlling this browser manually - you should also avoid creating multiple servers ] .pull-right[ ```md driver <- rsDriver(`chromever = "102.0.5005.27"`) ``` <img src="scrp_workshop_files/images_data/chrome_works.png" width="2073" /> ] --- ## Dynamic Pages — Browsers — Starting a Server Separate the .yellow-h[client] and .yellow-h[server] as different objects ```r browser <- driver$client server <- driver$server ``` <br> Note that - `rsDriver()` creates a client and a server - the code above singles out the client, with which our code will interact - client is best thought as the browser itself - it has the class of `remoteDriver` --- class: action ## Exercises 20) Start a server - supply a driver version if necessary <br> 21) Single out the client - call it `browser` to help you follow the slides
02
:
30
--- ## Dynamic Pages — Browsers — Navigate Navigate to a page with the following notation ```md browser`$navigate`("https://luzpar.netlify.app") ``` <img src="scrp_workshop_files/images_data/navigate.png" width="50%" style="display: block; margin: auto;" /> --- ## Dynamic Pages — Browsers — Navigate Navigate to a page with the following notation ```md browser`$`navigate("https://luzpar.netlify.app") ``` <br> Note that - `navigate` is called .yellow-h[a method, not a function] - it cannot be piped <span style="background-color: #ffff88;">%>%</span> into `browser` - use the dollar sign <span style="background-color: #ffff88;">$</span> notation instead --- ## Dynamic Pages — Browsers — Navigate Check the description of any method as follows, with no parentheses after the method name ```md browser$navigate ``` ``` Class method definition for method navigate() function (url) { "Navigate to a given url." qpath <- sprintf("%s/session/%s/url", serverURL, sessionInfo[["id"]]) queryRD(qpath, "POST", qdata = list(url = url)) } <environment: 0x00000173db9035a8> Methods used: "queryRD" ``` --- ## Dynamic Pages — Browsers — Navigate Go back to the previous URL ```md browser$goBack() ``` <br> Go forward ```md browser$goForward() ``` <br> Refresh the page ```md browser$refresh() ``` --- class: action ## Exercises 22) Navigate to a website, and then to another one - from within R, all the while observing the outcome in the automated browser <br> 23) Go back, and go forward <br> 24) See what other methods are available to interact with browsers - read the description for one or more of them <br> 25) Try one or more new methods - e.g., take a screenshot of your browser - and view it in R
10
:
00
--- ## Dynamic Pages — Browsers — Navigate Get the URL of the current page ```md browser$CurrentUrl() ``` <br> Get the title of the current page ```md browser$getTitle() ``` --- ## Dynamic Pages — Browsers — Close and Open Close the browser - which will not close the session on the server - recall that we have singled the client out ```md browser$close() ``` <br> Open a new browser - which does not require the `rsDriver` function - because the server is still running ```md browser$open() ``` --- ## Dynamic Pages — Browsers — Get Page Source Get the page source ```md browser$getPageSource()[[1]] ``` --- ## Dynamic Pages — Browsers — Get Page Source Get the page source ```md browser$getPageSource()`[[1]]` ``` <br> Note that - this method returns a list - XML source is in the first item - this is why we need the .inline-c[[[1]]] bit <br> - this is akin to `read_html()` for static pages - or `bow()` `%>%` `scrape()` <br> - `rvest` usually takes over after this step --- ## Dynamic Pages — Browsers — Get Page Source Extract the links on the homepage, with functions from both the `RSelenium` and `rvest` packages ```md browser$navigate(url = "https://luzpar.netlify.app") browser$getPageSource()[[1]] %>% read_html() %>% html_elements("#title a") %>% html_attr("href") ``` ``` [1] "https://github.com/resulumit/scrp_workshop" [2] "https://resulumit.com/" [3] "/documents/" [4] "/constituencies/" [5] "/members/" [6] "/states/" [7] "https://github.com/rstudio/blogdown" [8] "https://gohugo.io/" [9] "https://github.com/wowchemy" ``` --- ## Dynamic Pages — Browsers — Get Page Source Extract the links on the page, with functions from both the `RSelenium` and `rvest` packages ```md browser$navigate(url = "https://luzpar.netlify.app") browser$getPageSource()[[1]] %>% `read_html() %>%` html_elements("#title a") %>% html_attr("href") ``` <br> Note that - we are still using the `read_html()` function - to turn XML (coming from `RSelenium`) into HTML <br> - this is in fact not a dynamic page - we could do the same as above without `RSelenium` --- ## Dynamic Pages — Browsers — Get Page Source These two pieces of code lead to the same outcome, as the page we scrape is not dynamic .pull-left[ ```md browser$navigate(url = "https://luzpar.netlify.app") browser$getPageSource()[[1]] %>% read_html() %>% html_elements("#title a") %>% html_attr("href") ``` ``` [1] "https://github.com/resulumit/scrp_workshop" [2] "https://resulumit.com/" [3] "/documents/" [4] "/constituencies/" [5] "/members/" [6] "/states/" [7] "https://github.com/rstudio/blogdown" [8] "https://gohugo.io/" [9] "https://github.com/wowchemy" ``` ] .pull-right[ ```r read_html("https://luzpar.netlify.app") %>% html_elements(css = "#title a") %>% html_attr("href") ``` ``` ## [1] "https://github.com/resulumit/scrp_workshop" ## [2] "https://resulumit.com/" ## [3] "/documents/" ## [4] "/constituencies/" ## [5] "/members/" ## [6] "/states/" ## [7] "https://github.com/rstudio/blogdown" ## [8] "https://gohugo.io/" ## [9] "https://github.com/wowchemy" ``` ] --- class: action ## Exercises 26) Get the page source for <https://luzpar.netlify.app/members/> - using function(s) from `rvest` or `polite` <br> 27) Get the same page source, using `RSelenium` - compare the outcome with the one from Exercise 26 <br> 28) Collect names from <https://luzpar.netlify.app/members/> - using functions from `rvest` only - using `RSelenium` and `rvest` together - compare the outcomes
07
:
30
--- class: center, middle ## Interacting with Elements --- ## Dynamic Pages — Elements — Find .pull-left[ - Locate an element on the open browser - to be interacted later on - e.g., clicking on the element <br> - Note that - the default selector is `xpath` - requires entering the `xpath` value ] .pull-right[ ```md findElement(using = "xpath", value ) ``` ] --- ## Dynamic Pages — Elements — Find .pull-left[ - Locate an element on the open browser - using CSS selectors <br> - Note that - typing .yellow-h["css"], instead of .yellow-h["css selector"], also works - there are other selector schemes as well, including - id - name - link text ] .pull-right[ ```md findElement(using = `"css selector"`, value ) ``` ] --- ## Dynamic Pages — Elements — Find — Selectors If there were a button created by the following code ... ```md <button class="big-button" id="only-button" name="clickable">Click Me</button> ``` <br> ... any of the lines below would find it ```md browser$findElement(using = "xpath", value = '//*[(@id = "only-button")]') browser$findElement(using = "css selector", value = ".big-button") browser$findElement(using = "css", value = "#only-button") browser$findElement(using = "id", value = "only-button") browser$findElement(using = "name", value = "clickable") ``` --- ## Dynamic Pages — Elements — Objects Save elements as R objects to be interacted later on ```md button <- browser$findElement(using = ..., value = ...) ``` <br> Note the difference between the classes of clients and elements .pull-left[ ```md class(browser) ``` ``` [1] "remoteDriver" attr(,"package") [1] "RSelenium" ``` ] .pull-right[ ```md class(button) ``` ``` [1] "webElement" attr(,"package") [1] "RSelenium" ``` ] --- ## Dynamic Pages — Elements — Highlight Highlight the element found in the previous step, with the `highlightElement` method ```r # navigate to a page browser$navigate("http://luzpar.netlify.app/") # find the element menu_states <- browser$findElement(using = "link text", value = "States") # highlight it to see if we found the correct element menu_states$`highlightElement()` ``` <br> Note that - the highlighted element fill flash for a second or two on the browser - helpful to check if selection worked as intended --- ## Dynamic Pages — Elements — Highlight Highlight the element found in the previous step, with the `highlightElement` method ```r # navigate to a page browser$navigate("http://luzpar.netlify.app/") # find the element menu_states <- `browser$`findElement(using = "link text", value = "States") # highlight it to see if we found the correct element `menu_states$`highlightElement() ``` <br> Note that - the highlighted element fill flash for a second or two on the browser - helpful to check if selection worked as intended <br> - the highlight method is applied to the element (`menu_states`), not to the client (`browser`) --- ## Dynamic Pages — Elements — Click Click on the element found in the previous step, with the `clickElement` method ```r # navigate to a page browser$navigate("http://luzpar.netlify.app/") # find an element search_icon <- browser$findElement(using = "css", value = ".fa-search") # click on it search_icon$`clickElement()` ``` --- class: action ## Exercises 29) Go to <https://luzpar.netlify.app/constituencies/>, and click the next page button - using the automated browser - hint: to find the selector for the button, use an additional browser manually <br> 30) While on the second page, click the next page button again - hint: you will have to find the button again
07
:
30
--- ## Dynamic Pages — Elements — Input .pull-left[ - Provide input to elements, such as - text, with the <span style="background-color: #ffff88;">value</span> argument ] .pull-right[ ```md sendKeysToElement(list(`value`, key ) ) ``` ] --- ## Dynamic Pages — Elements — Input .pull-left[ - Provide input to elements, such as - text, with the value argument - keyboard presses or mouse gestures, with the <span style="background-color: #ffff88;">key</span> argument <br> - Note that - user provides values while the selenium keys are pre-defined ] .pull-right[ ```md sendKeysToElement(list(value, `key` ) ) ``` ] --- ## Dynamic Pages — Elements — Input — Selenium Keys View the list of Selenium keys ```r as_tibble(selKeys) %>% names() ``` ``` ## [1] "null" "cancel" "help" "backspace" "tab" ## [6] "clear" "return" "enter" "shift" "control" ## [11] "alt" "pause" "escape" "space" "page_up" ## [16] "page_down" "end" "home" "left_arrow" "up_arrow" ## [21] "right_arrow" "down_arrow" "insert" "delete" "semicolon" ## [26] "equals" "numpad_0" "numpad_1" "numpad_2" "numpad_3" ## [31] "numpad_4" "numpad_5" "numpad_6" "numpad_7" "numpad_8" ## [36] "numpad_9" "multiply" "add" "separator" "subtract" ## [41] "decimal" "divide" "f1" "f2" "f3" ## [46] "f4" "f5" "f6" "f7" "f8" ## [51] "f9" "f10" "f11" "f12" "command_meta" ``` --- ## Dynamic Pages — Elements — Input — Selenium Keys — Note Choosing the body element, you can scroll up and down a page ```r body <- browser$findElement(using = "css", `value = "body"`) body$sendKeysToElement(list(`key = "page_down"`)) ``` --- ## Dynamic Pages — Elements — Input — Example Search the demonstration site ```r # navigate to the home page browser$navigate("http://luzpar.netlify.app/") # find the search icon and click on it search_icon <- browser$findElement(using = "css", value = ".fa-search") search_icon$clickElement() # find the search bar on the new page and click on it search_bar <- browser$findElement(using = "css", value = "#search-query") search_bar$clickElement() # search for the keyword "Law" and click enter search_bar$`sendKeysToElement(list(value = "Law", key = "enter"))` ``` --- ## Dynamic Pages — Elements — Input — Example Slow down the code where necessary, with the `Sys.sleep` - for ethical reasons - because R might be faster than the browser ```r # navigate to the home page browser$navigate("http://luzpar.netlify.app/") # find the search icon and click on it search_icon <- browser$findElement(using = "css", value = ".fa-search") search_icon$clickElement() # sleep for 2 seconds *Sys.sleep(2) # find the search bar on the new page and click on it search_bar <- browser$findElement(using = "css", value = "#search-query") search_bar$clickElement() # search for the keyword "Law" and click enter search_bar$sendKeysToElement(list(value = "Law", key = "enter")) ``` --- ## Dynamic Pages — Elements — Input — Clear Clear text, or a value, from an element ```r search_bar$clearElement() ``` --- class: action ## Exercise 31) Conduct an internet search programatically - navigate to <https://duckduckgo.com/> - just to keep it simple, as Google would require you to scroll down and accept a policy <br> - find, highlight, and conduct a search <br> 32) Scroll down programatically, and up - to see all results <br> 33) Go back, and conduct another search - hint: you will have to find the search bar again
15
:
00
--- ## Dynamic Pages — Elements — Switch Frames .pull-left[ - Switch to a different frame on a page - some pages have multiple frames - you can think of them as browsers within browsers - while in one frame, we cannot work with the page source of another frame ] .pull-right[ ```md switchToFrame(Id ) ``` ] <br> - Note that - there is one such page on the demonstration website - <https://luzpar.netlify.app/documents/> - featuring a shiny app that lives originally lives at <https://resulumit.shinyapps.io/luzpar/> <br> - the `Id` argument takes an element object, unquoted - setting it to `NULL` returns to the default frame --- ## Dynamic Pages — Elements — Switch Frames Switch to a non-default frame ```r # navigate to a page and wait for the frame to load browser$navigate("https://luzpar.netlify.app/documents/") Sys.sleep(4) # find the frame, which is an element app_frame <- browser$findElement("css", "iframe") # switch to it browser$`switchToFrame(Id = app_frame)` #switch back to the default frame browser$`switchToFrame(Id = NULL)` ``` --- ## Dynamic Pages — Scraping — Example **Task:** - I need to download specific documents published by the parliament - e.g., proposals and reports <br> - The related section of the website is a dynamic page - initially it is empty, and clicking on things do not change the URL -- <br> **Plan:** - Interact with the page until it displays the desired list of documents - Get the page source and separate the links - Write a for loop to - visit the related pages one by one - download the documents --- ## Dynamic Pages — Scraping — Example Interact with the page until it displays the desired list of documents ```r # navigate to the desired page and wait a little browser$navigate("https://luzpar.netlify.app/documents/") Sys.sleep(4) # switch to the frame with the app app_frame <- browser$findElement("css", "iframe") browser$switchToFrame(Id = app_frame) # find and open the drop down menu drop_down <- browser$findElement(using = "css", value = ".bs-placeholder") drop_down$clickElement() # choose proposals proposal <- browser$findElement(using = 'css', "[id='bs-select-1-1']") proposal$clickElement() # choose reports report <- browser$findElement(using = 'css', "[id='bs-select-1-2']") report$clickElement() # close the drop down menu drop_down$clickElement() ``` --- ## Dynamic Pages — Scraping — Example Get the page source and separate the links ```r the_links <- browser$getPageSource()[[1]] %>% read_html() %>% html_elements("td a") %>% html_attr("href") print(the_links) ``` ``` ## [1] "https://luzpar.netlify.app/documents/human-rights-2021/" ## [2] "https://luzpar.netlify.app/documents/greenhouse-gas-emissions-2021/" ## [3] "https://luzpar.netlify.app/documents/tax-reform-2020/" ## [4] "https://luzpar.netlify.app/documents/parliamentary-staff-2020/" ## [5] "https://luzpar.netlify.app/documents/cyber-security-2019/" ## [6] "https://luzpar.netlify.app/documents/electronic-cigarettes-2019/" ``` --- ## Dynamic Pages — Scraping — Example Write a for loop to download PDFs ```md for (i in 1:length(the_links)) { pdf_link <- bow(the_links[i]) %>% scrape() %>% html_elements(css = ".btn-page-header") %>% html_attr("href") %>% url_absolute(base = "https://luzpar.netlify.app/") download.file(url = pdf_link, destfile = basename(pdf_link), mode = "wb") } ``` --- class: action ## Exercise 34) Collect data on a subset of documents - article tags and image credits - for documents within the Law and Proposal categories - published after 2019 Hint: - start with the related code in the previous slides - modify as necessary
30
:
00
--- name: reference-slide class: inverse, center, middle # References .footnote[ [Back to the contents slide](#contents-slide). ] --- ## References Harrison, J. (2020). _RSelenium: R Bindings for Selenium WebDriver_. R package version 1.7.7. <http://docs.ropensci.org/RSelenium. Meissner, P. and K. Ren (2020). _robotstxt: A robots.txt Parser and Webbot/'Spider'/Crawler Permissions Checker_. R package version 0.7.13. <https://CRAN.R-project.org/package=robotstxt. Perepolkin, D. (2019). _polite: Be Nice on the Web_. R package version 0.1.1. <https://github.com/dmi3kno/polite. Silge, J. and D. Robinson (2017). _Text mining with R: A tidy approach_. O'Reilly. Wickham, H. (2021). _rvest: Easily Harvest (Scrape) Web Pages_. R package version 1.0.2. <https://CRAN.R-project.org/package=rvest. Wickham, H., R. François, L. Henry, et al. (2022). _dplyr: A Grammar of Data Manipulation_. R package version 1.0.9. <https://CRAN.R-project.org/package=dplyr. Wickham, H. and G. Grolemund (2021). _R for data science_. O'Reilly. Xie, Y. (2022). _xaringan: Presentation Ninja_. R package version 0.24. <https://github.com/yihui/xaringan. --- class: middle, center ## The workshop ends here. ## Congradulations for making it this far, and ## thank you for joining me! .footnote[ [Back to the contents slide](#contents-slide). ]