+ - 0:00:00
Notes for current slide
Notes for next slide

Twitter Data in R

Collection | Cleaning | Analysis


Resul Umit

March 2022

Skip intro β€” To the contents slide.                                                        I can teach this workshop at your institution β€” Email me.

1 / 202

Who am I?

Resul Umit

  • post-doctoral researcher at the University of Oslo

  • interested in representation, elections, and parliaments

    • a recent publication: Parliamentary communication allowances do not increase electoral turnout or incumbents’ vote share
2 / 202

Who am I?

Resul Umit

  • post-doctoral researcher at the University of Oslo

  • interested in representation, elections, and parliaments

    • a recent publication: Parliamentary communication allowances do not increase electoral turnout or incumbents’ vote share
  • working with Twitter data

    • a publication based on Twitter data: The voices of Eurosceptic members of parliament (MPs) echo disproportionately louder on Twitter

      • 400,000+ tweets from 1,000+ MPs
      • automated classification, using a bag-of-words approach
      • manual validation
    • an app based on Twitter data: LikeWise β€” a Shiny app that facilitates searching the tweets a user liked

2 / 202

Who am I?

Resul Umit

  • post-doctoral researcher at the University of Oslo

  • interested in representation, elections, and parliaments

    • a recent publication: Parliamentary communication allowances do not increase electoral turnout or incumbents’ vote share
  • working with Twitter data

    • a publication based on Twitter data: The voices of Eurosceptic members of parliament (MPs) echo disproportionately louder on Twitter

      • 400,000+ tweets from 1,000+ MPs
      • automated classification, using a bag-of-words approach
      • manual validation
    • an app based on Twitter data: LikeWise β€” a Shiny app that facilitates searching the tweets a user liked

2 / 202

The Workshop β€” Overview

  • Two days, on how to collect, process, and analyse data from Twitter

    • ~200 slides, 75+ exercises
    • support for applications to Twitter, developing real projects
3 / 202

The Workshop β€” Overview

  • Two days, on how to collect, process, and analyse data from Twitter

    • ~200 slides, 75+ exercises
    • support for applications to Twitter, developing real projects


  • Designed for researchers with basic knowledge of R programming language

    • does not cover programming with R
      • e.g., writing functions

    • existing ability to work with data in R will be very helpful
      • especially while processing and analysing data
      • but not absolutely necessary β€” this ability can be developed during and after the workshop as well
3 / 202

The Workshop β€” Motivation

  • Twitter provides attractive opportunities for academic research

    • a simple search for "twitter data" returns 74,000+ results on Google Scholar
      • at the beginning of March 2022
4 / 202

The Workshop β€” Motivation

  • Twitter provides attractive opportunities for academic research

    • a simple search for "twitter data" returns 74,000+ results on Google Scholar
      • at the beginning of March 2022


  • Research based on Twitter data requires a set of skills

    • typically, these skills are not part of academic training
4 / 202

The Workshop β€” Motivation β€” Opportunities

  • Popularity of the network

    • about 220 million users, more than three quarters are daily active
    • your subjects are likely Twitter users
      • e.g., for me, members of parliament
5 / 202

The Workshop β€” Motivation β€” Opportunities

  • Popularity of the network

    • about 220 million users, more than three quarters are daily active*
    • your subjects are likely Twitter users
      • e.g., for me, members of parliament


  • Richness of the data

    • about 500 million tweets per day*
    • with up to 90 variables for each tweet

* These statistics are compiled at the end of 2021, by BusinessOfApps.

6 / 202

The Workshop β€” Motivation β€” Opportunities

  • Popularity of the network

    • about 220 million users, more than three quarters are daily active
    • your subjects are likely Twitter users
      • e.g., for me, members of parliament


  • Richness of the data

    • about 500 million tweets per day
    • with up to 90 variables for each tweet


  • Accessibility of the data

    • most Twitter data are public
      • i.e., relatively few private profiles

    • APIs (application programming interfaces) enable programmatic access to Twitter
      • e.g., downloading tweets with R, as opposed to viewing tweets as visualised via browsers
7 / 202

The Workshop β€” Motivation β€” Challenges

  • Research based on Twitter data requires certain skills

    • e.g., availability of API is meaningless for researchers who cannot code yet
    • processing and analysing data are no less challenging without programming skills
8 / 202

The Workshop β€” Motivation β€” Challenges

  • Research based on Twitter data requires certain skills

    • e.g., availability of API is meaningless for researchers who cannot code yet
    • processing and analysing data are no less challenging without programming skills


  • The required skills are often uncovered in the academic training of social scientists

    • e.g., methodology courses
    • but help is available elsewhere, including

      • collaboration with researchers with the skills
        • at the cost of sacrificing some control over your research and research agenda

      • acquiring the skills through individual effort
        • e.g., workshops such as this one
8 / 202

The Workshop β€” Motivation β€” Aims

  • To provide you with an understanding of what is possible

    • we will cover a large breath of issues, not all of it is for long-term memory
      • hence the slides are designed for self study as well

    • awareness of what is possible, Google, and perseverance are all you need
9 / 202

The Workshop β€” Motivation β€” Aims

  • To provide you with an understanding of what is possible

    • we will cover a large breath of issues, not all of it is for long-term memory
      • hence the slides are designed for self study as well

    • awareness of what is possible, Google, and perseverance are all you need


  • To start you with acquiring and practicing the skills needed

    • practice with ready-written code
    • start working on a real project
9 / 202

The Workshop β€” Contents


Part 1. Preliminary Considerations

  • e.g., considering Twitter for research

Part 2. Getting the Tools Ready

  • e.g., downloading course material

Part 3. Data Collection

  • e.g., acquiring a user's tweets

Part 4. Data Preperation

  • e.g., creating a tidy dataset of tweets

Part 5. Data Anaysis: Users

  • e.g., conducting network analysis

Part 6. Data Anaysis: Tweets

  • e.g., conducting sentiment analysis
10 / 202

The Workshop β€” Organisation

  • I will go through a number of slides...

    • introducing things
    • demonstrating how-to do things


  • ... and then pause, for you to use/do those things

    • e.g., prepare your computer for the workshop, and/or
    • complete a number of exercises


  • We are here to help

    • ask me, other participants
    • consult Google, slides, answer scripts
      • type, rather than copy and paste, the code you will find on the slides or the script
11 / 202

The Workshop β€” Organisation β€” Slides

Slides with this background colour indicate that your action is required, for

  • setting the workshop up

    • e.g., installing R
  • completing the exercises

    • e.g., downloading tweets
    • there are 75+ exercises
    • these slides have countdown timers
      • as a guide, not to be followed strictly
03:00
12 / 202

The Workshop β€” Organisation β€” Slides

  • Codes and texts that go in R console or scripts appear as such β€” in a different font, on gray background

    • long codes and texts will have their own line(s)
# read in the tweets dataset
df <- read_rds("tweets.rds") %>%
# split the variable text, create a new variable called da_tweets
unnest_tokens(output = da_tweets, input = text, token = "tweets") %>%
# remove rows that match any of the stop words as stored in the stop_words dataset
anti_join(stop_words, by = c("da_tweets" = "word"))
13 / 202

The Workshop β€” Organisation β€” Slides

  • Codes and texts that go in R console or scripts appear as such β€” in a different font, on gray background

    • long codes and texts will have their own line(s)


  • Results that come out as output appear as such β€” in the same font, on green background

    • except very obvious results, such as graphs
14 / 202

The Workshop β€” Organisation β€” Slides

  • Codes and texts that go in R console or scripts appear as such β€” in a different font, on gray background

    • long codes and texts will have their own line(s)


  • Results that come out as output appear as such β€” in the same font, on green background

    • except very obvious results, such as graphs


  • Specific sections are highlighted yellow as such for emphasis

    • these could be for anything β€” codes and texts in input, results in output, and/or texts on slides
14 / 202

The Workshop β€” Organisation β€” Slides

  • Codes and texts that go in R console or scripts appear as such β€” in a different font, on gray background

    • long codes and texts will have their own line(s)


  • Results that come out as output appear as such β€” in the same font, on green background

    • except very obvious results, such as graphs


  • Specific sections are highlighted yellow as such for emphasis

    • these could be for anything β€” codes and texts in input, results in output, and/or texts on slides


  • The slides are designed for self-study as much as for the workshop

    • accessible, in substance and form, to go through on your own
14 / 202

Part 1. Preliminary Considerations

15 / 202

Considerations β€” Research Questions & Hypotheses

  • Ideally, we have one or more research questions, hypotheses

    • developed prior to data collection, analysis
      • based on, e.g., theory, claims, observations

    • perhaps, even pre-registered
16 / 202

Considerations β€” Research Questions & Hypotheses

  • Ideally, we have one or more research questions, hypotheses

    • developed prior to data collection, analysis
      • based on, e.g., theory, claims, observations

    • perhaps, even pre-registered


16 / 202

Considerations β€” Potential Biases

There are at least two potential sources of bias in Twitter data

  • sampling

    • Twitter users are not representative of the people out there
    • Tweeting behaviour has a strategic component
17 / 202

Considerations β€” Potential Biases

There are at least two potential sources of bias in Twitter data

  • sampling

    • Twitter users are not representative of the people out there
    • Tweeting behaviour has a strategic component
  • mediation

    • the behaviour on Twitter is mediated through written and unwritten rules
      • e.g., there is a button to like, but no dislike
        • might systematically bias the replies towards negative

      • e.g., the common use of the like function as a bookmark
        • what would a study of Twitter likes be measuring?
17 / 202

Considerations β€” Constraints over Data Access

  • Twitter has restrictions on data access

    • how much data is available to download
    • how quickly, how frequently, how far dating back etc.
18 / 202

Considerations β€” Constraints over Data Access

  • Twitter has restrictions on data access

    • how much data is available to download
    • how quickly, how frequently, how far dating back etc.


18 / 202

Considerations β€” Constraints over Data Access

  • Twitter has restrictions on data access

    • how much data is available to download
    • how quickly, how frequently, how far dating back etc.



  • These restrictions also vary within APIs types, across different operations

    • e.g., collecting tweets in real time vs. collecting historical tweets
      • but also, collecting historical tweets from a specific user vs. tweets from any user
18 / 202

Considerations β€” Constraints over Data Redistribution

  • Twitter restricts content redistribution

    • e.g., only the tweet and/or user IDs can be made publicly available in datasets over 50,000 observations
      • e.g., not the tweets themselves
      • and no more than 1.5M IDs
        • with some exceptions for academic research

    • see Twitter Developer terms for further details
19 / 202

Considerations β€” Constraints over Data Redistribution

  • Twitter restricts content redistribution

    • e.g., only the tweet and/or user IDs can be made publicly available in datasets over 50,000 observations
      • e.g., not the tweets themselves
      • and no more than 1.5M IDs
        • with some exceptions for academic research

    • see Twitter Developer terms for further details

  • Reproducibility of research based on Twitter data is limited in practice

    • i.e., reproducibility after publication, by others
    • technically, they can retrieve the same tweets with IDs
      • demanding for reproducers
      • may even be impossible
        • e.g., some tweets, or whole accounts, might be deleted before replication attempts
19 / 202

Considerations β€” Changes in the Twitter APIs

  • Twitter is currently switching to a new generation of APIs

    • replacing APIs v1 with v2
      • each with various types of APIs

    • the switch is not complete, outcome is not clear
20 / 202

Considerations β€” Changes in the Twitter APIs

  • Twitter is currently switching to a new generation of APIs

    • replacing APIs v1 with v2
      • each with various types of APIs

    • the switch is not complete, outcome is not clear


  • Twitter might change the rules of the APIs game at any anytime, again

    • making the existing restrictions more or less strict
      • e.g., while you are in the middle of data collection

    • breaking your plans, code
20 / 202

Considerations β€” Changes in the Twitter APIs β€” Notes

  • Existing codes to collect tweets may or may not be affected, depending on

    • how the APIs v2 will look in the end
      • it is still a work in progress

    • how the rtweet package* will adopt
      • it is currently going through a major revision

* This is the R package that we will use to collect tweets. More details are in Part 2.

21 / 202

Considerations β€” Changes in the Twitter APIs β€” Notes

  • Existing codes to collect tweets may or may not be affected, depending on

    • how the APIs V2 will look in the end
      • it is still a work in progress

    • whether and how the rtweet package will adopt
      • it is currently going through a major revision


  • Not all changes are bad

    • among others, APIs v2 introduces the Academic Research access

      • 'to serve the unique needs and challenges of academic researchers'
        • ranging from master's students to professors

      • access to all public tweets
        • by up to 10M a month at a time
22 / 202

Considerations β€” Law and Ethics

  • It is often impossible to get users' consent

    • i.e., for collecting and analysing their data on Twitter

    • Twitter itself has no problem with it, but others might disagree
      • e.g., your law makers, (funding and/or research) institution, subjects, conscience
23 / 202

Considerations β€” Law and Ethics

  • It is often impossible to get users' consent

    • i.e., for collecting and analysing their data on Twitter

    • Twitter itself has no problem with it, but others might disagree
      • e.g., your law makers, (funding and/or research) institution, subjects, conscience


  • Check the rules that apply to your case

    • rules and regulations in your country, at your institution
23 / 202

Considerations β€” Law and Ethics

  • It is often impossible to get users' consent

    • i.e., for collecting and analysing their data on Twitter

    • Twitter itself has no problem with it, but others might disagree
      • e.g., your law makers, (funding and/or research) institution, subjects, conscience


  • Check the rules that apply to your case

    • rules and regulations in your country, at your institution


  • Reflect on whether using Twitter data for research is ethical

    • even where it is legal and allowed, it may not be moral
23 / 202

Considerations β€” Data Storage

Twitter data frequently requires

  • large amounts of digital storage space

    • Twitter data is typically big data
      • many tweets, up to 90 variables

    • e.g., a dataset of 1M tweets requires about 300MB
      • when stored in R data formats
24 / 202

Considerations β€” Data Storage

Twitter data frequently requires

  • large amounts of digital storage space

    • Twitter data is typically big data
      • many tweets, up to 90 variables

    • e.g., a dataset of 1M tweets requires about 300MB
      • when stored in R data formats
24 / 202

Considerations β€” Language and Context

  • Some tools of text analysis are developed for a specific language and/or context

    • e.g., dictionaries for sentiment analysis

      • might be in English, for political texts, only
    • these may not be useful, valid for different languages, and/or contexts

25 / 202

Considerations β€” Language and Context

  • Some tools of text analysis are developed for a specific language and/or context

    • e.g., dictionaries for sentiment analysis

      • might be in English, for political texts, only
    • these may not be useful, valid for different languages, and/or contexts


  • Some tools of text analysis are developed for general use

    • e.g., a dictionary for sentiments in everyday language

    • these may not be useful, valid for a specific context

      • e.g., political texts
25 / 202

Part 2. Getting the Tools Ready

26 / 202

Workshop Slides β€” Access on Your Browser

  • Having the workshop slides* on your own machine might be helpful

    • flexibility to go back and forward on your own
    • ability to scroll across long codes on some slides


* These slides are produced in R, with the xaringan package (Xie, 2022).

27 / 202

Course Materials β€” Download from the Internet

Code -> Download ZIP


  • Unzip and rename the folder

    • unzip to a location that is not synced
      • e.g., perhaps to Documents, but not Dropbox
28 / 202

Course Materials β€” Overview

Materials have the following structure

twtr_workshop-materials
|
|- data
| |
| |- mps.csv
| |- status_ids.rds
| |- tweets.rds
|
|- analysis
| |
| |- solutions.R
| |- tweets.Rmd
| |- tweets_answers.Rmd
| |- users.Rmd
| |- users_answers.Rmd
29 / 202

Course Materials β€” Contents

  • data/mps.csv

    • a dataset on the members of parliament (MPs) in the British House of Commons, at the end of January 2021
    • it includes variables on electoral results as well as Twitter usernames


  • data/status_ids.rds

    • a dataset with a single variable: status_id
    • lists the status IDs of all tweets posted by the MPs listed in mps.csv, during January 2021


  • data/tweets.rds

    • similar to data/status_ids, except that
      • the time period is now limited to 15 to 31 January, reducing the number of observations below 50,000, allowing for all variables to be posted online
30 / 202

Course Materials β€” Contents

  • exercises/solutions.R

    • an R script with solutions for exercises until the end of Part 4


  • exercises/tweets.Rmd

    • an R Markdown file with exercises for Part 6
    • the solution to these exercises are in tweets_answers.Rmd


  • exercises/users.Rmd

    • an R Markdown file with exercises for Part 5
    • the solution to these exercises are in users_answers.Rmd
31 / 202

R β€” Download from the Internet and Install

  • Programming language of this workshop

    • created for data analysis, extending for other purposes
      • e.g., accessing APIs
    • allows for all three steps in one environment
      • collecting, processing, and analysing Twitter data
    • an alternative: python


  • Optional, if you have R already installed

    • consider updating your copy, if it is not up to date


32 / 202

RStudio β€” Download from the Internet and Install

  • Optional, but highly recommended

    • facilitates working with Twitter data in R


  • A popular integrated development environment (IDE) for R


Help -> Check for Updates

33 / 202

RStudio Project β€” Create from within RStudio

  • RStudio allows for dividing your work with R into separate projects

    • each project gets dedicated workspace, history, and source documents
    • this page has more information on why projects are recommended


  • Create a new RStudio project for the existing* workshop directory ...\twtr_workshop-materials from the RStudio menu:

File -> New Project -> Existing Directory -> Browse -> ...\twtr_workshop-materials -> Open

* Recall that we have downloaded this earlier from GitHub. Back to the relevant slide.

34 / 202

R Packages β€” Install from within RStudio*

install.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))

* You may already have a copy of one or more of these packages. In that case, I recommend updating by re-installing them now.

35 / 202

R Packages β€” Install from within RStudio

install.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))
36 / 202

R Packages β€” Install from within RStudio

install.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))


  • httpuv (Cheng and Chang, 2022), for API authorization

    • alternative: using your own access tokens
      • necessitates making an application through a developer
      • has advantages that we will discuss later on
36 / 202

R Packages β€” Install from within RStudio

install.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))
  • tidyverse (Wickham, 2021), for various tasks

    • including data manipulation, visualisation
    • alternative: e.g., base R
37 / 202

R Packages β€” Install from within RStudio

install.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))
  • tidyverse (Wickham, 2021), for various tasks

    • including data manipulation, visualisation
    • alternative: e.g., base R


37 / 202

Twitter β€” Authorisation

Authorization to use Twitter APIs requires at least three steps*

1) open a user account on Twitter

  • a personal or an institutional (perhaps, for a research project) one
  • done once, takes minutes

2) with that user account, apply for a developer account

  • so that you are recognised as a developer, have access to the developer portal
  • done once per account, takes days to get approved manually

3) with that developer account, register a Twitter app

  • so that you have the keys and tokens for authorisation
  • repeated for every project, takes minutes

* There may be additional steps, such as registering for the Academic Research access.

38 / 202

Twitter β€” Authorisation β€” Notes

  • It is possible to interact with Twitter APIs without steps 2 and 3

    • rtweet has a its own Twitter app β€” rstats2twitter β€” that anyone can use
      • anyone with a Twitter account, who authorises rstats2twitter via a pop-up browser
39 / 202

Twitter β€” Authorisation β€” Notes

  • It is possible to interact with Twitter APIs without steps 2 and 3

    • rtweet has a its own Twitter app β€” rstats2twitter β€” that anyone can use
      • anyone with a Twitter account, who authorises rstats2twitter via a pop-up browser


  • I recommend

    • following only the step 1 (open an account) now, which
      • you might already have done
      • is otherwise automatic
      • allows us to use rstats2twitter and follow the workshop

    • leaving the remaining steps until after the workshop
      • to allow you to think and write your applications carefully
      • to get my feedback if you prefer to do so
39 / 202

Twitter β€” Open an Account

Sign up for Twitter at https://twitter.com/

  • a pre-condition for interacting with Twitter APIs

    • e.g., you must be authorized
      • even to use rtweet's app β€” rstats2twitter


  • helpful for getting to know what you study

    • e.g., the written and unwritten rules that mediate the behaviour on Twitter


  • with a strategic username

    • usernames are changeable, but nevertheless public
      • either choose an anonymous username (e.g., asdf029348)
      • or choose one carefully β€” they become a part of users' online presence
40 / 202

Twitter β€” Apply for a Developer Account

  • On developer.twitter.com, click Apply
    • at the top-right corner of the page*
    • follow the instructions on consecutive pages

*It takes a few days for Twitter to review and hopefully approve your request to have an account. You might have created an account before. In that case, you will see Developer Portal instead of Apply.

41 / 202

Twitter β€” Apply for a Developer Account

  • On developer.twitter.com, click Apply
    • at the top-right corner of the page*
    • follow the instructions on consecutive pages


  • Write a detailed case for your intended use of Twitter APIs
    • not just that you are attending this workshop
    • a quick application is likely to slow the review process
      • as Twitter often asks for more details via email
    • rejection is also a possibility


  • Carefully review the Twitter's Developer Agreement and Policy
    • especially important if you will develop an app that
      • others will use
      • will write as well as read Twitter data
42 / 202

Twitter β€” Register an App

  • On developer.twitter.com/en/portal/projects-and-apps, click + Create App

    • follow the instructions on consecutive pages

      • you will need a unique name for your app
    • note that, once the app is registered, you are provided with keys and tokens

      • you will use these for authentication
        • if you choose not to authenticate through rtweet's own app, called rstats2twitter
        • more on this in Part 3
  • does not mean you have to create an actual app

    • e.g., an app for smart phones
43 / 202

Twitter β€” Keys and Tokens β€” Notes

  • Keys and tokens are personal

    • should not be shared with others
      • e.g., through replication files, when asking a question on Stack Overflow
44 / 202

Twitter β€” Keys and Tokens β€” Notes

  • Keys and tokens are personal

    • should not be shared with others
      • e.g., through replication files, when asking a question on Stack Overflow
  • Keys and tokens can be re-generated anytime

    • on your applications page, under the Keys and tokens tab
    • this requires updating your R script with the new tokens as well
44 / 202

Twitter β€” Keys and Tokens β€” Notes

  • Keys and tokens are personal

    • should not be shared with others
      • e.g., through replication files, when asking a question on Stack Overflow
  • Keys and tokens can be re-generated anytime

    • on your applications page, under the Keys and tokens tab
    • this requires updating your R script with the new tokens as well
  • Twitter allows for further, optional settings involving keys and tokens

    • relevant, mostly, for apps to do more than just collecting tweets
    • e.g., settings for 3-legged OAuth
      • like rstats2twitter, to allow for other users to authenticate through a browser pop up
    • not covered in this workshop
44 / 202

Twitter β€” Keys and Tokens β€” Notes β€” Definitions

  • Consumer key and Consumer secret

    • identifiers for an application
    • provide project authorization
      • identify the application
      • check whether the application has been granted access
    • like your application's username and password
45 / 202

Twitter β€” Keys and Tokens β€” Notes β€” Definitions

  • Consumer key and Consumer secret

    • identifiers for an application
    • provide project authorization
      • identify the application
      • check whether the application has been granted access
    • like your application's username and password
  • Access token and Access token secret

    • identifier for users of an application
      • this may or may not include anyone other than the owner
      • e.g., for rstats2twitter, there are many users
    • provide user authentication
      • identify the user
      • check whether the user should have access to a request
45 / 202

Data Collection β€” Authentication

There are two different methods of authentication

  • through rtweet's rstats2twitter app

    • the app makes requests on your behalf
    • you simply approve this, via a browser that pops up
    • nice and easy, but comes with some limitations
46 / 202

Data Collection β€” Authentication

There are two different methods of authentication

  • through rtweet's rstats2twitter app

    • the app makes requests on your behalf
    • you simply approve this, via a browser that pops up
    • nice and easy, but comes with some limitations


  • through your own app

    • you make requests on your behalf
    • through your own app that you register as a developer
    • takes a little effort, but comes with additional stability and functionality
      • removing the dependency on rstats2twitter, over which you have no control
      • getting rid of the pop up windows
      • not only for collecting tweets, but also posting your own tweets
        • and, reading and writing your own direct messages
46 / 202

Data Collection β€” Define Your Token

If you are using your own app to authenticate, create a token

  • using the create_token function
  • the app argument requires for the name of your own app, as registered on developer.twitter.com
  • the other arguments to be filled with the information from the Keys and tokens tab on the same website
tw_token <- create_token(
app = "",
consumer_key = "",
consumer_secret = "",
access_token = "",
access_secret = ""
)
47 / 202

Data Collection β€” Define Your Token β€” Alternatives

  • You may wish to put your keys and tokens elsewhere

    • they are personal, just like a password
  • There are at least two alternatives

    • create a separate script, which you can then source at the top of your main script

keys_tokens.R

tw_token <- create_token(
app = "",
consumer_key = "",
consumer_secret = "",
access_token = "",
access_secret = ""
)

data_collection.R

library(rtweet)
source("keys_tokens.R")
48 / 202

Data Collection β€” Define Your Token β€” Alternatives

  • You may wish to put your keys and tokens elsewhere

    • they are personal, just like a password
  • There are at least two alternatives

    • create a separate script, which you can then source at the top of your main script

    • store your keys and tokens in your .Renviron file, which can be created at the project level as well

.Renviron

TWITTER_APP=name_of_my_app
TWITTER_CONSUMER_KEY=akN...
TWITTER_CONSUMER_SECRET=HJK...
TWITTER_ACCESS_TOKEN=345...
TWITTER_ACCESS_SECRET=SDF...

data_collection.R

library(rtweet)
tw_token <- create_token(
app = Sys.getenv("TWITTER_APP"),
consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"),
consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET"),
access_token = Sys.getenv("TWITTER_ACCESS_TOKEN"),
access_secret = Sys.getenv("TWITTER_ACCESS_SECRET")
)
49 / 202

Other Resources*

50 / 202

Other Resources*


50 / 202

Other Resources*



* I recommend these to be consulted not during but after the workshop.

50 / 202

Part 3. Data Collection

51 / 202

Data Collection β€” Overview β€” APIs

  • We will collect data through APIs

    • i.e., Twitter's Standard v1.1 APIs
    • provides more variables than available through browsers
    • comes with rules and restrictions
      • enforced through authentication
52 / 202

Data Collection β€” Overview β€” APIs

  • We will collect data through APIs

    • i.e., Twitter's Standard v1.1 APIs
    • provides more variables than available through browsers
    • comes with rules and restrictions
      • enforced through authentication


  • Collecting data through web scraping is also possible

    • e.g., with GetOldTweets3 β€” a python library
      • scrapes, scrolls down, and scrapes again to collect all matching data

    • does not require, is not limited by, Twitter APIs
    • limited with what is available on browsers
    • may or may not be ethical and/or legal
52 / 202

Data Collection β€” Overview β€” APIs β€” Types

  • In general, there are two main types of APIs

    • REST and Streaming
    • applies to APIS elsewhere, not just at Twitter
    • functions, arguments, behaviour differ slightly
53 / 202

Data Collection β€” Overview β€” APIs β€” Types

  • In general, there are two main types of APIs

    • REST and Streaming
    • applies to APIS elsewhere, not just at Twitter
    • functions, arguments, behaviour differ slightly


  • REST APIs are for single, one-off requests

    • e.g., search for tweets posted in the last 6 to 9 days
    • but also, post or delete tweets
53 / 202

Data Collection β€” Overview β€” APIs β€” Types

  • In general, there are two main types of APIs

    • REST and Streaming
    • applies to APIS elsewhere, not just at Twitter
    • functions, arguments, behaviour differ slightly


  • REST APIs are for single, one-off requests

    • e.g., search for tweets posted in the last 6 to 9 days
    • but also, post or delete tweets


  • Streaming APIs are for continuous requests

    • e.g., collect tweets as they are being posted
53 / 202

Data Collection β€” Overview β€” APIs β€” Types

54 / 202

Data Collection β€” Overview β€” APIs β€” Types


  • Rules and restrictions differ from one type to another

    • as does the cost
    • some remove the restrictions on how much data we can access
    • restrictions on how quickly we can access data exist in all types
54 / 202

Data Collection β€” Overview β€” APIs β€” Types


  • Rules and restrictions differ from one type to another

    • as does the cost
    • some remove the restrictions on how much data we can access
    • restrictions on how quickly we can access data exist in all types


  • Rules and restrictions can also differ within one type

    • for different operations
      • e.g., for collecting historical vs. live data
54 / 202

Data Collection β€” Overview β€” Standard v1.1 APIs

  • We will collect data through Twitter's Standard v1.1 APIs

    • free of charge
    • thanks to rweet's rstats2twitter app, can be used immediately
    • comes with the strictest of restrictions
      • e.g., searches tweets posted in the last 6 to 9 days
55 / 202

Data Collection β€” Overview β€” Standard v1.1 APIs

  • We will collect data through Twitter's Standard v1.1 APIs

    • free of charge
    • thanks to rweet's rstats2twitter app, can be used immediately
    • comes with the strictest of restrictions
      • e.g., searches tweets posted in the last 6 to 9 days


  • You can surpass these restrictions later on

    • academic researchers can apply for Academic Research access
    • others can purchase an alternative
    • the principles of data collection are likely to remain the same
55 / 202

Data Collection β€” Overview β€” APIs β€” Limitations

Our attempts to collect data will be limited for various reasons, including

  • the intended restrictions by Twitter
    • by the limitations of Standard v1.1
      • e.g., by rate limits
        • maximum number of requests
        • per app, type of request, time interval
56 / 202

Data Collection β€” Overview β€” APIs β€” Limitations

Our attempts to collect data will be limited for various reasons, including

  • the intended restrictions by Twitter
    • by the limitations of Standard v1.1
      • e.g., by rate limits
        • maximum number of requests
        • per app, type of request, time interval
  • any lack of tweets or users matching our search criteria
    • stricter the criteria, more likely to occur
56 / 202

Data Collection β€” Overview β€” APIs β€” Limitations

Our attempts to collect data will be limited for various reasons, including

  • the intended restrictions by Twitter
    • by the limitations of Standard v1.1
      • e.g., by rate limits
        • maximum number of requests
        • per app, type of request, time interval
  • any lack of tweets or users matching our search criteria
    • stricter the criteria, more likely to occur
  • connections timing out
    • depends on type of requests, and your internet connection
    • more likely for continuous searches
56 / 202

Data Collection β€” Overview β€” APIs β€” Limitations

Our attempts to collect data will be limited for various reasons, including

  • the intended restrictions by Twitter
    • by the limitations of Standard v1.1
      • e.g., by rate limits
        • maximum number of requests
        • per app, type of request, time interval
  • any lack of tweets or users matching our search criteria
    • stricter the criteria, more likely to occur
  • connections timing out
    • depends on type of requests, and your internet connection
    • more likely for continuous searches
  • reasons unknown to humankind
    • sometimes things just happen
56 / 202

Data Collection β€” rtweet β€” Overview

  • A powerful R package for collecting Twitter data

    • created by Michael W. Kearney (University of Missouri)
    • used widely, replacing previous packages for this task
      • e.g., twitteR
    • last updated on CRAN two years ago
      • the package is currently being updated on GitHub
57 / 202

Data Collection β€” rtweet β€” Overview

  • A powerful R package for collecting Twitter data

    • created by Michael W. Kearney (University of Missouri)
    • used widely, replacing previous packages for this task
      • e.g., twitteR
    • last updated on CRAN two years ago
      • the package is currently being updated on GitHub
57 / 202

Data Collection β€” rtweet β€” Overview

  • A powerful R package for collecting Twitter data

    • created by Michael W. Kearney (University of Missouri)
    • used widely, replacing previous packages for this task
      • e.g., twitteR
    • last updated on CRAN two years ago
      • the package is currently being updated on GitHub
  • Comes with its own app, rstats2twitter
    • allows for collecting tweets without a developer account
    • offers the option of using your own keys and tokens, if/once you have them
57 / 202

Data Collection β€” rtweet β€” Basics

There are four main groups of functions to collect historical data, starting with

  • search_
    • such as search_tweets or search_users
    • 4 functions, for general use
      • 2 for standard APIs, 2 for premium APIs
58 / 202

Data Collection β€” rtweet β€” Basics

There are four main groups of functions to collect historical data, starting with

  • search_
    • such as search_tweets or search_users
    • 4 functions, for general use
      • 2 for standard APIs, 2 for premium APIs
  • lookup_
    • such as lookup_tweets or lookup_users
    • 5 functions, for expanding an existing variable into a dataset
      • e.g., starting with a list of IDs for tweets or users
58 / 202

Data Collection β€” rtweet β€” Basics

There are four main groups of functions to collect historical data, starting with

  • search_
    • such as search_tweets or search_users
    • 4 functions, for general use
      • 2 for standard APIs, 2 for premium APIs
  • lookup_
    • such as lookup_tweets or lookup_users
    • 5 functions, for expanding an existing variable into a dataset
      • e.g., starting with a list of IDs for tweets or users
  • get_
    • such as get_followers or get_friends
    • 11 functions, for specific tasks
58 / 202

Data Collection β€” rtweet β€” Basics

There are four main groups of functions to collect historical data, starting with

  • search_
    • such as search_tweets or search_users
    • 4 functions, for general use
      • 2 for standard APIs, 2 for premium APIs
  • lookup_
    • such as lookup_tweets or lookup_users
    • 5 functions, for expanding an existing variable into a dataset
      • e.g., starting with a list of IDs for tweets or users
  • get_
    • such as get_followers or get_friends
    • 11 functions, for specific tasks
  • lists_
    • such as lists_members or lists_statuses
    • 6 functions, for tasks related to Twitter lists specifically
58 / 202

Data Collection β€” rtweet β€” Basics

There is also one function to collect tweets in real time

  • stream_tweets

    • queries the streaming API
    • returns a small random sample of all tweets as they are posted
    • can be filtered by keywords, users, and/or locations
59 / 202

Data Collection β€” rtweet β€” Basics

There is also one function to collect tweets in real time

  • stream_tweets

    • queries the streaming API
    • returns a small random sample of all tweets as they are posted
    • can be filtered by keywords, users, and/or locations


  • For other functions, see the package documentation

    • e.g., the functions starting with post_
      • allowing for posting your tweets, direct messages, from within R
59 / 202

Data Collection β€” Start Your Script

  • Check that you are in the right project

    • created in Part 2
    • indicated at the upper-right corner of RStudio window
  • Create a new R Script, following from the RStudio menu

File -> New File -> R Script

  • Name and save your file
    • to avoid the Untitled123 problem
    • e.g., data_collection.R
  • Load the rtweet and other packages
    • no need to load the httpuv package, enough if installed
library(rtweet)
library(tidyverse)
library(tidytext)
60 / 202

search_

61 / 202

Data Collection β€” search_tweets

Collect tweets posted in the last 6 to 9 days

  • filter by search query, with the q argument

  • limited to 18,000 tweets, per 15 minutes, per token*

    • set the limit, higher or lower, with the n argument**
      • works best the multiples of 100

    • if set higher, wait automatically by setting the retryonratelimit argument to TRUE
search_tweets(q,
n = 100,
type = "recent",
include_rts = TRUE,
geocode = NULL,
max_id = NULL,
parse = TRUE,
token = NULL,
retryonratelimit = FALSE,
verbose = TRUE,
...
)

* All limits are for the standard v1.1 APIs.

**This argument is common to many functions in the package. I recommend setting it to a small number, such as 200, for the exercises in this workshop. This will save computation time and avoid running into rate limits.

62 / 202

Data Collection β€” search_tweets

  • Collect the latest 100 tweets that
    • include the hashtag "publish"


  • Note that
    • by default, type = "recent", returning the latest tweets
      • other options are "popular" and "mixed"

    • by default, n = 100, returning 100 tweets
    • here we are relying on rtweet's rstats2twitter app
      • as, by default, token = NULL
search_tweets(q = "#publish")
63 / 202

Exercises

1) Collect the latest 30 tweets that

  • include the hashtag "AcademicTwitter"
  • and assign the resulting data frame to df_tweets
    • so that you can observe the results with ease
    • hence, I recommend doing the same for all APIs searches that follow

2) Observe how the rstats2twitter app works

  • when you call the function, pay attention to what happens on your R console and on your default browser
    • this will happen only once per R session

3) Take some time to explore the data frame

  • see which variables are in there, and how they are called
  • think about how you could use these variables for research
  • hint: use functions like View, str, names, tibble::glimpse

4) Conduct the same search on a browser

15:00
64 / 202

Data Collection β€” Notes

  • Twitter usernames, or handles, are stored under variable screen_name

    • can be misleading, as users also have display names
65 / 202

Data Collection β€” Notes

  • Twitter usernames, or handles, are stored under variable screen_name

    • can be misleading, as users also have display names
  • Twitter allows user to change their usernames and display names

    • user IDs, however, do not change
      • user_id is a better variable for reproducible research
65 / 202

Data Collection β€” Notes

  • Twitter usernames, or handles, are stored under variable screen_name

    • can be misleading, as users also have display names
  • Twitter allows user to change their usernames and display names

    • user IDs, however, do not change
      • user_id is a better variable for reproducible research
  • The date and time data are matched to Greenwhich Mean Time

    • stored under the variable created_at
    • no matter where users actually are at the time of tweeting
65 / 202

Data Collection β€” Notes

  • Twitter usernames, or handles, are stored under variable screen_name

    • can be misleading, as users also have display names
  • Twitter allows user to change their usernames and display names

    • user IDs, however, do not change
      • user_id is a better variable for reproducible research
  • The date and time data are matched to Greenwhich Mean Time

    • stored under the variable created_at
    • no matter where users actually are at the time of tweeting
  • You may wish to exclude retweets

    • depending on the research question and design
    • by setting include_rts = FALSE
65 / 202

Data Collection β€” search_tweets

  • Collect the top 200 tweets that

    • include the word "publish"
search_tweets(q = "publish",
n = 200,
type = "popular")
66 / 202

Data Collection β€” search_tweets

  • Collect the top 200 tweets that

    • include the word "publish" and "perish", not necessarily in that order


  • Note that

    • space is treated as the boolean AND operator
search_tweets(q = "publish perish",
n = 200,
type = "popular")
67 / 202

Data Collection β€” search_tweets

  • Collect the top 200 tweets that

    • include the word "publish" or "perish"


  • Note that

    • the boolean OR operator must be specified, in capital letters
search_tweets(q = "publish OR perish",
n = 200,
type = "popular")
68 / 202

Data Collection β€” search_tweets

  • Collect the top 200 tweets that

    • include the exact phrase "publish or perish"


  • Note that

    • double quotation marks " need to be escaped with a backslash \
search_tweets(q = "\"publish or perish\"",
n = 200,
type = "popular")
69 / 202

Data Collection β€” search_tweets

  • Collect the top 200 tweets that

    • include "publish" but not "perish"


  • Note that

    • words can be negated with a hyphen -


search_tweets(q = "publish -perish",
n = 200,
type = "popular")
70 / 202

Data Collection β€” search_tweets

  • Collect the top 200 tweets that

    • include "publish", and
    • are otherwise written in German


search_tweets(q = "publish lang:de",
n = 200,
type = "popular")
71 / 202

Data Collection β€” search_tweets

  • Collect the top 200 tweets that

    • include "publish", and
    • are not in German


  • Note that

    • operators can be negated with a hyphen as well
search_tweets(q = "publish -lang:de",
n = 200,
type = "popular")
72 / 202

Data Collection β€” search_tweets β€” Notes

  • Some query parameters can be passed into the function as arguments as well
    • e.g., lang, filter


  • Note that
    • these functions on the right return the same observations
    • there are slight differences in syntax, such as
    • lang:en as a parameter
    • lang = "en" as an argument
search_tweets(q = "publish lang:en filter:replies",
n = 200,
type = "mixed")
search_tweets(q = "publish",
n = 200,
type = "mixed",
lang = "en",
filter = "replies")
73 / 202

Data Collection β€” search_tweets β€” Notes

  • This function returns a data frame

    • as do many other functions in the package
    • because, by default, parse = TRUE


  • Under the hood, Twitter APIs return nested lists

    • nested lists can be tidied into rectangular shape, but not tidy data as they are returned
    • rtweet does most of the data preparation for us
74 / 202

Exercises

5) Collect the latest 10 tweets that include

  • the phrase "publish or perish"
  • and the word "academia" but not the word "PhD"


6) Collect the most popular 50 tweets that

  • include the word "Switzerland"
  • excluding retweets
  • note that Twitter API may return fewer than 50 tweets
    • as there may not be that many popular tweets posted in the last 6 to 9 days


7) Collect the most recent 35,000 tweets that

  • include the word "PhD"
  • note that this is over the limit of 18,000
  • hint: pull the help file for the function to see which argument you must use to surpass the limit
20:00
75 / 202

Data Collection β€” search_users

  • Collect information on users

    • filter usernames, names, bios
    • by search query with the q argument
    • returns information from recently active users
    • limited to 1,000 users
search_users(q,
n = 100,
parse = TRUE,
token = NULL,
verbose = TRUE)


76 / 202

Exercises

8) Collect information on 30 users that

  • are associated with the word "PhD", but not with the word "rstats"
  • read one of these users' bio on their homepage via a browser


9) Collect the latest 30 tweets that

  • include the word "PhD"
  • but not the word "rstats"


10) Take some time to explore the resulting data frames

  • how do they compare to each other?


11) Conduct one or more searches that interest you

  • for tweets and/or users
20:00
77 / 202

Data Collection β€” Notes β€” rate_limit

  • Check rate limits at any time

    • for all operations
    • for a specific operation
      • e.g., searching tweets with the search_tweets function
rate_limit(token = NULL,
query = NULL,
parse = TRUE)


  • Note that

    • these limits are for the number of requests, per user and/or per app
    • if no token is specified, the function uses the rstats2twitter app
    • rate limits decrease as you use them, increase again after a time threshold
    • Twitter Developer terms do not allow for multiple tokens to be used for the same project
78 / 202

Data Collection β€” Notes β€” rate_limit

Check you remaining rate limits, for all operations

rate_limit()
# A tibble: 261 x 7
query limit remaining reset reset_at timestamp app
<chr> <int> <int> <drtn> <dttm> <dttm> <chr>
1 lists/list 15 15 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
2 lists/:id/tweets&GET 900 900 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
3 lists/:id/followers&GET 180 180 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
4 lists/memberships 75 75 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
5 lists/:id&DELETE 300 300 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
6 lists/subscriptions 15 15 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
7 lists/members 900 900 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
8 lists/:id&GET 75 75 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
9 lists/subscribers/show 15 15 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
10 lists/:id&PUT 300 300 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter
# ... with 251 more rows
79 / 202

Data Collection β€” Notes β€” rate_limit

Check your remaining rate limits for specifically the search_tweets function

rate_limit(query = "search/tweets")
# A tibble: 1 x 7
query limit remaining reset reset_at timestamp app
<chr> <int> <int> <drtn> <dttm> <dttm> <chr>
1 search/tweets 180 171 14.77129 mins 2022-03-06 10:42:12 2022-03-06 10:27:26 rstats2twitter
80 / 202

Data Collection β€” Notes β€” rate_limit

Single out the number of remaining rate limit for specifically the search_tweets function

rate_limit(query = "search/tweets")$remaining
## [1] 171


  • Note that

    • this is not the number of tweets you can search for
    • but rather the number of times that you can search for tweets
81 / 202

Exercises

12) Check all your remaining rate limits

  • have a look at the resulting data


13) Check your remaining limits for the search_tweets function

  • assign it to a name


14) Collect the most recent 50 tweets that

  • include the word "rstat"
  • and posted in English


15) Check your remaining limits for the search_tweets function again

  • how do they compare to the results from Exercise 13?
10:00
82 / 202

lookup_

83 / 202

Data Collection β€” lookup_tweets

  • Collect data on one or more tweets

    • whose status ID you already know
    • limited with 90,000 posts per 15 minutes
      • there is no retryonratelimit argument
lookup_tweets(statuses,
parse = TRUE,
token = NULL)


  • Note that

    • this function would be useful for replicating studies
    • status IDs are visible on browsers
      • if you click on a specific tweet
84 / 202

Data Collection β€” lookup_tweets

Collect data on one or more status IDs

lookup_tweets(statuses = c("567053242429734913", "266031293945503744", "440322224407314432"))
85 / 202

Data Collection β€” lookup_tweets

Collect data on one or more status IDs

lookup_tweets(statuses = c("567053242429734913", "266031293945503744", "440322224407314432"))


Collect data on status IDs in a data frame

lookup_tweets(statuses = df$status_id)
85 / 202

Data Collection β€” lookup_users

  • Collect data on one or more users

    • whose user ID or username you already know
    • limited with 90,000 users per 15 minutes
      • there is no retryonratelimit argument


  • Note that

    • usernames can change
    • rely on user IDs where possible
lookup_users(users,
parse = TRUE,
token = NULL)
86 / 202

Data Collection β€” lookup_users

Collect data on one or more status IDs

lookup_users(users = c("drob", "hadleywickham", "JennyBryan"))
87 / 202

Data Collection β€” lookup_users

Collect data on one or more status IDs

lookup_users(users = c("drob", "hadleywickham", "JennyBryan"))


Collect data on status IDs in a data frame

lookup_users(users = df$screen_name)
87 / 202

Data Collection β€” lookup_friendships

  • Collect data on friendship status of two users

    • e.g., whether they follow each other
    • whose user ID or username you already know
    • limited with 90,000 users per 15 minutes
      • there is no retryonratelimit argument


  • Note that

    • usernames can change
    • rely on user IDs where possible
lookup_friendships(source,
target,
parse = TRUE,
token = NULL)
88 / 202

Exercises

16) Find a status ID through your browser and look it up in R

  • they appear in search bar when viewing a single tweet
  • unlike user IDs


17) Look up a subset of tweets whose ids stored in status_ids.rds


18) Look up a subset of users whose usernames stored in mps.csv


19) Check the friendship status of two MPs in the dataset

  • hint: not all MPs are on Twitter, but most are
    • e.g., check if the 10th and 20th observations follow each other
15:00
89 / 202

get_

90 / 202

Data Collection β€” get_timeline

  • Collect the latest posts from one or more users

    • specified by username or user IDs, with the user argument
    • limited to 3,200 tweets per user-timeline
      • there is no retryonratelimit argument
    • returns the most recent only, if there is more
get_timeline(user,
n = 100,
max_id = NULL,
home = FALSE,
parse = TRUE,
check = TRUE,
token = NULL,
...
)
91 / 202

Data Collection β€” get_timeline

Collect the most recent 200 tweets by David Robinson

  • e.i., tweets from the user-timeline of one user
get_timeline(user = "drob",
n = 200)
92 / 202

Data Collection β€” get_timeline

  • Collect the most recent posts by David Robinson and Hadley Wickham

    • tweets from the user-timeline of multiple users


  • Note that

    • this results in a dataframe of 400 observations
    • 200 from each specified user
    • with increasing number of users, you are likely to run out of rate limit
get_timeline(user = c("drob",
"hadleywickham"),
n = 200)
93 / 202

Data Collection β€” get_timeline β€” Home Timeline

  • The package documentation suggests that get_timeline can also retrieve home-timelines

    • i.e., the tweets that appear on a given user's home, as posted by accounts followed by that user
      • if the home argument is set to TRUE
  • This does not seem to be true

    • this code returns your home-timeline, not Wickham's
      • from the last 6 to 9 days
    • the user argument is ignored when home = TRUE
    • but the user argument cannot be missing
get_timeline(user = "hadleywickham",
n = 200,
home = TRUE)
94 / 202

Data Collection β€” Notes β€” retryonratelimit

  • the retryonratelimit argument is not available for all functions in the package

    • e.g., search_users
    • does not mean you will not run into limits
95 / 202

Data Collection β€” Notes β€” retryonratelimit

  • the retryonratelimit argument is not available for all functions in the package

    • e.g., search_users
    • does not mean you will not run into limits


  • You can create your own safety net

    • e.g., with loops, where the system sleeps between iterations
      • until a specific rate limit increases
95 / 202

Data Collection β€” Notes β€” retryonratelimit β€” Iteration

datalist <- list() # create an empty list, to be filled later
for(i in 1:length(df_users$screen_name)) { # for one user, in the data frame df_users, at a time
if (rate_limit(query = "application/rate_limit_status", token = tw_token)$remaining > 2 &
rate_limit(query = "get_timeline", token = tw_token)$remaining > 20) { # if your are still under rate limit for this task
dat <- get_timeline(df$screen_name[i], n = 3200, # collect the tweets
token = tw_token)
datalist[[i]] <- dat # fill the list with data, for one user at a time
}else{ # if there is no limit, wait a little
wait <- rate_limit(query = "get_timeline")$reset + 0.1
Sys.sleep(wait * 60)
}
}
df_tweets <- as.data.frame(do.call(rbind, datalist)) # put all data in one data frame
96 / 202

Exercises

20) Collect the most recent tweets posted by three users

  • 10 from the first user, 20 from the second, and 30 from the third
  • hint: see the function documentation on how to vary the n argument by user


21) Collect as many tweets as possible from your own home-timeline


22) Collect data from timelines of the first five MPs in mps.csv

  • collect as many tweets as possible
10:00
97 / 202

Data Collection β€” get_followers

  • Collect a list of followers, following one user

    • returns a single column of user IDs, not usernames
    • limited with 75,000 followers per 15 minutes
      • use retryonratelimit = TRUE to surpass the limit


  • Note that

    • this function does not accept multiple users
    • it can be combined with lookup_users if usernames are needed
get_followers(user,
n = 5000,
page = "-1",
retryonratelimit = FALSE,
parse = TRUE,
verbose = TRUE,
token = NULL
)
98 / 202

Data Collection β€” get_followers

Collect a list of Hadley Wickham's followers on Twitter

get_followers(user = "hadleywickham",
n = 10000,
retryonratelimit = TRUE)
99 / 202

Data Collection β€” get_friends

Get a list of users, followed by one or more users

  • this returns a single column of user IDs, not usernames
  • limited to 15 users
    • use retryonratelimit = TRUE to surpass the limit

  • limited also to 5,000 followers per user
    • here retryonratelimit = TRUE does not help

  • use the page argument instead to surpass the limit
    • learn the correct value with the next_cursor function
get_friends(users,
n = 5000,
retryonratelimit = FALSE,
page = "-1",
parse = TRUE,
verbose = TRUE,
token = NULL
)
100 / 202

Data Collection β€” get_friends

Collect a list of users followed by Jenny Bryan and Hadley Wickham on Twitter

  • with 20 friends from each
get_friends(users = c("hadleywickham", "JennyBryan"),
n = 20)
101 / 202

Exercises

23) Collect a list of accounts following UniversitΓ€t Luzern

  • search for the Unversity's username with search_users
  • search for the followers get_followers


24) Collect a list of accounts that UniversitΓ€t Luzern follows

  • and find out more about these accounts with lookup_users


25) Check your rate limits

  • see the ones that are getting low
07:30
102 / 202

Data Collection β€” get_favorites

Collect tweets liked by one or more users

  • by username or user IDs, with the user argument
  • limited to 3,000 likes
    • there is no retryonratelimit argument
  • returns the most recent only, if there is more
get_favorites(user,
n = 200,
since_id = NULL,
max_id = NULL,
parse = TRUE,
token = NULL
)
103 / 202

Data Collection β€” get_favorites

Collect a list of tweets liked by Jenny Bryan

get_favorites(user = "JennyBryan")
104 / 202

Data Collection β€” get_retweets

  • Collect information on the retweets of one tweet

    • using the status_id argument
    • available in the data frames returned by many functions in the package
      • e.g., get_timeline
      • also available on browsers

    • limited to 100 retweets
      • even if there might be more
get_retweets(status_id,
n = 100,
parse = TRUE,
token = NULL,
...
)
105 / 202

Data Collection β€” get_retweets

get_retweets(status_id = "1354143047324299264")
106 / 202

Exercises

26) Collect a list of favorites by three users

  • compare and contrast the resulting data with the same information available on browser


27) Collect a list of accounts retweeting a tweet of yours

  • compare and contrast the resulting data with what you can see on a browser
07:30
107 / 202
  • Collect information on twitter trends

    • by town or country, specified with
      • the woeid argument,* or
      • the lat and long arguments


  • Note that

    • not all locations have trending data
    • use the trends_available function to check availability
      • with no argument
get_trends(woeid = 1,
lat = NULL,
lng = NULL,
exclude_hashtags = FALSE,
token = NULL,
parse = TRUE
)

* It stands for "where on earth identifier", which is 44418 for London. Google for more!

108 / 202

Collect the trends data for London

  • using the woeid argument
get_trends(woeid = 44418)


Collect the same trends data for London

  • using the lat and long arguments instead
get_trends(lat = "51.50", lng = "0.12")
109 / 202

Exercises

28) Collect a list of places where the trends data is available

  • hint: use the trends_available function


29) Collect the lists of trends for two locations

  • compare and contrast the resulting data for two locations


30) Collect the list of trends for your location

  • compare and contrast the resulting data with what you see on your browser
07:30
110 / 202

lists_

111 / 202

Data Collection β€” lists_memberships

  • Collect data on lists, where one or more users are listed

    • i.e., the lists where a user appears
    • limited to 200 lists
lists_memberships(user = NULL,
n = 200,
cursor = "-1",
filter_to_owned_lists = FALSE,
token = NULL,
parse = TRUE,
previous_cursor = NULL
)
112 / 202

Data Collection β€” lists_memberships

Collect data on lists where Jenny Bryan is listed

lists_memberships(user = "JennyBryan")


Collect data on lists where Jenny Bryan or Hadley Wickham is listed

lists_memberships(user = c("JennyBryan", "hadleywickham"))
113 / 202

Data Collection β€” lists_members

  • Collect data on users listed in one list

    • specify the list with the list_id argument
      • e.g., with data from lists_memberships
    • or the owner_user and slug arguments together
    • limited to 5,000 members


lists_members(list_id = NULL,
slug = NULL,
owner_user = NULL,
n = 5000,
cursor = "-1",
token = NULL,
parse = TRUE,
...
)
114 / 202

Data Collection β€” lists_members

Collect data on the list of MPs in the House of Commons

  • using the list_id argument
lists_members(list_id = "217199644")
115 / 202

Data Collection β€” lists_members

Collect data on the list of MPs in the House of Commons

  • using the list_id argument
lists_members(list_id = "217199644")


Collect the same data, with different arguments

  • using the owner_user and slug arguments
lists_members(owner_user = "TwitterGov", slug = "UK-MPs")
115 / 202

Data Collection β€” lists_statuses

Collect tweets from the timeline of a list

  • i.e., tweets posted by those listed on a given list
  • specify the list with the list_id argument
  • or the owner_user and slug arguments together
lists_statuses(list_id = NULL,
slug = NULL,
owner_user = NULL,
since_id = NULL,
max_id = NULL,
n = 200,
include_rts = TRUE,
parse = TRUE,
token = NULL
)
116 / 202

Data Collection β€” lists_statuses

Collect tweets posted by the members of the UK MPs list

  • using the list_id argument
lists_statuses(list_id = "217199644")
117 / 202

Data Collection β€” lists_statuses

Collect tweets posted by the members of the UK MPs list

  • using the list_id argument
lists_statuses(list_id = "217199644")


Collect the same data, with different arguments

  • using the owner_user and slug arguments
lists_statuses(owner_user = "TwitterGov", slug = "UK-MPs")
117 / 202

Data Collection β€” lists_subscribers

Collect data on users subscribed to a given list

  • i.e., users who are following a list
  • specify the list with the list_id argument
  • or with the owner_user and slug arguments
  • limited to 5,000 users
lists_subscribers(list_id = NULL,
slug = NULL,
owner_user = NULL,
n = 20,
cursor = "-1",
parse = TRUE,
token = NULL
)
118 / 202

Data Collection β€” lists_subscribers

Collect data on users subscribed to the UKMPs list

  • using the list_id argument
lists_subscribers(list_id = "1405362")
119 / 202

Data Collection β€” lists_subscribers

Collect data on users subscribed to the UKMPs list

  • using the list_id argument
lists_subscribers(list_id = "1405362")


Collect the same data, with different arguments

  • using the owner_user and slug arguments
lists_subscribers(owner_user = "TwitterGov", slug = "UK-MPs")
119 / 202

Data Collection β€” lists_subscriptions

Collect data on the lists a user is subscribed to

  • specify the user with user argument
    • takes user ID or username
  • limited to 1,000 subscriptions
lists_subscriptions(user,
n = 20,
cursor = "-1",
parse = TRUE,
token = NULL)
120 / 202

Data Collection β€” lists_subscriptions

Collect data on the lists that TwitterGov is subscribed to

lists_subscriptions(user = "TwitterGov")
121 / 202

Exercises

31) Collect data on lists where Hadley Wickham is listed

  • hint: Wickham's username is "hadleywickham"


32) For one of these lists, see who else is listed with Hadley Wickham

  • compare and contrast this data with what you can see on a browser


33) Collect the latest posts from that list


34) Collect data on users subscribed to that list


35) For one of these users, see if they are subscribed to any other lists

10:00
122 / 202

stream_tweets

123 / 202

Data Collection β€” stream_tweets

  • Collect tweets as they are posted real time

    • about 1% of all new public Tweets, randomly chosen
    • set the length of search with the timeout argument
  • The search can be limited with the q argument

    • up to 400 keywords
    • up to 5,000 user IDs or usernames
    • location coordinates of geographical boxes
      • not two, but four coordinates
  • Note that

    • this function uses the stream APIs
    • unlike any other function covered so far
stream_tweets(q = "",
timeout = 30,
parse = TRUE,
token = NULL,
file_name = NULL,
verbose = TRUE,
...
)
124 / 202

Data Collection β€” stream_tweets

  • Collect a random sample of tweets being sent

    • continuously


  • Note that

    • the timeout function can be set to infinity
    • you are likely to run into connection problems at some point
stream_tweets(q = "",
timeout = Inf)
125 / 202

Data Collection β€” stream_tweets

  • Collect a random sample of tweets being sent

    • for 30 seconds


  • Note that

    • timeout values are otherwise in seconds
stream_tweets(q = "",
timeout = 30)
126 / 202

Data Collection β€” stream_tweets

  • Collect a random sample of tweets being sent

    • for 30 seconds
    • filtered by a search query


  • Note that

    • q accepts a comma separated character string
stream_tweets(q = "switzerland, schweiz,
suisse, svizzera",
timeout = 30)
127 / 202

Data Collection β€” stream_tweets

  • Collect a random sample of tweets being sent

    • for 30 seconds
    • filtered by usernames


  • Note that

    • q accepts a comma separated list
stream_tweets(q = c("UniLuzern", "hslu",
"phluzern"),
timeout = 30)
128 / 202

Data Collection β€” stream_tweets

  • Collect a random sample of tweets being sent

    • for 30 seconds
    • filtered by coordinates
stream_tweets(q = c(6.02, 45.77,
10.44, 47.83),
timeout = 30)
129 / 202

Exercises

36) Stream for all tweets, for 30 seconds

  • observer the outcome, and compare them with the outcomes from below


37) Further limit your stream by a popular keyword

  • e.g., "and"
  • stream for 30 more seconds


38) Further limit your stream to a not so popular word

  • e.g., "PhD"
  • stream for 30 more seconds


39) Stream for a word or words that interest you

10:00
130 / 202

Part 4. Data Preperation

131 / 202

Data Preperation β€” Overview

  • The rtweet package does a very good job with data preperation to start with

    • returns data frames, with mostly tidy data
    • although Twitter APIs return nested lists
    • some variables are still lists
      • e.g., hastags
132 / 202

Data Preperation β€” Overview

  • The rtweet package does a very good job with data preperation to start with

    • returns data frames, with mostly tidy data
    • although Twitter APIs return nested lists
    • some variables are still lists
      • e.g., hastags


  • Further data preparation depends on your research project

    • most importantly, on whether you will work with texts or not
    • we will cover some common preparation steps
132 / 202

Data Preperation β€” Overview β€” Strings

  • Most researchers would be interested in textual Twitter data

    • tweets as a whole, but also specifically hashtags etc.
133 / 202

Data Preperation β€” Overview β€” Strings

  • Most researchers would be interested in textual Twitter data

    • tweets as a whole, but also specifically hashtags etc.


  • There are many components of tweets as texts

    • e.g., mentions, hashtags, emojis, links etc.
    • but also punctuation, white spaces, upper case letters etc.
    • some of these may need to be taken out before analysis
133 / 202

Data Preperation β€” Overview β€” Strings

  • Most researchers would be interested in textual Twitter data

    • tweets as a whole, but also specifically hashtags etc.


  • There are many components of tweets as texts

    • e.g., mentions, hashtags, emojis, links etc.
    • but also punctuation, white spaces, upper case letters etc.
    • some of these may need to be taken out before analysis


  • I use the stringr package (Wickham, 2019) for string operations

    • part of the tidyverse family
    • you might have another favourite already
      • no need to change as long as it does the job
133 / 202

Data Preperation β€” Overview β€” Numbers

  • There is more to Twitter data than just tweets themselves

    • e.g., the number of followers, likes etc.
134 / 202

Data Preperation β€” Overview β€” Numbers

  • There is more to Twitter data than just tweets themselves

    • e.g., the number of followers, likes etc.


134 / 202

Data Preperation β€” Remove Mentions

tweet <- "These from @handle1 are #socool. πŸ‘ A #mustsee, @handle2!
πŸ‘‰ https://t.co/aq7MJJ2"
str_remove_all(string = tweet, pattern = "[@][\\w_-]+")

[1] "This from are #socool. πŸ‘ A #mustsee, ! πŸ‘‰ https://t.co/aq7MJJ2"


Note that

  • this removes all occurrences
  • there is also the str_remove fucntion, which removes the first occurrence only
135 / 202

Data Preperation β€” Remove Hashtags

tweet <- "These from @handle1 are #socool. πŸ‘ A #mustsee, @handle2!
πŸ‘‰ https://t.co/aq7MJJ2"
str_remove_all(string = tweet, pattern = "[#][\\w_-]+")

[1] "These from @handle1 are . πŸ‘ A , @handle2! πŸ‘‰ https://t.co/aq7MJJ2"

136 / 202

Data Preperation β€” Exercises β€” Notes

  • The exercises in this part are best followed by

    • using tweets.rds or a similar existing dataset
    • saving a new variable at every step of preparation
    • observing the newly created variables
      • by selecting only the old and new variable
      • to confirm whether the code works as intended


  • The mutate and select functions, from the dplyr package, can be helpful, as follows
df_tweets <- read_rds("data/tweets.rds")
df_tweets %>%
mutate(no_mentions = str_remove_all(string = text, pattern = "[@][\\w_-]+")) %>%
select(text, no_mentions) %>%
View()
137 / 202

Exercises

40) Create a new variable without mentions


41) Create a new variable without hashtags

05:00
138 / 202
tweet <- "These from @handle1 are #socool. πŸ‘ A #mustsee, @handle2!
πŸ‘‰ https://t.co/aq7MJJ2"
str_remove_all(string = tweet, pattern = "http\\S+\\s*")

[1] "These from @handle1 are. πŸ‘ A, @handle2! πŸ‘‰ "

139 / 202

Data Preperation β€” Remove Emojis

tweet <- "These from @handle1 are #socool. πŸ‘ A #mustsee, @handle2!
πŸ‘‰ https://t.co/aq7MJJ2"
iconv(x = tweet, from = "latin1", to = "ASCII", sub = "")

[1] "These from @handle1 are #socool. A #mustsee, @handle2! https://t.co/aq7MJJ2"

140 / 202

Exercises

42) Create a new variable without links


43) Create a new variable without emojis


44) Create a new variable without:

  • mentions, hashtags, links, and emojis
10:00
141 / 202

Data Preperation β€” Remove Punctuations

tweet <- "These from @handle1 are #socool. πŸ‘ A #mustsee, @handle2!
πŸ‘‰ https://t.co/aq7MJJ2"
str_remove_all(string = tweet, pattern = "[[:punct:]]")

[1] "This from are socool πŸ‘ A mustsee handle2 πŸ‘‰ httpstcoaq7MJJ2"


Note that

  • this removed all punctuation, including those in links
142 / 202

Data Preperation β€” Remove Punctuations

tweet <- "This is a sentence.There is no space before this sentence."
str_remove_all(string = tweet, pattern = "[[:punct:]]")

[1] "This is a sentenceThere is no space before this sentence"


Note that

  • if tweets are typed with no spaces after punctuation, this might lead to merged pieces of text
    • alternatively, try the str_replace_all function to replace punctuation with space
143 / 202

Data Preperation β€” Replace Punctuations

tweet <- "This is a sentence.There is no space before this sentence."
str_replace_all(string = tweet, pattern = "[[:punct:]]", replacement = " ")

[1] "This is a sentence There is no space before this sentence "

144 / 202

Data Preperation β€” Remove Repeated Whitespace

tweet <- "There are too many spaces after this sentence. This is a new sentence."
str_squish(string = tweet)

[1] "There are too many spaces after this sentence. This is a new sentence."


Note that

  • white spaces can be introduced not only by users on Twitter, but also by us, while cleaning the data
    • e.g., removing and/or replacing operations above
    • hence, this function might be useful after other operations
145 / 202

Data Preperation β€” Change Case

tweet <- "lower case. Sentence case. Title Case. UPPER CASE."
str_to_lower(string = tweet)

[1] "lower case. sentence case. title case. upper case."

Note that

  • there are other functions in this family, including
    • str_to_sentence, str_to_title, str_to_upper
146 / 202

Exercises

45) Remove punctuation


46) Remove repeated whitespace


47) Change case to lower case

10:00
147 / 202

Data Preperation β€” Change Unit of Observation

Research designs might require changing the unit of observation

  • aggregation
    • e.g., at the level of users, locations, hashtags etc.
    • summarise with dplyr
  • dis-aggregation
    • e.g., to the level of words
    • tokenise with tidytext
148 / 202

Data Preperation β€” Change Unit of Observation β€” Aggregation

Aggregate at the level of users

  • the number of tweets per user
# load the tweets dataset
df <- read_rds("tweets.rds") %>%
# group by users for aggregation
group_by(user_id) %>%
# create summary statistics for variables of interest
summarise(sum_tweets = n())
149 / 202

Data Preperation β€” Change Unit of Observation β€” Aggregation

What is aggregated at which level depends on your research design, such as

  • aggregate the tweets into a single text
  • at the level of users by source
# load the tweets dataset
df <- read_rds("tweets.rds") %>%
# group by users for aggregation
group_by(user_id, source) %>%
# create summary statistics for variables of interest
summarise(merged_tweets = paste0(text, collapse = ". "))
150 / 202

Data Preperation β€” Change Unit of Observation β€” Dis-aggregation

Disaggregate the tweets, by splitting them into smaller units

  • also called tokenisation

Note that

  • by default sep = "[^[:alnum:].]+", which works well with separating tweets into words
    • change this argument with a regular expression of your choice
  • this creates a tidy dataset, where each observation is a word
    • all other tweet-level variables are repeated for each observation
# load the tweets dataset
df <- read_rds("tweets.rds") %>%
# split the variable text
separate_rows(text)
151 / 202

Data Preperation β€” Change Unit of Observation β€” Dis-aggregation

The tidytext has a function that works better with tokenising tweets

  • with token = "tweets", it dis-aggregates text into words
    • except that it respects usernames, hashtags, and URLS
# load the tweets dataset
df <- read_rds("tweets.rds") %>%
# split the variable text, create a new variable called da_tweets
unnest_tokens(output = da_tweets, input = text, token = "tweets")
152 / 202

Data Preperation β€” Change Unit of Observation β€” Dis-aggregation

Tokenise variables to levels other than words

  • e.g., characters, words (the default), sentences, lines
# load the tweets dataset
df <- read_rds("tweets.rds") %>%
# split the variable text into sentences, create a new variable called da_tweets
unnest_tokens(output = da_tweets, input = text, token = "sentences")
153 / 202

Data Preperation β€” Change Unit of Observation β€” Dis-aggregation

Tokenise variables other than tweets

  • recall that rtweet stores multiple hastags, mentions etc. as lists
# load the tweets dataset
df <- read_rds("tweets.rds") %>%
# unlist the lists of hashtags to create strings
group_by(status_id) %>%
mutate(tidy_hashtags = str_c(unlist(hashtags), collapse = " ")) %>%
# split the string, create a new variable called da_tweets
unnest_tokens(output = da_hashtags, input = tidy_hashtags, token = "words")
154 / 202

Data Preperation β€” Remove Stop Words

Remove the common, uninformative words

  • e.g., the, a, i

Note that

  • this operation requires a tokenised-to-word variable
  • stop words for English are stored in the stop_words dataset in the tidytext variable
  • list of words for other languages are available elsewhere, including
    • the stopwordslangs function from the rtweet package
    • the stopwords function from the tm package
      • e.g., use tm::stopwords("german") for German
# load the tweets dataset
df <- read_rds("tweets.rds") %>%
# split the variable text, create a new variable called da_tweets
unnest_tokens(output = da_tweets, input = text, token = "tweets") %>%
# remove rows that match any of the stop words as stored in the stop_words dataset
anti_join(stop_words, by = c("da_tweets" = "word"))
155 / 202

Exercises

48) Aggregate text to a higher level

  • e.g., if you are using tweets.rds, to MP level -and add at least two numerical variables
    • e.g., average number of likes


49) Dis-aggregate text to a lower level

  • e.g., to words


50) Dis-aggregate hashtags

  • as one tweet can include multiple hashtags


51) Remove stop words

15:00
156 / 202

Part 5. Data Anaysis: Users

157 / 202

Analysing Users β€” Overview

  • Twitter analysis might focus on users

    • e.g., members of parliaments
    • as opposed to their tweets on Twitter
      • not always mutually exclusive

    • might be supplemented with non-Twitter data
      • e.g., data/mps.csv
158 / 202

Analysing Users β€” Overview

  • Twitter analysis might focus on users

    • e.g., members of parliaments
    • as opposed to their tweets on Twitter
      • not always mutually exclusive

    • might be supplemented with non-Twitter data
      • e.g., data/mps.csv


  • There are at least two types of user-based analysis

    • count things, and describe or correlate
      • e.g., who has the most followers
        • are female MPs more or less likely to have large number of followers?

    • network analysis
      • e.g., who retweets whom, how often?
158 / 202

Analysing Users β€” Correlates of being on Twitter

159 / 202

Analysing Users β€” Who has the most followers?

160 / 202

Exercises

On users.Rmd, complete the following exercises

52) Correlates of being on Twitter

53) Who has the most followers?

20:00
161 / 202

Analysing Users β€” Correlates of having more followers

162 / 202

Analysing Users β€” Who tweets the most often?

163 / 202

Exercises

On users.Rmd, complete the following exercises

54) Correlates of having more followers

55) Who tweets the most often?

20:00
164 / 202

Analysing Users β€” Correlates of tweeting more often

165 / 202

Analysing Users β€” Who do they talk to?

166 / 202

Exercises

On users.Rmd, complete the following exercises

56) Correlates of tweeting more often

57) Who do they talk to?

20:00
167 / 202

Network Analysis

168 / 202

Analysing Users β€” Network Analysis β€” Overview

  • Twitter data is suitable for network analysis

    • a social network
    • studying the relations between users


  • There are at least five networks

    • networks of followers
      • directed, not reoccurring
      • e.g., who is following whom

    • networks of retweeters, quoters, repliers, and likers
      • directed, reoccurring
      • e.g., who is replying to whom, how often
169 / 202

Analysing Users β€” Network Analysis β€” Basics

  • Networks are composed of nodes and edges

    • e.g., who (a node) replies (an edge) to whom (another node), how often (the strength of the edge)
170 / 202

Analysing Users β€” Network Analysis β€” Basics

  • Networks are composed of nodes and edges

    • e.g., who (a node) replies (an edge) to whom (another node), how often (the strength of the edge)


  • The nodes and edges are often kept separate for analysis

    • e.g., in separate datasets, frames
    • nodes are given an ID number
170 / 202

Analysing Users β€” Network Analysis β€” Basics

  • Networks are composed of nodes and edges

    • e.g., who (a node) replies (an edge) to whom (another node), how often (the strength of the edge)


  • The nodes and edges are often kept separate for analysis

    • e.g., in separate datasets, frames
    • nodes are given an ID number


  • We will use two pacakges for network analysis

    • tidygraph for data manipulation
    • ggraph for visualisation
170 / 202

Analysing Users β€” Network Analysis β€” tidygraph

read_rds("data/tweets.rds") %>%
filter(is_retweet == TRUE) %>%
group_by(screen_name, retweet_screen_name) %>%
summarise(rts = n()) %>%
head()
# A tibble: 6 x 3    
# Groups:   screen_name [1]     
  screen_name  retweet_screen_name   rts
                         
1 _OliviaBlake AlexDaviesJones         1
2 _OliviaBlake CommonsPAC              2
3 _OliviaBlake DanJarvisMP             3
4 _OliviaBlake DrRosena                1
5 _OliviaBlake EmmaHardyMP             1
6 _OliviaBlake FriendsLoxley           2
171 / 202

Analysing Users β€” Network Analysis β€” tidygraph

Use the as_tbl_graph function to transform data frames

read_rds("data/tweets.rds") %>%
filter(is_retweet == TRUE) %>%
group_by(screen_name, retweet_screen_name) %>%
summarise(rts = n()) %>%
as_tbl_graph()
# A tbl_graph: 7377 nodes and 16131 edges
#
# A directed multigraph with 5 components
#
# Node Data: 7,377 x 1 (active)
  name         
          
1 _OliviaBlake 
2 _RobbieMoore 
3 AaronBell4NUL
4 ab4scambs    
5 abenaopp     
6 ABridgen     
# ... with 7,371 more rows
#
# Edge Data: 16,131 x 3
   from    to   rts
    
1     1    19     1
2     1   550     2
3     1   119     3
# ... with 16,128 more rows
172 / 202

Analysing Users β€” Network Analysis β€” tidygraph

Use the activate function to manipulate the nodes or edges

read_rds("data/tweets.rds") %>%
filter(is_retweet == TRUE) %>%
group_by(screen_name, retweet_screen_name) %>%
summarise(rts = n()) %>%
as_tbl_graph() %>%
activate(edges) %>%
mutate(multi_rts = if_else(rts > 1, 1, 0))
# A tbl_graph: 7377 nodes and 16131 edges
#
# A directed multigraph with 5 components
#
# Edge Data: 16,131 x 4 (active)
   from    to   rts multi_rts
        
1     1    19     1        0
2     1   550     2        1
3     1   119     3        1
4     1   154     1        0
5     1   167     1        0
6     1   551     2        1
# ... with 16,125 more rows
#
# Node Data: 7,377 x 1
  name         
          
1 _OliviaBlake 
2 _RobbieMoore 
3 AaronBell4NUL
# ... with 7,374 more rows

173 / 202

Analysing Users β€” Network Analysis β€” ggraph

Once the nodes and edges are ready, use the ggraph package to visualise

  • an extension of ggplot2
  • many verbs are intuitively similar
ggraph(rt_network) +
geom_edge_link() +
geom_node_point(aes(color = party)) +
theme_graph()
174 / 202

Analysing Users β€” Reply networks

175 / 202

Analysing Users β€” Retweet networks

176 / 202

Exercises

On users.Rmd, complete the following exercises

58) Reply networks

59) Retweet networks

20:00
177 / 202

Analysing Users β€” Who are more central in the retweet networks?

178 / 202

Exercises

On users.Rmd, complete the following exercises

60) Who are more central in the retweet networks?

61) Something else interesting about MPs

63) Something interesting from your own data

45:00
179 / 202

Part 6. Data Anaysis: Tweets

180 / 202

Analysing Tweets β€” Overview

  • Twitter analysis often focuses on tweets

    • e.g., text in the tweets, but also hashtags
    • as opposed to users who tweet them
      • not always mutually exclusive
    • hard to supplement ordinary users with non-Twitter data
181 / 202

Analysing Tweets β€” Overview

  • Twitter analysis often focuses on tweets

    • e.g., text in the tweets, but also hashtags
    • as opposed to users who tweet them
      • not always mutually exclusive
    • hard to supplement ordinary users with non-Twitter data


  • There are at least two types of tweet-based analysis

    • count things, and describe or correlate
      • e.g., how often a hashtag is tweeted
        • are shorter hashtags more likely to be tweeted than longer ones?
    • categorising tweets
      • e.g., with dictionaries
        • but also with other techniques such as machiene learning
181 / 202

Analysing Tweets β€” When were the tweets posted?

182 / 202

Analysing Tweets β€” What day of the week?

183 / 202

Analysing Tweets β€” What time of the day?

184 / 202

Exercises

On tweets.Rmd, complete the following exercises

63) When were the tweets posted?

64) What day of the week?

65) What time of the day?

20:00
185 / 202

Analysing Tweets β€” Which hastags were the most frequent?

186 / 202

Analysing Tweets β€” Which words were the most frequent?

187 / 202

Exercises

On tweets.Rmd, complete the following exercises

66) Which hastags were the most frequent?

67) Which words were the most frequent?

20:00
188 / 202

Dictionary Methods

189 / 202

Analysing Users β€” Dictionary Methods β€” Overview

  • Dictionary methods are based on pre-categorisation of words

    • e.g., the word happy might be categorised as positive

      • sad would be negative
    • e.g., the word happy might be categorised as 0.2 sophisticated

      • contented might be 0.4 sophisticated
190 / 202

Analysing Users β€” Dictionary Methods β€” Overview

  • Dictionary methods are based on pre-categorisation of words

    • e.g., the word happy might be categorised as positive

      • sad would be negative
    • e.g., the word happy might be categorised as 0.2 sophisticated

      • contented might be 0.4 sophisticated


  • These catagories are than matched with the text we have

    • to calculate scores for, e.g., each tweet
190 / 202

Analysing Users β€” Dictionary Methods β€” Notes

  • There are many ways to calculate scores

    • depending on your research design


  • Positive score could be

    • sum(positive)
    • sum(positive) - sum(negative)
    • (sum(positive) - sum(negative)) / (sum(positive) + sum(negative))
  • We will use

    • the sentiment dictionary stored at tidytext::get_sentiments("nrc")
    • the concreteness dictionary stored at doc2concrete::mturk_list
191 / 202

Analysing Tweets β€” Sentiments by Hours of the Day

192 / 202

Analysing Tweets β€” Sentiments across the time frame

193 / 202

Analysing Tweets β€” Sentiments in different types of tweets

194 / 202

Exercises

On tweets.Rmd, complete the following exercises

69) Sentiments across the time frame

70) Sentiments in different types of tweets

20:00
195 / 202

Analysing Tweets β€” Concreteness in different types of tweets

196 / 202

Analysing Tweets β€” Concreteness by Hours of the Day

197 / 202

Exercises

On tweets.Rmd, complete the following exercises

71) Concreteness in different types of tweets

72) Concreteness by Hours of the Day

73) Something else interesting about MPs

74) Something interesting from your own data

45:00
198 / 202

References

199 / 202

References

Cheng, J. and W. Chang (2022). httpuv: HTTP and WebSocket Server Library. R package version 1.6.5. https://github.com/rstudio/httpuv.

Jungherr, A. (2016). "Twitter use in election campaigns: A systematic literature review". In: Journal of Information Technology & Politics 13.1, pp. 72-91.

JΓΌrgens, P. and A. Jungherr (2016). "A tutorial for using Twitter data in the social sciences: Data collection, preparation, and analysis". In: Available at http://dx.doi.org/10.2139/ssrn.2710146.

Kearney, M. W. (2020). rtweet: Collecting Twitter Data. R package version 0.7.0. https://CRAN.R-project.org/package=rtweet.

Mellon, J. and C. Prosser (2017). "Twitter and Facebook are not representative of the general population: Political attitudes and demographics of British social media users". In: Research & Politics 4.3, pp. 1-9.

Robinson, D. and J. Silge (2021). tidytext: Text Mining using dplyr, ggplot2, and Other Tidy Tools. R package version 0.3.2. https://github.com/juliasilge/tidytext.

Silge, J. and D. Robinson (2017). Text mining with R: A tidy approach. O'Reilly.

Silva, B. C. and S. Proksch (2021). "Fake It 'Til You Make It: A Natural Experiment to Identify European Politicians' Benefit from Twitter Bots". In: American Political Science Review 115.1, pp. 316-322.

200 / 202

References

Sinnenberg, L., A. M. Buttenheim, K. Padrez, et al. (2017). "Twitter as a tool for health research: a systematic review". In: American Journal of Public Health 107.1, pp. 1-8.

Umit, R. (2017). "Strategic communication of EU affairs: an analysis of legislative behaviour on Twitter". In: The Journal of Legislative Studies 23.1, pp. 93-124.

Wickham, H. (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr.

Wickham, H. (2021). tidyverse: Easily Install and Load the Tidyverse. R package version 1.3.1. https://CRAN.R-project.org/package=tidyverse.

Wickham, H., R. François, L. Henry, et al. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.8. https://CRAN.R-project.org/package=dplyr.

Wickham, H. and G. Grolemund (2021). R for data science. O'Reilly.

Xie, Y. (2022). xaringan: Presentation Ninja. R package version 0.23. https://github.com/yihui/xaringan.

201 / 202

The workshop ends here.

Congradulations for making it this far, and

thank you for joining me!

202 / 202

Who am I?

Resul Umit

  • post-doctoral researcher at the University of Oslo

  • interested in representation, elections, and parliaments

    • a recent publication: Parliamentary communication allowances do not increase electoral turnout or incumbents’ vote share
2 / 202
Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k Go to previous slide
↓, β†’, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow