Resul Umit
post-doctoral researcher at the University of Oslo
interested in representation, elections, and parliaments
Resul Umit
post-doctoral researcher at the University of Oslo
interested in representation, elections, and parliaments
working with Twitter data
a publication based on Twitter data: The voices of Eurosceptic members of parliament (MPs) echo disproportionately louder on Twitter
an app based on Twitter data: LikeWise β a Shiny app that facilitates searching the tweets a user liked
Resul Umit
post-doctoral researcher at the University of Oslo
interested in representation, elections, and parliaments
working with Twitter data
a publication based on Twitter data: The voices of Eurosceptic members of parliament (MPs) echo disproportionately louder on Twitter
an app based on Twitter data: LikeWise β a Shiny app that facilitates searching the tweets a user liked
Two days, on how to collect, process, and analyse data from Twitter
Two days, on how to collect, process, and analyse data from Twitter
Designed for researchers with basic knowledge of R programming language
Twitter provides attractive opportunities for academic research
Twitter provides attractive opportunities for academic research
Research based on Twitter data requires a set of skills
Popularity of the network
Popularity of the network
Richness of the data
* These statistics are compiled at the end of 2021, by BusinessOfApps.
Popularity of the network
Richness of the data
Accessibility of the data
Research based on Twitter data requires certain skills
Research based on Twitter data requires certain skills
The required skills are often uncovered in the academic training of social scientists
To provide you with an understanding of what is possible
Google
, and perseverance are all you needTo provide you with an understanding of what is possible
Google
, and perseverance are all you needTo start you with acquiring and practicing the skills needed
Part 1. Preliminary Considerations
Part 2. Getting the Tools Ready
I will go through a number of slides...
... and then pause, for you to use/do those things
We are here to help
Slides with this background colour indicate that your action is required, for
setting the workshop up
completing the exercises
03:00
Codes and texts that go in R console or scripts appear as such β in a different font, on gray background
# read in the tweets datasetdf <- read_rds("tweets.rds") %>%# split the variable text, create a new variable called da_tweets unnest_tokens(output = da_tweets, input = text, token = "tweets") %>%# remove rows that match any of the stop words as stored in the stop_words dataset anti_join(stop_words, by = c("da_tweets" = "word"))
Codes and texts that go in R console or scripts appear as such β in a different font, on gray background
Results that come out as output appear as such β in the same font, on green background
Codes and texts that go in R console or scripts appear as such β in a different font, on gray background
Results that come out as output appear as such β in the same font, on green background
Specific sections are highlighted yellow as such for emphasis
Codes and texts that go in R console or scripts appear as such β in a different font, on gray background
Results that come out as output appear as such β in the same font, on green background
Specific sections are highlighted yellow as such for emphasis
The slides are designed for self-study as much as for the workshop
Ideally, we have one or more research questions, hypotheses
Ideally, we have one or more research questions, hypotheses
Not all questions can be answered with Twitter data
There are at least two potential sources of bias in Twitter data
sampling
There are at least two potential sources of bias in Twitter data
sampling
mediation
Twitter has restrictions on data access
Twitter has restrictions on data access
These restrictions vary across API types
Twitter has restrictions on data access
These restrictions vary across API types
These restrictions also vary within APIs types, across different operations
Twitter restricts content redistribution
Twitter restricts content redistribution
Reproducibility of research based on Twitter data is limited in practice
Twitter is currently switching to a new generation of APIs
Twitter is currently switching to a new generation of APIs
Twitter might change the rules of the APIs game at any anytime, again
Existing codes to collect tweets may or may not be affected, depending on
rtweet
package* will adopt* This is the R package that we will use to collect tweets. More details are in Part 2.
Existing codes to collect tweets may or may not be affected, depending on
rtweet
package will adoptNot all changes are bad
It is often impossible to get users' consent
It is often impossible to get users' consent
Check the rules that apply to your case
It is often impossible to get users' consent
Check the rules that apply to your case
Reflect on whether using Twitter data for research is ethical
Twitter data frequently requires
large amounts of digital storage space
Twitter data frequently requires
large amounts of digital storage space
private, safe storage spaces
Some tools of text analysis are developed for a specific language and/or context
e.g., dictionaries for sentiment analysis
these may not be useful, valid for different languages, and/or contexts
Some tools of text analysis are developed for a specific language and/or context
e.g., dictionaries for sentiment analysis
these may not be useful, valid for different languages, and/or contexts
Some tools of text analysis are developed for general use
e.g., a dictionary for sentiments in everyday language
these may not be useful, valid for a specific context
Having the workshop slides* on your own machine might be helpful
Access at https://resulumit.com/teaching/twtr_workshop.html
* These slides are produced in R, with the xaringan
package (Xie, 2022).
Download the materials from https://github.com/resulumit/twtr_workshop/tree/materials
Code -> Download ZIP
Unzip and rename the folder
Materials have the following structure
twtr_workshop-materials | |- data | | | |- mps.csv | |- status_ids.rds | |- tweets.rds | |- analysis | | | |- solutions.R | |- tweets.Rmd | |- tweets_answers.Rmd | |- users.Rmd | |- users_answers.Rmd
data/mps.csv
data/status_ids.rds
status_id
mps.csv
, during January 2021data/tweets.rds
data/status_ids
, except thatexercises/solutions.R
exercises/tweets.Rmd
tweets_answers.Rmd
exercises/users.Rmd
users_answers.Rmd
Programming language of this workshop
Optional, if you have R already installed
R.version.string
command in R to check the version of your copyDownload R from https://cloud.r-project.org
Optional, but highly recommended
A popular integrated development environment (IDE) for R
Download RStudio from https://rstudio.com/products/rstudio/download
Help -> Check for Updates
RStudio allows for dividing your work with R into separate projects
...\twtr_workshop-materials
from the RStudio menu:
File -> New Project -> Existing Directory -> Browse -> ...\twtr_workshop-materials -> Open
* Recall that we have downloaded this earlier from GitHub. Back to the relevant slide.
install.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))
* You may already have a copy of one or more of these packages. In that case, I recommend updating by re-installing them now.
install.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))
rtweet
(Kearney, 2020), for collecting tweets
academictwitteR
for academic research access; running Python code in Rinstall.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))
rtweet
(Kearney, 2020), for collecting tweets
academictwitteR
for academic research access; running Python code in Rhttpuv
(Cheng and Chang, 2022), for API authorization
install.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))
tidyverse
(Wickham, 2021), for various tasks
base
Rinstall.packages(c("rtweet", "httpuv", "tidyverse", "tidytext"))
tidyverse
(Wickham, 2021), for various tasks
base
Rtidytext
(Robinson and Silge, 2021), for working with text as data
quanteda
Authorization to use Twitter APIs requires at least three steps*
1) open a user account on Twitter
2) with that user account, apply for a developer account
3) with that developer account, register a Twitter app
* There may be additional steps, such as registering for the Academic Research access.
It is possible to interact with Twitter APIs without steps 2 and 3
rtweet
has a its own Twitter app β rstats2twitter
β that anyone can userstats2twitter
via a pop-up browserIt is possible to interact with Twitter APIs without steps 2 and 3
rtweet
has a its own Twitter app β rstats2twitter
β that anyone can userstats2twitter
via a pop-up browserI recommend
rstats2twitter
and follow the workshopSign up for Twitter at https://twitter.com/
a pre-condition for interacting with Twitter APIs
rtweet
's app β rstats2twitter
helpful for getting to know what you study
with a strategic username
asdf029348
)Apply
*It takes a few days for Twitter to review and hopefully approve your request to have an account. You might have created an account before. In that case, you will see Developer Portal
instead of Apply
.
Apply
On developer.twitter.com/en/portal/projects-and-apps, click + Create App
follow the instructions on consecutive pages
note that, once the app is registered, you are provided with keys and tokens
rtweet
's own app, called rstats2twitter
does not mean you have to create an actual app
Keys and tokens
tabKeys and tokens can be re-generated anytime
Keys and tokens
tabTwitter allows for further, optional settings involving keys and tokens
rstats2twitter
, to allow for other users to authenticate through a browser pop upConsumer key and Consumer secret
Consumer key and Consumer secret
Access token and Access token secret
rstats2twitter
, there are many usersThere are two different methods of authentication
through rtweet
's rstats2twitter
app
There are two different methods of authentication
through rtweet
's rstats2twitter
app
through your own app
rstats2twitter
, over which you have no controlIf you are using your own app to authenticate, create a token
create_token
functionapp
argument requires for the name of your own app, as registered on developer.twitter.comKeys and tokens
tab on the same websitetw_token <- create_token( app = "", consumer_key = "", consumer_secret = "", access_token = "", access_secret = "" )
You may wish to put your keys and tokens elsewhere
There are at least two alternatives
source
at the top of your main scriptkeys_tokens.R
tw_token <- create_token( app = "", consumer_key = "", consumer_secret = "", access_token = "", access_secret = "" )
data_collection.R
library(rtweet) source("keys_tokens.R")
You may wish to put your keys and tokens elsewhere
There are at least two alternatives
create a separate script, which you can then source
at the top of your main script
store your keys and tokens in your .Renviron
file, which can be created at the project level as well
.Renviron
TWITTER_APP=name_of_my_appTWITTER_CONSUMER_KEY=akN...TWITTER_CONSUMER_SECRET=HJK...TWITTER_ACCESS_TOKEN=345...TWITTER_ACCESS_SECRET=SDF...
data_collection.R
library(rtweet)tw_token <- create_token( app = Sys.getenv("TWITTER_APP"), consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"), consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET"), access_token = Sys.getenv("TWITTER_ACCESS_TOKEN"), access_secret = Sys.getenv("TWITTER_ACCESS_SECRET") )
R for Data Science (Wickham and Grolemund, 2021)
R for Data Science (Wickham and Grolemund, 2021)
Text Mining with R: A Tidy Approach (Silge and Robinson, 2017)
R for Data Science (Wickham and Grolemund, 2021)
Text Mining with R: A Tidy Approach (Silge and Robinson, 2017)
A Tutorial for Using Twitter Data in the Social Sciences: Data Collection, Preparation, and Analysis (JΓΌrgens and Jungherr, 2016)
* I recommend these to be consulted not during but after the workshop.
We will collect data through APIs
We will collect data through APIs
Collecting data through web scraping is also possible
GetOldTweets3
β a python libraryIn general, there are two main types of APIs
In general, there are two main types of APIs
REST APIs are for single, one-off requests
In general, there are two main types of APIs
REST APIs are for single, one-off requests
Streaming APIs are for continuous requests
At Twitter, there is a further differentiation among the APIs
At Twitter, there is a further differentiation among the APIs
Rules and restrictions differ from one type to another
At Twitter, there is a further differentiation among the APIs
Rules and restrictions differ from one type to another
Rules and restrictions can also differ within one type
We will collect data through Twitter's Standard v1.1 APIs
rweet
's rstats2twitter
app, can be used immediatelyWe will collect data through Twitter's Standard v1.1 APIs
rweet
's rstats2twitter
app, can be used immediatelyYou can surpass these restrictions later on
rweet
has the search_30day
and search_fullarchive
functions for the Premium V1.1 APIsOur attempts to collect data will be limited for various reasons, including
Our attempts to collect data will be limited for various reasons, including
Our attempts to collect data will be limited for various reasons, including
Our attempts to collect data will be limited for various reasons, including
rtweet
β OverviewA powerful R package for collecting Twitter data
twitteR
rtweet
β OverviewA powerful R package for collecting Twitter data
twitteR
A lot has already been written on this package. See, for example,
rtweet
β OverviewA powerful R package for collecting Twitter data
twitteR
A lot has already been written on this package. See, for example,
rstats2twitter
rtweet
β BasicsThere are four main groups of functions to collect historical data, starting with
search_
search_tweets
or search_users
rtweet
β BasicsThere are four main groups of functions to collect historical data, starting with
search_
search_tweets
or search_users
lookup_
lookup_tweets
or lookup_users
rtweet
β BasicsThere are four main groups of functions to collect historical data, starting with
search_
search_tweets
or search_users
lookup_
lookup_tweets
or lookup_users
get_
get_followers
or get_friends
rtweet
β BasicsThere are four main groups of functions to collect historical data, starting with
search_
search_tweets
or search_users
lookup_
lookup_tweets
or lookup_users
get_
get_followers
or get_friends
lists_
lists_members
or lists_statuses
rtweet
β BasicsThere is also one function to collect tweets in real time
stream_tweets
rtweet
β BasicsThere is also one function to collect tweets in real time
stream_tweets
For other functions, see the package documentation
post_
Check that you are in the right project
Create a new R Script, following from the RStudio menu
File -> New File -> R Script
Untitled123
problemdata_collection.R
rtweet
and other packageshttpuv
package, enough if installedlibrary(rtweet)library(tidyverse)library(tidytext)
search_
search_tweets
Collect tweets posted in the last 6 to 9 days
filter by search query, with the q
argument
limited to 18,000 tweets, per 15 minutes, per token*
n
argument**retryonratelimit
argument to TRUE
search_tweets(q, n = 100, type = "recent", include_rts = TRUE, geocode = NULL, max_id = NULL, parse = TRUE, token = NULL, retryonratelimit = FALSE, verbose = TRUE, ...)
* All limits are for the standard v1.1 APIs.
**This argument is common to many functions in the package. I recommend setting it to a small number, such as 200, for the exercises in this workshop. This will save computation time and avoid running into rate limits.
search_tweets
type = "recent"
, returning the latest tweetsn = 100
, returning 100 tweets
rtweet
's rstats2twitter
apptoken = NULL
search_tweets(q = "#publish")
1) Collect the latest 30 tweets that
df_tweets
2) Observe how the rstats2twitter
app works
3) Take some time to explore the data frame
View
, str
, names
, tibble::glimpse
4) Conduct the same search on a browser
15:00
Twitter usernames, or handles, are stored under variable screen_name
Twitter usernames, or handles, are stored under variable screen_name
Twitter allows user to change their usernames and display names
user_id
is a better variable for reproducible researchTwitter usernames, or handles, are stored under variable screen_name
Twitter allows user to change their usernames and display names
user_id
is a better variable for reproducible researchThe date and time data are matched to Greenwhich Mean Time
created_at
Twitter usernames, or handles, are stored under variable screen_name
Twitter allows user to change their usernames and display names
user_id
is a better variable for reproducible researchThe date and time data are matched to Greenwhich Mean Time
created_at
You may wish to exclude retweets
include_rts = FALSE
search_tweets
Collect the top 200 tweets that
search_tweets(q = "publish", n = 200, type = "popular")
search_tweets
Collect the top 200 tweets that
Note that
search_tweets(q = "publish perish", n = 200, type = "popular")
search_tweets
Collect the top 200 tweets that
Note that
search_tweets(q = "publish OR perish", n = 200, type = "popular")
search_tweets
Collect the top 200 tweets that
Note that
search_tweets(q = "\"publish or perish\"", n = 200, type = "popular")
search_tweets
Collect the top 200 tweets that
Note that
search_tweets(q = "publish -perish", n = 200, type = "popular")
search_tweets
Collect the top 200 tweets that
search_tweets(q = "publish lang:de", n = 200, type = "popular")
Note that
lang
, are followed by a colon :filter
, from
, to
, since
, until
, min_retweets
etc.search_tweets
Collect the top 200 tweets that
Note that
search_tweets(q = "publish -lang:de", n = 200, type = "popular")
search_tweets
β Noteslang
, filter
lang:en
as a parameterlang = "en"
as an argumentsearch_tweets(q = "publish lang:en filter:replies", n = 200, type = "mixed")
search_tweets(q = "publish", n = 200, type = "mixed", lang = "en", filter = "replies")
search_tweets
β NotesThis function returns a data frame
parse = TRUE
Under the hood, Twitter APIs return nested lists
rtweet
does most of the data preparation for us5) Collect the latest 10 tweets that include
6) Collect the most popular 50 tweets that
7) Collect the most recent 35,000 tweets that
20:00
search_users
Collect information on users
q
argumentsearch_users(q, n = 100, parse = TRUE, token = NULL, verbose = TRUE)
Note that
retryonratelimit
argument8) Collect information on 30 users that
9) Collect the latest 30 tweets that
10) Take some time to explore the resulting data frames
11) Conduct one or more searches that interest you
20:00
rate_limit
Check rate limits at any time
search_tweets
functionrate_limit(token = NULL, query = NULL, parse = TRUE)
Note that
rstats2twitter
apprate_limit
Check you remaining rate limits, for all operations
rate_limit()
# A tibble: 261 x 7 query limit remaining reset reset_at timestamp app <chr> <int> <int> <drtn> <dttm> <dttm> <chr> 1 lists/list 15 15 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter 2 lists/:id/tweets&GET 900 900 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter 3 lists/:id/followers&GET 180 180 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter 4 lists/memberships 75 75 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter 5 lists/:id&DELETE 300 300 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter 6 lists/subscriptions 15 15 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter 7 lists/members 900 900 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter 8 lists/:id&GET 75 75 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter 9 lists/subscribers/show 15 15 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter10 lists/:id&PUT 300 300 14.78 mins 2022-03-06 10:35:45 2022-03-06 10:20:59 rstats2twitter# ... with 251 more rows
rate_limit
Check your remaining rate limits for specifically the search_tweets
function
rate_limit(query = "search/tweets")
# A tibble: 1 x 7 query limit remaining reset reset_at timestamp app <chr> <int> <int> <drtn> <dttm> <dttm> <chr> 1 search/tweets 180 171 14.77129 mins 2022-03-06 10:42:12 2022-03-06 10:27:26 rstats2twitter
rate_limit
Single out the number of remaining rate limit for specifically the search_tweets
function
rate_limit(query = "search/tweets")$remaining
## [1] 171
Note that
12) Check all your remaining rate limits
13) Check your remaining limits for the search_tweets
function
14) Collect the most recent 50 tweets that
15) Check your remaining limits for the search_tweets
function again
10:00
lookup_
lookup_tweets
Collect data on one or more tweets
retryonratelimit
argumentlookup_tweets(statuses, parse = TRUE, token = NULL)
Note that
lookup_tweets
Collect data on one or more status IDs
lookup_tweets(statuses = c("567053242429734913", "266031293945503744", "440322224407314432"))
lookup_tweets
Collect data on one or more status IDs
lookup_tweets(statuses = c("567053242429734913", "266031293945503744", "440322224407314432"))
Collect data on status IDs in a data frame
lookup_tweets(statuses = df$status_id)
lookup_users
Collect data on one or more users
retryonratelimit
argumentNote that
lookup_users(users, parse = TRUE, token = NULL)
lookup_users
Collect data on one or more status IDs
lookup_users(users = c("drob", "hadleywickham", "JennyBryan"))
lookup_users
Collect data on one or more status IDs
lookup_users(users = c("drob", "hadleywickham", "JennyBryan"))
Collect data on status IDs in a data frame
lookup_users(users = df$screen_name)
lookup_friendships
Collect data on friendship status of two users
retryonratelimit
argumentNote that
lookup_friendships(source, target, parse = TRUE, token = NULL)
16) Find a status ID through your browser and look it up in R
17) Look up a subset of tweets whose ids stored in status_ids.rds
18) Look up a subset of users whose usernames stored in mps.csv
19) Check the friendship status of two MPs in the dataset
15:00
get_
get_timeline
Collect the latest posts from one or more users
user
argumentretryonratelimit
argumentget_timeline(user, n = 100, max_id = NULL, home = FALSE, parse = TRUE, check = TRUE, token = NULL, ...)
get_timeline
Collect the most recent 200 tweets by David Robinson
get_timeline(user = "drob", n = 200)
get_timeline
Collect the most recent posts by David Robinson and Hadley Wickham
Note that
get_timeline(user = c("drob", "hadleywickham"), n = 200)
get_timeline
β Home TimelineThe package documentation suggests that get_timeline
can also retrieve home-timelines
home
argument is set to TRUE
This does not seem to be true
user
argument is ignored when home = TRUE
user
argument cannot be missingget_timeline(user = "hadleywickham", n = 200, home = TRUE)
retryonratelimit
the retryonratelimit
argument is not available for all functions in the package
search_users
retryonratelimit
the retryonratelimit
argument is not available for all functions in the package
search_users
You can create your own safety net
retryonratelimit
β Iterationdatalist <- list() # create an empty list, to be filled laterfor(i in 1:length(df_users$screen_name)) { # for one user, in the data frame df_users, at a time if (rate_limit(query = "application/rate_limit_status", token = tw_token)$remaining > 2 & rate_limit(query = "get_timeline", token = tw_token)$remaining > 20) { # if your are still under rate limit for this task dat <- get_timeline(df$screen_name[i], n = 3200, # collect the tweets token = tw_token) datalist[[i]] <- dat # fill the list with data, for one user at a time }else{ # if there is no limit, wait a little wait <- rate_limit(query = "get_timeline")$reset + 0.1 Sys.sleep(wait * 60) }}df_tweets <- as.data.frame(do.call(rbind, datalist)) # put all data in one data frame
20) Collect the most recent tweets posted by three users
n
argument by user21) Collect as many tweets as possible from your own home-timeline
22) Collect data from timelines of the first five MPs in mps.csv
10:00
get_followers
Collect a list of followers, following one user
retryonratelimit = TRUE
to surpass the limit
Note that
lookup_users
if usernames are neededget_followers(user, n = 5000, page = "-1", retryonratelimit = FALSE, parse = TRUE, verbose = TRUE, token = NULL)
get_followers
Collect a list of Hadley Wickham's followers on Twitter
get_followers(user = "hadleywickham", n = 10000, retryonratelimit = TRUE)
get_friends
Get a list of users, followed by one or more users
retryonratelimit = TRUE
to surpass the limitretryonratelimit = TRUE
does not helppage
argument instead to surpass the limitnext_cursor
functionget_friends(users, n = 5000, retryonratelimit = FALSE, page = "-1", parse = TRUE, verbose = TRUE, token = NULL)
get_friends
Collect a list of users followed by Jenny Bryan and Hadley Wickham on Twitter
get_friends(users = c("hadleywickham", "JennyBryan"), n = 20)
23) Collect a list of accounts following UniversitΓ€t Luzern
search_users
get_followers
24) Collect a list of accounts that UniversitΓ€t Luzern follows
lookup_users
25) Check your rate limits
07:30
get_favorites
Collect tweets liked by one or more users
user
argumentretryonratelimit
argumentget_favorites(user, n = 200, since_id = NULL, max_id = NULL, parse = TRUE, token = NULL)
get_favorites
Collect a list of tweets liked by Jenny Bryan
get_favorites(user = "JennyBryan")
get_retweets
Collect information on the retweets of one tweet
status_id
argumentget_timeline
get_retweets(status_id, n = 100, parse = TRUE, token = NULL, ...)
get_retweets
Collect the most recent 50 retweets
get_retweets(status_id = "1354143047324299264")
26) Collect a list of favorites by three users
27) Collect a list of accounts retweeting a tweet of yours
07:30
get_trends
Collect information on twitter trends
woeid
argument,* or lat
and long
arguments
Note that
trends_available
function to check availabilityget_trends(woeid = 1, lat = NULL, lng = NULL, exclude_hashtags = FALSE, token = NULL, parse = TRUE)
* It stands for "where on earth identifier", which is 44418 for London. Google for more!
get_trends
Collect the trends data for London
woeid
argumentget_trends(woeid = 44418)
Collect the same trends data for London
lat
and long
arguments insteadget_trends(lat = "51.50", lng = "0.12")
28) Collect a list of places where the trends data is available
trends_available
function29) Collect the lists of trends for two locations
30) Collect the list of trends for your location
07:30
lists_
lists_memberships
Collect data on lists, where one or more users are listed
lists_memberships(user = NULL, n = 200, cursor = "-1", filter_to_owned_lists = FALSE, token = NULL, parse = TRUE, previous_cursor = NULL)
lists_memberships
Collect data on lists where Jenny Bryan is listed
lists_memberships(user = "JennyBryan")
Collect data on lists where Jenny Bryan or Hadley Wickham is listed
lists_memberships(user = c("JennyBryan", "hadleywickham"))
lists_members
Collect data on users listed in one list
list_id
argumentlists_memberships
owner_user
and slug
arguments togetherNote that
lists_members(list_id = NULL, slug = NULL, owner_user = NULL, n = 5000, cursor = "-1", token = NULL, parse = TRUE, ...)
lists_members
Collect data on the list of MPs in the House of Commons
list_id
argumentlists_members(list_id = "217199644")
lists_members
Collect data on the list of MPs in the House of Commons
list_id
argumentlists_members(list_id = "217199644")
Collect the same data, with different arguments
owner_user
and slug
argumentslists_members(owner_user = "TwitterGov", slug = "UK-MPs")
lists_statuses
Collect tweets from the timeline of a list
list_id
argumentowner_user
and slug
arguments togetherlists_statuses(list_id = NULL, slug = NULL, owner_user = NULL, since_id = NULL, max_id = NULL, n = 200, include_rts = TRUE, parse = TRUE, token = NULL)
lists_statuses
Collect tweets posted by the members of the UK MPs list
list_id
argumentlists_statuses(list_id = "217199644")
lists_statuses
Collect tweets posted by the members of the UK MPs list
list_id
argumentlists_statuses(list_id = "217199644")
Collect the same data, with different arguments
owner_user
and slug
argumentslists_statuses(owner_user = "TwitterGov", slug = "UK-MPs")
lists_subscribers
Collect data on users subscribed to a given list
list_id
argumentowner_user
and slug
argumentslists_subscribers(list_id = NULL, slug = NULL, owner_user = NULL, n = 20, cursor = "-1", parse = TRUE, token = NULL)
lists_subscribers
Collect data on users subscribed to the UKMPs list
list_id
argumentlists_subscribers(list_id = "1405362")
lists_subscribers
Collect data on users subscribed to the UKMPs list
list_id
argumentlists_subscribers(list_id = "1405362")
Collect the same data, with different arguments
owner_user
and slug
argumentslists_subscribers(owner_user = "TwitterGov", slug = "UK-MPs")
lists_subscriptions
Collect data on the lists a user is subscribed to
user
argumentlists_subscriptions(user, n = 20, cursor = "-1", parse = TRUE, token = NULL)
lists_subscriptions
Collect data on the lists that TwitterGov is subscribed to
lists_subscriptions(user = "TwitterGov")
31) Collect data on lists where Hadley Wickham is listed
32) For one of these lists, see who else is listed with Hadley Wickham
33) Collect the latest posts from that list
34) Collect data on users subscribed to that list
35) For one of these users, see if they are subscribed to any other lists
10:00
stream_tweets
stream_tweets
Collect tweets as they are posted real time
timeout
argumentThe search can be limited with the q
argument
Note that
stream_tweets(q = "", timeout = 30, parse = TRUE, token = NULL, file_name = NULL, verbose = TRUE, ...)
stream_tweets
Collect a random sample of tweets being sent
Note that
timeout
function can be set to infinitystream_tweets(q = "", timeout = Inf)
stream_tweets
Collect a random sample of tweets being sent
Note that
timeout
values are otherwise in secondsstream_tweets(q = "", timeout = 30)
stream_tweets
Collect a random sample of tweets being sent
Note that
q
accepts a comma separated character stringstream_tweets(q = "switzerland, schweiz, suisse, svizzera", timeout = 30)
stream_tweets
Collect a random sample of tweets being sent
Note that
q
accepts a comma separated liststream_tweets(q = c("UniLuzern", "hslu", "phluzern"), timeout = 30)
stream_tweets
Collect a random sample of tweets being sent
stream_tweets(q = c(6.02, 45.77, 10.44, 47.83), timeout = 30)
36) Stream for all tweets, for 30 seconds
37) Further limit your stream by a popular keyword
38) Further limit your stream to a not so popular word
39) Stream for a word or words that interest you
10:00
The rtweet
package does a very good job with data preperation to start with
hastags
The rtweet
package does a very good job with data preperation to start with
hastags
Further data preparation depends on your research project
Most researchers would be interested in textual Twitter data
Most researchers would be interested in textual Twitter data
There are many components of tweets as texts
Most researchers would be interested in textual Twitter data
There are many components of tweets as texts
I use the stringr
package (Wickham, 2019) for string operations
tidyverse
familyThere is more to Twitter data than just tweets themselves
There is more to Twitter data than just tweets themselves
I use the dplyr
package (Wickham, FranΓ§ois, Henry, and MΓΒΌller, 2022) for most data operations on numbers
tidyverse
familytweet <- "These from @handle1 are #socool. π A #mustsee, @handle2! π https://t.co/aq7MJJ2"
str_remove_all(string = tweet, pattern = "[@][\\w_-]+")
[1] "This from are #socool. π A #mustsee, ! π https://t.co/aq7MJJ2"
Note that
str_remove
fucntion, which removes the first occurrence onlytweet <- "These from @handle1 are #socool. π A #mustsee, @handle2! π https://t.co/aq7MJJ2"
str_remove_all(string = tweet, pattern = "[#][\\w_-]+")
[1] "These from @handle1 are . π A , @handle2! π https://t.co/aq7MJJ2"
The exercises in this part are best followed by
tweets.rds
or a similar existing datasetmutate
and select
functions, from the dplyr
package, can be helpful, as followsdf_tweets <- read_rds("data/tweets.rds")df_tweets %>% mutate(no_mentions = str_remove_all(string = text, pattern = "[@][\\w_-]+")) %>% select(text, no_mentions) %>% View()
40) Create a new variable without mentions
41) Create a new variable without hashtags
05:00
tweet <- "These from @handle1 are #socool. π A #mustsee, @handle2! π https://t.co/aq7MJJ2"
str_remove_all(string = tweet, pattern = "http\\S+\\s*")
[1] "These from @handle1 are. π A, @handle2! π "
tweet <- "These from @handle1 are #socool. π A #mustsee, @handle2! π https://t.co/aq7MJJ2"
iconv(x = tweet, from = "latin1", to = "ASCII", sub = "")
[1] "These from @handle1 are #socool. A #mustsee, @handle2! https://t.co/aq7MJJ2"
42) Create a new variable without links
43) Create a new variable without emojis
44) Create a new variable without:
10:00
tweet <- "These from @handle1 are #socool. π A #mustsee, @handle2! π https://t.co/aq7MJJ2"
str_remove_all(string = tweet, pattern = "[[:punct:]]")
[1] "This from are socool π A mustsee handle2 π httpstcoaq7MJJ2"
Note that
tweet <- "This is a sentence.There is no space before this sentence."
str_remove_all(string = tweet, pattern = "[[:punct:]]")
[1] "This is a sentenceThere is no space before this sentence"
Note that
str_replace_all
function to replace punctuation with spacetweet <- "This is a sentence.There is no space before this sentence."
str_replace_all(string = tweet, pattern = "[[:punct:]]", replacement = " ")
[1] "This is a sentence There is no space before this sentence "
tweet <- "There are too many spaces after this sentence. This is a new sentence."
str_squish(string = tweet)
[1] "There are too many spaces after this sentence. This is a new sentence."
Note that
tweet <- "lower case. Sentence case. Title Case. UPPER CASE."
str_to_lower(string = tweet)
[1] "lower case. sentence case. title case. upper case."
Note that
str_to_sentence
, str_to_title
, str_to_upper
45) Remove punctuation
46) Remove repeated whitespace
47) Change case to lower case
10:00
Research designs might require changing the unit of observation
dplyr
tidytext
Aggregate at the level of users
# load the tweets datasetdf <- read_rds("tweets.rds") %>%# group by users for aggregation group_by(user_id) %>%# create summary statistics for variables of interest summarise(sum_tweets = n())
What is aggregated at which level depends on your research design, such as
# load the tweets datasetdf <- read_rds("tweets.rds") %>%# group by users for aggregation group_by(user_id, source) %>%# create summary statistics for variables of interest summarise(merged_tweets = paste0(text, collapse = ". "))
Disaggregate the tweets, by splitting them into smaller units
Note that
sep = "[^[:alnum:].]+"
, which works well with separating tweets into words# load the tweets datasetdf <- read_rds("tweets.rds") %>%# split the variable text separate_rows(text)
The tidytext
has a function that works better with tokenising tweets
token = "tweets"
, it dis-aggregates text into words# load the tweets datasetdf <- read_rds("tweets.rds") %>%# split the variable text, create a new variable called da_tweets unnest_tokens(output = da_tweets, input = text, token = "tweets")
Tokenise variables to levels other than words
# load the tweets datasetdf <- read_rds("tweets.rds") %>%# split the variable text into sentences, create a new variable called da_tweets unnest_tokens(output = da_tweets, input = text, token = "sentences")
Tokenise variables other than tweets
rtweet
stores multiple hastags, mentions etc. as lists# load the tweets datasetdf <- read_rds("tweets.rds") %>%# unlist the lists of hashtags to create strings group_by(status_id) %>% mutate(tidy_hashtags = str_c(unlist(hashtags), collapse = " ")) %>%# split the string, create a new variable called da_tweets unnest_tokens(output = da_hashtags, input = tidy_hashtags, token = "words")
Remove the common, uninformative words
Note that
stop_words
dataset in the tidytext
variablestopwordslangs
function from the rtweet
packagestopwords
function from the tm
packagetm::stopwords("german")
for German# load the tweets datasetdf <- read_rds("tweets.rds") %>%# split the variable text, create a new variable called da_tweets unnest_tokens(output = da_tweets, input = text, token = "tweets") %>%# remove rows that match any of the stop words as stored in the stop_words dataset anti_join(stop_words, by = c("da_tweets" = "word"))
48) Aggregate text
to a higher level
tweets.rds
, to MP level
-and add at least two numerical variables49) Dis-aggregate text
to a lower level
50) Dis-aggregate hashtags
51) Remove stop words
15:00
Twitter analysis might focus on users
data/mps.csv
Twitter analysis might focus on users
data/mps.csv
There are at least two types of user-based analysis
On users.Rmd
, complete the following exercises
52) Correlates of being on Twitter
53) Who has the most followers?
20:00
On users.Rmd
, complete the following exercises
54) Correlates of having more followers
55) Who tweets the most often?
20:00
On users.Rmd
, complete the following exercises
56) Correlates of tweeting more often
57) Who do they talk to?
20:00
Twitter data is suitable for network analysis
There are at least five networks
Networks are composed of nodes and edges
Networks are composed of nodes and edges
The nodes and edges are often kept separate for analysis
Networks are composed of nodes and edges
The nodes and edges are often kept separate for analysis
We will use two pacakges for network analysis
tidygraph
for data manipulationggraph
for visualisationtidygraph
read_rds("data/tweets.rds") %>% filter(is_retweet == TRUE) %>% group_by(screen_name, retweet_screen_name) %>% summarise(rts = n()) %>% head()
# A tibble: 6 x 3 # Groups: screen_name [1] screen_name retweet_screen_name rts1 _OliviaBlake AlexDaviesJones 1 2 _OliviaBlake CommonsPAC 2 3 _OliviaBlake DanJarvisMP 3 4 _OliviaBlake DrRosena 1 5 _OliviaBlake EmmaHardyMP 1 6 _OliviaBlake FriendsLoxley 2
tidygraph
Use the as_tbl_graph
function to transform data frames
read_rds("data/tweets.rds") %>% filter(is_retweet == TRUE) %>% group_by(screen_name, retweet_screen_name) %>% summarise(rts = n()) %>% as_tbl_graph()
# A tbl_graph: 7377 nodes and 16131 edges # # A directed multigraph with 5 components # # Node Data: 7,377 x 1 (active) name1 _OliviaBlake 2 _RobbieMoore 3 AaronBell4NUL 4 ab4scambs 5 abenaopp 6 ABridgen # ... with 7,371 more rows # # Edge Data: 16,131 x 3 from to rts 1 1 19 1 2 1 550 2 3 1 119 3 # ... with 16,128 more rows
tidygraph
Use the activate
function to manipulate the nodes or edges
read_rds("data/tweets.rds") %>% filter(is_retweet == TRUE) %>% group_by(screen_name, retweet_screen_name) %>% summarise(rts = n()) %>% as_tbl_graph() %>% activate(edges) %>% mutate(multi_rts = if_else(rts > 1, 1, 0))
# A tbl_graph: 7377 nodes and 16131 edges # # A directed multigraph with 5 components # # Edge Data: 16,131 x 4 (active) from to rts multi_rts1 1 19 1 0 2 1 550 2 1 3 1 119 3 1 4 1 154 1 0 5 1 167 1 0 6 1 551 2 1 # ... with 16,125 more rows # # Node Data: 7,377 x 1 name 1 _OliviaBlake 2 _RobbieMoore 3 AaronBell4NUL # ... with 7,374 more rows
ggraph
Once the nodes and edges are ready, use the ggraph
package to visualise
ggplot2
ggraph(rt_network) + geom_edge_link() + geom_node_point(aes(color = party)) + theme_graph()
On users.Rmd
, complete the following exercises
58) Reply networks
59) Retweet networks
20:00
On users.Rmd
, complete the following exercises
60) Who are more central in the retweet networks?
61) Something else interesting about MPs
63) Something interesting from your own data
45:00
Twitter analysis often focuses on tweets
Twitter analysis often focuses on tweets
There are at least two types of tweet-based analysis
On tweets.Rmd
, complete the following exercises
63) When were the tweets posted?
64) What day of the week?
65) What time of the day?
20:00
On tweets.Rmd
, complete the following exercises
66) Which hastags were the most frequent?
67) Which words were the most frequent?
20:00
Dictionary methods are based on pre-categorisation of words
e.g., the word happy might be categorised as positive
e.g., the word happy might be categorised as 0.2 sophisticated
Dictionary methods are based on pre-categorisation of words
e.g., the word happy might be categorised as positive
e.g., the word happy might be categorised as 0.2 sophisticated
These catagories are than matched with the text we have
There are many ways to calculate scores
Positive score could be
sum(positive)
sum(positive) - sum(negative)
(sum(positive) - sum(negative)) / (sum(positive) + sum(negative))
We will use
tidytext::get_sentiments("nrc")
doc2concrete::mturk_list
On tweets.Rmd
, complete the following exercises
69) Sentiments across the time frame
70) Sentiments in different types of tweets
20:00
On tweets.Rmd
, complete the following exercises
71) Concreteness in different types of tweets
72) Concreteness by Hours of the Day
73) Something else interesting about MPs
74) Something interesting from your own data
45:00
Cheng, J. and W. Chang (2022). httpuv: HTTP and WebSocket Server Library. R package version 1.6.5. https://github.com/rstudio/httpuv.
Jungherr, A. (2016). "Twitter use in election campaigns: A systematic literature review". In: Journal of Information Technology & Politics 13.1, pp. 72-91.
JΓΌrgens, P. and A. Jungherr (2016). "A tutorial for using Twitter data in the social sciences: Data collection, preparation, and analysis". In: Available at http://dx.doi.org/10.2139/ssrn.2710146.
Kearney, M. W. (2020). rtweet: Collecting Twitter Data. R package version 0.7.0. https://CRAN.R-project.org/package=rtweet.
Mellon, J. and C. Prosser (2017). "Twitter and Facebook are not representative of the general population: Political attitudes and demographics of British social media users". In: Research & Politics 4.3, pp. 1-9.
Robinson, D. and J. Silge (2021). tidytext: Text Mining using dplyr, ggplot2, and Other Tidy Tools. R package version 0.3.2. https://github.com/juliasilge/tidytext.
Silge, J. and D. Robinson (2017). Text mining with R: A tidy approach. O'Reilly.
Silva, B. C. and S. Proksch (2021). "Fake It 'Til You Make It: A Natural Experiment to Identify European Politicians' Benefit from Twitter Bots". In: American Political Science Review 115.1, pp. 316-322.
Sinnenberg, L., A. M. Buttenheim, K. Padrez, et al. (2017). "Twitter as a tool for health research: a systematic review". In: American Journal of Public Health 107.1, pp. 1-8.
Umit, R. (2017). "Strategic communication of EU affairs: an analysis of legislative behaviour on Twitter". In: The Journal of Legislative Studies 23.1, pp. 93-124.
Wickham, H. (2019). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.4.0. https://CRAN.R-project.org/package=stringr.
Wickham, H. (2021). tidyverse: Easily Install and Load the Tidyverse. R package version 1.3.1. https://CRAN.R-project.org/package=tidyverse.
Wickham, H., R. François, L. Henry, et al. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.8. https://CRAN.R-project.org/package=dplyr.
Wickham, H. and G. Grolemund (2021). R for data science. O'Reilly.
Xie, Y. (2022). xaringan: Presentation Ninja. R package version 0.23. https://github.com/yihui/xaringan.
Resul Umit
post-doctoral researcher at the University of Oslo
interested in representation, elections, and parliaments
Keyboard shortcuts
β, β, Pg Up, k | Go to previous slide |
β, β, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |