After completing this tutorial, you will be able to:
- Query the twitter RESTful API to access and import into
Rtweets that contain various text strings.
- Generate a list of users that are tweeting about a particular topic
- Use the
Rto explore and analyze word counts associated with tweets.
What you need
You will need a computer with internet access to complete this lesson.
In this lesson you will dive deeper into using twitter to understand a particular topic or event. You will learn more about text mining.
Data munging 101
When you work with data from sources like NASA, USGS, etc there are particular cleaning steps that you often need to do. For instance:
- you may need to remove nodata values
- you may need to scale the data
- and more
However, the data generally have a set structure in terms of file formats and metadata.
When you work with social media and other text data the user community creates and curates the content. This means there are NO RULES! This also means that you may have to perform extra steps to clean the data to ensure you are analyzing the right thing.
Searching for tweets related to climate
Above you learned some things about sorting through social media data and the associated types of issues that you may run into when beginning to analyze it. Next, let’s look at a different workflow - exploring the actual text of the tweets which will involve some text mining.
In this example, let’s find tweets that are using the words “forest fire” in them.
First, you load the
rtweet and other needed
R packages. Note you are introducing 2 new packages lower in this lesson: igraph and ggraph.
# load twitter library - the rtweet library is recommended now over twitteR library(rtweet) # plotting and pipes - tidyverse! library(ggplot2) library(dplyr) # text mining library library(tidytext) # plotting packages library(igraph) library(ggraph)
climate_tweets <- search_tweets(q = "#climatechange", n = 10000, lang = "en", include_rts = FALSE)
Let’s look at the results. Note any issues with our data? It seems like when you search for forest fire, you get tweets that contain the words forest and fire in them - but these tweets are not necessarily all related to our science topic of interest. Or are they?
If you set our query to
q="forest+fire" rather than
forest fire then the API fill find tweets that use the words together in a string rathen than across the entire string. Let’s try it.
# Find tweet using forest fire in them climate_tweets <- search_tweets(q = "#climatechange", n = 10000, lang = "en", include_rts = FALSE) # check data to see if there are emojis head(climate_tweets$text) ##  "Heart-Wrenching Video Shows Starving Polar Bear on Iceless Land https://t.co/UKBRmjqV7V via @NatGeo… https://t.co/uxgDWWUHAT" ##  "\"Indigenous communities throughout the Arctic depend on the land, lakes, rivers and the sea for food and income\". -… https://t.co/XNBUCIgnyw" ##  "\"When permafrost thaws, frozen plants & animals begin to decay, releasing CO2 & methane\". -NSIDC… https://t.co/wPrHTQ0MQQ" ##  "@invisibleman_17 @HamillHimself Meanwhile outside my home in Canada #OnThisDay ...\nBut #climatechange is #FakeNews,… https://t.co/RfyEKBlej7" ##  "The latest The Big Picture Daily! https://t.co/Pa990SbiJ5 Thanks to @rebootingfuture #climatechange #cop23" ##  "Remember when the jarring pic from the Arctic was a polar bear on an ice floe standing alone? Well now it’s starvin… https://t.co/hKwHKFrkF0"
Looking at the data above, it becomes clear that there is a lot of clean-up associated with social media data.
First, there are url’s in our tweets. If you want to do a text analysis to figure out what words are most common in our tweets, the URL’s won’t be helpful. Let’s remove those.
# remove urls tidyverse is failing here for some reason # climate_tweets %>% # mutate_at(c("stripped_text"), gsub("http.*","",.)) # remove http elements manually climate_tweets$stripped_text <- gsub("http.*","", climate_tweets$text) climate_tweets$stripped_text <- gsub("https.*","", climate_tweets$stripped_text)
Finally, you can clean up our text. If you are trying to create a list of unique words in our tweets, words with capitalization will be different from words that are all lowercase. Also you don’t need punctuation to be returned as a unique word.
# note the words that are recognized as unique by R a_list_of_words <- c("Dog", "dog", "dog", "cat", "cat", ",") unique(a_list_of_words) ##  "Dog" "dog" "cat" ","
You can use the
tidytext::unnest_tokens() function in the tidytext package to magically clean up our text! When you use this function the following things will be cleaned up in the text:
- Convert text to lowercase: each word found in the text will be converted to lowercase so ensure that you don’t get duplicate words due to variation in capitalization.
- Punctuation is removed: all instances of periods, commas etc will be removed from our list of words , and
- Unique id associated with the tweet: will be added for each occurrence of the word
unnest_tokens() function takes two arguments:
- The name of the column where the unique word will be stored and
- The column name from the
data.framethat you are using that you want to pull unique words from.
In our case, you want to use the
stripped_text column which is where you have our cleaned up tweet text stored.
# remove punctuation, convert to lowercase, add id for each tweet! climate_tweets_clean <- climate_tweets %>% dplyr::select(stripped_text) %>% unnest_tokens(word, stripped_text)
Now you can plot our data. What do you notice?
# plot the top 15 words -- notice any issues? climate_tweets_clean %>% count(word, sort = TRUE) %>% top_n(15) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col() + xlab(NULL) + coord_flip() + labs(x = "Count", y = "Unique words", title = "Count of unique words found in tweets")
Our plot of unique words contains some words that may not be useful to use. For instance “a” and “to”. In the word of text mining you call those words - ‘stop words’. You want to remove these words from our analysis as they are fillers used to compose a sentence.
Lucky for use, the
tidytext package has a function that will help us clean up stop words! To use this we:
- Load the
stop_wordsdata included with
tidytext. This data is simply a list of words that you may want to remove in a natural language analysis.
- Then you use
anti_jointo remove all stop words from our analysis.
Let’s give this a try next!
# load list of stop words - from the tidytext package data("stop_words") # view first 6 words head(stop_words) ## # A tibble: 6 x 2 ## word lexicon ## <chr> <chr> ## 1 a SMART ## 2 a's SMART ## 3 able SMART ## 4 about SMART ## 5 above SMART ## 6 according SMART nrow(climate_tweets_clean) ##  128597 # remove stop words from our list of words cleaned_tweet_words <- climate_tweets_clean %>% anti_join(stop_words) # there should be fewer words now nrow(cleaned_tweet_words) ##  70697
Now that we’ve performed this final step of cleaning, you can try to plot, once again.
# plot the top 15 words -- notice any issues? cleaned_tweet_words %>% count(word, sort = TRUE) %>% top_n(15) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col() + xlab(NULL) + coord_flip() + labs(y = "Count", x = "Unique words", title = "Count of unique words found in tweets", subtitle = "Stop words removed from the list")
Explore networks of words
You might also want to explore words that occur together in tweets. LEt’s do that next.
ngrams specifies pairs and 2 is the number of words together
# library(devtools) #install_github("dgrtwo/widyr") library(widyr) # remove punctuation, convert to lowercase, add id for each tweet! climate_tweets_paired_words <- climate_tweets %>% dplyr::select(stripped_text) %>% unnest_tokens(paired_words, stripped_text, token = "ngrams", n = 2) climate_tweets_paired_words %>% count(paired_words, sort = TRUE) ## # A tibble: 61,818 x 2 ## paired_words n ## <chr> <int> ## 1 climate change 1224 ## 2 in the 527 ## 3 of the 369 ## 4 the arctic 334 ## 5 climatechange is 308 ## 6 the most 257 ## 7 is the 250 ## 8 learn more 249 ## 9 more here 238 ## 10 is a 236 ## # ... with 61,808 more rows
library(tidyr) climate_tweets_separated_words <- climate_tweets_paired_words %>% separate(paired_words, c("word1", "word2"), sep = " ") climate_tweets_filtered <- climate_tweets_separated_words %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) # new bigram counts: climate_words_counts <- climate_tweets_filtered %>% count(word1, word2, sort = TRUE) head(climate_words_counts) ## # A tibble: 6 x 3 ## word1 word2 n ## <chr> <chr> <int> ## 1 climate change 1224 ## 2 climatechange denial 204 ## 3 leave alec_states 185 ## 4 fund climatechange 113 ## 5 sustainable companies 113 ## 6 ups sustainable 113
Finally, plot the data
library(igraph) library(ggraph) # plot climate change word network climate_words_counts %>% filter(n >= 24) %>% graph_from_data_frame() %>% ggraph(layout = "fr") + geom_edge_link(aes(edge_alpha = n, edge_width = n)) + geom_node_point(color = "darkslategray4", size = 3) + geom_node_text(aes(label = name), vjust = 1.8, size=3) + labs(title= "Word Network: Tweets using the hashtag - Climate Change", subtitle = "Text mining twitter data ", x = "", y = "")
You expect the words climate & change to have a high