This code is accompanying material for this talk for R-Ladies Bergen.
Text data provides an oasis of information for both researchers and non-researchers alike to explore. Natural Language Processing (NLP) methods help make sense of this difficult data type, which is written text. The talk and code give you a smooth introduction to the quanteda package. I will also showcase how to quickly visualize your text data and cover both supervised and unsupervised approaches in NLP. As part of the code demo, we will use text data from the UN as a working example to give you first insights into the structure of text data and how to work with it.
Before we get started, you might want to revise the terms and concepts once more (or use it as a glossary to look them up when needed).
In a first step, we load the packages.
## Packages
pkgs <- c(
"knitr", # A General-Purpose Package for Dynamic Report Generation in R
"tidyverse", # Easily Install and Load the 'Tidyverse'
"quanteda", # Quantitative Analysis of Textual Data
"stm", # Estimation of the Structural Topic Model
"stminsights", # A 'Shiny' Application for Inspecting Structural Topic Models
"LDAvis", # Interactive Visualization of Topic Models
"servr", # A Simple HTTP Server to Serve Static Files or Dynamic Documents
"topicmodels", # Topic Models
"kableExtra", # Construct Complex Table with 'kable' and Pipe Syntax
"readtext", # Import and Handling for Plain and Formatted Text Files
"magrittr", # A Forward-Pipe Operator for R
"overviewR", # Easily Extracting Information About Your Data
"countrycode", # Convert Country Names and Country Codes
"wesanderson", # A Wes Anderson Palette Generator
"tidytext" # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
)
## Install uninstalled packages
lapply(pkgs[!(pkgs %in% installed.packages())], install.packages)
## Load all packages to library
lapply(pkgs, library, character.only = TRUE)
## Set a theme for the plots:
theme_set(
theme_minimal() + theme(
strip.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)
)
Remember, we will use quanteda
in this workshop. Starting with quanteda
basically works in 3-4 different steps:
Import the data
Build a corpus
Pre-process your data
Calculate a document-feature matrix (DFM)
And we will walk through each of them now.
# Load packages
library(quanteda) # For NLP (Quantitative Analysis of Textual Data)
load("../data/UN-data.RData")
Before we proceed, we check the data and the time and geographical coverage of the data:
head(un_data)
## readtext object consisting of 6 documents and 3 docvars.
## # Description: df[,5] [6 × 5]
## doc_id text country session year
## * <chr> <chr> <chr> <int> <int>
## 1 AFG_55_2000.txt "\"On my way \"..." AFG 55 2000
## 2 AGO_55_2000.txt "\"Allow me t\"..." AGO 55 2000
## 3 ALB_55_2000.txt "\"Allow me t\"..." ALB 55 2000
## 4 AND_55_2000.txt "\"Andorra wi\"..." AND 55 2000
## 5 ARE_55_2000.txt "\"I have the\"..." ARE 55 2000
## 6 ARG_55_2000.txt "\"In\nthis, m\"..." ARG 55 2000
What we see is that our data set has five variables containing information on doc_id
, text
, country
, session
, and year
. The text of the speech is stored in the variable text
, the doc_id
gives us a unique id for each document.
No we turn to the time and geographical coverage:
un_data %>%
overview_tab(id = country, time = year)
## # A tibble: 197 x 2
## # Groups: country [197]
## country time_frame
## <chr> <chr>
## 1 AFG 2000 - 2018
## 2 AGO 2000 - 2018
## 3 ALB 2000 - 2018
## 4 AND 2000 - 2018
## 5 ARE 2000 - 2018
## 6 ARG 2000 - 2018
## 7 ARM 2000 - 2018
## 8 ATG 2000 - 2018
## 9 AUS 2000 - 2018
## 10 AUT 2000 - 2018
## # … with 187 more rows
To decrease the sample size, I reduced the data set and it now only captures data from 2000 onwards.
# Build the corpus
mycorpus <- corpus(un_data)
# Assigns a unique identifier to each text
docvars(mycorpus, "Textno") <-
sprintf("%02d", 1:ndoc(mycorpus))
# Create tokens
token <-
tokens(
# Takes the corpus
mycorpus,
# Remove numbers
remove_numbers = TRUE,
# Remove punctuation
remove_punct = TRUE,
# Remove symbols
remove_symbols = TRUE,
# Remove URL
remove_url = TRUE,
# Split up hyphenated words
split_hyphens = TRUE,
# And include the doc vars (we'll need them later)
include_docvars = TRUE
)
Since the data is generated with OCR, we need to additionally clean this. We do this using the following command:
# Clean tokens created by OCR
token_ungd <- tokens_select(
token,
c("[\\d-]", "[[:punct:]]", "^.{1,2}$"),
selection = "remove",
valuetype = "regex",
verbose = TRUE
)
Here, we also stem, remove stop words and lower case words.
mydfm <- dfm(
# Take the token object
token_ungd,
# Lower the words
tolower = TRUE,
# Get the stem of the words
stem = TRUE,
# Remove stop words
remove = stopwords("english")
)
Have a look at the DFM:
head(mydfm)
## Document-feature matrix of: 6 documents, 19,265 features (96.6% sparse) and 4 docvars.
## features
## docs way assembl hall inform suprem state council islam
## AFG_55_2000.txt 1 8 1 2 1 14 9 16
## AGO_55_2000.txt 5 2 0 0 0 2 5 0
## ALB_55_2000.txt 2 1 0 0 0 1 2 0
## AND_55_2000.txt 2 2 1 1 0 7 1 0
## ARE_55_2000.txt 0 2 0 1 0 10 6 4
## ARG_55_2000.txt 4 4 0 1 0 11 7 0
## features
## docs afghanistan self
## AFG_55_2000.txt 45 1
## AGO_55_2000.txt 0 2
## ALB_55_2000.txt 0 0
## AND_55_2000.txt 0 1
## ARE_55_2000.txt 1 1
## ARG_55_2000.txt 0 0
## [ reached max_nfeat ... 19,255 more features ]
Trim data: remove all the words that appear less than 7.5% of the time and more than 90% of the time
mydfm.trim <-
dfm_trim(
mydfm,
min_docfreq = 0.075,
# min 7.5%
max_docfreq = 0.90,
# max 90%
docfreq_type = "prop"
)
To get a first and easy visualization, we use word clouds:
quanteda::textplot_wordcloud(
# Load the DFM object
mydfm,
# Define the minimum number the words have to occur
min_count = 3,
# Define the maximum number the words can occur
max_words = 500,
# Define a color
color = wes_palette("Darjeeling1")
)
Word clouds are an illustrative way to show what you have in your data. You can also use word clouds beyond NLP analysis purposes, e.g., on a website.
To visualize the frequency of the top 30 features, we use a lollipop plot:
# Inspired here: https://bit.ly/37MCEHg
# Get the 30 top features from the DFM
freq_feature <- topfeatures(mydfm, 30)
# Create a data.frame for ggplot
data <- data.frame(list(
term = names(freq_feature),
frequency = unname(freq_feature)
))
# Plot the plot
data %>%
# Call ggplot
ggplot() +
# Add geom_segment (this will give us the lines of the lollipops)
geom_segment(aes(
x = reorder(term, frequency),
xend = reorder(term, frequency),
y = 0,
yend = frequency
), color = "grey") +
# Call a point plot with the terms on the x-axis and the frequency on the y-axis
geom_point(aes(x = reorder(term, frequency), y = frequency)) +
# Flip the plot
coord_flip() +
# Add labels for the axes
xlab("") +
ylab("Absolute frequency of the features")
We use the LexiCoder Policy Agenda dictionary. It captures major topics from the comparative Policy Agenda project and is currently available in Dutch and English.
# Load the dictionary with quanteda's built-in function
dict <- dictionary(file = "../data/policy_agendas_english.lcd")
Using this dictionary, we now generate our DFM:
# Generate the DFM...
mydfm.un <- dfm(mydfm.trim,
# Based on country
groups = "country",
# And the previously loaded dictionary
dictionary = dict)
Have a look at the new DFM:
head(mydfm.un)
## Document-feature matrix of: 6 documents, 28 features (35.7% sparse) and 1 docvar.
## features
## docs macroeconomics civil_rights healthcare agriculture forestry labour
## AFG 3 14 13 0 0 12
## AGO 11 4 10 0 0 7
## ALB 4 44 7 0 0 3
## AND 6 12 12 0 0 2
## ARE 15 11 3 0 0 11
## ARG 135 13 15 0 0 27
## features
## docs immigration education environment energy
## AFG 16 29 1 0
## AGO 16 4 9 1
## ALB 16 9 5 1
## AND 16 15 13 1
## ARE 13 1 13 2
## ARG 12 6 13 5
## [ reached max_nfeat ... 18 more features ]
Before we can turn to the plotting, we need to wrangle the data bit to bring it in the right order. These are basic tidyverse commands.
un.topics.pa <-
# Convert the DFM to a data frame
convert(mydfm.un, "data.frame") %>%
# Rename the doc_id to country
dplyr::rename(country = doc_id) %>%
# Select relevant variables
dplyr::select(country, macroeconomics, intl_affairs, defence) %>%
# Bring the data set in a different order
tidyr::gather(macroeconomics:defence, key = "topic", value = "share") %>%
# Group by country
group_by(country) %>%
dplyr::mutate(
# Generate the relative share of topics
share = share / sum(share),
# Make topic a factor
topic = haven::as_factor(topic))
Based on this data set, we now generate the plot.
# Generate the plot
un.topics.pa %>%
# We have country on the x-axis and the share on the y-axis, we color and fill by topic
ggplot(aes(x = country, y = share, colour = topic, fill = topic)) +
# Call the `geom_bar`
geom_bar(stat = "identity") +
# Define the fill colors and the labels in the legend
scale_fill_manual(
values = wes_palette("Darjeeling1"),
labels = c("Macro-economic", "International affairs", "Defence")
) +
# Same for the colors
scale_color_manual(
values = wes_palette("Darjeeling1"),
labels = c("Macro-economic", "International affairs", "Defence")
) +
# Add a title
ggtitle("Distribution of PA topics in the UN General Debate corpus") +
# And add x-axis and y-axis labels
xlab("") +
ylab("Topic share (%)") +
# And last do some tweaking with the theme
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
Here, we need to define more stopwords (create a manual list of them) to make sure that we do not bias our results. I have a non-exhaustive list here:
# Define stopwords
UNGD_stopwords <-
c(
"good bye",
"good morning",
"unit",
"nation",
"res",
"general",
"assemb",
"mr",
"greet",
"thank",
"congratulat",
"sir",
"per",
"cent",
"mdgs",
"soo",
"han",
"ó",
"g",
"madam",
"ncds",
"sdgs",
"pv",
"isil",
"isi",
"f",
"fifti",
"sixtieth",
"annan",
"kofi",
"fifth",
"fourth",
"first",
"second",
"third",
"sixth",
"seventh",
"eighth",
"ninth",
"tenth",
"seventieth",
"jeremić",
"agenda",
"obama",
"julian",
"sergio",
"mello",
"septemb",
"document",
"plenari",
"jean",
"eliasson",
"anniversari",
"vieira",
"haya",
"rash",
"treki"
)
To apply it, we re-start from step 4 creating a DFM.
We know apply it to our function
# Remove self-defined stopwords
mydfm_sentiment <- dfm(
# Select the token object
token_ungd,
# Lower the words
tolower = TRUE,
# Stem the words
stem = TRUE,
# Remove stop words and self-defined stop words
remove = c(UNGD_stopwords, stopwords("english"))
)
Trim data: remove all the words that appear less than 7.5% of the time and more than 90% of the time
mydfm.trim <-
dfm_trim(
# Select the DFM object
mydfm_sentiment,
min_docfreq = 0.075,
# min 7.5%
max_docfreq = 0.90,
# max 90%
docfreq_type = "prop"
)
And now we get the sentiment :-) There are multiple ways how to do it, we will use one that is built in quanteda. In any case, it is important to check the underlying dictionary (on which basis is it built on?). The most frequently used dictionaries are:
sentimentr
package, that my co-author, Dennis Hammerschmidt, and I use in our paper on sentiment at the UNGD.We will go with the LSD dictionary here as it is already built in quanteda.
# Call a dictionary
dfmat_lsd <-
dfm(mydfm.trim,
dictionary =
data_dictionary_LSD2015[1:2])
We can look at the dictionary first choosing the first 5 documents:
head(dfmat_lsd, 5)
## Document-feature matrix of: 5 documents, 2 features (0.0% sparse) and 4 docvars.
## features
## docs negative positive
## AFG_55_2000.txt 84 68
## AGO_55_2000.txt 88 95
## ALB_55_2000.txt 42 105
## AND_55_2000.txt 53 63
## ARE_55_2000.txt 34 81
To better work with the data, we convert it to a data frame:
# Calculate the overall
# share of positive and
# negative words on a scale
data <- convert(dfmat_lsd,
to = "data.frame")
To get more meaningful results, we do some last tweaks:
data %<>%
dplyr::mutate(
# Generate the number of total words
total_words = positive + negative,
# Generate the relative frequency
pos_perc = positive / total_words * 100,
neg_perc = negative / total_words * 100,
# Generate the net sentiment
net_perc = pos_perc - neg_perc
)
# Generate country code and year
data %<>%
dplyr::mutate(# Define the country-code (it's all in the document ID)
ccode = str_sub(doc_id, 1, 3),
# Define the year (it's also in the document ID)
year = as.numeric(str_sub(doc_id, 8, 11))) %>%
# Drop all observations with "EU_" because they are not a single country
dplyr::filter(ccode != "EU_") %>%
# Drop the variable doc_id
dplyr::select(-doc_id)
We first get an overall impression by plotting the average net sentiment by continent over time:
data %>%
# Generate the continent for each country using the `countrycode()` command
dplyr::mutate(continent = countrycode(ccode, "iso3c", "continent", custom_match =
c("YUG" = "Europe"))) %>%
# We group by continent and year to generate the average sentiment by continent
# and and year
group_by(continent, year) %>%
dplyr::mutate(avg = mean(net_perc)) %>%
# We now plot it
ggplot() +
# Using a line chart with year on the x-axis, the average sentiment by continent
# on the y-axis and colored by continent
geom_line(aes(x = year, y = avg, col = continent)) +
# Define the colors
scale_color_manual(name = "", values = wes_palette("Darjeeling1")) +
# Label the axes
xlab("Time") +
ylab("Average net sentiment")
And now we want to visualize the results in more detail :-)
data %>%
# Generate the country name for each country using the `countrycode()` command
dplyr::mutate(countryname = countrycode(ccode, "iso3c", "country.name")) %>%
# Filter and only select specific countries that we want to compare
dplyr::filter(countryname %in% c(
"Germany",
"France",
"United Kingdom",
"Norway",
"Spain",
"Sweden"
)) %>%
# Now comes the plotting part :-)
ggplot() +
# We do a bar plot that has the years on the x-axis and the level of the
# net-sentiment on the y-axis
# We also color it so that all the net-sentiments greater 0 get a
# different color
geom_col(aes(
x = year,
y = net_perc,
fill = (net_perc > 0)
)) +
# Here we define the colors as well as the labels and title of the legend
scale_fill_manual(
name = "Sentiment",
labels = c("Negative", "Positive"),
values = c("#C93312", "#446455")
) +
# Now we add the axes labels
xlab("Time") +
ylab("Net sentiment") +
# And do a facet_wrap by country to get a more meaningful visualization
facet_wrap(~ countryname)
We use the stm
package here. Quanteda also has built-in topic models such as LDA.
library(stm) # Estimation of the Structural Topic Model
In a first step, we assign a topic count. Usually the number of topics can be higher – but that obviously comes at a cost. Here, it’s computational power. To make it as fast as possible, we’ll pick 5 topics for our example.
# Assigns the number of topics
topic.count <- 5
# To make sure that we get the same results, we set a seed
set.seed(68159)
# Convert the trimmed DFM to a STM object
dfm2stm <- convert(mydfm.trim, to = "stm")
# Use this object to estimate the structural topic model
model.stm <- stm(
# Define the documents
documents = dfm2stm$documents,
# Define the words in the corpus
vocab = dfm2stm$vocab,
# Define the number of topics
K = topic.count,
# The neat thing about STM is that you can use meta data to inform your model
# (here we use country and year and rely heavily on the vignette of STM)
prevalence = ~ country + s(year),
# Define the data set that contains content variables (remember, this is what is so great about STM!)
data = dfm2stm$meta,
# This defines the initialization method. "spectral" is the default and provides a deterministic
# initialization based on Arora et al. 2014 (it is in particular recommended if the number of
# documents is large)
init.type = "Spectral"
)
## Beginning Spectral Initialization
## Calculating the gram matrix...
## Finding anchor words...
## .....
## Recovering initialization...
## ...............
## Initialization complete.
## ....................................................................................................
## Completed E-Step (2 seconds).
## Completed M-Step.
## Completing Iteration 1 (approx. per word bound = -6.836)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 2 (approx. per word bound = -6.814, relative change = 3.106e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 3 (approx. per word bound = -6.807, relative change = 1.110e-03)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 4 (approx. per word bound = -6.803, relative change = 6.143e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 5 (approx. per word bound = -6.800, relative change = 4.018e-04)
## Topic 1: republ, right, democrat, cooper, respect
## Topic 2: global, chang, climat, sustain, island
## Topic 3: right, terror, region, one, war
## Topic 4: global, council, terror, region, right
## Topic 5: african, govern, africa, conflict, commit
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 6 (approx. per word bound = -6.798, relative change = 2.762e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 7 (approx. per word bound = -6.797, relative change = 1.994e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 8 (approx. per word bound = -6.796, relative change = 1.510e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 9 (approx. per word bound = -6.795, relative change = 1.173e-04)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 10 (approx. per word bound = -6.794, relative change = 9.379e-05)
## Topic 1: right, republ, democrat, respect, social
## Topic 2: global, chang, climat, sustain, island
## Topic 3: right, one, terror, war, palestinian
## Topic 4: global, council, region, right, terror
## Topic 5: african, govern, africa, conflict, session
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 11 (approx. per word bound = -6.794, relative change = 7.697e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 12 (approx. per word bound = -6.793, relative change = 6.401e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 13 (approx. per word bound = -6.793, relative change = 5.409e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 14 (approx. per word bound = -6.793, relative change = 4.636e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 15 (approx. per word bound = -6.792, relative change = 4.041e-05)
## Topic 1: right, republ, social, respect, govern
## Topic 2: global, chang, climat, sustain, island
## Topic 3: right, one, war, terror, palestinian
## Topic 4: council, region, global, right, cooper
## Topic 5: african, govern, africa, conflict, session
## ....................................................................................................
## Completed E-Step (2 seconds).
## Completed M-Step.
## Completing Iteration 16 (approx. per word bound = -6.792, relative change = 3.538e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 17 (approx. per word bound = -6.792, relative change = 3.155e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 18 (approx. per word bound = -6.792, relative change = 2.822e-05)
## ....................................................................................................
## Completed E-Step (2 seconds).
## Completed M-Step.
## Completing Iteration 19 (approx. per word bound = -6.792, relative change = 2.551e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 20 (approx. per word bound = -6.791, relative change = 2.322e-05)
## Topic 1: right, social, republ, govern, respect
## Topic 2: global, chang, climat, sustain, island
## Topic 3: right, one, war, terror, time
## Topic 4: region, council, global, right, cooper
## Topic 5: african, govern, africa, session, conflict
## ....................................................................................................
## Completed E-Step (2 seconds).
## Completed M-Step.
## Completing Iteration 21 (approx. per word bound = -6.791, relative change = 2.114e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 22 (approx. per word bound = -6.791, relative change = 1.942e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 23 (approx. per word bound = -6.791, relative change = 1.799e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 24 (approx. per word bound = -6.791, relative change = 1.706e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 25 (approx. per word bound = -6.791, relative change = 1.585e-05)
## Topic 1: right, social, govern, republ, respect
## Topic 2: global, chang, climat, sustain, island
## Topic 3: right, one, war, terror, time
## Topic 4: region, council, global, right, cooper
## Topic 5: african, govern, africa, session, conflict
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 26 (approx. per word bound = -6.791, relative change = 1.456e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 27 (approx. per word bound = -6.791, relative change = 1.344e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 28 (approx. per word bound = -6.790, relative change = 1.255e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 29 (approx. per word bound = -6.790, relative change = 1.169e-05)
## ....................................................................................................
## Completed E-Step (2 seconds).
## Completed M-Step.
## Completing Iteration 30 (approx. per word bound = -6.790, relative change = 1.088e-05)
## Topic 1: right, social, govern, respect, republ
## Topic 2: global, chang, climat, sustain, island
## Topic 3: right, one, war, terror, time
## Topic 4: region, council, right, global, cooper
## Topic 5: african, govern, africa, session, conflict
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Completing Iteration 31 (approx. per word bound = -6.790, relative change = 1.015e-05)
## ....................................................................................................
## Completed E-Step (1 seconds).
## Completed M-Step.
## Model Converged
As mentioned during the talk, the structural topic model can also include meta data to estimate the topic. You include them using the argument prevalence
. If you want to know more about it works, see this excellent vignette.
There are different ways to visualize the results. We’ll first go with base-plot which gives you a shared estimation of the topic shares.
plot(
# Takes the STM object
model.stm,
# Define the type of plot
type = "summary",
# Define font size
text.cex = 0.5,
# Label the title
main = "STM topic shares",
# And the x-axis
xlab = "Share estimation"
)
To get more out of it and to learn more about topic 4, we can use findThoughts
to get the plain text that is associated with the topic:
findThoughts(
# Your topic model
model.stm,
# The text data that you used
# to retrieve passages from
texts = un_data$text,
# Number of documents to be displayed here
n = 1,
# Topic number you are interested in
topics = 4
)
##
## Topic 4:
## At the outset, allow me to express our sincere gratitude
## for the honour of addressing the General Assembly from
## this rostrum. I bring greetings from His Excellency
## Mr. Gurbanguly Berdimuhamedov, President of
## Turkmenistan, who wishes the General Assembly the
## best of luck at this session. I congratulate Mr. John
## William Ashe on his election as President of the General
## Assembly at its sixty-eighth session and wish him all
## the best in fulfilling his forthcoming tasks. I would also
## like to thank Mr. Vuk Jeremi., President of the General
## Assembly at its sixty-seventh session, for his skill and
## effectiveness in that post.
##
## Turkmenistan considers this session to be an
## important phase in the process of consolidating the
## efforts of the international community to strengthen
## universal peace, stability and security by adopting
## meaningful decisions on sustainable development and
## to counter emerging challenges and threats. We believe
## that strict adherence to the principles and norms of the
## Charter of the United Nations is the main prerequisite
## for ensuring long-term peace and strategic stability.
##
## In that belief, Turkmenistan adheres to a steady
## and resolute policy of peace, good-neighbourliness and
## the active promotion of peacebuilding processes. As a
## matter of principle, we reject the use of military force as
## a tool of foreign policy and international relations. Our
## country is convinced that solutions based on the use
## of force are doomed to fail. They neither eliminate the
## causes of conflicts nor create conditions for adequate
## responses to the many issues that arise from military
## action. Therefore, at the heart of Turkmenistan’s policies
## is the will to resolve any situation by peaceful, political
## and diplomatic ways and means, which it considers to
## be the main legitimate resources available within the
## United Nations. This approach is based on our common
## goal to establish a world without conflict.
##
## At the sixty-sixth session of the General Assembly,
## Turkmenistan’s President launched an initiative aimed
## at the adoption of a United Nations declaration on
## prioritizing political and diplomatic ways and means
## for the resolution of international challenges. Today,
## the elaboration of such a document has become a top
## priority. Turkmenistan therefore reaffirms its firm
## desire to engage in a meaningful discussion on this
## initiative with all interested Member States. We are
## convinced that the adoption of such a declaration would
## help to expand and strengthen the legal basis for the
## work of the General Assembly, the Security Council
## and other United Nations entities dealing with issues
## relating to world peace, stability and security.
##
## The challenging processes unfolding in today’s
## world call for a responsible, thoughtful, effective
## and efficient approach on the part of the United
## Nations. That is also linked directly to the important
## challenges of disarmament. By playing an active role
## in the multilateral dialogue on disarmament issues, my
## Government is demonstrating its firm commitment to
## complying with the core international norms regulating
## the disarmament process and the non-proliferation
## of weapons of mass destruction through practical
## action. Following this course of action and taking
## into consideration the need to energize the discussion
## and meaningful consideration of disarmament issues,
## Turkmenistan proposes the convening in 2014 of a high-
## level international meeting on disarmament issues.
## We are prepared to create all the necessary conditions
## and to provide the appropriate infrastructure for this
## meeting in our capital city.
##
## Nowadays, problems related to strengthening
## peace and stability and to ensuring the stability of
## countries and nations are among the most important
## topics in global politics. Their resolution will depend
## primarily on the establishment and effective legal
## and organizational operationalization of international
## political cooperation. In this context, we advise the
## General Assembly at this session to embark on the
## consideration of issues relating to the improvement of
## various forms of multilateral interaction that could serve
## as a political platform for finding mutually acceptable
## decisions on urgent regional and international policy
## matters.
##
## It should be noted in that regard that the
## United Nations fulfils its purpose. For example, the
## establishment of United Nations preventive diplomacy
## centres in various regions of the world has become a
##
##
##
## highly effective form of joint work to strengthen security,
## prevent conflicts and eliminate their underlying causes.
## It is well known that the first such centre, the United
## Nations Regional Centre for Preventive Diplomacy in
## Central Asia, based in Ashgabat, opened in December
## 2007. In our view, the experience of creating new
## mechanisms and institutions aimed at forming a system
## of international interaction at the global and regional
## levels must and should be replicated by States Members
## of the United Nations.
##
## Taking into account the need to enhance the
## effectiveness of inter-State contact at the regional
## level, Turkmenistan has launched a forum of peace
## and cooperation, aimed at establishing a standing
## mechanism for political dialogue in Central Asia.
## We believe that the forum will contribute to the
## elaboration of consensus-based approaches to finding
## solutions to the most important issues relating to the
## present and future development of Central Asia and
## its neighbouring regions. Moreover, the forum could
## become the basis for the establishment of a consultative
## council of the Heads of State of Central Asia. We are
## convinced that the development of new formats for
## political interaction among States within the region,
## coupled with the effective functioning of United
## Nations regional structures, will provide a reliable
## foundation and stability for the entire architecture of
## inter-State relations in Central Asia.
##
## To a great extent, attaining the goals of
## comprehensive and universal security will depend on
## ensuring security in the sphere of energy. Furthermore,
## the achievement of that goal is one of the most
## important components of a stable world economy and
## serves to protect it against distortions and disruptions.
## In that connection, the development of an international
## mechanism that provides for a set of guarantees for the
## global energy supply is a task of paramount importance.
## It is also necessary to underscore the importance of
## the joint work and coordinated efforts of all Member
## States aimed at developing and adopting consolidated
## approaches to the solution of energy security issues.
##
## The establishment by the United Nations of a new
## universal international legal tool kit is a key element
## of that process. It should, in our view, consist of the
## following three major elements: a multilateral United
## Nations document providing the legal basis for
## relations in the area of the global supply of energy
## resources; a corresponding United Nations structure
## that would ensure the implementation of the provisions
## of the aforementioned document; and an international
## database designed for the collection and analysis of
## data on the implementation of international obligations
## assumed by the participating States.
##
## It is common knowledge that on 17 May 2013 the
## General Assembly adopted by consensus resolution
## 67/263, submitted on the initiative of Turkmenistan’s
## President, entitled “Reliable and stable transit of energy
## and its role in ensuring sustainable development and
## international cooperation”. The importance of that
## document lies primarily in the fact that it forms the
## basis for a global energy partnership that takes into
## account the interests of producer States, transit States
## and States that are consumers of energy resources.
##
## In accordance with the letter and the spirit of that
## resolution, our country proposes to Member States
## the establishment, during the current session of the
## Assembly, of an international group of experts for the
## development of a new mechanism for energy security.
## To that end, the Government of Turkmenistan proposes
## to convene an international meeting of experts on
## that topic in 2014. We are ready to engage in close
## cooperation with all Member States and the United
## Nations Secretariat with a view to organizing and
## holding such a forum.
##
## Currently the resolution of issues of security and
## sustainable development depends largely on the level
## of international cooperation in the important areas of
## transport and communications. The geo-economic
## potential of new transport and transit routes in the
## world is enormously significant. Such routes involve
## vast spaces and enormous human resources and
## attract considerable investments. All of that creates
## opportunities to transform the transport sector into
## one of the most important factors in sustainable
## development.
##
## Turkmenistan is convinced that the twenty-first-
## century transportation architecture provides the
## framework for a breakthrough in integration, in joining
## the common efforts of regions and in the pooling of
## resources and industrial and human potential. It is
## our firm conviction that the future belongs to such
## a combined system of transport communication,
## involving major international and regional maritime,
## road, railroad and air hubs, their optimal integration
## and the use of their specific advantages.
##
## The practical implementation of that idea
## became the subject of a high-level event on modality,
##
##
##
## interconnectivity and the post-2015 development
## programme, which was held in New York on
## 26 September. It was organized by the Government of
## Turkmenistan and the International Road Transport
## Union. The event focused on the search for effective
## solutions relating to the establishment of modern,
## diversified and safe transport infrastructure throughout
## the world.
##
## We consider it necessary to continue the multilateral
## dialogue on transport issues that was initiated during
## the current session of the General Assembly. In that
## connection, Turkmenistan would like to submit a
## proposal to host in 2014 in Ashgabat an international
## conference on the role of transport and transit corridors
## in ensuring international cooperation, stability and
## sustainable development.
##
## With regard to the achievement of the sustainable
## development goals, we believe that the greatest attention
## should be focused on promoting the economic interests
## of States, while maintaining an appropriate ecological
## balance and preventing harm to the environment. That,
## in turn, implies the use of cutting-edge environmental
## technologies and the development of innovative
## solutions for the preservation of nature. Preserving
## the significant environmental component of the global
## economic space has therefore become an integral part
## of its effectiveness.
##
## We highly value the efforts undertaken by the
## Secretary-General, as well as the successive actions
## of the international community at the United Nations
## Climate Change Conferences in Copenhagen and
## Cancun and during the seventeenth Conference of the
## Parties to the United Nations Framework Convention
## on Climate Change, held in Durban, which have
## gradually laid the foundations for the development
## of comprehensive decisions at the United Nations
## Conference on Sustainable Development.
##
## We look forward to the continuation of a constructive
## international dialogue on that topic during the sixty-
## eighth session of the Assembly. We are convinced that
## it is necessary to combine our efforts in that area at
## the international, regional and national levels, and to
## effectively coordinate the efforts of States with those of
## the United Nations.
##
## Taking into account the numerous aspects of the
## climate change issue, Turkmenistan wishes to state at the
## current session of the General Assembly that it stands
## ready to make its contribution to the strengthening of the
## role of multilateral international mechanisms aimed at
## preventing the negative consequences of global climate
## change. In particular, we refer to the need for enhancing
## the implementation of the provisions of the United
## Nations Convention to Combat Desertification. In that
## connection, we are prepared to host in Turkmenistan
## the Conference of the Parties to the United Nations
## Convention to Combat Desertification in 2014.
##
## Furthermore, our country would like to launch an
## initiative aimed at the establishment of a specialized
## entity, a subregional centre on technologies relating
## to climate change in Central Asia and the Caspian Sea
## basin. We believe that such an entity would help the
## countries of our regions to substantially strengthen
## their interaction in the sphere of environmental security
## and would contribute to the effective coordination of
## interregional efforts in that field.
##
## The challenges confronting the community
## of nations in the area of security and sustainable
## development cannot be resolved unless we find a
## solution to the humanitarian issues at the international
## level. In particular, we are referring to the serious global
## problem of the fate of refugees and stateless persons.
## As a permanent member of the Executive Committee of
## the High Commissioner’s Programme of the Office of
## the United Nations High Commissioner for Refugees,
## Turkmenistan has accumulated valuable experience in
## resolving the issues facing people who were forced to
## leave their home countries. Together with the Office of
## the United Nations High Commissioner for Refugees,
## we propose that all interested parties become familiar
## with Turkmenistan’s practical work in granting
## citizenship to refugees and stateless persons.
##
## In that connection, it would be advisable to work
## jointly with United Nations humanitarian agencies to
## develop an appropriate social programme. Moreover,
## taking into account the outcomes of the International
## Ministerial Conference of the Organization of Islamic
## Cooperation on Refugees in the Muslim World, held
## in Ashgabat in May 2012, we consider it necessary to
## develop long-term solutions to such issues, on the basis
## of generally recognized norms of international law.
## With a view to discussing those issues, we are ready
## to host in Turkmenistan in 2014 a high-level event in
## cooperation with the Office of the United Nations High
## Commissioner for Refugees.
##
## Today, as Member States actively discuss the
## role and place of the United Nations in international
##
##
##
## relations, Turkmenistan declares that constructive and
## multilateral cooperation with the United Nations is
## the top priority of its foreign policy strategy. In that
## connection, we believe that it is precisely the United
## Nations that is the main and universal international
## Organization, which adopts decisions concerning
## the most important issues of global development and
## comprehensive peace and security. Since its inception,
## the United Nations has demonstrated its role as the
## foundation of the entire system of international stability,
## through mechanisms to ensure justice and to resolve
## the most complex international problems.
##
## Similarly, we share the opinion of the Organization
## today that the issue of providing it with fresh impetus
## is increasingly relevant, in view of the rapidly changing
## realities of the modern world. Therefore, Turkmenistan
## supports a strengthened and expanded role for the
## United Nations at the global level.
##
## We are firmly convinced that international
## law and the provisions of the Charter of the United
## Nations — based on peace, equal rights and respect for
## nations, their rights and sovereignty — must remain
## the foundation of the world order in the twenty-first
## century.
You can also combine stm
with tidytext
to generate ggplot2
objects. We follow Julia Silge’s outline here. The plot shows the probabilities with which different terms are associated with the topics.
# Load tidytext
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
# Turn the STM object into a data frame. This is necessary so that we can work with it.
td_beta <- tidy(model.stm)
td_beta %>%
# Group by topic
group_by(topic) %>%
# Take the top 10 based on beta
top_n(10, beta) %>%
# Ungroup
ungroup() %>%
# Generate the variables topic and term
dplyr::mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) %>%
# And plot it
ggplot() +
# Using a bar plot with the terms on the x-axis, the beta on the y-axis, filled by topic
geom_col(aes(x = term, y = beta, fill = as.factor(topic)),
alpha = 0.8,
show.legend = FALSE) +
# Do a facet_wrap by topic
facet_wrap(~ topic, scales = "free_y") +
# And flip the plot
coord_flip() +
scale_x_reordered() +
# Label the x-axis, y-axis, as well as title
labs(
y = expression(beta),
title = "Highest word probabilities for each topic",
subtitle = "Different words are associated with different topics"
) +
# And finally define the colors
scale_fill_manual(values = wes_palette("Darjeeling1"))
You could also visualize our results with a perspective plot that allows you to show how different terms are related with each topic. We again use a base plot here.
plot(
# Access the STM object
model.stm,
# Select the type of plot
type = "perspectives",
# Select the topics
topics = c(4, 5),
# Define the title
main = "Putting two different topics in perspective")
It is always tricky to find the right number of topics (your k
). The stm
package comes with a handy function called searchK
that allows you to evaluate different topic models and to find your k
. We can again plot it using the plot()
function.
To further evaluate your topic models, have a look at stminsights
, LDAvis
(use stminsights::toLDAvis
to convert your STM model), and oolong
.
Here’s a short code snippet that shows how LDAvis
works. Both stminsights
and LDAvis
call ShinyApps that allow you to interactively access your topic models.
library(LDAvis) # Interactive Visualization of Topic Models
toLDAvis(
# The topic model
mod = model.stm,
# The documents
docs = dfm2stm$documents)
The ShinyApp looks like this:
These are just the most basic supervised and unsupervised models in NLP that you can use but as you work more and more with textual data, you will see that there is so much more in the field of NLP including document similarity, text generation or even chat bots that you can create using your knowledge and the same simple steps that we have used here.
Quanteda
More on text mining and NLP
Sentiment analysis
Model validation
More general resource
University of Mannheim, cosima.meyer@uni-mannheim.de↩︎