EU State of the Union speeches
A text mining analysis
Introduction
Each September, the President of the European Commission gives a speech in the plenary session of the European Parliament about last year’s achievements and presents the priorities for the next one. Such speech is known as the EU State of the Union speech.
In this assignment we will be applying some basic text mining techniques to extract useful information from speeches comprehended between 2010 and 2023. During this period, the European Commission had three different presidents, Barroso, Juncker and von der Leyen, so we’ll also be looking for similarities and differences across years and presidents. We’ll perform sentiment analysis to compare positivity and negativity across speeches, we will use TF-IDF to determine the most distinctive words each year and lastly we will do topic modelling to classify the speeches depending on their theme.
As an initial hypothesis, we expect to have a big amount of similarities regarding the style and the words used between presidents because, although each may have their particular style and some particular event to address, these type of discourses tend to follow a formal and structured scheme and consistently mention key policy areas.
Moreover, in the sentiment analysis we expect to achieve greater negativity between 2010 and 2016 and in 2020-2022 because it was the period of the Eurozone debt crisis and the COVID-19 pandemic and the Ukraine war, respectively.
Similarly we expect the main topics of the speeches to reflect these main challenges faced by the Union (financial crisis, COVID, Ukraine), as well as maybe a mentioning of Brexit.
Library
Installing and loading packages.
rm(list = ls())
# install.packages("tidyverse")
# install.packages("tidytext")
# install.packages("wordcloud")
# install.packages("topicmodels")
# install.packages("scales")
# install.packages("patchwork")
# install.packages("quanteda")
# install.packages("rmdformats")
library(tidyverse)
library(tidytext)
library(wordcloud)
library(topicmodels)
library(scales)
library(patchwork)
library(quanteda)
Loading data
Here we read the speeches and we assign the year in which they were held and the president that delivered each of them.
All the speeches have been obtained from the European’s Commission website and speeches’ section headings or subheadings have been deleted.
speech_2010 <- read_file("speeches/2010.txt")
speech_2011 <- read_file("speeches/2011.txt")
speech_2012 <- read_file("speeches/2012.txt")
speech_2013 <- read_file("speeches/2013.txt")
speech_2015 <- read_file("speeches/2015.txt")
speech_2016 <- read_file("speeches/2016.txt")
speech_2017 <- read_file("speeches/2017.txt")
speech_2018 <- read_file("speeches/2018.txt")
speech_2020 <- read_file("speeches/2020.txt")
speech_2021 <- read_file("speeches/2021.txt")
speech_2022 <- read_file("speeches/2022.txt")
speech_2023 <- read_file("speeches/2023.txt")
speeches <- tibble(
year = c(2010, 2011, 2012, 2013, 2015, 2016, 2017, 2018, 2020, 2021, 2022, 2023),
text = c(
speech_2010, speech_2011, speech_2012, speech_2013,
speech_2015, speech_2016, speech_2017, speech_2018,
speech_2020, speech_2021, speech_2022, speech_2023))
speeches <- speeches |>
mutate(president = case_when(
year %in% c(2010, 2011, 2012, 2013) ~ "Barroso",
year %in% c(2015, 2016, 2017, 2018) ~ "Juncker",
year %in% c(2020, 2021, 2022, 2023) ~ "von der Leyen"))
Note that years 2014 and 2019 are missing. These years correspond to European election years, during which the Commission and the Parliament are typically in transition and not yet fully in office and thus the speech is not delivered.
Pre-processing
As for most text mining pipelines, the first step is to proceed with tokenization. Then, since stop words do not add useful information to our analysis we are filtering them out.
# Tokenizing by word
tokenized_speeches <- speeches |>
unnest_tokens(word,text, token="words") |>
# Removing possessive to end of words (ex country's becomes country)
mutate(word = gsub("'s$", "", word)) |>
mutate(word = gsub("’s$", "", word))
# Note that I+we need to remove possessive for two slightly different apostrophe
# Because ' and ’ are different characters we need to handle them separately
stop_words <- stop_words
# Removing stopwords
filtered_speeches <- tokenized_speeches |>
anti_join(stop_words, by = "word")
Analysis
Word frequency
Unigrams
What are the most used words in the speeches?
filtered_speeches |>
# Counting the words
count(word, sort = TRUE) |>
# Filtering only those used more than 100 times
filter(n > 100) |>
# Plotting the most used
ggplot(aes(n, reorder(word, n))) +
geom_col(fill = "#004494") +
labs(title = "Most used words in EU State of the Union speeches 2010-2023",
y = NULL,
x = "Number of times used") +
theme_minimal()
In a first look, we can observe that the most used words are making reference to Europe, the European Union and the Commission as well as to the European people to which these speeches are directed. Other words refer to key policy areas mainly related with the economy.
A curious presence is the word honourable
. Its presence
is due to the fact that all Commission presidents use the expression
“Honourable Members” (of the parliament) to address the members of the
European Parliament, where the speech takes place. This expression is
used repeatedly, often as the opening of several sentences or at least
of each section of the speech.
We’ll now focus on differences across years and presidents.
# Counting the appearances of words by year
filtered_speeches |>
group_by(year) |>
count(word, sort = TRUE) |>
# Selecting only the most used word per year
slice_head(n = 1)
## # A tibble: 12 × 3
## # Groups: year [12]
## year word n
## <dbl> <chr> <int>
## 1 2010 europe 53
## 2 2011 europe 41
## 3 2012 european 90
## 4 2013 europe 70
## 5 2015 european 74
## 6 2016 europe 83
## 7 2017 european 74
## 8 2018 europe 79
## 9 2020 europe 58
## 10 2021 europe 51
## 11 2022 europe 36
## 12 2023 europe 63
We can see that europe
or european
are
constantly the most used word every year.
What happens if instead we filter out these standard words. Which ones become the most used?
# Repeating the same process as before, but filtering out standard words
filtered_speeches |>
filter(!word %in% c("european", "europe", "eu", "union",
"commission", "parliament", "president")) |>
group_by(year) |>
count(word, sort = TRUE) |>
slice_head(n = 1)
## # A tibble: 12 × 3
## # Groups: year [12]
## year word n
## <dbl> <chr> <int>
## 1 2010 growth 26
## 2 2011 world 22
## 3 2012 political 33
## 4 2013 crisis 24
## 5 2015 euro 30
## 6 2016 means 18
## 7 2017 future 14
## 8 2018 world 18
## 9 2020 world 40
## 10 2021 time 22
## 11 2022 ukraine 22
## 12 2023 future 23
This gives us more insights. For example it’s clear that the focus of
2022 was the Ukraine’s war, while in 2020, the Commission underlined the
global dimension of the pandemic we were witnessing. future
was the most used word in 2017, in a speech which discussed heavily the
path forward for the Union in the wake of the Brexit shock.
euro
, growth
and crisis
are
instead words that characterize the speeches happening closer to the
financial and Eurozone crisis.
We now turn the focus on each presidency and compare their most used words.
# Counting words and making a facet graph of the top 10 words by president
filtered_speeches |>
filter(!word %in% c("european", "europe", "eu", "union", "commission",
"parliament", "president", "europeans", "honourable")) |>
count(president, word, sort = TRUE) |>
group_by(president) |>
slice_max(n, n = 10) |>
ggplot(aes(n, reorder_within(word, n, president))) +
geom_col(fill = "#004494") +
facet_wrap(~ president, nrow = 3, scales = "free_y") +
scale_y_reordered() +
labs(
title = "Most used words in State of the Union speeches by EC President",
y = NULL,
x = "Number of times used") +
theme_minimal()
Again here we can see how the Barroso presidency was really focused on overcoming the economic crisis. Juncker instead often made references to words of unity because his presidency was marked by the Brexit referendum and the need to protect the Union from raising nationalisms as well as finding European solutions to the refugee crisis. von der Leyen’s presidency is marked by the pandemic and the war in Ukraine, two global events that are reflected in the choice of words used in her speeches.
Now we will be calculating the frequency of each word to compare the style used by these presidents. We will be using von der Leyen’s speeches as reference since she is the current president, to see if they differ from the previous ones.
# Calculate word frequency proportions by president
frequency <- filtered_speeches |>
group_by(president) |>
# Count number of times each word appears for each president
count(word) |>
# Calculate relative frequency (proportion) of each word
mutate(proportion = n / sum(n)) |>
ungroup() |>
select(-n) |>
# Creating one column per president
pivot_wider(names_from = president,
values_from = proportion) |>
mutate(avg_proportion = rowMeans(across(c("Barroso", "Juncker", "von der Leyen")),
na.rm = TRUE)) |>
# Going back to long format for Barroso and Juncker
pivot_longer(`Barroso`:`Juncker`,
names_to = "president", values_to = "proportion") |>
arrange(desc(avg_proportion))
# Plot compareing von der Leyen's word frequencies vs. previous presidents
ggplot(frequency, aes(x = proportion, y = `von der Leyen`,
color = abs(`von der Leyen` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 0.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 0.5) +
# Log scales
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001),
low = "darkslategray4", high = "gray75") +
facet_wrap(~ president, ncol = 2) +
theme_minimal() +
theme(legend.position="none") +
labs(y = "von der Leyen", x = NULL)
By looking at the graph we can observe that word frequency is quite
similar between von der Leyen and the previous presidents. However, we
must say that since in Barroso and Juncker’s presidencies the economic
situation was the principal concern words like economic
,
euro
, monetary
or debt
had a
highest frequency than in von der Layen’s presidency.
To check from a quantitative perspective if the style of the speeches are truly similar we are going to calculate the Pearson correlation coefficient.
##
## Pearson's product-moment correlation
##
## data: proportion and von der Leyen
## t = 47.851, df = 1282, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7801341 0.8194811
## sample estimates:
## cor
## 0.8006694
##
## Pearson's product-moment correlation
##
## data: proportion and von der Leyen
## t = 54.791, df = 1570, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7926281 0.8266375
## sample estimates:
## cor
## 0.8103141
In both cases the correlation is greater than 0.8, so we have demonstrated that the word frequency used in Barroso and Juncker’s speeches is quite similar to von der Leyen’s. Additionally, Juncker’s speech is slightly more similar to von der Leyen’s than Barroso’s, which makes sense as Barroso’s presidency is further away than Juncker’s from von der Leyen’s.
Bigrams
What are the most used bigrams in the speeches?
# Keeping all bigrams that are not made up of 2 stopwords
speeches_bigrams <- speeches |>
unnest_tokens(bigram, text, token = "ngrams", n = 2) |>
separate(bigram, c("word1", "word2"), sep = " ") |>
filter(!(word1 %in% stop_words$word & word2 %in% stop_words$word)) |>
unite("bigram", "word1", "word2", sep = " ")
speeches_bigrams |>
count(bigram, sort = TRUE)
## # A tibble: 30,796 × 2
## bigram n
## <chr> <int>
## 1 the european 261
## 2 the commission 189
## 3 honourable members 140
## 4 the world 137
## 5 european union 100
## 6 the euro 97
## 7 the eu 95
## 8 our union 90
## 9 in europe 88
## 10 commission will 72
## # ℹ 30,786 more rows
Not much information can be extracted from this. The only interesting
bigrams are honourable members
that, as mentioned before,
is the way in which the speaker addresses the audience (the member of
the Euoropean Parliament), and commission will
which
suggests that the speeches often serve as a platform for the Commission
to outline its plans and future actions.
# Keeping bigrams where neither of the words is a stopword
speeches_bigrams <- speeches |>
unnest_tokens(bigram, text, token = "ngrams", n = 2) |>
separate(bigram, c("word1", "word2"), sep = " ") |>
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word) |>
unite("bigram", "word1", "word2", sep = " ")
speeches_bigrams |>
count(bigram, sort = TRUE)
## # A tibble: 7,049 × 2
## bigram n
## <chr> <int>
## 1 european union 100
## 2 single market 58
## 3 european parliament 43
## 4 climate change 31
## 5 monetary union 29
## 6 president honourable 24
## 7 social market 20
## 8 market economy 18
## 9 european commission 17
## 10 european level 17
## # ℹ 7,039 more rows
When we filter out all the stopwords instead we see some more
interesting collocations (meaningful sequence of words that co-occur
more commonly in a given context than their individual word parts).
Examples in our case include climate change
, a key
challenge faced by the Commission, as well as single market
and monetary union
which are other crucial policy areas.
Additionally the EU’s major institutions are frequently mentioned (ex.
european parliament
, european commission
and
of course european union
).
Sentiment analysis
We now focus on the sentiment that words in the speeches convey. We’ll carry out word-level sentiment analysis using several lexicons. We are aware of the several limitations of word-level sentiment analysis, but it serves our purpose of applying basic techniques.
# Most common negative words by bing
filtered_speeches |>
# joining the bing classification
left_join(get_sentiments("bing"), by = join_by(word)) |>
# filtering only negative words
filter(sentiment == "negative") |>
# counting and sorting
count(word, sort = TRUE) |>
# selecting the top 10
slice_head(n = 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 crisis 132
## 2 debt 24
## 3 risk 21
## 4 urgent 20
## 5 critical 17
## 6 hard 17
## 7 risks 17
## 8 vulnerable 16
## 9 difficult 15
## 10 issues 15
The most used words that are labelled as negative by the
Bing
lexicon are all related with some of the problems that
the Union had to face in the last decade. The most used word is
crisis
, which is not surprising since the speeches were
given in a period of time where the Union was indeed facing several
crises.
# Most common positive words by bing
filtered_speeches |>
left_join(get_sentiments("bing"), by = join_by(word)) |>
filter(sentiment == "positive") |>
count(word, sort = TRUE) |>
slice_head(n = 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 support 78
## 2 solidarity 70
## 3 strong 54
## 4 stronger 49
## 5 reform 45
## 6 fair 44
## 7 freedom 44
## 8 stability 40
## 9 trust 35
## 10 progress 33
On the other hand, Bing
labels as positives words that
deal with stability, solidarity and strengthening and reforming the
Union.
We’ll now focus instead on emotion detection, looking in particular
at fear and disgust. For this we’ll use another lexicon called
nrc
. First, we’ll look at the most common words related
with fear and then we’ll look at the most common words related with
disgust.
# Fear
nrc_fear <- get_sentiments("nrc") |>
# selecting only the words labelled as fear
filter(sentiment == "fear")
filtered_speeches |>
left_join(nrc_fear, by = join_by(word)) |>
filter(sentiment == "fear") |>
count(word, sort = TRUE) |>
slice_head(n = 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 change 82
## 2 war 52
## 3 rule 43
## 4 confidence 32
## 5 asylum 30
## 6 pandemic 28
## 7 fight 27
## 8 risk 21
## 9 urgent 20
## 10 defend 19
Most of the above words can indeed be related to fear as they recall
negative scary things that have happened (ex. Ukraine’s war, pandemic,
asylum crisis) or that could potentially happen (risk
,
urgent
, change
).
# Disgust
nrc_disgust <- get_sentiments("nrc") |>
filter(sentiment == "disgust")
filtered_speeches |>
left_join(nrc_disgust, by = join_by(word)) |>
filter(sentiment == "disgust") |>
count(word, sort = TRUE) |>
slice_head(n = 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 weight 13
## 2 powerful 10
## 3 corruption 9
## 4 finally 9
## 5 bad 8
## 6 honest 8
## 7 terrorist 8
## 8 poverty 7
## 9 terrorism 7
## 10 blame 6
While some of the above words make sense (such as
terrorism
, poverty
, corruption
),
other words are not that clear. For example, the association between
weight
and disgust, or honest
and disgust, is
unclear and questionable. These mismatches further highlight the
limitations and the difficulties of using pre-defined lexicons for
emotion detection and of single-word sentiment analysis in general.
Most negative year
Now we will calculate the ratio of negative words per year to check the validity of the previous comments.
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
# creating a dataframe with the total number of words per year
wordcounts <- filtered_speeches %>%
group_by(year) %>%
summarize(words = n())
filtered_speeches %>%
# keeping only negative words
semi_join(bingnegative, by = join_by(word)) %>%
group_by(year) %>%
# counting how many negative words are there per year
summarize(negativewords = n()) %>%
# joining the info of how many total words
left_join(wordcounts, by = c("year")) %>%
# calculating the percentage of negative words over total words
mutate(ratio = negativewords/words) %>%
ungroup() %>%
arrange(desc(ratio))
## # A tibble: 12 × 4
## year negativewords words ratio
## <dbl> <int> <int> <dbl>
## 1 2013 116 2031 0.0571
## 2 2012 110 2280 0.0482
## 3 2021 111 2364 0.0470
## 4 2015 178 3794 0.0469
## 5 2018 91 1945 0.0468
## 6 2022 103 2253 0.0457
## 7 2016 105 2332 0.0450
## 8 2011 77 1828 0.0421
## 9 2020 130 3108 0.0418
## 10 2010 76 1850 0.0411
## 11 2017 95 2342 0.0406
## 12 2023 96 2675 0.0359
The most negative year is 2013, it is the one that has a biggest ratio of negative words, in the previous graph 2015 seemed to be the most negative because it is the longest one and it is the one that has the biggest number of negative words. However, in relative terms 2013 is more negative.
2013 marked the last year of service for the Barroso’s Commission and the president made a very ambitious speech about the future and the challenges that Europe would have to face, but he also recapped all the harshness and difficulties already faced with the sovereign debt crisis therefore culminating in the most negative speech of our collection.
TF-IDF
TF-IDF helps us identify the most distinctive words in the speeches by weighing their importance relative to both their frequency within each speech and in how many speeches they appear.
speeches_tf_idf <- tokenized_speeches |>
# Counting n. of appearances of words per year
count(year, word, sort = TRUE) |>
# Creating tf-idf
bind_tf_idf(term = word, document = year, n = n)
# Printing the words with the highest tf-idf
speeches_tf_idf |>
arrange(desc(tf_idf)) |>
print(n=20)
## # A tibble: 17,862 × 6
## year word n tf idf tf_idf
## <dbl> <chr> <int> <dbl> <dbl> <dbl>
## 1 2023 ai 17 0.00252 1.79 0.00452
## 2 2022 ukraine 22 0.00382 0.875 0.00334
## 3 2022 gas 12 0.00208 1.39 0.00289
## 4 2022 electricity 9 0.00156 1.79 0.00280
## 5 2012 parties 9 0.00149 1.79 0.00268
## 6 2022 russian 8 0.00139 1.79 0.00249
## 7 2012 federation 6 0.000996 2.48 0.00248
## 8 2015 refugee 17 0.00170 1.39 0.00236
## 9 2020 nextgenerationeu 17 0.00208 1.10 0.00229
## 10 2021 cyber 8 0.00123 1.79 0.00221
## 11 2022 ukrainian 9 0.00156 1.39 0.00216
## 12 2023 ukraine 16 0.00237 0.875 0.00208
## 13 2012 doubts 9 0.00149 1.39 0.00207
## 14 2016 000 5 0.000824 2.48 0.00205
## 15 2011 efsf 4 0.000796 2.48 0.00198
## 16 2023 clean 12 0.00178 1.10 0.00195
## 17 2021 soul 9 0.00139 1.39 0.00192
## 18 2018 patriotism 4 0.000771 2.48 0.00192
## 19 2018 surprise 4 0.000771 2.48 0.00192
## 20 2021 pandemic 11 0.00170 1.10 0.00186
## # ℹ 17,842 more rows
We can see that the words with the highest TF-IDF are all related to
particular crisis or challenges the Union had to face. Most of the top
words come from speeches delivered in the 2020s, which implies that
these words are used often in recent years but were used rarely or never
before. This reflects the rapid changes in the world over the past few
years, marked by unexpected events and crises. An example of these
changes is ai
, the word with the highest tf-idf, which was
used 17 times in the 2023 speech. In particular, many of the top words
belong to the 2022 speech, further emphasizing the extraordinary nature
of that year with the Russian invasion of Ukraine and all the crisis
connected to it.
We’ll now look at the most distinctive word of each year:
speeches_tf_idf |>
# adding a column with the name of the president that delivered the speech
mutate(president = case_when(
year %in% c(2010, 2011, 2012, 2013) ~ "Barroso",
year %in% c(2015, 2016, 2017, 2018) ~ "Juncker",
year %in% c(2020, 2021, 2022, 2023) ~ "von der Leyen")) |>
group_by(year) |>
# selecting only the most distinctive word per year (highest tf-idf)
# if there is a tie all words are printed
slice_max(tf_idf)
## # A tibble: 16 × 7
## # Groups: year [12]
## year word n tf idf tf_idf president
## <dbl> <chr> <int> <dbl> <dbl> <dbl> <chr>
## 1 2010 exist 4 0.000908 1.39 0.00126 Barroso
## 2 2010 sound 4 0.000908 1.39 0.00126 Barroso
## 3 2011 efsf 4 0.000796 2.48 0.00198 Barroso
## 4 2012 parties 9 0.00149 1.79 0.00268 Barroso
## 5 2013 justus 4 0.000710 2.48 0.00176 Barroso
## 6 2013 lipsius 4 0.000710 2.48 0.00176 Barroso
## 7 2013 version 4 0.000710 2.48 0.00176 Barroso
## 8 2015 refugee 17 0.00170 1.39 0.00236 Juncker
## 9 2016 000 5 0.000824 2.48 0.00205 Juncker
## 10 2017 2019 8 0.00129 1.39 0.00179 Juncker
## 11 2018 patriotism 4 0.000771 2.48 0.00192 Juncker
## 12 2018 surprise 4 0.000771 2.48 0.00192 Juncker
## 13 2020 nextgenerationeu 17 0.00208 1.10 0.00229 von der Leyen
## 14 2021 cyber 8 0.00123 1.79 0.00221 von der Leyen
## 15 2022 ukraine 22 0.00382 0.875 0.00334 von der Leyen
## 16 2023 ai 17 0.00252 1.79 0.00452 von der Leyen
For some of these words, it’s clear why they are the most distinctive
in their respective speeches. For instance, in 2015,
refugee
stands out as the most distinctive word, as Europe
was indeed in the middle of the refugee crisis at that time. In 2020,
nextgenerationeu
is the top word, as that is the name of
the recovery plan that the Commission had just approved. In 2022,
ukraine
is obviously the most distinctive word, given the
Russian invasion, while in 2023, ai
is prominent due to
President von der Leyen’s discussion about artificial intelligence and
the AI Act. However, for some words, the interpretation is more
challenging. For example, ‘efsf’ stands for the European Financial
Stability Facility, which was frequently mentioned by President Barroso
in 2011 in the context of strengthening this mechanisms established to
address the sovereign debt crisis. For other words, the reason behind
their importance and what was meant with them when used is even less
clear. To help use with that in the next section we are going to add
some context around them.
Key words in context
To look at the context of words that were not clear in the previous
segment, first we need to create a {quanteda}
corpus and
then apply the function kwic
to see the phrase in which
they were included.
speeches_corpus <- corpus(speeches$text, docvars = data.frame(year = speeches$year,
president = speeches$president))
summary(speeches_corpus)
## Corpus consisting of 12 documents, showing 12 documents:
##
## Text Types Tokens Sentences year president
## text1 1293 4878 299 2010 Barroso
## text2 1322 5634 324 2011 Barroso
## text3 1390 6644 371 2012 Barroso
## text4 1477 6353 341 2013 Barroso
## text5 2200 10994 528 2015 Juncker
## text6 1598 6752 413 2016 Juncker
## text7 1614 6881 395 2017 Juncker
## text8 1449 5810 323 2018 Juncker
## text9 1992 9020 518 2020 von der Leyen
## text10 1691 7201 481 2021 von der Leyen
## text11 1671 6451 439 2022 von der Leyen
## text12 1748 7525 484 2023 von der Leyen
# The chunks below do the following:
# 1 - subset our corpus object selecting only the year we are interested in
# 2 - tokenize the subset
# 3 - search for the word and look for its context (words that come before and after)
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2010")),
"exist", window = 8)
kwic_result
## Keyword-in-context with 4 matches.
## [text1, 1532] stronger single market for jobs. The opportunities | exist
## [text1, 2786] will find that their fundamental rights and obligations | exist
## [text1, 3331] duration of the next budget. Various options | exist
## [text1, 4745] will find that their fundamental rights and obligations | exist
##
## | . We have very high levels of unemployment
## | wherever they go. Everyone in Europe must
## | . I would like to look at a
## | wherever they go. I have made clear
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2010")),
"sound", window = 8)
kwic_result
## Keyword-in-context with 4 matches.
## [text1, 930] grow, we also need a strong and | sound |
## [text1, 1215] in place by the end of 2011. | Sound |
## [text1, 1254] We can have both. Honourable Members, | Sound |
## [text1, 3500] size and is negotiating further accessions. A | sound |
##
## financial sector. A sector that serves the
## government finances and responsible financial markets give us
## public finances are a means to an end
## currency, the euro, that is a
In 2010 the words with highest TF-IDF were exist
and
sound
. The former was used in various contexts that were
not very similar and do not provide us with a lot of insights, but the
latter, sound
, is used in relation to the establishment of
more solid economic and financial policies and this makes sense in the
context of the financial crisis.
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2012")),
"parties", window = 8)
kwic_result
## Keyword-in-context with 9 matches.
## [text3, 4165] also cannot be done without strengthening European political |
## [text3, 4178] have very often a real disconnect between political |
## [text3, 4186] parties in the capitals and the European political |
## [text3, 4213] often as if it were just between national |
## [text3, 4230] not see the name of the European political |
## [text3, 4244] we see a national debate between national political |
## [text3, 4257] we need a reinforced statute for European political |
## [text3, 4291] debate would be the presentation by European political |
## [text3, 4344] even clearer. I call on the political |
##
## parties | . Indeed, we have very often a
## parties | in the capitals and the European political parties
## parties | here in Strasbourg. This is why we
## parties | . Even in the European elections we do
## parties | on the ballot box, we see a
## parties | . This is why we need a reinforced
## parties | . I am proud to announce that the
## parties | of their candidate for the post of Commission
## parties | to commit to this step and thus to
In 2012, parties
is the word at which we are looking at
and it appeared in a significant rate in this speech in the context of
the role of European political parties and their future candidacies for
2014’s elections.
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2013")),
"justus", window = 8)
kwic_result
## Keyword-in-context with 4 matches.
## [text4, 3318] still lack. Surely, you all know | Justus |
## [text4, 3321] Surely, you all know Justus Lipsius. | Justus |
## [text4, 3333] name of the Council building in Brussels. | Justus |
## [text4, 3436] the governments' representatives that meet at the | Justus |
##
## Lipsius. Justus Lipsius is the name of
## Lipsius is the name of the Council building
## Lipsius was a very influential 16th century humanist
## Lipsius building, show that determination, that
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2013")),
"version", window = 8)
kwic_result
## Keyword-in-context with 4 matches.
## [text4, 5488] will voters be presented with? The candid | version |
## [text4, 5493] ? The candid version, or the cartoon | version |
## [text4, 5505] or the facts? The honest, reasonable | version |
## [text4, 5512] reasonable version, or the extremist, populist | version |
##
## , or the cartoon version? The myths
## ? The myths or the facts? The
## , or the extremist, populist version?
## ? It's an important difference. I know
In this case, the three words were highly mentioned in a small space.
lipsius
and justus
appeared together since
they were referring to a building named after a 16th century humanist
that was being used by Barroso to make an analogy. The word
version
was distinctive in this document and it was used to
refer to what version of reality voters would receive, supposedly from
political parties while campaigning for the 2014’s elections.
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2016")),
"000", window = 8)
kwic_result
## Keyword-in-context with 5 matches.
## [text6, 1626] conflicts, which claim the lives of 170 | 000 |
## [text6, 1872] 1 billion we get in exports, 14 | 000 |
## [text6, 3401] its first year of operation. Over 200 | 000 |
## [text6, 3414] across Europe got loans. And over 100 | 000 |
## [text6, 4487] by 2020, to see the first 100 | 000 |
##
## people every year. Of course we still
## extra jobs are created across the EU.
## small firms and start-ups across Europe got loans
## people got new jobs, thanks to the
## young Europeans taking part. By voluntarily joining
The reason why 000
appears as the most distinctive in
2016 is because some numbers were given during the speech and in the
ones given in thousands the 000
was considered as a
separate word. In fact, it has a high IDF because this problem problably
was not happening in the rest of the transcriptions.
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2017")),
"2019", window = 8)
kwic_result
## Keyword-in-context with 8 matches.
## [text7, 6152] Presidencies of the Council between now and March | 2019 |
## [text7, 6228] for the second option. On 29 March | 2019 |
## [text7, 6292] the future of Europe. On 30 March | 2019 |
## [text7, 6335] just a few weeks later, in May | 2019 |
## [text7, 6389] holding the Presidency in the first half of | 2019 |
## [text7, 6401] a Special Summit in Romania on 30 March | 2019 |
## [text7, 6453] . My hope is that on 30 March | 2019 |
## [text7, 6650] wake up to this Union on 30 March | 2019 |
##
## , outlining where we should go from here
## , the United Kingdom will leave the European
## , we will be a Union of 27
## . Europeans have a date with democracy.
## , to organise a Special Summit in Romania
## . My wish is that this summit be
## , Europeans will wake up to a Union
## , then the European Union will be a
During 2017’s speech, 2019
was highly mentioned since it
was the year in which the UK was initially supposed to abandon the
European Union and it was also the year of the next European
elections.
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2018")),
"patriotism", window = 8)
kwic_result
## Keyword-in-context with 4 matches.
## [text8, 401] more. We should embrace the kind of |
## [text8, 5630] us to reject unhealthy nationalism and embrace enlightened |
## [text8, 5638] patriotism. We should never forget that the |
## [text8, 5709] love your country is to love Europe. |
##
## patriotism | that is used for good, and never
## patriotism | . We should never forget that the patriotism
## patriotism | of the 21st Century is two-fold: both
## Patriotism | is a virtue. Unchecked nationalism is riddled
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2018")),
"surprise", window = 12)
kwic_result
## Keyword-in-context with 4 matches.
## [text8, 284] took the sunny, optimistic and peaceful continent of the time by
## [text8, 991] some, the agreement I struck with President Trump came as a
## [text8, 997] with President Trump came as a surprise. But it was no
## [text8, 1011] as whenever Europe speaks with one voice, there is never any
##
## | surprise | . In 1913, Europeans expected to live a lasting peace.
## | surprise | . But it was no surprise – as whenever Europe speaks with
## | surprise | – as whenever Europe speaks with one voice, there is never
## | surprise | . When Europe speaks with one voice, it can prevail.
In 2018, the word patriotism
was used by Juncker in
opposition to nationalism, the first was seen as something positive
while the second having negative connotations. These were times of
rising nationalism and anti-europeism all over the Union, therefore it
makes sense that the President would address such topics with the
elections coming up within the next year. The word surprise
was used multiple times referring to an agreement with Trump, likely a
trade agreement, that some politicians or Member States had not believed
would be possible to reach.
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2021")),
"cyber", window = 8)
kwic_result
## Keyword-in-context with 7 matches.
## [text10, 4439] , from fighter jets, to drones and | cyber |
## [text10, 4500] we cannot talk about defence without talking about | cyber |
## [text10, 4535] should not just be satisfied to address the | cyber |
## [text10, 4546] but also strive to become a leader in | cyber |
## [text10, 4556] . It should be here in Europe where | cyber |
## [text10, 4569] . This is why we need a European | Cyber |
## [text10, 4582] legislation on common standards under a new European | Cyber |
##
## . But we have to keep thinking of
## . If everything is connected, everything can
## threat, but also strive to become a
## security. It should be here in Europe
## defence tools are developed. This is why
## Defence Policy, including legislation on common standards
## Resilience Act. So, we can do
Lastly, the word cyber
stood out in 2021 because
President von der Leyen talked about the growing cyber threats and the
need for the EU to defend from them (cyberdefense), all this is also
likely linked to a context of increasing digitalization during the
pandemic. In this speech the Commission President also mentions the
creation of a “Cyber Resilience Act”.
Zipf’s Law
Now we will be looking at the use that Commission Presidents do of common and rare words. For that we will be using Zipf’s Law to assign a rank to each word, the lower the TF the higher the rank, because the word is more rare and valuable.
First, we’ll sort the words by TF, then generate a log-log graph for each president to observe deviations from typical language patterns. We’ll consider the middle section of each graph as representing standard language use, as it tends to be the most stable. To visualize this baseline, we’ll draw a reference line on each graph based on a linear regression that will extrapolate the trend of the middle section.
speeches_tf_idf <- speeches_tf_idf |>
mutate(president = case_when(
year %in% c(2010, 2011, 2012, 2013) ~ "Barroso",
year %in% c(2015, 2016, 2017, 2018) ~ "Juncker",
year %in% c(2020, 2021, 2022, 2023) ~ "von der Leyen")) |>
group_by(year) |>
arrange(desc(tf)) |>
ungroup()
freq_by_rank_barroso <- speeches_tf_idf %>%
filter(president == "Barroso") |>
group_by(year) %>%
mutate(rank = row_number()) %>%
ungroup()
freq_by_rank_juncker <- speeches_tf_idf %>%
filter(president == "Juncker") |>
group_by(year) %>%
mutate(rank = row_number()) %>%
ungroup()
freq_by_rank_vonderleyen <- speeches_tf_idf %>%
filter(president == "von der Leyen") |>
group_by(year) %>%
mutate(rank = row_number()) %>%
ungroup()
We extract the line for Barroso:
freq_by_rank_barroso %>%
ggplot(aes(rank, tf, color = as_factor(year))) +
geom_line(linewidth = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() +
theme_minimal()
rank_subset <- freq_by_rank_barroso %>%
filter(rank < 500,
rank > 50)
#we use the linear model function to find numeric coefficients of relationship between tf and rank
lm(log10(tf) ~ log10(rank), data = rank_subset)
##
## Call:
## lm(formula = log10(tf) ~ log10(rank), data = rank_subset)
##
## Coefficients:
## (Intercept) log10(rank)
## -0.9514 -0.9393
p_barroso <- freq_by_rank_barroso %>%
ggplot(aes(rank, tf, color = as_factor(year))) +
#We add a line in the plot with the two coefficients we have found
geom_abline(intercept = -0.9514, slope = -0.9393,
color = "gray50", linetype = 2) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() +
theme_minimal() +
labs(
title = "Barroso"
) +
theme(plot.title = element_text(hjust = 0.5))
We extract the line for Juncker:
freq_by_rank_juncker %>%
ggplot(aes(rank, tf, color = as_factor(year))) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() +
theme_minimal()
rank_subset <- freq_by_rank_juncker %>%
filter(rank < 500,
rank > 50)
#we use the linear model function to find numeric coefficients of relationship between tf and rank
lm(log10(tf) ~ log10(rank), data = rank_subset)
##
## Call:
## lm(formula = log10(tf) ~ log10(rank), data = rank_subset)
##
## Coefficients:
## (Intercept) log10(rank)
## -0.9024 -0.9666
p_juncker <- freq_by_rank_juncker %>%
ggplot(aes(rank, tf, color = as_factor(year))) +
geom_abline(intercept = -0.9024, slope = -0.9666,
color = "gray50", linetype = 2) +
geom_line(linewidth = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() +
theme_minimal() +
labs(
title = "Juncker"
) +
theme(plot.title = element_text(hjust = 0.5))
We extract the line for von der Leyen:
freq_by_rank_vonderleyen %>%
ggplot(aes(rank, tf, color = as_factor(year))) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() +
theme_minimal()
rank_subset <- freq_by_rank_vonderleyen %>%
filter(rank < 500,
rank > 50)
lm(log10(tf) ~ log10(rank), data = rank_subset)
##
## Call:
## lm(formula = log10(tf) ~ log10(rank), data = rank_subset)
##
## Coefficients:
## (Intercept) log10(rank)
## -0.9549 -0.9498
p_von_der_leyen <- freq_by_rank_vonderleyen %>%
ggplot(aes(rank, tf, color = as_factor(year))) +
#we add a line in the plot with the two coefficients we have found
geom_abline(intercept = -0.9549, slope = -0.9498,
color = "gray50", linetype = 2) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10() +
theme_minimal() +
labs(
title = "von der Leyen"
) +
theme(plot.title = element_text(hjust = 0.5))
# We use the library patvhwork to display the three graphs at the same time
(p_barroso / p_juncker / p_von_der_leyen) +
plot_layout(ncol = 1,
heights = c(1, 1, 1))
Both Barroso and Juncker seem to share the same type of deviation from the standard use of language, their speeches are bellow the line in the first section, so they use a lower percentage of the most common words. Besides, in the last section the speeches do not deviate form the line, i.e. they use the same percentage of words as many other collections of language. However, despite the fact that von der Leyen’s speeches also use a similar percentage of rare words and very common words, they seem to have a different behaviour in the fist section, words with a rank of 10 are above it, so we can say that von der Leyen uses more middle-low rank words than the standard use of language.
Topic modelling
Sparsity
speeches_dtm <- filtered_speeches |>
# Counting the appearances of each word in each speech
count(year, word, sort = TRUE) |>
# Transforming from tidy format to document-term-matrix
cast_dtm(document = year, term = word, value = n)
speeches_dtm
## <<DocumentTermMatrix (documents: 12, terms: 5539)>>
## Non-/sparse entries: 14158/52310
## Sparsity : 79%
## Maximal term length: 20
## Weighting : term frequency (tf)
We have 12 speeches containing a total of 5593 different (non-stop)words. 79% percent of sparsity in our document-term matrix means that 79% of the cells in the matrix are zeros. 79% of the word-document combinations do not happen in the speeches that we have analyzed.
While this sparsity might seem high, sparsity can even be much higher in other real word text data (>90/95%). In our case, the sparsity is somewhat lower because these are political speeches made by the same institution (for several years represented by the same person), in the same context and that therefore often revolve around recurring topics. This thematic consistency reduces the number of unique, non-overlapping words across documents, making the vocabulary of the speeches across the years similar and in turn, it lowers the overall sparsity of our collection.
Another factor that makes our sparsity lower is the fact that we are analyzing a relatively small number of documents (12 speeches). In larger collections, the sparsity tends to increase as the number of documents increases carrying with them new unique words.
Topics
# Applying LDA with 5 topics
lda_speeches <- LDA(speeches_dtm, k = 4, control = list(seed = 777))
# Extracting the most likely terms to belong to each topic
terms(lda_speeches, k = 10)
## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "europe" "european" "europe" "europe"
## [2,] "european" "europe" "european" "european"
## [3,] "union" "union" "union" "union"
## [4,] "commission" "commission" "world" "commission"
## [5,] "eu" "political" "people" "people"
## [6,] "world" "euro" "honourable" "time"
## [7,] "time" "world" "future" "world"
## [8,] "people" "crisis" "time" "future"
## [9,] "crisis" "eu" "global" "global"
## [10,] "europeans" "economic" "ukraine" "honourable"
As can be seen from the results above, most of the top terms are common across all topics. This is probably due to the fact that these are also the most frequent words in the speeches and thus can be found in all topics. To improve this situation we will get rid of them by building a custom stop words list.
# Custom stop words
stop_custom <- tibble(word = c("european",
"europe",
"eu",
"union",
"commission",
"honourable",
"world",
"people",
"time",
"future"))
# Repeating the pre-processing steps
tidy_speeches <- speeches |>
unnest_tokens(word, text) |>
anti_join(stop_words, by = join_by(word)) |>
anti_join(stop_custom, by = join_by(word)) |>
count(year, word, sort = TRUE) |>
arrange(desc(n))
speeches_dtm <- tidy_speeches |>
cast_dtm(document = year, term = word, value = n)
lda_speeches <- LDA(speeches_dtm, k = 4, control = list(seed = 777))
# Note that the number of topic is the results of several refinements
terms(lda_speeches, k = 15)
## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "crisis" "market" "parliament" "global"
## [2,] "euro" "economy" "president" "law"
## [3,] "economic" "global" "europeans" "countries"
## [4,] "political" "companies" "united" "climate"
## [5,] "growth" "forward" "euro" "security"
## [6,] "market" "ukraine" "single" "investment"
## [7,] "social" "digital" "law" "common"
## [8,] "national" "support" "national" "means"
## [9,] "financial" "nextgenerationeu" "elections" "companies"
## [10,] "citizens" "live" "democracy" "freedom"
## [11,] "policy" "values" "trade" "social"
## [12,] "single" "crisis" "citizens" "billion"
## [13,] "common" "industry" "continent" "support"
## [14,] "means" "energy" "europe's" "values"
## [15,] "greece" "data" "economic" "jobs"
## 2010 2011 2012 2013 2015 2016 2017 2018 2020 2021 2022 2023
## 1 1 1 1 1 4 3 3 2 4 2 4
Our analysis is way better now. From the results of topic modelling it can be seen that for the Barroso presidency from 2010 to 2013, as well as for the first of Juncker’s speeches in 2015, the main topic seems to be the economic and financial crisis related with the global financial crisis and the subsequent eurozone crisis. This is topic n.1.
Topic n.2 is related with the pandemic (ex. digital, nextgenerationeu), as well as with Ukraine’s war (ex. ukraine, energy). The topic is prevalent in 2020 and 2022, the first speeches after each of these 2 crisis unfolded.
Topic n.3 is related mainly with elections, a newly found unity and defense of the rule of law and democracy. These are all topics relevant in 2017 and 2018 when the EU had come to terms with the fact that Brexit was happening and EU commission president Juncker was trying to set out a vision for the future of the post-Brexit European Union. At the same time that he was praising and highlighting the unity and commitment of the remaining EU countries, he was also warning against the worsening of the democratic conditions in some of the European states and the need to defend the rule of law, thus underlining the importance of the forthcoming 2019 European elections.
Topic n.4 is probably the more general topic, grouping together several of what have been key priorities of the European Commission in the last decade. These are speeches that come at times of difficulties for the Union, but without being necessarily in the most acute part of a crisis (2016, 2021, 2023). Some of the priorities mentioned by the EU commission presidents range from global challenges such as climate change, to the need to defend the rule of law and the EU’s values, to the need to invest to create jobs and have a functioning common market for EU companies. Important in these 2016, 2021 and 2023 speeches is also the concept of security which has been declined in increasingly more ways. Initially related to migrants and terrorism in 2016, the concept then expanded into an umbrella term covering also social and economic security, turning into a key priority and concept of the von der Leyen’s Commission.
Conclusion
Our project analyzed European Commission’s State of the Union speeches from 2010 to 2023 using text mining techniques. Throughout the study we were able to trace the last decade of the European Union’s history and all the challenges and major events it has had to face.
The main finding of our analysis is that crises such as the Eurozone debt crisis, refugee crisis, the pandemic, and the war in Ukraine are often central topics and greatly shape the discourse. TF-IDF and topic modeling revealed how these challenges influenced the vocabulary and topics of each presidency, with speeches in the early 2010s centered on economic recovery, while speeches in the 2020s focused on the pandemic and the war in Ukraine.
However, being the speeches a very formal occasion, they still tend to share a significant portion of their vocabulary and style across years and presidencies. The speeches primarily serve as a platform for the Commission to highlight legislative achievements, outline policy priorities, and address pressing challenges. Nonetheless, even in very difficult times and when addressing candidly serious issues, Commission Presidents seem to ultimately always convey a message of unity and confidence in the European project resulting in neutral to fairly positive speeches.