EU State of the Union speeches

A text mining analysis

Introduction

Each September, the President of the European Commission gives a speech in the plenary session of the European Parliament about last year’s achievements and presents the priorities for the next one. Such speech is known as the EU State of the Union speech.

In this assignment we will be applying some basic text mining techniques to extract useful information from speeches comprehended between 2010 and 2023. During this period, the European Commission had three different presidents, Barroso, Juncker and von der Leyen, so we’ll also be looking for similarities and differences across years and presidents. We’ll perform sentiment analysis to compare positivity and negativity across speeches, we will use TF-IDF to determine the most distinctive words each year and lastly we will do topic modelling to classify the speeches depending on their theme.

As an initial hypothesis, we expect to have a big amount of similarities regarding the style and the words used between presidents because, although each may have their particular style and some particular event to address, these type of discourses tend to follow a formal and structured scheme and consistently mention key policy areas.

Moreover, in the sentiment analysis we expect to achieve greater negativity between 2010 and 2016 and in 2020-2022 because it was the period of the Eurozone debt crisis and the COVID-19 pandemic and the Ukraine war, respectively.

Similarly we expect the main topics of the speeches to reflect these main challenges faced by the Union (financial crisis, COVID, Ukraine), as well as maybe a mentioning of Brexit.

Library

Installing and loading packages.

rm(list = ls())
# install.packages("tidyverse")
# install.packages("tidytext")
# install.packages("wordcloud")
# install.packages("topicmodels")
# install.packages("scales")
# install.packages("patchwork")
# install.packages("quanteda")
# install.packages("rmdformats")
library(tidyverse)
library(tidytext)
library(wordcloud)
library(topicmodels)
library(scales)
library(patchwork)
library(quanteda)

Loading data

Here we read the speeches and we assign the year in which they were held and the president that delivered each of them.

All the speeches have been obtained from the European’s Commission website and speeches’ section headings or subheadings have been deleted.

speech_2010 <- read_file("speeches/2010.txt") 
speech_2011 <- read_file("speeches/2011.txt")
speech_2012 <- read_file("speeches/2012.txt") 
speech_2013 <- read_file("speeches/2013.txt") 
speech_2015 <- read_file("speeches/2015.txt")
speech_2016 <- read_file("speeches/2016.txt") 
speech_2017 <- read_file("speeches/2017.txt") 
speech_2018 <- read_file("speeches/2018.txt") 
speech_2020 <- read_file("speeches/2020.txt") 
speech_2021 <- read_file("speeches/2021.txt") 
speech_2022 <- read_file("speeches/2022.txt") 
speech_2023 <- read_file("speeches/2023.txt") 

speeches <- tibble(
  year = c(2010, 2011, 2012, 2013, 2015, 2016, 2017, 2018, 2020, 2021, 2022, 2023),
  text = c(
    speech_2010, speech_2011, speech_2012, speech_2013,
    speech_2015, speech_2016, speech_2017, speech_2018,
    speech_2020, speech_2021, speech_2022, speech_2023))

speeches <- speeches |> 
  mutate(president = case_when(
    year %in% c(2010, 2011, 2012, 2013) ~ "Barroso",
    year %in% c(2015, 2016, 2017, 2018) ~ "Juncker",
    year %in% c(2020, 2021, 2022, 2023) ~ "von der Leyen"))

Note that years 2014 and 2019 are missing. These years correspond to European election years, during which the Commission and the Parliament are typically in transition and not yet fully in office and thus the speech is not delivered.

Pre-processing

As for most text mining pipelines, the first step is to proceed with tokenization. Then, since stop words do not add useful information to our analysis we are filtering them out.

# Tokenizing by word
tokenized_speeches <- speeches |> 
  unnest_tokens(word,text, token="words") |> 
  # Removing possessive to end of words (ex country's becomes country)
  mutate(word = gsub("'s$", "", word)) |> 
  mutate(word = gsub("’s$", "", word))
# Note that I+we need to remove possessive for two slightly different apostrophe
# Because ' and ’ are different characters we need to handle them separately

stop_words <- stop_words

# Removing stopwords
filtered_speeches <- tokenized_speeches |> 
  anti_join(stop_words, by = "word")

Analysis

Word frequency

Unigrams

What are the most used words in the speeches?

filtered_speeches |> 
  # Counting the words
  count(word, sort = TRUE) |> 
  # Filtering only those used more than 100 times
  filter(n > 100) |>  
  # Plotting the most used
  ggplot(aes(n, reorder(word, n))) +
  geom_col(fill = "#004494") +
  labs(title = "Most used words in EU State of the Union speeches 2010-2023",
       y = NULL, 
       x = "Number of times used") +
  theme_minimal()

In a first look, we can observe that the most used words are making reference to Europe, the European Union and the Commission as well as to the European people to which these speeches are directed. Other words refer to key policy areas mainly related with the economy.

A curious presence is the word honourable. Its presence is due to the fact that all Commission presidents use the expression “Honourable Members” (of the parliament) to address the members of the European Parliament, where the speech takes place. This expression is used repeatedly, often as the opening of several sentences or at least of each section of the speech.

We’ll now focus on differences across years and presidents.

# Counting the appearances of words by year
filtered_speeches |> 
  group_by(year) |> 
  count(word, sort = TRUE) |> 
  # Selecting only the most used word per year
  slice_head(n = 1)
## # A tibble: 12 × 3
## # Groups:   year [12]
##     year word         n
##    <dbl> <chr>    <int>
##  1  2010 europe      53
##  2  2011 europe      41
##  3  2012 european    90
##  4  2013 europe      70
##  5  2015 european    74
##  6  2016 europe      83
##  7  2017 european    74
##  8  2018 europe      79
##  9  2020 europe      58
## 10  2021 europe      51
## 11  2022 europe      36
## 12  2023 europe      63

We can see that europe or european are constantly the most used word every year.

What happens if instead we filter out these standard words. Which ones become the most used?

# Repeating the same process as before, but filtering out standard words
filtered_speeches |>
  filter(!word %in% c("european", "europe", "eu", "union", 
                      "commission", "parliament", "president")) |> 
  group_by(year) |> 
  count(word, sort = TRUE) |> 
  slice_head(n = 1)
## # A tibble: 12 × 3
## # Groups:   year [12]
##     year word          n
##    <dbl> <chr>     <int>
##  1  2010 growth       26
##  2  2011 world        22
##  3  2012 political    33
##  4  2013 crisis       24
##  5  2015 euro         30
##  6  2016 means        18
##  7  2017 future       14
##  8  2018 world        18
##  9  2020 world        40
## 10  2021 time         22
## 11  2022 ukraine      22
## 12  2023 future       23

This gives us more insights. For example it’s clear that the focus of 2022 was the Ukraine’s war, while in 2020, the Commission underlined the global dimension of the pandemic we were witnessing. future was the most used word in 2017, in a speech which discussed heavily the path forward for the Union in the wake of the Brexit shock. euro, growth and crisis are instead words that characterize the speeches happening closer to the financial and Eurozone crisis.

We now turn the focus on each presidency and compare their most used words.

# Counting words and making a facet graph of the top 10 words by president
filtered_speeches |> 
  filter(!word %in% c("european", "europe", "eu", "union", "commission",
                      "parliament", "president", "europeans", "honourable")) |> 
  count(president, word, sort = TRUE) |> 
  group_by(president) |> 
  slice_max(n, n = 10) |> 
  ggplot(aes(n, reorder_within(word, n, president))) +
  geom_col(fill = "#004494") +
  facet_wrap(~ president, nrow = 3, scales = "free_y") +
  scale_y_reordered() +
  labs(
    title = "Most used words in State of the Union speeches by EC President",
    y = NULL, 
    x = "Number of times used") +
  theme_minimal()

Again here we can see how the Barroso presidency was really focused on overcoming the economic crisis. Juncker instead often made references to words of unity because his presidency was marked by the Brexit referendum and the need to protect the Union from raising nationalisms as well as finding European solutions to the refugee crisis. von der Leyen’s presidency is marked by the pandemic and the war in Ukraine, two global events that are reflected in the choice of words used in her speeches.

Now we will be calculating the frequency of each word to compare the style used by these presidents. We will be using von der Leyen’s speeches as reference since she is the current president, to see if they differ from the previous ones.

# Calculate word frequency proportions by president
frequency <- filtered_speeches |> 
  group_by(president) |> 
   # Count number of times each word appears for each president
  count(word) |> 
  # Calculate relative frequency (proportion) of each word
  mutate(proportion = n / sum(n)) |> 
  ungroup() |> 
  select(-n) |> 
  # Creating one column per president
  pivot_wider(names_from = president, 
              values_from = proportion) |> 
  mutate(avg_proportion = rowMeans(across(c("Barroso", "Juncker", "von der Leyen")), 
                                   na.rm = TRUE)) |>
  # Going back to long format for Barroso and Juncker
  pivot_longer(`Barroso`:`Juncker`,
               names_to = "president", values_to = "proportion") |> 
  arrange(desc(avg_proportion))

# Plot compareing von der Leyen's word frequencies vs. previous presidents
ggplot(frequency, aes(x = proportion, y = `von der Leyen`, 
                      color = abs(`von der Leyen` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 0.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 0.5) +
  # Log scales
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), 
                       low = "darkslategray4", high = "gray75") +
  facet_wrap(~ president, ncol = 2) +
  theme_minimal() + 
  theme(legend.position="none") +
  labs(y = "von der Leyen", x = NULL)

By looking at the graph we can observe that word frequency is quite similar between von der Leyen and the previous presidents. However, we must say that since in Barroso and Juncker’s presidencies the economic situation was the principal concern words like economic, euro, monetary or debt had a highest frequency than in von der Layen’s presidency.

To check from a quantitative perspective if the style of the speeches are truly similar we are going to calculate the Pearson correlation coefficient.

cor.test(data = frequency[frequency$president == "Barroso",],
         ~ proportion + `von der Leyen`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and von der Leyen
## t = 47.851, df = 1282, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7801341 0.8194811
## sample estimates:
##       cor 
## 0.8006694
cor.test(data = frequency[frequency$president == "Juncker",],
         ~ proportion + `von der Leyen`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and von der Leyen
## t = 54.791, df = 1570, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7926281 0.8266375
## sample estimates:
##       cor 
## 0.8103141

In both cases the correlation is greater than 0.8, so we have demonstrated that the word frequency used in Barroso and Juncker’s speeches is quite similar to von der Leyen’s. Additionally, Juncker’s speech is slightly more similar to von der Leyen’s than Barroso’s, which makes sense as Barroso’s presidency is further away than Juncker’s from von der Leyen’s.

Bigrams

What are the most used bigrams in the speeches?

# Keeping all bigrams that are not made up of 2 stopwords
speeches_bigrams <- speeches |> 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) |> 
  separate(bigram, c("word1", "word2"), sep = " ") |>
  filter(!(word1 %in% stop_words$word & word2 %in% stop_words$word)) |> 
  unite("bigram", "word1", "word2", sep = " ")

speeches_bigrams |> 
  count(bigram, sort = TRUE)
## # A tibble: 30,796 × 2
##    bigram                 n
##    <chr>              <int>
##  1 the european         261
##  2 the commission       189
##  3 honourable members   140
##  4 the world            137
##  5 european union       100
##  6 the euro              97
##  7 the eu                95
##  8 our union             90
##  9 in europe             88
## 10 commission will       72
## # ℹ 30,786 more rows

Not much information can be extracted from this. The only interesting bigrams are honourable members that, as mentioned before, is the way in which the speaker addresses the audience (the member of the Euoropean Parliament), and commission will which suggests that the speeches often serve as a platform for the Commission to outline its plans and future actions.

# Keeping bigrams where neither of the words is a stopword
speeches_bigrams <- speeches |> 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) |> 
  separate(bigram, c("word1", "word2"), sep = " ") |>
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) |> 
  unite("bigram", "word1", "word2", sep = " ")

speeches_bigrams |> 
  count(bigram, sort = TRUE)
## # A tibble: 7,049 × 2
##    bigram                   n
##    <chr>                <int>
##  1 european union         100
##  2 single market           58
##  3 european parliament     43
##  4 climate change          31
##  5 monetary union          29
##  6 president honourable    24
##  7 social market           20
##  8 market economy          18
##  9 european commission     17
## 10 european level          17
## # ℹ 7,039 more rows

When we filter out all the stopwords instead we see some more interesting collocations (meaningful sequence of words that co-occur more commonly in a given context than their individual word parts). Examples in our case include climate change, a key challenge faced by the Commission, as well as single market and monetary union which are other crucial policy areas. Additionally the EU’s major institutions are frequently mentioned (ex. european parliament, european commission and of course european union).

Sentiment analysis

We now focus on the sentiment that words in the speeches convey. We’ll carry out word-level sentiment analysis using several lexicons. We are aware of the several limitations of word-level sentiment analysis, but it serves our purpose of applying basic techniques.

# Most common negative words by bing
filtered_speeches |> 
  # joining the bing classification
  left_join(get_sentiments("bing"), by = join_by(word)) |> 
  # filtering only negative words
  filter(sentiment == "negative") |> 
  # counting and sorting
  count(word, sort = TRUE) |> 
  # selecting the top 10
  slice_head(n = 10)
## # A tibble: 10 × 2
##    word           n
##    <chr>      <int>
##  1 crisis       132
##  2 debt          24
##  3 risk          21
##  4 urgent        20
##  5 critical      17
##  6 hard          17
##  7 risks         17
##  8 vulnerable    16
##  9 difficult     15
## 10 issues        15

The most used words that are labelled as negative by the Bing lexicon are all related with some of the problems that the Union had to face in the last decade. The most used word is crisis, which is not surprising since the speeches were given in a period of time where the Union was indeed facing several crises.

# Most common positive words by bing
filtered_speeches |> 
  left_join(get_sentiments("bing"), by = join_by(word)) |> 
  filter(sentiment == "positive") |> 
  count(word, sort = TRUE) |> 
  slice_head(n = 10)
## # A tibble: 10 × 2
##    word           n
##    <chr>      <int>
##  1 support       78
##  2 solidarity    70
##  3 strong        54
##  4 stronger      49
##  5 reform        45
##  6 fair          44
##  7 freedom       44
##  8 stability     40
##  9 trust         35
## 10 progress      33

On the other hand, Bing labels as positives words that deal with stability, solidarity and strengthening and reforming the Union.

We’ll now focus instead on emotion detection, looking in particular at fear and disgust. For this we’ll use another lexicon called nrc. First, we’ll look at the most common words related with fear and then we’ll look at the most common words related with disgust.

# Fear
nrc_fear <- get_sentiments("nrc") |> 
  # selecting only the words labelled as fear
  filter(sentiment == "fear")
filtered_speeches |> 
  left_join(nrc_fear, by = join_by(word)) |> 
  filter(sentiment == "fear") |> 
  count(word, sort = TRUE) |> 
  slice_head(n = 10)
## # A tibble: 10 × 2
##    word           n
##    <chr>      <int>
##  1 change        82
##  2 war           52
##  3 rule          43
##  4 confidence    32
##  5 asylum        30
##  6 pandemic      28
##  7 fight         27
##  8 risk          21
##  9 urgent        20
## 10 defend        19

Most of the above words can indeed be related to fear as they recall negative scary things that have happened (ex. Ukraine’s war, pandemic, asylum crisis) or that could potentially happen (risk, urgent, change).

# Disgust
nrc_disgust <- get_sentiments("nrc") |> 
  filter(sentiment == "disgust")
filtered_speeches |> 
  left_join(nrc_disgust, by = join_by(word)) |> 
  filter(sentiment == "disgust") |> 
  count(word, sort = TRUE) |> 
  slice_head(n = 10)
## # A tibble: 10 × 2
##    word           n
##    <chr>      <int>
##  1 weight        13
##  2 powerful      10
##  3 corruption     9
##  4 finally        9
##  5 bad            8
##  6 honest         8
##  7 terrorist      8
##  8 poverty        7
##  9 terrorism      7
## 10 blame          6

While some of the above words make sense (such as terrorism, poverty, corruption), other words are not that clear. For example, the association between weight and disgust, or honest and disgust, is unclear and questionable. These mismatches further highlight the limitations and the difficulties of using pre-defined lexicons for emotion detection and of single-word sentiment analysis in general.

Most negative year

Now we will calculate the ratio of negative words per year to check the validity of the previous comments.

bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

# creating a dataframe with the total number of words per year
wordcounts <- filtered_speeches %>%
  group_by(year) %>%
  summarize(words = n())

filtered_speeches %>%
  # keeping only negative words
  semi_join(bingnegative, by = join_by(word)) %>%
  group_by(year) %>%
  # counting how many negative words are there per year
  summarize(negativewords = n()) %>%
  # joining the info of how many total words
  left_join(wordcounts, by = c("year")) %>%
  # calculating the percentage of negative words over total words
  mutate(ratio = negativewords/words) %>%
  ungroup() %>%
  arrange(desc(ratio))
## # A tibble: 12 × 4
##     year negativewords words  ratio
##    <dbl>         <int> <int>  <dbl>
##  1  2013           116  2031 0.0571
##  2  2012           110  2280 0.0482
##  3  2021           111  2364 0.0470
##  4  2015           178  3794 0.0469
##  5  2018            91  1945 0.0468
##  6  2022           103  2253 0.0457
##  7  2016           105  2332 0.0450
##  8  2011            77  1828 0.0421
##  9  2020           130  3108 0.0418
## 10  2010            76  1850 0.0411
## 11  2017            95  2342 0.0406
## 12  2023            96  2675 0.0359

The most negative year is 2013, it is the one that has a biggest ratio of negative words, in the previous graph 2015 seemed to be the most negative because it is the longest one and it is the one that has the biggest number of negative words. However, in relative terms 2013 is more negative.

2013 marked the last year of service for the Barroso’s Commission and the president made a very ambitious speech about the future and the challenges that Europe would have to face, but he also recapped all the harshness and difficulties already faced with the sovereign debt crisis therefore culminating in the most negative speech of our collection.

TF-IDF

TF-IDF helps us identify the most distinctive words in the speeches by weighing their importance relative to both their frequency within each speech and in how many speeches they appear.

speeches_tf_idf <- tokenized_speeches |> 
  # Counting n. of appearances of words per year
  count(year, word, sort = TRUE) |> 
  # Creating tf-idf
  bind_tf_idf(term = word, document = year, n = n)

# Printing the words with the highest tf-idf
speeches_tf_idf |>
  arrange(desc(tf_idf)) |> 
  print(n=20)
## # A tibble: 17,862 × 6
##     year word                 n       tf   idf  tf_idf
##    <dbl> <chr>            <int>    <dbl> <dbl>   <dbl>
##  1  2023 ai                  17 0.00252  1.79  0.00452
##  2  2022 ukraine             22 0.00382  0.875 0.00334
##  3  2022 gas                 12 0.00208  1.39  0.00289
##  4  2022 electricity          9 0.00156  1.79  0.00280
##  5  2012 parties              9 0.00149  1.79  0.00268
##  6  2022 russian              8 0.00139  1.79  0.00249
##  7  2012 federation           6 0.000996 2.48  0.00248
##  8  2015 refugee             17 0.00170  1.39  0.00236
##  9  2020 nextgenerationeu    17 0.00208  1.10  0.00229
## 10  2021 cyber                8 0.00123  1.79  0.00221
## 11  2022 ukrainian            9 0.00156  1.39  0.00216
## 12  2023 ukraine             16 0.00237  0.875 0.00208
## 13  2012 doubts               9 0.00149  1.39  0.00207
## 14  2016 000                  5 0.000824 2.48  0.00205
## 15  2011 efsf                 4 0.000796 2.48  0.00198
## 16  2023 clean               12 0.00178  1.10  0.00195
## 17  2021 soul                 9 0.00139  1.39  0.00192
## 18  2018 patriotism           4 0.000771 2.48  0.00192
## 19  2018 surprise             4 0.000771 2.48  0.00192
## 20  2021 pandemic            11 0.00170  1.10  0.00186
## # ℹ 17,842 more rows

We can see that the words with the highest TF-IDF are all related to particular crisis or challenges the Union had to face. Most of the top words come from speeches delivered in the 2020s, which implies that these words are used often in recent years but were used rarely or never before. This reflects the rapid changes in the world over the past few years, marked by unexpected events and crises. An example of these changes is ai, the word with the highest tf-idf, which was used 17 times in the 2023 speech. In particular, many of the top words belong to the 2022 speech, further emphasizing the extraordinary nature of that year with the Russian invasion of Ukraine and all the crisis connected to it.

We’ll now look at the most distinctive word of each year:

speeches_tf_idf |> 
  # adding a column with the name of the president that delivered the speech
  mutate(president = case_when(
    year %in% c(2010, 2011, 2012, 2013) ~ "Barroso",
    year %in% c(2015, 2016, 2017, 2018) ~ "Juncker",
    year %in% c(2020, 2021, 2022, 2023) ~ "von der Leyen")) |>
  group_by(year) |> 
  # selecting only the most distinctive word per year (highest tf-idf)
  # if there is a tie all words are printed
  slice_max(tf_idf)
## # A tibble: 16 × 7
## # Groups:   year [12]
##     year word                 n       tf   idf  tf_idf president    
##    <dbl> <chr>            <int>    <dbl> <dbl>   <dbl> <chr>        
##  1  2010 exist                4 0.000908 1.39  0.00126 Barroso      
##  2  2010 sound                4 0.000908 1.39  0.00126 Barroso      
##  3  2011 efsf                 4 0.000796 2.48  0.00198 Barroso      
##  4  2012 parties              9 0.00149  1.79  0.00268 Barroso      
##  5  2013 justus               4 0.000710 2.48  0.00176 Barroso      
##  6  2013 lipsius              4 0.000710 2.48  0.00176 Barroso      
##  7  2013 version              4 0.000710 2.48  0.00176 Barroso      
##  8  2015 refugee             17 0.00170  1.39  0.00236 Juncker      
##  9  2016 000                  5 0.000824 2.48  0.00205 Juncker      
## 10  2017 2019                 8 0.00129  1.39  0.00179 Juncker      
## 11  2018 patriotism           4 0.000771 2.48  0.00192 Juncker      
## 12  2018 surprise             4 0.000771 2.48  0.00192 Juncker      
## 13  2020 nextgenerationeu    17 0.00208  1.10  0.00229 von der Leyen
## 14  2021 cyber                8 0.00123  1.79  0.00221 von der Leyen
## 15  2022 ukraine             22 0.00382  0.875 0.00334 von der Leyen
## 16  2023 ai                  17 0.00252  1.79  0.00452 von der Leyen

For some of these words, it’s clear why they are the most distinctive in their respective speeches. For instance, in 2015, refugee stands out as the most distinctive word, as Europe was indeed in the middle of the refugee crisis at that time. In 2020, nextgenerationeu is the top word, as that is the name of the recovery plan that the Commission had just approved. In 2022, ukraine is obviously the most distinctive word, given the Russian invasion, while in 2023, ai is prominent due to President von der Leyen’s discussion about artificial intelligence and the AI Act. However, for some words, the interpretation is more challenging. For example, ‘efsf’ stands for the European Financial Stability Facility, which was frequently mentioned by President Barroso in 2011 in the context of strengthening this mechanisms established to address the sovereign debt crisis. For other words, the reason behind their importance and what was meant with them when used is even less clear. To help use with that in the next section we are going to add some context around them.

Key words in context

To look at the context of words that were not clear in the previous segment, first we need to create a {quanteda} corpus and then apply the function kwic to see the phrase in which they were included.

speeches_corpus <- corpus(speeches$text, docvars = data.frame(year = speeches$year,
                                                              president = speeches$president))

summary(speeches_corpus)
## Corpus consisting of 12 documents, showing 12 documents:
## 
##    Text Types Tokens Sentences year     president
##   text1  1293   4878       299 2010       Barroso
##   text2  1322   5634       324 2011       Barroso
##   text3  1390   6644       371 2012       Barroso
##   text4  1477   6353       341 2013       Barroso
##   text5  2200  10994       528 2015       Juncker
##   text6  1598   6752       413 2016       Juncker
##   text7  1614   6881       395 2017       Juncker
##   text8  1449   5810       323 2018       Juncker
##   text9  1992   9020       518 2020 von der Leyen
##  text10  1691   7201       481 2021 von der Leyen
##  text11  1671   6451       439 2022 von der Leyen
##  text12  1748   7525       484 2023 von der Leyen
# The chunks below do the following:
# 1 - subset our corpus object selecting only the year we are interested in
# 2 - tokenize the subset
# 3 - search for the word and look for its context (words that come before and after)
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2010")),
                    "exist", window = 8)
kwic_result
## Keyword-in-context with 4 matches.                                                                              
##  [text1, 1532]      stronger single market for jobs. The opportunities | exist
##  [text1, 2786] will find that their fundamental rights and obligations | exist
##  [text1, 3331]            duration of the next budget. Various options | exist
##  [text1, 4745] will find that their fundamental rights and obligations | exist
##                                              
##  | . We have very high levels of unemployment
##  | wherever they go. Everyone in Europe must 
##  | . I would like to look at a               
##  | wherever they go. I have made clear
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2010")),
                    "sound", window = 8)
kwic_result
## Keyword-in-context with 4 matches.                                                                      
##   [text1, 930]               grow, we also need a strong and | sound |
##  [text1, 1215]                  in place by the end of 2011. | Sound |
##  [text1, 1254]         We can have both. Honourable Members, | Sound |
##  [text1, 3500] size and is negotiating further accessions. A | sound |
##                                                               
##  financial sector. A sector that serves the                   
##  government finances and responsible financial markets give us
##  public finances are a means to an end                        
##  currency, the euro, that is a

In 2010 the words with highest TF-IDF were exist and sound. The former was used in various contexts that were not very similar and do not provide us with a lot of insights, but the latter, sound, is used in relation to the establishment of more solid economic and financial policies and this makes sense in the context of the financial crisis.

kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2012")),
                    "parties", window = 8)
kwic_result
## Keyword-in-context with 9 matches.                                                                             
##  [text3, 4165] also cannot be done without strengthening European political |
##  [text3, 4178]          have very often a real disconnect between political |
##  [text3, 4186]           parties in the capitals and the European political |
##  [text3, 4213]                    often as if it were just between national |
##  [text3, 4230]                   not see the name of the European political |
##  [text3, 4244]          we see a national debate between national political |
##  [text3, 4257]          we need a reinforced statute for European political |
##  [text3, 4291]       debate would be the presentation by European political |
##  [text3, 4344]                        even clearer. I call on the political |
##                                                              
##  parties | . Indeed, we have very often a                    
##  parties | in the capitals and the European political parties
##  parties | here in Strasbourg. This is why we                
##  parties | . Even in the European elections we do            
##  parties | on the ballot box, we see a                       
##  parties | . This is why we need a reinforced                
##  parties | . I am proud to announce that the                 
##  parties | of their candidate for the post of Commission     
##  parties | to commit to this step and thus to

In 2012, parties is the word at which we are looking at and it appeared in a significant rate in this speech in the context of the role of European political parties and their future candidacies for 2014’s elections.

kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2013")),
                    "justus", window = 8)
kwic_result
## Keyword-in-context with 4 matches.                                                                           
##  [text4, 3318]                  still lack. Surely, you all know | Justus |
##  [text4, 3321]              Surely, you all know Justus Lipsius. | Justus |
##  [text4, 3333]         name of the Council building in Brussels. | Justus |
##  [text4, 3436] the governments' representatives that meet at the | Justus |
##                                                      
##  Lipsius. Justus Lipsius is the name of              
##  Lipsius is the name of the Council building         
##  Lipsius was a very influential 16th century humanist
##  Lipsius building, show that determination, that
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2013")),
                    "version", window = 8)
kwic_result
## Keyword-in-context with 4 matches.                                                                         
##  [text4, 5488]      will voters be presented with? The candid | version |
##  [text4, 5493]           ? The candid version, or the cartoon | version |
##  [text4, 5505]           or the facts? The honest, reasonable | version |
##  [text4, 5512] reasonable version, or the extremist, populist | version |
##                                        
##  , or the cartoon version? The myths   
##  ? The myths or the facts? The         
##  , or the extremist, populist version? 
##  ? It's an important difference. I know

In this case, the three words were highly mentioned in a small space. lipsius and justus appeared together since they were referring to a building named after a 16th century humanist that was being used by Barroso to make an analogy. The word version was distinctive in this document and it was used to refer to what version of reality voters would receive, supposedly from political parties while campaigning for the 2014’s elections.

kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2016")),
                    "000", window = 8)
kwic_result
## Keyword-in-context with 5 matches.                                                              
##  [text6, 1626] conflicts, which claim the lives of 170 | 000 |
##  [text6, 1872]         1 billion we get in exports, 14 | 000 |
##  [text6, 3401]   its first year of operation. Over 200 | 000 |
##  [text6, 3414]   across Europe got loans. And over 100 | 000 |
##  [text6, 4487]           by 2020, to see the first 100 | 000 |
##                                                     
##  people every year. Of course we still              
##  extra jobs are created across the EU.              
##  small firms and start-ups across Europe got loans  
##  people got new jobs, thanks to the                 
##  young Europeans taking part. By voluntarily joining

The reason why 000 appears as the most distinctive in 2016 is because some numbers were given during the speech and in the ones given in thousands the 000 was considered as a separate word. In fact, it has a high IDF because this problem problably was not happening in the rest of the transcriptions.

kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2017")),
                    "2019", window = 8)
kwic_result
## Keyword-in-context with 8 matches.                                                                         
##  [text7, 6152] Presidencies of the Council between now and March | 2019 |
##  [text7, 6228]                for the second option. On 29 March | 2019 |
##  [text7, 6292]                 the future of Europe. On 30 March | 2019 |
##  [text7, 6335]                    just a few weeks later, in May | 2019 |
##  [text7, 6389]       holding the Presidency in the first half of | 2019 |
##  [text7, 6401]           a Special Summit in Romania on 30 March | 2019 |
##  [text7, 6453]                     . My hope is that on 30 March | 2019 |
##  [text7, 6650]                 wake up to this Union on 30 March | 2019 |
##                                              
##  , outlining where we should go from here    
##  , the United Kingdom will leave the European
##  , we will be a Union of 27                  
##  . Europeans have a date with democracy.     
##  , to organise a Special Summit in Romania   
##  . My wish is that this summit be            
##  , Europeans will wake up to a Union         
##  , then the European Union will be a

During 2017’s speech, 2019 was highly mentioned since it was the year in which the UK was initially supposed to abandon the European Union and it was also the year of the next European elections.

kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2018")),
                    "patriotism", window = 8)
kwic_result
## Keyword-in-context with 4 matches.                                                                           
##   [text8, 401]                        more. We should embrace the kind of |
##  [text8, 5630] us to reject unhealthy nationalism and embrace enlightened |
##  [text8, 5638]                patriotism. We should never forget that the |
##  [text8, 5709]                       love your country is to love Europe. |
##                                                            
##  patriotism | that is used for good, and never             
##  patriotism | . We should never forget that the patriotism 
##  patriotism | of the 21st Century is two-fold: both        
##  Patriotism | is a virtue. Unchecked nationalism is riddled
kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2018")),
                    "surprise", window = 12)
kwic_result
## Keyword-in-context with 4 matches.                                                                               
##   [text8, 284] took the sunny, optimistic and peaceful continent of the time by
##   [text8, 991]      some, the agreement I struck with President Trump came as a
##   [text8, 997]           with President Trump came as a surprise. But it was no
##  [text8, 1011]     as whenever Europe speaks with one voice, there is never any
##                                                                         
##  | surprise | . In 1913, Europeans expected to live a lasting peace.    
##  | surprise | . But it was no surprise – as whenever Europe speaks with 
##  | surprise | – as whenever Europe speaks with one voice, there is never
##  | surprise | . When Europe speaks with one voice, it can prevail.

In 2018, the word patriotism was used by Juncker in opposition to nationalism, the first was seen as something positive while the second having negative connotations. These were times of rising nationalism and anti-europeism all over the Union, therefore it makes sense that the President would address such topics with the elections coming up within the next year. The word surprise was used multiple times referring to an agreement with Trump, likely a trade agreement, that some politicians or Member States had not believed would be possible to reach.

kwic_result <- kwic(tokens(corpus_subset(speeches_corpus, year == "2021")),
                    "cyber", window = 8)
kwic_result
## Keyword-in-context with 7 matches.                                                                              
##  [text10, 4439]                   , from fighter jets, to drones and | cyber |
##  [text10, 4500]   we cannot talk about defence without talking about | cyber |
##  [text10, 4535]          should not just be satisfied to address the | cyber |
##  [text10, 4546]                but also strive to become a leader in | cyber |
##  [text10, 4556]                  . It should be here in Europe where | cyber |
##  [text10, 4569]                     . This is why we need a European | Cyber |
##  [text10, 4582] legislation on common standards under a new European | Cyber |
##                                                           
##  . But we have to keep thinking of                        
##  . If everything is connected, everything can             
##  threat, but also strive to become a                      
##  security. It should be here in Europe                    
##  defence tools are developed. This is why                 
##  Defence Policy, including legislation on common standards
##  Resilience Act. So, we can do

Lastly, the word cyber stood out in 2021 because President von der Leyen talked about the growing cyber threats and the need for the EU to defend from them (cyberdefense), all this is also likely linked to a context of increasing digitalization during the pandemic. In this speech the Commission President also mentions the creation of a “Cyber Resilience Act”.

Zipf’s Law

Now we will be looking at the use that Commission Presidents do of common and rare words. For that we will be using Zipf’s Law to assign a rank to each word, the lower the TF the higher the rank, because the word is more rare and valuable.

First, we’ll sort the words by TF, then generate a log-log graph for each president to observe deviations from typical language patterns. We’ll consider the middle section of each graph as representing standard language use, as it tends to be the most stable. To visualize this baseline, we’ll draw a reference line on each graph based on a linear regression that will extrapolate the trend of the middle section.

speeches_tf_idf <- speeches_tf_idf |> 
    mutate(president = case_when(
    year %in% c(2010, 2011, 2012, 2013) ~ "Barroso",
    year %in% c(2015, 2016, 2017, 2018) ~ "Juncker",
    year %in% c(2020, 2021, 2022, 2023) ~ "von der Leyen")) |> 
  group_by(year) |> 
  arrange(desc(tf)) |> 
  ungroup()

freq_by_rank_barroso <- speeches_tf_idf %>% 
  filter(president == "Barroso") |> 
  group_by(year) %>% 
  mutate(rank = row_number()) %>%
  ungroup()

freq_by_rank_juncker <- speeches_tf_idf %>% 
  filter(president == "Juncker") |> 
  group_by(year) %>%
  mutate(rank = row_number()) %>%
  ungroup()

freq_by_rank_vonderleyen <- speeches_tf_idf %>% 
  filter(president == "von der Leyen") |> 
  group_by(year) %>%
  mutate(rank = row_number()) %>%
  ungroup()

We extract the line for Barroso:

freq_by_rank_barroso %>% 
  ggplot(aes(rank, tf, color = as_factor(year))) + 
  geom_line(linewidth = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() + 
  theme_minimal()

rank_subset <- freq_by_rank_barroso %>% 
  filter(rank < 500,
         rank > 50)

#we use the linear model function to find numeric coefficients of relationship between tf and rank
lm(log10(tf) ~ log10(rank), data = rank_subset)
## 
## Call:
## lm(formula = log10(tf) ~ log10(rank), data = rank_subset)
## 
## Coefficients:
## (Intercept)  log10(rank)  
##     -0.9514      -0.9393
p_barroso <- freq_by_rank_barroso %>% 
  ggplot(aes(rank, tf, color = as_factor(year))) + 
  #We add a line in the plot with the two coefficients we have found
  geom_abline(intercept = -0.9514, slope = -0.9393, 
              color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() + 
  theme_minimal() + 
  labs(
    title = "Barroso"
  ) +
  theme(plot.title = element_text(hjust = 0.5))

We extract the line for Juncker:

freq_by_rank_juncker %>% 
  ggplot(aes(rank, tf, color = as_factor(year))) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() + 
  theme_minimal()

rank_subset <- freq_by_rank_juncker %>% 
  filter(rank < 500,
         rank > 50)
#we use the linear model function to find numeric coefficients of relationship between tf and rank
lm(log10(tf) ~ log10(rank), data = rank_subset)
## 
## Call:
## lm(formula = log10(tf) ~ log10(rank), data = rank_subset)
## 
## Coefficients:
## (Intercept)  log10(rank)  
##     -0.9024      -0.9666
p_juncker <- freq_by_rank_juncker %>% 
  ggplot(aes(rank, tf, color = as_factor(year))) + 
  geom_abline(intercept = -0.9024, slope = -0.9666, 
              color = "gray50", linetype = 2) +
  geom_line(linewidth = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() +
  theme_minimal() + 
  labs(
    title = "Juncker"
  ) +
  theme(plot.title = element_text(hjust = 0.5))

We extract the line for von der Leyen:

freq_by_rank_vonderleyen %>% 
  ggplot(aes(rank, tf, color = as_factor(year))) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() + 
  theme_minimal()

rank_subset <- freq_by_rank_vonderleyen %>% 
  filter(rank < 500,
         rank > 50)

lm(log10(tf) ~ log10(rank), data = rank_subset)
## 
## Call:
## lm(formula = log10(tf) ~ log10(rank), data = rank_subset)
## 
## Coefficients:
## (Intercept)  log10(rank)  
##     -0.9549      -0.9498
p_von_der_leyen <- freq_by_rank_vonderleyen %>% 
  ggplot(aes(rank, tf, color = as_factor(year))) + 
  #we add a line in the plot with the two coefficients we have found
  geom_abline(intercept = -0.9549, slope = -0.9498, 
              color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() + 
  theme_minimal() + 
  labs(
    title = "von der Leyen"
  ) +
  theme(plot.title = element_text(hjust = 0.5))
# We use the library patvhwork to display the three graphs at the same time
(p_barroso / p_juncker / p_von_der_leyen)  +
  plot_layout(ncol = 1,            
              heights = c(1, 1, 1))

Both Barroso and Juncker seem to share the same type of deviation from the standard use of language, their speeches are bellow the line in the first section, so they use a lower percentage of the most common words. Besides, in the last section the speeches do not deviate form the line, i.e. they use the same percentage of words as many other collections of language. However, despite the fact that von der Leyen’s speeches also use a similar percentage of rare words and very common words, they seem to have a different behaviour in the fist section, words with a rank of 10 are above it, so we can say that von der Leyen uses more middle-low rank words than the standard use of language.

Topic modelling

Sparsity

speeches_dtm <- filtered_speeches |>
  # Counting the appearances of each word in each speech
  count(year, word, sort = TRUE) |>
  # Transforming from tidy format to document-term-matrix
  cast_dtm(document = year, term = word, value = n)

speeches_dtm
## <<DocumentTermMatrix (documents: 12, terms: 5539)>>
## Non-/sparse entries: 14158/52310
## Sparsity           : 79%
## Maximal term length: 20
## Weighting          : term frequency (tf)

We have 12 speeches containing a total of 5593 different (non-stop)words. 79% percent of sparsity in our document-term matrix means that 79% of the cells in the matrix are zeros. 79% of the word-document combinations do not happen in the speeches that we have analyzed.

While this sparsity might seem high, sparsity can even be much higher in other real word text data (>90/95%). In our case, the sparsity is somewhat lower because these are political speeches made by the same institution (for several years represented by the same person), in the same context and that therefore often revolve around recurring topics. This thematic consistency reduces the number of unique, non-overlapping words across documents, making the vocabulary of the speeches across the years similar and in turn, it lowers the overall sparsity of our collection.

Another factor that makes our sparsity lower is the fact that we are analyzing a relatively small number of documents (12 speeches). In larger collections, the sparsity tends to increase as the number of documents increases carrying with them new unique words.

Topics

# Applying LDA with 5 topics
lda_speeches <- LDA(speeches_dtm, k = 4, control = list(seed = 777))

# Extracting the most likely terms to belong to each topic
terms(lda_speeches, k = 10)
##       Topic 1      Topic 2      Topic 3      Topic 4     
##  [1,] "europe"     "european"   "europe"     "europe"    
##  [2,] "european"   "europe"     "european"   "european"  
##  [3,] "union"      "union"      "union"      "union"     
##  [4,] "commission" "commission" "world"      "commission"
##  [5,] "eu"         "political"  "people"     "people"    
##  [6,] "world"      "euro"       "honourable" "time"      
##  [7,] "time"       "world"      "future"     "world"     
##  [8,] "people"     "crisis"     "time"       "future"    
##  [9,] "crisis"     "eu"         "global"     "global"    
## [10,] "europeans"  "economic"   "ukraine"    "honourable"

As can be seen from the results above, most of the top terms are common across all topics. This is probably due to the fact that these are also the most frequent words in the speeches and thus can be found in all topics. To improve this situation we will get rid of them by building a custom stop words list.

# Custom stop words
stop_custom <- tibble(word = c("european", 
                               "europe", 
                               "eu", 
                               "union", 
                               "commission", 
                               "honourable",
                               "world",
                               "people",
                               "time",
                               "future"))

# Repeating the pre-processing steps
tidy_speeches <- speeches |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by = join_by(word)) |>
  anti_join(stop_custom, by = join_by(word)) |>
  count(year, word, sort = TRUE) |>
  arrange(desc(n))

speeches_dtm <- tidy_speeches |>
  cast_dtm(document = year, term = word, value = n)

lda_speeches <- LDA(speeches_dtm, k = 4, control = list(seed = 777))
# Note that the number of topic is the results of several refinements

terms(lda_speeches, k = 15)
##       Topic 1     Topic 2            Topic 3      Topic 4     
##  [1,] "crisis"    "market"           "parliament" "global"    
##  [2,] "euro"      "economy"          "president"  "law"       
##  [3,] "economic"  "global"           "europeans"  "countries" 
##  [4,] "political" "companies"        "united"     "climate"   
##  [5,] "growth"    "forward"          "euro"       "security"  
##  [6,] "market"    "ukraine"          "single"     "investment"
##  [7,] "social"    "digital"          "law"        "common"    
##  [8,] "national"  "support"          "national"   "means"     
##  [9,] "financial" "nextgenerationeu" "elections"  "companies" 
## [10,] "citizens"  "live"             "democracy"  "freedom"   
## [11,] "policy"    "values"           "trade"      "social"    
## [12,] "single"    "crisis"           "citizens"   "billion"   
## [13,] "common"    "industry"         "continent"  "support"   
## [14,] "means"     "energy"           "europe's"   "values"    
## [15,] "greece"    "data"             "economic"   "jobs"
topics <- topics(lda_speeches)
topics[order(names(topics))]
## 2010 2011 2012 2013 2015 2016 2017 2018 2020 2021 2022 2023 
##    1    1    1    1    1    4    3    3    2    4    2    4

Our analysis is way better now. From the results of topic modelling it can be seen that for the Barroso presidency from 2010 to 2013, as well as for the first of Juncker’s speeches in 2015, the main topic seems to be the economic and financial crisis related with the global financial crisis and the subsequent eurozone crisis. This is topic n.1.

Topic n.2 is related with the pandemic (ex. digital, nextgenerationeu), as well as with Ukraine’s war (ex. ukraine, energy). The topic is prevalent in 2020 and 2022, the first speeches after each of these 2 crisis unfolded.

Topic n.3 is related mainly with elections, a newly found unity and defense of the rule of law and democracy. These are all topics relevant in 2017 and 2018 when the EU had come to terms with the fact that Brexit was happening and EU commission president Juncker was trying to set out a vision for the future of the post-Brexit European Union. At the same time that he was praising and highlighting the unity and commitment of the remaining EU countries, he was also warning against the worsening of the democratic conditions in some of the European states and the need to defend the rule of law, thus underlining the importance of the forthcoming 2019 European elections.

Topic n.4 is probably the more general topic, grouping together several of what have been key priorities of the European Commission in the last decade. These are speeches that come at times of difficulties for the Union, but without being necessarily in the most acute part of a crisis (2016, 2021, 2023). Some of the priorities mentioned by the EU commission presidents range from global challenges such as climate change, to the need to defend the rule of law and the EU’s values, to the need to invest to create jobs and have a functioning common market for EU companies. Important in these 2016, 2021 and 2023 speeches is also the concept of security which has been declined in increasingly more ways. Initially related to migrants and terrorism in 2016, the concept then expanded into an umbrella term covering also social and economic security, turning into a key priority and concept of the von der Leyen’s Commission.

Conclusion

Our project analyzed European Commission’s State of the Union speeches from 2010 to 2023 using text mining techniques. Throughout the study we were able to trace the last decade of the European Union’s history and all the challenges and major events it has had to face.

The main finding of our analysis is that crises such as the Eurozone debt crisis, refugee crisis, the pandemic, and the war in Ukraine are often central topics and greatly shape the discourse. TF-IDF and topic modeling revealed how these challenges influenced the vocabulary and topics of each presidency, with speeches in the early 2010s centered on economic recovery, while speeches in the 2020s focused on the pandemic and the war in Ukraine.

However, being the speeches a very formal occasion, they still tend to share a significant portion of their vocabulary and style across years and presidencies. The speeches primarily serve as a platform for the Commission to highlight legislative achievements, outline policy priorities, and address pressing challenges. Nonetheless, even in very difficult times and when addressing candidly serious issues, Commission Presidents seem to ultimately always convey a message of unity and confidence in the European project resulting in neutral to fairly positive speeches.