Cross-National Panel Data Analysis in R

A Step-by-Step Walkthrough of the International Projection Project

Author

Thaís Simões Dória

Published

April, 2026

Overview

This document walks through a complete cross-national panel data analysis, from research design and data acquisition through regression modelling and diagnostics. Each section explains:

What we are doing and why (substantive rationale)
The relevant IR/IPE literature that motivates the choices
The statistical reasoning behind each method
The R code to execute it
Expected output and how to interpret it
What the method can and cannot tell us

Research question: What factors contribute to a country’s degree of international projection, and did the North-South divide in international presence narrow between 1998 and 2023?

By the end of this document, you will have:

Built a country-year panel from 3 international datasets
Run 8 regression models, each relaxing specific assumptions
Performed diagnostic tests to select the appropriate estimator
Understood what each step can and cannot tell you

1 Foundational Concepts

This document uses a specific research project as a worked example throughout. The project asks whether the North-South divide in international integration narrowed between 1998 and 2023. It uses a measure of globalisation (the KOF Index) as the dependent variable and tests whether trade openness, political regime characteristics, and Global South status predict variation in that index. All five principles below are illustrated using variables from this project, which are introduced fully in Part 3.

Before loading any data or running any models, there are foundational concepts that every user of regression should understand.

1.1 Regression identifies associations, not causes

A regression coefficient tells you: “X and Y move together in this dataset, after holding other variables constant.” It does not tell you: “X causes Y.” To establish causation, you need either a randomised experiment, a natural experiment, or an instrumental variable—none of which we have in this project.

Key reference: Holland, P. (1986). “Statistics and Causal Inference.” JASA, 81(396), 945–960—“No causation without manipulation.”

1.2 A coefficient is a partial association

The coefficient on trade_dep tells you: “the association between trade openness and KOF, holding ideology, person_leader, cap_all, nam, and global_south constant.” Change the set of controls and you change the coefficient. There is no single “true” coefficient—it depends on the model specification.

This is why we run multiple models (Part 6): to see how the coefficient changes under different assumptions. A coefficient that is stable across specifications is more credible than one that appears only in one model.

1.3 “Holding constant” means “in the model we chose”

When we say “trade openness predicts KOF, holding ideology constant,” we mean: “among countries that have the same ideology score, those with higher trade openness tend to have higher KOF.” But we did not hold constant institutional quality, colonial history, natural resource dependence, or a hundred other things. The omission of these variables may bias our estimates. This is the omitted variable bias problem—the central challenge of observational research.

1.4 The choice of DV and IV is driven by theory, not data

KOF is our dependent variable because our research question asks “what predicts international projection?” But KOF could equally be an independent variable if we asked “does globalisation predict democratic change?” The assignment of DV and IV roles is a research design decision, not a statistical one.

Exercise

What would it mean to run ideology ~ kof + trade_dep + cap_all? Is this a meaningful model? What would the coefficient on kof tell us?

1.5 Every model makes assumptions

Pooled OLS assumes no unobserved confounders. Fixed effects assume confounders are time-invariant. Random effects assume confounders are uncorrelated with the regressors. No model is “right”—each is a simplification. The job of the model sequence is to show how results change when assumptions change, and to identify which results are robust and which are fragile.

Key references: Angrist & Pischke (2009) Mostly Harmless Econometrics; Cunningham (2021) Causal Inference: The Mixtape; Morgan & Winship (2015) Counterfactuals and Causal Inference.

2 Research Design and Conceptual Framework

2.1 The Research Question

Our question sits at the intersection of three IR/IPE literatures:

(a) Hierarchy in International Relations. The IR discipline has moved beyond the assumption of sovereign equality to study how states are hierarchically ordered (Lake 2009; Hobson & Sharman 2005; Zarakol 2017). Our DV (KOF) captures one dimension of this: the degree to which states are structurally integrated into the international system.

(b) The “Rise of the Rest” and Global South agency. A large literature argued that the global order was being reshaped by emerging powers (Hurrell 2006; Ikenberry 2011; Stuenkel 2015; Acharya 2014). We test whether this “rise” translated into measurable changes in structural integration, or whether it was primarily a discursive phenomenon.

(c) China’s role in reshaping global hierarchies. The “Rise of China” literature argues that China’s economic expansion created favourable conditions for Global South internationalisation (Liang 2019; Strüver 2014). We test whether trade openness—as a proxy for structural engagement—was a significant predictor of Global South countries’ KOF trajectories.

2.2 Why KOF is the Dependent Variable

KOF is the DV because our research question asks “what predicts variation in structural integration?” But this is a choice, not a fact about the data:

If the question were “does globalisation predict democratic change?”, KOF would be an IV.
The assignment of DV and IV roles is driven by the research question, not by the nature of the variable itself.

By placing KOF on the left-hand side, we implicitly assume that trade openness, regime type, and other IVs are prior to or independent of globalisation level. If this assumption is wrong (reverse causality), our coefficients are biased. We address this partially through lagged IVs (Model 2) and fixed effects (Model 3).

2.3 Why Panel Data?

A panel dataset observes the same units (countries) over multiple time periods (years). This gives us two advantages over cross-sectional data:

We can study change over time within countries, not just differences between countries at a single point.
We can control for unobserved heterogeneity—stable country characteristics (geography, colonial history, culture) that affect both our DV and IVs but are impossible to measure directly.

Our panel: 183 countries × 26 years (1998–2023) = up to 4,758 observations.

Between vs. Within Variation

In a panel, variation in KOF comes from two sources:

Between variation: Norway has higher KOF than Chad.
Within variation: Indonesia’s KOF rose from 52 to 60 between 2000 and 2015.

Pooled OLS uses both. Fixed effects use only within variation. This distinction is crucial for interpretation.

Key references: Wooldridge (2010) Econometric Analysis of Cross Section and Panel Data; Beck & Katz (1995) “What To Do (and Not To Do) with TSCS Data,” APSR.

3 Data Selection and Data Sources

3.1 Setting Up the Environment

library(tidyverse)
library(countrycode)
library(kofdata)
library(vdemdata)
library(WDI)
library(plm)
library(lmtest)
library(sandwich)
library(stargazer)
library(corrplot)
library(car)

START_YEAR <- 1998
END_YEAR   <- 2023

3.2 The Dependent Variable: KOF Globalisation Index

The KOF Globalisation Index (Dreher 2006; revised by Gygli et al. 2019) measures the degree to which a country is integrated into the international system across three dimensions: economic (trade, FDI, tariffs), social (interpersonal contact, information, cultural proximity), and political (embassies, UN peacekeeping, IOs, treaties). The overall index ranges from 0 to 100.

What KOF allows and does not allow us to test

Can test	Cannot test
Whether countries with higher trade openness have higher integration	Whether trade causes integration
Whether the Global South scores lower after controlling for other factors	Whether integration is on favourable or unfavourable terms
Whether within-country trade increases predict KOF increases	Whether specific policies drove changes

What we would need: instrumental variables, natural experiments, or qualitative and computational text analysis.

Tautology risk

KOF’s economic sub-index includes trade flows. When we regress KOF on trade/GDP, part of the relationship is mechanical. We can test this by using the political sub-index as the DV instead (Exercise 8).

kof_raw <- get_collection("ds_globidx.v2020")

kof_df <- map_dfr(names(kof_raw), function(key) {
  ts_data <- kof_raw[[key]]
  if (is.null(ts_data) || !is.ts(ts_data)) return(tibble())
  iso3c <- toupper(gsub("ch.kof.globidx.v2020.gi.", "", key))
  tibble(iso3c = iso3c, year = as.integer(time(ts_data)), kof = as.numeric(ts_data))
})

kof <- kof_df %>%
  filter(year >= START_YEAR, year <= END_YEAR) %>%
  mutate(cow_code = countrycode(iso3c, "iso3c", "cown", warn = FALSE)) %>%
  filter(!is.na(cow_code)) %>%
  select(cow_code, year, kof) %>%
  distinct(cow_code, year, .keep_all = TRUE)

cat(sprintf("KOF: %d observations, %d countries, %d–%d\n",
            nrow(kof), n_distinct(kof$cow_code), min(kof$year), max(kof$year)))

KOF: 4739 observations, 183 countries, 1998–2023

3.3 Independent Variables: V-Dem (Political Regime)

We use two variables from V-Dem (Coppedge et al. 2024):

v2exl_legitideol (ideology): extent of government ideology promotion
v2exl_legitlead (person_leader): extent of leadership personalisation

These are continuous latent variables from a Bayesian IRT measurement model—not the 0–4 ordinal categories from the expert surveys.

What V-Dem allows and does not allow us to test

Can test	Cannot test
Whether personalised regimes tend to score lower on KOF	Whether personalisation causes lower globalisation
Whether within-country changes in personalisation predict KOF changes	Which ideology is promoted (socialist internationalism vs. autarkic nationalism both score high)

vdem_data <- vdem %>%
  filter(year >= START_YEAR, year <= END_YEAR) %>%
  mutate(cow_code = countrycode(country_text_id, "iso3c", "cown", warn = FALSE)) %>%
  filter(!is.na(cow_code)) %>%
  select(cow_code, year, country_name,
         ideology = v2exl_legitideol, person_leader = v2exl_legitlead) %>%
  distinct(cow_code, year, .keep_all = TRUE)

cat(sprintf("V-Dem: %d observations, %d countries\n",
            nrow(vdem_data), n_distinct(vdem_data$cow_code)))

V-Dem: 4459 observations, 172 countries

3.4 Control Variables: World Development Indicators

wdi_indicators <- list(
  c(name = "trade_gdp",    code = "NE.TRD.GNFS.ZS"),
  c(name = "gdp_pc",       code = "NY.GDP.PCAP.KD"),
  c(name = "population",   code = "SP.POP.TOTL"),
  c(name = "urban_pct",    code = "SP.URB.TOTL.IN.ZS"),
  c(name = "military_exp", code = "MS.MIL.XPND.GD.ZS")
)

wdi_list <- list()
for (ind in wdi_indicators) {
  tryCatch({
    d <- WDI(indicator = setNames(ind["code"], ind["name"]),
             start = START_YEAR, end = END_YEAR, extra = TRUE)
    wdi_list[[ind["name"]]] <- d
  }, error = function(e) NULL)
}

wdi_merged <- wdi_list[[1]]
for (i in seq_along(wdi_list)[-1]) {
  data_col <- setdiff(names(wdi_list[[i]]), names(wdi_merged))
  if (length(data_col) > 0)
    wdi_merged <- left_join(wdi_merged, wdi_list[[i]][, c("iso2c", "year", data_col)],
                             by = c("iso2c", "year"))
}

wdi <- wdi_merged %>%
  filter(!is.na(iso3c), income != "Aggregates") %>%
  mutate(cow_code = countrycode(iso3c, "iso3c", "cown", warn = FALSE)) %>%
  filter(!is.na(cow_code)) %>%
  distinct(cow_code, year, .keep_all = TRUE)

for (col in c("trade_gdp", "gdp_pc", "population", "urban_pct", "military_exp"))
  if (!col %in% names(wdi)) wdi[[col]] <- NA_real_

cat(sprintf("WDI: %d observations, %d countries\n", nrow(wdi), n_distinct(wdi$cow_code)))

WDI: 4966 observations, 191 countries

On control variables

Students often think “more controls = better.” This is wrong. Each control does two things: (1) reduces omitted variable bias if the control is a confounder, but (2) introduces post-treatment bias if the control is a mediator. Including GDP per capita would be case (2)—globalisation probably causes higher GDP, so controlling for it blocks the pathway we want to study. See Angrist & Pischke (2009), Chapter 3 on “bad controls.”

3.5 Should We Transform Our Variables?

Log transformation is standard for right-skewed variables like population and trade/GDP. For our data:

Raw trade_dep coefficient: 0.0724 (SE 0.0145)
Logged trade_dep coefficient: 6.9068 (SE 1.4682)
R-squared: 0.579 (raw) vs 0.577 (logged)—essentially identical

We use raw trade_dep because the coefficient is more interpretable (“a 1 percentage-point increase in trade/GDP → 0.07-point increase in KOF”) and R-squared is unchanged. But logging would be defensible.

When NOT to log: KOF (already bounded 0–100), binary variables (global_south, nam), V-Dem variables (already scaled by the measurement model).

4 Data Preparation: Building the Panel

4.1 Classification Variables

Classification lists are loaded from CSV files rather than hardcoded. Each file has two columns (cow_code, country_name), making them auditable, version-controlled, and easy to modify for sensitivity analysis (see Exercise 4).

nam_members <- read_csv("nam_members.csv", show_col_types = FALSE) %>%
  pull(cow_code)

global_south_codes <- read_csv("global_south.csv", show_col_types = FALSE) %>%
  pull(cow_code)

cat(sprintf("NAM members: %d countries\nGlobal South: %d countries\n",
            length(nam_members), length(global_south_codes)))

NAM members: 117 countries
Global South: 132 countries

4.2 Merging Datasets

n_start <- nrow(kof)
panel <- kof

# Add V-Dem
panel <- panel %>% left_join(vdem_data, by = c("cow_code", "year"))
cat(sprintf("After V-Dem: %d rows | ideology NAs: %d (%.1f%%)\n",
            nrow(panel), sum(is.na(panel$ideology)), 100 * mean(is.na(panel$ideology))))

After V-Dem: 4739 rows | ideology NAs: 427 (9.0%)

# Add WDI
panel <- panel %>%
  left_join(wdi %>% select(cow_code, year, trade_gdp, gdp_pc, population,
                           urban_pct, military_exp, region, income),
            by = c("cow_code", "year"))
cat(sprintf("After WDI: %d rows | trade NAs: %d (%.1f%%)\n",
            nrow(panel), sum(is.na(panel$trade_gdp)), 100 * mean(is.na(panel$trade_gdp))))

After WDI: 4739 rows | trade NAs: 660 (13.9%)

# Construct derived variables
panel <- panel %>%
  mutate(
    nam          = as.integer(cow_code %in% nam_members),
    global_south = as.integer(cow_code %in% global_south_codes),
    country_name = coalesce(country_name,
                            countrycode(cow_code, "cown", "country.name", warn = FALSE)),
    cap_all   = scales::rescale(population, to = c(0, 1), na.rm = TRUE),
    trade_dep = trade_gdp
  ) %>%
  filter(!is.na(kof))

# Exclude countries with < 5 years
coverage <- panel %>% group_by(cow_code) %>%
  summarise(n_years = sum(!is.na(kof)), .groups = "drop")
panel <- panel %>%
  filter(cow_code %in% (coverage %>% filter(n_years >= 5) %>% pull(cow_code)))

cat(sprintf("Final panel: %d observations, %d countries, %d–%d\n",
            nrow(panel), n_distinct(panel$cow_code), min(panel$year), max(panel$year)))

Final panel: 4739 observations, 183 countries, 1998–2023

5 Descriptive Analysis

5.1 Summary Statistics

desc_stats <- panel %>%
  select(kof, ideology, person_leader, trade_dep, cap_all, nam, global_south) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  group_by(variable) %>%
  summarise(n = sum(!is.na(value)), mean = round(mean(value, na.rm = TRUE), 2),
            sd = round(sd(value, na.rm = TRUE), 2), min = round(min(value, na.rm = TRUE), 2),
            median = round(median(value, na.rm = TRUE), 2),
            max = round(max(value, na.rm = TRUE), 2), .groups = "drop")
desc_stats

variable	n	mean	sd	min	median	max
cap_all	4713	0.03	0.10	0.00	0.01	1.00
global_south	4739	0.70	0.46	0.00	1.00	1.00
ideology	4312	-0.35	1.33	-3.29	-0.57	3.38
kof	4739	57.36	14.97	23.09	55.40	89.19
nam	4739	0.62	0.49	0.00	1.00	1.00
person_leader	4313	-0.06	1.51	-3.42	-0.03	3.56
trade_dep	4079	85.34	50.36	2.47	75.27	437.33

How to read this: n shows non-missing observations—note that ideology (4,312) has ~427 fewer than kof (4,739). These are countries without V-Dem coding. The regression will drop them silently. trade_dep max of 437 is Singapore (trade exceeds GDP due to re-exports)—check whether this outlier drives results.

5.2 The North–South Divide: Wrong vs. Right

# WRONG: raw panel data (inflated df)
tt_wrong <- t.test(panel$kof[panel$global_south == 1], panel$kof[panel$global_south == 0])

# RIGHT: country-level means
country_means <- panel %>%
  group_by(cow_code, global_south) %>%
  summarise(mean_kof = mean(kof, na.rm = TRUE), .groups = "drop")
south <- country_means %>% filter(global_south == 1) %>% pull(mean_kof)
north <- country_means %>% filter(global_south == 0) %>% pull(mean_kof)
tt_right <- t.test(south, north)

cat(sprintf("WRONG (raw panel):     t = %.2f, df = %.0f, p = %s\n",
            tt_wrong$statistic, tt_wrong$parameter, format.pval(tt_wrong$p.value, 3)))

WRONG (raw panel):     t = -58.25, df = 2591, p = <2e-16

cat(sprintf("RIGHT (country means): t = %.2f, df = %.0f, p = %s\n",
            tt_right$statistic, tt_right$parameter, format.pval(tt_right$p.value, 3)))

RIGHT (country means): t = -12.16, df = 99, p = <2e-16

cat(sprintf("South mean: %.1f (n=%d) | North mean: %.1f (n=%d)\n",
            mean(south), length(south), mean(north), length(north)))

South mean: 50.9 (n=127) | North mean: 72.0 (n=56)

Why the wrong version is wrong

The raw panel version claims ~2,600 independent observations. The correct version claims ~100. The t-statistic is inflated by a factor of ~5—not because the effect is larger, but because pretending you have 30× more data shrinks the standard error. In a borderline case, this could mean the difference between “significant” and “not significant.”

5.3 Correlation Analysis

cor_vars <- panel %>%
  select(KOF = kof, Ideology = ideology, Leader = person_leader,
         `Trade Openness` = trade_dep, Capabilities = cap_all) %>%
  drop_na()

cor_mat <- cor(cor_vars, use = "complete.obs")
corrplot(cor_mat, method = "color", type = "lower", addCoef.col = "black",
         number.cex = 0.9, tl.col = "black", tl.srt = 45,
         col = colorRampPalette(c("#3498db", "white", "#e74c3c"))(200),
         title = "Correlation Matrix", mar = c(0, 0, 2, 0))

Key patterns: Leader ↔︎ KOF = −0.50 (strongest); Ideology ↔︎ Leader = +0.66 (collinearity risk); Trade ↔︎ KOF = +0.37; Capabilities ↔︎ KOF ≈ 0.

6 Regression Analysis: The Model Sequence

6.1 Why a Sequence of Models?

We build a sequence, each model relaxing a specific assumption:

Model	Specification	What it controls for
M1	Pooled OLS	Nothing (baseline)
M2	Pooled OLS + lagged IVs	Reverse causality
M3	Country FE	All time-invariant confounders
M4	Two-way FE	Time-invariant + common shocks
M5	First differences	Same as FE, different error assumption
M6	Random effects	Allows time-invariant variables
M7	Interaction	Differential effects for Global South
M8	Global South subsample	Robustness check

# Create lags BEFORE pdata.frame (dplyr::lag vs plm::lag conflict)
panel <- panel %>%
  arrange(cow_code, year) %>%
  group_by(cow_code) %>%
  mutate(trade_dep_lag = dplyr::lag(trade_dep, 1),
         cap_all_lag   = dplyr::lag(cap_all, 1)) %>%
  ungroup()

pdata <- pdata.frame(panel, index = c("cow_code", "year"), drop.index = FALSE)

f_base <- kof ~ ideology + person_leader + trade_dep + cap_all + nam + global_south
f_fe   <- kof ~ ideology + person_leader + trade_dep + cap_all

6.2 Model 1: Pooled OLS

m1 <- plm(f_base, data = pdata, model = "pooling")
m1_se <- vcovHC(m1, type = "HC1", cluster = "group")
ct1 <- coeftest(m1, vcov = m1_se)
ct1


t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
(Intercept)    64.756427   1.739363 37.2300 < 2.2e-16 ***
ideology        0.087911   0.675587  0.1301 0.8964744    
person_leader  -2.077096   0.565976 -3.6699 0.0002459 ***
trade_dep       0.072389   0.014474  5.0014 5.946e-07 ***
cap_all         5.589286   6.318018  0.8847 0.3763966    
nam            -2.848962   2.794269 -1.0196 0.3079952    
global_south  -14.594917   2.872437 -5.0810 3.933e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How to interpret: trade_dep = 0.072 means a 1pp increase in trade/GDP → 0.07 KOF points. global_south = −14.6 means ~15 points lower, controlling for everything else—roughly one full SD of KOF.

cat(sprintf("R-squared: %.3f | N: %d\n", summary(m1)$r.squared["rsq"], nobs(m1)))

R-squared: 0.579 | N: 3837

6.3 Model 2: Lagged IVs

m2 <- plm(kof ~ ideology + person_leader + trade_dep_lag + cap_all_lag + nam + global_south,
           data = pdata, model = "pooling")
m2_se <- vcovHC(m2, type = "HC1", cluster = "group")
coeftest(m2, vcov = m2_se)


t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
(Intercept)    65.185234   1.706730 38.1931 < 2.2e-16 ***
ideology        0.104071   0.673416  0.1545 0.8771914    
person_leader  -2.054475   0.566691 -3.6254 0.0002924 ***
trade_dep_lag   0.072264   0.014372  5.0282 5.188e-07 ***
cap_all_lag     5.462602   6.348726  0.8604 0.3896109    
nam            -2.868683   2.786176 -1.0296 0.3032593    
global_south  -14.678820   2.855534 -5.1405 2.883e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lagged and contemporaneous coefficients are nearly identical → reverse causality is probably not dramatically distorting estimates.

6.4 Model 3: Country Fixed Effects

m3 <- plm(f_fe, data = pdata, model = "within", effect = "individual")
m3_se <- vcovHC(m3, type = "HC1", cluster = "group")
ct3 <- coeftest(m3, vcov = m3_se)
ct3


t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
ideology        0.761662   0.537438  1.4172  0.156507    
person_leader  -1.700863   0.403174 -4.2187 2.517e-05 ***
trade_dep       0.072831   0.014563  5.0011 5.968e-07 ***
cap_all       117.613496  42.575699  2.7625  0.005765 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat(sprintf("Within R-squared: %.3f\n", summary(m3)$r.squared["rsq"]))

Within R-squared: 0.159

R² drops from 0.58 to 0.16. This does not mean the model is worse—it means most KOF variation is between countries, not within countries over time. trade_dep remains significant (+0.073***)—within-country increases in trade do predict within-country increases in KOF.

6.5 Model 4: Two-Way Fixed Effects

m4 <- plm(f_fe, data = pdata, model = "within", effect = "twoways")
m4_se <- vcovHC(m4, type = "HC1", cluster = "group")
coeftest(m4, vcov = m4_se)


t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
ideology      -0.4305498  0.3259763 -1.3208 0.1866505    
person_leader -0.1669934  0.2380810 -0.7014 0.4830893    
trade_dep      0.0306045  0.0091215  3.3552 0.0008011 ***
cap_all       -2.8447271 14.4941227 -0.1963 0.8444116    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Critical finding

person_leader loses significance (−0.17, p = 0.48). The within-country changes in personalisation were correlated with global trends (the “autocratic wave” of the 2010s). Year dummies absorb this, leaving nothing for person_leader to explain. This does not mean personalisation “does not matter”—it means its effect cannot be distinguished from global trends in these data.

6.6 Model 5: First Differences

m5 <- plm(f_fe, data = pdata, model = "fd")
m5_se <- vcovHC(m5, type = "HC1", cluster = "group")
coeftest(m5, vcov = m5_se)


t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
(Intercept)    0.4499748  0.0202580 22.2122 < 2.2e-16 ***
ideology      -0.0714438  0.0907826 -0.7870    0.4313    
person_leader -0.0729590  0.0624162 -1.1689    0.2425    
trade_dep      0.0265186  0.0037353  7.0995 1.496e-12 ***
cap_all        7.1574762 14.8473150  0.4821    0.6298    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

trade_dep = +0.027***. Significant in changes: when a country opens up more, its KOF rises.

6.7 Model 6: Random Effects

m6 <- plm(f_fe, data = pdata, model = "random")
m6b <- plm(f_base, data = pdata, model = "random")
m6b_se <- vcovHC(m6b, type = "HC1", cluster = "group")
coeftest(m6b, vcov = m6b_se)


t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
(Intercept)    63.878908   2.014297 31.7128 < 2.2e-16 ***
ideology        0.752339   0.512071  1.4692  0.141858    
person_leader  -1.583181   0.389628 -4.0633 4.935e-05 ***
trade_dep       0.074293   0.013805  5.3815 7.827e-08 ***
cap_all        55.198618  18.013211  3.0643  0.002197 ** 
nam            -5.045049   2.836652 -1.7785  0.075397 .  
global_south  -14.130615   2.955338 -4.7814 1.806e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.8 Model 7: Interaction—Global South × Trade

m7 <- plm(kof ~ ideology + person_leader + trade_dep * global_south + cap_all + nam,
           data = pdata, model = "pooling")
m7_se <- vcovHC(m7, type = "HC1", cluster = "group")
ct7 <- coeftest(m7, vcov = m7_se)
ct7


t test of coefficients:

                         Estimate Std. Error t value  Pr(>|t|)    
(Intercept)             67.986313   1.694780 40.1151 < 2.2e-16 ***
ideology                -0.103044   0.672280 -0.1533 0.8781897    
person_leader           -2.063690   0.558754 -3.6934 0.0002244 ***
trade_dep                0.037501   0.012613  2.9733 0.0029648 ** 
global_south           -19.547232   3.405785 -5.7394 1.024e-08 ***
cap_all                  4.889694   6.741656  0.7253 0.4683150    
nam                     -2.644620   2.822619 -0.9369 0.3488493    
trade_dep:global_south   0.054803   0.020013  2.7384 0.0062032 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

effect_north <- ct7["trade_dep", "Estimate"]
effect_south <- ct7["trade_dep", "Estimate"] + ct7["trade_dep:global_south", "Estimate"]

cat(sprintf("Marginal effect of 10pp trade increase:\n"))

Marginal effect of 10pp trade increase:

cat(sprintf("  Global North: %.2f KOF points\n", effect_north * 10))

  Global North: 0.38 KOF points

cat(sprintf("  Global South: %.2f KOF points\n", effect_south * 10))

  Global South: 0.92 KOF points

cat(sprintf("  Difference:   %.2f (p = %s)\n",
            ct7["trade_dep:global_south", "Estimate"] * 10,
            format.pval(ct7["trade_dep:global_south", "Pr(>|t|)"], 3)))

  Difference:   0.55 (p = 0.0062)

Trade openness has a stronger positive association with KOF for Global South countries. This speaks to the IPE debate: is trade a pathway to integration (Keohane & Nye) or to structural dependence (Cardoso & Faletto 1977)? The regression cannot distinguish between these—the coefficient tells us trade correlates with higher KOF for the South, not whether that integration is on favourable terms.

6.9 Model 8: Global South Subsample

pdata_south <- pdata.frame(
  panel %>% filter(global_south == 1),
  index = c("cow_code", "year"), drop.index = FALSE)
m8 <- plm(f_fe, data = pdata_south, model = "within", effect = "individual")
m8_se <- vcovHC(m8, type = "HC1", cluster = "group")
coeftest(m8, vcov = m8_se)


t test of coefficients:

                Estimate Std. Error t value  Pr(>|t|)    
ideology        0.350000   0.674009  0.5193 0.6036135    
person_leader  -1.566907   0.441747 -3.5471 0.0003971 ***
trade_dep       0.046693   0.018364  2.5426 0.0110647 *  
cap_all       115.843226  46.099623  2.5129 0.0120403 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.10 Summary: How Coefficients Change Across Models

	Pooled OLS	Country FE	Two-Way FE	First Diff
`trade_dep`	0.072***	0.073***	0.031***	0.027***
`person_leader`	−2.077***	−1.701***	−0.167	−0.073
`ideology`	0.088	0.762	−0.431	−0.071
`global_south`	−14.595***	—	—	—
R²	0.579	0.159	0.050	0.050

trade_dep is robust: significant in every specification. person_leader is fragile: washes out in two-way FE. ideology is never significant: likely absorbed by person_leader (correlation = 0.66).

7 Diagnostic Tests

# Hausman: FE vs RE
hausman <- phtest(m3, m6)
cat(sprintf("Hausman: chi-sq = %.1f, p = %s → %s\n",
            hausman$statistic, format.pval(hausman$p.value, 4),
            if (hausman$p.value < 0.05) "FE preferred" else "RE OK"))

Hausman: chi-sq = 279.0, p = < 2.2e-16 → FE preferred

# Breusch-Pagan LM: Pooled vs Panel
bp <- plmtest(m1, type = "bp")
cat(sprintf("BP LM: chi-sq = %.1f, p = %s → %s\n",
            bp$statistic, format.pval(bp$p.value, 4),
            if (bp$p.value < 0.05) "Panel effects present" else "Pooled OK"))

BP LM: chi-sq = 29622.5, p = < 2.2e-16 → Panel effects present

# Serial correlation
bg <- pbgtest(m3, order = 1)
cat(sprintf("BG AR(1): chi-sq = %.1f, p = %s → %s\n",
            bg$statistic, format.pval(bg$p.value, 4),
            if (bg$p.value < 0.05) "Serial correlation" else "No serial correlation"))

BG AR(1): chi-sq = 2484.0, p = < 2.2e-16 → Serial correlation

# VIF
m_lm <- lm(kof ~ ideology + person_leader + trade_dep + cap_all + nam + global_south, data = panel)
cat("VIF:\n"); print(round(vif(m_lm), 2))

VIF:

     ideology person_leader     trade_dep       cap_all           nam 
         1.85          2.05          1.10          1.07          3.39 
 global_south 
         3.22

Decision tree

BP LM significant → Pooled OLS inadequate. Use panel model.
Hausman significant → RE violated. Use FE.
Serial correlation → Use cluster-robust SEs (done) and FD (done).
VIF < 5 → No severe collinearity.

Conclusion: Country FE with cluster-robust SEs is the preferred specification. FD (M5) is a valid robustness check.

8 Synthesis

8.1 What We Found

The North–South divide did not narrow. The Global South penalty is ~15 KOF points and widened over 1998–2023.
Trade openness is the dominant predictor. Significant in every specification.
Trade matters more for the Global South. The interaction term is positive and significant (+0.055, p = 0.006).
Leadership personalisation is negatively associated with projection. Robust in pooled OLS and FE, but washes out in two-way FE.
Ideology is not a significant predictor. Likely a collinearity result.

8.2 What We Cannot Claim

Causality. Even two-way FE identifies associations, not causal effects.
Mechanism. We do not know whether the trade-KOF link is economic, social, or political. Running the same analysis on KOF’s economic, social, and political sub-indices separately would begin to address this.
Generalisability beyond KOF. Cuba has low KOF but high diplomatic engagement. Singapore has high KOF but limited geopolitical influence.

9 Exercises

Exercise 1: Variable Selection

Run Model 1 replacing trade_dep with gdp_pc. What happens to R²? Why? Discuss the “bad controls” problem.

Exercise 2: Heterogeneous Effects

Run the FE model separately for each WDI region. Does trade predict KOF equally across regions?

Exercise 3: Temporal Structure

Split the panel into 1998–2007, 2008–2015, 2016–2023. Does the Global South penalty change?

Exercise 4: Sensitivity to Classification

Move Turkey, South Korea, and Israel into the Global South. How do results change?

Exercise 5: Reverse the Equation

Run ideology ~ kof + trade_dep + cap_all. Is this meaningful? What does kof’s coefficient tell us?

Exercise 6: The Tautology Test

Use KOF’s political sub-index as the DV. Does trade_dep still predict it? If so, that is stronger evidence than predicting the overall index (which includes trade by construction).

Exercise 7: Residual Analysis

Extract residuals from M3. Which 10 countries have the largest average residuals? What is the model missing?

10 Questioning the Design

Is KOF the right DV? KOF measures structural integration, not agency. A country can be integrated without projecting itself (small EU states). Our findings should be qualified: “trade predicts structural integration (as measured by KOF)”—not “trade causes projection.”

Is trade openness the right IV? Our question asked about “proximity to China.” Trade/GDP captures how much a country trades, not with whom. Better alternatives: share of trade with China specifically, bilateral trade network centrality, UNGA voting alignment.

Is OLS the right model family? KOF is bounded (0–100). A fractional response model (Papke & Wooldridge 1996) would be more appropriate in principle, though our values (23–89) are well within the interior.

Are we asking the right question? I have also conducted a trajectory analysis on the same data using group-based trajectory modelling (Nagin 2005). Rather than imposing a binary North-South classification, the model identifies distinct clusters of countries based on the shape of their globalisation trajectories over time. The analysis revealed five tiers—not two—with the “climbing middle” (Brazil, China, India, Indonesia, Vietnam) registering the largest gains, while the bottom tier (fragile and post-conflict states) remained largely unchanged. One of the five groups cross-cuts the North-South divide entirely, containing both post-socialist EU states and Gulf petro-states on the same trajectory. If starting the regression with this knowledge, we would use trajectory group membership instead of the binary global_south dummy—a multinomial logistic regression rather than OLS.

The iterative lesson

This project began with a binary question (“did the gap close?”), but the data taught us the question was wrong. Good research changes its questions in response to what the data reveal.

11 Suggested Readings

Methods: Agresti & Finlay (2009); Wooldridge (2010); Angrist & Pischke (2009); Cunningham (2021); Imai (2017).

R for social science: Wickham & Grolemund (2017) R for Data Science; Healy (2018) Data Visualization.

sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 26.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] C.UTF-8/C.UTF-8/C.UTF-8/C/C.UTF-8/C.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] car_3.1-5         carData_3.0-5     corrplot_0.95     stargazer_5.2.3  
 [5] sandwich_3.1-1    lmtest_0.9-40     zoo_1.8-12        plm_2.6-7        
 [9] WDI_2.7.9         vdemdata_16.0     kofdata_0.2.1     httr_1.4.7       
[13] jsonlite_1.8.8    countrycode_1.7.0 lubridate_1.9.3   forcats_1.0.0    
[17] stringr_1.6.0     dplyr_1.2.0       purrr_1.2.1       readr_2.1.5      
[21] tidyr_1.3.2       tibble_3.3.1      ggplot2_4.0.2     tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       xfun_0.44          htmlwidgets_1.6.4  collapse_2.1.6    
 [5] lattice_0.22-6     tzdb_0.4.0         vctrs_0.7.2        tools_4.4.0       
 [9] Rdpack_2.6.4       generics_0.1.4     curl_5.2.1         parallel_4.4.0    
[13] xts_0.14.0         pkgconfig_2.0.3    RColorBrewer_1.1-3 S7_0.2.1          
[17] lifecycle_1.0.5    compiler_4.4.0     farver_2.1.2       maxLik_1.5-2.2    
[21] htmltools_0.5.8.1  yaml_2.3.8         Formula_1.2-5      crayon_1.5.2      
[25] pillar_1.11.1      MASS_7.3-60.2      abind_1.4-8        nlme_3.1-164      
[29] tidyselect_1.2.1   bdsmatrix_1.3-7    digest_0.6.35      stringi_1.8.7     
[33] miscTools_0.6-30   fastmap_1.2.0      grid_4.4.0         cli_3.6.5         
[37] magrittr_2.0.4     withr_3.0.2        scales_1.4.0       bit64_4.0.5       
[41] timechange_0.3.0   rmarkdown_2.28     bit_4.0.5          hms_1.1.3         
[45] evaluate_0.24.0    knitr_1.47         rbibutils_2.3      rlang_1.1.7       
[49] Rcpp_1.0.12        glue_1.8.0         vroom_1.6.5        R6_2.6.1