An Empirical Example: Simulation-Based Inference for Free Recall

This section presents a real-world example illustrating how the proposed simulation-based inference workflow can be applied to empirical data.

Here, we used the free recall data from Penn Electrophysiology of Encoding and Retrieval Study (PEERS; Kahana et al., 2024). We model the recall process using a multi-response Racing Diffusion Model (RDM), focusing on RTs for first seven retrievaled items.

Descriptions of the model

The details of this model are listed below:

We specify weakly informative priors on the global parameters governing decision boundary, drift rate, non-decision time, and between-trial variability: $\begin{aligned} A_{\beta_0} &\sim \text{Uniform}(0.5, 5.0), \\ A_{\beta_1} &\sim \text{Uniform}(0.01, 1.0), \\ V_{\beta_0} &\sim \text{Uniform}(0.5, 3.0), \\ V_{\beta_1} &\sim \text{Uniform}(0.01, 0.3), \\ V_{\sigma} &\sim \text{Uniform}(0.01, 1.0), \\ ndt_{\beta_0} &\sim \text{Uniform}(-1, 1). \end{aligned}$

To capture variabilities in recall trajectories and recall strategies across trials , we introduce a between-trial variability on the drift rate:

$V_{\text{var}} \sim \mathcal{N}(0, V_{\sigma}).$

At the item-level, different model parameters are linked to different structural aspects of the task. Specifically, the decision boundary is assumed to vary with the , whereas the drift rate varies with the .

Formally, we define $\begin{aligned} A &= A_{\beta_0} + \text{output position} \cdot A_{\beta_1}, \\ V &= \max\!\left( V_{\beta_0} - \text{item index} \cdot V_{\beta_1} + V_{\text{var}}, \; \theta \right) \end{aligned}$ where $\theta > 0$ is a small constant ensuring positivity of the drift rate.

Within each trial, evidence evolves according to a Wiener diffusion process: $dX(t) = V_i \, dt + \sigma \, dW(t), \qquad dW(t) \sim \mathcal{N}(0, dt),$

A response is generated when one accumulator first reaches the decision boundary $A_i$ . The observed response time for item $i$ is given by

$RT_i = T_{\text{decision},i} + ndt_i .$

Step one: Model setup

First, we load the required packages.

# Load necessary packages
library(eam)
library(dplyr)

# Set a random seed for reproducibility
set.seed(1)

Then, we specify the model configuration according to the setup described above.

#######################
# Model specification #
#######################
# Define the number of accumulators
n_items <- 7

# Specify the prior distributions for free parameters
prior_formulas <- list(
  n_items ~ 7,
  # parameters with distributions
  A_beta_0 ~ distributional::dist_uniform(0.50, 5.00),
  A_beta_1 ~ distributional::dist_uniform(0.01, 1.00),
  # V
  V_beta_0 ~ distributional::dist_uniform(0.5, 3.00),
  V_beta_1 ~ distributional::dist_uniform(0.01, 0.3),
  V_sigma ~ distributional::dist_uniform(0.01, 1.0),
  # ndt
  ndt_beta_0 ~ distributional::dist_uniform(-1, 1),
  # noise param
  noise_coef ~ 1
)


# Specify the between-trial components 
between_trial_formulas <- list(
  # random group binomial
  V_var ~ distributional::dist_normal(0,V_sigma)
)

# Specify the item-level parameters
item_formulas <- list(
  A ~ A_beta_0 + seq(1, n_items) * A_beta_1,
  V ~ pmax(V_beta_0 - seq(1, n_items) * V_beta_1 + V_var,1e-5),
  ndt ~ ndt_beta_0
)

# Specify the diffusion noise
noise_factory <- function(context) {
  function(n, dt) {
    rnorm(n, mean = 0, sd = sqrt(dt))
  }
}

Step two: Data simulation

When parallelization is enabled, running the full simulation typically takes approximately 30 minutes to 1 hour. Given the computational constraints of this tutorial, we do not execute the full pipeline here. Interested readers are encouraged to run the code locally to reproduce the results.

####################
# Simulation setup #
####################
sim_config <- new_simulation_config(
  prior_formulas = prior_formulas,
  between_trial_formulas = between_trial_formulas,
  item_formulas = item_formulas,
  n_conditions_per_chunk = NULL, # automatic chunking
  n_conditions = 5000,
  n_trials_per_condition = 1000,
  n_items = n_items,
  max_reached = n_items,
  max_t = 60,
  dt = 0.001,
  noise_mechanism = "add",
  noise_factory = noise_factory,
  model = "ddm",
  parallel = TRUE,
  n_cores = NULL, # Will use default: detectCores() - 1
  rand_seed = NULL # Will use default random seed
)

# Output temporary path setup
temp_output_path <- tempfile("eam_demo_output")

##################
# Run simulation #
##################
sim_output <- run_simulation(
  config = sim_config,
  output_dir = temp_output_path
)

Step three: Load observed data

For illustration purposes, we use data from one representative participant in the PEERS dataset as the observed data in this example.

#######################
# Load Observed data  #
#######################

observed_data<-read.csv("./30-empirical-example/example_data.csv")

Step four: Extract summary statistics

Here we calcualte the response time quantiles (0.1–0.9) separately for each response position (“rank_idx”).

#####################
# abc model prepare #
#####################

# Define the summary procedure
summary_pipe <-
  summarise_by(
    .by = c("condition_idx", "rank_idx"),
    rt_quantiles = quantile(rt, probs = c(0.1, 0.3, 0.5, 0.7, 0.9))
  )

# Calculate simulated summary statistics
simulation_sumstat <- map_by_condition(
  sim_output,
  .progress = TRUE,
  .parallel = FALSE,
  function(cond_df) summary_pipe(cond_df)
)

# Calculate observed summary statistics
target_sumstat <- summary_pipe(observed_data)

# Align the simulated and observed summary statistics
abc_input <- build_abc_input(
  simulation_output = sim_output,
  simulation_summary = simulation_sumstat,
  target_summary     = target_sumstat,
  param = c("A_beta_0","A_beta_1","V_beta_0","V_beta_1","V_sigma","ndt_beta_0")
)

Step five: Fit the model

As the simulation step is omitted here, the inference code below is presented without being executed. Readers can run the complete inference pipeline locally once the simulated data have been generated.


#############################################
# Model validation and parameter estimation #
#############################################

abc_fit <- abc::abc(
  target = abc_input$target,
  param  = abc_input$param,
  sumstat= abc_input$sumstat,
  tol    = 0.05,
  method = "neuralnet"
)

abc_cv <- abc::cv4abc(
  param   = abc_input$param,
  sumstat = abc_input$sumstat,
  abc.out = abc_fit,
  nval    = 100,
  tols    = c(0.05)
)


plot_cv_recovery(
  abc_cv,
  n_rows = 3,
  n_cols = 2,
  resid_tol = 0.99,
  interactive = FALSE
)

summarise_posterior_parameters(
  abc_fit,
  # custom summary functions
  ci_level = 0.95,
  sd = function(x) sd(x)
)

plot_posterior_parameters(
  abc_fit,
  abc_input
)

Step six: Model evaluation

Here, we directly present results from a previously fitted model and assess model adequacy by examining the overlap between RTs simulated from the posterior median and the RTs.

##############################
# Posterior predictive check #
##############################

prior_formulas <- list(
  n_items ~ 7,
  # parameters with distributions
  A_beta_0 ~ distributional::dist_uniform(3.89, 3.89),
  A_beta_1 ~ distributional::dist_uniform(0.68, 0.68),
  # V
  V_beta_0 ~ distributional::dist_uniform(1.47, 1.47),
  V_beta_1 ~ distributional::dist_uniform(0.15, 0.15),
  V_sigma ~ distributional::dist_uniform(0.35, 0.35),
  # ndt
  ndt_beta_0 ~ distributional::dist_uniform(0.27, 0.27),
  # noise param
  noise_coef ~ 1
)

temp_output_path <- tempfile("multieam_demo_output")
if (dir.exists(temp_output_path)) unlink(temp_output_path, recursive = TRUE)

sim_config <- new_simulation_config(
  prior_formulas = prior_formulas,
  between_trial_formulas = between_trial_formulas,
  item_formulas = item_formulas,
  n_conditions_per_chunk = NULL, # automatic chunking
  n_conditions = 1,
  n_trials_per_condition = 500,
  n_items = n_items,
  max_reached = n_items,
  max_t = 60,
  dt = 0.001,
  noise_mechanism = "add",
  noise_factory = noise_factory,
  model = "ddm",
  parallel = FALSE,
  n_cores = NULL, # Will use default: detectCores() - 1
  rand_seed = NULL # Will use default random seed
)

t0 <- Sys.time()
sim_output_post <- run_simulation(config = sim_config, output_dir = temp_output_path)

post_output <- sim_output_post
observed_data$item_idx <- observed_data$word_idx
# plot the posterior rt and accuracy
plot_rt(
  post_output,
  observed_data,
  facet_x = c("rank_idx"),
  facet_y = c()
)

plot of chunk unnamed-chunk-8

The results showed that, when fitted to the model, the estimated parameters accurately captured the shape of the observed reaction time distributions, thereby providing support for the validity of the model.

Reference:

Kahana, M. J., Lohnas, L. J., Healey, M. K., Aka, A., Broitman, A. W., Crutchley, P., Crutchley, E., Alm, K. H., Katerman, B. S., Miller, N. E., Kuhn, J. R., Li, Y., Long, N. M., Miller, J., Paron, M. D., Pazdera, J. K., Pedisich, I., Rudoler, J. H., & Weidemann, C. T. (2024). The Penn Electrophysiology of Encoding and Retrieval Study. Journal of Experimental Psychology: Learning, Memory, and Cognition, 50(9), 1421–1443. https://doi.org/10.1037/xlm0001319