# Bootstrap non-linear regression with purrr and modelr

This post was updated on the 02/03/2018 to reflect changes to nls.multstart.

## Introduction

For my first academic publication, a reviewer asked for the $r^{2}$ values of the thermal performance curves I fitted using non-linear regression. I bowed to the request as is often the case with reviewer comments, but would now resist as the $r^{2}$ is not necessarily an effective goodness of fit measure for non-linear regression (see this SO answer). It does raise the question of how to determine how well a biologically meaningful model fits the data it is fitted to. I generally just plot every curve to its data, but it tells me nothing of the uncertainty around the curve.

Step forward the bootstrap! Bootstrapping involes simulating “new” datasets produced from the existing data by sampling with replacement. The same model is then fitted separately on each individual bootstrapped dataset. Doing this over and over allows us to visualise uncertainty of predictions and produce confidence intervals of estimated parameters. This blog post was inspired by posts by Andrew MacDonald and Hadley Wickham, as well as a broom vignette which use this approach. I have taken their approaches and again applied them to thermal performance curves. The broom approach in these blog posts has since been replaced by modelr::bootstrap(), another package of the tidyverse.

## Bootstrapping predictions for a single curve

I will demonstrate this approach by using the thermal performance curves for phytoplankton metabolism that I used in a previous post. The Sharpe-Schoolfield equation and meaning of the parameters can be found in more detail in the previous post.

Firstly lets load the packages used in the script and write the Sharpe-Schoolfield model as a function.

# load packages
library(patchwork) # devtools::install_github('thomasp85/patchwork')
library(ggplot2)
library(broom)
library(purrr)
library(dplyr)
library(tidyr)
library(nlstools)
library(modelr)

# write function for sharpe schoolfield model
schoolfield_high <- function(lnc, E, Eh, Th, temp, Tc) {
Tc <- 273.15 + Tc
k <- 8.62e-5
boltzmann.term <- lnc + log(exp(E/k*(1/Tc - 1/temp)))
inactivation.term <- log(1/(1 + exp(Eh/k*(1/Th - 1/temp))))
return(boltzmann.term + inactivation.term)
}

Then load in the data and have a look at the its structure using glimpse().

# load in data
data(Chlorella_TRC)

# look at data
glimpse(Chlorella_TRC)
## Observations: 649
## Variables: 7
## $curve_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,... ##$ growth.temp <dbl> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20...
## $process <chr> "acclimation", "acclimation", "acclimation", "accl... ##$ flux        <chr> "respiration", "respiration", "respiration", "resp...
## $temp <dbl> 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 16... ##$ K           <dbl> 289.15, 292.15, 295.15, 298.15, 301.15, 304.15, 30...
## $ln.rate <dbl> -2.06257833, -1.32437939, -0.95416807, -0.79443675... There are 60 curves here, 30 each for photosynthesis and respiration. The treatments are growth temperature (20, 23, 27, 30, 33 ºC) and adaptive process (acclimation or adaptation) that reflects the number of generations cultures were grown at each temperature. Bootstrapping the uncertainty on each individual curve is difficult for thermal performance curves because rates generally rapidly decrease after the optimum temperature, $T_{opt}$, making data collection difficult. This means it is likely some of the bootstraps will not include points after the optimum, making the unimodal model formulation unsuitable. Because of this I will pool the replicates curves within treatments together to demonstrate the bootstrapping approach, giving 20 curves in total. This post therefore ignores the non-independence of data points within replicates (a little naughty!). To bootstrap a single curve, we can filter the dataset for a single flux (photosynthesis) at a singe growth temperature (20 ºC) and generations of growth (~ 100) and plot the data. # filter for one curve d_examp <- filter(Chlorella_TRC, growth.temp == 20, flux == 'photosynthesis', process == 'adaptation') # plot ggplot(d_examp, aes(K - 273.15, ln.rate)) + geom_point(col = 'green4') + ylab('log Metabolic rate') + xlab('Assay temperature (ºC)') + theme_bw(base_size = 12, base_family = 'Helvetica') We can use nls_multstart(), that allows for multiple start parameters, to fit a single model to the data. We can then use tidy() and augment() from broom to get the parameters and predictions of the model. # run nls_multstart fit <- nls_multstart(ln.rate ~ schoolfield_high(lnc, E, Eh, Th, temp = K, Tc = 20), data = d_examp, iter = 500, start_lower = c(lnc = -10, E = 0.1, Eh = 0.2, Th = 285), start_upper = c(lnc = 10, E = 2, Eh = 5, Th = 330), supp_errors = 'Y', na.action = na.omit, lower = c(lnc = -10, E = 0, Eh = 0, Th = 0)) # broom functions to tidy up model params <- tidy(fit) preds <- augment(fit) # plot with predictions ggplot(d_examp, aes(K - 273.15, ln.rate)) + geom_point(col = 'green4') + geom_line(aes(K - 273.15, .fitted), preds) + ylab('log Metabolic rate') + xlab('Assay temperature (ºC)') + theme_bw(base_size = 12, base_family = 'Helvetica') + ggtitle('Single TPC with fitted model') This fit provides p-values and confidence intervals can be calculated using nlstools::confint2(). However, bootstrapping can provide confidence intervals around predictions and for estimated parameters. The bootstrap() function in modelr samples bootstrap replicates (here we do 200), each of which is randomly sampled with replacement. This creates a list column in our tibble called strap which contains the bootsrapped dataset, and a new column called boot_num that is the number of that bootstrap (from 1 to 200). We can then create a new list column of the fit for each strap using purrr::map(). fit_boots <- d_examp %>% modelr::bootstrap(n = 200, id = 'boot_num') %>% group_by(boot_num) %>% mutate(fit = map(strap, ~nls_multstart(ln.rate ~ schoolfield_high(lnc, E, Eh, Th, temp = K, Tc = 20), data = data.frame(.), iter = 100, start_lower = c(lnc = -10, E = 0.1, Eh = 0.2, Th = 285), start_upper = c(lnc = 10, E = 2, Eh = 5, Th = 330), lower = c(lnc=-10, E=0, Eh=0, Th=0), supp_errors = 'Y') )) fit_boots ## # A tibble: 200 x 3 ## # Groups: boot_num [200] ## strap boot_num fit ## <list> <chr> <list> ## 1 <S3: resample> 001 <S3: nls> ## 2 <S3: resample> 002 <S3: nls> ## 3 <S3: resample> 003 <S3: nls> ## 4 <S3: resample> 004 <S3: nls> ## 5 <S3: resample> 005 <S3: nls> ## 6 <S3: resample> 006 <S3: nls> ## 7 <S3: resample> 007 <S3: nls> ## 8 <S3: resample> 008 <S3: nls> ## 9 <S3: resample> 009 <S3: nls> ## 10 <S3: resample> 010 <S3: nls> ## # ... with 190 more rows Each bootstrap replicate is stored in a list column within the tibble . This then allows us to apply the tidy() and augment() functions used earlier, using unnest() to combine the list column into a dataframe. # get parameters #### params_boot <- fit_boots %>% unnest(fit %>% map(tidy)) %>% ungroup() # get predictions preds_boot <- fit_boots %>% unnest(fit %>% map(augment)) %>% ungroup() Using these two dataframes, we can plot each set of bootstrapped predictions alongside the fit of the original data, preds, and plot the distribution of each estimated parameter. The relatively new package patchwork by Thomas Lin Pedersen can help add multiple graphs together simply by saying plot_1 + plot_2. # plot distribution of estimated parameters p1 <- ggplot(params_boot, aes(estimate)) + geom_histogram(col = 'black', fill = 'white') + facet_wrap(~ term, scales = 'free_x') # plot points with predictions p2 <- ggplot() + geom_line(aes(K - 273.15, .fitted, group = boot_num), preds_boot, alpha = .03) + geom_line(aes(K - 273.15, .fitted), preds) + geom_point(aes(K - 273.15, ln.rate), d_examp, col = 'green4') + ylab('log Metabolic rate') + xlab('Assay temperature (ºC)') + theme_bw(base_size = 12, base_family = 'Helvetica') # plot both p1 + p2 We can smooth our predictions over smaller increments of our predictor variable by passing a new dataset to augment(). Alongside this, for every value of the predictor we can calculate the 2.5% and 97.5% quantiles which gives confidence bands around the predictions. Personally I prefer this approach rather than plotting each bootstrapped replicate. # new data frame of predictions new_preds <- d_examp %>% do(., data.frame(K = seq(min(.$K), max(.$K), length.out = 250), stringsAsFactors = FALSE)) # create smooth predictions for best fit preds <- augment(fit, newdata = new_preds) # create smoother predictions for bootstrapped replicate preds <- fit_boots %>% unnest(fit %>% map(augment, newdata = new_preds)) %>% # group by each value of K and get quantiles group_by(., K) %>% summarise(lwr_CI = quantile(.fitted, 0.025), upr_CI = quantile(.fitted, 0.975)) %>% ungroup() %>% merge(., preds, by = 'K') # plot ggplot() + geom_point(aes(K - 273.15, ln.rate), d_examp) + geom_line(aes(K - 273.15, .fitted), preds) + geom_ribbon(aes(K - 273.15, ymin = lwr_CI, ymax = upr_CI), fill = 'green4', preds, alpha = .2) + ylab('log Metabolic rate') + xlab('Assay temperature (ºC)') + theme_bw(base_size = 12, base_family = 'Helvetica') + ggtitle('Single TPC with confidence intervals') This is starting to look pretty nice, and is a great approach to visualising uncertainty of non-linear regressions for many types of data. From params_boot we can calculate confidence intervals of each estimated parameter by taking the desired quantiles of the data. This can be compared the output from confint2() from nlstools. # calculate confidence intervals of estimated parameters using confint2() confint_1 <- confint2(fit) %>% data.frame() %>% rename(., conf_low = X2.5.., conf_high = X97.5..) %>% mutate(method = 'nlstools') %>% cbind(., select(params, term, estimate)) %>% select(., term, estimate, conf_low, conf_high, method) # calculate confidence intervals using bootstraps confint_2 <- group_by(params_boot, term) %>% summarise(., conf_low = quantile(estimate, 0.025), conf_high = quantile(estimate, 0.975), estimate = quantile(estimate, 0.5)) %>% ungroup() %>% mutate(., method = 'boot') # bind both methods confint <- bind_rows(confint_1, confint_2) # plot each method side by side ggplot(confint, aes(method, estimate, col = method)) + geom_point(size = 3) + geom_linerange(aes(ymin = conf_low, ymax = conf_high)) + facet_wrap(~ term, scales = 'free_y') + theme_bw() + theme(legend.position = 'none') + ggtitle('Comparison of confidence interval calculation for estimated parameters') Bootstrapping gives similar mean estimates, but gives wider, asymmetric, confidence intervals compared to those calculated using nlstools::confint2(). In this instance this could be because I am only running 200 bootstrap replicates, whereas the number of bootstraps done in published analyses are commonly around 10,000. Crucially, bootstrapping allows the calculation of confidence intervals for parameters derived from the model that were not present in the initial fitting process. For example, the optimum temperature of a thermal performance curve, $T_{opt}$ is calculated as: $T_{opt} = \frac{E_{h}T_{h}}{E_{h} + k T_{h} ln(\frac{E_{h}}{E} - 1)}$ We can calculate $T_{opt}$ for each iteration of the bootstrap and then plot the distribution of derived parameters. This can be done by using tidyr::spread() to have a column for each estimated parameter, and then calculate Topt for each boot_num. # function for calculating Topt Topt <- function(E, Th, Eh){ return((Eh*Th)/(Eh + (8.62e-05 *Th*log((Eh/E) - 1)))) } Topt_boot <- select(params_boot, boot_num, term, estimate) %>% spread(., term, estimate) %>% mutate(., Topt = Topt(E, Th, Eh)) # plot distribution of Topt ggplot(Topt_boot, aes(Topt - 273.15)) + geom_histogram(col = 'black', fill = 'white') + xlab('Optimum Temperature (ºC)') + ggtitle('Distribution of optimum temperatures via bootstrapping') Bootstrapping allows us to get uncertainty estimates for parameters outside of the original curve fitting process! ## Bootstrapping multiple curves Bootstrapping over each curve can be done by combining functions from the tidyverse to the bootstrap() call. To fit a single model to each curve, I use nest(), mutate() and map() as shown previously. I searched for a way of using the same workflow for bootstrapping, and finally came across the answer. Each element of strap is not strictly a dataframe (more of a promise to be a dataframe), so the only difference to fitting multiple non-linear regressions is the need to specify the data using dataframe(.) within map(). After grouping the dataframe, new datasets are bootstrapped using modelr::bootstrap(). Using unnest() gives a new column called boot_num which represents the bootstrap replicate within each group. The tibble is then re-grouped to include boot_num and the model can finally be fitted to each bootstrapped dataset. Obviously the total number of models your code fits is increases as you up the number of bootstraps, so be aware this code may take a fair while to run! # run nls.multstart on each curve of the original data #### fit_many <- group_by(Chlorella_TRC, growth.temp, process, flux) %>% nest() %>% mutate(., fit = purrr::map(data, ~nls_multstart(ln.rate ~ schoolfield_high(lnc, E, Eh, Th, temp = K, Tc = 20), data = .x, iter = 500, start_lower = c(lnc = -10, E = 0.1, Eh = 0.2, Th = 285), start_upper = c(lnc = 10, E = 2, Eh = 5, Th = 330), supp_errors = 'Y', na.action = na.omit, lower = c(lnc = -10, E = 0, Eh = 0, Th = 0)))) # run bootstrap over many curves #### boot_many <- group_by(Chlorella_TRC, growth.temp, process, flux) %>% # create 200 bootstrap replicates per curve do(., boot = modelr::bootstrap(., n = 200, id = 'boot_num')) %>% # unnest to show bootstrap number, .id unnest() %>% # regroup to include the boot_num group_by(., growth.temp, process, flux, boot_num) %>% # run the model using map() mutate(fit = map(strap, ~nls_multstart(ln.rate ~ schoolfield_high(lnc, E, Eh, Th, temp = K, Tc = 20), data = data.frame(.), iter = 50, start_lower = c(lnc = -10, E = 0.1, Eh = 0.2, Th = 285), start_upper = c(lnc = 10, E = 2, Eh = 5, Th = 330), lower = c(lnc=-10, E=0, Eh=0, Th=0), supp_errors = 'Y') )) Smooth predictions can then be calculated from fit_many and boot_many and plotted to demonstrate the uncertainty of multiple curves. I do some wrangling to get the max and min temperature of each curve so that I only plot predictions over the range of each specific curve (some have measurements up to 46 ºC, some to 49 ºC). # new data frame for smooth predictions new_preds <- Chlorella_TRC %>% do(., data.frame(K = seq(min(.$K), max(.\$K), length.out = 150), stringsAsFactors = FALSE))

# get max and min for each curve
max_min <- group_by(Chlorella_TRC, growth.temp, flux, process) %>%
summarise(., min_K = min(K), max_K = max(K)) %>%
ungroup()

# create smoother predictions for unbootstrapped models
preds_many <- fit_many %>%
unnest(fit %>% map(augment, newdata = new_preds))

# create smoother predictions for bootstrapped replicates
preds_many <- boot_many %>%
unnest(fit %>% map(augment, newdata = new_preds)) %>%
ungroup() %>%
# group by each value of K and get quantiles
group_by(., growth.temp, process, flux, K) %>%
summarise(lwr_CI = quantile(.fitted, 0.025),
upr_CI = quantile(.fitted, 0.975)) %>%
ungroup() %>%
merge(., preds_many, by = c('K', 'growth.temp', 'flux', 'process')) %>%
# merge with max_min to delete predictions outside of the max and min temperatures of each curve
merge(., max_min, by = c('growth.temp', 'flux', 'process')) %>%
group_by(., growth.temp, flux, process) %>%
filter(., K >= unique(min_K) & K <= unique(max_K)) %>%
rename(., ln.rate = .fitted) %>%
ungroup()

# plot predictions
ggplot(Chlorella_TRC, aes(K - 273.15, ln.rate, group = flux)) +
geom_point(alpha = 0.5, size = 0.5) +
geom_line(data = preds_many) +
geom_ribbon(aes(ymin = lwr_CI, ymax = upr_CI, fill = flux), data = preds_many, alpha = .2) +
scale_fill_manual(values = c('green4', 'black')) +
theme_bw(base_size = 12, base_family = 'Helvetica') +
ylab('log Metabolic rate') +
xlab('Assay temperature (ºC)') +
facet_wrap(~ process + growth.temp, labeller = labeller(.multi_line = FALSE)) +
theme_bw(base_size = 12, base_family = 'Helvetica') +
theme(legend.position = c(0.9, 0.15)) +
ggtitle('Multiple TPCs with confidence intervals')

When in doubt, it seems that purrr has the answer. Bootstrapping is a super useful method for visualising the uncertainty of predictions for non-linear regressions, and better allow us to understand how well a particular model fits the data. The tidyverse, as usual, provides a set of tools that makes this method easy to understand and implement, sort of.

## Next steps

Another way of visualising uncertainty in predictions is by using a Bayesian approach. When using Stan, the generated quantities{} block is a great way to create new predictions from the data. This allows for models that can account for the structure of the data (i.e. temperatures within replicates) and visualise the uncertainty of the model fits. The package brms looks to be a great help in fitting non-linear mixed models in a Bayesian framework. Just need to get round to experimenting with it!