Vignette: the R package CDFt

Introduction

The aim of this vignette is to provide a quick introduction to the R package CDFt.

This vignette is divided in 3 parts:

A brief presentation of what is the CDF-t method implemented in the CDFt package.
One toy example to present the CDFt function and how it works.
An example close to an actual climate application to give an idea of how this method can be applied in climate sciences.

Remark: This vignettes is originally made for me as a reminder of what I learned about the method CDF-t but I hope it could be also useful to others. This vignette reflects my level of understanding of the topic and thus it could contain mistakes or erroneous information (I hope not too big nor too many). Constructive comments or suggestions on possible ways to improve this document are welcome. For instance, it could be about better ways to explain or illustrate certain ideas or tips about more efficient ways to program in R. Corrections related to the English language are also welcome.

Bias correction and the method CDF-t

The CDFt packages actually contains only one function called CDFt, which is the implementation the method CDF-t (Michelangeli, Vrac, and Loukos 2009). CDF-t is an univariate bias-correction method which was first applied to correct wind intensities in the South of France.

Bias correction consists in post-processing a dataset to make it closer in a statistical sense of a dataset of reference. For instance, one may wish that the processed dataset possess the same mean, the save variance, or even the same statistical distribution as the reference dataset. When the reference dataset a higher spatial resolution than the dataset to be corrected, then bias-correction can also be seen as a form of downscaling.

In climate sciences, bias correction often refer to the process of correcting outputs of climate models (e.g. GCM: General Circulation Models, RCM: Regional Climate Models) with respect to observations. Indeed, climate models are not perfectly able to reproduce the climate dynamic and exhibits errors related for instance to the limited spatial resolution of the model, the parametrisation of unresolved processes, or uncertainties in the model initialization (Ehret et al. 2012).

There exists various methods of bias corrections (see e.g. Maraun 2016) but the CDF-t method can be seen as an extension of a widely-applied univariate bias-correction method, the so-called quantile mapping (also referred as quantile matching, or quantile-quantile correction) method (see e.g. Déqué 2007; Piani et al. 2010).

Quantile mapping

In the quantile mapping method, we consider two datasets:

G, (standing for GCM data) is the dataset we want to correct. G is supposed to follow a distribution with cumulative distribution function $F_G$.
O, (standing for Observations) is the reference dataset. O is supposed to follow a distribution with cumulative distribution function $F_O$

In the quantile mapping method, we look for a mapping $M$ such that after transformation, G follows the same distribution as O:

\[ M(G) \sim F_ O\]

It can be shown that this condition is satisfied when we select $T$ in the following way: \[ M(x) = F^{-1}_O \circ F_G(x) \] where $F^{-1}_O$ denotes the (general) inverse of the cumulative distribution function $F_O$. In practice, $F_O$ and $F_G$ are unknown and to estimate the transformation $M$, they are replaced by their empirical estimators, the empirical cumulative distribution functions (ECDFs).

In the quantile mapping approach, it is assumed that the the observations O and the GCM data G are stationary (i.e. their statistical distribution don’t evolve with time) or at least that the the observations O and the GCM data G are observed during the same period.

Under those conditions, quantile mapping cannot be used for instance to correct climate model projections for the future since observation are not available for that period and because the distributions of the observations and the data simulated by the GCM are expected to change in the future. The method CDF-t provides a framework to tackle this kind of questions.

CDF-t

In the CDF-t framework, 3 datasets are considered:

$G_{cal}$, stands for “GCM data in the calibration period” . $G_{cal}$ is supposed to follow a distribution with cumulative distribution function $F_{G, Cal}$.
$O_{cal}$, stands for “Observations in the calibration period” and corresponds to the reference dataset in the calibration period. O is supposed to follow a distribution with cumulative distribution function $F_{O, Cal}$.
$G_{val}$, stands for “GCM data in the validation period”. $G_{val}$ is supposed to follow a distribution with cumulative distribution function $F_{G, Val}$.

Remark: Termininlogy-wise, the validation period is often refered as the “future” or “projection”" period since CDF-t is often applied to correct climate model projections for the future in opposition to the “present”" of the “historical” period for which observations are available (this period is refered here as the calibration period). However, CDF-t can also be used to correct data from the past or the present. In the same way, we refer to observational data and GCM data since it relates to many applications in climate sciences but the method CDF-t can simply thought as a method to correct one dataset (here, the GCM outputs) with respect to a dataset of reference (here, the observations) whenever the reference data are available in the calibration period but not in the validation period.

The objective is to correct $G_{val}$ such that after correction $G_{val}$ follows the same distribution as $O_{val}$, the observations that we would observe in the validation period. Those hypothetical observations are assumed to follow an unknown distribution $F_{O, Val}$.

Since $F_{O, Val}$ is unknown, we assume that there exists a transformation $T$ such that:

\[ T(F_{G, Cal}(x)) = F_{O, Cal}(x) \]

and

\[ T(F_{G, Val}(x)) = F_{O, Val}(x)\]

The function $T(u) = F_{O, Cal} \circ F^{-1}_{G, Cal}(u)$ satisfies the former equation and we assume that it satisfies the latter equation. Hence, we obtain $F_{O, Val}(x)$ as:

\[\begin{align} F_{O, Val}(x) &= T(F_{G, Val}(x)) \\ &= F_{O, Cal} \circ F^{-1}_{G, Cal} \circ F_{G, Val}(x) \end{align}\]

Then, we can build the quantile mapping function to transform the GCM data in the validation period, $G_{val}$ so that the transform data have the distribution of the observations in the validation period, $O_{val}$:

\[\begin{align} M(x) &= F^{-1}_{O, Val} \circ F_{G, Val}(x) \\ &= F^{-1}_{G, Val} \circ F_{G, Cal} \circ F^{-1}_{O, Cal} \circ F_{G, Val}(x) \end{align}\]

Implementation

the CDF-t method implemented in the CDFt package follows these two steps:

Reconstruct $F_{O, Val}$ from $F_{O, Cal}$, $F_{G, Cal}$ and $F_{G, Val}$.
Apply quantile mapping from $F_{O, Val}$ and $F_{G, Val}$ to correct the data $G_{val}$.

In practice, $F_{O, Cal}$, $F_{G, Cal}$ and $F_{G, Val}$ are estimated using empirical cumulative distribution function. However, using empirical cumulative distribution function has an implication, the CDF-t method is only expected to work properly when the the observed values of $O_{cal}$ and $G_{cal}$ share a similar range of values.

Toy Example

In this part, we will apply the bias correction method CDF-t to a toy example to illustrate how the method and the package work.

First, let us load the packages that are used in this vignette.

R packages

library(maps)
library(animation)
library(CDFt)

I provide just in case the session information for the sake of reproducibility.

Session information

print(sessionInfo(), locale = FALSE)

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] CDFt_1.0.1    animation_2.6 maps_3.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.3  magrittr_1.5    tools_3.6.3     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.3      stringi_1.4.6   rmarkdown_2.1  
##  [9] knitr_1.28      stringr_1.4.0   xfun_0.12       digest_0.6.25  
## [13] rlang_0.4.5     evaluate_0.14

Data

In this toy example, we consider four variables:

G_cal plays the role of the GCM data in the calibration period.
G_val plays the role of the GCM data in the validation period.
O_cal plays the role of the observational data in the calibration period and is used as a reference.
O_val plays the role of the observational data in the validation period and is supposed unknown.

The goal is to transform the data G_val so that the transformed data have the same distribution as O_val which is unknown to us. In the CDF-t method, it is implicitly assumed that the evolution of the GCM data distributions from a time period to another would also makes sense for the evolution of the observation distributions (Vrac et al. 2012). It means that there exists a same transformation that turns the distribution of O_cal into the distribution of O_val and that turns distribution of G_cal into the distribution of G_val. For our example, we apply a linear transformation such that from calibration period to the validation period, the variance of the distribution has been multiplied by 10² and its mean has been multiplied by 10 and then increased by 3. In the toy example, the GCM datasets, G_cal and G_val are constructed from a uniform distribution where as the observational datasets, O_cal and O_val, are constructed using a Gaussian distribution.

Simulations

set.seed(1) # randon number generator seed for reproducibility
ndat <- 5000
G_cal <- runif(ndat, min = -10, max = 10)
G_val <- runif(ndat, min = -10, max = 10) * 10 + 3
O_cal <- rnorm(ndat, mean = 1, sd = 2)
O_val <- rnorm(ndat, mean = 1, sd = 2) * 10 + 3

# Example of a case where CDF-t is not expected to behave properly:
# ndat <- 5000
# G_cal <- runif(ndat, min = -10, max = 10)
# G_val <- runif(ndat, min = -10, max = 10) * 10 + 3
# O_cal <- rnorm(ndat, mean = 11, sd = 2)
# O_val <- rnorm(ndat, mean = 11, sd = 2) * 10 + 3

Here are the empirical distributions of the simulated data represented by histograms.

Histograms

par(mfrow  = c(2,2))
ylim = c(0, 0.15)
breaks <- hist(c(O_cal, G_cal, O_val, G_val), breaks ="Scott", plot = FALSE)$breaks
hist(G_cal, breaks = breaks, freq = FALSE, border = "grey", ylim = ylim)
hist(O_cal, breaks = breaks, freq = FALSE, border = "black", ylim = ylim)
hist(G_val, breaks = breaks, freq = FALSE, border = "brown", ylim = ylim)
hist(O_val, breaks = breaks, freq = FALSE, border = "red", ylim = ylim)

The empirical distributions of the simulated data can also be represented by cumulative distribution functions.

Empirical cumulative distribution functions

par(mfrow = c(1, 1))
plot(ecdf(G_val),  col = "brown", lwd = 2, xlim = range(O_cal, G_cal, O_val, G_val), main = "Empirical Cumulative Distribution Function")
lines(ecdf(G_cal), col = "grey", lwd = 2)
lines(ecdf(O_cal), col = "black", lwd = 2)
lines(ecdf(O_val), col = "red", lwd = 2)
legend("topleft", legend = c("G_cal", "G_val", "O_cal", "O_val"), col = c("grey", "brown", "black", "red"), lty = 1, lwd = 2)

Let us now apply CDF-t to correct G_val so it has ideally the same distribution as O_val.

Applying CDF-t for univariate bias-correction

# help("CDFt")
args(CDFt)

## function (ObsRp, DataGp, DataGf, npas = 100, dev = 2) 
## NULL

The CDFt function takes as arguments:

ObsRp stands for “Observation used as Reference in the Present period” and it corresponds in our case to O_cal.
DataGp stands for “Data from the GCM in the Present period” and it corresponds in our case to G_cal.
DataGf stands for “Data from the GCM in the Future period” and it corresponds in our case to G_val.
npas corresponds to the number points for which FRf is estimated. FRf is the cumulative distribution function of ObsRf (“Observation used as Reference in the Future period”).
dev helps to define the range of value x for which FRf(x) is reconstructed.

cdft_bc <- CDFt(ObsRp = O_cal, DataGp = G_cal, DataGf = G_val, npas = 1000, dev = 2)
str(cdft_bc)

## List of 6
##  $ x  : num [1:1000] -104 -104 -104 -104 -104 ...
##  $ FRp: num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FGp: num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FGf: num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FRf: num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ DS : num [1:5000] -31.396 -0.772 -6.75 29.036 -2.63 ...

The CDFt function returns a list with the following elements:

x, the values for which the function FRf(x) is evaluated.
FRp, the empirical cumulative distribution function for the data ObsRp evaluated at x.
FGp, the empirical cumulative distribution function for the data DataGp evaluated at x.
FGf, the empirical cumulative distribution function for the data DataGf evaluated at x.
FRf, the reconstructed empirical cumulative distribution for the unobserved data ObsRf evaluated at x. FRf is estimated from FRp, FGp and FRf.
DS, standing for “DownScaled”, corresponds to corrected of DataGf data which should have the same statistical distribution as the unobserved data ObsRf.

Evaluating the correction

Let us compare the distribution of the corrected data with respect to the distribution of the reference, O_val. We look at histogram, cumulative distributions functions and QQ-plots (quantile-quantile plots). For this comparison, the bias-corrected data are evaluated with respect to the theoretical distribution of O_val and also to empirical distribution of O_val. Indeed, in practice, the theoretical distribution of O_val would be unknown and to evaluate a bias-corrected method on actual cases, one could split the data into a calibration part and a validation (see the climate application part). There could be differences when comparing with respect to the theoretical or empirical distribution of O_val due to sampling effect. Indeed, the empirical distribution of O_val is a realisation of a random variable.

Histograms

par(mfrow  = c(1, 4))
breaks <- hist(c(O_val, G_val, cdft_bc$DS), breaks ="Scott", plot = FALSE)$breaks
x <- seq(min(breaks), max(breaks), length.out = 200)
ylim <- c(0, 0.03)
plot(x, dnorm(x, mean = 1 * 10 + 3, sd = 2 * 10), main = "Theorical density of O_val", ylim = ylim, ylab = "Density", col = "red", type = "l")
hist(O_val, breaks = breaks, freq = FALSE, border = "red", main = "Histogram of O_val", ylim = ylim)
hist(cdft_bc$DS, breaks = breaks, freq = FALSE, border = "purple", main = "Histogram of bias-corrected G_val", ylim = ylim)
hist(G_val, breaks = breaks, freq = FALSE, border = "brown", main = "Histogram of G_val", ylim = ylim)

Empirical cumulative distribution functions

par(mfrow = c(1, 5))
xlim <- range(O_val, G_val, cdft_bc$DS)
x <- seq(xlim[1], xlim[2], length.out = 200)
plot(x, pnorm(x, mean = 1 * 10 + 3, sd = 2 * 10), col = "red", type = "l", lwd = 2, xlim = xlim, main = "Theorical CDF of O_val", ylab = "Fn(x)")
plot(ecdf(O_val), col = "red", lwd = 2, xlim = xlim, main = "Empirical CDF of O_val")
plot(ecdf(cdft_bc$DS), col = "white", lwd = 2, xlim = xlim, main = "Estimated CDF of O_val")
lines(cdft_bc$x, cdft_bc$FRf, col = "violet", lwd = 2)
plot(ecdf(cdft_bc$DS), col = "purple", lwd = 2, xlim = xlim, main = "Empirical CDF of \n bias-corrected G_val")
plot(ecdf(G_val),  col = "brown", lwd = 2, xlim = xlim, main = "Empirical CDF of G_val")

Quantile - Quantile plots

par(mfrow = c(1, 4))
xlim <- range(O_val, G_val, cdft_bc$DS)

qqplot(qnorm(seq_along(G_val)/length(G_val), mean = 1 * 10 + 3, sd = 2 * 10), cdft_bc$DS, xlim = xlim, ylim = xlim, xlab = "Theoretical quantile of O_val", main = "QQ-plot after correction")
abline(b = 1, a = 0, col = "red")

qqplot(O_val, cdft_bc$DS, xlim = xlim, ylim = xlim, main = "QQ-plot after correction")
abline(b = 1, a = 0, col = "red")

qqplot(qnorm(seq_along(G_val)/length(G_val), mean = 1 * 10 + 3, sd = 2 * 10), G_val, xlim = xlim, ylim = xlim, xlab = "Theoretical quantile of O_val", main = "QQ-plot before correction")
abline(b = 1, a = 0, col = "red")

qqplot(O_val, G_val, xlim = xlim, ylim = xlim, main = "QQ-plot before correction")
abline(b = 1, a = 0, col = "red")

From a qualitative standpoint, we see that after the correction, the G_val data have a statistical distribution closer to the distribution of O_val. In the following, we use the two-sample Kolmogorov-Smirnov test to quantify whether the bias-corrected data are actually closer distribution-wise of O_val.

Two-sample Kolmogorov-Smirnov tests

ks.test(cdft_bc$DS, pnorm, mean = 1 * 10 + 3,  sd = 2 * 10)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  cdft_bc$DS
## D = 0.1443, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(G_val, pnorm, mean = 1 * 10 + 3, sd = 2 * 10)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  G_val
## D = 0.33211, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(cdft_bc$DS, O_val)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  cdft_bc$DS and O_val
## D = 0.1312, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(G_val, O_val)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  G_val and O_val
## D = 0.3364, p-value < 2.2e-16
## alternative hypothesis: two-sided

The test statistic of the Kolmogorov-Smirnov test reflects a distance between two distributions. The distance between the bias-corrected data cdft_bc$DSto the theoretical distribution of O_val is smaller than the distance between empirical distribution of G_val to the theoretical distribution of O_val. This shows that after correction, the G_val are closer to O_val in terms of distribution. However, in both cases, the p-values of the the Kolmogorov-Smirnov tests are very small (p-value < 10^-6, meaning that we can reject with high confidence the null hypotheses of cdft_bc$D or G_val having the same distribution as O_val. We obtain similar results when the bias-corrected data are compared to the empirical distribution of O_val. Thus, the CDF-t bias correction is not perfect but it nonetheless increases the similarity between G_val and the reference O_val.

Climate application

In this part, we apply the method CDF-t to an example that is closer to an actual application in climate sciences. The goal is to correct the ERA-Interim reanalysis (Dee D. P. et al. 2011) is the South-East of France ([2E, 7.5E] x [42N, 45]) using as reference, data from CORDEX (Coordinated Regional Climate Downscaling Experiment, Jacob et al. 2014). To be more specific, what we use and refer to as CORDEX data hereafter in this example is only one dataset produced for the CORDEX evaluation. It was produced by the institutions IPSL-INERIS and it corresponds to the run r1i1p1 made for Europe using the regional model WRF331F with forcings provided by the ERA-Interim reanalysis (Vautard et al. 2013). The CORDEX data have a higher spatial resolution than the ERA-Interim reanalysis, and thus in this case the bias-correction method CDF-t can also be considered as a statistical downscaling technique. In this example, we will only focus on the temperature at surface at the daily time scale.

Data

To facilitate their use in this vignette, the data from CORDEX and ERA-Interim have been prepared and compiled in an .rdata file bc_tas_erai2cordex.rdata. It is available here.

The original CORDEX data can accessed following those instructions.

The original ERA-Interim data are available through the ECMWF web application server.

First, let us load the file bc_tas_erai2cordex.rdata in R.

load(file = "bc_tas_erai2cordex.rdata")
ls()

## [1] "cordex_latlon"     "cordex_tas_calval" "erai_latlon"      
## [4] "erai_tas_calval"   "nearestn_bc"       "time_calval"

The file bc_tas_erai2cordex.rdata contains 6 variables.

cordex_tas_calval contains the surface temperature data for the regional model CORDEX. cordex_tas_calval is actually a list containing two matrices, respectively storing the data for the calibration period and the validation period. The first dimension of the matrix is the location (or grid point) and the second dimension is the time.
cordex_latlon is a two-column data.frame containing the latitude and the longitude of the grid points of the the CORDEX simulations.
erai_tas_calval, same as cordex_tas_calval but for the ERA-Interim reanalysis.
erai_latlon, same as cordex_tas_latlon but for the ERA-Interim reanalysis.
time_calval is a list of two vectors. The fist one contains the dates for the calibration period and the second one the dates for the validations period. The dates in time_calval are the same for the CORDEX simulations and the ERA-Interim reanalysis. The calibration period goes from 1989-01-01 to 1998-12-31 and the validation period from 1999-01-01 to 2008-12-31.
nearestn_bc contains the ERA-Interim data interpolated to the CORDEX grid points using nearest neighbor interpolation. It is a list of two matrices, one for the calibration period and one for the validation. The matrices have the same dimensions as the matrices in cordex_tas_calval.

To have a idea of the data, we plot the temperature for the the first day of the calibration period.

Visualisation

# function to map the data
map_dat <- function(dat_latlon, dat,
                    xlim = range(dat_latlon$lon),
                    ylim = range(dat_latlon$lat),
                    main = "",
                    unit = "",
                    cex = 1,
                    palette = cm.colors,
                    ncolor = 20,
                    break_lim = c(min(dat) - 1, max(dat) + 1)){
  par_mar <- par("mar")
  map_fr <- maps::map("france", plot = FALSE)
  breaks <- seq(break_lim[1], break_lim[2], length.out = ncolor + 1)
  color <- cut(dat, breaks)
  pal <- palette(ncolor)
  layout(matrix(1:2, ncol = 2),
         widths = c(3, 1),
         heights = c(1, 1))
  par(mar = c(5, 4, 4, 2))
  plot(dat_latlon,
       xlim = xlim,
       ylim = ylim,
       pch = 20,
       col = pal[color],
       cex = cex,
       main = main)
  lines(map_fr, col = "black", lty = 4)
  par(mar = c(4, 1.5, 3, 1.5))
  plot(NA, type = "n", ann = FALSE, xlim = 1:2, ylim = range(breaks),
       xaxt = "n", yaxt = "n", bty = "n")
  rect(1, head(breaks, -1), 1.5, tail(breaks, -1), col = pal)
  blab <- breaks[seq(1, ncolor + 1, length.out = 11)]
  mtext(format(blab, digits = 1, trim = TRUE), side = 2, at = blab, las = 2, cex = 0.7)
  mtext(unit, 2, line = 1.5)
  on.exit(par(mar = c(4, 1.5, 3, 1.5)))
  invisible(NULL)
}

break_lim <- range(erai_tas_calval$cal[, 1], cordex_tas_calval$cal[, 1]) + c(-1, 1)

map_dat(dat_latlon = erai_latlon, dat = erai_tas_calval$cal[, 1],
        main = "ERA-Interim", unit = "tas (K)", 
        xlim = range(erai_latlon$lon, cordex_latlon$lon),
        ylim = range(erai_latlon$lat, cordex_latlon$lat),
        break_lim = break_lim,
        # palette = function(x) rev(heat.colors(x)))
        palette = rainbow)

map_dat(dat_latlon = cordex_latlon, dat = cordex_tas_calval$cal[, 1],
        main = "CORDEX", unit = "tas (K)", cex = 0.4,
        xlim = range(erai_latlon$lon, cordex_latlon$lon),
        ylim = range(erai_latlon$lat, cordex_latlon$lat),
        break_lim = break_lim,
        # palette = function(x) rev(heat.colors(x)))
        palette = rainbow)

The spatial resolution of CORDEX is higher than the spatial resolution of ERA-Interim. Here is what we obtain if we interpolate the data from ERA-Interim to the CORDEX grid points using nearest-neighbor interpolation.

break_lim <- range(erai_tas_calval$cal[, 1], cordex_tas_calval$cal[, 1]) + c(-1, 1)

map_dat(dat_latlon = cordex_latlon, dat = nearestn_bc$cal[, 1],
        main = "ERA-Interim, Interpolated", cex = 0.4,  unit = "tas (K)", break_lim = break_lim,
        # palette = function(x) rev(heat.colors(x)))
        palette = rainbow)

map_dat(dat_latlon = cordex_latlon, dat = cordex_tas_calval$cal[, 1],
        main = "CORDEX", unit = "tas (K)", cex = 0.4, break_lim = break_lim,
        # palette = function(x) rev(heat.colors(x)))
        palette = rainbow)

The interpolated ERA-Interim data show a pattern of temperature similar to the CORDEX simulation, but lack details at a finer scale.

Applying CDF-t to correct the interpolated ERA-Interim data

In the following, we will use the CDF-t method to correct biases in the distribution of the interpolated ERA-Interim data in the validation period with respect to the CORDEX simulations available in the calibration period. The CDF-t method is separately applied to each grid point.

# function to apply CDF-t to all locations using vapply as a loop over all locations( the rows of the data matrix)
bcorrect_all <- function(obs_cal, interpolated_cal, interpolated_val){
  nsites <- nrow(obs_cal)
  obs_val_bs <- vapply(seq.int(nsites),
         FUN = function(ibox){
           CT <- CDFt(ObsRp = obs_cal[ibox, ], DataGp = interpolated_cal[ibox, ], DataGf = interpolated_val[ibox, ])
           return(CT$DS) # returning the downscaled/bias-corrected data.
           # ds_qFGf <- CDFt_qFGf(obs_cal[ibox, ], interpolated_cal[ibox, ], interpolated_val[ibox, ])
           # return(ds_qFGf(interpolated_val[ibox, ]))
         },
         FUN.VALUE = numeric(ncol(interpolated_val)))
  return(t(obs_val_bs))
}

cdft_bc <- bcorrect_all(cordex_tas_calval$cal, nearestn_bc$cal, nearestn_bc$val)

Evaluating the correction

Let us plot for each day the ERA-Interim data bias-corrected with CDF-t along with the CORDEX and the interpolated ERA-Interim data.

Temperature maps

# function to plot the CORDEX data, the interpolated ERA-Interim data, and the bias-corrected data with CDF-t
map_compare_bs <- function(cordex_latlon,
                           cordex_dat,
                           erai_dat,
                           bs_dat,
                           main_cordex = "",
                           main_erai = "",
                           main_bs = "",
                           unit = "",
                           cex = 1,
                           palette = cm.colors,
                           ncolor = 20,
                           break_lim = c(min(cordex_dat, erai_dat, bs_dat) - 1,
                                         max(cordex_dat, erai_dat, bs_dat) + 1)){
  par_mar <- par("mar")
  map_fr <- maps::map("france", plot = FALSE)
  breaks <- seq(break_lim[1], break_lim[2], length.out = ncolor + 1)
  color_cordex <- cut(cordex_dat, breaks)
  color_erai <- cut(erai_dat, breaks)
  color_bs <- cut(bs_dat, breaks)
  pal <- palette(ncolor)
  layout(matrix(1:4, ncol = 4),
         widths = c(3, 3, 3, 1),
         heights = rep(1, 4))
  par(mar = c(5, 4, 4, 2))
  plot(cordex_latlon,
       pch = 20,
       cex = 1,
       col = pal[color_erai],
       main = main_erai)
  lines(map_fr, col = "black", lty = 4)
  plot(cordex_latlon,
       pch = 20,
       cex = 1,
       col = pal[color_cordex],
       main = main_cordex)
  lines(map_fr, col = "black", lty = 4)
  plot(cordex_latlon,
       pch = 20,
       cex = 1,
       col = pal[color_bs],
       main = main_bs)
  lines(map_fr, col = "black", lty = 4)
  par(mar = c(4, 1.5, 3, 1.5))
  plot(NA, type = "n", ann = FALSE, xlim = 1:2, ylim = range(breaks),
       xaxt = "n", yaxt = "n", bty = "n")
  rect(1, head(breaks, -1), 1.5, tail(breaks, -1), col = pal)
  blab <- breaks[seq(1, ncolor + 1, length.out = 11)]
  mtext(format(blab, digits = 1, trim = TRUE), side = 2, at = blab, las = 2, cex = 0.7)
  mtext(unit, 2, line = 1.5)
  on.exit(par(mar = c(4, 1.5, 3, 1.5)))
  return(NULL)
}

# apply the plot function at all time.
animate_bs <- function(){
  oopt <- ani.options(interval = 0.1, nmax = length(time_calval$val))
  for (i in 1:ani.options("nmax")) {
    map_compare_bs(cordex_latlon,
                   cordex_dat = cordex_tas_calval$val[, i],
                   erai_dat = nearestn_bc$val[, i],
                   bs_dat = cdft_bc[, i],
                   main_cordex = paste("cordex \n", time_calval$val[i]),
                   main_erai = "erai",
                   main_bs = "bs",
                   unit = "tas (K)",
                   cex = 0.7,
                   break_lim = range(cordex_tas_calval$val, nearestn_bc$val, cdft_bc) + c(-1, 1),
                   # palette = function(x) rev(heat.colors(x)))
                   palette = rainbow)
    ani.pause()
  }
  ani.options(oopt)
}
saveHTML(animate_bs(),
         img.name = "bs_animation", htmlfile = "bs_animation.html", ani.height = 200, ani.width = 550)

Quantitatively, the bias-corrected data with CDF-t does not reproduce perfectly the CORDEX data in the validation period but compared to the interpolated ERA-Interim data, it exhibits spatial structures that are closer what is observed in the CORDEX data.

To have a quantitative score on whether CDF-t improve the similarity between the CORDEX and ERAI-Interim distribution at each grid point, we compute the difference of p-values of two-sample Kolmogorov-Smirnov tests respectively between the CORDEX data and the bias-corrected data with CDF-t and between the CORDEX data and the interpolated ERA-Interim data. When the difference is positive (respectively negative), it means that CDF-t improves (respectively degrades) the statistical similarity with the CORDEX data with respect to the interpolated ERA-Interim data.

Two-sample Kolmogorov-Smirnov tests

# function to compare two distribution at each grid point with a two-sample Kolmogorov-Smirnov test.
# returns the p-values of the tests


validate_bs <- function(obs_val_bs, obs_val){
  nsites <- nrow(obs_val)
  ks_pv <- vapply(seq.int(nsites),
                  FUN = function(isite){
                    ks.test(obs_val_bs[isite, ], obs_val[isite, ])$p.value
                  },
                  FUN.VALUE = numeric(1))
}
ks_cdft <- validate_bs(cdft_bc, cordex_tas_calval$val)
ks_nearestn <- validate_bs(nearestn_bc$val, cordex_tas_calval$val)
ks_score <- (ks_cdft - ks_nearestn)

hist(ks_score, breaks ="Scott", main = "Differences of \n K-S test p-values")

map_dat(dat_latlon = cordex_latlon, dat = ks_score,
        main = "Differences of \n K-S test p-values", unit = "p-values diff", cex = 0.4, break_lim = c(-1, 1),
        # palette = function(x) rev(heat.colors(x)))
        palette = cm.colors)

For most grid points, the difference of p-values is positives which means that the CDF-t correction improves the statistical similarity between the ERA-Interim and the CORDEX data. There are only a few grid points where the CDF-t correction degrades the similarity between CORDEX and the interpolated ERA-Interim data according to the p-values of the Kolmogorov-Smirnov tests.

CDF-t independently applied at each grid point aims to improve the statistical similarity between the statistical distribution of the interpolated ERA-Interim data and of the CORDEX data of each grid points individually. Let us see if this point by point correction also improves the spatial structures of the data with the CORDEX data still acting as the reference. To do so, we compute the time series of spatial correlations between the CDF-t bias-corrected data and the CORDEX data and between the ERA-Interim interpolated data and the CORDEX data.

Spatial correlation

# function t compute the spatial correlation between two fields at each time
# returns the p-values of the tests

spatial_cor <- vapply(seq_along(time_calval$val),
       FUN = function(itime){
         cor_cordex_vs_bs <- cor(cordex_tas_calval$val[, itime],  cdft_bc[, itime])
         cor_cordex_vs_erai <- cor(cordex_tas_calval$val[, itime],  nearestn_bc$val[, itime])
         c("cor_cordex_vs_bs" = cor_cordex_vs_bs, "cor_cordex_vs_erai" = cor_cordex_vs_erai)
       },
       FUN.VALUE = numeric(2))

matplot(x = time_calval$val, t(spatial_cor), type = "l",  xaxt = "n", xlab = "time", ylab = "cor", main = "Validation: spatial correlation with the CORDEX data", lty = 1)
xaxis <- as.Date(format(as.Date(time_calval$val), format = "%Y-%m-%d"))
ixaxis <- which(format(xaxis, "%m") == "01" & format(xaxis, "%d") == "01")
ixaxis <- ixaxis[seq(1, length(ixaxis), length.out = 10)]
xaxis <- xaxis[ixaxis]
axis(1, at = time_calval$val[ixaxis], labels=xaxis)
legend("bottomleft", legend = c("bias-corrected vs cordex", "erai vs cordex"), lty = 1, col = 1:2)

Compared to the ERA-Interim data interpolated with the method of nearest neighbor, the CDF-t bias-corrected data tends to exhibit higher spatial correlation with the CORDEX data at each time step. We note that the spatial correlation between the CDF-t bias-corrected data and CORDEX data is not stable through time. The spatial correlation decreases at time when the correlation between the ERA-Interim interpolated data and the CORDEX data is also not as high as usual.

Conclusion

In this vignette, we wanted to provide a quick introduction to the CDF-t method and presents its implementation in the R package CDFt. We hope that we managed to give a feel on how the method and the package work and how it could be applied in the context of climate sciences.

Any questions, comments or suggestions are welcome and can be sent at sthao@lsce.ipsl.fr

References

Dee D. P., Uppala S. M., Simmons A. J., Berrisford P., Poli P., Kobayashi S., Andrae U., et al. 2011. “The ERA‐Interim Reanalysis: Configuration and Performance of the Data Assimilation System.” Quarterly Journal of the Royal Meteorological Society 137 (656): 553–97. doi:10.1002/qj.828.

Déqué, Michel. 2007. “Frequency of Precipitation and Temperature Extremes over France in an Anthropogenic Scenario: Model Results and Statistical Correction According to Observed Values.” Extreme Climatic Events 57 (1): 16–26. doi:10.1016/j.gloplacha.2006.11.030.

Ehret, U., E. Zehe, V. Wulfmeyer, K. Warrach-Sagi, and J. Liebert. 2012. “HESS Opinions ‘Should We Apply Bias Correction to Global and Regional Climate Model Data?’” Hydrology and Earth System Sciences 16 (9): 3391–3404. doi:10.5194/hess-16-3391-2012.

Jacob, Daniela, Juliane Petersen, Bastian Eggert, Antoinette Alias, Ole Bøssing Christensen, Laurens M. Bouwer, Alain Braun, et al. 2014. “EURO-CORDEX: New High-Resolution Climate Change Projections for European Impact Research.” Regional Environmental Change 14 (2): 563–78. doi:10.1007/s10113-013-0499-2.

Maraun, Douglas. 2016. “Bias Correcting Climate Change Simulations - a Critical Review.” Current Climate Change Reports 2 (4): 211–20. doi:10.1007/s40641-016-0050-x.

Michelangeli, P.-A., M. Vrac, and H. Loukos. 2009. “Probabilistic Downscaling Approaches: Application to Wind Cumulative Distribution Functions.” Geophysical Research Letters 36 (11). doi:10.1029/2009GL038401.

Piani, C., G.P. Weedon, M. Best, S.M. Gomes, P. Viterbo, S. Hagemann, and J.O. Haerter. 2010. “Statistical Bias Correction of Global Simulated Daily Precipitation and Temperature for the Application of Hydrological Models.” Journal of Hydrology 395 (3-4): 199–215. doi:10.1016/j.jhydrol.2010.10.024.

Vautard, Robert, Andreas Gobiet, Daniela Jacob, Michal Belda, Augustin Colette, Michel Déqué, Jesús Fernández, et al. 2013. “The Simulation of European Heat Waves from an Ensemble of Regional Climate Models Within the EURO-CORDEX Project.” Climate Dynamics 41 (9): 2555–75. doi:10.1007/s00382-013-1714-z.

Vrac, M., P. Drobinski, A. Merlo, M. Herrmann, C. Lavaysse, L. Li, and S. Somot. 2012. “Dynamical and Statistical Downscaling of the French Mediterranean Climate: Uncertainty Assessment.” Nat. Hazards Earth Syst. Sci. 12 (9): 2769–84. doi:10.5194/nhess-12-2769-2012.

Vignette: the R package CDFt

Soulivanh Thao (sthao@lsce.ipsl.fr)

April 9, 2018

Introduction

Bias correction and the method CDF-t

Quantile mapping

CDF-t

Implementation

Toy Example

R packages

Session information

Data

Simulations

Histograms

Empirical cumulative distribution functions

Applying CDF-t for univariate bias-correction

Evaluating the correction

Histograms

Empirical cumulative distribution functions

Quantile - Quantile plots

Two-sample Kolmogorov-Smirnov tests

Climate application

Data

Visualisation

Applying CDF-t to correct the interpolated ERA-Interim data

Evaluating the correction

Temperature maps

Two-sample Kolmogorov-Smirnov tests

Spatial correlation

Conclusion

References