The aim of this vignette is to provide a quick introduction to the R package CDFt
.
This vignette is divided in 3 parts:
A brief presentation of what is the CDF-t method implemented in the CDFt
package.
One toy example to present the CDFt
function and how it works.
An example close to an actual climate application to give an idea of how this method can be applied in climate sciences.
Remark: This vignettes is originally made for me as a reminder of what I learned about the method CDF-t but I hope it could be also useful to others. This vignette reflects my level of understanding of the topic and thus it could contain mistakes or erroneous information (I hope not too big nor too many). Constructive comments or suggestions on possible ways to improve this document are welcome. For instance, it could be about better ways to explain or illustrate certain ideas or tips about more efficient ways to program in R. Corrections related to the English language are also welcome.
The CDFt
packages actually contains only one function called CDFt
, which is the implementation the method CDF-t (Michelangeli, Vrac, and Loukos 2009). CDF-t is an univariate bias-correction method which was first applied to correct wind intensities in the South of France.
Bias correction consists in post-processing a dataset to make it closer in a statistical sense of a dataset of reference. For instance, one may wish that the processed dataset possess the same mean, the save variance, or even the same statistical distribution as the reference dataset. When the reference dataset a higher spatial resolution than the dataset to be corrected, then bias-correction can also be seen as a form of downscaling.
In climate sciences, bias correction often refer to the process of correcting outputs of climate models (e.g. GCM: General Circulation Models, RCM: Regional Climate Models) with respect to observations. Indeed, climate models are not perfectly able to reproduce the climate dynamic and exhibits errors related for instance to the limited spatial resolution of the model, the parametrisation of unresolved processes, or uncertainties in the model initialization (Ehret et al. 2012).
There exists various methods of bias corrections (see e.g. Maraun 2016) but the CDF-t method can be seen as an extension of a widely-applied univariate bias-correction method, the so-called quantile mapping (also referred as quantile matching, or quantile-quantile correction) method (see e.g. Déqué 2007; Piani et al. 2010).
In the quantile mapping method, we consider two datasets:
G, (standing for GCM data) is the dataset we want to correct. G is supposed to follow a distribution with cumulative distribution function \(F_G\).
O, (standing for Observations) is the reference dataset. O is supposed to follow a distribution with cumulative distribution function \(F_O\)
In the quantile mapping method, we look for a mapping \(M\) such that after transformation, G follows the same distribution as O:
\[ M(G) \sim F_ O\]
It can be shown that this condition is satisfied when we select \(T\) in the following way: \[ M(x) = F^{-1}_O \circ F_G(x) \] where \(F^{-1}_O\) denotes the (general) inverse of the cumulative distribution function \(F_O\). In practice, \(F_O\) and \(F_G\) are unknown and to estimate the transformation \(M\), they are replaced by their empirical estimators, the empirical cumulative distribution functions (ECDFs).
In the quantile mapping approach, it is assumed that the the observations O and the GCM data G are stationary (i.e. their statistical distribution don’t evolve with time) or at least that the the observations O and the GCM data G are observed during the same period.
Under those conditions, quantile mapping cannot be used for instance to correct climate model projections for the future since observation are not available for that period and because the distributions of the observations and the data simulated by the GCM are expected to change in the future. The method CDF-t provides a framework to tackle this kind of questions.
In the CDF-t framework, 3 datasets are considered:
\(G_{cal}\), stands for “GCM data in the calibration period” . \(G_{cal}\) is supposed to follow a distribution with cumulative distribution function \(F_{G, Cal}\).
\(O_{cal}\), stands for “Observations in the calibration period” and corresponds to the reference dataset in the calibration period. O is supposed to follow a distribution with cumulative distribution function \(F_{O, Cal}\).
\(G_{val}\), stands for “GCM data in the validation period”. \(G_{val}\) is supposed to follow a distribution with cumulative distribution function \(F_{G, Val}\).
Remark: Termininlogy-wise, the validation period is often refered as the “future” or “projection”" period since CDF-t is often applied to correct climate model projections for the future in opposition to the “present”" of the “historical” period for which observations are available (this period is refered here as the calibration period). However, CDF-t can also be used to correct data from the past or the present. In the same way, we refer to observational data and GCM data since it relates to many applications in climate sciences but the method CDF-t can simply thought as a method to correct one dataset (here, the GCM outputs) with respect to a dataset of reference (here, the observations) whenever the reference data are available in the calibration period but not in the validation period.
The objective is to correct \(G_{val}\) such that after correction \(G_{val}\) follows the same distribution as \(O_{val}\), the observations that we would observe in the validation period. Those hypothetical observations are assumed to follow an unknown distribution \(F_{O, Val}\).
Since \(F_{O, Val}\) is unknown, we assume that there exists a transformation \(T\) such that:
\[ T(F_{G, Cal}(x)) = F_{O, Cal}(x) \]
and
\[ T(F_{G, Val}(x)) = F_{O, Val}(x)\]
The function \(T(u) = F_{O, Cal} \circ F^{-1}_{G, Cal}(u)\) satisfies the former equation and we assume that it satisfies the latter equation. Hence, we obtain \(F_{O, Val}(x)\) as:
\[\begin{align} F_{O, Val}(x) &= T(F_{G, Val}(x)) \\ &= F_{O, Cal} \circ F^{-1}_{G, Cal} \circ F_{G, Val}(x) \end{align}\]Then, we can build the quantile mapping function to transform the GCM data in the validation period, \(G_{val}\) so that the transform data have the distribution of the observations in the validation period, \(O_{val}\):
\[\begin{align} M(x) &= F^{-1}_{O, Val} \circ F_{G, Val}(x) \\ &= F^{-1}_{G, Val} \circ F_{G, Cal} \circ F^{-1}_{O, Cal} \circ F_{G, Val}(x) \end{align}\]the CDF-t method implemented in the CDFt
package follows these two steps:
Reconstruct \(F_{O, Val}\) from \(F_{O, Cal}\), \(F_{G, Cal}\) and \(F_{G, Val}\).
Apply quantile mapping from \(F_{O, Val}\) and \(F_{G, Val}\) to correct the data \(G_{val}\).
In practice, \(F_{O, Cal}\), \(F_{G, Cal}\) and \(F_{G, Val}\) are estimated using empirical cumulative distribution function. However, using empirical cumulative distribution function has an implication, the CDF-t method is only expected to work properly when the the observed values of \(O_{cal}\) and \(G_{cal}\) share a similar range of values.
In this part, we will apply the bias correction method CDF-t to a toy example to illustrate how the method and the package work.
First, let us load the packages that are used in this vignette.
library(maps)
library(animation)
library(CDFt)
I provide just in case the session information for the sake of reproducibility.
print(sessionInfo(), locale = FALSE)
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] CDFt_1.0.1 animation_2.6 maps_3.3.0
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.3 magrittr_1.5 tools_3.6.3 htmltools_0.4.0
## [5] yaml_2.2.1 Rcpp_1.0.3 stringi_1.4.6 rmarkdown_2.1
## [9] knitr_1.28 stringr_1.4.0 xfun_0.12 digest_0.6.25
## [13] rlang_0.4.5 evaluate_0.14
In this toy example, we consider four variables:
G_cal
plays the role of the GCM data in the calibration period.
G_val
plays the role of the GCM data in the validation period.
O_cal
plays the role of the observational data in the calibration period and is used as a reference.
O_val
plays the role of the observational data in the validation period and is supposed unknown.
The goal is to transform the data G_val
so that the transformed data have the same distribution as O_val
which is unknown to us. In the CDF-t method, it is implicitly assumed that the evolution of the GCM data distributions from a time period to another would also makes sense for the evolution of the observation distributions (Vrac et al. 2012). It means that there exists a same transformation that turns the distribution of O_cal
into the distribution of O_val
and that turns distribution of G_cal
into the distribution of G_val
. For our example, we apply a linear transformation such that from calibration period to the validation period, the variance of the distribution has been multiplied by 102 and its mean has been multiplied by 10 and then increased by 3. In the toy example, the GCM datasets, G_cal
and G_val
are constructed from a uniform distribution where as the observational datasets, O_cal and O_val, are constructed using a Gaussian distribution.
set.seed(1) # randon number generator seed for reproducibility
ndat <- 5000
G_cal <- runif(ndat, min = -10, max = 10)
G_val <- runif(ndat, min = -10, max = 10) * 10 + 3
O_cal <- rnorm(ndat, mean = 1, sd = 2)
O_val <- rnorm(ndat, mean = 1, sd = 2) * 10 + 3
# Example of a case where CDF-t is not expected to behave properly:
# ndat <- 5000
# G_cal <- runif(ndat, min = -10, max = 10)
# G_val <- runif(ndat, min = -10, max = 10) * 10 + 3
# O_cal <- rnorm(ndat, mean = 11, sd = 2)
# O_val <- rnorm(ndat, mean = 11, sd = 2) * 10 + 3
Here are the empirical distributions of the simulated data represented by histograms.
par(mfrow = c(2,2))
ylim = c(0, 0.15)
breaks <- hist(c(O_cal, G_cal, O_val, G_val), breaks ="Scott", plot = FALSE)$breaks
hist(G_cal, breaks = breaks, freq = FALSE, border = "grey", ylim = ylim)
hist(O_cal, breaks = breaks, freq = FALSE, border = "black", ylim = ylim)
hist(G_val, breaks = breaks, freq = FALSE, border = "brown", ylim = ylim)
hist(O_val, breaks = breaks, freq = FALSE, border = "red", ylim = ylim)
The empirical distributions of the simulated data can also be represented by cumulative distribution functions.
par(mfrow = c(1, 1))
plot(ecdf(G_val), col = "brown", lwd = 2, xlim = range(O_cal, G_cal, O_val, G_val), main = "Empirical Cumulative Distribution Function")
lines(ecdf(G_cal), col = "grey", lwd = 2)
lines(ecdf(O_cal), col = "black", lwd = 2)
lines(ecdf(O_val), col = "red", lwd = 2)
legend("topleft", legend = c("G_cal", "G_val", "O_cal", "O_val"), col = c("grey", "brown", "black", "red"), lty = 1, lwd = 2)
Let us now apply CDF-t to correct G_val
so it has ideally the same distribution as O_val
.
# help("CDFt")
args(CDFt)
## function (ObsRp, DataGp, DataGf, npas = 100, dev = 2)
## NULL
The CDFt function takes as arguments:
ObsRp
stands for “Observation used as Reference in the Present period” and it corresponds in our case to O_cal
.
DataGp
stands for “Data from the GCM in the Present period” and it corresponds in our case to G_cal
.
DataGf
stands for “Data from the GCM in the Future period” and it corresponds in our case to G_val
.
npas
corresponds to the number points for which FRf
is estimated. FRf
is the cumulative distribution function of ObsRf
(“Observation used as Reference in the Future period”).
dev
helps to define the range of value x for which FRf(x) is reconstructed.
cdft_bc <- CDFt(ObsRp = O_cal, DataGp = G_cal, DataGf = G_val, npas = 1000, dev = 2)
str(cdft_bc)
## List of 6
## $ x : num [1:1000] -104 -104 -104 -104 -104 ...
## $ FRp: num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
## $ FGp: num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
## $ FGf: num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
## $ FRf: num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
## $ DS : num [1:5000] -31.396 -0.772 -6.75 29.036 -2.63 ...
The CDFt
function returns a list with the following elements:
x
, the values for which the function FRf(x) is evaluated.
FRp
, the empirical cumulative distribution function for the data ObsRp
evaluated at x.
FGp
, the empirical cumulative distribution function for the data DataGp
evaluated at x.
FGf
, the empirical cumulative distribution function for the data DataGf
evaluated at x.
FRf
, the reconstructed empirical cumulative distribution for the unobserved data ObsRf
evaluated at x. FRf
is estimated from FRp
, FGp
and FRf
.
DS
, standing for “DownScaled”, corresponds to corrected of DataGf
data which should have the same statistical distribution as the unobserved data ObsRf
.
Let us compare the distribution of the corrected data with respect to the distribution of the reference, O_val
. We look at histogram, cumulative distributions functions and QQ-plots (quantile-quantile plots). For this comparison, the bias-corrected data are evaluated with respect to the theoretical distribution of O_val
and also to empirical distribution of O_val
. Indeed, in practice, the theoretical distribution of O_val
would be unknown and to evaluate a bias-corrected method on actual cases, one could split the data into a calibration part and a validation (see the climate application part). There could be differences when comparing with respect to the theoretical or empirical distribution of O_val
due to sampling effect. Indeed, the empirical distribution of O_val
is a realisation of a random variable.
par(mfrow = c(1, 4))
breaks <- hist(c(O_val, G_val, cdft_bc$DS), breaks ="Scott", plot = FALSE)$breaks
x <- seq(min(breaks), max(breaks), length.out = 200)
ylim <- c(0, 0.03)
plot(x, dnorm(x, mean = 1 * 10 + 3, sd = 2 * 10), main = "Theorical density of O_val", ylim = ylim, ylab = "Density", col = "red", type = "l")
hist(O_val, breaks = breaks, freq = FALSE, border = "red", main = "Histogram of O_val", ylim = ylim)
hist(cdft_bc$DS, breaks = breaks, freq = FALSE, border = "purple", main = "Histogram of bias-corrected G_val", ylim = ylim)
hist(G_val, breaks = breaks, freq = FALSE, border = "brown", main = "Histogram of G_val", ylim = ylim)
par(mfrow = c(1, 5))
xlim <- range(O_val, G_val, cdft_bc$DS)
x <- seq(xlim[1], xlim[2], length.out = 200)
plot(x, pnorm(x, mean = 1 * 10 + 3, sd = 2 * 10), col = "red", type = "l", lwd = 2, xlim = xlim, main = "Theorical CDF of O_val", ylab = "Fn(x)")
plot(ecdf(O_val), col = "red", lwd = 2, xlim = xlim, main = "Empirical CDF of O_val")
plot(ecdf(cdft_bc$DS), col = "white", lwd = 2, xlim = xlim, main = "Estimated CDF of O_val")
lines(cdft_bc$x, cdft_bc$FRf, col = "violet", lwd = 2)
plot(ecdf(cdft_bc$DS), col = "purple", lwd = 2, xlim = xlim, main = "Empirical CDF of \n bias-corrected G_val")
plot(ecdf(G_val), col = "brown", lwd = 2, xlim = xlim, main = "Empirical CDF of G_val")
par(mfrow = c(1, 4))
xlim <- range(O_val, G_val, cdft_bc$DS)
qqplot(qnorm(seq_along(G_val)/length(G_val), mean = 1 * 10 + 3, sd = 2 * 10), cdft_bc$DS, xlim = xlim, ylim = xlim, xlab = "Theoretical quantile of O_val", main = "QQ-plot after correction")
abline(b = 1, a = 0, col = "red")
qqplot(O_val, cdft_bc$DS, xlim = xlim, ylim = xlim, main = "QQ-plot after correction")
abline(b = 1, a = 0, col = "red")
qqplot(qnorm(seq_along(G_val)/length(G_val), mean = 1 * 10 + 3, sd = 2 * 10), G_val, xlim = xlim, ylim = xlim, xlab = "Theoretical quantile of O_val", main = "QQ-plot before correction")
abline(b = 1, a = 0, col = "red")
qqplot(O_val, G_val, xlim = xlim, ylim = xlim, main = "QQ-plot before correction")
abline(b = 1, a = 0, col = "red")
From a qualitative standpoint, we see that after the correction, the G_val
data have a statistical distribution closer to the distribution of O_val
. In the following, we use the two-sample Kolmogorov-Smirnov test to quantify whether the bias-corrected data are actually closer distribution-wise of O_val
.
ks.test(cdft_bc$DS, pnorm, mean = 1 * 10 + 3, sd = 2 * 10)
##
## One-sample Kolmogorov-Smirnov test
##
## data: cdft_bc$DS
## D = 0.1443, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(G_val, pnorm, mean = 1 * 10 + 3, sd = 2 * 10)
##
## One-sample Kolmogorov-Smirnov test
##
## data: G_val
## D = 0.33211, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(cdft_bc$DS, O_val)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: cdft_bc$DS and O_val
## D = 0.1312, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(G_val, O_val)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: G_val and O_val
## D = 0.3364, p-value < 2.2e-16
## alternative hypothesis: two-sided
The test statistic of the Kolmogorov-Smirnov test reflects a distance between two distributions. The distance between the bias-corrected data cdft_bc$DS
to the theoretical distribution of O_val
is smaller than the distance between empirical distribution of G_val
to the theoretical distribution of O_val
. This shows that after correction, the G_val
are closer to O_val
in terms of distribution. However, in both cases, the p-values of the the Kolmogorov-Smirnov tests are very small (p-value < 10-6, meaning that we can reject with high confidence the null hypotheses of cdft_bc$D
or G_val
having the same distribution as O_val
. We obtain similar results when the bias-corrected data are compared to the empirical distribution of O_val
. Thus, the CDF-t bias correction is not perfect but it nonetheless increases the similarity between G_val
and the reference O_val
.
In this part, we apply the method CDF-t to an example that is closer to an actual application in climate sciences. The goal is to correct the ERA-Interim reanalysis (Dee D. P. et al. 2011) is the South-East of France ([2E, 7.5E] x [42N, 45]) using as reference, data from CORDEX (Coordinated Regional Climate Downscaling Experiment, Jacob et al. 2014). To be more specific, what we use and refer to as CORDEX data hereafter in this example is only one dataset produced for the CORDEX evaluation. It was produced by the institutions IPSL-INERIS and it corresponds to the run r1i1p1 made for Europe using the regional model WRF331F with forcings provided by the ERA-Interim reanalysis (Vautard et al. 2013). The CORDEX data have a higher spatial resolution than the ERA-Interim reanalysis, and thus in this case the bias-correction method CDF-t can also be considered as a statistical downscaling technique. In this example, we will only focus on the temperature at surface at the daily time scale.
To facilitate their use in this vignette, the data from CORDEX and ERA-Interim have been prepared and compiled in an .rdata file bc_tas_erai2cordex.rdata
. It is available here.
The original CORDEX data can accessed following those instructions.
The original ERA-Interim data are available through the ECMWF web application server.
First, let us load the file bc_tas_erai2cordex.rdata
in R.
load(file = "bc_tas_erai2cordex.rdata")
ls()
## [1] "cordex_latlon" "cordex_tas_calval" "erai_latlon"
## [4] "erai_tas_calval" "nearestn_bc" "time_calval"
The file bc_tas_erai2cordex.rdata
contains 6 variables.
cordex_tas_calval
contains the surface temperature data for the regional model CORDEX. cordex_tas_calval
is actually a list containing two matrices, respectively storing the data for the calibration period and the validation period. The first dimension of the matrix is the location (or grid point) and the second dimension is the time.
cordex_latlon
is a two-column data.frame containing the latitude and the longitude of the grid points of the the CORDEX simulations.
erai_tas_calval
, same as cordex_tas_calval
but for the ERA-Interim reanalysis.
erai_latlon
, same as cordex_tas_latlon
but for the ERA-Interim reanalysis.
time_calval
is a list of two vectors. The fist one contains the dates for the calibration period and the second one the dates for the validations period. The dates in time_calval are the same for the CORDEX simulations and the ERA-Interim reanalysis. The calibration period goes from 1989-01-01 to 1998-12-31 and the validation period from 1999-01-01 to 2008-12-31.
nearestn_bc
contains the ERA-Interim data interpolated to the CORDEX grid points using nearest neighbor interpolation. It is a list of two matrices, one for the calibration period and one for the validation. The matrices have the same dimensions as the matrices in cordex_tas_calval
.
To have a idea of the data, we plot the temperature for the the first day of the calibration period.
# function to map the data
map_dat <- function(dat_latlon, dat,
xlim = range(dat_latlon$lon),
ylim = range(dat_latlon$lat),
main = "",
unit = "",
cex = 1,
palette = cm.colors,
ncolor = 20,
break_lim = c(min(dat) - 1, max(dat) + 1)){
par_mar <- par("mar")
map_fr <- maps::map("france", plot = FALSE)
breaks <- seq(break_lim[1], break_lim[2], length.out = ncolor + 1)
color <- cut(dat, breaks)
pal <- palette(ncolor)
layout(matrix(1:2, ncol = 2),
widths = c(3, 1),
heights = c(1, 1))
par(mar = c(5, 4, 4, 2))
plot(dat_latlon,
xlim = xlim,
ylim = ylim,
pch = 20,
col = pal[color],
cex = cex,
main = main)
lines(map_fr, col = "black", lty = 4)
par(mar = c(4, 1.5, 3, 1.5))
plot(NA, type = "n", ann = FALSE, xlim = 1:2, ylim = range(breaks),
xaxt = "n", yaxt = "n", bty = "n")
rect(1, head(breaks, -1), 1.5, tail(breaks, -1), col = pal)
blab <- breaks[seq(1, ncolor + 1, length.out = 11)]
mtext(format(blab, digits = 1, trim = TRUE), side = 2, at = blab, las = 2, cex = 0.7)
mtext(unit, 2, line = 1.5)
on.exit(par(mar = c(4, 1.5, 3, 1.5)))
invisible(NULL)
}
break_lim <- range(erai_tas_calval$cal[, 1], cordex_tas_calval$cal[, 1]) + c(-1, 1)
map_dat(dat_latlon = erai_latlon, dat = erai_tas_calval$cal[, 1],
main = "ERA-Interim", unit = "tas (K)",
xlim = range(erai_latlon$lon, cordex_latlon$lon),
ylim = range(erai_latlon$lat, cordex_latlon$lat),
break_lim = break_lim,
# palette = function(x) rev(heat.colors(x)))
palette = rainbow)
map_dat(dat_latlon = cordex_latlon, dat = cordex_tas_calval$cal[, 1],
main = "CORDEX", unit = "tas (K)", cex = 0.4,
xlim = range(erai_latlon$lon, cordex_latlon$lon),
ylim = range(erai_latlon$lat, cordex_latlon$lat),
break_lim = break_lim,
# palette = function(x) rev(heat.colors(x)))
palette = rainbow)
The spatial resolution of CORDEX is higher than the spatial resolution of ERA-Interim. Here is what we obtain if we interpolate the data from ERA-Interim to the CORDEX grid points using nearest-neighbor interpolation.
break_lim <- range(erai_tas_calval$cal[, 1], cordex_tas_calval$cal[, 1]) + c(-1, 1)
map_dat(dat_latlon = cordex_latlon, dat = nearestn_bc$cal[, 1],
main = "ERA-Interim, Interpolated", cex = 0.4, unit = "tas (K)", break_lim = break_lim,
# palette = function(x) rev(heat.colors(x)))
palette = rainbow)
map_dat(dat_latlon = cordex_latlon, dat = cordex_tas_calval$cal[, 1],
main = "CORDEX", unit = "tas (K)", cex = 0.4, break_lim = break_lim,
# palette = function(x) rev(heat.colors(x)))
palette = rainbow)
The interpolated ERA-Interim data show a pattern of temperature similar to the CORDEX simulation, but lack details at a finer scale.
In the following, we will use the CDF-t method to correct biases in the distribution of the interpolated ERA-Interim data in the validation period with respect to the CORDEX simulations available in the calibration period. The CDF-t method is separately applied to each grid point.
# function to apply CDF-t to all locations using vapply as a loop over all locations( the rows of the data matrix)
bcorrect_all <- function(obs_cal, interpolated_cal, interpolated_val){
nsites <- nrow(obs_cal)
obs_val_bs <- vapply(seq.int(nsites),
FUN = function(ibox){
CT <- CDFt(ObsRp = obs_cal[ibox, ], DataGp = interpolated_cal[ibox, ], DataGf = interpolated_val[ibox, ])
return(CT$DS) # returning the downscaled/bias-corrected data.
# ds_qFGf <- CDFt_qFGf(obs_cal[ibox, ], interpolated_cal[ibox, ], interpolated_val[ibox, ])
# return(ds_qFGf(interpolated_val[ibox, ]))
},
FUN.VALUE = numeric(ncol(interpolated_val)))
return(t(obs_val_bs))
}
cdft_bc <- bcorrect_all(cordex_tas_calval$cal, nearestn_bc$cal, nearestn_bc$val)
Let us plot for each day the ERA-Interim data bias-corrected with CDF-t along with the CORDEX and the interpolated ERA-Interim data.
# function to plot the CORDEX data, the interpolated ERA-Interim data, and the bias-corrected data with CDF-t
map_compare_bs <- function(cordex_latlon,
cordex_dat,
erai_dat,
bs_dat,
main_cordex = "",
main_erai = "",
main_bs = "",
unit = "",
cex = 1,
palette = cm.colors,
ncolor = 20,
break_lim = c(min(cordex_dat, erai_dat, bs_dat) - 1,
max(cordex_dat, erai_dat, bs_dat) + 1)){
par_mar <- par("mar")
map_fr <- maps::map("france", plot = FALSE)
breaks <- seq(break_lim[1], break_lim[2], length.out = ncolor + 1)
color_cordex <- cut(cordex_dat, breaks)
color_erai <- cut(erai_dat, breaks)
color_bs <- cut(bs_dat, breaks)
pal <- palette(ncolor)
layout(matrix(1:4, ncol = 4),
widths = c(3, 3, 3, 1),
heights = rep(1, 4))
par(mar = c(5, 4, 4, 2))
plot(cordex_latlon,
pch = 20,
cex = 1,
col = pal[color_erai],
main = main_erai)
lines(map_fr, col = "black", lty = 4)
plot(cordex_latlon,
pch = 20,
cex = 1,
col = pal[color_cordex],
main = main_cordex)
lines(map_fr, col = "black", lty = 4)
plot(cordex_latlon,
pch = 20,
cex = 1,
col = pal[color_bs],
main = main_bs)
lines(map_fr, col = "black", lty = 4)
par(mar = c(4, 1.5, 3, 1.5))
plot(NA, type = "n", ann = FALSE, xlim = 1:2, ylim = range(breaks),
xaxt = "n", yaxt = "n", bty = "n")
rect(1, head(breaks, -1), 1.5, tail(breaks, -1), col = pal)
blab <- breaks[seq(1, ncolor + 1, length.out = 11)]
mtext(format(blab, digits = 1, trim = TRUE), side = 2, at = blab, las = 2, cex = 0.7)
mtext(unit, 2, line = 1.5)
on.exit(par(mar = c(4, 1.5, 3, 1.5)))
return(NULL)
}
# apply the plot function at all time.
animate_bs <- function(){
oopt <- ani.options(interval = 0.1, nmax = length(time_calval$val))
for (i in 1:ani.options("nmax")) {
map_compare_bs(cordex_latlon,
cordex_dat = cordex_tas_calval$val[, i],
erai_dat = nearestn_bc$val[, i],
bs_dat = cdft_bc[, i],
main_cordex = paste("cordex \n", time_calval$val[i]),
main_erai = "erai",
main_bs = "bs",
unit = "tas (K)",
cex = 0.7,
break_lim = range(cordex_tas_calval$val, nearestn_bc$val, cdft_bc) + c(-1, 1),
# palette = function(x) rev(heat.colors(x)))
palette = rainbow)
ani.pause()
}
ani.options(oopt)
}
saveHTML(animate_bs(),
img.name = "bs_animation", htmlfile = "bs_animation.html", ani.height = 200, ani.width = 550)
Quantitatively, the bias-corrected data with CDF-t does not reproduce perfectly the CORDEX data in the validation period but compared to the interpolated ERA-Interim data, it exhibits spatial structures that are closer what is observed in the CORDEX data.
To have a quantitative score on whether CDF-t improve the similarity between the CORDEX and ERAI-Interim distribution at each grid point, we compute the difference of p-values of two-sample Kolmogorov-Smirnov tests respectively between the CORDEX data and the bias-corrected data with CDF-t and between the CORDEX data and the interpolated ERA-Interim data. When the difference is positive (respectively negative), it means that CDF-t improves (respectively degrades) the statistical similarity with the CORDEX data with respect to the interpolated ERA-Interim data.
# function to compare two distribution at each grid point with a two-sample Kolmogorov-Smirnov test.
# returns the p-values of the tests
validate_bs <- function(obs_val_bs, obs_val){
nsites <- nrow(obs_val)
ks_pv <- vapply(seq.int(nsites),
FUN = function(isite){
ks.test(obs_val_bs[isite, ], obs_val[isite, ])$p.value
},
FUN.VALUE = numeric(1))
}
ks_cdft <- validate_bs(cdft_bc, cordex_tas_calval$val)
ks_nearestn <- validate_bs(nearestn_bc$val, cordex_tas_calval$val)
ks_score <- (ks_cdft - ks_nearestn)
hist(ks_score, breaks ="Scott", main = "Differences of \n K-S test p-values")
map_dat(dat_latlon = cordex_latlon, dat = ks_score,
main = "Differences of \n K-S test p-values", unit = "p-values diff", cex = 0.4, break_lim = c(-1, 1),
# palette = function(x) rev(heat.colors(x)))
palette = cm.colors)
For most grid points, the difference of p-values is positives which means that the CDF-t correction improves the statistical similarity between the ERA-Interim and the CORDEX data. There are only a few grid points where the CDF-t correction degrades the similarity between CORDEX and the interpolated ERA-Interim data according to the p-values of the Kolmogorov-Smirnov tests.
CDF-t independently applied at each grid point aims to improve the statistical similarity between the statistical distribution of the interpolated ERA-Interim data and of the CORDEX data of each grid points individually. Let us see if this point by point correction also improves the spatial structures of the data with the CORDEX data still acting as the reference. To do so, we compute the time series of spatial correlations between the CDF-t bias-corrected data and the CORDEX data and between the ERA-Interim interpolated data and the CORDEX data.
# function t compute the spatial correlation between two fields at each time
# returns the p-values of the tests
spatial_cor <- vapply(seq_along(time_calval$val),
FUN = function(itime){
cor_cordex_vs_bs <- cor(cordex_tas_calval$val[, itime], cdft_bc[, itime])
cor_cordex_vs_erai <- cor(cordex_tas_calval$val[, itime], nearestn_bc$val[, itime])
c("cor_cordex_vs_bs" = cor_cordex_vs_bs, "cor_cordex_vs_erai" = cor_cordex_vs_erai)
},
FUN.VALUE = numeric(2))
matplot(x = time_calval$val, t(spatial_cor), type = "l", xaxt = "n", xlab = "time", ylab = "cor", main = "Validation: spatial correlation with the CORDEX data", lty = 1)
xaxis <- as.Date(format(as.Date(time_calval$val), format = "%Y-%m-%d"))
ixaxis <- which(format(xaxis, "%m") == "01" & format(xaxis, "%d") == "01")
ixaxis <- ixaxis[seq(1, length(ixaxis), length.out = 10)]
xaxis <- xaxis[ixaxis]
axis(1, at = time_calval$val[ixaxis], labels=xaxis)
legend("bottomleft", legend = c("bias-corrected vs cordex", "erai vs cordex"), lty = 1, col = 1:2)
Compared to the ERA-Interim data interpolated with the method of nearest neighbor, the CDF-t bias-corrected data tends to exhibit higher spatial correlation with the CORDEX data at each time step. We note that the spatial correlation between the CDF-t bias-corrected data and CORDEX data is not stable through time. The spatial correlation decreases at time when the correlation between the ERA-Interim interpolated data and the CORDEX data is also not as high as usual.
In this vignette, we wanted to provide a quick introduction to the CDF-t method and presents its implementation in the R package CDFt
. We hope that we managed to give a feel on how the method and the package work and how it could be applied in the context of climate sciences.
Any questions, comments or suggestions are welcome and can be sent at sthao@lsce.ipsl.fr
Dee D. P., Uppala S. M., Simmons A. J., Berrisford P., Poli P., Kobayashi S., Andrae U., et al. 2011. “The ERA‐Interim Reanalysis: Configuration and Performance of the Data Assimilation System.” Quarterly Journal of the Royal Meteorological Society 137 (656): 553–97. doi:10.1002/qj.828.
Déqué, Michel. 2007. “Frequency of Precipitation and Temperature Extremes over France in an Anthropogenic Scenario: Model Results and Statistical Correction According to Observed Values.” Extreme Climatic Events 57 (1): 16–26. doi:10.1016/j.gloplacha.2006.11.030.
Ehret, U., E. Zehe, V. Wulfmeyer, K. Warrach-Sagi, and J. Liebert. 2012. “HESS Opinions ‘Should We Apply Bias Correction to Global and Regional Climate Model Data?’” Hydrology and Earth System Sciences 16 (9): 3391–3404. doi:10.5194/hess-16-3391-2012.
Jacob, Daniela, Juliane Petersen, Bastian Eggert, Antoinette Alias, Ole Bøssing Christensen, Laurens M. Bouwer, Alain Braun, et al. 2014. “EURO-CORDEX: New High-Resolution Climate Change Projections for European Impact Research.” Regional Environmental Change 14 (2): 563–78. doi:10.1007/s10113-013-0499-2.
Maraun, Douglas. 2016. “Bias Correcting Climate Change Simulations - a Critical Review.” Current Climate Change Reports 2 (4): 211–20. doi:10.1007/s40641-016-0050-x.
Michelangeli, P.-A., M. Vrac, and H. Loukos. 2009. “Probabilistic Downscaling Approaches: Application to Wind Cumulative Distribution Functions.” Geophysical Research Letters 36 (11). doi:10.1029/2009GL038401.
Piani, C., G.P. Weedon, M. Best, S.M. Gomes, P. Viterbo, S. Hagemann, and J.O. Haerter. 2010. “Statistical Bias Correction of Global Simulated Daily Precipitation and Temperature for the Application of Hydrological Models.” Journal of Hydrology 395 (3-4): 199–215. doi:10.1016/j.jhydrol.2010.10.024.
Vautard, Robert, Andreas Gobiet, Daniela Jacob, Michal Belda, Augustin Colette, Michel Déqué, Jesús Fernández, et al. 2013. “The Simulation of European Heat Waves from an Ensemble of Regional Climate Models Within the EURO-CORDEX Project.” Climate Dynamics 41 (9): 2555–75. doi:10.1007/s00382-013-1714-z.
Vrac, M., P. Drobinski, A. Merlo, M. Herrmann, C. Lavaysse, L. Li, and S. Somot. 2012. “Dynamical and Statistical Downscaling of the French Mediterranean Climate: Uncertainty Assessment.” Nat. Hazards Earth Syst. Sci. 12 (9): 2769–84. doi:10.5194/nhess-12-2769-2012.