The extreme complexity can be seen as a wide range of possibilities that may occur after some initial event (say, I leave home), each possibility having a very small probability to be observed. Of course, some is much more likely than the other (say, I come across my neighbour), maybe quite certain, but still there is a continuous chain of events that are not predictable (what will my neighbour do and/or say to me? What will I answer? Will I be late at work because of something happening on my way to the office?).

This long sequence of events that portrays our life has zero probability to happen. But still, it happens. This means that **everything we observe is the result of a chain of events having zero probability. But every possible chain of events is almost impossible to be observed. **Put it like this: if each possible sequence has zero probability, this means that all sequences have almost the same (very small) probability to occur. However, one of them has to occur for sure.

Handling complexity in a proper way means being able to build models that can explain some systematic pattern in reality, making some sequence much more likely to happen given certain inputs. Statisticians’ work is to unearth the deterministic in a world largely ruled by randomness. If the work is accomplished properly, then just the mere case remains unexplained.

Could we do something more? Well, it depends on what randomness really is. We may see it as a binder of what is not possible to observe for technical or natural reasons. For instance, if we knew the starting face of a coin, our mechanic movement when tossing it, the effect of the wind, etc…, of course we would be able to predict better than usual whether it is going to come up head or tail. But sometimes, it is just too difficult, costly, or just impossible to have a model that improves our predictions. This is why we end up gambling on the real unknown with an irrational certainty of being able to handle it.

]]>

`set.seed(20)`

and `set.seed(21)`

, this would produce unrelated outputs as expected. Pseudo-randomness in R is based on algorithms I honestly have read nothing about, and therefore I am only checking this as a simple naif user.
Let’s say we have a vector of one million consecutive seeds, and, under each of them, we want to sample an integer from 1 to 10. Hence, every time one of these 10 numbers shows up, we can start to count the failures we observe before that specific number shows up again. Theoretically, the distribution of failures for each of the 10 numbers has to follow a geometric distribution of probability 0.1.

```
random_num <- 8217015 # from random.org
N <- 1000000
seeds <- seq_len(N) + random_num
numbers <- sapply(seeds, function(x) { set.seed(x); sample(1:10, 1) })
library(tidyverse)
mydata0 <- data_frame(seeds = seeds, numbers = numbers)
mydata0
```

```
## # A tibble: 1,000,000 x 2
## seeds numbers
##
## 1 8217016 5
## 2 8217017 9
## 3 8217018 4
## 4 8217019 6
## 5 8217020 4
## 6 8217021 2
## 7 8217022 9
## 8 8217023 7
## 9 8217024 2
## 10 8217025 5
## # ... with 999,990 more rows
```

```
mydata <- mydata0 %>%
mutate(seeds = 1:N) %>%
group_by(numbers) %>%
mutate(failures = seeds - lag(seeds) - 1) %>%
ungroup() %>%
drop_na()
mydata
```

```
## # A tibble: 999,990 x 3
## seeds numbers failures
##
## 1 5 4 1
## 2 7 9 4
## 3 9 2 2
## 4 10 5 8
## 5 11 4 5
## 6 12 4 0
## 7 14 3 0
## 8 15 6 10
## 9 16 4 3
## 10 19 1 1
## # ... with 999,980 more rows
```

```
summ_table <- mydata %>%
group_by(numbers, failures) %>%
summarise(n = n()) %>%
mutate(frequencies = n/sum(n)) %>%
ungroup()
summ_table
```

```
## # A tibble: 902 x 4
## numbers failures n frequencies
##
## 1 1 0 8801 0.08780016
## 2 1 1 9068 0.09046379
## 3 1 2 6929 0.06912479
## 4 1 3 7566 0.07547960
## 5 1 4 7109 0.07092050
## 6 1 5 6305 0.06289967
## 7 1 6 5371 0.05358194
## 8 1 7 5447 0.05434013
## 9 1 8 4529 0.04518201
## 10 1 9 4166 0.04156067
## # ... with 892 more rows
```

```
max_plot <- max(summ_table$failures)
qplot(data = summ_table, failures, frequencies, geom = 'path') +
facet_wrap(~ numbers, ncol = 2) +
scale_x_continuous(breaks = 0:max_plot) +
geom_line(data = data.frame(x = 0:max_plot,
y = dgeom(0:max_plot, .1)),
aes(x, y), color = 'red', alpha = .5) +
theme_bw() +
theme(axis.text.x = element_blank(),
panel.grid.major = element_line(colour = "darkgrey"))
```

The red curve is the theoretical geometric model, whereas the black one is the observed distribution of the failures for each of the 10 numbers. We immediately see that a given number is less likely to be sampled under two consecutive seeds. This is clearly not what a statistician trusting a software would like to observe.

Let’s see whether shuffling the seeds (so to remove contiguity) would fix the problem.

```
mydata2 <- mydata0 %>%
sample_frac() %>% # shuffling here
mutate(seeds = 1:N) %>%
group_by(numbers) %>%
mutate(failures = seeds - lag(seeds) - 1) %>%
ungroup() %>%
drop_na()
summ_table2 <- mydata2 %>%
group_by(numbers, failures) %>%
summarise(n = n()) %>%
mutate(frequencies = n/sum(n)) %>%
ungroup()
max_plot2 <- max(summ_table2$failures)
qplot(data = summ_table2, failures, frequencies, geom = 'path') +
facet_wrap(~ numbers, ncol = 2) +
scale_x_continuous(breaks = 0:max_plot2) +
geom_line(data = data.frame(x = 0:max_plot2,
y = dgeom(0:max_plot2, .1)),
aes(x, y), color = 'red', alpha = .5) +
theme_bw() +
theme(axis.text.x = element_blank(),
panel.grid.major = element_line(colour = "darkgrey"))
```

As we expected, now the red curve and the black curve seem to be completely overlapping.

Hence, in whatever application you would need to set seeds in different iterations avoid using consecutive seeds!

]]>

As I am used to installing LaTeX packages

`install.packages()`

and/or `library()`

.
usepackages = Vectorize(function(x) { stopifnot(is.character(x)) if(!is.element(x, rownames(installed.packages()))) install.packages(x) library(x, character.only = T, logical.return = T, quietly = T) })

Let’s try to load three packages, one of which (`SafeBayes`

) has to be installed *on the fly*.

The named logical vector at the very end of the output is the result of the vectorization of the argument `logical.return`

of the function `library()`

which returns `TRUE`

if the package has been successfully loaded, and `FALSE`

otherwise.

]]>

In general, there are more than one way to execute the same task via looping, and the efficiency of each choice varies among languages.

This post is not intended to demonstrate any general truth about R loops, but aims to provide some insights into some common methods for looping in base-R.

In particular, four possible types of loop will be executed, all simulating the probability of obtaining as many successess as the number of trials for a binomial random variable.

Such a probability is clearly equal to *π ^{n}*, where

The purpose of looping is to estimate the probability of NO FAILURE for different combinations of

`No_Failure`

functionLet us build a function that estimates the long-term proportion of no failures for a given size and probability of success. The `n_rep`

argument indicates the number of deviates to be drawn from the binomial distribution.

No_Failure = function(n, PI, n_rep) { mean(rbinom(n_rep, n, PI) == n) } ns = seq_len(80) PIs = seq(0, 1, .05) n_rep = 1000

`for`

loopThe `for`

loop may be the first solution everyone’s brain comes out with for a problem like this, because it really represents the way each of us reasons when the number of iteration is fixed and not conditional on anything.

FOR_fun = function() { outputFor = matrix(0, nrow = length(ns), ncol = length(PIs)) for(i in seq_along(ns)) { for(j in seq_along(PIs)) { outputFor[i, j] = No_Failure(ns[i], PIs[j], n_rep) } } outputFor }

`while`

loopEvery `for`

loop can be expressed by a `while`

loop. Implementing a `while`

loop requires a bit more code, and some more attention on the counter(s) to be updated.

WHILE_fun = function() { outputWhile = matrix(0, nrow = length(ns), ncol = length(PIs)) i = 1 while(i <= length(ns)) { j = 1 while(j <= length(PIs)) { outputWhile[i, j] = No_Failure(ns[i], PIs[j], n_rep) j = j + 1 } i = i + 1 } outputWhile }

`sapply`

The apply family in R is rich of functions which really identify the R-way of looping.

The `sapply()`

function is one of the most common, in that it works mainly on vectors (but also lists). It is very powerful, as it accomplishes the task with very little code.

Moreover, this function creates automatically the output in a simplified version, therefore no empty matrix should be initialized as it happened in the previous two methods.

SAPPLY_fun = function() { sapply(PIs, function(j) sapply(ns, No_Failure, j, n_rep)) }

`Vectorize`

The `Vectorize()`

function is the one I like the most. What it does is to create another function whose arguments are the same as the initial function, but some of them can be vectorized. As far as I see, sometimes (an this is the case) vectorizing more than one argument at a time does not give the expected output.

VECTORIZE_fun = function() { No_Failure_vect = Vectorize(Vectorize(No_Failure, 'n'), 'PI') No_Failure_vect(ns, PIs, n_rep) }

library(microbenchmark) microbenchmark(FOR_fun(), WHILE_fun(), SAPPLY_fun(), VECTORIZE_fun())

In this case, the four methods do not differ so much in terms of computational time required. The winner is on average the `sapply()`

function, which will be used to run the simulation.

library(reshape2) set.seed(1) n_rep = 100000 # more precise mydata = SAPPLY_fun() dimnames(mydata) = list(n = ns, PI = PIs) curves = melt(mydata) library(ggplot2) ggplot(curves, aes(PI, value, color = n, group = n)) + geom_path() + theme_bw() + scale_x_continuous(breaks = PIs) + scale_y_continuous(breaks = PIs) + scale_color_gradient(low = 'green', high = 'navy') + xlab(expression(pi)) + ylab('Pr(NO FAILURE)')

The plot shows all the estimated curves representing the probability of having no failures as a function of *π* and for different sizes (from 1 to 80).

As *n* increases, it is less likely to have no failure: for *n* approaching infinity, the right corner of the graph will approach the point (1, 0).

This confirms that even a really rare event occurs almost certainly if the number of repetitions of the same experiment is considerably large.

]]>

Let’s suppose we want to compute mean and standard deviation of the following three populations:

rm(list = ls()) NPOP <- 10000 distributions <- data.frame(NORM = rnorm(NPOP), LOGNORM = rlnorm(NPOP), CHISQ = rchisq(NPOP, 3))

An easy way to save each result in a different variable, is to exploit the `assign`

function, whose first two arguments are the variable (a string) and the value, respectively.

for(i in seq_along(distributions)) { # create the variable name for the mean of distribution i mean_variable <- paste0('mean_', names(distributions)[i]) # assign the corresponding value to the label assign(mean_variable, mean(distributions[, i])) # create the variable name for the s.d. of distribution i sd_variable <- paste0('sd_', names(distributions)[i]) # assign the corresponding value to the label assign(sd_variable, sd(distributions[, i])) } rm(list = c('mean_variable', 'sd_variable', 'NPOP', 'distributions', 'i')) ls() # it worked !

]]>

*All we know about the world teaches us that the effects of A and B are always different—in some decimal place—for any A and B. Thus asking “are the effects different?” is foolish.*

This is a quote by John Tuckey, and I’d like to start with that to enter the core of this post.

What are we really trying to find out/explain/infer when modeling information coming from a sample?

We generally look for effects to establish whether something affects something else. The standard reasoning intimately concerns the evidence of the effect at sample level in order to have a clue of what happens at population level. If there is a fair certainty of an effect in the data, we may conclude that A and B are different according to some features we are analyzing. The issue arises because such a decision on A and B is strongly influenced by the sample size and by the level of confidence we are adopting. It is a logical thing, I agree, but, on the other hand, we’d like to have some independent response which can safely guide decisions. The fact that a 90% C.I. obtained from 40 units says something opposite from what a 95% C.I. obtained from 30 units indicates, is a clear pitfall for statistics-oriented decisions. The question to pose is: “Are A and B different?”. The answer will always (?) be: “Yes, they are, because there is a zero probability of observing something perfectly equal to something else on a continuous scale”. Here may be the death of statistical inference (at least the classical inference). In reality, things should be interpreted differently: increasing the sample size for a given confidence level only gives the method more certainty that the different measurements observed for A and B are actually different. This is what we usually call “power” or “sensitivity”: the method starts to be better at spotting that values are not the same. The “best” question is: “How much is *not the same*?”. This translates in “Are we really interested in a 0.007 difference between A and B?”. It may be a lot, it may be negligible, it depends on the context and on the scale of the variable. This is to say that we have to be careful with statistical significance and reason much more on magnitude and sign of an effect. Such a way of proceeding will allow us to interpret results more objectively and make more unquestionable decisions.

That said, let’s now give some tips on the interpretation of the effect. Let’s say we observe taller guys are from 1 to 3 points worse in math, and this looks significant, since it is found in a large sample. Can we conclude height affects math scores? The first answer must be a question: is a 1-to-3 point gap a substantial gap? In case it is, are we missing some covariates that may be linked to the phenomenon and may display it from a much better perspective? For instance, have the tall guys been in more basketball classes in school during the year, so they have had less time to do math homework? If so, you should adjust the school time table and avoid using math tests to show senseless differences.

]]>

` if(x > lower_bound & x < upper_bound) return(T) `

` return(F)`

Throughout my coding experience so far, I’ve faced that a lot. And everytime this happened, I started thinking that I was not really writing it in the same way as I would have written it using basic maths on paper. As far as I know, R does not support the usual statement ` lower_bound < x < upper_bound `

. Therefore, I tried to implement it and I ended up with something working, which is nice.

I used ` %<% `

for “less than”, and ` %<=% `

for “less than or equal to”. Following the code:

``%<%` <- function(x, b) { `

` `

` useNames <- T `

` if(is.logical(x)) { `

` if(x) { `

` x <- as.numeric(names(x)) `

` useNames <- F `

` } else return(F) `

` } `

` `

` out <- T `

` if(x >= b) { `

` out <- F `

` } `

` if(useNames) names(out) <- b `

` out `

`} `

``%<=%` <- function(x, b) { `

` `

` useNames <- T `

` if(is.logical(x)) { `

` if(x) { `

` x <- as.numeric(names(x)) `

` useNames <- F `

` } else return(F) `

` } `

` `

` out <- T `

` if(x > b) { `

` out <- F `

` } `

` if(useNames) names(out) <- b `

` out `

`}`

Now, let’s play a little bit:

- Is 1 < 2 < 3? (Should return TRUE)

`1 %<% 2 %<% 3`

`[1] TRUE`

- Is 2 < 2 < 3? (Should return FALSE)

`2 %<% 2 %<% 3`

`[1] FALSE`

- Is 2 ≤ 2 < 3? (Should return TRUE)

`2 %<=% 2 %<% 3`

`[1] TRUE`

- Is 1 < 2 ≤ 2? (Should return TRUE)

`1 %<% 2 %<=% 2`

`[1] TRUE`

- Is 2 ≤ 2 ≤ 2? (Should return TRUE)

`2 %<=% 2 %<=% 2`

`[1] TRUE`

- Is 4 < 2 ≤ 5? (Should return FALSE)

`4 %<% 2 %<=% 5`

`[1] FALSE`

]]>

`rm(list = ls())`

(actually in `rm()`

, when we had an idea: why not building a more elegant and less tedious method for doing this? Let’s think of what `ls()`

is: it is a `rm()`

demands is a vector of names. It is straightforward now: we can use the subset of the vector `ls()`

which does not include the names of the objects we want to keep. To make it more general and hence more usable to others, here you find a function that does the trick. The function is to be used in the Global Environment, which is the one you’re more likely to be working in. The inputs are:
`x`

which is the vector of names of the variables we want to keep in our workspace;

`keep_this_function`

which is set to`TRUE`

by default, since we don’t want our cool function to disappear from our workspace

Following the code:

`rm_all_but <- function(x, keep_this_function = TRUE) {`

` if(keep_this_function) { x <- c("rm_all_but", x) }`

` rm(list = ls(pos = ".GlobalEnv") [ !(ls(pos = ".GlobalEnv") %in% x)], pos= ".GlobalEnv")`

`}`

Let’s try it on my workspace:

`a=1 ; b=2; c=3; d=4`

`ls()`

`[1] "a" "b" "c" "d" "rm_all_but"`

`rm_all_but(x = c("a", "b"))`

`ls() `

*# were "c" and "d" deleted?*

`[1] "a" "b" "rm_all_but"`

**Yes! **

]]>

For a multiclass case, it comes to be less used, also because it loses most of its explicative power.

However, it would be good to use it also in this scenario in order to own one more weapon for diagnostics.

A possibility I have met across surfing the net is to implement an OVA approach (one vs all the others). It means to pick a class a time, say class **C**, and draw the ROC for **C** against **not C**.

We end up having as many curves as the number of classes of the qualitative response.

This is true for one specific model, but, if we aim to compare *n *models with a* k*-class response, we end up with *n X k* curves on the plot, which give you an image of the behavior of each model with the respect of each class.

To visually evaluate the overall (general) behavior of each model, we could get a central tendency measure, which might be:

- a simple arithmetic mean;
- a weighted mean, taking as weights the number of observations belonging to each class

A weighted average might be more correct, but even more “misleading” in telling you your model is good: if your data has quite all observations from class **A** and a few from the others and your model predicts always **A**, your weighted mean would say your model is quite ok, whereas a simple mean (which weighs the classes equally) would say your model is a disaster. That’s why looking at a simple average might be too extreme, but at least it indicates you that something should be reviewed.

What follows is a procedure that ends with a graph representing all the info you should evaluate your model with a critical eye.

First the function:

compare_multiROC <- function(truth, pred, abscissa = "tpr", ordinate = "fpr") { stopifnot(is.list(pred)) truth <- ordered(truth); lev <- levels(truth) pred <- lapply(pred, function(x) ordered(x, levels = lev)) truth <- if(require(dummies)) data.frame(dummy(truth)) pred <- lapply(pred, function(x) data.frame(dummy(x))) library(ROCR) appr <- interp <- list() for(j in seq_along(pred)) { interp[[j]] <- lapply(seq_len(ncol(truth)), function(i) { predob <- prediction(pred[[j]][, i], truth[, i]) perf <- performance(predob, abscissa, ordinate) cbind(perf@x.values[[1]], perf@y.values[[1]]) }) appr[[j]] <- lapply(interp[[j]], function(f) approxfun(f[, 1], f[, 2])) } Xs <- sort(unique(unlist(lapply(interp, function(x) lapply(x, function(X) {X[, 1]}))))) lapply(seq_along(appr), function(j) { Ys <- sapply(appr[[j]], function(f) f(Xs)); colnames(Ys) <- lev data.frame(X = Xs, Ys, avg = rowMeans(Ys), w.avg = apply(Ys, 1, weighted.mean, colSums(truth)), check.names = F) }) }

Let’s create some random data to see what the output is:

set.seed(103) n <- 1e+04 truth <- factor(sample(letters[1:3], n, T, c(.7, .2, .1))) cl1 <- cl2 <- cl3 <- truth trC <- truth == "c"; trClen <- length(truth[trC]) # building cl1 na <- is.na(sample(c(NA, 0), trClen, T, c(.2, .8))); cl1[trC][na] <- "a" # building cl2 na <- is.na(sample(c(NA, 0), trClen, T, c(.3, .7))); cl2[trC][na] <- "a" # building cl3 na <- is.na(sample(c(NA, 0), trClen, T, c(.35, .65))); cl3[trC][na] <- "a" # applying the function valuesROC <- compare_multiROC(truth, list(cl1, cl2, cl3)) names(valuesROC) <- paste0("cl", 1:3) valuesROC

$cl1 X a b c avg w.avg 1 0.00000000 0 0.5000000 0.3969124 0.2989708 0.1414000 2 0.06820428 1 0.5341021 0.4380455 0.6573826 0.8489559 3 0.09983526 1 0.5499176 0.4571218 0.6690131 0.8540833 4 0.11696870 1 0.5584843 0.4674547 0.6753130 0.8568606 5 1.00000000 1 1.0000000 1.0000000 1.0000000 1.0000000 $cl2 X a b c avg w.avg 1 0.00000000 0.0000000 0.5000000 0.3491036 0.2830345 0.1366000 2 0.06820428 0.6831683 0.5341021 0.3934975 0.5369227 0.6238100 3 0.09983526 1.0000000 0.5499176 0.4140860 0.6546679 0.8497625 4 0.11696870 1.0000000 0.5584843 0.4252381 0.6612408 0.8526221 5 1.00000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 $cl3 X a b c avg w.avg 1 0.00000000 0.0000000 0.5000000 0.3232072 0.2744024 0.1340000 2 0.06820428 0.5830986 0.5341021 0.3693673 0.4955227 0.5516888 3 0.09983526 0.8535211 0.5499176 0.3907750 0.5980712 0.7453995 4 0.11696870 1.0000000 0.5584843 0.4023707 0.6536184 0.8503262 5 1.00000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

Now we should “melt this data and create a long data frame to be provided to ggplot.

library(reshape2); library(ggplot2); library(dplyr) dat <- mutate(melt(valuesROC, id.vars = "X"), facet = ordered(ifelse(variable %in% c("avg", "w.avg"), "Averages", "Class Values"), levels = c("Class Values", "Averages"))) names(dat)[-5] <- c("True positive rate", "Curve", "False positive rate", "Model")

Finally, the plot:

ggplot(dat, aes(`True positive rate`, `False positive rate`, color = Curve, lty = Model)) + geom_path() + facet_wrap(~facet) + scale_x_continuous(breaks = seq(0, 1, .1)) + scale_y_continuous(breaks = seq(0, 1, .1)) + geom_segment(x = 0, xend = 1, y = 0, yend = 1, color = "grey90") + geom_ribbon(aes(x = `True positive rate`, ymin = 0, ymax = `True positive rate`), color = NA, fill = "pink", alpha = .2) + geom_ribbon(aes(x = `True positive rate`, ymin = `True positive rate`, ymax = 1), color = NA, fill = "lightblue", alpha = .2) + theme_bw() + guides(linetype = guide_legend( override.aes = list(colour = NULL, fill = "white"))) + ggtitle("ROC curves of different multinomial classification models")

]]>

]]>