Enjoy R: Do two consecutive seeds behave independently?

I’ve always wondered whether two random seeds in R provide independent results, whatever they are. In particular, I wanted to check if repeating a sampling operation with two consecutive seeds, say set.seed(20) and set.seed(21), this would produce unrelated outputs as expected. Pseudo-randomness in R is based on algorithms I honestly have read nothing about, and therefore I am only checking this as a simple naif user.

Let’s say we have a vector of one million consecutive seeds, and, under each of them, we want to sample an integer from 1 to 10. Hence, every time one of these 10 numbers shows up, we can start to count the failures we observe before that specific number shows up again. Theoretically, the distribution of failures for each of the 10 numbers has to follow a geometric distribution of probability 0.1.

random_num <- 8217015 # from random.org
N <- 1000000
seeds <- seq_len(N) + random_num
numbers <- sapply(seeds, function(x) { set.seed(x); sample(1:10, 1) })
library(tidyverse)
mydata0 <- data_frame(seeds = seeds, numbers = numbers)
mydata0
## # A tibble: 1,000,000 x 2
##      seeds numbers
##         
##  1 8217016       5
##  2 8217017       9
##  3 8217018       4
##  4 8217019       6
##  5 8217020       4
##  6 8217021       2
##  7 8217022       9
##  8 8217023       7
##  9 8217024       2
## 10 8217025       5
## # ... with 999,990 more rows
mydata <- mydata0 %>%
  mutate(seeds = 1:N) %>%
  group_by(numbers) %>%
  mutate(failures = seeds - lag(seeds) - 1) %>%
  ungroup() %>%
  drop_na()
mydata
## # A tibble: 999,990 x 3
##    seeds numbers failures
##           
##  1     5       4        1
##  2     7       9        4
##  3     9       2        2
##  4    10       5        8
##  5    11       4        5
##  6    12       4        0
##  7    14       3        0
##  8    15       6       10
##  9    16       4        3
## 10    19       1        1
## # ... with 999,980 more rows
summ_table <- mydata %>%
  group_by(numbers, failures) %>%
  summarise(n = n()) %>%
  mutate(frequencies = n/sum(n)) %>%
  ungroup()
summ_table
## # A tibble: 902 x 4
##    numbers failures     n frequencies
##                  
##  1       1        0  8801  0.08780016
##  2       1        1  9068  0.09046379
##  3       1        2  6929  0.06912479
##  4       1        3  7566  0.07547960
##  5       1        4  7109  0.07092050
##  6       1        5  6305  0.06289967
##  7       1        6  5371  0.05358194
##  8       1        7  5447  0.05434013
##  9       1        8  4529  0.04518201
## 10       1        9  4166  0.04156067
## # ... with 892 more rows
max_plot <- max(summ_table$failures)
qplot(data = summ_table, failures, frequencies, geom = 'path') +
  facet_wrap(~ numbers, ncol = 2) +
  scale_x_continuous(breaks = 0:max_plot) +
  geom_line(data = data.frame(x = 0:max_plot, 
                              y = dgeom(0:max_plot, .1)), 
            aes(x, y), color = 'red', alpha = .5) +
  theme_bw() +
  theme(axis.text.x = element_blank(),
        panel.grid.major = element_line(colour = "darkgrey"))

unnamed-chunk-4-1

The red curve is the theoretical geometric model, whereas the black one is the observed distribution of the failures for each of the 10 numbers. We immediately see that a given number is less likely to be sampled under two consecutive seeds. This is clearly not what a statistician trusting a software would like to observe.

Let’s see whether shuffling the seeds (so to remove contiguity) would fix the problem.

mydata2 <- mydata0 %>%
  sample_frac() %>% # shuffling here
  mutate(seeds = 1:N) %>%
  group_by(numbers) %>%
  mutate(failures = seeds - lag(seeds) - 1) %>%
  ungroup() %>%
  drop_na()

summ_table2 <- mydata2 %>%
  group_by(numbers, failures) %>%
  summarise(n = n()) %>%
  mutate(frequencies = n/sum(n)) %>%
  ungroup()

max_plot2 <- max(summ_table2$failures)

qplot(data = summ_table2, failures, frequencies, geom = 'path') +
  facet_wrap(~ numbers, ncol = 2) +
  scale_x_continuous(breaks = 0:max_plot2) +
  geom_line(data = data.frame(x = 0:max_plot2, 
                              y = dgeom(0:max_plot2, .1)), 
            aes(x, y), color = 'red', alpha = .5) +
  theme_bw() +
  theme(axis.text.x = element_blank(),
        panel.grid.major = element_line(colour = "darkgrey"))

unnamed-chunk-5-1

As we expected, now the red curve and the black curve seem to be completely overlapping.
Hence, in whatever application you would need to set seeds in different iterations avoid using consecutive seeds!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: