Enjoy R: How to make a Pareto Chart using ggplot2 (and dplyr)

Hi all.

The well-known choice of pushing ggplot2 users towards a cleaner and more correct way of plotting data, has led to the miss-implementation of a secondary axis.

This is at the basis of the difficulty of plotting a Pareto Chart using this smart R package.

In this post, I suggest a way to overcome this hurdle, by adding a segment, with ticks and text expressing the percentages, on the right-hand side of the x-axis. I also deicided to use theme_bw().

Let’s see an example:

# creating a factor variable:
Example <- factor(c(rep("a", 15), rep("b", 39), rep("A", 6), rep("B", 42), rep("C", 50)))

# creating the function:
ggpareto <- function(x) {

title <- deparse(substitute(x))
x <- data.frame(modality = na.omit(x))

library(dplyr)
Df <- x %>% group_by(modality) %>% summarise(frequency=n()) %>% arrange(desc(frequency))
Df$modality <- ordered(Df$modality, levels = unlist(Df$modality, use.names = F))
Df <- Df %>% mutate(modality_int = as.integer(modality),
cumfreq = cumsum(frequency), cumperc = cumfreq/nrow(x) * 100)
nr <- nrow(Df)
N <- sum(Df$frequency)

Df_ticks <- data.frame(xtick0 = rep(nr +.55, 11), xtick1 = rep(nr +.59, 11),
ytick = seq(0, N, N/10))

y2 <- paste0(seq(0, 100, 10), '%')

library(ggplot2)
g <- ggplot(Df, aes(x=modality, y=frequency)) +
geom_bar(stat="identity", aes(fill = modality_int)) +
geom_line(aes(x=modality_int, y = cumfreq, color = modality_int)) +
geom_point(aes(x=modality_int, y = cumfreq, color = modality_int), pch = 19) +
scale_y_continuous(breaks=seq(0, N, N/10), limits=c(-.02 * N, N * 1.02)) +
scale_x_discrete(breaks = Df$modality) +
guides(fill = FALSE, color = FALSE) +
annotate("rect", xmin = nr + .55, xmax = nr + 1, ymin = -.02 * N, ymax = N * 1.02,
fill = "white") +
annotate("text", x = nr + .8, y = seq(0, N, N/10), label = y2, size = 3.5) +
geom_segment(x = nr + .55, xend = nr + .55, y = -.02 * N, yend = N * 1.02, color = "grey50") +
geom_segment(data = Df_ticks, aes(x = xtick0, y = ytick, xend = xtick1, yend = ytick)) +
labs(title = paste0("Pareto Chart of ", title), y = "absolute frequency") +
theme_bw()

return(list(graph = g, Df = Df[, c(3, 1, 2, 4, 5)]))

}

# applying the function to the factor variable
ggpareto(Example)

And the output is:

$graph

$Df
Source: local data frame [5 x 4]

         modality_int    modality      frequency    cumfreq      cumperc
1                   1           C             50         50     32.89474
2                   2           B             42         92     60.52632
3                   3           b             39        131     86.18421
4                   4           a             15        146     96.05263
5                   5           A              6        152    100.00000

paretgplot

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: