Skip to contents

Introduction

This set of R chunks creates a synthetic dataset that presents a variety of data importation and wrangling challenges, as well as some interesting analytic opportunities.

This is based on my own experience in the late 1980s, to a time when I was working for a wholesale company that distributed LPs, cassettes, and those new-fangled compact discs to a wide range of retailers.

The central table in this relational database shows the details of each store’s orders…but this table needs to be compiled for four quarterly CSV files.

These can then be linked to other tables about the stores.

The Scenario

The fictional wholesale company, CR25 (for “Classic Rock 25”) only sells 25 album titles. They are the biggest selling / most famous albums by the 25 most-played artists shown in Walt Hickey’s article “Why Classic Rock Isn’t What It Used To Be”. 1

Setup

The first step is to install the necessary packages. The seed is set to a number associated with a classic rock staple, so that future runs of the code will generate the same data table. This is important, as some of the examples in the book are based on summary tables drawn from the master file.

# tidyverse packages
library(dplyr)
library(readr)
library(lubridate)
library(ggplot2)
#
# utilities
set.seed(8675309)

Input data

To create the album list, I created a csv file with the artist and percentage of all plays fields, using the table in the article (which is displayed as an image; note that there appears to be an error in the table, in that Boston has a lower rank but a higher percentage than Queen or Journey). I also added a representative album by that artist—I used some judgement here. The album is the biggest selling, most famous, and/or the one with the most hit singles that the artist made. I also opted for a “no compilations” rule; one could make the case that, as one example, The Rolling Stones might be better or more accurately represented by Hot Rocks.

I also added a price field; albums released before 1976 are $7.98, those released 1976-1981 are $8.98, and 1982 and beyond are $9.98. The sole exception to this is Pink Floyd’s double album The Wall, which I priced at $14.98.

albumlist <- read_csv(dpjr::dpjr_data("cr25/cr25_538.csv"))
albumlist

The next step was to create a csv file with some fictional stores. What I did here was use the names of now-defunct Canadian chains that were record retailers active in the 70s and 80s. You can find them listed in the Wikipedia List of defunct Canadian companies I left off HMV, as it was an international brand that was late to the market.

I also assigned each store to one of four regions (based on the cardinal points of the compass), and gave them a store ID number.

storelist <- read_csv(dpjr::dpjr_data("cr25/cr25_storelist.csv"))

head(storelist)

storelist %>%
  group_by(chain, region) %>%
  tally()

storelist %>%
  group_by(region) %>%
  tally()

storelist %>%
  group_by(chain) %>%
  tally()

Create order

# order date
# based on https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html

jan01 <- ymd("2019-01-01")
startdate <- as_date(floor_date(jan01, "year"))
enddate <- as_date(ceiling_date(jan01, "year") - days(1))
daterange <- tibble(orderdate = as_date(c(startdate:enddate)))

orderdate_fun <- function(){ 
daterange %>%
  sample_n(1) %>%
  pull(orderdate) 
}  

# set the size of the order based on year quarter
# - same number of orders but biggest in 4th quarter, smallest in 1st
order_mean_fun <- function(orderdate){
  order_qtr <- quarter(orderdate)
  order_mean <- case_when(
    order_qtr == 1 ~ 4,
    order_qtr == 4 ~ 6,
    TRUE ~ 5
  )
}

# function to randomly select the store
store_fun <- function(){ 
storelist %>%
  sample_n(1) %>%
  pull(store.id) 
}  


# album_order_fun

# - `rpois` sets the mean value of poisson distribution from which random value is pulled

album_order_fun <- function(...){
  sample_n(albumlist, 1, weight = pct_plays) %>%
    select(artist, album, price) %>%
  mutate(format = sample(c("lp", "cass"), 1, prob = c(0.33, 0.67)),
         qty = rpois(1, 4)
  )
}


# now with order date and store id appended after the run

order_fun <- function(){
  
  orderdate <- orderdate_fun()

# conditional on store open date being before random order date  
  orderstore.id <- store_fun()
  
  store.opendate <- storelist %>%
    filter(store.id == orderstore.id) %>%
    pull(open_date)

  if(store.opendate < orderdate){

    order_mean <- order_mean_fun(orderdate)
    n = rpois(1, order_mean)
  

  # define blank form  
  order_final <- data.frame(artist = as.character(),
                          album = as.character(),
                          price = as.double(),
                          format = as.character(),
                          qty = as.integer(),
                          stringsAsFactors = FALSE)  
  
  for(i in 1:n){
    order_final[i, ] <- album_order_fun()
    }
  
  order_final <- order_final %>%
    mutate(orderdate = orderdate,
           store.id = orderstore.id)
  
  order_final
  
  # end if
  }
# end function
}
# run one order...
order_fun()

Now a run of multiple orders…

m <- 750


order_multi <- data.frame(artist = as.character(),
                          album = as.character(),
                          price = as.double(),
                          format = as.character(),
                          qty = as.integer(),
                          orderdate = as_date(as.character()),
                          store.id = as.character(),
                          stringsAsFactors = FALSE)  
  
  for(i in 1:m){
    order_final <- order_fun()
    order_multi <- rbind(order_multi, order_final)
  }
  

#write_csv(order_multi, "cr25_order_multi.csv")  
#write_rds(order_multi, "cr25_order_multi.rds")  
order_multi <- read_rds(dpjr::dpjr_data("cr25/cr25_order_multi.rds"))

head(order_multi)

Summary tables

order_multi %>%
  group_by(store.id, orderdate) %>%
  tally()
order_multi %>%
  group_by(artist) %>%
  summarise(qty = sum(qty)) %>%
  arrange(desc(qty))

Build a large scale summary table

order_multi %>%
  mutate(order_price = qty * price) %>%
  group_by(artist, format) %>%
  summarise(qty = sum(qty),
            cost = sum(order_price)) %>%
  arrange(desc(qty))
order_multi %>%
  mutate(order_price = qty * price) %>%
  left_join(storelist, by = "store.id") %>%
  group_by(chain) %>%
  summarise(qty = sum(qty),
            cost = sum(order_price)) %>%
  arrange(desc(qty))
order_multi %>%
  mutate(order_price = qty * price) %>%
  left_join(storelist, by = "store.id") %>%
  group_by(region) %>%
  summarise(qty = sum(qty),
            cost = sum(order_price)) %>%
  mutate(pct_of_sales = cost / sum(cost) * 100) %>%
  arrange(desc(qty))

Future development

It could also be expanded to include additional years (with some trends embedded), inventory management, and customer relationship management (CRM).

-30-