Skip to contents

Raw data

This page is a listing of the raw data files in the {dpjr} package. The files can be accessed using the appropriate “read” function, with the package function dpjr_data() providing the path. For example,

df_mpg <- read.csv(dpjr_data("mpg.csv"))

The full list of files is as follows:

dpjr::dpjr_data()
#>  [1] "authors_count_fwf.txt"                   
#>  [2] "authors_fwf.txt"                         
#>  [3] "authors2_fwf.txt"                        
#>  [4] "badpenguins.rds"                         
#>  [5] "bcferries_2021.pdf"                      
#>  [6] "canpop.csv"                              
#>  [7] "census2021firstresultsenglandwales1.xlsx"
#>  [8] "cr25"                                    
#>  [9] "date.xlsx"                               
#> [10] "gapminder.csv"                           
#> [11] "intl_tuition_fees_bc.xlsx"               
#> [12] "JCUSH.txt"                               
#> [13] "lahman_2020.zip"                         
#> [14] "lfs_canada_employment.rds"               
#> [15] "mpg.csv"                                 
#> [16] "mtcars.csv"                              
#> [17] "noc_table.rds"                           
#> [18] "nycflights13_sql.zip"                    
#> [19] "penguin_summary.pdf"                     
#> [20] "penguins_fwf_code.txt"                   
#> [21] "penguins_fwf.txt"                        
#> [22] "penguins_labelled.xlsx"                  
#> [23] "penguins_spss_output.xlsx"               
#> [24] "penguins.csv"                            
#> [25] "penguins.dta"                            
#> [26] "penguins.sas"                            
#> [27] "penguins.sav"                            
#> [28] "seph_naics721_bc_sal.rds"                
#> [29] "us_bls_employment_2010-2020.csv"         
#> [30] "Video_SPSS.sav"

Licenses

All of the data in this package is covered under an open license. For information about the specific license for each data set in this package, please refer to the vignette “Data licenses”.

Files

authors_fwf.txt; authors2_fwf.txt

A pair of fixed-width files containing information about ten authors born in the United States of America:

  • “name”

  • “state of birth”, the two-letter abbreviation for the US state in which the author was born

  • “unique_id”, a randomly generated personal ID (similar to a national individual identification number used in different countries).

The variables are as follows:

Variable Width Start position End position
name 20 1 20
state_of_birth 10 21 30
unique_id 12 31 42

census2021firstresultsenglandwales1.xlsx

Population and household estimates, England and Wales: Census 2021

Data is a tabulation of the population of England and Wales, drawn from the 2021 Census, and downloaded from the UK Office of National Statistics (ONS): Population and household estimates, England and Wales: Census 2021, 2022-06-28

License

The UK Open Government License for Public Sector Information

cr25

The files within the subdirectory cr25 are as follows:

list.files(dpjr::dpjr_data("cr25"))
#> [1] "cr25_538.csv"              "cr25_human_resources.xlsx"
#> [3] "cr25_order_multi.csv"      "cr25_order_multi.rds"     
#> [5] "cr25_storelist.csv"

For information about these files see the vignette “Data generation—CR25”.

gapminder.csv

A csv version of the dataset in the {gapminder} package

intl_tuition_fees_bc.xlsx

International Tuition Fees at Public Post-Secondary Institutions by Economic Development Region

Sourced from the British Columbia Data Catalogue, this Excel file contains “Annual Academic Arts Program tuition fees for full-time international students at public post-secondary institutions by Economic Development Region (EDR) and by institution. Academic Years 2011/12 to 2021/22.”

Original file name: “tui2_international_tuition_fees_at_public_post_secondary_institutions_by_economic_development_r.xlsx”

https://catalogue.data.gov.bc.ca/dataset/international-tuition-fees-at-public-post-secondary-institutions-by-economic-development-region

License:

Open Government Licence - British Columbia

JCUSH.txt

Statistics Canada has made available an anonymized Public-Use Microdata File (PUMF) of the Joint Canada/United States Survey of Health, a telephone survey conducted in late 2002 and early 2003. There were 8,688 respondents to the survey, 3,505 Canadians and 5,183 Americans.

The PUMF is a fixed-width file named “JCUSH.txt”. Each line is 552 columns in length.

The webpage for the survey, including the PUMF file, data dictionary, and methodological notes, is here: https://www150.statcan.gc.ca/n1/pub/82m0022x/2003001/4069119-eng.htm

License:

Statistics Canada Open License

Source: Statistics Canada, Joint Canada/United States Survey of Health 2002-03, 2004. Reproduced and distributed on an “as is” basis with the permission of Statistics Canada.

lfs_canada_employment.rds

Total employment in Canada, monthly, unadjusted for seasonality, both sexes, age 15 and over, total all population centres and rural areas (thousands), 2011-01 to 2022-12

Table 14-10-0374-01, vector v1234977815

License:

Statistics Canada Open License

Source: Statistics Canada. Table 14-10-0374-01 Employment and unemployment rate, monthly, unadjusted for seasonality DOI: https://doi.org/10.25318/1410037401-eng

mpg.csv

A csv version of the famous mpg dataset, included in the {ggplot2} package

penguins

penguins.*

The penguins dataframe from the {palmerpenguins} dataset, in different formats:

  • comma-separated values penguins.csv
# code to create file
readr::write_csv(palmerpenguins::penguins, 
                 here::here("penguins.csv"))
  • SAS penguins.sas
# code to create file
haven::write_sas(palmerpenguins::penguins, 
                 here::here("penguins.sas"))
  • SPSS penguins.sav
# code to create file
haven::write_sav(palmerpenguins::penguins,
                 here::here("penguins.sav"))
  • Stata penguins.dta
# code to create file
haven::write_dta(palmerpenguins::penguins,
                 here::here("penguins.dta"))

penguins_fwf.txt

A fixed-width version of the {palmerpenguins} dataset

There are 8 different variables, described in the table below:

Variable Width Start position End position
species 9 1 9
island 9 10 18
bill_length_mm 4 19 22
bill_depth_mm 4 23 26
flipper_length_mm 3 27 29
body_mass_g 4 30 33
sex 6 34 39
year 4 40 43

The fixed-width file has been created to minimize white space. The first four and last two rows of the data look like this:

readLines(dpjr::dpjr_data("penguins_fwf.txt"), n = 4)
#> [1] "Adelie   Torgersen39.118.71813750male  2007"
#> [2] "Adelie   Torgersen39.517.41863800female2007"
#> [3] "Adelie   Torgersen40.318.01953250female2007"
#> [4] "Adelie   Torgersen                     2007"
tail(readLines(dpjr::dpjr_data("penguins_fwf.txt")), 2)
#> [1] "ChinstrapDream    50.819.02104100male  2009"
#> [2] "ChinstrapDream    50.218.71983775female2009"

Note that the first row is not the variable names. This is common in fixed-width files.

# code to create fwf file
gdata::write.fwf(penguins_df, file = "penguins_fwf.txt",
                 width = c(
                   9,   # species
                   9,   # island
                   4,   # bill_length_mm
                   4,   # bill_depth_mm
                   3,   # flipper_length_mm
                   4,   # body_mass_g
                   6,   # sex
                   4    # year
                 ),
                 sep = "",
                 colname = FALSE)

penguins_fwf_code.txt

A fixed-width version of the {palmerpenguins} dataset, where the character strings associated with the values have been replaced by numeric values.

There are 8 different variables, described in the table below:

Variable Width Start position End position
species 1 1 1
island 1 2 2
bill_length_mm 4 3 6
bill_depth_mm 4 7 10
flipper_length_mm 3 11 13
body_mass_g 4 14 17
sex 1 18 18
year 4 19 22
Variable Value Character
species 1 Adelie
2 Gentoo
3 Chinstrap
island 1 Torgersen
2 Biscoe
3 Dream
sex 1 female
2 male

The fixed-width file is now less than half the width of the fixed-width file that has the character values. The first four and last two rows of the data look like this:

readLines(dpjr::dpjr_data("penguins_fwf_code.txt"), n = 4)
#> [1] "1139.118.7181375022007" "1139.517.4186380012007" "1140.318.0195325012007"
#> [4] "11                2007"
tail(readLines(dpjr::dpjr_data("penguins_fwf_code.txt")), 2)
#> [1] "3350.819.0210410022009" "3350.218.7198377512009"

Note that the first row is not the variable names. This is common in fixed-width files.

penguins_labelled.xlsx

An Excel version of the {palmerpenguins} dataset.

This file simulates the circumstance where the data were originally collected and stored in an SPSS file, and the data collectors have made the data available as an Excel file.

The file contains two sheets,

  • penguins_values, containing the individual records, with the variables coded numerically. Note that for all variables, “NA” is used to represent missing values.

  • penguins_codebook contains the output SPSS creates when the “DISPLAY DICTIONARY” syntax (or the “File > Display Data File Information > Working File” GUI menu sequence) in SPSS is used to produce the “dictionary” (or codebook). There is information about each of the variables, including whether that variable is nominal or scale..

Variable Value Character
species 1 Adelie
2 Chinstrap
3 Gentoo
island 1 Biscoe
2 Dream
3 Torgersen
sex 1 female
2 male

National Travel Survey (NTS)

Note that the National Travel Survey (NTS) data referenced in Chapter 7 of The Data Preparation Journey is not included in the {dpjr} package. The microdata file for the 2020 reference year can be found here:

License:

Data are available by CC-0 license in accordance with the Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data.

-30-