Raw data
This page is a listing of the raw data files in the {dpjr} package.
The files can be accessed using the appropriate “read” function, with
the package function dpjr_data()
providing the path. For
example,
df_mpg <- read.csv(dpjr_data("mpg.csv"))
The full list of files is as follows:
dpjr::dpjr_data()
#> [1] "authors_count_fwf.txt"
#> [2] "authors_fwf.txt"
#> [3] "authors2_fwf.txt"
#> [4] "badpenguins.rds"
#> [5] "bcferries_2021.pdf"
#> [6] "canpop.csv"
#> [7] "census2021firstresultsenglandwales1.xlsx"
#> [8] "cr25"
#> [9] "date.xlsx"
#> [10] "gapminder.csv"
#> [11] "intl_tuition_fees_bc.xlsx"
#> [12] "JCUSH.txt"
#> [13] "lahman_2020.zip"
#> [14] "lfs_canada_employment.rds"
#> [15] "mpg.csv"
#> [16] "mtcars.csv"
#> [17] "noc_table.rds"
#> [18] "nycflights13_sql.zip"
#> [19] "penguin_summary.pdf"
#> [20] "penguins_fwf_code.txt"
#> [21] "penguins_fwf.txt"
#> [22] "penguins_labelled.xlsx"
#> [23] "penguins_spss_output.xlsx"
#> [24] "penguins.csv"
#> [25] "penguins.dta"
#> [26] "penguins.sas"
#> [27] "penguins.sav"
#> [28] "seph_naics721_bc_sal.rds"
#> [29] "us_bls_employment_2010-2020.csv"
#> [30] "Video_SPSS.sav"
Licenses
All of the data in this package is covered under an open license. For information about the specific license for each data set in this package, please refer to the vignette “Data licenses”.
Files
authors_fwf.txt; authors2_fwf.txt
A pair of fixed-width files containing information about ten authors born in the United States of America:
“name”
“state of birth”, the two-letter abbreviation for the US state in which the author was born
“unique_id”, a randomly generated personal ID (similar to a national individual identification number used in different countries).
The variables are as follows:
Variable | Width | Start position | End position |
---|---|---|---|
name | 20 | 1 | 20 |
state_of_birth | 10 | 21 | 30 |
unique_id | 12 | 31 | 42 |
census2021firstresultsenglandwales1.xlsx
Population and household estimates, England and Wales: Census 2021
Data is a tabulation of the population of England and Wales, drawn from the 2021 Census, and downloaded from the UK Office of National Statistics (ONS): Population and household estimates, England and Wales: Census 2021, 2022-06-28
Statistical bulletin: “Population and household estimates, England and Wales: Census 2021”, 2022-06-28
“Local Authority Districts, Counties and Unitary Authorities (April 2021) Map in United Kingdom”
License
The UK Open Government License for Public Sector Information
cr25
The files within the subdirectory cr25
are as
follows:
list.files(dpjr::dpjr_data("cr25"))
#> [1] "cr25_538.csv" "cr25_human_resources.xlsx"
#> [3] "cr25_order_multi.csv" "cr25_order_multi.rds"
#> [5] "cr25_storelist.csv"
For information about these files see the vignette “Data generation—CR25”.
gapminder.csv
A csv version of the dataset in the {gapminder} package
intl_tuition_fees_bc.xlsx
International Tuition Fees at Public Post-Secondary Institutions by Economic Development Region
Sourced from the British Columbia Data Catalogue, this Excel file contains “Annual Academic Arts Program tuition fees for full-time international students at public post-secondary institutions by Economic Development Region (EDR) and by institution. Academic Years 2011/12 to 2021/22.”
Original file name: “tui2_international_tuition_fees_at_public_post_secondary_institutions_by_economic_development_r.xlsx”
License:
JCUSH.txt
Statistics Canada has made available an anonymized Public-Use Microdata File (PUMF) of the Joint Canada/United States Survey of Health, a telephone survey conducted in late 2002 and early 2003. There were 8,688 respondents to the survey, 3,505 Canadians and 5,183 Americans.
The PUMF is a fixed-width file named “JCUSH.txt”. Each line is 552 columns in length.
The webpage for the survey, including the PUMF file, data dictionary, and methodological notes, is here: https://www150.statcan.gc.ca/n1/pub/82m0022x/2003001/4069119-eng.htm
License:
Statistics Canada Open License
Source: Statistics Canada, Joint Canada/United States Survey of Health 2002-03, 2004. Reproduced and distributed on an “as is” basis with the permission of Statistics Canada.
lfs_canada_employment.rds
Total employment in Canada, monthly, unadjusted for seasonality, both sexes, age 15 and over, total all population centres and rural areas (thousands), 2011-01 to 2022-12
Table 14-10-0374-01, vector v1234977815
License:
Statistics Canada Open License
Source: Statistics Canada. Table 14-10-0374-01 Employment and unemployment rate, monthly, unadjusted for seasonality DOI: https://doi.org/10.25318/1410037401-eng
mpg.csv
A csv version of the famous mpg dataset, included in the {ggplot2} package
penguins
penguins.*
The penguins
dataframe from the {palmerpenguins}
dataset, in different formats:
- comma-separated values
penguins.csv
# code to create file
readr::write_csv(palmerpenguins::penguins,
here::here("penguins.csv"))
- SAS
penguins.sas
# code to create file
haven::write_sas(palmerpenguins::penguins,
here::here("penguins.sas"))
- SPSS
penguins.sav
# code to create file
haven::write_sav(palmerpenguins::penguins,
here::here("penguins.sav"))
- Stata
penguins.dta
# code to create file
haven::write_dta(palmerpenguins::penguins,
here::here("penguins.dta"))
penguins_fwf.txt
A fixed-width version of the {palmerpenguins} dataset
There are 8 different variables, described in the table below:
Variable | Width | Start position | End position |
---|---|---|---|
species | 9 | 1 | 9 |
island | 9 | 10 | 18 |
bill_length_mm | 4 | 19 | 22 |
bill_depth_mm | 4 | 23 | 26 |
flipper_length_mm | 3 | 27 | 29 |
body_mass_g | 4 | 30 | 33 |
sex | 6 | 34 | 39 |
year | 4 | 40 | 43 |
The fixed-width file has been created to minimize white space. The first four and last two rows of the data look like this:
readLines(dpjr::dpjr_data("penguins_fwf.txt"), n = 4)
#> [1] "Adelie Torgersen39.118.71813750male 2007"
#> [2] "Adelie Torgersen39.517.41863800female2007"
#> [3] "Adelie Torgersen40.318.01953250female2007"
#> [4] "Adelie Torgersen 2007"
tail(readLines(dpjr::dpjr_data("penguins_fwf.txt")), 2)
#> [1] "ChinstrapDream 50.819.02104100male 2009"
#> [2] "ChinstrapDream 50.218.71983775female2009"
Note that the first row is not the variable names. This is common in fixed-width files.
# code to create fwf file
gdata::write.fwf(penguins_df, file = "penguins_fwf.txt",
width = c(
9, # species
9, # island
4, # bill_length_mm
4, # bill_depth_mm
3, # flipper_length_mm
4, # body_mass_g
6, # sex
4 # year
),
sep = "",
colname = FALSE)
penguins_fwf_code.txt
A fixed-width version of the {palmerpenguins} dataset, where the character strings associated with the values have been replaced by numeric values.
There are 8 different variables, described in the table below:
Variable | Width | Start position | End position |
---|---|---|---|
species | 1 | 1 | 1 |
island | 1 | 2 | 2 |
bill_length_mm | 4 | 3 | 6 |
bill_depth_mm | 4 | 7 | 10 |
flipper_length_mm | 3 | 11 | 13 |
body_mass_g | 4 | 14 | 17 |
sex | 1 | 18 | 18 |
year | 4 | 19 | 22 |
Variable | Value | Character |
---|---|---|
species | 1 | Adelie |
2 | Gentoo | |
3 | Chinstrap | |
island | 1 | Torgersen |
2 | Biscoe | |
3 | Dream | |
sex | 1 | female |
2 | male |
The fixed-width file is now less than half the width of the fixed-width file that has the character values. The first four and last two rows of the data look like this:
readLines(dpjr::dpjr_data("penguins_fwf_code.txt"), n = 4)
#> [1] "1139.118.7181375022007" "1139.517.4186380012007" "1140.318.0195325012007"
#> [4] "11 2007"
tail(readLines(dpjr::dpjr_data("penguins_fwf_code.txt")), 2)
#> [1] "3350.819.0210410022009" "3350.218.7198377512009"
Note that the first row is not the variable names. This is common in fixed-width files.
penguins_labelled.xlsx
An Excel version of the {palmerpenguins} dataset.
This file simulates the circumstance where the data were originally collected and stored in an SPSS file, and the data collectors have made the data available as an Excel file.
The file contains two sheets,
penguins_values, containing the individual records, with the variables coded numerically. Note that for all variables, “NA” is used to represent missing values.
penguins_codebook contains the output SPSS creates when the “DISPLAY DICTIONARY” syntax (or the “File > Display Data File Information > Working File” GUI menu sequence) in SPSS is used to produce the “dictionary” (or codebook). There is information about each of the variables, including whether that variable is nominal or scale..
Variable | Value | Character |
---|---|---|
species | 1 | Adelie |
2 | Chinstrap | |
3 | Gentoo | |
island | 1 | Biscoe |
2 | Dream | |
3 | Torgersen | |
sex | 1 | female |
2 | male |
National Travel Survey (NTS)
Note that the National Travel Survey (NTS) data referenced in Chapter 7 of The Data Preparation Journey is not included in the {dpjr} package. The microdata file for the 2020 reference year can be found here:
Additional information about the NTS can be found here: https://www23.statcan.gc.ca/imdb/p2SV.pl?Function=getSurvey&Id=1535550&
License:
Data are available by CC-0 license in accordance with the Palmer Station LTER Data Policy and the LTER Data Access Policy for Type I data.
-30-