Skip to content

Commit fb7d4b4

Browse files
committed
finish first draft of load_data_packages that can use metadata to call data types.
1 parent 8c80ce0 commit fb7d4b4

File tree

1 file changed

+143
-161
lines changed

1 file changed

+143
-161
lines changed

R/load_data_packages.R

+143-161
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
1-
#' Read contents of data package(s) and return a tibble with a tibble for each data file.
2-
#'
3-
#' `r lifecycle::badge("experimental")`
1+
#' Read contents of data package(s) and return a list of tibbles list of tibbles based on the data file(s). Can use metadata to specify data types.
42
#'
5-
#' @description `load_data_packages()` loads one to may data packages and returns a tibble of tibbles where each data package is a tibble and within that each data file is it's own tibble. `load_data_packages()` will only work with .csv data files and EML metadata. `load_data_packages()` can also utilize the metadata to assign attributes to each data column.
3+
#' @description `load_data_packages()` loads one to many data packages and returns a list. If only one data package is loaded, the list will be a list of tibbles where each tibble is a data (.csv) file from the data package. If multiple data packages are loaded, the list will be a list of lists where each nested list contains a list of tibble and each tibble is a data file (.csv). See `simplify` below for details on handling these lists.
64
#'
7-
#' @details currently `load_data_packages()` only supports EML metadata and .csv files. To take advantage of the default settings in load_data_packages, use the default settings in `get_data_package()` or `get_data_packages()`. Archived (.zip) files must be extracted before `load_data_packages()` will work properly. Again, `get_data_package()` or `get_data_packages()` will accomplish this for you.
5+
#' @details currently `load_data_packages()` only supports EML metadata and .csv files. The reference_id
86
#' '
9-
#' @param reference_id is a list of 6-7 digit numbers corresponding to the DataStore reference ID of the datapackage(s) to load. Alternatively, you can set `reference_id` to "load_all", which will load all the data packages in your /data folder.
10-
#' @param directory is the location of a folder, 'data' (created during `get_data_packages()`) which contains sub-directories where each sub-directory is the DataStore referenceId of the data package. Again, this file structure is all set up using `get_data_packages()`. Defaults to the current working directory (which is the default location for `get_data_packages()`).
11-
#' @param assign_attributes Logical. Defaults to FALSE. Data will be loaded using `readr::read_csv()` guessing algorithm for calling column types. If set to TRUE, column types will be set using metadata attributes via the yet-to-be written `load_metadata()` function. `r lifecycle::badge('experimental')`
12-
#' @param simplify Logical. Defaults to TRUE. If there is only a single data package loaded, the function will return a simple list of tibbles (where each tibble reflects a data file from within the data package). If set to FALSE, the function will return a list that contains a list of tibbles. This structure mirrors the object structure returned if multiple data packages are simultaneously loaded (a list of data packages with each data package containing a list of tibbles where each tibble corresponds to a data file in the given data package).
7+
#' @param reference_id the immediate directory/directories where your data packages reside. For data packages downloaded from DataStore using `get_data_package()` or `get_data_packages()` default settings, this is the DataStore reference ID for your data package(s). Alternatively, you can set `reference_id` to "`load_all`", which will load all the data packages in the directory specified in via `directory` (typically ./data).
8+
#' @param directory is the location of a folder that contains all of the data packages (where data packages are a folder containing .csv data files and a single .xml EML metadata file). If these data packages were downloaded from DataStore using the default settings for `get_data_packages`, this folder is "./data" and you can use the default settings for `directory`.
9+
#' @param assign_attributes Logical. Defaults to FALSE. Data will be loaded using `readr::read_csv()` guessing algorithm for calling column types. If you set to `assign_attributes = TRUE`, column types will be set using the data types specified in the metadata. Currently supported data types include string, dateTime, float, double, integer, and categorical (factor in R). This assignment is very stringent: for instance if you did not specify date-time formats using ISO-8601 notation (i.e. "YYYY", not "yyyy"), your data will import as NAs. If you have undefined missing values or blank cells, your data will not import at all. If you run into problems consider using the default settings and letting `read_csv` guess the column types.
10+
#' @param simplify Logical. Defaults to TRUE. If `simplify = TRUE`, the function will return a list of tibbles where each tibble is a data file from the data package(s) specified. The tibbles are named using the following format: "pkg_<reference_id.filename" (without the filename extension). If you want to load each individual data file into R for further processing, use `simplify = TRUE` and then run `list2env(x, envir=.GlobalEnv)`. If you set `simplify = FALSE`, the object returned will either be a list of tibbles identical to that returned by `simplify = TRUE` (if only one data package is loaded) or will be a list of lists where each nested list is a contains one tibble for each data file in each data package.Setting `simplify = FALSE` may make it easier to do post-processing on a package-by-package level rather than a tibble-by-tibble level.
1311
#'
1412
#' @return a list of (of lists of) tibbles.
1513
#'
@@ -27,182 +25,166 @@ load_data_packages <- function(reference_id,
2725

2826

2927
#is user specifies "allData" get all directories from the data folder:
30-
if (reference_id == "all_data") {
28+
if (length(seq_along(reference_id)) == 1 && reference_id == "load_all") {
3129
reference_id <- list.dirs(path = directory,
3230
full.names = FALSE,
3331
recursive = FALSE)
3432
}
3533

36-
### if only one data package is specified:
37-
### fix how single data packages are handled later:
38-
if (assign_attributes == TRUE) {
39-
tibble_list <- list()
40-
for (h in 1:length(seq_along(reference_id))) {
41-
42-
directory <- paste0(directory, "/", reference_id[h])
43-
#get csv file names:
44-
filenames <- list.files(path = directory,
45-
pattern = "*csv")
46-
## Create list of data frame names without the ".csv" part
47-
names <- gsub(pattern = "\\.csv$", "", filenames)
34+
tibble_list <- list()
35+
for (h in 1:length(seq_along(reference_id))) {
36+
suppressWarnings(rm(directory1))
37+
directory1 <- paste0(directory, "/", reference_id[h])
38+
#get csv file names:
39+
filenames <- list.files(path = directory1,
40+
pattern = "*csv")
41+
## Create list of data frame names without the ".csv" part
42+
names <- gsub(pattern = "\\.csv$", "", filenames)
4843

49-
#load metadata:
50-
metadata <- DPchecker::load_metadata(directory = directory)
51-
52-
### Load all files into tibbles
53-
tibble <- list()
54-
for (i in 1:length(seq_along(filenames))) {
55-
file_path <- file.path(paste0(directory,"/", filenames[i]))
44+
### Load all files into tibbles
45+
package_data <- list()
46+
for (i in 1:length(seq_along(filenames))) {
47+
file_path <- file.path(paste0(directory1,"/", filenames[i]))
5648

5749
#get attributes information from metadata:
58-
# To do: specifically call dataTable by name, not position! #########
59-
dataTable <- metadata[["dataset"]][["dataTable"]][[i]]
60-
attribs <- purrr::map_dfr(dataTable[["attributeList"]][["attribute"]],
61-
tibble::as_tibble)
62-
63-
attribs <- attribs %>% dplyr::mutate(R_data_type = dplyr::case_when(
64-
storageType == "string" ~ "collector_character",
65-
storageType == "date" ~ "collector_date",
66-
storageType == "float" ~ "collector_double"))
67-
68-
#get column specification as R would guess:
69-
csv_cols <- readr::spec_csv(file_path)
70-
71-
#set data types based on EML, simple:
72-
for(j in 1:nrow(attribs)) {
73-
class(csv_cols$cols[[j]]) <- attribs$R_data_type[[j]]
74-
}
50+
#To do: handle case when only one data file in the data package!
51+
if (assign_attributes == TRUE) {
52+
#load metadata:
53+
metadata <- DPchecker::load_metadata(directory = directory1)
54+
# when there is only one dataTable:
55+
if ("physical" %in% names(metadata$dataset$dataTable)) {
56+
dataTable <- metadata[["dataset"]][["dataTable"]]
57+
} else {
58+
for (j in 1:length(seq_along(metadata$dataset$dataTable))) {
59+
if (filenames[i] %in%
60+
metadata$dataset$dataTable[[j]]$physical$objectName) {
61+
dataTable <- metadata[["dataset"]][["dataTable"]][[j]]
62+
}
63+
}
64+
}
65+
#turn the metadata into a useable tibble
66+
attribs <- purrr::map_dfr(dataTable[["attributeList"]][["attribute"]],
67+
tibble::as_tibble)
68+
#map_dfr started double counting rows; fix it if it happens:
69+
attribs <- attribs %>% dplyr::distinct(attributeName,
70+
.keep_all = TRUE)
71+
72+
attribs <- attribs %>% dplyr::mutate(R_data_type = dplyr::case_when(
73+
storageType == "string" ~ "collector_character",
74+
storageType == "date" ~ "collector_date",
75+
storageType == "float" ~ "collector_double",
76+
storageType == "double" ~ "collector_double",
77+
storageType == "integer" ~ "collector_integer"))
78+
79+
#get column specification as R would guess:
80+
csv_cols <- readr::spec_csv(file_path)
81+
82+
#set data types based on EML, simple:
83+
for(j in 1:nrow(attribs)) {
84+
class(csv_cols$cols[[j]]) <- attribs$R_data_type[[j]]
85+
}
7586

76-
#set date/time col type format string:
77-
for(j in 1:nrow(attribs)) {
78-
if("dateTime" %in% names(attribs$measurementScale[j])) {
79-
eml_date <-
80-
attribs$measurementScale[j][["dateTime"]][["formatString"]]
81-
r_date <- QCkit::convert_datetime_format(eml_date)
82-
csv_cols$cols[[j]]$format <- r_date
87+
#set date/time col type format string:
88+
for(j in 1:nrow(attribs)) {
89+
if("dateTime" %in% names(attribs$measurementScale[j])) {
90+
eml_date <-
91+
attribs$measurementScale[j][["dateTime"]][["formatString"]]
92+
r_date <- QCkit::convert_datetime_format(eml_date)
93+
csv_cols$cols[[j]]$format <- r_date
94+
}
8395
}
84-
}
85-
#set levels for factor call types:
86-
for (j in 1:nrow(attribs)) {
87-
if("nominal" %in% names(attribs$measurementScale[j])) {
88-
nom <- attribs$measurementScale[j][["nominal"]]
89-
if ("nonNumericDomain" %in% names(nom)) {
90-
nom2 <- nom[["nonNumericDomain"]]
91-
if ("enumeratedDomain" %in% names(nom2)) {
92-
nom3 <- nom2[["enumeratedDomain"]]
93-
if ("codeDefinition" %in% names(nom3)) {
94-
nom4 <- nom3[["codeDefinition"]]
95-
#get factors
96-
factors <- NULL
97-
#handle case where there is only one code definition
98-
if ("code" %in% names(nom4)) {
99-
nom4 <- list(nom4)
100-
}
101-
for (k in 1:length(seq_along(nom4))) {
102-
factors <- append(factors, nom4[[k]][["code"]])
96+
#set levels for factor call types:
97+
for (j in 1:nrow(attribs)) {
98+
if("nominal" %in% names(attribs$measurementScale[j])) {
99+
nom <- attribs$measurementScale[j][["nominal"]]
100+
if ("nonNumericDomain" %in% names(nom)) {
101+
nom2 <- nom[["nonNumericDomain"]]
102+
if ("enumeratedDomain" %in% names(nom2)) {
103+
nom3 <- nom2[["enumeratedDomain"]]
104+
if ("codeDefinition" %in% names(nom3)) {
105+
nom4 <- nom3[["codeDefinition"]]
106+
#get factors
107+
factors <- NULL
108+
#handle case where there is only one code definition
109+
if ("code" %in% names(nom4)) {
110+
nom4 <- list(nom4)
111+
}
112+
for (k in 1:length(seq_along(nom4))) {
113+
factors <- append(factors, nom4[[k]][["code"]])
114+
}
115+
#set column type:
116+
csv_cols$cols[[j]] <- readr::col_factor(factors,
117+
include_na = FALSE,
118+
ordered = FALSE)
103119
}
104-
#set column type:
105-
csv_cols$cols[[j]] <- readr::col_factor(factors,
106-
include_na = FALSE,
107-
ordered = FALSE)
108120
}
109121
}
110122
}
111123
}
112-
}
113-
suppressWarnings(tibble_list[[i]] <-
124+
suppressWarnings(package_data[[i]] <-
125+
assign(names[i],
126+
readr::read_csv(file_path,
127+
col_types = csv_cols,
128+
show_col_types = FALSE)
129+
)
130+
)
131+
names(package_data)[i] <- names[i]
132+
} else {
133+
# Do not call attributes:
134+
suppressWarnings(package_data[[i]] <-
114135
assign(names[i],
115136
readr::read_csv(file_path,
116-
col_types = csv_cols,
117137
show_col_types = FALSE)
118-
)
138+
)
119139
)
120-
names(tibble_list)[i] <- names[i]
140+
names(package_data)[i] <- names[i]
121141
}
122142
}
143+
tibble_list[[h]] <- package_data
144+
names(tibble_list)[[h]] <- paste0("pkg_", reference_id[h])
145+
}
146+
#put all the tibbles in a single list that is not nested
147+
#(simplifies subsequent extraction)
148+
if (simplify == TRUE) {
149+
tibble_list <- extract_tbl(tibble_list)
123150
}
124151
return(tibble_list)
125-
}
126-
127-
128-
129-
152+
}
130153

131-
get_attribute_type <- function(data_filename,
132-
reference_id,
133-
directory = here::here("data")
134-
){
154+
#' @export
155+
#' @rdname load_data_packages
156+
load_data_package <- function(reference_id,
157+
directory = here::here("data"),
158+
assign_attributes = FALSE,
159+
simplify = TRUE) {
135160

136-
metadata <- DPchecker::load_metadata(directory = paste0(directory,
137-
"/",
138-
reference_id))
139-
#get dataTable(s):
140-
#if there is only one dataTable, put it in a list for consitency:
141-
if("physical" %in% names(metadata$dataset$dataTable)) {
142-
dataTable <- list(metadata$dataset$dataTable)
143-
} else {
144-
dataTable <- metadata$dataset$dataTable
145-
}
146-
# create a place to put attributes and information
147-
attribute_list <- list()
148-
#find the right dataTable:
149-
for (i in 1:length(seq_along(dataTable))) {
150-
if (dataTable[[i]][["physical"]][["objectName"]] == filename) {
151-
#get attribute names:
152-
attr_names <- unlist(dataTable[[i]])[grepl('attributeName',
153-
names(unlist(dataTable[[i]])),
154-
fixed=T)]
155-
names(attr_names) <- NULL
156-
157-
#get attribute storage types
158-
attr_type <- unlist(dataTable[[i]])[grepl('storageType',
159-
names(unlist(dataTable[[i]])),
160-
fixed=T)]
161-
names(attr_type) <- NULL
162-
163-
#turn these into a dataframe:
164-
filename_data <- tibble::as_tibble(data.frame(attr_names, attr_type))
165-
166-
167-
date_format <- unlist(dataTable[[i]])[grepl('formatString',
168-
names(unlist(dataTable[[i]])),
169-
fixed=T)]
170-
names(date_format) <- NULL
171-
172-
173-
filename_data2 <- filename_data %>% dplyr::mutate(date_format = dplyr::casewhen(attr_type == "date" ~ x))
174-
175-
176-
177-
filename_data1 <- filename_data %>% dplyr::mutate(attr_type_abbr = dplyr::case_when(
178-
attr_type == "float" ~ "d",
179-
attr_type == "date" ~ "T",
180-
attr_type == "string" ~ "c"
181-
))
182-
183-
184-
185-
186-
187-
188-
#add date formats to the dataframe:
189-
#get date formats:
190-
date_format <- unlist(dataTable[[i]])[grepl('formatString',
191-
names(unlist(dataTable[[i]])),
192-
fixed=T)]
193-
names(date_format) <- NULL
194-
195-
transform(filename_data, format = ifelse( (attr_type == "date"), "Y", "unk"))
196-
197-
198-
199-
200-
201-
attribute_list[i] <- assign(attributeNames,
202-
readr::read_csv(file_path,
203-
show_col_types = FALSE))
204-
}
205-
206-
}
161+
x <- load_data_packages(reference_id,
162+
directory = here::here("data"),
163+
assign_attributes = FALSE,
164+
simplify = TRUE)
165+
return(x)
207166
}
208167

168+
#' extract nested tibbles
169+
#'
170+
#' Adapted from stack overflow find_df function found at:
171+
#' https://stackoverflow.com/questions/70512869/extract-data-frames-from-nested-list
172+
#' And accessed on 2024-10-02
173+
#'
174+
#' @param x a (potentially deeply) nested list containing at least one tibble
175+
#'
176+
#' @return a list where each item in the list is a tibble found in the nested list `x`
177+
#' @keywords Internal
178+
#' @noRd
179+
#'
180+
#' @examples
181+
#' \dontrun{
182+
#' z <- .extract_tbl(x)
183+
#' }
184+
extract_tbl <- function(x) {
185+
if (is_tibble(x))
186+
return(list(x))
187+
if (!is.list(x))
188+
return(NULL)
189+
unlist(lapply(x, extract_tbl), FALSE)
190+
}

0 commit comments

Comments
 (0)