dataset: Create Data Frames that are Easier to Exchange and Reuse #681

antaldaniel · 2025-01-21T08:34:55Z

Submitting Author Name: Daniel Antal
Submitting Author Github Handle: @antaldaniel
Repository: https://github.com/dataobservatory-eu/dataset
Version submitted: 0.3.4002
Submission type: Standard
Editor: @maurolepore
Reviewers: @maurolepore

Archive: TBD
Version accepted: TBD
Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: dataset
Title: Create Data Frames that are Easier to Exchange and Reuse
Version: 0.3.4002
Date: 2024-12-26
DOI: 10.32614/CRAN.package.dataset
Language: en-GB
Authors@R: 
    c(person(given = "Daniel", family = "Antal", 
           email = "daniel.antal@dataobservatory.eu", 
           role = c("aut", "cre"),
           comment = c(ORCID = "0000-0001-7513-6760")
           ), 
      person(given = "Marcelo", family =  "Perlin", 
             role = c("rev"), 
             comment = c(ORCID = "0000-0002-9839-4268")
             )
      )
Maintainer: Daniel Antal <daniel.antal@dataobservatory.eu>
Description: The aim of the 'dataset' package is to make tidy datasets easier to release, 
    exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, 
    well-described, interoperable datasets into release and reuse ready form.
License: GPL (>= 3)
Encoding: UTF-8
URL: https://dataset.dataobservatory.eu/
BugReports: https://github.com/dataobservatory-eu/dataset/issues/
Roxygen: list(markdown = TRUE)
LazyData: true
Imports: 
    assertthat,
    cli,
    haven,
    ISOcodes,
    labelled,
    methods,
    pillar,
    RefManageR,
    rlang,
    tibble,
    utils,
    vctrs (>= 0.5.2)
RoxygenNote: 7.3.2
Suggests: 
    knitr,
    rdflib,
    rmarkdown,
    spelling,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Depends: 
    R (>= 3.5)
VignetteBuilder: knitr

Scope

Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
- data retrieval
- data extraction
- data munging
- data deposition
  - data validation and testing
- workflow automation
- version control
- citation management and bibliometrics
- scientific software wrappers
- field and lab reproducibility tools
- database software bindings
- geospatial data
- text analysis
Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package works with various semantic interoperability standards, therefore it allows the users to retrieve RDF annotated, rich and platform-independent data and reconstruct it as an R data.frame with rich metadata attributes, or to release interoperable, RDF annotated datasets on linked open data platforms from native R objects.

Who is the target audience and what are scientific applications of this package?

Production-side statisticans. Scientists who want to update their sources from various data repositories and exchanges. Scientists and research data managers who want to release new scientific or professional datasets that follow modern interoperability standards.

Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?

The package aimst to complement the rdflib and the datapsice package.

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research? Not applicable.
If you made a pre-submission inquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Explain reasons for any pkgcheck items which your package is unable to pass.

Technical checks

Confirm each of the following by checking the box.

I have read the rOpenSci packaging guide.
[ x I have read the author guide and I expect to maintain this package for at least 2 years or to find a replacement.

This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
[xú contains a README with instructions for installing the development version.
includes documentation with examples for all functions, created with roxygen2.
contains a vignette with examples of its essential functions and uses.
has a test suite.
has continuous integration, including reporting of test coverage.

Publication options

Do you intend for this package to go on CRAN?
Do you intend for this package to go on Bioconductor?
Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

Code of conduct

I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

The text was updated successfully, but these errors were encountered:

ropensci-review-bot · 2025-01-21T08:34:57Z

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

maelle · 2025-01-21T08:40:51Z

@antaldaniel I am so sorry, you didn't use the right template... because we broke the right template last time we edited it! I put it back in shape, thanks for helping us catch it. You can find it by opening a new issue (it'll be at the top of the list) or use it from here https://github.com/ropensci/software-review/blob/main/.github/ISSUE_TEMPLATE/A-submit-software-for-review.md

So sorry, thanks for your patience.

ropensci-review-bot · 2025-01-21T08:50:28Z

Checks for dataset (v0.3.4002)

git hash: 7bf85ac7

✔️ Package is already on CRAN.
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✖️ Package coverage failed
✖️ R CMD check process failed with message: 'Build process failed'.
👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	281
internal	dataset	215
internal	graphics	11
internal	stats	2
imports	assertthat	29
imports	utils	8
imports	labelled	5
imports	rlang	2
imports	cli	1
imports	haven	1
imports	tibble	1
imports	ISOcodes	NA
imports	methods	NA
imports	pillar	NA
imports	RefManageR	NA
imports	vctrs	NA
suggests	knitr	NA
suggests	rdflib	NA
suggests	rmarkdown	NA
suggests	spelling	NA
suggests	testthat	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

ifelse (41), as.character (40), is.null (39), list (20), c (14), vapply (8), lapply (7), names (7), data.frame (6), logical (6), paste (6), paste0 (6), character (5), inherits (5), which (5), contributors (4), date (4), seq_along (4), substr (4), Sys.time (4), for (3), format (3), invisible (3), length (3), t (3), with (3), all (2), attr (2), class (2), drop (2), gsub (2), labels (2), nrow (2), args (1), as.data.frame (1), as.Date (1), as.POSIXct (1), cbind (1), do.call (1), double (1), if (1), nchar (1), ncol (1), rbind (1), Sys.Date (1), tolower (1), vector (1)

dataset

get_bibentry (26), creator (11), dataset_title (11), subject (10), publisher (7), rights (7), get_creator (6), identifier (6), description (5), language (5), new_Subject (5), provenance (5), dataset_df (4), get_publisher (4), get_type (4), agent (3), convert_column (3), n_triple (3), publication_year (3), var_definition (3), var_namespace (3), var_unit (3), as_dataset_df (2), as_dublincore (2), datacite (2), default_provenance (2), definition_attribute (2), geolocation (2), get_author (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), namespace_attribute (2), new_my_tibble (2), prov_author (2), unit_attribute (2), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), create_iri (1), dataset_to_triples (1), defined (1), describe (1), dublincore (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), print.dataset_df (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1)

assertthat

assert_that (29)

graphics

title (11)

utils

person (5), bibentry (2), citation (1)

labelled

var_label (4), to_labelled (1)

rlang

caller_env (1), env_is_user_facing (1)

stats

df (1), family (1)

cli

cat_line (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 29 files) and
1 authors
5 vignettes
1 internal data file
12 imported packages
89 exported functions (median 5 lines of code)
153 non-exported functions in R (median 8 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	29	88.1
files_vignettes	5	95.4
files_tests	29	97.1
loc_R	1731	79.0
loc_vignettes	521	77.3
loc_tests	717	77.6
num_vignettes	5	96.4	TRUE
data_size_total	2480	62.3
data_size_median	2480	69.0
n_fns_r	242	91.0
n_fns_r_exported	89	94.0
n_fns_r_not_exported	153	89.0
n_fns_per_file_r	5	71.9
num_params_per_fn	2	8.2
loc_per_fn_r	6	13.0
loc_per_fn_r_exp	5	8.5
loc_per_fn_r_not_exp	8	22.9
rel_whitespace_R	25	85.1
rel_whitespace_vignettes	37	81.2
rel_whitespace_tests	23	79.5
doclines_per_fn_exp	52	64.9
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	155	84.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following error:

Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

no_description_date
no_import_package_as_a_whole

Test coverage with covr

ERROR: Test Coverage Failed

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

checking for file ‘dataset/DESCRIPTION’ ... OK
preparing ‘dataset’:
checking DESCRIPTION meta-information ... OK
installing the package to build vignettes
creating vignettes ... ERROR
--- re-building ‘bibentry.Rmd’ using rmarkdown
--- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown
--- finished re-building ‘defined.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from lines 106-108 [jsonld] (rdf.Rmd)
Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following file failed:
‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found no issues with this package!

4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

- as_character from metan, radiant.data, retroharmonize, sjlabelled
- as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
- describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
- description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
- get_bibentry from eurostat
- identifier from Ramble
- is.defined from nonmemica
- language from sylly, wakefield
- provenance from provenance
- subject from DGM, emayili, gmailr, sendgridr

Package Versions

package	version
pkgstats	0.2.0.48
pkgcheck	0.1.2.77

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

antaldaniel · 2025-01-21T09:13:18Z

@maelle let me know if this works now :)

maelle · 2025-01-21T09:15:43Z

Yes, thank you!

maelle · 2025-01-21T09:15:51Z

@ropensci-review-bot assign @maelle as editor

ropensci-review-bot · 2025-01-21T09:15:55Z

Assigned! @maelle is now the editor

maelle · 2025-01-21T09:21:57Z

Thanks again for your submission!

Editor checks:

Editor comments

Documentation

My main comment before I can proceed to looking for reviewers is that the case of the package could be made better.

On the one hand, it'd be interesting to read how dataset compares to other approaches to the same "problem", such as (if I follow correctly)

On the other hand, how would an user take advantage of dataset?
To me, it is not clear yet from reading the docs.
Questions I wonder about:

As a data publisher, I create the dataset object, and then, how does it help me document it? How does it help me publish it on a repository?
When you mention standard statistical libraries in the README, could you name some?
As a data consumer, how do I create a dataset object (do I get it from an R package? shared in another way)? How can I easily use the information on units when exploring the data, when plotting it?

In short, could you exemplify "release" and "re-use" in a vignette or more, as use cases, potentially using as roles the type of users you mention in the submission under "target audience".

For instance https://wbdataset.dataobservatory.eu/ is a good example, but it is mentioned in a vignette.
More concrete information like wbdataset should make it to the README to make it clearer what dataset is about (and then be expanded in vignettes).

A tiny comment: I find "reuse" harder to parse than "re-use" but that might be a personal preference.

Installation instructions

I'd recommend documenting the two methods of installation (CRAN and GitHub) in distinct chunks so readers could copy-paste the entire code chunk of interest.

Instead of devtools you could recommend using pak.

install.packages("pak")
pak::pak("dataobservatory-eu/dataset")

Default git branch

You might want to rename the master branch to main as some people can be offended by the word "master", see https://www.tidyverse.org/blog/2021/10/renaming-default-branch/ that includes links to context, and practical advice on renaming the default branch.

Contributing guide

The contributing guide does not seem customized.
It mentions a possible "src" folder which is not present.

Since you are looking for co-developers, and mentioned one of the articles could be relevant to potential contributors, I'd recommend having some text related to design and wishes for feedback in the contributing guide.

The contributing guide mentions "AppVeyor" which is not used any more as far as I can tell.

Continuous integration

If AppVeyor is not used any more, please remove the related contributing file.
The code coverage workflow seems not to be working: https://github.com/dataobservatory-eu/dataset/actions/workflows/test-coverage.yaml I'd recommend using the latest workflow file from r-lib/actions, by copy-pasting it or by running usethis::use_github_action("test-coverage").
The pkgdown website is not up to date, for instance on it the test coverage badge is not broken whereas it is in the README. Please add a workflow to continuously deploy it, for instance by running usethis::use_github_action("pkgdown").
If you no longer whish to use the R-CMD-check workflow because you rely on the R-hub ones, please remove the old workflow file.
The latest commits all have red crosses as status, which shows the continuous integration files need a bit of cleaning and tweaking. 🙂

Project management

From the open issues, which ones are meant to be tackled soon?
One of them has the "First CRAN release" milestone, which is outdated.

Code style

I'd recommend running styler (on R scripts including tests) to make spacing more consistent.

For instance in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L56C1-L57C55 the space before return_type is surprising.
I remember being inconsistent with spaces myself years ago and not noticing, (un)fortunately I was converted. 😅

Code

The code could be simplified so that reviewers might more easily follow the logic.

is_person <- function(p) ifelse (inherits(p, "person"), TRUE, FALSE) in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/as_dublincore.R#L11
could be is_person <- function(p) inherits(p, "person") (thanks lintr for catching this).
That pattern comes up several times in the codebase (the relevant linter is "redundant_ifelse_linter").

Code like

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L4-L20

and

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L66

(and other similar pipelines) reminds me of Jenny Bryan's advice in her talk Code smells and feels

If your conditions deal with class, it's time to get object-oriented. In CS jargon, use polymorphisms.

So instead of the complex logic, you'd define methods.

In some files like R/xsd_convert.R and R/dataset_title.R, you use class(something) == or class(something) %in% instead of code built on the more correct inherits().
Using "proper functions for handling class & type" is another tip in the aforementioned talk. 😸

Since dataset imports rlang, you could use the %||% operator from rlang.
For instance https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L24

creators   <- ifelse(is.null(dataset_bibentry$author), ":tba", dataset_bibentry$author)

would become

creators   <- dataset_bibentry$author %||% ":tba"

There are many other occurrences of the ifelse(is.null( pattern (and variants with different spacing) that could get the same treatment.

In the R/agent.R file, functions like get_creator() are defined twice, why?

Example dataset

The iris dataset is very well-known, but it is also infamous because of its eugenics links.
Since having a good example dataset is very important, would you consider replacing it with another one, like maybe the palmerpenguins one, even if it comes at the cost of adding a (possibly optional) dependency?

Tests

Should the line https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-agent.R#L9 be removed as it is not used in the test?

I don't understand why the iris object needs to be duplicated in lines like https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-creator.R#L23

expect_true(is.dataset_df might become a custom expectation and/or rely on expect_s3_class() instead. Same comment for expect_true(is.subject.

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-datacite.R#L18

should be

expect_type(as_datacite(iris_dataset, "list"), "list")

What is https://github.com/dataobservatory-eu/dataset/blob/master/tests/testthat/test-dataset_prov.bak?

When using expect_error() as in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-dataset_title.R#L3 maybe add a pattern for the error message, just in case another error happens?

Thank you! Happy to discuss any of the items.

maelle · 2025-01-21T09:24:16Z

@ropensci-review-bot check package

ropensci-review-bot · 2025-01-21T09:24:19Z

Thanks, about to send the query.

ropensci-review-bot · 2025-01-21T09:24:22Z

🚀

Editor check started

👋

ropensci-review-bot · 2025-01-21T09:25:47Z

Checks for dataset (v0.3.4002)

git hash: 7bf85ac7

✔️ Package is already on CRAN.
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✖️ Package coverage failed
✖️ R CMD check process failed with message: 'Build process failed'.
👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	281
internal	dataset	215
internal	graphics	11
internal	stats	2
imports	assertthat	29
imports	utils	8
imports	labelled	5
imports	rlang	2
imports	cli	1
imports	haven	1
imports	tibble	1
imports	ISOcodes	NA
imports	methods	NA
imports	pillar	NA
imports	RefManageR	NA
imports	vctrs	NA
suggests	knitr	NA
suggests	rdflib	NA
suggests	rmarkdown	NA
suggests	spelling	NA
suggests	testthat	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

ifelse (41), as.character (40), is.null (39), list (20), c (14), vapply (8), lapply (7), names (7), data.frame (6), logical (6), paste (6), paste0 (6), character (5), inherits (5), which (5), contributors (4), date (4), seq_along (4), substr (4), Sys.time (4), for (3), format (3), invisible (3), length (3), t (3), with (3), all (2), attr (2), class (2), drop (2), gsub (2), labels (2), nrow (2), args (1), as.data.frame (1), as.Date (1), as.POSIXct (1), cbind (1), do.call (1), double (1), if (1), nchar (1), ncol (1), rbind (1), Sys.Date (1), tolower (1), vector (1)

dataset

get_bibentry (26), creator (11), dataset_title (11), subject (10), publisher (7), rights (7), get_creator (6), identifier (6), description (5), language (5), new_Subject (5), provenance (5), dataset_df (4), get_publisher (4), get_type (4), agent (3), convert_column (3), n_triple (3), publication_year (3), var_definition (3), var_namespace (3), var_unit (3), as_dataset_df (2), as_dublincore (2), datacite (2), default_provenance (2), definition_attribute (2), geolocation (2), get_author (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), namespace_attribute (2), new_my_tibble (2), prov_author (2), unit_attribute (2), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), create_iri (1), dataset_to_triples (1), defined (1), describe (1), dublincore (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), print.dataset_df (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1)

assertthat

assert_that (29)

graphics

title (11)

utils

person (5), bibentry (2), citation (1)

labelled

var_label (4), to_labelled (1)

rlang

caller_env (1), env_is_user_facing (1)

stats

df (1), family (1)

cli

cat_line (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 29 files) and
1 authors
5 vignettes
1 internal data file
12 imported packages
89 exported functions (median 5 lines of code)
153 non-exported functions in R (median 8 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	29	88.1
files_vignettes	5	95.4
files_tests	29	97.1
loc_R	1731	79.0
loc_vignettes	521	77.3
loc_tests	717	77.6
num_vignettes	5	96.4	TRUE
data_size_total	2480	62.3
data_size_median	2480	69.0
n_fns_r	242	91.0
n_fns_r_exported	89	94.0
n_fns_r_not_exported	153	89.0
n_fns_per_file_r	5	71.9
num_params_per_fn	2	8.2
loc_per_fn_r	6	13.0
loc_per_fn_r_exp	5	8.5
loc_per_fn_r_not_exp	8	22.9
rel_whitespace_R	25	85.1
rel_whitespace_vignettes	37	81.2
rel_whitespace_tests	23	79.5
doclines_per_fn_exp	52	64.9
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	155	84.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following error:

Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

no_description_date
no_import_package_as_a_whole

Test coverage with covr

ERROR: Test Coverage Failed

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

checking for file ‘dataset/DESCRIPTION’ ... OK
preparing ‘dataset’:
checking DESCRIPTION meta-information ... OK
installing the package to build vignettes
creating vignettes ... ERROR
--- re-building ‘bibentry.Rmd’ using rmarkdown
--- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown
--- finished re-building ‘defined.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from lines 106-108 [jsonld] (rdf.Rmd)
Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following file failed:
‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found no issues with this package!

4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

- as_character from metan, radiant.data, retroharmonize, sjlabelled
- as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
- describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
- description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
- get_bibentry from eurostat
- identifier from Ramble
- is.defined from nonmemica
- language from sylly, wakefield
- provenance from provenance
- subject from DGM, emayili, gmailr, sendgridr

Package Versions

package	version
pkgstats	0.2.0.48
pkgcheck	0.1.2.77

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

maelle · 2025-01-21T09:31:05Z

Coverage is also something noted in my comments.

antaldaniel · 2025-01-21T09:47:53Z

@maelle Thank you. It is unfortunate that you have joined this review after two years, because I find many of your comments very useful. However, I must say that some of them would be unusual to add to a vignette, and I would find an article format more useful, i.e., the your question about frictionless, datapack, datapackage.org, and researchobject, as about 2 years of research went into this package that is usually not a vignette material, other reviewers of my other packages hated extended vignettes.

I think that the frictionless package family follows a very different approach, and when I started the development of this package and this review, it did not even seem relevant. Eventually I see that in some use cases both can be useful and a choice could be offered, so I will argue that, but I think that this is more a paper considering statistical exchange formats and their best representation in R.

A small question: what it the package coverage you are aiming at? I think that the package already has a very high coverage, and it exceeded the requirements when I started the review.

maelle · 2025-01-21T10:38:44Z

@antaldaniel thank you for your answer!

Use cases would be crucial to make the point of the package clearer. You have a vision, that potential users need to understand. With concrete examples of usage, I think understanding what it is all about (and why a potential user should care) would be easier. An use case does not have to be really long, it can contain diagrams, but it should help someone see what they could do with the package in practice. Even if from the current README text they might get it has the good goal of helping with FAIRness, the application of the package might remain nebulous.
I agree the docs do not need to contain an in-depth comparison with other approaches, but a small section "Related work" would be important. It'd allow users to quickly compare your package to others, and also would help them, again, get the point. "Oh, it's a bit like frictionless, it's not a new tibble or a new dataspice". In our dev guide that's "If applicable, how the package compares to other similar packages and/or how it relates to other packages." in https://devguide.ropensci.org/pkg_building.html#readme With workflow/standards packages like this one (as opposed to, say, a package that helps you get data from a specific API, where the goal is easier to grasp for a newcomer), introducing the package is obviously harder, but also more important.
For the coverage, the problem is not the number but the fact that no continuous integration workflow is updating the badge. I am aiming at a not unknown coverage in the README badge. 😉

antaldaniel · 2025-01-21T10:46:02Z

Thank you @maelle and I will look in to why the coverage is not updating... However, do you have an explicit coverage target?

maelle · 2025-01-21T10:54:40Z

Yes it's 75% https://devguide.ropensci.org/pkg_building.html#testing -- But I'd recommend also looking at the coverage report to find the not covered lines and make a judgement call on how important/risky these lines are (vs how hard to test they'd be) just so you're sure there's nothing dangerous in the remaining 25%. 🙂 The idea really is to cover "key functionality" (phrasing from the dev guide).

In my comments I recommend updating the test-coverage workflow, the fix might be as simple as that.

antaldaniel · 2025-01-21T11:34:08Z

@maelle Thank you agian for your useful comments and your PRs. I reorganised the issues and created a new milestone. The milestone currently breaks up your review comments to 7 issues, but they could of course be broken down to more. I set myself a deadline of 16 February to resolve them, though it may happen earlier. I will tag you in the issues when they are ready for review and will also add a new comment here.

maelle · 2025-04-22T12:11:18Z

@antaldaniel thank you! I am having trouble reading your comment, could you please format quotes as quotes i.e. using the ">" sign in front of the paragraph?

antaldaniel · 2025-04-24T22:19:52Z

On the one hand, it'd be interesting to read how dataset compares to other approaches to the same "problem", such as (if I follow correctly)
https://docs.ropensci.org/frictionless/
https://docs.ropensci.org/datapack/
https://datapackage.org/
https://www.researchobject.org/
On the other hand, how would an user take advantage of dataset?
To me, it is not clear yet from reading the docs.
Questions I wonder about:
As a data publisher, I create the dataset object, and then, how does it help me document it? How does it help me publish it on a repository?

This is now explained in detail in the new vignette.

Repositories like Zenodo apply at most, but usually with limitations, the 5-star FAIR data model, which helps with finding data, but does not make the data readily reusable. We apply more the 8-star model, which means that we put metadata about the contents of the data, not only the file. On Zenodo, you can upload three different datasets about GDP per capita in countries, one in dollars, the other in euros, and yet another on PPP adjusted euros. These datasets are not very reusable, because if a user accidentally joins them, it will result in logical-semantic errors; for example, euro values will be compared or added to dollar values. Or, in natural sciences, measures of units may mismatch.

Dataset contain the machine readable definition of each variable, it unit of measure, and other metadata. To my knowledge, apart from rdflib, no other R pakcages help in that. We designed a set of packages that work with rdflib.

When you mention standard statistical libraries in the README, could you name some?

We aim for full compliance with the GSIM system, which is partly the StatisticaL Data and Metadata eXchange (so our datasets can be exchanged, enriched, etc. with IMF,Eurostat, French national statitical office data) and DDI (the standard for microdata.)

As a data consumer, how do I create a dataset object (do I get it from an R package? shared in another way)? How can I easily use the information on units when exploring the data, when plotting it?

You use the dataset_df constructor that enriches a data.frame or a tibble object. The print method use this information. But I implemented a plot function, too (see the README)

In short, could you exemplify "release" and "re-use" in a vignette or more, as use cases, potentially using as roles the type of users you mention in the submission under "target audience".

Release means that you can create datasets that you can send to the EU Open Data Portal, which has far more strict requirements than Zenodo. It is meant for people who want to machine-exchange data with professional scientific or statistical data providers, even on API level.

For instance https://wbdataset.dataobservatory.eu/ is a good example, but it is mentioned in a vignette.
More concrete information like wbdataset should make it to the README to make it clearer what dataset is about (and then be expanded in vignettes).

Installation instructions
I'd recommend documenting the two methods of installation (CRAN and GitHub) in distinct chunks so readers could copy-paste the entire code chunk of interest.

Done

Contributing guide
The contributing guide does not seem customized.
It mentions a possible "src" folder which is not present. Since you are looking for co-developers, and mentioned one of the articles could be relevant to potential contributors, I'd recommend having some text related to design and wishes for feedback in the contributing guide.

The Guide is rewritten.

The contributing guide mentions "AppVeyor" which is not used any more as far as I can tell. If AppVeyor is not used any more, please remove the related contributing file.

It is removed

Project management
From the open issues, which ones are meant to be tackled soon?

All issues tacked and wait for your review.

Code style
I'd recommend running styler (on R scripts including tests) to make spacing more consistent._

styler is used now with tidyverse settings.

is_person <- function(p) ifelse (inherits(p, "person"), TRUE, FALSE) in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/as_dublincore.R#L11
could be is_person <- function(p) inherits(p, "person") (thanks lintr for catching this).
That pattern comes up several times in the codebase (the relevant linter is "redundant_ifelse_linter")._

This issues are resolved.

I hope that all such cases are handled.

Since dataset imports rlang, you could use the %||% operator from rlang.
For instance https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L24

I prefer to remain as close to base R as possible and later remove some dependencies, possibly even rlang. So I prefer to use less of rlang not more. However, for now I created an internal function for the %||% and there are few examples of its use.

Example dataset
The iris dataset is very well-known, but it is also infamous because of its eugenics links.
Since having a good example dataset is very important, would you consider replacing it with another one, like maybe the palmerpenguins one, even if it comes at the cost of adding a (possibly optional) dependency?_

I think that the way rOpenSci handles this issue could improve, because the dataset was never used in eugenics. I find boycotting an innocent author, i.e., Edgar Anderson, unethical. There are better practices in handling problematic provenance descriptions (because in this case it is the provenance description that is associated with eugenics, not the dataset or its author) and scientific solutions about such problems from digital humanities and digital heritage management could be introduced. But this could be the the topic of a peer-reviewed article itself. So, upon your request, I removed the iris dataset from all vignetes and examples. It uses now the Orange dataset.

Tests
Should the line https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-agent.R#L9 be removed as it is not used in the test?

Solved.

I don't understand why the iris object needs to be duplicated in lines like https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-creator.R#L23
expect_true(is.dataset_df might become a custom expectation and/or rely on expect_s3_class() instead. Same comment for >expect_true(is.subject.
https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-datacite.R#L18
should be
expect_type(as_datacite(iris_dataset, "list"), "list")
What is https://github.com/dataobservatory-eu/dataset/blob/master/tests/testthat/test-dataset_prov.bak?
When using expect_error() as in https://github.com/dataobservatory->eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-dataset_title.R#L3 maybe add a pattern for the error message, just in case another error happens?_

These minor issues are solved and there are many new texts.

maurolepore · 2025-04-29T16:00:21Z

@ropensci-review-bot assign @maurolepore as editor

ropensci-review-bot · 2025-04-29T16:00:25Z

Assigned! @maurolepore is now the editor

maurolepore · 2025-04-29T16:12:56Z

Dear @antaldaniel

Thanks again for your sharing your work with rOpenSci, for you patience, and for incorporating so many of your suggestions so far.

I'm now the handling editor, please direct all your questions to me.

This week I'm wrapping up my EiC rotation but starting next week I'll be available to work with you to move this towards approval as smoothly and quickly as you and I can make it.

I plan to first digest this thread, then do some preliminary checks, then assign myself as the first reviewer -- which we sometimes do and the editorial board agrees to it in this specific submission. With that deeper dive into the package I'll be more able to share my findings with the editorial board and discuss the next step.

Would you be available to respond to my suggestions between May 12-23? Not a hard commitment but just to sense your availability and decide if this is indeed the right time for you.

antaldaniel · 2025-04-29T17:15:26Z

Dear @maurolepore , thank you very much. I think that I have an important deadline on 15 May, so I think that whatever you write I will be able to review, and at least partly implement by 23 May, most likely starting only on 16 May.

Because the package is already in CRAN, and I fixed a lot of minor bugs and improvements, if it is not against your preferences, I would release the current version (the README makes it very clear that this is an experimental version under review in rOpenSci). The changes that may be bothering those few users who are experimenting with the package are resolved (there are new, and better tested subsetting, printing, etc. methods for the dataframe like and the vector classes -- nothing groundbreaking and conceptual, but if somebody uses the package as it is, it is a significant improvement.)

So is it OK for you if I just release on CRAN these new improved version and wait for your feedback till 12 May, then reply (and whatever possible, implement) by 23 May?

maurolepore · 2025-04-29T21:07:04Z

@antaldaniel sure go ahead and release what you have.

maurolepore · 2025-05-04T21:56:20Z

This is to mark the end of my EiC rotation and to hand it over to the next EiC with a short summary.

This submission has been open for a while.
Lately we've had rapid activity so to use that momentum I'm taking a dual role as editor and reviewer.
For a more complete summary see dataset: Create Data Frames that are Easier to Exchange and Reuse #681 (comment)

maurolepore · 2025-05-10T22:36:16Z

@ropensci-review-bot assign @maurolepore as reviewer

ropensci-review-bot · 2025-05-10T22:36:19Z

@maurolepore added to the reviewers list. Review due date is 2025-05-31. Thanks @maurolepore for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot · 2025-05-10T22:36:24Z

@maurolepore: If you haven't done so, please fill this form for us to update our reviewers records.

maurolepore · 2025-05-10T22:36:50Z

@ropensci-review-bot check package

ropensci-review-bot · 2025-05-10T22:36:51Z

Thanks, about to send the query.

ropensci-review-bot · 2025-05-10T22:36:54Z

🚀

Editor check started

👋

ropensci-review-bot · 2025-05-10T22:46:41Z

Checks for dataset (v0.3.4022)

git hash: 8768983c

✔️ Package is already on CRAN.
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✖️ The following function has no documented return value: [bind_defined_rows]
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✔️ Package coverage is 80.1%.
✖️ R CMD check process failed with message: 'Build process failed'.
👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)

1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	353
internal	dataset	301
internal	graphics	7
internal	stats	7
imports	assertthat	25
imports	utils	10
imports	labelled	9
imports	vctrs	7
imports	rlang	2
imports	haven	1
imports	tibble	1
imports	datasets	NA
imports	ISOcodes	NA
imports	methods	NA
imports	pillar	NA
imports	RefManageR	NA
suggests	knitr	NA
suggests	rdflib	NA
suggests	rmarkdown	NA
suggests	spelling	NA
suggests	testthat	NA
linking_to	NA	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

is.null (45), ifelse (39), c (27), list (19), lapply (18), vapply (16), contributors (9), for (9), seq_along (9), names (8), paste0 (8), if (7), inherits (7), logical (7), mapply (7), substr (7), which (7), attr (6), labels (6), unique (6), unlist (6), do.call (5), format (5), date (4), length (4), Sys.time (4), col (3), data.frame (3), invisible (3), nzchar (3), t (3), vector (3), all (2), as.data.frame (2), as.list (2), drop (2), gsub (2), is.na (2), nchar (2), nrow (2), paste (2), tryCatch (2), any (1), args (1), as.POSIXct (1), cbind (1), class (1), double (1), gregexpr (1), identical (1), ncol (1), NextMethod (1), rbind (1), rep (1), setdiff (1), strsplit (1), sub (1), Sys.Date (1), tolower (1), typeof (1), units (1), with (1)

dataset

as.character (41), get_bibentry (26), creator (11), identifier (10), subject (10), var_unit (10), namespace_attribute (9), dataset_df (8), dataset_title (8), publisher (7), var_definition (7), description (5), language (5), provenance (5), rights (5), get_author (4), get_creator (4), get_publisher (4), get_type (4), agent (3), as_dataset_df (3), convert_column (3), definition_attribute (3), n_triple (3), new_Subject (3), publication_year (3), var_namespace (3), as_dublincore (2), datacite (2), default_provenance (2), defined (2), dublincore (2), geolocation (2), get_contributor (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), new_my_tibble (2), prov_author (2), unit_attribute (2), all.identical (1), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_factor (1), as_factor.haven_labelled_defined (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), as.list.haven_labelled_defined (1), as.numeric (1), as.numeric.haven_labelled_defined (1), as.vector.haven_labelled_defined (1), bind_defined_rows (1), c.haven_labelled_defined (1), compare_creators (1), create_iri (1), dataset_to_triples (1), describe (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), format.haven_labelled_defined (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), head.haven_labelled_defined (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), length.haven_labelled_defined (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), Ops.haven_labelled_defined (1), plot.dataset_df (1), print.dataset_df (1), print.haven_labelled_defined (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), strip_defined (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tail.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1), var_unit.default (1), vec_cast.character.haven_labelled_defined (1)

assertthat

assert_that (25)

utils

person (5), bibentry (2), citation (1), head (1), tail (1)

labelled

var_label (8), to_labelled (1)

graphics

title (7)

stats

df (5), end (1), start (1)

vctrs

vec_data (7)

rlang

caller_env (1), env_is_user_facing (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in R (100% in 35 files) and
1 authors
6 vignettes
2 internal data files
12 imported packages
94 exported functions (median 5 lines of code)
202 non-exported functions in R (median 7 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	35	91.3
files_vignettes	6	97.0
files_tests	32	97.6
loc_R	2201	84.0
loc_vignettes	1037	90.8
loc_tests	1316	87.7
num_vignettes	6	97.6	TRUE
data_size_total	4101	64.7
data_size_median	2050	67.9
n_fns_r	296	93.2
n_fns_r_exported	94	94.6
n_fns_r_not_exported	202	92.3
n_fns_per_file_r	5	73.1
num_params_per_fn	2	8.2
loc_per_fn_r	5	9.2
loc_per_fn_r_exp	5	8.4
loc_per_fn_r_not_exp	7	18.5
rel_whitespace_R	21	86.4
rel_whitespace_vignettes	41	95.1	TRUE
rel_whitespace_tests	12	77.7
doclines_per_fn_exp	53	65.9
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	267	90.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

3. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following error:

Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

no_description_date
no_import_package_as_a_whole

Test coverage with covr

Package coverage: 80.05

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

checking for file ‘dataset/DESCRIPTION’ ... OK
preparing ‘dataset’:
checking DESCRIPTION meta-information ... OK
installing the package to build vignettes
creating vignettes ... ERROR
--- re-building ‘bibentry.Rmd’ using rmarkdown
--- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown

Quitting from defined.Rmd:158-161 [coerce-char]

<error/rlang_error>
Error in `as_character.haven_labelled_defined()`:
! as_character(): underlying data is not a character vector.
---
Backtrace:
    ▆
 1. ├─dataset::as_character(c(gdp_1, gdp_2))
 2. └─dataset:::as_character.haven_labelled_defined(c(gdp_1, gdp_2))

Error: processing vignette 'defined.Rmd' failed with diagnostics:
as_character(): underlying data is not a character vector.
--- failed re-building ‘defined.Rmd’

--- re-building ‘Motivation.Rmd’ using rmarkdown
--- finished re-building ‘Motivation.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from rdf.Rmd:109-112 [jsonld]

<error/rlang_error>
Error in `rdf_serialize()`:
! please install the jsonld package to use this functionality.
---
Backtrace:
     ▆
  1. └─tools::buildVignettes(dir = ".", tangle = TRUE)
  2.   ├─base::tryCatch(...)
  3.   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
  4.   │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
  5.   │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
  6.   └─engine$weave(file, quiet = quiet, encoding = enc)
  7.     └─knitr:::vweave_rmarkdown(...)
  8.       └─rmarkdown::render(...)
  9.         └─knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
 10.           └─knitr:::process_file(text, output)
 11.             ├─xfun:::handle_error(...)
 12.             ├─base::withCallingHandlers(...)
 13.             └─knitr:::process_group(group)
 14.               └─knitr:::call_block(x)
 15.                 └─knitr:::block_exec(params)
 16.                   └─knitr:::eng_r(options)
 17.                     ├─knitr:::in_input_dir(...)
 18.                     │ └─knitr:::in_dir(input_dir(), expr)
 19.                     └─knitr (local) evaluate(...)
 20.                       └─evaluate::evaluate(...)
 21.                         ├─base::withRestarts(...)
 22.                         │ └─base (local) withRestartList(expr, restarts)
 23.                         │   ├─base (local) withOneRestart(withRestartList(expr, restarts[-nr]), restarts[[nr]])
 24.                         │   │ └─base (local) doWithOneRestart(return(expr), restart)
 25.                         │   └─base (local) withRestartList(expr, restarts[-nr])
 26.                         │     └─base (local) withOneRestart(expr, restarts[[1L]])
 27.                         │       └─base (local) doWithOneRestart(return(expr), restart)
 28.                         ├─evaluate:::with_handlers(...)
 29.                         │ ├─base::eval(call)
 30.                         │ │ └─base::eval(call)
 31.                         │ └─base::withCallingHandlers(...)
 32.                         └─watcher$print_value(ev$value, ev$visible, envir)
 33.                           ├─base::withVisible(handle_value(handler, value, visible, envir))
 34.                           └─evaluate:::handle_value(handler, value, visible, envir)
 35.                             └─handler$value(value, visible)
 36.                               └─knitr (local) fun(x, options = options)
 37.                                 ├─base::withVisible(knit_print(x, ...))
 38.                                 ├─knitr::knit_print(x, ...)
 39.                                 └─knitr:::knit_print.default(x, ...)
 40.                                   └─knitr::normal_print(x)
 41.                                     ├─base::print(x)
 42.                                     └─rdflib:::print.rdf(x)
 43.                                       ├─base::cat(format.rdf(x, ...), sep = "\n")
 44.                                       └─rdflib:::format.rdf(x, ...)
 45.                                         └─rdflib::rdf_serialize(x, tmp, format = format, ...)

Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following files failed:
‘defined.Rmd’ ‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found the following 403 potential issues:

message	number of times
Avoid 1:nrow(...) expressions, use seq_len.	1
Avoid library() and require() calls in packages	21
Avoid using sapply, consider vapply instead, that's type safe	1
Lines should not be more than 80 characters. This line is 100 characters.	10
Lines should not be more than 80 characters. This line is 101 characters.	6
Lines should not be more than 80 characters. This line is 102 characters.	6
Lines should not be more than 80 characters. This line is 103 characters.	8
Lines should not be more than 80 characters. This line is 104 characters.	7
Lines should not be more than 80 characters. This line is 105 characters.	7
Lines should not be more than 80 characters. This line is 106 characters.	6
Lines should not be more than 80 characters. This line is 107 characters.	9
Lines should not be more than 80 characters. This line is 108 characters.	3
Lines should not be more than 80 characters. This line is 109 characters.	14
Lines should not be more than 80 characters. This line is 110 characters.	5
Lines should not be more than 80 characters. This line is 111 characters.	1
Lines should not be more than 80 characters. This line is 112 characters.	6
Lines should not be more than 80 characters. This line is 113 characters.	1
Lines should not be more than 80 characters. This line is 114 characters.	8
Lines should not be more than 80 characters. This line is 115 characters.	6
Lines should not be more than 80 characters. This line is 117 characters.	3
Lines should not be more than 80 characters. This line is 118 characters.	5
Lines should not be more than 80 characters. This line is 119 characters.	2
Lines should not be more than 80 characters. This line is 120 characters.	4
Lines should not be more than 80 characters. This line is 122 characters.	1
Lines should not be more than 80 characters. This line is 123 characters.	2
Lines should not be more than 80 characters. This line is 125 characters.	1
Lines should not be more than 80 characters. This line is 126 characters.	3
Lines should not be more than 80 characters. This line is 131 characters.	3
Lines should not be more than 80 characters. This line is 133 characters.	2
Lines should not be more than 80 characters. This line is 134 characters.	1
Lines should not be more than 80 characters. This line is 136 characters.	1
Lines should not be more than 80 characters. This line is 140 characters.	1
Lines should not be more than 80 characters. This line is 142 characters.	1
Lines should not be more than 80 characters. This line is 143 characters.	1
Lines should not be more than 80 characters. This line is 148 characters.	1
Lines should not be more than 80 characters. This line is 151 characters.	1
Lines should not be more than 80 characters. This line is 153 characters.	1
Lines should not be more than 80 characters. This line is 157 characters.	1
Lines should not be more than 80 characters. This line is 166 characters.	1
Lines should not be more than 80 characters. This line is 171 characters.	2
Lines should not be more than 80 characters. This line is 174 characters.	1
Lines should not be more than 80 characters. This line is 292 characters.	3
Lines should not be more than 80 characters. This line is 312 characters.	1
Lines should not be more than 80 characters. This line is 81 characters.	17
Lines should not be more than 80 characters. This line is 82 characters.	15
Lines should not be more than 80 characters. This line is 83 characters.	21
Lines should not be more than 80 characters. This line is 84 characters.	15
Lines should not be more than 80 characters. This line is 85 characters.	8
Lines should not be more than 80 characters. This line is 86 characters.	11
Lines should not be more than 80 characters. This line is 87 characters.	15
Lines should not be more than 80 characters. This line is 88 characters.	15
Lines should not be more than 80 characters. This line is 89 characters.	12
Lines should not be more than 80 characters. This line is 90 characters.	11
Lines should not be more than 80 characters. This line is 91 characters.	10
Lines should not be more than 80 characters. This line is 92 characters.	12
Lines should not be more than 80 characters. This line is 93 characters.	12
Lines should not be more than 80 characters. This line is 94 characters.	8
Lines should not be more than 80 characters. This line is 95 characters.	15
Lines should not be more than 80 characters. This line is 96 characters.	9
Lines should not be more than 80 characters. This line is 97 characters.	2
Lines should not be more than 80 characters. This line is 98 characters.	17
Lines should not be more than 80 characters. This line is 99 characters.	8
unexpected 'in'	1

4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

- as_character from metan, radiant.data, retroharmonize, sjlabelled
- as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
- describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
- description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
- get_bibentry from eurostat
- identifier from Ramble
- is.defined from nonmemica
- language from sylly, wakefield
- provenance from provenance
- subject from DGM, emayili, gmailr, sendgridr

Package Versions

package	version
pkgstats	0.2.0.54
pkgcheck	0.1.2.126

Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

maurolepore · 2025-05-12T00:48:51Z

Dear @antaldaniel

Thanks again for your submission. Here's my review. Before you proceed please wait to hear back from me. I need to first discuss with the editorial board. I'll aim to share next steps this week.

I see there is a lot of love and effort invested in this package. With such a large investment I anticipate some of my feedback will likely feel uncomfortable. I appreciate your willingness to receive feedback. Please read imagining a friendly tone and the best of the intentions. We're all here helping each other kindly to make research software better.

Thanks for sharing your work with rOpenSci.

Semantic comment-tags

I tag my comments (ml01, ml02, and so on) to help track them. Please respond in sequence and using
the corresponding tag:

A comment starting with a checkbox requires action and I expect a response.

A comment starting with a bullet is something to consider but does not require action. You may respond but I don't expect it.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.

As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

A statement of need: clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
- ml01: The installation instructions mention {remotes} but use {pak}:
The latest development version of dataset can be installed with remotes::install_github():
```
install.packages("pak")
pak::pak("dataobservatory-eu/dataset")
```
Vignette(s): demonstrating major functionality that runs successfully locally

README (not a vignette but I'll cover it here)
- ml02: The print output of orange_df shows that column names are somewhat distant from the column values.
- ml03: I'm struggling to understand the meaning of the information in angle brackets <hvn_lbl_>. I'm sure it'll become clear later in this or other docs, but it would be nice to have a clear "first impression". It would be nice to compare how the same dataset prints as a tibble, to more clearly show the richness that {dataset} adds.
- ml04: I see an explicit call to print() in print(get_bibentry()). It seems unnecessary.
- ml05: The text above plot(orange_df)says "Print title and labels" but I see no effect reflected in the plot.
- ml06: Please review the structure of README. I struggle to follow the mix between sections with and without code. Consider grouping all the code-sections and all the text-sections. A structure that commonly works is an intro at the top including the goal and need, then installation instructions, minimal code examples, and finally the rest of the text-sections. Best to minimize duplication. Links to vignettes seem unnecessary since the website shows the Articles tab prominently. Contributing, code of conduct, and related work are OK and common.
VIGNETTES
- ml07: Please reconsider the need for the Article's introduction page. The default drop down menu that pkgdown provides is common and enough. More text adds maintenance cost and may overwhelm the reader. The vignette titles should be enough information for the reader to decide if they want to navigate to that article, and then the article's introduction would do the rest -- likely in the first 1-3 sentences. An alternative to avoid "throwig away" the introduction to each article while keeping the website clean is to move it under a link "More articles...", e.g.:
Here I see the sentence "The Motivation article will replace the New requirements". Consider removing this sentence, removing the article "New Requirements", and maybe adding a bullet in NEWS.md saying "New article 'Motivation' replaces the old article 'New Requirements'".
- ml08: The article Motivation duplicates information in README. What's in README seems enough. Consider removing the article, or replacing the text in README with a link to the article.
- ml09: The article Create Datasets that are Easy to Share Exchange and Extend seems equivalent to the common "Get started". If you rename the file to dataset.Rmd (i.e. .Rmd) then pkgdown should automatically add a "Get started" link to the navbar.
- ml10: Consider consolidating all other articles. Ideally this should result in fewer, more focused articles. Also consider moving the less important articles to "More articles..." (see above) to reduce the cognitive load required to have a first glance at the website. A smooth experience on the website suggests a similarly smooth experience with the code.
Function Documentation: for all exported functions
- ml11: Please review the Reference index.
  The grouping seems not cohesive. For example, to me it makes sense that the "Dataset" group includes functions with "dataset" in their name, e.g. dataset_df() and as_dataset_df(). But the fit in this group for the function get_bibentry() is unclear to me. Instead a function called get_bibentry() seems to more naturally fit in the group of "Bibliography functions". Consider reviewing the groups and evaluating if you need different groups, or different function names that more clearly reflect their membership to a group.
- ml12: Please document the @return value of bin_defined_rows(). See these recent pkgcheck results.

Examples: (that run successfully locally) for all exported functions

ml13: Please fix this failing example (but if the point is to show an error you may use try()):

devtools::run_examples()
#> ℹ Updating dataset documentation
#> ℹ Loading dataset
#> ── Running 36 example files ───────────────────────────────────────── dataset ──
#> ℹ Loading dataset
#>
#> > ### Name: as.character.haven_labelled_defined
#> > ### Title: Coerce to character vector
#> > ### Aliases: as.character.haven_labelled_defined as_character
#> > ###  as_character.haven_labelled_defined
#>
#> > ###  Examples
#> >
#> > x <- defined(c("a", "b", "c"), label = "Letter code")
#>
#> > as_character(x)
#> [1] "a" "b" "c"
#>
#> > y <- defined(1:3, label = "Index")
#>
#> > as_character(y)
#> Error in as_character.haven_labelled_defined(y): as_character(): underlying data is not a character vector.
#> ℹ Loading dataset

That failing example is likely what's causing R CMD check to fail on this recent run of pkgcheck.

ml14: Please review all examples and ensure the website shows the most relevant output. The goal is for a potential user to understand each hepfile before they install the package and run the code. The reprex package gives good advice:

The ... code ... Should be easy for others to digest, so they don't
necessarily have to run it.

Here are a few examples so you get the idea:

?orange_df lacks an examples section showing the dataset itself. This would be useful:

library(dataset)

orange_df
#> Draper N, Smith H (1998). "Growth of Orange Trees."
#>   rowid  tree   age circumference
#>                        The age of circumference
#>  <hvn_lbl_> <hvn_lbl_> <hvn_lbl_> <hvn_lbl_>
#> 1 orange:1   2 [1]     118            30
#> 2 orange:2   2 [1]     484            58
#> 3 orange:3   2 [1]     664            87
#> 4 orange:4   2 [1]    1004           115
#> 5 orange:5   2 [1]    1231           120
#> 6 orange:6   2 [1]    1372           142
#> 7 orange:7   2 [1]    1582           145
#> 8 orange:8   4 [2]     118            33
#> 9 orange:9   4 [2]     484            69
#> 10 orange:10  4 [2]     664           111
#> # ℹ 25 more rows

?dataset_df() shows how to add metadata but not how to see it:

library(dataset)

my_dataset <- dataset_df(
  country_name = defined(
    c("AD", "LI"),
    definition = "[http://data.europa.eu/bna/c_6c2bb82d](http://data.europa.eu/bna/c_6c2bb82d)",
    namespace = "[https://www.geonames.org/countries/$1/](https://www.geonames.org/countries/$1/)"
  ),
  gdp = defined(
    c(3897, 7365),
    label = "Gross Domestic Product",
    unit = "million dollars",
    definition = "[http://data.europa.eu/83i/aa/GDP](http://data.europa.eu/83i/aa/GDP)"
  )
)

print(my_dataset)
#> Unknown A (:tba). "Untitled Dataset."
#>   rowid  country_name gdp
#>                        Gross Dome
#>  <hvn_lbl_> <hvn_lbl_>   <hvn_lbl_>
#> 1 eg:1       AD         3897
#> 2 eg:2       LI         7365

is.dataset_df(my_dataset)
#> [1] TRUE

This would be useful:

str(my_dataset$country_name)
#>  hvn_lbl_ [1:2] AD, LI
#>  @ definition: chr "[http://data.europa.eu/bna/c_6c2bb82d](http://data.europa.eu/bna/c_6c2bb82d)"
#>  @ namespace : chr "[https://www.geonames.org/countries/$1/](https://www.geonames.org/countries/$1/)"
str(my_dataset$gdp)
#>  hvn_lbl_ [1:2] 3897 (million dollars), 7365 (million dollars)
#>  @ label   : chr "Gross Domestic Product"
#>  @ unit    : chr "million dollars"
#>  @ definition: chr "[http://data.europa.eu/83i/aa/GDP](http://data.europa.eu/83i/aa/GDP)"

?get_bibentry() doesn't show its output, neither before nor after calling set_bibentry() so we can't see what's the effect.

Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

Installation: Installation succeeds as documented.
- ml15: Great. On a fresh Ubuntu system with R 4.5.0, the package installs smoothly from CRAN and GitHub (using pak()).
Functionality: Any functional claims of the software have been confirmed.
- ml16: The functions I tried worked fine and the tests cover a lot of ground. But much of my time was spent in other aspects of this review so I can't claim I did this sufficiently.
Performance: Any performance claims of the software have been confirmed.
- ml17: I see no performance claims in the computational sense.

Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

ml18: In general test coverage looks great but some specific files with lots of logic may need attention

#> R/var_label.R: 35.71%
#> R/xsd_convert.R: 46.15%
#> R/agent.R: 47.62%
#> R/id_to_column.R: 51.72%
#> R/get_bibentry.R: 71.43%
#> R/as_dublincore.R: 73.61%
#> R/publisher.R: 75.00%
#> R/dataset_df.R: 75.54%
#> R/utils-idcol_find.R: 76.92%
#> R/identifier.R: 78.05%
#> R/dataset_title.R: 78.95%

This file has a lot of logic and seems straightforward to test:

ml19: Consider making the test titles more specific, e.g

library(testthat)
devtools::load_all()
#> ℹ Loading dataset

# Better
# More specific. The data is minimal
test_that("creates objects of S3 class 'dataset_df'", {
  expect_s3_class(as_dataset_df(data.frame()), "dataset_df")

  expect_true(inherits(as_dataset_df(data.frame()), "dataset_df"))
  expect_false(inherits(data.frame(), "dataset_df"))
})
#> Test passed 🎊

# Can improve
# Not specific. The specific data is irrelevant to this test
# [https://github.com/dataobservatory-eu/dataset/blob/8768983c36a6f6ee2aadf7637dcf5f0f88ab1e3e/tests/testthat/test-dataset_df.R#L144](https://github.com/dataobservatory-eu/dataset/blob/8768983c36a6f6ee2aadf7637dcf5f0f88ab1e3e/tests/testthat/test-dataset_df.R#L144)
test_that("as_dataset_df() works", {
  expect_s3_class(as_dataset_df(iris), "dataset_df")
  expect_false(is.dataset_df(mtcars))
})
#> Test passed 🥳

This makes it easier to understand what's the problem when a test fails. Compare:

# Too vague
── Failure (test-dataset_df.R:146:3): as_dataset_df() works ───────────────────────────────────────────────────────
is.dataset_df(mtcars) is not TRUE

`actual`:    FALSE
`expected`: TRUE


# Shows intent
── Failure (test-dataset_df.R:151:3): as_dataset_df() creates objects of S3 class 'dataset_df' ────────────────────
is.dataset_df(mtcars) is not TRUE

`actual`:    FALSE
`expected`: TRUE
[ FAIL 2 | WARN 0 | SKIP 0 | PASS 32 ]

ml20: In calls to expect_error() please ensure to test not just for any error but for the specific one
you want to test. See the argument regex or class of expect_error() or
consider using expect_snapshot(error = TRUE)` (or friends) (example).

test_that("new_labelled_defined() throws errors", {
  expect_error(defined(
    x = c(1:3),
    label = c("Numbers", "Numbers"),
    definition = 1,
    namespace = "Iris"
  ))
  expect_error(defined(
    x = c(1:3),
    label = "Numbers",
    definition = 1,
    namespace = "Iris"
  ))
  # ... and more
})

ml21: Consider removing the name of the function under test from the test title when the test file contains the name of that function (which should be most cases if you follow the most common naming conventions) (example).

  # The file name already contains the name of the function test-xsd_convert  
── Failure (test-xsd_convert.R:2:3): works ────────────────────────────────────────────────────────────────────────
  xsd_convert(TRUE) (`actual`) not equal to "fail for demo" (`expected`).
  
  `actual`:    ""TRUE"^^<xs:boolean>""
  `expected`: "fail for demo"
 
  # The function name is duplicated information 
 ── Failure (test-xsd_convert.R:6:3): xsd_convert(x, idcols) works ────────────────────────────────────────────────
  xsd_convert(TRUE) (`actual`) not equal to "fail for demo" (`expected`).
  
  `actual`:    ""TRUE"^^<xs:boolean>""
  `expected`: "fail for demo"
  [ FAIL 2 | WARN 0 | SKIP 0 | PASS 0 ]

Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.
- ml22: rOpenSci's guide recommends markdown syntax in roxygen2 documentation. This task seems like good for an LLM. For example, starting from this, Google's Gemini did a pretty good job:

  # https://github.com/dataobservatory-eu/dataset/blob/main/R/dataset_df.R
  
  #' Create a new dataset_df object
  #'
  #' The `dataset_df` constructor creates objects of this class, which
  #' are semantically rich, modern data frames inherited from
  #' `\link[tibble:tibble]{tibble::tibble}`.
  #'
  #' @details
  #' To check if an object has the class `dataset_df` use `is.dataset_df`.
  #'
  #' `print` is the method to print out the semantically rich data frames
  #' created with the constructor of `dataset_df`.
  #'
  #' `summary` is the method to summarise these semantically rich data frames.
  #'
  #' For more details, please check the `vignette("dataset_df", package = "dataset")`
  #' vignette.
  #'
  #' @param identifier Defaults to `c(eg="http://example.com/dataset#")`, which should be
  #' changed to the permanent identifier of the dataset. For example, if your dataset will be
  #' released with the Digital Object Identifier (DOI) `https;//doi.org/1234`, you should use
  #' a short prefixed identifier like `c(obs="https://doi.org/1234#")`, which will resolve
  #' to the rows being identified as https://doi.org/1234#1...https://doi.org/1234#n.
  #' @param dataset_bibentry A list of bibliographic references and descriptive metadata
  #' about the dataset as a whole created with [`datacite()`] or
  #' [`dublincore()`].
  #' @param var_labels The long, human readable labels of each variable.
  #' @param units The units of measurement for the measured variables.
  #' @param definitions The linked definitions of the variables, attributes, or constants.
  #' @param dataset_subject The subject of the dataset, see [`subject()`].
  #' @param ... The vectors (variables) that should be included in the dataset.
  #' @param x A `dataset_df` object for S3 methods.
  #' @param df A `data.frame` to be converted to `dataset_df`.
  #' @return `dataset_df` is the constructor of this type, it returns an object
  #' inherited from a data frame with semantically rich metadata.
  #' @import vctrs
  #' @import pillar
  #' @examples
  #' my_dataset <- dataset_df(
  #'  country_name = defined(
  #'    c("AD", "LI"),
  #'    definition = "http://data.europa.eu/bna/c_6c2bb82d",
  #'    namespace = "https://www.geonames.org/countries/$1/"
  #'  ),
  #'  gdp = defined(
  #'    c(3897, 7365),
  #'    label = "Gross Domestic Product",
  #'    unit = "million dollars",
  #'    definition = "http://data.europa.eu/83i/aa/GDP"
  #'  )
  #' )
  #'
  #' print(my_dataset)
  #'
  #' is.dataset_df(my_dataset)
  #' @export

ml23: Please re-run the styler:

styler::style_pkg()
#> Styling  79  files:
...
#>  R/datacite.R                                    ✔
#>  R/dataset_df.R                                  ℹ
#>  R/dataset_title.R                               ✔
#>  R/dataset_to_triples.R                          ✔
#>  R/dataset-iris_dataset.R                        ✔
#>  R/dataset-orange_df.R                           ✔
#>  R/defined.R                                     ℹ
#>  R/describe.R                                    ✔
#>  R/description.R                                 ✔
#>  R/dublincore.R                                  ℹ
#>  R/fix_contributor.R                             ℹ
#>  R/fix_publisher.R                               ✔
#>  R/geolocation.R                                 ✔
...
#>  tests/testthat/test-dataset_title.R            ✔
#>  tests/testthat/test-dataset_to_triples.R       ✔
#>  tests/testthat/test-defined.R                  ℹ
#>  tests/testthat/test-describe.R                 ✔
#>  tests/testthat/test-description.R              ✔
#>  tests/testthat/test-dublincore.R               ℹ
#>  tests/testthat/test-fix_contributor.R          ℹ
#>  tests/testthat/test-fix_publisher.R            ℹ
#>  tests/testthat/test-geolocation.R              ✔
...
#>  vignettes/dataset_df.Rmd                        ✔
#>  vignettes/defined.Rmd
#> Warning: When processing vignettes/defined.Rmd: ℹ In index: 7.
#> Caused by error in `parse_safely()`:
#> ! <text>:2:7: unexpected 'in'
#> 1: c(gdp_1, gdp_2)
#> 2: Error in
#>        ^
#> ⚠
#>  vignettes/Motivation.Rmd                       ✔
...
#> ────────────────────────────────────────
#> Status    Count   Legend
#> ✔      70   File unchanged.
#> ℹ       8   File changed.
#> ✖       1   Styling threw an error.
#> ────────────────────────────────────────
#> Please review the changes carefully!

</details>

Estimated hours spent reviewing: 6

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.
- ml24: No need to acknowledge in DESCRIPTION. Your kind, thoughtful responses will be sufficient acknowledgement for my work. Thanks!

Review Comments

ml25: I see ?iris_dataset (source) and many hits for "iris" but

I understand "further usage of this dataset as unacceptable" (@mpadge ropensci/dev_guide#868 (comment)):

➜  dataset git:(main) git pull
Already up to date.

➜ grep iris -R | wc -l
385

➜  dataset git:(main) cat R/dataset-iris_dataset.R
#' @title Edgar Anderson's Iris Data
#'
#' @description This famous (Fisher's or Anderson's) iris data set gives the measurements in
#' centimetres of the variables sepal length and width and petal length and
#' width, respectively, for 50 flowers from each of 3 species of iris.
#' The species are \emph{Iris setosa}, \emph{versicolor}, and \emph{virginica}.
#' This is a replication of \code{datasets::\link[datasets:iris]{iris}} as
#' \emph{dataset} s3 class.
#'
#' @details See \code{datasets::\link[datasets:iris]{iris}} for details.
#'
#' @format
#' iris is a data frame with 150 cases (rows) and 6 variables (columns) named
#' rowid, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
#' @references
#' Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language.
#' Wadsworth & Brooks/Cole.
#' @source
#' Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems.
#' Annals of Eugenics, \strong{7}, Part II, p179–188.
#'
#' The data were collected by Anderson, Edgar (1935).
#' The irises of the Gaspe Peninsula,
#' Bulletin of the American Iris Society, \strong{59},
#' \href{https://www.biodiversitylibrary.org/item/270486#page/344/mode/1up}{2–5}.
"iris_dataset"

This is surprising to me because this comment suggests it was removed:

So, upon your request, I removed the iris dataset from all vignettes and examples. It uses now the Orange dataset.

Did you remove it from the default main branch (which is the one I'm reviewing)?

ml26: Please review the most recent output of pkgcheck. Ensure to expand the drop downs. There is a lot of useful information. Some things that stand out to me:

The two issues marked with✖️
Some packages in Imports that might be unused (at least in R/). Maybe they can move to Suggests?
The number of vignettes.
Some lint issues, particularly the few ones other than line-length.

maurolepore · 2025-05-12T00:50:48Z

@ropensci-review-bot submit review #681 (comment) time 6

ropensci-review-bot · 2025-05-12T00:50:54Z

Logged review for maurolepore (hours: 6)

antaldaniel · 2025-05-12T16:46:53Z

@maurolepore Thank you very much for the extensive comments, and I will start from here, because I was a bit concerned that you perhaps see a wrong version as all examples ran smoothly for me, as well as checks, and

This is surprising to me because this #681 (comment) suggests it was removed:
So, upon your request, I removed the iris dataset from all vignettes and examples. It uses now the Orange dataset.
Did you remove it from the default main branch (which is the one I'm reviewing)?

... indeed, you should not meet the iris dataset. The agreement was that dataset will be hidden (after all, it is part of base R, and about 200 unit test use it) but from the examples and vignettes it will be removed. And as far as I see that the iris dataset only appears in the documentation once and in unit tests.

I'll write out all your comments as issues and take a serious look, because when I signalled that I think all issues are resolved (at least from the previous two reviews), I certainly did not see the ìris dataset, and any failing examples or tests. As originally suggested, I will work on this during the week, and try to implement all your suggestions.

The original guidelines for rOpenSci when I first submitted this package for review was to submit preferrably in a conceptual stage, before the first CRAN release, and that was my intention. So I understand that there are a lot of unpolished solutions present, but the idea was to somehow create a tool that can receive semantically rich datasets into the R ecosystem, and can give it back to such system, mainly those APIs that use SDMX, i.e., the statistical data and metadata exchange standard.

maurolepore · 2025-05-16T20:22:25Z

Thanks @antaldaniel for the clarifications. The editorial board also confirmed that the priority is to remove iris from public facing documentation but it's fine to keep it in tests.

For the record, we typically do two reviews before the author makes changes but we accept exceptions and we agree this is one, so please do go ahead with your changes as planned.

For changes to the documentation it can really help to get an external reviewer. As someone who has written some long pieces of work I learned to appreciate the value of fresh eyes. External reviewers (including AI) can typically reword and summarize text much more easily than the first author. Professional writers do it, and this work seems very professional to me.

For my own writing I find this short paper very helpful: The science of scientific writing. It really applies to writing not just science but anything. It shows how to structure text to maximize the chances that the reader will interpret it as the writer intended. Interestingly that structure may sometimes be not the most natural and it's yet the most effective.

Funny enough I learned about the main author, George Gopen, from reading R for Data Science:

To improve your writing, we highly recommend ... The Sense of Structure: Writing from the Reader’s Perspective by George Gopen. ... George Gopen also has a number of short articles on writing at https://www.georgegopen.com/litigation-articles.html. They are aimed at lawyers, but almost everything applies to data scientists too.

antaldaniel · 2025-05-22T14:33:57Z

@maurolepore Thank you for your patience, regarding these errors I made a very clumsy mistake, I work on several computers and the one on the main branch was not the version that should have been available to you. I think that there are very certainly no more visible iris references, no failing examples, and where you suggested, I also very extensively improved the tests, now I have ~400 unit tests. I will turn to the revisions of the documentation that you suggested, which is a bit bigger task than I though but I will try my best, and if I already invest into this, I will also consider to try to create a publication. I think that the idea that the R system should be able to integrate more into the semantic web and the infrastructure of statistical exchange has its merits, even if this package is still in an early stage. I will let you know as soon as I am finished with the rewrites.

mpadge · 2025-05-23T10:24:18Z

@maurolepore Just noting that this issue pops up on our dashboard as being labelled with multiple stages. Can you please rectify that? thanks 👍

ropensci-review-bot · 2025-05-24T00:50:57Z

@antaldaniel: please post your response with @ropensci-review-bot submit response <url to issue comment> if you haven't done so already (this is an automatic reminder).

Here's the author guide for response. https://devguide.ropensci.org/authors-guide.html

ropensci-review-bot added the holding label Jan 21, 2025

This comment has been minimized.

Sign in to view

antaldaniel mentioned this issue Jan 21, 2025

dataset: Create Data Frames that are Easier to Exchange and Reuse #553

Closed

18 tasks

ropensci-review-bot assigned maelle Jan 21, 2025

ropensci-review-bot added the 1/editor-checks label Jan 21, 2025

maelle removed the holding label Jan 21, 2025

ropensci-review-bot assigned maurolepore and unassigned maelle Apr 29, 2025

maurolepore added 3/reviewer(s)-assigned and removed holding labels May 10, 2025

ropensci-review-bot added 4/review(s)-in-awaiting-changes and removed 3/reviewer(s)-assigned labels May 12, 2025

maurolepore removed the 1/editor-checks label May 29, 2025

Uh oh!

dataset: Create Data Frames that are Easier to Exchange and Reuse #681

dataset: Create Data Frames that are Easier to Exchange and Reuse #681

Comments

antaldaniel commented Jan 21, 2025 • edited by ropensci-review-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

Technical checks

Publication options

Code of conduct

ropensci-review-bot commented Jan 21, 2025

Uh oh!

This comment has been minimized.

maelle commented Jan 21, 2025

Uh oh!

ropensci-review-bot commented Jan 21, 2025

Checks for dataset (v0.3.4002)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

3a. Continuous Integration Badges

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

4. Other Checks

Editor-in-Chief Instructions:

Uh oh!

antaldaniel commented Jan 21, 2025

Uh oh!

maelle commented Jan 21, 2025

Uh oh!

maelle commented Jan 21, 2025

Uh oh!

ropensci-review-bot commented Jan 21, 2025

Uh oh!

maelle commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Editor checks:

Editor comments

Documentation

Installation instructions

Default git branch

Contributing guide

Continuous integration

Project management

Code style

Code

Example dataset

Tests

Uh oh!

maelle commented Jan 21, 2025

Uh oh!

ropensci-review-bot commented Jan 21, 2025

Uh oh!

ropensci-review-bot commented Jan 21, 2025

Uh oh!

ropensci-review-bot commented Jan 21, 2025

Checks for dataset (v0.3.4002)

1. Package Dependencies

2. Statistical Properties

2a. Network visualisation

3. goodpractice and other checks

3a. Continuous Integration Badges

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

4. Other Checks

Editor-in-Chief Instructions:

Uh oh!

maelle commented Jan 21, 2025

Uh oh!

antaldaniel commented Jan 21, 2025

Uh oh!

maelle commented Jan 21, 2025

Uh oh!

antaldaniel commented Jan 21, 2025 •

edited by ropensci-review-bot

Loading

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

maelle commented Jan 21, 2025 •

edited

Loading

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

antaldaniel commented Apr 24, 2025 •

edited

Loading

3. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

maurolepore commented May 12, 2025 •

edited

Loading