Skip to content

dataset: Create Data Frames that are Easier to Exchange and Reuse #681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
13 of 24 tasks
antaldaniel opened this issue Jan 21, 2025 · 55 comments
Open
13 of 24 tasks

dataset: Create Data Frames that are Easier to Exchange and Reuse #681

antaldaniel opened this issue Jan 21, 2025 · 55 comments

Comments

@antaldaniel
Copy link

antaldaniel commented Jan 21, 2025

Submitting Author Name: Daniel Antal
Submitting Author Github Handle: @antaldaniel
Repository: https://github.com/dataobservatory-eu/dataset
Version submitted: 0.3.4002
Submission type: Standard
Editor: @maurolepore
Reviewers: @maurolepore

Archive: TBD
Version accepted: TBD
Language: en


  • Paste the full DESCRIPTION file inside a code block below:
Package: dataset
Title: Create Data Frames that are Easier to Exchange and Reuse
Version: 0.3.4002
Date: 2024-12-26
DOI: 10.32614/CRAN.package.dataset
Language: en-GB
Authors@R: 
    c(person(given = "Daniel", family = "Antal", 
           email = "daniel.antal@dataobservatory.eu", 
           role = c("aut", "cre"),
           comment = c(ORCID = "0000-0001-7513-6760")
           ), 
      person(given = "Marcelo", family =  "Perlin", 
             role = c("rev"), 
             comment = c(ORCID = "0000-0002-9839-4268")
             )
      )
Maintainer: Daniel Antal <daniel.antal@dataobservatory.eu>
Description: The aim of the 'dataset' package is to make tidy datasets easier to release, 
    exchange and reuse. It organizes and formats data frame 'R' objects into well-referenced, 
    well-described, interoperable datasets into release and reuse ready form.
License: GPL (>= 3)
Encoding: UTF-8
URL: https://dataset.dataobservatory.eu/
BugReports: https://github.com/dataobservatory-eu/dataset/issues/
Roxygen: list(markdown = TRUE)
LazyData: true
Imports: 
    assertthat,
    cli,
    haven,
    ISOcodes,
    labelled,
    methods,
    pillar,
    RefManageR,
    rlang,
    tibble,
    utils,
    vctrs (>= 0.5.2)
RoxygenNote: 7.3.2
Suggests: 
    knitr,
    rdflib,
    rmarkdown,
    spelling,
    testthat (>= 3.0.0)
Config/testthat/edition: 3
Depends: 
    R (>= 3.5)
VignetteBuilder: knitr

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • data munging
    • data deposition
      • data validation and testing
    • workflow automation
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • field and lab reproducibility tools
    • database software bindings
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package works with various semantic interoperability standards, therefore it allows the users to retrieve RDF annotated, rich and platform-independent data and reconstruct it as an R data.frame with rich metadata attributes, or to release interoperable, RDF annotated datasets on linked open data platforms from native R objects.

  • Who is the target audience and what are scientific applications of this package?

Production-side statisticans. Scientists who want to update their sources from various data repositories and exchanges. Scientists and research data managers who want to release new scientific or professional datasets that follow modern interoperability standards.

The package aimst to complement the rdflib and the datapsice package.

Technical checks

Confirm each of the following by checking the box.

This package:

Publication options

  • Do you intend for this package to go on CRAN?

  • Do you intend for this package to go on Bioconductor?

  • Do you wish to submit an Applications Article about your package to Methods in Ecology and Evolution? If so:

Code of conduct

@ropensci-review-bot
Copy link
Collaborator

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

@ropensci-review-bot

This comment has been minimized.

@maelle
Copy link
Member

maelle commented Jan 21, 2025

@antaldaniel I am so sorry, you didn't use the right template... because we broke the right template last time we edited it! I put it back in shape, thanks for helping us catch it. You can find it by opening a new issue (it'll be at the top of the list) or use it from here https://github.com/ropensci/software-review/blob/main/.github/ISSUE_TEMPLATE/A-submit-software-for-review.md

So sorry, thanks for your patience.

@ropensci-review-bot
Copy link
Collaborator

Checks for dataset (v0.3.4002)

git hash: 7bf85ac7

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✖️ Package coverage failed
  • ✖️ R CMD check process failed with message: 'Build process failed'.
  • 👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 281
internal dataset 215
internal graphics 11
internal stats 2
imports assertthat 29
imports utils 8
imports labelled 5
imports rlang 2
imports cli 1
imports haven 1
imports tibble 1
imports ISOcodes NA
imports methods NA
imports pillar NA
imports RefManageR NA
imports vctrs NA
suggests knitr NA
suggests rdflib NA
suggests rmarkdown NA
suggests spelling NA
suggests testthat NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

ifelse (41), as.character (40), is.null (39), list (20), c (14), vapply (8), lapply (7), names (7), data.frame (6), logical (6), paste (6), paste0 (6), character (5), inherits (5), which (5), contributors (4), date (4), seq_along (4), substr (4), Sys.time (4), for (3), format (3), invisible (3), length (3), t (3), with (3), all (2), attr (2), class (2), drop (2), gsub (2), labels (2), nrow (2), args (1), as.data.frame (1), as.Date (1), as.POSIXct (1), cbind (1), do.call (1), double (1), if (1), nchar (1), ncol (1), rbind (1), Sys.Date (1), tolower (1), vector (1)

dataset

get_bibentry (26), creator (11), dataset_title (11), subject (10), publisher (7), rights (7), get_creator (6), identifier (6), description (5), language (5), new_Subject (5), provenance (5), dataset_df (4), get_publisher (4), get_type (4), agent (3), convert_column (3), n_triple (3), publication_year (3), var_definition (3), var_namespace (3), var_unit (3), as_dataset_df (2), as_dublincore (2), datacite (2), default_provenance (2), definition_attribute (2), geolocation (2), get_author (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), namespace_attribute (2), new_my_tibble (2), prov_author (2), unit_attribute (2), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), create_iri (1), dataset_to_triples (1), defined (1), describe (1), dublincore (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), print.dataset_df (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1)

assertthat

assert_that (29)

graphics

title (11)

utils

person (5), bibentry (2), citation (1)

labelled

var_label (4), to_labelled (1)

rlang

caller_env (1), env_is_user_facing (1)

stats

df (1), family (1)

cli

cat_line (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 29 files) and
  • 1 authors
  • 5 vignettes
  • 1 internal data file
  • 12 imported packages
  • 89 exported functions (median 5 lines of code)
  • 153 non-exported functions in R (median 8 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 29 88.1
files_vignettes 5 95.4
files_tests 29 97.1
loc_R 1731 79.0
loc_vignettes 521 77.3
loc_tests 717 77.6
num_vignettes 5 96.4 TRUE
data_size_total 2480 62.3
data_size_median 2480 69.0
n_fns_r 242 91.0
n_fns_r_exported 89 94.0
n_fns_r_not_exported 153 89.0
n_fns_per_file_r 5 71.9
num_params_per_fn 2 8.2
loc_per_fn_r 6 13.0
loc_per_fn_r_exp 5 8.5
loc_per_fn_r_not_exp 8 22.9
rel_whitespace_R 25 85.1
rel_whitespace_vignettes 37 81.2
rel_whitespace_tests 23 79.5
doclines_per_fn_exp 52 64.9
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 155 84.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

rhub.yaml


3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following error:

  1. Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

  1. no_description_date
  2. no_import_package_as_a_whole

Test coverage with covr

ERROR: Test Coverage Failed

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

  • checking for file ‘dataset/DESCRIPTION’ ... OK
  • preparing ‘dataset’:
  • checking DESCRIPTION meta-information ... OK
  • installing the package to build vignettes
  • creating vignettes ... ERROR
    --- re-building ‘bibentry.Rmd’ using rmarkdown
    --- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown
--- finished re-building ‘defined.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from lines 106-108 [jsonld] (rdf.Rmd)
Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following file failed:
‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found no issues with this package!


4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

    • as_character from metan, radiant.data, retroharmonize, sjlabelled
    • as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
    • describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
    • description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
    • get_bibentry from eurostat
    • identifier from Ramble
    • is.defined from nonmemica
    • language from sylly, wakefield
    • provenance from provenance
    • subject from DGM, emayili, gmailr, sendgridr


Package Versions

package version
pkgstats 0.2.0.48
pkgcheck 0.1.2.77


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

@antaldaniel
Copy link
Author

@maelle let me know if this works now :)

@maelle
Copy link
Member

maelle commented Jan 21, 2025

Yes, thank you!

@maelle
Copy link
Member

maelle commented Jan 21, 2025

@ropensci-review-bot assign @maelle as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @maelle is now the editor

@maelle
Copy link
Member

maelle commented Jan 21, 2025

Thanks again for your submission!

Editor checks:

  • Documentation: The package has sufficient documentation available online (README, pkgdown docs) to allow for an assessment of functionality and scope without installing the package. In particular,
    • Is the case for the package well made?
    • Is the reference index page clear (grouped by topic if necessary)?
    • Are vignettes readable, sufficiently detailed and not just perfunctory?
  • Fit: The package meets criteria for fit and overlap.
  • Installation instructions: Are installation instructions clear enough for human users?
  • Tests: If the package has some interactivity / HTTP / plot production etc. are the tests using state-of-the-art tooling?
  • Contributing information: Is the documentation for contribution clear enough e.g. tokens for tests, playgrounds?
  • License: The package has a CRAN or OSI accepted license.
  • Project management: Are the issue and PR trackers in a good shape, e.g. are there outstanding bugs, is it clear when feature requests are meant to be tackled?

Editor comments


Documentation

My main comment before I can proceed to looking for reviewers is that the case of the package could be made better.

On the one hand, it'd be interesting to read how dataset compares to other approaches to the same "problem", such as (if I follow correctly)

On the other hand, how would an user take advantage of dataset?
To me, it is not clear yet from reading the docs.
Questions I wonder about:

  • As a data publisher, I create the dataset object, and then, how does it help me document it? How does it help me publish it on a repository?
  • When you mention standard statistical libraries in the README, could you name some?
  • As a data consumer, how do I create a dataset object (do I get it from an R package? shared in another way)? How can I easily use the information on units when exploring the data, when plotting it?

In short, could you exemplify "release" and "re-use" in a vignette or more, as use cases, potentially using as roles the type of users you mention in the submission under "target audience".

For instance https://wbdataset.dataobservatory.eu/ is a good example, but it is mentioned in a vignette.
More concrete information like wbdataset should make it to the README to make it clearer what dataset is about (and then be expanded in vignettes).

A tiny comment: I find "reuse" harder to parse than "re-use" but that might be a personal preference.

Installation instructions

I'd recommend documenting the two methods of installation (CRAN and GitHub) in distinct chunks so readers could copy-paste the entire code chunk of interest.

Instead of devtools you could recommend using pak.

install.packages("pak")
pak::pak("dataobservatory-eu/dataset")

Default git branch

You might want to rename the master branch to main as some people can be offended by the word "master", see https://www.tidyverse.org/blog/2021/10/renaming-default-branch/ that includes links to context, and practical advice on renaming the default branch.

Contributing guide

The contributing guide does not seem customized.
It mentions a possible "src" folder which is not present.

Since you are looking for co-developers, and mentioned one of the articles could be relevant to potential contributors, I'd recommend having some text related to design and wishes for feedback in the contributing guide.

The contributing guide mentions "AppVeyor" which is not used any more as far as I can tell.

Continuous integration

  • If AppVeyor is not used any more, please remove the related contributing file.

  • The code coverage workflow seems not to be working: https://github.com/dataobservatory-eu/dataset/actions/workflows/test-coverage.yaml I'd recommend using the latest workflow file from r-lib/actions, by copy-pasting it or by running usethis::use_github_action("test-coverage").

  • The pkgdown website is not up to date, for instance on it the test coverage badge is not broken whereas it is in the README. Please add a workflow to continuously deploy it, for instance by running usethis::use_github_action("pkgdown").

  • If you no longer whish to use the R-CMD-check workflow because you rely on the R-hub ones, please remove the old workflow file.

  • The latest commits all have red crosses as status, which shows the continuous integration files need a bit of cleaning and tweaking. 🙂

Project management

From the open issues, which ones are meant to be tackled soon?
One of them has the "First CRAN release" milestone, which is outdated.

Code style

I'd recommend running styler (on R scripts including tests) to make spacing more consistent.

For instance in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L56C1-L57C55 the space before return_type is surprising.
I remember being inconsistent with spaces myself years ago and not noticing, (un)fortunately I was converted. 😅

Code

The code could be simplified so that reviewers might more easily follow the logic.

is_person <- function(p) ifelse (inherits(p, "person"), TRUE, FALSE) in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/as_dublincore.R#L11
could be is_person <- function(p) inherits(p, "person") (thanks lintr for catching this).
That pattern comes up several times in the codebase (the relevant linter is "redundant_ifelse_linter").

Code like

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L4-L20

and

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L66

(and other similar pipelines) reminds me of Jenny Bryan's advice in her talk Code smells and feels

If your conditions deal with class, it's time to get object-oriented. In CS jargon, use polymorphisms.

So instead of the complex logic, you'd define methods.

In some files like R/xsd_convert.R and R/dataset_title.R, you use class(something) == or class(something) %in% instead of code built on the more correct inherits().
Using "proper functions for handling class & type" is another tip in the aforementioned talk. 😸

Since dataset imports rlang, you could use the %||% operator from rlang.
For instance https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L24

creators   <- ifelse(is.null(dataset_bibentry$author), ":tba", dataset_bibentry$author)

would become

creators   <- dataset_bibentry$author %||% ":tba"

There are many other occurrences of the ifelse(is.null( pattern (and variants with different spacing) that could get the same treatment.

In the R/agent.R file, functions like get_creator() are defined twice, why?

Example dataset

The iris dataset is very well-known, but it is also infamous because of its eugenics links.
Since having a good example dataset is very important, would you consider replacing it with another one, like maybe the palmerpenguins one, even if it comes at the cost of adding a (possibly optional) dependency?

Tests

Should the line https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-agent.R#L9 be removed as it is not used in the test?

I don't understand why the iris object needs to be duplicated in lines like https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-creator.R#L23

expect_true(is.dataset_df might become a custom expectation and/or rely on expect_s3_class() instead. Same comment for expect_true(is.subject.

https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-datacite.R#L18

should be

expect_type(as_datacite(iris_dataset, "list"), "list")

What is https://github.com/dataobservatory-eu/dataset/blob/master/tests/testthat/test-dataset_prov.bak?

When using expect_error() as in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-dataset_title.R#L3 maybe add a pattern for the error message, just in case another error happens?


Thank you! Happy to discuss any of the items.

@maelle maelle removed the holding label Jan 21, 2025
@maelle
Copy link
Member

maelle commented Jan 21, 2025

@ropensci-review-bot check package

@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for dataset (v0.3.4002)

git hash: 7bf85ac7

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✖️ Package coverage failed
  • ✖️ R CMD check process failed with message: 'Build process failed'.
  • 👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 281
internal dataset 215
internal graphics 11
internal stats 2
imports assertthat 29
imports utils 8
imports labelled 5
imports rlang 2
imports cli 1
imports haven 1
imports tibble 1
imports ISOcodes NA
imports methods NA
imports pillar NA
imports RefManageR NA
imports vctrs NA
suggests knitr NA
suggests rdflib NA
suggests rmarkdown NA
suggests spelling NA
suggests testthat NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

ifelse (41), as.character (40), is.null (39), list (20), c (14), vapply (8), lapply (7), names (7), data.frame (6), logical (6), paste (6), paste0 (6), character (5), inherits (5), which (5), contributors (4), date (4), seq_along (4), substr (4), Sys.time (4), for (3), format (3), invisible (3), length (3), t (3), with (3), all (2), attr (2), class (2), drop (2), gsub (2), labels (2), nrow (2), args (1), as.data.frame (1), as.Date (1), as.POSIXct (1), cbind (1), do.call (1), double (1), if (1), nchar (1), ncol (1), rbind (1), Sys.Date (1), tolower (1), vector (1)

dataset

get_bibentry (26), creator (11), dataset_title (11), subject (10), publisher (7), rights (7), get_creator (6), identifier (6), description (5), language (5), new_Subject (5), provenance (5), dataset_df (4), get_publisher (4), get_type (4), agent (3), convert_column (3), n_triple (3), publication_year (3), var_definition (3), var_namespace (3), var_unit (3), as_dataset_df (2), as_dublincore (2), datacite (2), default_provenance (2), definition_attribute (2), geolocation (2), get_author (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), namespace_attribute (2), new_my_tibble (2), prov_author (2), unit_attribute (2), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), create_iri (1), dataset_to_triples (1), defined (1), describe (1), dublincore (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), print.dataset_df (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1)

assertthat

assert_that (29)

graphics

title (11)

utils

person (5), bibentry (2), citation (1)

labelled

var_label (4), to_labelled (1)

rlang

caller_env (1), env_is_user_facing (1)

stats

df (1), family (1)

cli

cat_line (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 29 files) and
  • 1 authors
  • 5 vignettes
  • 1 internal data file
  • 12 imported packages
  • 89 exported functions (median 5 lines of code)
  • 153 non-exported functions in R (median 8 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 29 88.1
files_vignettes 5 95.4
files_tests 29 97.1
loc_R 1731 79.0
loc_vignettes 521 77.3
loc_tests 717 77.6
num_vignettes 5 96.4 TRUE
data_size_total 2480 62.3
data_size_median 2480 69.0
n_fns_r 242 91.0
n_fns_r_exported 89 94.0
n_fns_r_not_exported 153 89.0
n_fns_per_file_r 5 71.9
num_params_per_fn 2 8.2
loc_per_fn_r 6 13.0
loc_per_fn_r_exp 5 8.5
loc_per_fn_r_not_exp 8 22.9
rel_whitespace_R 25 85.1
rel_whitespace_vignettes 37 81.2
rel_whitespace_tests 23 79.5
doclines_per_fn_exp 52 64.9
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 155 84.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

rhub.yaml


3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following error:

  1. Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

  1. no_description_date
  2. no_import_package_as_a_whole

Test coverage with covr

ERROR: Test Coverage Failed

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

  • checking for file ‘dataset/DESCRIPTION’ ... OK
  • preparing ‘dataset’:
  • checking DESCRIPTION meta-information ... OK
  • installing the package to build vignettes
  • creating vignettes ... ERROR
    --- re-building ‘bibentry.Rmd’ using rmarkdown
    --- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown
--- finished re-building ‘defined.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from lines 106-108 [jsonld] (rdf.Rmd)
Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following file failed:
‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found no issues with this package!


4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

    • as_character from metan, radiant.data, retroharmonize, sjlabelled
    • as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
    • describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
    • description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
    • get_bibentry from eurostat
    • identifier from Ramble
    • is.defined from nonmemica
    • language from sylly, wakefield
    • provenance from provenance
    • subject from DGM, emayili, gmailr, sendgridr


Package Versions

package version
pkgstats 0.2.0.48
pkgcheck 0.1.2.77


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

@maelle
Copy link
Member

maelle commented Jan 21, 2025

Coverage is also something noted in my comments.

@antaldaniel
Copy link
Author

@maelle Thank you. It is unfortunate that you have joined this review after two years, because I find many of your comments very useful. However, I must say that some of them would be unusual to add to a vignette, and I would find an article format more useful, i.e., the your question about frictionless, datapack, datapackage.org, and researchobject, as about 2 years of research went into this package that is usually not a vignette material, other reviewers of my other packages hated extended vignettes.

I think that the frictionless package family follows a very different approach, and when I started the development of this package and this review, it did not even seem relevant. Eventually I see that in some use cases both can be useful and a choice could be offered, so I will argue that, but I think that this is more a paper considering statistical exchange formats and their best representation in R.

A small question: what it the package coverage you are aiming at? I think that the package already has a very high coverage, and it exceeded the requirements when I started the review.

@maelle
Copy link
Member

maelle commented Jan 21, 2025

@antaldaniel thank you for your answer!

  • Use cases would be crucial to make the point of the package clearer. You have a vision, that potential users need to understand. With concrete examples of usage, I think understanding what it is all about (and why a potential user should care) would be easier. An use case does not have to be really long, it can contain diagrams, but it should help someone see what they could do with the package in practice. Even if from the current README text they might get it has the good goal of helping with FAIRness, the application of the package might remain nebulous.
  • I agree the docs do not need to contain an in-depth comparison with other approaches, but a small section "Related work" would be important. It'd allow users to quickly compare your package to others, and also would help them, again, get the point. "Oh, it's a bit like frictionless, it's not a new tibble or a new dataspice". In our dev guide that's "If applicable, how the package compares to other similar packages and/or how it relates to other packages." in https://devguide.ropensci.org/pkg_building.html#readme With workflow/standards packages like this one (as opposed to, say, a package that helps you get data from a specific API, where the goal is easier to grasp for a newcomer), introducing the package is obviously harder, but also more important.
  • For the coverage, the problem is not the number but the fact that no continuous integration workflow is updating the badge. I am aiming at a not unknown coverage in the README badge. 😉

@antaldaniel
Copy link
Author

Thank you @maelle and I will look in to why the coverage is not updating... However, do you have an explicit coverage target?

@maelle
Copy link
Member

maelle commented Jan 21, 2025

Yes it's 75% https://devguide.ropensci.org/pkg_building.html#testing -- But I'd recommend also looking at the coverage report to find the not covered lines and make a judgement call on how important/risky these lines are (vs how hard to test they'd be) just so you're sure there's nothing dangerous in the remaining 25%. 🙂 The idea really is to cover "key functionality" (phrasing from the dev guide).

In my comments I recommend updating the test-coverage workflow, the fix might be as simple as that.

@antaldaniel
Copy link
Author

@maelle Thank you agian for your useful comments and your PRs. I reorganised the issues and created a new milestone. The milestone currently breaks up your review comments to 7 issues, but they could of course be broken down to more. I set myself a deadline of 16 February to resolve them, though it may happen earlier. I will tag you in the issues when they are ready for review and will also add a new comment here.

@maelle
Copy link
Member

maelle commented Apr 22, 2025

@antaldaniel thank you! I am having trouble reading your comment, could you please format quotes as quotes i.e. using the ">" sign in front of the paragraph?

@antaldaniel
Copy link
Author

antaldaniel commented Apr 24, 2025

On the one hand, it'd be interesting to read how dataset compares to other approaches to the same "problem", such as (if I follow correctly)
https://docs.ropensci.org/frictionless/
https://docs.ropensci.org/datapack/
https://datapackage.org/
https://www.researchobject.org/
On the other hand, how would an user take advantage of dataset?
To me, it is not clear yet from reading the docs.
Questions I wonder about:
As a data publisher, I create the dataset object, and then, how does it help me document it? How does it help me publish it on a repository?

This is now explained in detail in the new vignette.

Repositories like Zenodo apply at most, but usually with limitations, the 5-star FAIR data model, which helps with finding data, but does not make the data readily reusable. We apply more the 8-star model, which means that we put metadata about the contents of the data, not only the file. On Zenodo, you can upload three different datasets about GDP per capita in countries, one in dollars, the other in euros, and yet another on PPP adjusted euros. These datasets are not very reusable, because if a user accidentally joins them, it will result in logical-semantic errors; for example, euro values will be compared or added to dollar values. Or, in natural sciences, measures of units may mismatch.

Dataset contain the machine readable definition of each variable, it unit of measure, and other metadata. To my knowledge, apart from rdflib, no other R pakcages help in that. We designed a set of packages that work with rdflib.

When you mention standard statistical libraries in the README, could you name some?

We aim for full compliance with the GSIM system, which is partly the StatisticaL Data and Metadata eXchange (so our datasets can be exchanged, enriched, etc. with IMF,Eurostat, French national statitical office data) and DDI (the standard for microdata.)

As a data consumer, how do I create a dataset object (do I get it from an R package? shared in another way)? How can I easily use the information on units when exploring the data, when plotting it?

You use the dataset_df constructor that enriches a data.frame or a tibble object. The print method use this information. But I implemented a plot function, too (see the README)

In short, could you exemplify "release" and "re-use" in a vignette or more, as use cases, potentially using as roles the type of users you mention in the submission under "target audience".

Release means that you can create datasets that you can send to the EU Open Data Portal, which has far more strict requirements than Zenodo. It is meant for people who want to machine-exchange data with professional scientific or statistical data providers, even on API level.

For instance https://wbdataset.dataobservatory.eu/ is a good example, but it is mentioned in a vignette.
More concrete information like wbdataset should make it to the README to make it clearer what dataset is about (and then be expanded in vignettes).

Installation instructions
I'd recommend documenting the two methods of installation (CRAN and GitHub) in distinct chunks so readers could copy-paste the entire code chunk of interest.

Done

Contributing guide
The contributing guide does not seem customized.
It mentions a possible "src" folder which is not present. Since you are looking for co-developers, and mentioned one of the articles could be relevant to potential contributors, I'd recommend having some text related to design and wishes for feedback in the contributing guide.

The Guide is rewritten.

The contributing guide mentions "AppVeyor" which is not used any more as far as I can tell. If AppVeyor is not used any more, please remove the related contributing file.

It is removed

Project management
From the open issues, which ones are meant to be tackled soon?

All issues tacked and wait for your review.

Code style
I'd recommend running styler (on R scripts including tests) to make spacing more consistent._

styler is used now with tidyverse settings.

is_person <- function(p) ifelse (inherits(p, "person"), TRUE, FALSE) in https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/as_dublincore.R#L11
could be is_person <- function(p) inherits(p, "person") (thanks lintr for catching this).
That pattern comes up several times in the codebase (the relevant linter is "redundant_ifelse_linter")._

This issues are resolved.

I hope that all such cases are handled.

Since dataset imports rlang, you could use the %||% operator from rlang.
For instance https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/R/agent.R#L24

I prefer to remain as close to base R as possible and later remove some dependencies, possibly even rlang. So I prefer to use less of rlang not more. However, for now I created an internal function for the %||% and there are few examples of its use.

Example dataset
The iris dataset is very well-known, but it is also infamous because of its eugenics links.
Since having a good example dataset is very important, would you consider replacing it with another one, like maybe the palmerpenguins one, even if it comes at the cost of adding a (possibly optional) dependency?_

I think that the way rOpenSci handles this issue could improve, because the dataset was never used in eugenics. I find boycotting an innocent author, i.e., Edgar Anderson, unethical. There are better practices in handling problematic provenance descriptions (because in this case it is the provenance description that is associated with eugenics, not the dataset or its author) and scientific solutions about such problems from digital humanities and digital heritage management could be introduced. But this could be the the topic of a peer-reviewed article itself. So, upon your request, I removed the iris dataset from all vignetes and examples. It uses now the Orange dataset.

Tests
Should the line https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-agent.R#L9 be removed as it is not used in the test?

Solved.

I don't understand why the iris object needs to be duplicated in lines like https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-creator.R#L23
expect_true(is.dataset_df might become a custom expectation and/or rely on expect_s3_class() instead. Same comment for >expect_true(is.subject.
https://github.com/dataobservatory-eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-datacite.R#L18
should be
expect_type(as_datacite(iris_dataset, "list"), "list")
What is https://github.com/dataobservatory-eu/dataset/blob/master/tests/testthat/test-dataset_prov.bak?
When using expect_error() as in https://github.com/dataobservatory->eu/dataset/blob/7bf85ac7abe9477d02b429b4d335179d94993a77/tests/testthat/test-dataset_title.R#L3 maybe add a pattern for the error message, just in case another error happens?_

These minor issues are solved and there are many new texts.

@maurolepore
Copy link
Member

@ropensci-review-bot assign @maurolepore as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @maurolepore is now the editor

@maurolepore
Copy link
Member

Dear @antaldaniel

Thanks again for your sharing your work with rOpenSci, for you patience, and for incorporating so many of your suggestions so far.

I'm now the handling editor, please direct all your questions to me.

This week I'm wrapping up my EiC rotation but starting next week I'll be available to work with you to move this towards approval as smoothly and quickly as you and I can make it.

I plan to first digest this thread, then do some preliminary checks, then assign myself as the first reviewer -- which we sometimes do and the editorial board agrees to it in this specific submission. With that deeper dive into the package I'll be more able to share my findings with the editorial board and discuss the next step.

Would you be available to respond to my suggestions between May 12-23? Not a hard commitment but just to sense your availability and decide if this is indeed the right time for you.

@antaldaniel
Copy link
Author

Dear @maurolepore , thank you very much. I think that I have an important deadline on 15 May, so I think that whatever you write I will be able to review, and at least partly implement by 23 May, most likely starting only on 16 May.

Because the package is already in CRAN, and I fixed a lot of minor bugs and improvements, if it is not against your preferences, I would release the current version (the README makes it very clear that this is an experimental version under review in rOpenSci). The changes that may be bothering those few users who are experimenting with the package are resolved (there are new, and better tested subsetting, printing, etc. methods for the dataframe like and the vector classes -- nothing groundbreaking and conceptual, but if somebody uses the package as it is, it is a significant improvement.)

So is it OK for you if I just release on CRAN these new improved version and wait for your feedback till 12 May, then reply (and whatever possible, implement) by 23 May?

@maurolepore
Copy link
Member

@antaldaniel sure go ahead and release what you have.

@maurolepore
Copy link
Member

This is to mark the end of my EiC rotation and to hand it over to the next EiC with a short summary.

@maurolepore
Copy link
Member

@ropensci-review-bot assign @maurolepore as reviewer

@ropensci-review-bot
Copy link
Collaborator

@maurolepore added to the reviewers list. Review due date is 2025-05-31. Thanks @maurolepore for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

@ropensci-review-bot
Copy link
Collaborator

@maurolepore: If you haven't done so, please fill this form for us to update our reviewers records.

@maurolepore
Copy link
Member

@ropensci-review-bot check package

@ropensci-review-bot
Copy link
Collaborator

Thanks, about to send the query.

@ropensci-review-bot
Copy link
Collaborator

🚀

Editor check started

👋

@ropensci-review-bot
Copy link
Collaborator

Checks for dataset (v0.3.4022)

git hash: 8768983c

  • ✔️ Package is already on CRAN.
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✖️ The following function has no documented return value: [bind_defined_rows]
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 80.1%.
  • ✖️ R CMD check process failed with message: 'Build process failed'.
  • 👀 Function names are duplicated in other packages

Important: All failing checks above must be addressed prior to proceeding

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL (>= 3)


1. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 353
internal dataset 301
internal graphics 7
internal stats 7
imports assertthat 25
imports utils 10
imports labelled 9
imports vctrs 7
imports rlang 2
imports haven 1
imports tibble 1
imports datasets NA
imports ISOcodes NA
imports methods NA
imports pillar NA
imports RefManageR NA
suggests knitr NA
suggests rdflib NA
suggests rmarkdown NA
suggests spelling NA
suggests testthat NA
linking_to NA NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

is.null (45), ifelse (39), c (27), list (19), lapply (18), vapply (16), contributors (9), for (9), seq_along (9), names (8), paste0 (8), if (7), inherits (7), logical (7), mapply (7), substr (7), which (7), attr (6), labels (6), unique (6), unlist (6), do.call (5), format (5), date (4), length (4), Sys.time (4), col (3), data.frame (3), invisible (3), nzchar (3), t (3), vector (3), all (2), as.data.frame (2), as.list (2), drop (2), gsub (2), is.na (2), nchar (2), nrow (2), paste (2), tryCatch (2), any (1), args (1), as.POSIXct (1), cbind (1), class (1), double (1), gregexpr (1), identical (1), ncol (1), NextMethod (1), rbind (1), rep (1), setdiff (1), strsplit (1), sub (1), Sys.Date (1), tolower (1), typeof (1), units (1), with (1)

dataset

as.character (41), get_bibentry (26), creator (11), identifier (10), subject (10), var_unit (10), namespace_attribute (9), dataset_df (8), dataset_title (8), publisher (7), var_definition (7), description (5), language (5), provenance (5), rights (5), get_author (4), get_creator (4), get_publisher (4), get_type (4), agent (3), as_dataset_df (3), convert_column (3), definition_attribute (3), n_triple (3), new_Subject (3), publication_year (3), var_namespace (3), as_dublincore (2), datacite (2), default_provenance (2), defined (2), dublincore (2), geolocation (2), get_contributor (2), get_person_iri (2), idcol_find (2), is_person (2), is.dataset_df (2), n_triples (2), new_my_tibble (2), prov_author (2), unit_attribute (2), all.identical (1), as_character (1), as_character.haven_labelled_defined (1), as_datacite (1), as_factor (1), as_factor.haven_labelled_defined (1), as_numeric (1), as_numeric.haven_labelled_defined (1), as.character.haven_labelled_defined (1), as.list.haven_labelled_defined (1), as.numeric (1), as.numeric.haven_labelled_defined (1), as.vector.haven_labelled_defined (1), bind_defined_rows (1), c.haven_labelled_defined (1), compare_creators (1), create_iri (1), dataset_to_triples (1), describe (1), dublincore_to_triples (1), fix_contributor (1), fix_publisher (1), format.haven_labelled_defined (1), get_definition_attribute (1), get_namespace_attribute (1), get_unit_attribute (1), head.haven_labelled_defined (1), id_to_column (1), is_dataset_df (1), is_doi (1), is.datacite (1), is.datacite.datacite (1), is.defined (1), is.dublincore (1), is.dublincore.dublincore (1), is.subject (1), label_attribute (1), length.haven_labelled_defined (1), names.dataset_df (1), new_datacite (1), new_datetime_defined (1), new_dublincore (1), new_labelled_defined (1), Ops.haven_labelled_defined (1), plot.dataset_df (1), print.dataset_df (1), print.haven_labelled_defined (1), set_default_bibentry (1), set_definition_attribute (1), set_namespace_attribute (1), set_unit_attribute (1), set_var_labels (1), strip_defined (1), subject_create (1), summary.dataset_df (1), summary.haven_labelled_defined (1), tail.haven_labelled_defined (1), tbl_sum.dataset_df (1), var_definition.default (1), var_label.dataset_df (1), var_label.defined (1), var_namespace.default (1), var_unit.default (1), vec_cast.character.haven_labelled_defined (1)

assertthat

assert_that (25)

utils

person (5), bibentry (2), citation (1), head (1), tail (1)

labelled

var_label (8), to_labelled (1)

graphics

title (7)

stats

df (5), end (1), start (1)

vctrs

vec_data (7)

rlang

caller_env (1), env_is_user_facing (1)

haven

labelled (1)

tibble

new_tibble (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.


2. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in R (100% in 35 files) and
  • 1 authors
  • 6 vignettes
  • 2 internal data files
  • 12 imported packages
  • 94 exported functions (median 5 lines of code)
  • 202 non-exported functions in R (median 7 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 35 91.3
files_vignettes 6 97.0
files_tests 32 97.6
loc_R 2201 84.0
loc_vignettes 1037 90.8
loc_tests 1316 87.7
num_vignettes 6 97.6 TRUE
data_size_total 4101 64.7
data_size_median 2050 67.9
n_fns_r 296 93.2
n_fns_r_exported 94 94.6
n_fns_r_not_exported 202 92.3
n_fns_per_file_r 5 73.1
num_params_per_fn 2 8.2
loc_per_fn_r 5 9.2
loc_per_fn_r_exp 5 8.4
loc_per_fn_r_not_exp 7 18.5
rel_whitespace_R 21 86.4
rel_whitespace_vignettes 41 95.1 TRUE
rel_whitespace_tests 12 77.7
doclines_per_fn_exp 53 65.9
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 267 90.7

2a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


3. goodpractice and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

rhub.yaml


3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following error:

  1. Error in proc$get_built_file() : Build process failed

R CMD check generated the following check_fails:

  1. no_description_date
  2. no_import_package_as_a_whole

Test coverage with covr

Package coverage: 80.05

Cyclocomplexity with cyclocomp

Error : Build failed, unknown error, standard output:

  • checking for file ‘dataset/DESCRIPTION’ ... OK
  • preparing ‘dataset’:
  • checking DESCRIPTION meta-information ... OK
  • installing the package to build vignettes
  • creating vignettes ... ERROR
    --- re-building ‘bibentry.Rmd’ using rmarkdown
    --- finished re-building ‘bibentry.Rmd’

--- re-building ‘dataset_df.Rmd’ using rmarkdown
--- finished re-building ‘dataset_df.Rmd’

--- re-building ‘defined.Rmd’ using rmarkdown

Quitting from defined.Rmd:158-161 [coerce-char]

<error/rlang_error>
Error in `as_character.haven_labelled_defined()`:
! as_character(): underlying data is not a character vector.
---
Backtrace:
    ▆
 1. ├─dataset::as_character(c(gdp_1, gdp_2))
 2. └─dataset:::as_character.haven_labelled_defined(c(gdp_1, gdp_2))

Error: processing vignette 'defined.Rmd' failed with diagnostics:
as_character(): underlying data is not a character vector.
--- failed re-building ‘defined.Rmd’

--- re-building ‘Motivation.Rmd’ using rmarkdown
--- finished re-building ‘Motivation.Rmd’

--- re-building ‘new_requirements.Rmd’ using rmarkdown
--- finished re-building ‘new_requirements.Rmd’

--- re-building ‘rdf.Rmd’ using rmarkdown

Quitting from rdf.Rmd:109-112 [jsonld]

<error/rlang_error>
Error in `rdf_serialize()`:
! please install the jsonld package to use this functionality.
---
Backtrace:
     ▆
  1. └─tools::buildVignettes(dir = ".", tangle = TRUE)
  2.   ├─base::tryCatch(...)
  3.   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
  4.   │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
  5.   │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
  6.   └─engine$weave(file, quiet = quiet, encoding = enc)
  7.     └─knitr:::vweave_rmarkdown(...)
  8.       └─rmarkdown::render(...)
  9.         └─knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
 10.           └─knitr:::process_file(text, output)
 11.             ├─xfun:::handle_error(...)
 12.             ├─base::withCallingHandlers(...)
 13.             └─knitr:::process_group(group)
 14.               └─knitr:::call_block(x)
 15.                 └─knitr:::block_exec(params)
 16.                   └─knitr:::eng_r(options)
 17.                     ├─knitr:::in_input_dir(...)
 18.                     │ └─knitr:::in_dir(input_dir(), expr)
 19.                     └─knitr (local) evaluate(...)
 20.                       └─evaluate::evaluate(...)
 21.                         ├─base::withRestarts(...)
 22.                         │ └─base (local) withRestartList(expr, restarts)
 23.                         │   ├─base (local) withOneRestart(withRestartList(expr, restarts[-nr]), restarts[[nr]])
 24.                         │   │ └─base (local) doWithOneRestart(return(expr), restart)
 25.                         │   └─base (local) withRestartList(expr, restarts[-nr])
 26.                         │     └─base (local) withOneRestart(expr, restarts[[1L]])
 27.                         │       └─base (local) doWithOneRestart(return(expr), restart)
 28.                         ├─evaluate:::with_handlers(...)
 29.                         │ ├─base::eval(call)
 30.                         │ │ └─base::eval(call)
 31.                         │ └─base::withCallingHandlers(...)
 32.                         └─watcher$print_value(ev$value, ev$visible, envir)
 33.                           ├─base::withVisible(handle_value(handler, value, visible, envir))
 34.                           └─evaluate:::handle_value(handler, value, visible, envir)
 35.                             └─handler$value(value, visible)
 36.                               └─knitr (local) fun(x, options = options)
 37.                                 ├─base::withVisible(knit_print(x, ...))
 38.                                 ├─knitr::knit_print(x, ...)
 39.                                 └─knitr:::knit_print.default(x, ...)
 40.                                   └─knitr::normal_print(x)
 41.                                     ├─base::print(x)
 42.                                     └─rdflib:::print.rdf(x)
 43.                                       ├─base::cat(format.rdf(x, ...), sep = "\n")
 44.                                       └─rdflib:::format.rdf(x, ...)
 45.                                         └─rdflib::rdf_serialize(x, tmp, format = format, ...)

Error: processing vignette 'rdf.Rmd' failed with diagnostics:
please install the jsonld package to use this functionality.
--- failed re-building ‘rdf.Rmd’

SUMMARY: processing the following files failed:
‘defined.Rmd’ ‘rdf.Rmd’

Error: Vignette re-building failed.
Execution halted

Static code analyses with lintr

lintr found the following 403 potential issues:

message number of times
Avoid 1:nrow(...) expressions, use seq_len. 1
Avoid library() and require() calls in packages 21
Avoid using sapply, consider vapply instead, that's type safe 1
Lines should not be more than 80 characters. This line is 100 characters. 10
Lines should not be more than 80 characters. This line is 101 characters. 6
Lines should not be more than 80 characters. This line is 102 characters. 6
Lines should not be more than 80 characters. This line is 103 characters. 8
Lines should not be more than 80 characters. This line is 104 characters. 7
Lines should not be more than 80 characters. This line is 105 characters. 7
Lines should not be more than 80 characters. This line is 106 characters. 6
Lines should not be more than 80 characters. This line is 107 characters. 9
Lines should not be more than 80 characters. This line is 108 characters. 3
Lines should not be more than 80 characters. This line is 109 characters. 14
Lines should not be more than 80 characters. This line is 110 characters. 5
Lines should not be more than 80 characters. This line is 111 characters. 1
Lines should not be more than 80 characters. This line is 112 characters. 6
Lines should not be more than 80 characters. This line is 113 characters. 1
Lines should not be more than 80 characters. This line is 114 characters. 8
Lines should not be more than 80 characters. This line is 115 characters. 6
Lines should not be more than 80 characters. This line is 117 characters. 3
Lines should not be more than 80 characters. This line is 118 characters. 5
Lines should not be more than 80 characters. This line is 119 characters. 2
Lines should not be more than 80 characters. This line is 120 characters. 4
Lines should not be more than 80 characters. This line is 122 characters. 1
Lines should not be more than 80 characters. This line is 123 characters. 2
Lines should not be more than 80 characters. This line is 125 characters. 1
Lines should not be more than 80 characters. This line is 126 characters. 3
Lines should not be more than 80 characters. This line is 131 characters. 3
Lines should not be more than 80 characters. This line is 133 characters. 2
Lines should not be more than 80 characters. This line is 134 characters. 1
Lines should not be more than 80 characters. This line is 136 characters. 1
Lines should not be more than 80 characters. This line is 140 characters. 1
Lines should not be more than 80 characters. This line is 142 characters. 1
Lines should not be more than 80 characters. This line is 143 characters. 1
Lines should not be more than 80 characters. This line is 148 characters. 1
Lines should not be more than 80 characters. This line is 151 characters. 1
Lines should not be more than 80 characters. This line is 153 characters. 1
Lines should not be more than 80 characters. This line is 157 characters. 1
Lines should not be more than 80 characters. This line is 166 characters. 1
Lines should not be more than 80 characters. This line is 171 characters. 2
Lines should not be more than 80 characters. This line is 174 characters. 1
Lines should not be more than 80 characters. This line is 292 characters. 3
Lines should not be more than 80 characters. This line is 312 characters. 1
Lines should not be more than 80 characters. This line is 81 characters. 17
Lines should not be more than 80 characters. This line is 82 characters. 15
Lines should not be more than 80 characters. This line is 83 characters. 21
Lines should not be more than 80 characters. This line is 84 characters. 15
Lines should not be more than 80 characters. This line is 85 characters. 8
Lines should not be more than 80 characters. This line is 86 characters. 11
Lines should not be more than 80 characters. This line is 87 characters. 15
Lines should not be more than 80 characters. This line is 88 characters. 15
Lines should not be more than 80 characters. This line is 89 characters. 12
Lines should not be more than 80 characters. This line is 90 characters. 11
Lines should not be more than 80 characters. This line is 91 characters. 10
Lines should not be more than 80 characters. This line is 92 characters. 12
Lines should not be more than 80 characters. This line is 93 characters. 12
Lines should not be more than 80 characters. This line is 94 characters. 8
Lines should not be more than 80 characters. This line is 95 characters. 15
Lines should not be more than 80 characters. This line is 96 characters. 9
Lines should not be more than 80 characters. This line is 97 characters. 2
Lines should not be more than 80 characters. This line is 98 characters. 17
Lines should not be more than 80 characters. This line is 99 characters. 8
unexpected 'in' 1


4. Other Checks

Details of other checks (click to open)

✖️ The following 10 function names are duplicated in other packages:

    • as_character from metan, radiant.data, retroharmonize, sjlabelled
    • as_numeric from descstat, metan, qdapRegex, radiant.data, retroharmonize, sjlabelled, zenplots
    • describe from AzureVision, Bolstad2, describer, dlookr, explore, Hmisc, iBreakDown, ingredients, lambda.r, MSbox, onewaytests, prettyR, psych, psych, psyntur, questionr, radiant.data, RCPA3, Rlab, scan, scorecard, sylly, tidycomm
    • description from dataMaid, dataPreparation, dataReporter, dcmodify, memisc, metaboData, PerseusR, ritis, rmutil, rsyncrosim, stream, synchronicity, timeSeries, tis, validate
    • get_bibentry from eurostat
    • identifier from Ramble
    • is.defined from nonmemica
    • language from sylly, wakefield
    • provenance from provenance
    • subject from DGM, emayili, gmailr, sendgridr


Package Versions

package version
pkgstats 0.2.0.54
pkgcheck 0.1.2.126


Editor-in-Chief Instructions:

Processing may not proceed until the items marked with ✖️ have been resolved.

@maurolepore
Copy link
Member

maurolepore commented May 12, 2025

Dear @antaldaniel

Thanks again for your submission. Here's my review. Before you proceed please wait to hear back from me. I need to first discuss with the editorial board. I'll aim to share next steps this week.

I see there is a lot of love and effort invested in this package. With such a large investment I anticipate some of my feedback will likely feel uncomfortable. I appreciate your willingness to receive feedback. Please read imagining a friendly tone and the best of the intentions. We're all here helping each other kindly to make research software better.

Thanks for sharing your work with rOpenSci.

Semantic comment-tags

I tag my comments (ml01, ml02, and so on) to help track them. Please respond in sequence and using
the corresponding tag:

  • A comment starting with a checkbox requires action and I expect a response.
  • A comment starting with a bullet is something to consider but does not require action. You may respond but I don't expect it.

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Briefly describe any working relationship you have (had) with the package authors.

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (if you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need: clearly stating problems the software is designed to solve and its target audience in README

  • Installation instructions: for the development version of package and any non-standard dependencies in README

    • ml01: The installation instructions mention {remotes} but use {pak}:

    The latest development version of dataset can be installed with remotes::install_github():

    install.packages("pak")
    pak::pak("dataobservatory-eu/dataset")
  • Vignette(s): demonstrating major functionality that runs successfully locally

    README (not a vignette but I'll cover it here)

    • ml02: The print output of orange_df shows that column names are somewhat distant from the column values.

    • ml03: I'm struggling to understand the meaning of the information in angle brackets <hvn_lbl_>. I'm sure it'll become clear later in this or other docs, but it would be nice to have a clear "first impression". It would be nice to compare how the same dataset prints as a tibble, to more clearly show the richness that {dataset} adds.

    • ml04: I see an explicit call to print() in print(get_bibentry()). It seems unnecessary.

    • ml05: The text above plot(orange_df)says "Print title and labels" but I see no effect reflected in the plot.

    • ml06: Please review the structure of README. I struggle to follow the mix between sections with and without code. Consider grouping all the code-sections and all the text-sections. A structure that commonly works is an intro at the top including the goal and need, then installation instructions, minimal code examples, and finally the rest of the text-sections. Best to minimize duplication. Links to vignettes seem unnecessary since the website shows the Articles tab prominently. Contributing, code of conduct, and related work are OK and common.

    VIGNETTES

    • ml07: Please reconsider the need for the Article's introduction page. The default drop down menu that pkgdown provides is common and enough. More text adds maintenance cost and may overwhelm the reader. The vignette titles should be enough information for the reader to decide if they want to navigate to that article, and then the article's introduction would do the rest -- likely in the first 1-3 sentences. An alternative to avoid "throwig away" the introduction to each article while keeping the website clean is to move it under a link "More articles...", e.g.:

    Image

    Here I see the sentence "The Motivation article will replace the New requirements". Consider removing this sentence, removing the article "New Requirements", and maybe adding a bullet in NEWS.md saying "New article 'Motivation' replaces the old article 'New Requirements'".

    • ml08: The article Motivation duplicates information in README. What's in README seems enough. Consider removing the article, or replacing the text in README with a link to the article.

    • ml09: The article Create Datasets that are Easy to Share Exchange and Extend seems equivalent to the common "Get started". If you rename the file to dataset.Rmd (i.e. .Rmd) then pkgdown should automatically add a "Get started" link to the navbar.

    • ml10: Consider consolidating all other articles. Ideally this should result in fewer, more focused articles. Also consider moving the less important articles to "More articles..." (see above) to reduce the cognitive load required to have a first glance at the website. A smooth experience on the website suggests a similarly smooth experience with the code.

  • Function Documentation: for all exported functions

    • ml11: Please review the Reference index.
      The grouping seems not cohesive. For example, to me it makes sense that the "Dataset" group includes functions with "dataset" in their name, e.g. dataset_df() and as_dataset_df(). But the fit in this group for the function get_bibentry() is unclear to me. Instead a function called get_bibentry() seems to more naturally fit in the group of "Bibliography functions". Consider reviewing the groups and evaluating if you need different groups, or different function names that more clearly reflect their membership to a group.

    • ml12: Please document the @return value of bin_defined_rows(). See these recent pkgcheck results.

  • Examples: (that run successfully locally) for all exported functions

    • ml13: Please fix this failing example (but if the point is to show an error you may use try()):
    devtools::run_examples()
    #> ℹ Updating dataset documentation
    #> ℹ Loading dataset
    #> ── Running 36 example files ───────────────────────────────────────── dataset ──
    #> ℹ Loading dataset
    #>
    #> > ### Name: as.character.haven_labelled_defined
    #> > ### Title: Coerce to character vector
    #> > ### Aliases: as.character.haven_labelled_defined as_character
    #> > ###  as_character.haven_labelled_defined
    #>
    #> > ###  Examples
    #> >
    #> > x <- defined(c("a", "b", "c"), label = "Letter code")
    #>
    #> > as_character(x)
    #> [1] "a" "b" "c"
    #>
    #> > y <- defined(1:3, label = "Index")
    #>
    #> > as_character(y)
    #> Error in as_character.haven_labelled_defined(y): as_character(): underlying data is not a character vector.
    #> ℹ Loading dataset

    That failing example is likely what's causing R CMD check to fail on this recent run of pkgcheck.

    • ml14: Please review all examples and ensure the website shows the most relevant output. The goal is for a potential user to understand each hepfile before they install the package and run the code. The reprex package gives good advice:

    The ... code ... Should be easy for others to digest, so they don't
    necessarily have to run it.

    Here are a few examples so you get the idea:

    • ?orange_df lacks an examples section showing the dataset itself. This would be useful:
    library(dataset)
    
    orange_df
    #> Draper N, Smith H (1998). "Growth of Orange Trees."
    #>   rowid  tree   age circumference
    #>                        The age of circumference
    #>  <hvn_lbl_> <hvn_lbl_> <hvn_lbl_> <hvn_lbl_>
    #> 1 orange:1   2 [1]     118            30
    #> 2 orange:2   2 [1]     484            58
    #> 3 orange:3   2 [1]     664            87
    #> 4 orange:4   2 [1]    1004           115
    #> 5 orange:5   2 [1]    1231           120
    #> 6 orange:6   2 [1]    1372           142
    #> 7 orange:7   2 [1]    1582           145
    #> 8 orange:8   4 [2]     118            33
    #> 9 orange:9   4 [2]     484            69
    #> 10 orange:10  4 [2]     664           111
    #> # ℹ 25 more rows
    library(dataset)
    
    my_dataset <- dataset_df(
      country_name = defined(
        c("AD", "LI"),
        definition = "[http://data.europa.eu/bna/c_6c2bb82d](http://data.europa.eu/bna/c_6c2bb82d)",
        namespace = "[https://www.geonames.org/countries/$1/](https://www.geonames.org/countries/$1/)"
      ),
      gdp = defined(
        c(3897, 7365),
        label = "Gross Domestic Product",
        unit = "million dollars",
        definition = "[http://data.europa.eu/83i/aa/GDP](http://data.europa.eu/83i/aa/GDP)"
      )
    )
    
    print(my_dataset)
    #> Unknown A (:tba). "Untitled Dataset."
    #>   rowid  country_name gdp
    #>                        Gross Dome
    #>  <hvn_lbl_> <hvn_lbl_>   <hvn_lbl_>
    #> 1 eg:1       AD         3897
    #> 2 eg:2       LI         7365
    
    is.dataset_df(my_dataset)
    #> [1] TRUE

    This would be useful:

    str(my_dataset$country_name)
    #>  hvn_lbl_ [1:2] AD, LI
    #>  @ definition: chr "[http://data.europa.eu/bna/c_6c2bb82d](http://data.europa.eu/bna/c_6c2bb82d)"
    #>  @ namespace : chr "[https://www.geonames.org/countries/$1/](https://www.geonames.org/countries/$1/)"
    str(my_dataset$gdp)
    #>  hvn_lbl_ [1:2] 3897 (million dollars), 7365 (million dollars)
    #>  @ label   : chr "Gross Domestic Product"
    #>  @ unit    : chr "million dollars"
    #>  @ definition: chr "[http://data.europa.eu/83i/aa/GDP](http://data.europa.eu/83i/aa/GDP)"
    
    • ?get_bibentry() doesn't show its output, neither before nor after calling set_bibentry() so we can't see what's the effect.
  • Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.

    • ml15: Great. On a fresh Ubuntu system with R 4.5.0, the package installs smoothly from CRAN and GitHub (using pak()).
  • Functionality: Any functional claims of the software have been confirmed.

    • ml16: The functions I tried worked fine and the tests cover a lot of ground. But much of my time was spent in other aspects of this review so I can't claim I did this sufficiently.
  • Performance: Any performance claims of the software have been confirmed.

    • ml17: I see no performance claims in the computational sense.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.

    • ml18: In general test coverage looks great but some specific files with lots of logic may need attention
    #> R/var_label.R: 35.71%
    #> R/xsd_convert.R: 46.15%
    #> R/agent.R: 47.62%
    #> R/id_to_column.R: 51.72%
    #> R/get_bibentry.R: 71.43%
    #> R/as_dublincore.R: 73.61%
    #> R/publisher.R: 75.00%
    #> R/dataset_df.R: 75.54%
    #> R/utils-idcol_find.R: 76.92%
    #> R/identifier.R: 78.05%
    #> R/dataset_title.R: 78.95%
    

    This file has a lot of logic and seems straightforward to test:

    Image

    • ml19: Consider making the test titles more specific, e.g
    library(testthat)
    devtools::load_all()
    #> ℹ Loading dataset
    
    # Better
    # More specific. The data is minimal
    test_that("creates objects of S3 class 'dataset_df'", {
      expect_s3_class(as_dataset_df(data.frame()), "dataset_df")
    
      expect_true(inherits(as_dataset_df(data.frame()), "dataset_df"))
      expect_false(inherits(data.frame(), "dataset_df"))
    })
    #> Test passed 🎊
    
    # Can improve
    # Not specific. The specific data is irrelevant to this test
    # [https://github.com/dataobservatory-eu/dataset/blob/8768983c36a6f6ee2aadf7637dcf5f0f88ab1e3e/tests/testthat/test-dataset_df.R#L144](https://github.com/dataobservatory-eu/dataset/blob/8768983c36a6f6ee2aadf7637dcf5f0f88ab1e3e/tests/testthat/test-dataset_df.R#L144)
    test_that("as_dataset_df() works", {
      expect_s3_class(as_dataset_df(iris), "dataset_df")
      expect_false(is.dataset_df(mtcars))
    })
    #> Test passed 🥳

    This makes it easier to understand what's the problem when a test fails. Compare:

    # Too vague
    ── Failure (test-dataset_df.R:146:3): as_dataset_df() works ───────────────────────────────────────────────────────
    is.dataset_df(mtcars) is not TRUE
    
    `actual`:    FALSE
    `expected`: TRUE
    
    
    # Shows intent
    ── Failure (test-dataset_df.R:151:3): as_dataset_df() creates objects of S3 class 'dataset_df' ────────────────────
    is.dataset_df(mtcars) is not TRUE
    
    `actual`:    FALSE
    `expected`: TRUE
    [ FAIL 2 | WARN 0 | SKIP 0 | PASS 32 ]
    
    • ml20: In calls to expect_error() please ensure to test not just for any error but for the specific one
      you want to test. See the argument regex or class of expect_error() or
      consider using expect_snapshot(error = TRUE)` (or friends) (example).
    test_that("new_labelled_defined() throws errors", {
      expect_error(defined(
        x = c(1:3),
        label = c("Numbers", "Numbers"),
        definition = 1,
        namespace = "Iris"
      ))
      expect_error(defined(
        x = c(1:3),
        label = "Numbers",
        definition = 1,
        namespace = "Iris"
      ))
      # ... and more
    })
    • ml21: Consider removing the name of the function under test from the test title when the test file contains the name of that function (which should be most cases if you follow the most common naming conventions) (example).
  # The file name already contains the name of the function test-xsd_convert  
── Failure (test-xsd_convert.R:2:3): works ────────────────────────────────────────────────────────────────────────
  xsd_convert(TRUE) (`actual`) not equal to "fail for demo" (`expected`).
  
  `actual`:    ""TRUE"^^<xs:boolean>""
  `expected`: "fail for demo"
 
  # The function name is duplicated information 
 ── Failure (test-xsd_convert.R:6:3): xsd_convert(x, idcols) works ────────────────────────────────────────────────
  xsd_convert(TRUE) (`actual`) not equal to "fail for demo" (`expected`).
  
  `actual`:    ""TRUE"^^<xs:boolean>""
  `expected`: "fail for demo"
  [ FAIL 2 | WARN 0 | SKIP 0 | PASS 0 ]
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

  # https://github.com/dataobservatory-eu/dataset/blob/main/R/dataset_df.R
  
  #' Create a new dataset_df object
  #'
  #' The `dataset_df` constructor creates objects of this class, which
  #' are semantically rich, modern data frames inherited from
  #' `\link[tibble:tibble]{tibble::tibble}`.
  #'
  #' @details
  #' To check if an object has the class `dataset_df` use `is.dataset_df`.
  #'
  #' `print` is the method to print out the semantically rich data frames
  #' created with the constructor of `dataset_df`.
  #'
  #' `summary` is the method to summarise these semantically rich data frames.
  #'
  #' For more details, please check the `vignette("dataset_df", package = "dataset")`
  #' vignette.
  #'
  #' @param identifier Defaults to `c(eg="http://example.com/dataset#")`, which should be
  #' changed to the permanent identifier of the dataset. For example, if your dataset will be
  #' released with the Digital Object Identifier (DOI) `https;//doi.org/1234`, you should use
  #' a short prefixed identifier like `c(obs="https://doi.org/1234#")`, which will resolve
  #' to the rows being identified as https://doi.org/1234#1...https://doi.org/1234#n.
  #' @param dataset_bibentry A list of bibliographic references and descriptive metadata
  #' about the dataset as a whole created with [`datacite()`] or
  #' [`dublincore()`].
  #' @param var_labels The long, human readable labels of each variable.
  #' @param units The units of measurement for the measured variables.
  #' @param definitions The linked definitions of the variables, attributes, or constants.
  #' @param dataset_subject The subject of the dataset, see [`subject()`].
  #' @param ... The vectors (variables) that should be included in the dataset.
  #' @param x A `dataset_df` object for S3 methods.
  #' @param df A `data.frame` to be converted to `dataset_df`.
  #' @return `dataset_df` is the constructor of this type, it returns an object
  #' inherited from a data frame with semantically rich metadata.
  #' @import vctrs
  #' @import pillar
  #' @examples
  #' my_dataset <- dataset_df(
  #'  country_name = defined(
  #'    c("AD", "LI"),
  #'    definition = "http://data.europa.eu/bna/c_6c2bb82d",
  #'    namespace = "https://www.geonames.org/countries/$1/"
  #'  ),
  #'  gdp = defined(
  #'    c(3897, 7365),
  #'    label = "Gross Domestic Product",
  #'    unit = "million dollars",
  #'    definition = "http://data.europa.eu/83i/aa/GDP"
  #'  )
  #' )
  #'
  #' print(my_dataset)
  #'
  #' is.dataset_df(my_dataset)
  #' @export
  • ml23: Please re-run the styler:

    styler::style_pkg()
    #> Styling  79  files:
    ...
    #>  R/datacite.R                                    ✔
    #>  R/dataset_df.R                                  ℹ
    #>  R/dataset_title.R                               ✔
    #>  R/dataset_to_triples.R                          ✔
    #>  R/dataset-iris_dataset.R                        ✔
    #>  R/dataset-orange_df.R                           ✔
    #>  R/defined.R                                     ℹ
    #>  R/describe.R                                    ✔
    #>  R/description.R                                 ✔
    #>  R/dublincore.R                                  ℹ
    #>  R/fix_contributor.R                             ℹ
    #>  R/fix_publisher.R                               ✔
    #>  R/geolocation.R                                 ✔
    ...
    #>  tests/testthat/test-dataset_title.R            ✔
    #>  tests/testthat/test-dataset_to_triples.R       ✔
    #>  tests/testthat/test-defined.R                  ℹ
    #>  tests/testthat/test-describe.R                 ✔
    #>  tests/testthat/test-description.R              ✔
    #>  tests/testthat/test-dublincore.R               ℹ
    #>  tests/testthat/test-fix_contributor.R          ℹ
    #>  tests/testthat/test-fix_publisher.R            ℹ
    #>  tests/testthat/test-geolocation.R              ✔
    ...
    #>  vignettes/dataset_df.Rmd                        ✔
    #>  vignettes/defined.Rmd
    #> Warning: When processing vignettes/defined.Rmd: ℹ In index: 7.
    #> Caused by error in `parse_safely()`:
    #> ! <text>:2:7: unexpected 'in'
    #> 1: c(gdp_1, gdp_2)
    #> 2: Error in
    #>        ^
    #> ⚠
    #>  vignettes/Motivation.Rmd                       ✔
    ...
    #> ────────────────────────────────────────
    #> Status    Count   Legend
    #> ✔      70   File unchanged.
    #> ℹ       8   File changed.
    #> ✖       1   Styling threw an error.
    #> ────────────────────────────────────────
    #> Please review the changes carefully!

    </details>

Estimated hours spent reviewing: 6

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

    • ml24: No need to acknowledge in DESCRIPTION. Your kind, thoughtful responses will be sufficient acknowledgement for my work. Thanks!

Review Comments

  • ml25: I see ?iris_dataset (source) and many hits for "iris" but

I understand "further usage of this dataset as unacceptable" (@mpadge ropensci/dev_guide#868 (comment)):

➜  dataset git:(main) git pull
Already up to date.

➜ grep iris -R | wc -l
385

➜  dataset git:(main) cat R/dataset-iris_dataset.R
#' @title Edgar Anderson's Iris Data
#'
#' @description This famous (Fisher's or Anderson's) iris data set gives the measurements in
#' centimetres of the variables sepal length and width and petal length and
#' width, respectively, for 50 flowers from each of 3 species of iris.
#' The species are \emph{Iris setosa}, \emph{versicolor}, and \emph{virginica}.
#' This is a replication of \code{datasets::\link[datasets:iris]{iris}} as
#' \emph{dataset} s3 class.
#'
#' @details See \code{datasets::\link[datasets:iris]{iris}} for details.
#'
#' @format
#' iris is a data frame with 150 cases (rows) and 6 variables (columns) named
#' rowid, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
#' @references
#' Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language.
#' Wadsworth & Brooks/Cole.
#' @source
#' Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems.
#' Annals of Eugenics, \strong{7}, Part II, p179–188.
#'
#' The data were collected by Anderson, Edgar (1935).
#' The irises of the Gaspe Peninsula,
#' Bulletin of the American Iris Society, \strong{59},
#' \href{https://www.biodiversitylibrary.org/item/270486#page/344/mode/1up}{2–5}.
"iris_dataset"

This is surprising to me because this comment suggests it was removed:

So, upon your request, I removed the iris dataset from all vignettes and examples. It uses now the Orange dataset.

Did you remove it from the default main branch (which is the one I'm reviewing)?

  • ml26: Please review the most recent output of pkgcheck. Ensure to expand the drop downs. There is a lot of useful information. Some things that stand out to me:
  • The two issues marked with✖️
  • Some packages in Imports that might be unused (at least in R/). Maybe they can move to Suggests?
  • The number of vignettes.
  • Some lint issues, particularly the few ones other than line-length.

@maurolepore
Copy link
Member

@ropensci-review-bot submit review #681 (comment) time 6

@ropensci-review-bot
Copy link
Collaborator

Logged review for maurolepore (hours: 6)

@antaldaniel
Copy link
Author

@maurolepore Thank you very much for the extensive comments, and I will start from here, because I was a bit concerned that you perhaps see a wrong version as all examples ran smoothly for me, as well as checks, and

This is surprising to me because this #681 (comment) suggests it was removed:
So, upon your request, I removed the iris dataset from all vignettes and examples. It uses now the Orange dataset.
Did you remove it from the default main branch (which is the one I'm reviewing)?

... indeed, you should not meet the iris dataset. The agreement was that dataset will be hidden (after all, it is part of base R, and about 200 unit test use it) but from the examples and vignettes it will be removed. And as far as I see that the iris dataset only appears in the documentation once and in unit tests.

I'll write out all your comments as issues and take a serious look, because when I signalled that I think all issues are resolved (at least from the previous two reviews), I certainly did not see the ìris dataset, and any failing examples or tests. As originally suggested, I will work on this during the week, and try to implement all your suggestions.

The original guidelines for rOpenSci when I first submitted this package for review was to submit preferrably in a conceptual stage, before the first CRAN release, and that was my intention. So I understand that there are a lot of unpolished solutions present, but the idea was to somehow create a tool that can receive semantically rich datasets into the R ecosystem, and can give it back to such system, mainly those APIs that use SDMX, i.e., the statistical data and metadata exchange standard.

@maurolepore
Copy link
Member

maurolepore commented May 16, 2025

Thanks @antaldaniel for the clarifications. The editorial board also confirmed that the priority is to remove iris from public facing documentation but it's fine to keep it in tests.

For the record, we typically do two reviews before the author makes changes but we accept exceptions and we agree this is one, so please do go ahead with your changes as planned.

For changes to the documentation it can really help to get an external reviewer. As someone who has written some long pieces of work I learned to appreciate the value of fresh eyes. External reviewers (including AI) can typically reword and summarize text much more easily than the first author. Professional writers do it, and this work seems very professional to me.

For my own writing I find this short paper very helpful: The science of scientific writing. It really applies to writing not just science but anything. It shows how to structure text to maximize the chances that the reader will interpret it as the writer intended. Interestingly that structure may sometimes be not the most natural and it's yet the most effective.

Funny enough I learned about the main author, George Gopen, from reading R for Data Science:

To improve your writing, we highly recommend ... The Sense of Structure: Writing from the Reader’s Perspective by George Gopen. ... George Gopen also has a number of short articles on writing at https://www.georgegopen.com/litigation-articles.html. They are aimed at lawyers, but almost everything applies to data scientists too.

@antaldaniel
Copy link
Author

@maurolepore Thank you for your patience, regarding these errors I made a very clumsy mistake, I work on several computers and the one on the main branch was not the version that should have been available to you. I think that there are very certainly no more visible iris references, no failing examples, and where you suggested, I also very extensively improved the tests, now I have ~400 unit tests. I will turn to the revisions of the documentation that you suggested, which is a bit bigger task than I though but I will try my best, and if I already invest into this, I will also consider to try to create a publication. I think that the idea that the R system should be able to integrate more into the semantic web and the infrastructure of statistical exchange has its merits, even if this package is still in an early stage. I will let you know as soon as I am finished with the rewrites.

@mpadge
Copy link
Member

mpadge commented May 23, 2025

@maurolepore Just noting that this issue pops up on our dashboard as being labelled with multiple stages. Can you please rectify that? thanks 👍

@ropensci-review-bot
Copy link
Collaborator

@antaldaniel: please post your response with @ropensci-review-bot submit response <url to issue comment> if you haven't done so already (this is an automatic reminder).

Here's the author guide for response. https://devguide.ropensci.org/authors-guide.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants