dfms: Efficient Estimation of Dynamic Factor Models for R #556

SebKrantz · 2022-10-07T13:08:10Z

Submitting Author Name: Sebastian Krantz
Submitting Author Github Handle: @SebKrantz
Other Package Authors Github handles: @rbagd
Repository: https://github.com/SebKrantz/dfms
Version submitted: 0.1.2
Submission type: Stats
Badge grade: bronze
Editor: @noamross
Reviewers: @eeholmes, @santikka

Due date for @eeholmes: 2023-01-03

Due date for @santikka: 2023-01-04
Archive: TBD
Version accepted: TBD
Language: en

Paste the full DESCRIPTION file inside a code block below:

Package: dfms
Version: 0.1.2
Title: Dynamic Factor Models
Authors@R: c(person("Sebastian", "Krantz", role = c("aut", "cre"), email = "sebastian.krantz@graduateinstitute.ch"),
             person("Rytis", "Bagdziunas", role = "aut"))
Description: Efficient estimation of Dynamic Factor Models using the Expectation Maximization (EM) algorithm 
  or Two-Step (2S) estimation, on datasets with missing data. The implementation follows advances in the econometric 
  literature: estimation can be done either by running the Kalman Filter and Smoother once with initial values 
  from PCA - following Doz, Giannone and Reichlin (2011) (2S) - or via iterated Kalman Filtering and Smoothing until EM 
  convergence - following Doz, Giannone and Reichlin (2012) - or using the adapted EM algorithm of Banbura and Modugno 
  (2014), allowing estimation with arbitrary patterns of missing data. The implementation makes heavy use of the 
  Armadillo C++ library and the collapse package, providing for particularly speedy estimation. A comprehensive set of 
  methods supports interpretation/visualization of the model and forecasting. Information criteria to choose the number 
  of factors are also provided - following Bai and Ng (2002).
  --- Key References: ---
  Doz, C., Giannone, D., & Reichlin, L. (2011). A two-step estimator for large approximate dynamic 
       factor models based on Kalman filtering. Journal of Econometrics, 164(1), 188-205.
  Doz, C., Giannone, D., & Reichlin, L. (2012). A quasi-maximum likelihood approach for large, approximate 
       dynamic factor models. Review of Economics and Statistics, 94(4), 1014-1024.
  Banbura, M., & Modugno, M. (2014). Maximum likelihood estimation of factor models on datasets with arbitrary 
       pattern of missing data. Journal of Applied Econometrics, 29(1), 133-160.
URL: https://sebkrantz.github.io/dfms/
BugReports: https://github.com/SebKrantz/dfms/issues
Depends: R (>= 3.0.0)
Imports: Rcpp (>= 1.0.1), collapse (>= 1.8.0)
LinkingTo: Rcpp, RcppArmadillo
Suggests: 
    xts,
    vars,
    magrittr,
    testthat (>= 3.0.0),
    knitr,
    rmarkdown,
    covr
License: GPL-3
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c ("namespace", "rd", "srr::srr_stats_roclet"))
RoxygenNote: 7.1.2
Config/testthat/edition: 3
VignetteBuilder: knitr

Scope

Please indicate which of our statistical package categories this package falls under. (Please check one appropriate box below):

Statistical Packages
- Bayesian and Monte Carlo Routines
- Dimensionality Reduction, Clustering, and Unsupervised Learning
- Machine Learning
- Regression and Supervised Learning
- Exploratory Data Analysis (EDA) and Summary Statistics
- Spatial Analyses
- Time Series Analyses

Pre-submission Inquiry

A pre-submission inquiry has been approved in issue #555

General Information

Who is the target audience and what are scientific applications of this package?

Anybody working with time series. The package is useful for dimensionality reduction and forecasting with a large amount of time series.

Paste your responses to our General Standard G1.1 here, describing whether your software is:
- The first implementation of a novel algorithm; or
- The first implementation within R of an algorithm which has previously been implemented in other languages or contexts; or
- An improvement on other implementations of similar algorithms in R.

See README.md, dfms implements simple baseline versions of algorithms that have been around for a while in Matlab, and in other langaues (R, Python, Julia), but inside more elaborate nowcasting codes - thus not directly accessible, and less efficient. It is the only pure baseline implementation of the algorithms proposed by the 3 academic references mentioned in the description available for R and ready for CRAN.

Please include hyperlinked references to all other relevant software.

The software is actually a reboot and massive improvement upon dynfactoR, an abandoned software project. Generalizations of the functionality are provided by nowcasting and nowcastDFM, which fit dynamic factor models specific to mixed-frequency nowcasting applications. These packages are currently not on CRAN (they were archived) and also not very well maintained. Package MARSS can be used to fit dynamic factor models, but has a complicated API and fails on bigger datasets. The only really useful and well maintained dynamic factor modelling package for R is bayesdfa, which is also on CRAN, and fits bayesian dynamic factor models with Stan. I expect dfms to provide substantially faster estimation than bayesdfa. There are various other codes for Python and Julia on GitHub, including an implementation in the popular statsmodels library, but I did not engage with those as my primary tool remains R and I wanted to create an efficient baseline implementation for R that follows advances in the econometrics literature (PCA + EM Algorithm based estimation).

(If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable.

Badging

What grade of badge are you aiming for? (bronze, silver, gold)

Bronze

If aiming for silver or gold, describe which of the four aspects listed in the Guide for Authors chapter the package fulfils (at least one aspect for silver; three for gold)

Technical checks

Confirm each of the following by checking the box.

I have read the rOpenSci packaging guide.
I have read the author guide and I expect to maintain this package for at least 2 years or have another maintainer identified.
I/we have read the Statistical Software Peer Review Guide for Authors.
I/we have run autotest checks on the package, and ensured no tests fail.
The srr_stats_pre_submit() function confirms this package may be submitted.
The pkgcheck() function confirms this package may be submitted - alternatively, please explain reasons for any checks which your package is unable to pass.

There are still some autotest issues, especially for the main DFM() function, but I do not understand those as all inputs received the maximum extent of checking. See lines 211-226. I also don't understand the note in pkgcheck requesting CI checks. The package receives CI through GitHub Actions (all plattforms) and test coverage is uploaded to codecov.io.

This package:

does not violate the Terms of Service of any service it interacts with.
has a CRAN and OSI accepted license.
contains a README with instructions for installing the development version.

Publication options

Do you intend for this package to go on CRAN?
Do you intend for this package to go on Bioconductor?

Code of conduct

I agree to abide by rOpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

The text was updated successfully, but these errors were encountered:

ropensci-review-bot · 2022-10-07T13:08:12Z

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

ropensci-review-bot · 2022-10-07T13:08:15Z

🚀

Editor check started

👋

ropensci-review-bot · 2022-10-07T13:34:07Z

Checks for dfms (v0.1.2)

git hash: 52706666

✔️ Package name is available
✔️ has a 'codemeta.json' file.
✔️ has a 'contributing' file.
✔️ uses 'roxygen2'.
✔️ 'DESCRIPTION' has a URL field.
✔️ 'DESCRIPTION' has a BugReports field.
✔️ Package has at least one HTML vignette
✔️ All functions have examples.
✔️ Package has continuous integration checks.
✔️ Package coverage is 76.6%.
✔️ R CMD check found no errors.
✔️ R CMD check found no warnings.
👀 Function names are duplicated in other packages

(Checks marked with 👀 may be optionally addressed.)

Package License: GPL-3

1. rOpenSci Statistical Standards (`srr` package)

This package is in the following category:

Time Series
Dimensionality Reduction, Clustering and Unsupervised Learning

✔️ All applicable standards [v0.1.0] have been documented in this package (147 complied with; 9 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.

2. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type	package	ncalls
internal	base	726
internal	stats	83
internal	dfms	51
internal	graphics	10
internal	grDevices	1
imports	collapse	75
imports	Rcpp	NA
suggests	xts	NA
suggests	vars	NA
suggests	magrittr	NA
suggests	testthat	NA
suggests	knitr	NA
suggests	rmarkdown	NA
suggests	covr	NA
linking_to	Rcpp	NA
linking_to	RcppArmadillo	NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

list (120), c (69), T (55), rep (48), if (45), dim (40), F (24), levels (22), drop (19), return (18), length (14), tolower (14), switch (13), t (10), beta (9), matrix (9), ncol (9), nrow (9), paste0 (9), crossprod (8), lapply (8), tcrossprod (8), diag (7), for (7), unlist (7), attr (6), gamma (6), abs (5), apply (5), attributes (5), col (5), seq_len (5), cbind (4), dimnames (4), is.na (4), log (4), rbind (4), rowSums (4), stop (4), which (4), anyNA (3), call (3), colSums (3), cumsum (3), eigen (3), sum (3), do.call (2), identity (2), is.null (2), match.call (2), min (2), names (2), quote (2), replace (2), rev (2), rowMeans (2), structure (2), any (1), as.integer (1), as.vector (1), environment (1), is.character (1), is.finite (1), is.list (1), is.matrix (1), isTRUE (1), kronecker (1), numeric (1), nzchar (1), outer (1), paste (1), prod (1), solve.default (1)

stats

C (36), time (18), weights (8), filter (6), setNames (6), cov (4), ts.plot (4), residuals (1)

collapse

setAttrib (26), setDimnames (13), mctl (10), frange (5), setColnames (5), qsu (3), t_list (3), unattrib (3), pwcov (2), setop (2), fmedian (1), fvar (1), whichv (1)

dfms

ainv (6), ftail (5), SKFS (5), lagnam (4), msum (4), AC1 (3), apinv (2), EMstepBMOPT (2), Estep (2), FIS (2), SKF (2), as.data.frame.dfm (1), as.data.frame.dfm_forecast (1), DFM (1), em_converged (1), EMstepDGR (1), findNA_LE (1), fitted.dfm (1), ICr (1), impNA_MA (1), impNA_spline (1), plot.dfm (1), plot.ICr (1), tsnarmimp (1), unscale (1)

graphics

par (8), lines (2)

grDevices

rainbow (1)

NOTE: Some imported packages appear to have no associated function calls; please ensure with author that these 'Imports' are listed appropriately.

3. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

code in C++ (30% in 5 files) and R (70% in 9 files)
2 authors
1 vignette
3 internal data files
2 imported packages
30 exported functions (median 13 lines of code)
57 non-exported functions in R (median 8 lines of code)
13 R functions (median 15 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

loc = "Lines of Code"
fn = "function"
exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure	value	percentile	noteworthy
files_R	9	55.2
files_src	5	88.8
files_vignettes	1	68.4
files_tests	3	75.2
loc_R	844	63.3
loc_src	364	39.8
loc_vignettes	120	31.3
loc_tests	87	36.5
num_vignettes	1	64.8
data_size_total	117315	84.2
data_size_median	7049	76.8
n_fns_r	87	72.8
n_fns_r_exported	30	78.3
n_fns_r_not_exported	57	71.1
n_fns_src	13	35.0
n_fns_per_file_r	6	72.3
n_fns_per_file_src	3	34.0
num_params_per_fn	5	69.6
loc_per_fn_r	8	20.0
loc_per_fn_r_exp	13	30.5
loc_per_fn_r_not_exp	8	22.6
loc_per_fn_src	15	51.6
rel_whitespace_R	17	62.2
rel_whitespace_src	25	47.8
rel_whitespace_vignettes	51	45.9
rel_whitespace_tests	38	47.0
doclines_per_fn_exp	48	60.6
doclines_per_fn_not_exp	0	0.0	TRUE
fn_call_network_size	39	61.0

3a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package

4. `goodpractice` and other checks

Details of goodpractice checks (click to open)

3a. Continuous Integration Badges

GitHub Workflow Results

id	name	conclusion	sha	run_number	date
3204988595	pages build and deployment	success	06193d	28	2022-10-07
3204988635	R-CMD-check	failure	06193d	100	2022-10-07
3204988634	test-coverage	success	06193d	29	2022-10-07

3b. `goodpractice` results

`R CMD check` with rcmdcheck

R CMD check generated the following notes:

checking installed package size ... NOTE
installed size is 9.2Mb
sub-directories of 1Mb or more:
doc 1.5Mb
libs 7.4Mb
checking dependencies in R code ... NOTE
Namespace in Imports field not imported from: ‘Rcpp’
All declared Imports should be used.

R CMD check generated the following check_fails:

cyclocomp
rcmdcheck_imports_not_imported_from
rcmdcheck_reasonable_installed_size

Test coverage with covr

Package coverage: 76.57

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function	cyclocomplexity
DFM	55
plot.dfm_forecast	27
plot.dfm	25
as.data.frame.dfm	19

Static code analyses with lintr

lintr found the following 551 potential issues:

message	number of times
Avoid library() and require() calls in packages	3
Lines should not be more than 80 characters.	491
Use <-, not =, for assignment.	57

5. Other Checks

Details of other checks (click to open)

✖️ The following function name is duplicated in other packages:

- DFM from MKclass

Package Versions

package	version
pkgstats	0.1.1.54
pkgcheck	0.1.0.24
srr	0.0.1.180

Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

adamhsparks · 2022-10-08T03:32:57Z

Thanks, @SebKrantz. I'll find a handling editor for you shortly.

SebKrantz · 2022-10-09T22:08:44Z

Thanks a lot @adamhsparks. I'm also pending a CRAN submission now, following your earlier indication that this does not conflict with the review. I will maintain a note at the top of the README.md file stating the the package is under review and that review may result in API changes.

adamhsparks · 2022-10-14T12:17:00Z

@ropensci-review-bot assign @rkillick as editor

ropensci-review-bot · 2022-10-14T12:17:03Z

Assigned! @rkillick is now the editor

noamross · 2022-12-02T15:54:51Z

@ropensci-review-bot assign @noamross as editor

ropensci-review-bot · 2022-12-02T15:54:55Z

Assigned! @noamross is now the editor

noamross · 2022-12-02T15:56:31Z

Hello @SebKrantz, I wanted to apologize for the delay on this. One of our editors has needed to step back for a bit so I'll be taking over and seeking reviewers.

noamross · 2022-12-02T15:57:29Z

@ropensci-review-bot seeking reviewers

ropensci-review-bot · 2022-12-02T15:57:30Z

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/556_status.svg)](https://github.com/ropensci/software-review/issues/556)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

SebKrantz · 2022-12-03T15:40:19Z

Hi @noamross, no problem. I added the review badge.

noamross · 2022-12-13T22:39:33Z

@ropensci-review-bot assign @eeholmes as reviewer

ropensci-review-bot · 2022-12-13T22:39:38Z

@eeholmes added to the reviewers list. Review due date is 2023-01-03. Thanks @eeholmes for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

noamross · 2022-12-14T12:38:46Z

@ropensci-review-bot assign @santikka as reviewer

ropensci-review-bot · 2022-12-14T12:38:49Z

@santikka added to the reviewers list. Review due date is 2023-01-04. Thanks @santikka for accepting to review! Please refer to our reviewer guide.

rOpenSci’s community is our best asset. We aim for reviews to be open, non-adversarial, and focused on improving software quality. Be respectful and kind! See our reviewers guide and code of conduct for more.

ropensci-review-bot · 2023-01-01T22:39:42Z

📆 @eeholmes you have 2 days left before the due date for your review (2023-01-03).

ropensci-review-bot · 2023-01-02T12:38:52Z

📆 @santikka you have 2 days left before the due date for your review (2023-01-04).

santikka · 2023-01-03T07:45:35Z

Package Review

As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Compliance with Standards

This package complies with a sufficient number of standards for a (bronze/silver/gold) badge
This grade of badge is the same as what the authors wanted to achieve

The following standards currently deemed non-applicable (through tags of @srrstatsNA) could potentially be applied to future versions of this software:

I'm currently unable to assess how well this package conforms to the standards for two reasons. First, all of the @srrstats tags are still included in the srr-stats-standards.R file and are not placed in appropriate places in the code. Second, the compliance of only a handful of the standards has been explained.

General Review

Documentation

The package includes all the following forms of documentation:

A statement of need clearly stating problems the software is designed to solve and its target audience in README
Installation instructions: for the development version of package and any non-standard dependencies in README
Community guidelines including contribution guidelines in the README or CONTRIBUTING
The documentation is sufficient to enable general use of the package beyond one specific use case

Algorithms

As I understand it, the one of the main advantages of this package is computational speed, making C++ a valid choice for implementing the core algorithms. I would like to see some tests to substantiate the performance claims of this package vs. packages such as MARSS and bayesdfa that were mentioned by the authors.

Testing

While the package has impressive test coverage, I found the suite of tests lacking especially with regards to the inputs for the exported functions. There are several instances where the inputs are not thoroughly checked, for example:

.VAR(diff(EuStockMarkets), -1)
 Error in `[.default`(x, (p + 1L - i):(TT - i), ) : 
only 0's may be mixed with negative subscripts

I'm also not convinced that the results of autotest have been properly accounted for. At the time of writing this review, running autotest on the package produces a data frame with 200 rows.

Visualisation (where appropriate)

Visualisations are mostly clear and aid the primary purposes of statistical interpretation of the results. I don't think there is a risk of statistical misinterpretation.

There is a small issue where the plot legend may overlap with the plotted values (for example when running plot(mod, type = "individual", method = "all") in the examples of plot.dfm), perhaps the authors could move the legend outside of the main figure area.

Package Design

Overall, the package seems mostly well designed for its intended purpose. The code has been thoroughly annotated making it easy to understand, although some of the lines are very long, and the indentation is sometimes inconsistent.

There are some curious design choices, like using switch instead of match.arg to check function arguments. Also, this part in DFM.R puzzled me:

# Quoting some functions that need to be evaluated iteratively
.EM_DGR <- quote(EMstepDGR(X, A, C, Q, R, F_0, P_0, cpX, n, r, sr, TT, rQi, rRi))
.EM_BM <- quote(EMstepBMOPT(X, A, C, Q, R, F_0, P_0, XW0, W, n, r, sr, TT, dgind, dnkron, dnkron_ind, rQi, rRi))
.KFS <- quote(SKFS(X, A, C, Q, R, F_0, P_0))

These expressions are then eval'd later, why exactly is this done as opposed to simply calling these functions directly?

I'm also curious as to what exactly is the logic in choosing which parts of the code are written in R and which in C++. It seems to me that there is quite a lot of matrix algebra going on in the R code. I would have expected that R would only be used to check and prepare the inputs for C++, and to provide outputs such as plots.

Print methods should return their respective argument objects invisibly (i.e., for dfm, dfm_summary, dfm_forecast, and ICr).

Since this is a non-tidyverse package, it is limited in terms of inter-operability in relation to other packages. For example, only base R graphics are available.

Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

This package does not conform to the guidelines in at least the following aspects:

Inconsistent use of assignment operators
No top-level documentation
No CITATION file
README should also show the results of the usage example (I suggest using .Rmd)

Estimated hours spent reviewing: 5

Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

noamross · 2023-02-21T14:20:54Z

Hi @eeholmes! We don't have a standard practice for this. People often open issues or PRs in their repositories to track the major changes underway. If you do this, its helpful to link back to this issue as then they will show up in this thread, as well.

ldecicco-USGS · 2024-02-28T15:38:42Z

Hi @SebKrantz and @noamross , I'm checking in on submissions that haven't seen much action.

Let me know if there is progress to report or any questions you may have. We can place this issue on "hold" if no work is expected to happen anytime soon. Otherwise, we should try to figure out a path forward.

ldecicco-USGS · 2024-02-28T15:48:35Z

Actually I am going to place it on hold so I know I've checked in, however it's very easy to switch it back to active!

ldecicco-USGS · 2024-02-28T16:43:50Z

@ropensci-review-bot put on hold

ropensci-review-bot · 2024-02-28T16:43:52Z

Submission on hold!

SebKrantz · 2024-02-28T19:47:01Z

Hi, so my general take is that I think the package is fairly robust and works well. I was still planning to extend its functionality towards mixed-frequency estimation, and have started some work in that regard, but currently have more important projects. When I do get to do that (this year I hope), I may also address some of the remaining form comments and add some of the requested methods. My interpretation of the reviews is that there were not substantial concerns about errors in the package. So yeah, happy to put things on hold until I get to work on it again.

ropensci-review-bot · 2024-05-28T16:43:56Z

@ldecicco-USGS: Please review the holding status

emilyriederer · 2024-11-10T23:45:08Z

Hi @SebKrantz - checking in with the rOpenSci editors team. Are you still actively working on this? Do you have line of sight to wanting to finish this review?

SebKrantz · 2024-11-11T10:00:33Z

@emilyriederer I'm still planning an update, but currently preparing my PhD defense and also on the Job Market so it may take some time. If all goes well I might have time over christmas or early next year. But feel free to archive this if this is not concrete enough.

maurolepore · 2025-02-02T18:33:12Z

Dear @SebKrantz this is to mark the start of my EiC rotation. I'm reviewing all open issues and noting what I see. Do you intend to continue?

I see one review has already been submitted here.
@eeholmes shared several comments but the formal review seems pending.
The author might be unavailable, so before insisting about the review it seems important to ensure the author intends to continue.
The submission has been holding for about 1 year. WARNING: Following our policy (below) the bot may close this issue soon.

https://devguide.ropensci.org/softwarereview_policies.html#policiesreviewprocess

The author can choose to have their submission put on hold (editor applies the holding label). The holding status will be revisited every 3 months, and after one year the issue will be closed.

SebKrantz · 2025-02-02T18:43:51Z

Dear @maurolepore, thanks! Yes, still planning to do this. Currently resubmitting collapse JSS article and I have a presentation on 10th Feb which I need to prepare. Hope there is some time afterwards.

maurolepore · 2025-02-02T19:01:50Z

@SebKrantz, thanks for your quick response. I'll try to reset the holding-clock by removing the tag and re-assigning it. This intends to stop the but from closing this issue just before you're ready to continue.

From what you say I assume you'll be touching base before the end of February. I look forward to hearing from you.

Best luck in upcoming deadlines.

maurolepore · 2025-02-02T19:03:24Z

@ropensci-review-bot put on hold

ropensci-review-bot · 2025-02-02T19:03:26Z

Submission on hold!

SebKrantz · 2025-02-28T11:18:28Z

I've started work on the package but will not be done today.

ropensci-review-bot · 2025-05-03T19:03:28Z

@maurolepore: Please review the holding status

maurolepore · 2025-05-04T20:56:13Z

@SebKrantz I'm wrapping up my EiC rotation so I'm leaving a few notes to hand over this submission to the next EiC:

I summarized the situation 3 months ago here.
Since then I reset the holding clock to avoid closing because the hold remained for 1 year but the @SebKrantz intended to resume right about then.
I see no more activity so we need to confirm the intention to continue or close.

In any case thanks a lot for sharing your work with rOpenSci and I look forward to working with you again whenever it's a good time.

SebKrantz · 2025-05-06T08:24:29Z

Hi @maurolepore. Thanks. I confess the time allotted in February/March was not sufficient to complete the work. There has been work on the repository and I released v0.3.0, which adds support for mixed frequency estimation. What is missing still is the mixed-freuency case with autoregressive errors, and of course the issues raised by the review, which I would interpret as minor. So, what I can offer to do is do some small changes/fixes in the spirit of the review next weekend and then we can end this with a Bronze badge. I'll then complete the intended functionality of the package at my own pace.

mpadge · 2025-05-06T10:21:32Z

That sounds great @SebKrantz. I'll take over from here, and look forward to hearing your updates in the near future.

SebKrantz · 2025-05-11T22:57:03Z

My response to the review by @santikka below. Will look into the one by @eeholmes during the week...

Package Review

I'm currently unable to assess how well this package conforms to the standards for two reasons. First, all of the @srrstats tags are still included in the srr-stats-standards.R file and are not placed in appropriate places in the code. Second, the compliance of only a handful of the standards has been explained.

I've placed (blocks of) them in the files now. I've provided some commentary under some of them, but in general: Inputs to DFM() are now rigorously checked, the package is robust to noise (its a factor model after all), supports missing values, but cannot be used with character or complex data and will likely not converge with infinite values.

Algorithms

As I understand it, the one of the main advantages of this package is computational speed, making C++ a valid choice for implementing the core algorithms. I would like to see some tests to substantiate the performance claims of this package vs. packages such as MARSS and bayesdfa that were mentioned by the authors.

Of course such tests would be nice, but I'm not sure the model of Banbura & Modugno (2014) can even be estimated with MARSS. In any case, MARSS is implemented fully in R, so the C++-based Kalman filter alone should give a significant speedup. Bayesian methods are also more expensive, but indeed here some tests would be interesting.

Testing

While the package has impressive test coverage, I found the suite of tests lacking especially with regards to the inputs for the exported functions. There are several instances where the inputs are not thoroughly checked, for example:

.VAR(diff(EuStockMarkets), -1)
Error in [.default(x, (p + 1L - i):(TT - i), ) :
only 0's may be mixed with negative subscripts

Thanks, I've added a condition to check that p is positive now.

I'm also not convinced that the results of autotest have been properly accounted for. At the time of writing this review, running autotest on the package produces a data frame with 200 rows.

TBH I have a problem understanding the results of this package. Take for example the Arguments rR and rQ in DFM() specifying the restrictions places on the observation and transition covariance matrix. They are checked as follows:

# Default values 
rQ = c("none", "diagonal", "identity")
rR = c("diagonal", "identity", "none")

# Input Checks
rRi <- switch(tolower(rR[1L]), identity = 0L, diagonal = 1L, none = 2L, stop("Unknown rR option:", rR[1L]))
rQi <- switch(tolower(rQ[1L]), identity = 0L, diagonal = 1L, none = 2L, stop("Unknown rQ option:", rQ[1L]))

Now autotest gives:

Why? I fail to understand the problem.

Visualisation (where appropriate)

Visualisations are mostly clear and aid the primary purposes of statistical interpretation of the results. I don't think there is a risk of statistical misinterpretation.

There is a small issue where the plot legend may overlap with the plotted values (for example when running plot(mod, type = "individual", method = "all") in the examples of plot.dfm), perhaps the authors could move the legend outside of the main figure area.

Thanks, this is a good suggestion. Ideal would be something like legend.position = "top" as in ggplot2 and then put it below the plot title. Not sure though how to implement that reliably using legend() in base R.

Package Design

Overall, the package seems mostly well designed for its intended purpose. The code has been thoroughly annotated making it easy to understand, although some of the lines are very long, and the indentation is sometimes inconsistent.

There are some curious design choices, like using switch instead of match.arg to check function arguments. Also, this part in DFM.R puzzled me:

switch is much more efficient. Probably not needed here, but just habitual from developing collapse where every microsecond counts.

Quoting some functions that need to be evaluated iteratively

.EM_DGR <- quote(EMstepDGR(X, A, C, Q, R, F_0, P_0, cpX, n, r, sr, TT, rQi, rRi))
.EM_BM <- quote(EMstepBMOPT(X, A, C, Q, R, F_0, P_0, XW0, W, n, r, sr, TT, dgind, dnkron, dnkron_ind, rQi, rRi))
.KFS <- quote(SKFS(X, A, C, Q, R, F_0, P_0))

These expressions are then eval'd later, why exactly is this done as opposed to simply calling these functions directly?

So the reason for doing this is that they are iteratively evaluated:

  em_res <- list()
  encl <- environment()
  while(num_iter < max.iter && !converged) {
  
    em_res <- eval(expr, em_res, encl)
    loglik <- em_res$loglik

    ## Iterate at least min.iter times
    if(num_iter < min.iter) converged <- FALSE else {
      converged <- em_converged(loglik, previous_loglik, tol, check.increased)
      if(check.increased) converged <- converged[1L] && !converged[2L]
    }

    previous_loglik <- loglik
    num_iter <- num_iter + 1L
    loglik_all[num_iter] <- loglik

  }

where expr is the function being called, e.g. quote(EMstepBMOPT(X, A, C, Q, R, F_0, P_0, XW0, W, n, r, sr, TT, dgind, dnkron, dnkron_ind, rQi, rRi)). Some of the function inputs are fixed: XW0, W, n, r, sr, TT, dgind, dnkron, dnkron_ind, rQi, rRi, the others are updated every iteration until convergence. I Don't know of a more parsimonious way to code this than quoting a function which is then evaluated against its own results object and the enclosing environment which contains the invariant objects and the initial estimates of the system matrices.

I'm also curious as to what exactly is the logic in choosing which parts of the code are written in R and which in C++. It seems to me that there is quite a lot of matrix algebra going on in the R code. I would have expected that R would only be used to check and prepare the inputs for C++, and to provide outputs such as plots.

Agreed, I've been lazy. The entire EM algorithm could also be written in C++. For now its mostly the Kalman Filter and Smoother. C++ is more tedious to write, so I've refrained from that until now. If I get more time on the package I'll rewrite some stuff in C++.

Print methods should return their respective argument objects invisibly (i.e., for dfm, dfm_summary, dfm_forecast, and ICr).

Thanks, I've done this.

Since this is a non-tidyverse package, it is limited in terms of inter-operability in relation to other packages. For example, only base R graphics are available.

As far as ggplots are concerned I would agree, but then it depends on what graphics system the user prefers. I believe the as.data.frame.dfm and as.data.frame.dfm_forecast methods which return tidy data frames make it quite interoperable. In general, I wanted to create a low-dependency and fast package which warranted a non-tidyverse approach.

* [ ]  **Packaging guidelines**: The package conforms to the rOpenSci packaging guidelines
This package does not conform to the guidelines in at least the following aspects:
* Inconsistent use of assignment operators

It turns out that = is slightly faster than <-, which is why it is used in the the EM algorithms which are called iteratively.

* No top-level documentation

Thanks, I added a package page now.

* No CITATION file

* README should also show the results of the usage example (I suggest using .Rmd)

Thanks, I executed the README examples. There is currently no dedicated article, so a CITATION file is presently not necessary.

Estimated hours spent reviewing: 5

* [x]  Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Thanks for investing the time! I added you as 'rev' to the description.

SebKrantz · 2025-05-17T22:48:12Z

And my responses to the review by @eeholmes:

Missing info on the description

https://sebkrantz.github.io/dfms/ and description on CRAN is missing key information that would allow a statistician who is not in econometrics to understand what this does. It would be good to add the following detail:

"It is a DFA where the factors follow a stationary VAR process of order p."

In my field, DFAs are random walks (A=identity) so I was puzzled as to why the A matrix was not identity and why the identifiability constraint was different. Then I read Banbura, M., & Modugno, M. and realized this a different type of DFA. You might also add that this type of DFA is (widely?) used in econometrics.

Thanks, I added "Factors are assumed to follow a stationary VAR process of order p" to the DESCRIPTION and README. There is now also a new theoretical vignette explaining the model in greater detail.

Minor

Readme and vignettes code examples use "=" but R style uses "<-" for assignment.

Thanks, fixed.

https://sebkrantz.github.io/dfms/reference/DFM.html says it returns a class dfm which is "list-like". like? I looks like a list not "list-like". If it is a list then set class to c("dfm", "list") so that list methods recognize it as a list.

This is interesting, but I don't see base R lm() etc. append the list class to its result object. I am hesitant to do something that is not common practice.

?dfms did not pull up the help page. Is there no help page for the package?

I added it now.

https://sebkrantz.github.io/dfms/reference/SKFS.html
The covariance equation has a typo. I think you meant F^{smooth}_t and left out the { }

Thanks, fixed.

Kalman filter is not only for stationary data

https://sebkrantz.github.io/dfms/reference/SKF.html
https://sebkrantz.github.io/dfms/reference/SKFS.html

The Kalman filter and smoother are not restricted to stationary data. A can be identity (random walks).

You have to know a lot about state-space models and the Kalman filter and smoother to understand the documentation. However it does cite the references for the user.

Thanks. I agree that I didn't consider the Random Walk case. What I meant here is that the Kalman Filter does not model trends or intercepts in the data. The naming convention SKF = Stationary Kalman Filter follows Banbura & Modugno (2014), and I don't want to change function names now. In response to this comment, I added a qualifier: "(or random walks - data should be mean zero and without a trend)" to the Description.

Add mention of the data assumptions

The model is stationary but no stationarity tests for the data or mention of the importance of this. Also the data must have 0 mean due to no mean in the model.

I would make this quite clear for the user and point them towards tests for stationarity.

Thanks. The data must not have mean zero because it is scaled and centered in DFM(). I have made this more explicit in the documentation now. I don't plan to provide stationarity tests in this package, and as you mentioned, it would also work with a random walk. Estimation works even with mild trends, noting of course that the functional form doesn't reflect this. I also don't provide inference procedures for the factors (yet), so testing is not very critical - eyeballing the data should suffice. In economics there are certain conventions, e.g. interest/inflation rates in levels or differences, and all other variables in log-differences --- as shown in the examples. This generally produces stationary data.

no coef() and logLik() functions

Not big deal but I would expect those for a statistics package.

Thanks. I added them now.

class for DFA output

The class is shown as "dfm", but it appears to be a list also. In this case, one should prepend "dfm" but keep "list" also so that list methods work.

Thanks. However, I don't know of any major model class where this is the case. It is also not the case for data.frame. The default method usually gives the correct behavior on the list-vector.

The method of imputation for X_imp is not given

https://sebkrantz.github.io/dfms/reference/DFM.html

In the value section, X_imp is one of the values returned, but how it is imputed is not given. Is this the Kalman smoother output from the estimated model? Or something else? tsnarmimp?

Thanks. I have given some details. Basically, this is tsnarmimp(scale(X), ...).

No info on the residuals

No information on what type of residuals are returned. Link is to the vars residuals() function, but that has no information either. I am assuming that the economics DFA field, there are a standard type of residuals being used, but would be helpful to say which ones. Note, I'll add the options later. Need to look them up.

Thanks, ?resid.dfm now states that the residuals are computed as x_t - C*F_t, where C is the observation matrix.

No info on fitted.

Same re fitted(). No info in the vars help page either. For conditional state-space models, the meaning of "fitted" can have multiple definitions. I can make a good guess, but it'd be better to be explicit by saying. I'll add an example later

Same here, it is C*F_t, where C is the observation matrix - stated in the Description now.

Nothing mentioned about the F_0 and P_0 assumptions

Different algorithms and authors take different approaches. Which one is used here? I'll need to read Banbura, M., & Modugno, M. to see what they do.

I added the sentence: "The filter is initialized with PCA estimates on the imputed dataset—see ?SKFS for a complete example." In short, F_0 is the first row of the PCA estimates. P_0 is inv(I - kron(A, A)) * Q.

What's going on with the rotation?

Why is C not returned with the identifiability constraint? For my test case, my data has missing values so Banbura, M., & Modugno, M. should have been used. They specify a identifiability contraint with the $r \times r$ matrix at top of C set to identity. But the returned C had all values. So that means there is a rotation being done post estimation. Need to look that up and that info should be added to the DFM() details.

Not exactly sure what the problem here is. Let me just make one argument. Doz et al. (2012) argue that as the number of series n tends to infinity with the number of periods T kept fixed, the dynamic factors become identical to PCA estimates. But principal components are orthogonal only to each other, not to the data. In this case C resembles the loadings matrix (eigenvectors) from PCA, which is a dense matrix. And yes, in PCA we can rotate the PC's to better match certain characteristics (Varimax etc.). DFM() does no such thing, just the default PC's (eigen(cov(X))) are used to initialize the Kalman filter and the final factors generally follow the same pattern --- see the example at ?SKFS.

Installation

Installation failed on my M1 Mac at first. Probably just my machine set-up's fault.

ld: warning: directory not found for option '-L/opt/R/arm64/gfortran/lib/gcc/aarch64-apple-darwin20.6.0/12.0.1'
ld: warning: directory not found for option '-L/opt/R/arm64/gfortran/lib'
ld: library not found for -lgfortran
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [dfms.so] Error 1
ERROR: compilation failed for package ‘dfms’

Problem: no fortran compiler for M1 Macs in XCode. Problem discussed here: https://mac.r-project.org/tools/

Solution posted here RubD/Giotto_site#11

# for R>=4.2.0
curl -O https://mac.r-project.org/tools/gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz

# unpack
sudo tar fxz gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz -C /

# /opt/R/arm64/gfortran/SDK has to point to your macOS SDK
# EEH: this didn't work for me but was able to install anyhow and run example
sudo gfortran-update-sdk

Thanks, this seems like an old Rcpp or RcppArmadillo issue. dfms has no Fortran source code.

Overall, highly appreciate the efforts --- even though this econometrics literature, which I closely follow with this package, may not be a statisticians treatment of DFMs. But it works and is computationally efficient. I also added you as 'rev' to the description now.

SebKrantz · 2025-05-18T00:10:57Z

@mpadge I believe that completes my revision. The changes are in the main branch.

mpadge · 2025-05-20T08:04:08Z

Great, thanks @SebKrantz. Now back to both of you @eeholmes @santikka Could you both:

Have a look over @SebKrantz's comments both here and here,
Compare comments with package updates to confirm all issues have been satisfactorily addressed.
Feel free to comment here about any issues that may remain, and then once everything has finally been resolved:
Paste the reviewer approval template in a separate comment, fill out the details, and click "Comment" here.

Thanks in advance, and thank you @SebKrantz for getting this longstanding review into motion again 🚀

santikka · 2025-05-22T13:33:15Z

Reviewer Response

All of my comments have been addressed.

Final approval (post-review)

The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5 (original) + 1 (revision)

adamhsparks self-assigned this Oct 14, 2022

ropensci-review-bot assigned rkillick Oct 14, 2022

ropensci-review-bot added the 1/editor-checks label Oct 14, 2022

adamhsparks removed their assignment Oct 14, 2022

noamross self-assigned this Dec 2, 2022

ropensci-review-bot unassigned rkillick Dec 2, 2022

ropensci-review-bot added 2/seeking-reviewer(s) and removed 1/editor-checks labels Dec 2, 2022

ropensci-review-bot added 3/reviewer(s)-assigned and removed 2/seeking-reviewer(s) labels Dec 14, 2022

ldecicco-USGS added holding and removed holding labels Feb 28, 2024

ropensci-review-bot added the holding label Feb 28, 2024

mpadge added the stats label Mar 20, 2024

maurolepore removed the holding label Feb 2, 2025

ropensci-review-bot added the holding label Feb 2, 2025

Uh oh!

dfms: Efficient Estimation of Dynamic Factor Models for R #556

dfms: Efficient Estimation of Dynamic Factor Models for R #556

Comments

SebKrantz commented Oct 7, 2022 • edited by ropensci-review-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Due date for @santikka: 2023-01-04 Archive: TBD Version accepted: TBD Language: en

Scope

Pre-submission Inquiry

General Information

Badging

Technical checks

Publication options

Code of conduct

ropensci-review-bot commented Oct 7, 2022

Uh oh!

ropensci-review-bot commented Oct 7, 2022

Uh oh!

ropensci-review-bot commented Oct 7, 2022

Checks for dfms (v0.1.2)

1. rOpenSci Statistical Standards (srr package)

2. Package Dependencies

3. Statistical Properties

3a. Network visualisation

4. goodpractice and other checks

3a. Continuous Integration Badges

3b. goodpractice results

R CMD check with rcmdcheck

Test coverage with covr

Cyclocomplexity with cyclocomp

Static code analyses with lintr

5. Other Checks

Editor-in-Chief Instructions:

Uh oh!

adamhsparks commented Oct 8, 2022

Uh oh!

SebKrantz commented Oct 9, 2022

Uh oh!

adamhsparks commented Oct 14, 2022

Uh oh!

ropensci-review-bot commented Oct 14, 2022

Uh oh!

noamross commented Dec 2, 2022

Uh oh!

ropensci-review-bot commented Dec 2, 2022

Uh oh!

noamross commented Dec 2, 2022

Uh oh!

noamross commented Dec 2, 2022

Uh oh!

ropensci-review-bot commented Dec 2, 2022

Uh oh!

SebKrantz commented Dec 3, 2022

Uh oh!

noamross commented Dec 13, 2022

Uh oh!

ropensci-review-bot commented Dec 13, 2022

Uh oh!

noamross commented Dec 14, 2022

Uh oh!

ropensci-review-bot commented Dec 14, 2022

Uh oh!

ropensci-review-bot commented Jan 1, 2023

Uh oh!

ropensci-review-bot commented Jan 2, 2023

Uh oh!

santikka commented Jan 3, 2023

Package Review

Compliance with Standards

General Review

Documentation

Algorithms

Testing

Visualisation (where appropriate)

Package Design

Uh oh!

noamross commented Feb 21, 2023

Uh oh!

ldecicco-USGS commented Feb 28, 2024

Uh oh!

SebKrantz commented Oct 7, 2022 •

edited by ropensci-review-bot

Loading

Due date for @santikka: 2023-01-04
Archive: TBD
Version accepted: TBD
Language: en

1. rOpenSci Statistical Standards (`srr` package)

4. `goodpractice` and other checks

3b. `goodpractice` results

`R CMD check` with rcmdcheck

maurolepore commented Feb 2, 2025 •

edited

Loading

maurolepore commented May 4, 2025 •

edited

Loading

SebKrantz commented May 17, 2025 •

edited

Loading

SebKrantz commented May 18, 2025 •

edited

Loading