Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose a faster version of check_installed() #1776

Closed
shikokuchuo opened this issue Feb 12, 2025 · 11 comments · Fixed by #1780
Closed

Propose a faster version of check_installed() #1776

shikokuchuo opened this issue Feb 12, 2025 · 11 comments · Fixed by #1780

Comments

@shikokuchuo
Copy link
Member

This function provides a much better user experience for code using 'suggests' packages over some formulation of
if (requireNamespace("x", quietly = TRUE)).

However, it's much more expensive to use (a full 3 orders of magnitude):

bench::mark(rlang::check_installed("later"), requireNamespace("later", quietly = TRUE), check = FALSE)
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "rlang::check_installed(\"later\")"  397µs  443µs     2189.    5.54MB     17.0
#> 2 "requireNamespace(\"later\", quietl… 558ns  613ns  1294407.        0B      0

Created on 2025-02-12 with reprex v2.1.1

In purrr, I employ caching to ensure it's not run time every time a function is called e.g. https://github.com/tidyverse/purrr/blob/main/R/map.R#L214-L217 but I notice in other codebases it's often used unconditionally at the top of a function body (FYI @hadley).

Rather than require the user to implement these optimizations, it might be nice to handle this at source by short-cutting detection if the package is already loaded. This can be easily done using .getNamespace(), a really cheap function which returns either the namespace environment if already loaded or else NULL.

Note that this would be different to current behaviour as check_installed() actually checks if the package is installed on disk (which contributes to the slowness). Hence a question is: do we hide this behind an argument, or would it be cleaner to introduce another function e.g. check_available()?

@hadley
Copy link
Member

hadley commented Feb 12, 2025

I think we could assume that if the package is loaded, then it's installed.

It might just be that check_installed() could use a different method than detect_installed(); looking at the logic there I suspect we had to adapt it to the "partly" loaded case that you sometimes get when loading dependencies, and that distinction might not matter for check_installed(). (But presumably matters elsewhere)

@shikokuchuo
Copy link
Member Author

I think we could assume that if the package is loaded, then it's installed.

If you load the package e.g. call pkg::function(), then uninstall the package, the package is still loaded and usable, but technically no longer installed. I can't think of a situation where this distinction matters though.

It might just be that check_installed() could use a different method than detect_installed(); looking at the logic there I suspect we had to adapt it to the "partly" loaded case that you sometimes get when loading dependencies, and that distinction might not matter for check_installed(). (But presumably matters elsewhere)

I don't see detect_installed() being used elsewhere (other than is_installed()) and is not exported. So that led me to believe that the current behaviour is intentional - but I guess it's probably to ensure robust behaviour when checking versions of installed packages.

In light of the above, I think it'd safe to make the "if the package is loaded, then it's installed" assumption if only package names are supplied.

@hadley
Copy link
Member

hadley commented Feb 12, 2025

Can you do a little exploration of git blame just to see if you can figure out why this was added? Want to check for Chesterton's fence.

@shikokuchuo
Copy link
Member Author

This reported issue: #1561 is the reason it's checked if the package is installed on disk. Which seems quite improbable... But as a result, almost half the time is spent on this disk check and the other half on version comparison.

I think the solution is a re-factoring of is_installed() and check_installed() to have a fast branch when a package version is not supplied. This would preserve the existing logic in detect_installed() minus the version check and disk check. This branch would be made to succeed in the #1561 scenario and hence not run into that issue.

@hadley
Copy link
Member

hadley commented Feb 12, 2025

It was line 82 that I was thinking of; I agree that deleting a loaded package is a relatively esoteric case that we don't need to protect against.

If we really believe that this case is worth thinking about then, we could also provide a faster implementation:

bench::mark(
  suppressWarnings(nzchar(system.file(package = "ggplot2"))),
  file.exists(find.package("ggplot2")),
  any(file.exists(file.path(.libPaths(), "ggplot2")))
)
#> # A tibble: 3 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 "file.exists(find.package(\"gg…  98.56µs 110.33µs     8721.    7.88KB     12.2
#> 2 "suppressWarnings(nzchar(syste…    100µs 108.49µs     8996.    2.08KB     14.3
#> 3 "any(file.exists(file.path(.li…   1.93µs   2.21µs   442941.        0B      0

Created on 2025-02-12 with reprex v2.1.0

@shikokuchuo
Copy link
Member Author

Very nice. I also use (3) in mirai. Let's take this easy win - I'll open up a PR.

I think there's still room for a refactor, but I want to make sure it won't break anything before I propose anything.

@shikokuchuo
Copy link
Member Author

Ah one of the main reasons for the rest of the slowness is the regex parsing to support for the 'pkg' argument:

The package names. Can include version requirements, e.g. "pkg (>= 1.0.0)".

Changing that would break existing behaviour. I expect it supports parsing the DESCRIPTION requirements.

Hence I wonder if having another faster version would be a good idea, where the inputs are well-defined.

@hadley
Copy link
Member

hadley commented Feb 14, 2025

I wonder if we could include a special case for simple package names? Or maybe we could return immediately if any(file.exists(file.path(.libPaths(), pkg))) is TRUE, and then only parse if that fails?

@lionel-
Copy link
Member

lionel- commented Feb 19, 2025

I wonder if we could include a special case for simple package names

I like this idea.

Or maybe we could return immediately if any(file.exists(file.path(.libPaths(), pkg))) is TRUE, and then only parse if that fails?

I'm not sure I understand but it seems like bypassing the version check when the package is installed defeats the purpose of supplying a minimum version?

@hadley
Copy link
Member

hadley commented Feb 19, 2025

I just meant that any(file.exists(file.path(.libPaths(), "ggplot2 >=2.0.0)) isn't ever going to return TRUE.

@shikokuchuo
Copy link
Member Author

That's interesting. By the same measure, is.environment(.getNamespace("ggplot2 >=2.0.0"))) will never return TRUE either.

I'm going to put my shortcut proposal in a PR for review. This will prevent the full code path being run repeatedly within a session.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants