Skip to content

implementing parquet filetype? #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
njtierney opened this issue Mar 16, 2024 · 5 comments
Open

implementing parquet filetype? #36

njtierney opened this issue Mar 16, 2024 · 5 comments

Comments

@njtierney
Copy link
Collaborator

As mentioned in #4, e.g.

tar_sf_vector(filetype="parquet")
@brownag
Copy link
Contributor

brownag commented Mar 18, 2024

So far, the following works for terra SpatVector objects via the GDAL (Geo)Parquet driver:

library(targets)

tar_script({
    list(
        geotargets::tar_terra_vect(test_terra_parquet,
                                   terra::vect(system.file("ex", "lux.shp", package = "terra")),
                                   filetype = "Parquet")
    )
})

tar_make()
#> Loading required namespace: terra
#> ▶ dispatched target test_terra_parquet
#> ● completed target test_terra_parquet [0.012 seconds]
#> ▶ ended pipeline [0.095 seconds]
x <- tar_read(test_terra_parquet)
x
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : test_terra_parquet
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

terra::describe(tar_path_target(test_terra_parquet))
#> [1] "Driver: Parquet/(Geo)Parquet"              
#> [2] "Files: _targets/objects/test_terra_parquet"
#> [3] "Size is 512, 512"                          
#> [4] "Corner Coordinates:"                       
#> [5] "Upper Left  (    0.0,    0.0)"             
#> [6] "Lower Left  (    0.0,  512.0)"             
#> [7] "Upper Right (  512.0,    0.0)"             
#> [8] "Lower Right (  512.0,  512.0)"             
#> [9] "Center      (  256.0,  256.0)"

Still need to implement analogous methods for {sf} objects via #13.

Also, we may want to implement a variant that uses write methods via {arrow} RE: #2 as this may be more efficient for larger targets? Would be interesting to benchmark GDAL vs. Arrow

@Aariq
Copy link
Collaborator

Aariq commented Mar 18, 2024

Would be interesting to benchmark GDAL vs. Arrow

I think benchmarking is definitely part of the plan once things are somewhat stable. Would be good to give users an idea of the tradeoffs in speed, size, and dependency requirements.

@Aariq
Copy link
Collaborator

Aariq commented Oct 3, 2024

Confirming that parquet doesn't work "out of the box" with just targets and sf

library(targets)
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
tar_dir({
    tar_script({
        library(targets)
        library(sf)
        library(arrow)
        list(
            tar_target(nc, st_read(system.file("shape/nc.shp", package="sf")), format = "parquet")
        )
    })
    tar_make()
    tar_read(nc)
})
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
#> 
#> Attaching package: ‘arrow’
#> 
#> The following object is masked from ‘package:utils’:
#> 
#>     timestamp
#> 
#> ▶ dispatched target nc
#> Reading layer `nc' from data source 
#>   `/Users/ericscott/Library/R/x86_64/4.4/library/sf/shape/nc.shp' 
#>   using driver `ESRI Shapefile'
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> Geodetic CRS:  NAD27
#> ✖ errored target nc
#> ✖ errored pipeline [0.226 seconds]
#> 
#>Error:
#>    ! targets::tar_make() error
#>
#>── Debug target nc ──────────────────────────────────────────────────────────────────────────────────────────────────────
#>tar_meta(nc)$error
#>tar_workspace(nc)
#>
#>── General debugging ────────────────────────────────────────────────────────
#>• tar_errored()
#>• tar_meta(fields = any_of("error"), complete_only = TRUE)
#>• tar_workspace()
#>• tar_workspaces()
#>
#>── How to ────────────────────────────────────────────────────────
#>• Debug: https://books.ropensci.org/targets/debugging.html
#>• Help: https://books.ropensci.org/targets/help.html
#>
#>── Last error message ──────────────────────────────────────────────────────
#>_store_ Can't infer Arrow data type from object inheriting from XY / MULTIPOLYGON / sfg
#>
#>── Last error traceback ────────────────────────────────────────────────────────
#>    No traceback available.

Created on 2024-10-03 with reprex v2.1.1

@cedricr
Copy link

cedricr commented Oct 3, 2024

Would it be different from the tar_parquet factory from tarchetypes?
https://docs.ropensci.org/tarchetypes/reference/tar_formats.html

@Aariq
Copy link
Collaborator

Aariq commented Oct 3, 2024

I can investigate, but I think all of the tar_<filetype>() functions are just shortcuts for tar_target() with the format arg already set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants