Skip to contents

Lifecycle: experimental

Introduction & installation

This package is EXPERIMENTAL. It enables access to Globus collection management (including ‘personal’ collections on your own computer), including file and directory transfer. The functions implemented in this package are primarily from the Globus ‘Transfer’ API, documented at https://docs.globus.org/api/transfer/. Many other capabilities of Globus are not implemented.

Install the package if necessary

if (!requireNamespace("remotes", quiety = TRUE))
    install.packages("remotes", repos = "https://CRAN.R-project.org")
remotes::install_github("mtmorgan/rglobus")

Attach the package to your R session

Globus provides software to allow your laptop to appear as a collection. Follow the Globus Connect Personal installation instructions for your operating system, and launch the application. Then identify a location on your local disk to act as a collection.

On macOS, I created a directory ~/tmp/HuBMAP, and then configured Globus Connect Personal to share that location. I did this by launching the application (it appears as an icon in the menu bar) then selecting ‘Preferences’ and using the ‘-’ and ‘+’ buttons to select the path to my local collection.

Local collections

Discovering and navigating collections

The functions discussed here are based on the APIs described in Endpoints and Collections and Endpoint and Collection Search.

Globus data sets are organized into collections. Start by discovering collections that contain the words “HuBMAP” and “Public”, in any order.

hubmap_collections <- collections("HuBMAP Public")
hubmap_collections
## # A tibble: 4 × 2
##   display_name        id                                  
##   <chr>               <chr>                               
## 1 HuBMAP Public       af603d86-eab9-4eec-bb1d-9d26556741bb
## 2 HuBMAP Dev Public   2b82f085-1d50-4c93-897e-cd79d77481ed
## 3 HuBMAP Stage Public 4b383482-8c5c-48fb-8b80-a450338ca383
## 4 HuBMAP Test Public  1e12c423-dd17-4095-a9f8-12555ee83345

Authentication

The first time collections() is invoked, Globus requires that you authenticate. A web page appears, where one can choose authentication via generic identifiers like ORCID or Google, or through an institution that you belong to. Follow the browser prompt(s) and return to R when done.

Authentication

Collection content

Each collection is presented as directories and files. Focus on the ‘HuBMAP Public’ collection.

hubmap <-
    hubmap_collections |>
    dplyr::filter(display_name == "HuBMAP Public")

List the content of the collection.

globus_ls(hubmap)
## # A tibble: 2,308 × 4
##    name                             last_modified              size type 
##    <chr>                            <chr>                     <int> <chr>
##  1 0008a49ac06f4afd886be81491a5a926 2024-07-26 13:35:21+00:00  4096 dir  
##  2 0027cb59bcb4a34d5db83acaf934a9d9 2024-07-26 13:19:12+00:00  4096 dir  
##  3 002e9747855eef1e69452b39713a7592 2024-08-13 19:04:13+00:00  4096 dir  
##  4 00318be0b7cfa3c6ed7fbeab08fe700b 2024-07-26 12:58:42+00:00  4096 dir  
##  5 004d4f157df4ba07356cd805131dfc04 2024-08-13 19:21:04+00:00  4096 dir  
##  6 0066713ca95c03c52cb40f90ce8bbdb8 2024-04-17 23:24:41+00:00  4096 dir  
##  7 007ae59344e7df0e398204ee40155cb0 2024-04-17 23:18:56+00:00  4096 dir  
##  8 007f3dfaaa287d5c7c227651f61a9c5b 2024-07-26 14:46:55+00:00  4096 dir  
##  9 00cc71c7e1cddac60e794044079faeee 2024-07-26 14:17:08+00:00  4096 dir  
## 10 00d1a3623dac388773bc7780fcb42797 2024-04-17 23:02:27+00:00  4096 dir  
## # ℹ 2,298 more rows

There are 2308 records, each corresponding to a HuBMAP dataset. Information about each dataset, e.g., the dataset with name 0008a49ac06f4afd886be81491a5a926 is available from the HuBMAP data portal, or using the HuBMAPR package. (Usually, one would ‘discover’ the dataset name using the data portal or HuBMAPR, and then use rglobus to further explore the content).

List the content of a dataset of interest by adding a path to globus_ls(), e.g.,

path <- "0008a49ac06f4afd886be81491a5a926"
globus_ls(hubmap, path)
## # A tibble: 13 × 4
##    name                last_modified                 size type 
##    <chr>               <chr>                        <int> <chr>
##  1 anndata-zarr        2021-11-21 09:40:50+00:00     4096 dir  
##  2 for-visualization   2021-11-21 09:41:41+00:00     4096 dir  
##  3 n5                  2021-11-21 09:35:26+00:00     4096 dir  
##  4 ometiff-pyramids    2021-11-21 09:35:26+00:00     4096 dir  
##  5 output_json         2021-11-21 09:39:13+00:00     4096 dir  
##  6 output_offsets      2021-11-21 09:37:00+00:00     4096 dir  
##  7 sprm_outputs        2021-11-21 09:15:36+00:00    12288 dir  
##  8 stitched            2021-11-21 04:44:53+00:00     4096 dir  
##  9 experiment.yaml     2021-11-21 04:32:41+00:00     1721 file 
## 10 metadata.json       2024-07-26 13:35:21+00:00    90481 file 
## 11 pipelineConfig.json 2021-11-21 04:58:21+00:00     5422 file 
## 12 session.log         2021-11-21 09:40:55+00:00 18699854 file 
## 13 symlinks.tar        2021-11-21 09:17:12+00:00   133120 file

The dataset consists of files and directories; further explore the content of individual directories by constructing the appropriate path.

path <- paste(path, "anndata-zarr", sep = "/")
globus_ls(hubmap, path)
## # A tibble: 1 × 4
##   name                                   last_modified              size type 
##   <chr>                                  <chr>                     <int> <chr>
## 1 reg1_stitched_expressions-anndata.zarr 2021-11-21 09:40:52+00:00  4096 dir

path <- paste(path, "reg1_stitched_expressions-anndata.zarr", sep = "/")
hubmap |> globus_ls(path)
## # A tibble: 5 × 4
##   name   last_modified              size type 
##   <chr>  <chr>                     <int> <chr>
## 1 X      2021-11-21 09:40:51+00:00  4096 dir  
## 2 layers 2021-11-21 09:40:55+00:00  4096 dir  
## 3 obs    2021-11-21 09:40:52+00:00  4096 dir  
## 4 obsm   2021-11-21 09:40:52+00:00  4096 dir  
## 5 var    2021-11-21 09:40:52+00:00  4096 dir

Note in the last example that the functions in rglobus are designed to support ‘piping’.

Local collections

Collections owned by you appear in collections(), but the convenience function my_collections() provides another way to access these.

my_collections <- my_collections()
my_collections
## # A tibble: 1 × 2
##   display_name    id                                  
##   <chr>           <chr>                               
## 1 Martin's HuBMAP 714ce2c4-3268-11ef-9629-453c3ae125a5

One aspect of the collection is that the path from the root (starting with /) or relative to the user home directory needs to be specified.

globus_ls(my_collections, "/Users/ma38727/tmp")
## # A tibble: 1 × 4
##   name   last_modified              size type 
##   <chr>  <chr>                     <int> <chr>
## 1 HuBMAP 2024-08-14 21:09:38+00:00    96 dir

path <- "tmp"
globus_ls(my_collections, path)
## # A tibble: 1 × 4
##   name   last_modified              size type 
##   <chr>  <chr>                     <int> <chr>
## 1 HuBMAP 2024-08-14 21:09:38+00:00    96 dir

## nothing here yet...
path <- "tmp/HuBMAP"
globus_ls(my_collections, path)
## # A tibble: 1 × 4
##   name                             last_modified              size type 
##   <chr>                            <chr>                     <int> <chr>
## 1 d1dcab2df80590d8cd8770948abaf976 2024-08-14 21:09:38+00:00    96 dir

The local connection is under our ownership, so it is possible to, e.g., create a directory; mkdir() returns the updated directory listing of the enclosing folder.

mkdir(my_collections, "tmp/HuBMAP/test")
## # A tibble: 2 × 4
##   name                             last_modified              size type 
##   <chr>                            <chr>                     <int> <chr>
## 1 d1dcab2df80590d8cd8770948abaf976 2024-08-14 21:09:38+00:00    96 dir  
## 2 test                             2024-08-14 21:09:44+00:00    64 dir

Of course we could have used our operating system to create the directory in the path of the local collection.

Directory and file transfer

Directory and file transfer are described in the Task Submission API.

Start this section by ensuring we have the HuBMAP and our own collections

hubmap <-
    hubmap_collections |>
    dplyr::filter(display_name == "HuBMAP Public")
my_collections <- my_collections()

We illustrate directory and file transfer on a specific HuBMAP dataset. Here is the dataset and directory content

hubmap_dataset <- "d1dcab2df80590d8cd8770948abaf976"
globus_ls(hubmap, hubmap_dataset)
## # A tibble: 5 × 4
##   name                                          last_modified         size type 
##   <chr>                                         <chr>                <int> <chr>
## 1 extras                                        2023-01-12 15:32:24…  4096 dir  
## 2 imzML                                         2023-01-04 17:56:12…  4096 dir  
## 3 ometiffs                                      2023-01-04 17:58:26…  4096 dir  
## 4 d1dcab2df80590d8cd8770948abaf976-metadata.tsv 2023-01-12 15:32:21…  1901 file 
## 5 metadata.json                                 2024-04-17 23:24:32… 34814 file

File transfer

Use copy() to transfer files or directories between collections. We start with the metadata.json file, and transfer it to the ‘test’ directory in our local collection. Define the source and destination paths in the two collections.

source_path <-
    paste(hubmap_dataset, "metadata.json", sep = "/")
destination_path <-
    paste("tmp/HuBMAP/test", basename(source_path), sep = "/")

Globus anticipates that large data transfers may be involved, so the copy() operation actually submits a task that runs asynchronously. Globus has confidence in the robustness of their file transfer, so adopt a ‘fire and forget’ philosophy – the task will eventually succeed or fail, perhaps overcoming intermittent network or other issues.

The ... optional arguments to copy() are the same as the arguments for the lower-level transfer() function. By default, transfer tasks send email on completion, but we will check on task progress and respond appropriately. Each task can be labeled, the default label is provided by transfer_label().

task <- copy(
    hubmap, my_collections,        # collections
    source_path, destination_path, # paths
    notify_on_succeeded = FALSE
)
task |>
    dplyr::glimpse()
## Rows: 1
## Columns: 3
## $ submission_id <chr> "88599a67-5a81-11ef-be9c-83cd94efb466"
## $ task_id       <chr> "88599a66-5a81-11ef-be9c-83cd94efb466"
## $ code          <chr> "Accepted"

We anticipate that code is “Accepted”, indicating that the task is sufficiently well-formatted to be added to the task queue.

Task management

Use task_status() to check on status.

task_status(task)
## # A tibble: 1 × 5
##   task_id                              type     status nice_status label        
##   <chr>                                <chr>    <chr>  <chr>       <chr>        
## 1 88599a66-5a81-11ef-be9c-83cd94efb466 TRANSFER ACTIVE Queued      2024-08-14 1…

The status column changes from ACTIVE to SUCCEEDED for successful tasks. An active task proceeding normally has nice_status either Queued or OK. An active task may be encountering errors, e.g., because the local connection is offline (CONNECTION_FAILED) or paused (GC_PAUSED) or that the source or destination file exists but the user does not have permission to read or write it (PERMISSION_DENIED). Perhaps unintuitively, Globus views these errors as transient (e.g., because the local collection may come back online) and so continues to try to complete the task. Active tasks that persist in an error state will eventually fail.

Failed tasks have status FAILED.

Let’s write a simple loop to check on status, allowing the task to run for up to 60 seconds.

now <- Sys.time()
repeat {
    status <- task_status(task)$status
    complete <- status %in% c("SUCCEEDED", "FAILED")
    if (complete || Sys.time() - now > 60)
        break
    Sys.sleep(5)
}

If the task was successful, we should see the file in our local collection.

status
## [1] "SUCCEEDED"
globus_ls(my_collections, "tmp/HuBMAP/test")
## # A tibble: 1 × 4
##   name          last_modified              size type 
##   <chr>         <chr>                     <int> <chr>
## 1 metadata.json 2024-08-14 21:09:48+00:00 34814 file

If the task has failed or is still active for nice_status reasons that are not likely to resolve, the task can be canceled. Actually, canceling a completed task generates a useful message without error.

task_cancel(task)
## TaskComplete: The task completed before the cancel request was processed.

Directory transfer

Directory transfer is similar. Here we transfer the entire HuBMAP dataset to our local collection.

Specify the source path as the HuBMAP dataset, and the destination path as the dataset id in our local collection.

source_path <- hubmap_dataset
destination_path <- paste("tmp/HuBMAP", hubmap_dataset, sep = "/")

Submit the task as before, but add recursive = TRUE since this is a directory. As the task may take quite a while to complete, we will not change the default notify_on_succeeded option. Check on its initial status.

task <- copy(
    hubmap, my_collections,        # collections
    source_path, destination_path, # paths
    recursive = TRUE
)
task_status(task)
## # A tibble: 1 × 5
##   task_id                              type     status nice_status label        
##   <chr>                                <chr>    <chr>  <chr>       <chr>        
## 1 8c415b64-5a81-11ef-be9c-83cd94efb466 TRANSFER ACTIVE Queued      2024-08-14 1…

The task is added to a queue, and eventually the entire content of the HuBMAP dataset is transferred.

One can gain additional insight into the progress of the task by asking for all_fields of the task status. Relevant fields are the number of bytes transferred and the effective transfer rate. We use a helper function to format these values in a more intelligibly.

bytes_to_units <-
    function(x)
{
    ## use R's 'object_size' S3 class to pretty-print bytes as MB, etc
    x |>
        structure(class = "object_size") |>
        format(units = "auto")
}

task_status(task, all_fields = TRUE) |>
    dplyr::select(
        status, nice_status,
        bytes_transferred,
        effective_bytes_per_second
    ) |>
    dplyr::mutate(
        bytes_transferred = bytes_to_units(bytes_transferred),
        effective_bytes_per_second =
            bytes_to_units(effective_bytes_per_second)
    )
##   status nice_status bytes_transferred effective_bytes_per_second
##   <chr>  <chr>       <chr>             <chr>
## 1 ACTIVE OK          227.8 Mb          486.9 Kb

The transfer can take some time, so in the interest of brevity we cancel the task.

task_cancel(task)
## Canceled: The task has been cancelled successfully.

Neglecting to set recursive = TRUE results in a nice_status IS_A_DIRECTORY. Globus nonetheless continues to try the file transfer, but we would recognize this as ‘user error’ and would cancel and resubmit the task.

task <- copy(
    hubmap, my_collections, # collections
    source_path,
    destination_path
)
## ...
task_status(task)
## # A tibble: 1 × 5
##   task_id                              type     status nice_status    label
##   <chr>                                <chr>    <chr>  <chr>          <chr>
## 1 36bee1e6-5a5b-11ef-be9a-83cd94efb466 TRANSFER ACTIVE IS_A_DIRECTORY 2024-08-1…

## Oops, forgot the `recursive = TRUE` option
task_cancel(task)
task <- copy(
    hubmap, my_collections, # collections
    source_path,
    destination_path,
    recursive = TRUE
)

Complicated transfer tasks

Globus supports more complicated tasks, for instance submitting multiple transfer tasks at once, or synchronizing two locations. These will be developed in subsequent iterations of the package.

Session information

This vignette was compiled using the following software versions

sessionInfo()
## R version 4.4.1 Patched (2024-06-20 r86819)
## Platform: aarch64-apple-darwin23.5.0
## Running under: macOS Sonoma 14.5
## 
## Matrix products: default
## BLAS:   /Users/ma38727/bin/R-4-4-branch/lib/libRblas.dylib 
## LAPACK: /Users/ma38727/bin/R-4-4-branch/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rglobus_0.0.1
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.8    dplyr_1.1.4       compiler_4.4.1    promises_1.3.0   
##  [5] Rcpp_1.0.13       tidyselect_1.2.1  later_1.3.2       jquerylib_0.1.4  
##  [9] systemfonts_1.1.0 textshaping_0.4.0 yaml_2.3.10       fastmap_1.2.0    
## [13] R6_2.5.1          rjsoncons_1.3.1   generics_0.1.3    curl_5.2.1       
## [17] httr2_1.0.2       knitr_1.48        htmlwidgets_1.6.4 tibble_3.2.1     
## [21] desc_1.4.3        openssl_2.2.0     bslib_0.7.0       pillar_1.9.0     
## [25] rlang_1.1.4       utf8_1.2.4        cachem_1.1.0      httpuv_1.6.15    
## [29] xfun_0.46         fs_1.6.4          sass_0.4.9        cli_3.6.3        
## [33] withr_3.0.1       pkgdown_2.1.0     magrittr_2.0.3    digest_0.6.36    
## [37] askpass_1.2.0     rappdirs_0.3.3    lifecycle_1.0.4   vctrs_0.6.5      
## [41] evaluate_0.24.0   glue_1.7.0        whisker_0.4.1     ragg_1.3.2       
## [45] fansi_1.0.6       rmarkdown_2.27    tools_4.4.1       pkgconfig_2.0.3  
## [49] htmltools_0.5.8.1