Skip to contents

db_connect() manages connections to AlphaMissense record-specific databases. By default, connections are created once and reused.

db_tables() queries for the names of temporary and regular tables defined in the database.

db_temporary_table() creates a temporary (for the duration of the duckdb connection) table from a tibble.

db_range_join() performs a range join, finding all positions in key within ranges defined by join. The result is stored in table to.

db_disconnect() disconnects the duckdb database and shuts down the DuckDB server associated with the connection. Temporary tables are lost.

db_disconnect_all() disconnects all managed duckdb database connection.

Usage

db_connect(
    record = ALPHAMISSENSE_RECORD,
    bfc = BiocFileCache(),
    read_only = TRUE,
    managed = read_only
)

db_tables(db = db_connect())

db_temporary_table(db, value, to)

db_range_join(db, key, join, to)

db_disconnect(db = db_connect())

db_disconnect_all()

Arguments

record

character(1) Zenodo record for the AlphaMissense data resources.

bfc

an object returned by BiocFileCache() representing the location where downloaded files and the parsed database will be stored. The default is the 'global' BiocFileCache.

read_only

logical(1) open the connection 'read only'. TRUE protects against overwriting existing data and is the default.

managed

logical(1) when TRUE, re-use an existing managed connection to the same database.

db

duckdb_connection object, returned by db_connect().

value

a data.frame / tibble containing data to be placed in a temporary table, e.g., from a GenomicRanges object to be used in a range join.

to

the character(1) name of the table to be created

key

a character(1) table name in db containing missense mutation coordinates.

join

a character(1) table name in db containing ranges to be used for joining with (filtering) key.

Value

db_connect() returns an open duckdb_connection to the AlphaMissense record-specific database.

db_tables() returns a character vector of database table names.

db_temporary_table() returns the temporary table as a dbplyr tibble.

db_range_join() returns to (the temporary table created from the join) as a dbplyr tibble.

db_disconnect() returns FALSE if the connection has already been closed or is not valid (via dbIsValid()) or TRUE if disconnection is successful. Values are returned invisibly.

db_disconnect_all() returns the db_disconnect() value for each connection, invisibly.

Details

For db_connect(), set managed = FALSE when, for instance, accessing a database in a separate process. Remember to capture the database connection db_unmanaged <- db_connect(managed = FALSE) and disconnect when done `db_disconnect(db_unmanaged). Connections are managed by default.

db_temporary_table() overwrites an existing table with name to.

db_range_join() overwrites an existing table to. The table key is usually "hg19" or "hg38" and must have CHROM and POS columns. The table join must have columns CHROM, start and end. Following Bioconductor convention and as reported in am_browse(), coordinates are 1-based and ranges defined by start and end are closed. All columns from both key and join are included, so column names (other than CHROM) cannot be duplicated.

db_disconnect() should be called on each unmanaged connection, and once (to free the default managed connection) at the end of a session.

Examples

db_connect()          # default 'read-only' connection
#> alphamissense_connection (read_only: TRUE; managed: TRUE)
#> record: '10813168'
#> connected: TRUE
#> tables: aa_substitutions, clinvar, gene_hg38, hg38

db_rw <- db_connect(read_only = FALSE)

am_data("hg38")       # uses the default, 'read-only' connection
#> # Source:   table<hg38> [?? x 10]
#> # Database: DuckDB v1.1.1 [mtmorgan@Darwin 23.6.0:R 4.5.0//Users/mtmorgan/Library/Caches/org.R-project.R/R/BiocFileCache/121787f1dafbc_121787f1dafbc]
#>    CHROM   POS REF   ALT   genome uniprot_id transcript_id     protein_variant
#>    <chr> <dbl> <chr> <chr> <chr>  <chr>      <chr>             <chr>          
#>  1 chr1  69094 G     T     hg38   Q8NH21     ENST00000335137.4 V2L            
#>  2 chr1  69094 G     C     hg38   Q8NH21     ENST00000335137.4 V2L            
#>  3 chr1  69094 G     A     hg38   Q8NH21     ENST00000335137.4 V2M            
#>  4 chr1  69095 T     C     hg38   Q8NH21     ENST00000335137.4 V2A            
#>  5 chr1  69095 T     A     hg38   Q8NH21     ENST00000335137.4 V2E            
#>  6 chr1  69095 T     G     hg38   Q8NH21     ENST00000335137.4 V2G            
#>  7 chr1  69097 A     G     hg38   Q8NH21     ENST00000335137.4 T3A            
#>  8 chr1  69097 A     C     hg38   Q8NH21     ENST00000335137.4 T3P            
#>  9 chr1  69097 A     T     hg38   Q8NH21     ENST00000335137.4 T3S            
#> 10 chr1  69098 C     A     hg38   Q8NH21     ENST00000335137.4 T3N            
#> # ℹ more rows
#> # ℹ 2 more variables: am_pathogenicity <dbl>, am_class <chr>
db_tables()           # connections initially share the same tables
#> [1] "aa_substitutions" "clinvar"          "gene_hg38"        "hg38"            
db_tables(db_rw)
#> [1] "aa_substitutions" "clinvar"          "gene_hg38"        "hg38"            

## ranges of interest -- the first 200000 bases on chromsomes 1-4.
ranges <- tibble(
    CHROM = paste0("chr", 1:4),
    start = rep(1, 4),
    end = rep(200000, 4)
)
db_temporary_table(db_rw, ranges, "ranges")
#> # Source:   table<ranges> [4 x 3]
#> # Database: DuckDB v1.1.1 [mtmorgan@Darwin 23.6.0:R 4.5.0//Users/mtmorgan/Library/Caches/org.R-project.R/R/BiocFileCache/121787f1dafbc_121787f1dafbc]
#>   CHROM start    end
#>   <chr> <dbl>  <dbl>
#> 1 chr1      1 200000
#> 2 chr2      1 200000
#> 3 chr3      1 200000
#> 4 chr4      1 200000

db_tables(db_rw)      # temporary table available to the db_rw connection...
#> [1] "aa_substitutions" "clinvar"          "gene_hg38"        "hg38"            
#> [5] "ranges"          
db_tables()           # ...but not to the read-only connection
#> [1] "aa_substitutions" "clinvar"          "gene_hg38"        "hg38"            

rng <- db_range_join(db_rw, "hg38", "ranges", "ranges_overlaps")
rng
#> # Source:   table<ranges_overlaps> [?? x 12]
#> # Database: DuckDB v1.1.1 [mtmorgan@Darwin 23.6.0:R 4.5.0//Users/mtmorgan/Library/Caches/org.R-project.R/R/BiocFileCache/121787f1dafbc_121787f1dafbc]
#>    CHROM    POS REF   ALT   genome uniprot_id transcript_id     protein_variant
#>    <chr>  <dbl> <chr> <chr> <chr>  <chr>      <chr>             <chr>          
#>  1 chr4  161194 C     G     hg38   Q3SXZ3     ENST00000510175.6 T170S          
#>  2 chr4  161196 T     A     hg38   Q3SXZ3     ENST00000510175.6 F171I          
#>  3 chr4  161196 T     C     hg38   Q3SXZ3     ENST00000510175.6 F171L          
#>  4 chr4  161196 T     G     hg38   Q3SXZ3     ENST00000510175.6 F171V          
#>  5 chr4  161197 T     G     hg38   Q3SXZ3     ENST00000510175.6 F171C          
#>  6 chr4  161197 T     C     hg38   Q3SXZ3     ENST00000510175.6 F171S          
#>  7 chr4  161197 T     A     hg38   Q3SXZ3     ENST00000510175.6 F171Y          
#>  8 chr4  161198 T     A     hg38   Q3SXZ3     ENST00000510175.6 F171L          
#>  9 chr4  161198 T     G     hg38   Q3SXZ3     ENST00000510175.6 F171L          
#> 10 chr4  161199 A     C     hg38   Q3SXZ3     ENST00000510175.6 K172Q          
#> # ℹ more rows
#> # ℹ 4 more variables: am_pathogenicity <dbl>, am_class <chr>, start <dbl>,
#> #   end <dbl>
rng |>
    count(CHROM) |>
    arrange(CHROM)
#> # Source:     SQL [3 x 2]
#> # Database:   DuckDB v1.1.1 [mtmorgan@Darwin 23.6.0:R 4.5.0//Users/mtmorgan/Library/Caches/org.R-project.R/R/BiocFileCache/121787f1dafbc_121787f1dafbc]
#> # Ordered by: CHROM
#>   CHROM     n
#>   <chr> <dbl>
#> 1 chr1   2018
#> 2 chr2   2028
#> 3 chr4   7488

db_disconnect(db_rw)  # explicit read-write connection
db_disconnect()       # implicit read-only connection

db_disconnect_all()