db_connect()
manages connections to AlphaMissense
record-specific databases. By default, connections are created
once and reused.
db_tables()
queries for the names of temporary and
regular tables defined in the database.
db_temporary_table()
creates a temporary (for the
duration of the duckdb connection) table from a tibble.
db_range_join()
performs a range join, finding all
positions in key
within ranges defined by join
. The result
is stored in table to
.
db_disconnect()
disconnects the duckdb database and
shuts down the DuckDB server associated with the
connection. Temporary tables are lost.
db_disconnect_all()
disconnects all managed duckdb
database connection.
Usage
db_connect(
record = ALPHAMISSENSE_RECORD,
bfc = BiocFileCache(),
read_only = TRUE,
managed = read_only
)
db_tables(db = db_connect())
db_temporary_table(db, value, to)
db_range_join(db, key, join, to)
db_disconnect(db = db_connect())
db_disconnect_all()
Arguments
- record
character(1) Zenodo record for the AlphaMissense data resources.
- bfc
an object returned by
BiocFileCache()
representing the location where downloaded files and the parsed database will be stored. The default is the 'global' BiocFileCache.- read_only
logical(1) open the connection 'read only'.
TRUE
protects against overwriting existing data and is the default.- managed
logical(1) when
TRUE
, re-use an existing managed connection to the same database.- db
duckdb_connection
object, returned bydb_connect()
.- value
a
data.frame
/tibble
containing data to be placed in a temporary table, e.g., from a GenomicRanges object to be used in a range join.- to
the character(1) name of the table to be created
- key
a character(1) table name in
db
containing missense mutation coordinates.- join
a character(1) table name in
db
containing ranges to be used for joining with (filtering)key
.
Value
db_connect()
returns an open duckdb_connection
to the
AlphaMissense record-specific database.
db_tables()
returns a character vector of database table
names.
db_temporary_table()
returns the temporary table as a
dbplyr tibble
.
db_range_join()
returns to
(the temporary table created
from the join) as a dbplyr tibble.
db_disconnect()
returns FALSE
if the connection has
already been closed or is not valid (via dbIsValid()
) or
TRUE
if disconnection is successful. Values are returned
invisibly.
db_disconnect_all()
returns the db_disconnect()
value
for each connection, invisibly.
Details
For db_connect()
, set managed = FALSE
when, for instance,
accessing a database in a separate process. Remember to capture the
database connection db_unmanaged <- db_connect(managed = FALSE)
and disconnect when done `db_disconnect(db_unmanaged). Connections
are managed by default.
db_temporary_table()
overwrites an existing table
with name to
.
db_range_join()
overwrites an existing table to
.
The table key
is usually "hg19"
or "hg38"
and must have
CHROM
and POS
columns. The table join
must have columns
CHROM
, start
and end
. Following Bioconductor
convention and as reported in am_browse()
, coordinates are
1-based and ranges defined by start
and end
are closed. All
columns from both key
and join
are included, so column names
(other than CHROM
) cannot be duplicated.
db_disconnect()
should be called on each unmanaged
connection, and once (to free the default managed connection)
at the end of a session.
Examples
db_connect() # default 'read-only' connection
#> alphamissense_connection (read_only: TRUE; managed: TRUE)
#> record: '10813168'
#> connected: TRUE
#> tables: aa_substitutions, clinvar, gene_hg38, hg38
db_rw <- db_connect(read_only = FALSE)
am_data("hg38") # uses the default, 'read-only' connection
#> # Source: table<hg38> [?? x 10]
#> # Database: DuckDB v1.1.1 [mtmorgan@Darwin 23.6.0:R 4.5.0//Users/mtmorgan/Library/Caches/org.R-project.R/R/BiocFileCache/121787f1dafbc_121787f1dafbc]
#> CHROM POS REF ALT genome uniprot_id transcript_id protein_variant
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 chr1 69094 G T hg38 Q8NH21 ENST00000335137.4 V2L
#> 2 chr1 69094 G C hg38 Q8NH21 ENST00000335137.4 V2L
#> 3 chr1 69094 G A hg38 Q8NH21 ENST00000335137.4 V2M
#> 4 chr1 69095 T C hg38 Q8NH21 ENST00000335137.4 V2A
#> 5 chr1 69095 T A hg38 Q8NH21 ENST00000335137.4 V2E
#> 6 chr1 69095 T G hg38 Q8NH21 ENST00000335137.4 V2G
#> 7 chr1 69097 A G hg38 Q8NH21 ENST00000335137.4 T3A
#> 8 chr1 69097 A C hg38 Q8NH21 ENST00000335137.4 T3P
#> 9 chr1 69097 A T hg38 Q8NH21 ENST00000335137.4 T3S
#> 10 chr1 69098 C A hg38 Q8NH21 ENST00000335137.4 T3N
#> # ℹ more rows
#> # ℹ 2 more variables: am_pathogenicity <dbl>, am_class <chr>
db_tables() # connections initially share the same tables
#> [1] "aa_substitutions" "clinvar" "gene_hg38" "hg38"
db_tables(db_rw)
#> [1] "aa_substitutions" "clinvar" "gene_hg38" "hg38"
## ranges of interest -- the first 200000 bases on chromsomes 1-4.
ranges <- tibble(
CHROM = paste0("chr", 1:4),
start = rep(1, 4),
end = rep(200000, 4)
)
db_temporary_table(db_rw, ranges, "ranges")
#> # Source: table<ranges> [4 x 3]
#> # Database: DuckDB v1.1.1 [mtmorgan@Darwin 23.6.0:R 4.5.0//Users/mtmorgan/Library/Caches/org.R-project.R/R/BiocFileCache/121787f1dafbc_121787f1dafbc]
#> CHROM start end
#> <chr> <dbl> <dbl>
#> 1 chr1 1 200000
#> 2 chr2 1 200000
#> 3 chr3 1 200000
#> 4 chr4 1 200000
db_tables(db_rw) # temporary table available to the db_rw connection...
#> [1] "aa_substitutions" "clinvar" "gene_hg38" "hg38"
#> [5] "ranges"
db_tables() # ...but not to the read-only connection
#> [1] "aa_substitutions" "clinvar" "gene_hg38" "hg38"
rng <- db_range_join(db_rw, "hg38", "ranges", "ranges_overlaps")
rng
#> # Source: table<ranges_overlaps> [?? x 12]
#> # Database: DuckDB v1.1.1 [mtmorgan@Darwin 23.6.0:R 4.5.0//Users/mtmorgan/Library/Caches/org.R-project.R/R/BiocFileCache/121787f1dafbc_121787f1dafbc]
#> CHROM POS REF ALT genome uniprot_id transcript_id protein_variant
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 chr4 161194 C G hg38 Q3SXZ3 ENST00000510175.6 T170S
#> 2 chr4 161196 T A hg38 Q3SXZ3 ENST00000510175.6 F171I
#> 3 chr4 161196 T C hg38 Q3SXZ3 ENST00000510175.6 F171L
#> 4 chr4 161196 T G hg38 Q3SXZ3 ENST00000510175.6 F171V
#> 5 chr4 161197 T G hg38 Q3SXZ3 ENST00000510175.6 F171C
#> 6 chr4 161197 T C hg38 Q3SXZ3 ENST00000510175.6 F171S
#> 7 chr4 161197 T A hg38 Q3SXZ3 ENST00000510175.6 F171Y
#> 8 chr4 161198 T A hg38 Q3SXZ3 ENST00000510175.6 F171L
#> 9 chr4 161198 T G hg38 Q3SXZ3 ENST00000510175.6 F171L
#> 10 chr4 161199 A C hg38 Q3SXZ3 ENST00000510175.6 K172Q
#> # ℹ more rows
#> # ℹ 4 more variables: am_pathogenicity <dbl>, am_class <chr>, start <dbl>,
#> # end <dbl>
rng |>
count(CHROM) |>
arrange(CHROM)
#> # Source: SQL [3 x 2]
#> # Database: DuckDB v1.1.1 [mtmorgan@Darwin 23.6.0:R 4.5.0//Users/mtmorgan/Library/Caches/org.R-project.R/R/BiocFileCache/121787f1dafbc_121787f1dafbc]
#> # Ordered by: CHROM
#> CHROM n
#> <chr> <dbl>
#> 1 chr1 2018
#> 2 chr2 2028
#> 3 chr4 7488
db_disconnect(db_rw) # explicit read-write connection
db_disconnect() # implicit read-only connection
db_disconnect_all()