To address these issues, we developed VariantBam, a scriptable BAM/CRAM/SAM filtering tool designed to provide a robust and flexible method for retrieving custom sets of reads. Efficient and accurate retrieval of those sequences would facilitate development of such analyses. epigenomics, metagenomics, etc.) will tend to focus on different subsets of the data. ![]() Research questions in different fields (e.g. Read filtering can also be used directly in analysis pipelines that operate only on a subset of reads. The original BAMs could then be moved to low-cost archival storage, or retained on an external server (e.g. Read filtering that meets all of these needs could reduce the data footprint while preserving the most relevant information in a compact file. Even further reductions in size can be achieved by removing alignment tags or subsampling reads in areas of high coverage. Furthermore, depending on the quality of the sequencing library, a large fraction of the low-quality reads can be removed with little effect on downstream analyses. For instance, one group may be interested in only mutations in exons, while another may be interested in only reads that support structural variants. However, the CRAM format is not currently supported by many analysis tools, and can still leave whole-genome files at 30× coverage at > 30 GB.Īn alternative is to create trimmed BAM files that contain only the sequence information relevant for a particular set of scientific questions. To mitigate this, the CRAM format was established to provide reference-based compression of BAM files, resulting in a 2–3× reduction in file size ( Hsi-Yang Fritz et al., 2011). Whole genome sequencing of a human genome to 30× coverage can result in approximately 1 billion reads, requiring more than 100 GB of disk space even in the compressed BAM format ( Li et al., 2009). Thus VariantBam enables efficient storage of sequencing data while preserving the most relevant information for downstream analysis.Īvailability and implementation: VariantBam and full documentation are available at /jwalabroad/VariantBam.Ĭontact: information: Supplementary data are available at Bioinformatics online.Īs the cost of genome sequencing decreases, the storage and computational burden of handling large sequencing datasets is an increasing concern. For example, VariantBam achieved a median size reduction ratio of 3.1:1 when applied to 10 lung cancer whole genome BAMs by removing large tags and selecting for only high-quality variant-supporting reads and reads matching a large dictionary of sequence motifs. We have implemented filters based on alignment data, sequence motifs, regional coverage and base quality. VariantBam provides a flexible framework for extracting sequencing reads or read-pairs that satisfy combinations of rules, defined by any number of genomic intervals or variant sites. We developed VariantBam, a C ++ read filtering and profiling tool for use with BAM, CRAM and SAM sequencing files.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |