Can You Upload Gz Files to Sra

How to use NCBI SRA Toolkit effectively?

Renesh Bedre 6 minute read

  • NCBI SRA toolkit is a fix of utilities to download, view and search large volume of high-throughput sequencing data from NCBI SRA database at faster speed
  • SRA database has several accessions including, SRR (run accession for actual sequencing data for the item experiment), SRX (experiment accession representing the metadata for study, sample, library, and runs), SRP (study accession representing the metadata for sequencing study and projection abstract), SAMN/SRS (BioSample/SRA accession representing the metadata for biological sample).

Applications

  • Effectively download the large book of loftier-throughput sequencing data (eg. FASTQ, SAM)
  • Catechumen SRA file into other biological file format (eg. FASTA, ABI, SAM, QSEQ, SFF)
  • Call up a small subset of large files (e.g. sequences, alignment)
  • Search within SRA files and fetch specific sequences

To install the latest version of SRA toolkit, download the binaries/install scripts for your Os from here

                              # I am using Ubuntu Linux                                # download latest version of compiled binaries of NCBI SRA toolkit                                # (version 2.xi.3) for Ubuntu Linux                # Compiled binaries for other Os visit: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software                $                wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.11.three/sratoolkit.2.xi.three-ubuntu64.tar.gz                # extract tar.gz file                                $                                tar                -zxvf                sratoolkit.2.xi.three-ubuntu64.tar.gz                # add binaries to path using export path or editing ~/.bashrc file                $                                consign                                PATH                =                $PATH:/dwelling house/ren/software/sratoolkit.two.11.three-ubuntu64/bin                # Now SRA binaries added to path and prepare to employ                          

Download SRA datasets using NCBI SRA toolkit

Notation: Current SRA toolkit version 2.ten.8 does non back up Aspera customer (ascp). Even though ascp tin can run with older versions, it will download the information past https fashion and not by FASP manner.

                              # download file: prefetch will download and relieve SRA file related to SRR accession in                                # the current directory under newly created SRA accession directory                $                prefetch  SRR5790106                # for a single file                $                prefetch  SRR5790106 SRR5790104                # multiple files                # convert to FASTQ: fastq-dump will convert SRR5790106.sra to SRR5790106.fastq                $                fastq-dump  SRR5790106                # single file                $                fastq-dump  SRR5790106  SRR5790104                # multiple files                # at present you tin also replace fastq-dump with fasterq-dump which is much faster                                # and efficient for big datasets                # by default it will use six threads (-east option)                $                fasterq-dump  SRR5790106                # single file                $                fasterq-dump  SRR5790106  SRR5790104                # multiple files                # for paired-finish data use --dissever-files (fastq-dump) and -Due south or --dissever-files (fasterq-dump) choice                # employ --divide-3 option, if the paired-end reads are not properly arranged (e.m. some reads has lack of mate pair)                $                fastq-dump                --split-files                SRR8296149                $                fasterq-dump                -Due south                SRR8296149                # download biological and technical reads (cell and sample barcodes) in case of single                                # prison cell RNA-seq (10x chromium)                $                fasterq-dump                -S                --include-technical                SRR12564282                # download alignment files (SAM)                # brand sure corresponding accretion has alignment file at SRA database                $                sam-dump                --output-file                SRR1236468.sam SRR1236468                          

Note With fastq-dump and fasterq-dump, prefetch pace is unncessary and you tin directly download sequence data in FASTQ format. Using fastq-dump straight without prefetch will be slow equally compared to first using prefetch and and thenfastq-dump

parallel-fastq-dump

parallel-fastq-dump is a wrapper to fastq-dump, which makes fastq-dump to run parallel. In brief, information technology splits the file based on number of threads and run fastq-dump parallel. Read more than here

Install parallel-fastq-dump as conda install -c bioconda parallel-fastq-dump

                              # download paired-end RNA-seq data with 8 threads                                parallel                -                fastq                -                dump                --                sra                -                id                SRR17062757                --                threads                8                --                split                -                files                --                gzip                          

parallel-fastq-dump download FASTQ files (with gzip compression) faster as compared to fasterq-dump. When I compared the operation to download SRR17062757 (~25 Thousand paired-cease reads), parallel-fastq-dump took 2m36.257s and fasterq-dump took 3m13.182s (without gzip compression).

Batch download SRA datasets

  • Sometimes, we demand to download hundreds or thousands of FASTQ files from the SRA database and information technology would exist inconvenient to straight utilize the SRA toolkit for batch download
  • I have added a wrapper script for fasterq-dump in bioinfokit (v0.nine.seven or later) for easy download of a large number of FASTQ files from the SRA database
  • Check bioinfokit documentation for installation and documentation
  • Download exam SRA accretion file containing accessions for both single and paired-stop FASTQ datasets
                              # tested on Linux and Mac. Information technology may not work on Windows                                >>>                from                bioinfokit.analys                import                fastq                # batch download fastq files # make sure you have installed the latest version of NCBI SRA toolkit (version 2.ten.viii) and added binaries in the  # organisation path                                >>>                fastq                .                sra_bd                (                file                =                'sra_accessions.txt'                )                # increase number of threads                                >>>                fastq                .                sra_bd                (                file                =                'sra_accessions.txt'                ,                t                =                16                )                # utilize fasterq-dump customized options, you lot can see more options for fas terq-dump as # fasterq-dump -help                                fastq                .                sra_bd                (                file                =                'sra_accessions.txt'                ,                t                =                16                ,                other_opts                =                '--outdir temp --skip-technical'                )                # multiple FASTQ (technical and biological)  files from from  # 10x chromium unmarried prison cell 3' RNA-seq information # if you provide file containing SRA accessions for 10x chromium  # single cell 3' RNA-seq data, it will give multiple FASTQ files # for example, SRA accession SRR12564282 will give  three FASTQ files  # (sample barcode,  cell barcode, and biological read FASTQ files)                                fastq                .                sra_bd                (                file                =                'path_to_sra_file'                ,                t                =                16                ,                other_opts                =                '--include-technical --split-files'                )                          

Validation of downloaded SRA data integrity

It is essential to bank check the integrity and checksum of SRA datasets to ensure successful download

                              # download FASTQ file                $                prefetch SRR5790104                # fastq-dump  SRR5790104                                # check integrity of downloaded SRR5790106.fastq file                # output from vdb-validate should written report 'ok' and 'consistent' for all parameters                # Notation: make certain you take .sra (not .cache) file for corresponding accession in                                # sra accession directory                $                vdb-validate SRR5790104 2020-08-31T22:46:27 vdb-validate.two.10.8 info: Table                'SRR5790104.sra'                metadata: md5 ok 2020-08-31T22:46:27 vdb-validate.2.x.8 info: Column                'ALTREAD': checksums ok 2020-08-31T22:46:29 vdb-validate.two.10.viii info: Column                'QUALITY': checksums ok 2020-08-31T22:46:xxx vdb-validate.2.10.8 info: Cavalcade                'READ': checksums ok 2020-08-31T22:46:30 vdb-validate.ii.10.8 info: Column                'X': checksums ok 2020-08-31T22:46:30 vdb-validate.2.x.eight info: Cavalcade                'Y': checksums ok 2020-08-31T22:46:30 vdb-validate.ii.10.8 info: Table                'SRR5790104.sra'                is consistent                          

Customized download of SRA datasets

Yous tin use SRA tools for customized output of large SRA datasets without downloading complete datasets (Notation: some options are not available in fasterq-dump)

                              # print start 10 reads from single-terminate FASTQ file                # -Z choice volition print output on screen (STDOUT)                $                fastq-dump                -X                10                -Z                SRR5790106                # save FASTQ file to specified directory                $                fastq-dump                -O                temp SRR5790106                $                fasterq-dump                -O                temp SRR5790106                # compress FASTQ file gzip or bzip2                $                fastq-dump                -O                temp SRR5790106                $                fastq-dump                --gzip                SRR5790106                $                fastq-dump                --bzip2                SRR5790106                # Annotation: --gzip or --bzip2 options are not available with fasterq-dump                # Multithreading                                $                fasterq-dump                -eastward                10 SRR5790106                          

Convert SRA data into other biological formats

SRA tools allow y'all to convert SRA files into FASTA, ABI, Illumina native (QSEQ), and SFF format

                              # convert to FASTA                # you need to first download the FASTQ file to convert to FASTA file                $                fastq-dump                --fasta                lx SRR5790106                # if you have paired-terminate FASTQ, utilise --dissever-files -fasta 60                                # if you don't apply --separate-files for paired-ends, the reads will be merged from both ends                # number 60 represents number of bases per line                # Note: --fasta options is not available with fasterq-dump                # catechumen to ABI (CSFASTA and QVAL)                $                abi-dump  SRR5790106                # catechumen to QSEQ                                # SRA database should have alignment information submitted for corresponding accession                                $                illumina-dump                --qseq                2 SRR1236472                # 2 for paired-end and 1 for single-finish                # catechumen to SFF                                # SFF is a binary file format related to 454 loftier-throughput sequencing                $                sff-dump SRR996630                          

Search within SRA files

Yous can search specific sequences or subset of sequences in SRA files

                              # search within SRA files                # output volition be sequence read IDs                                $                sra-search  GATGCCGCGCC SRR5790104                          

Get read length from the SRA file,

                              # this assumes that read length is aforementioned for all reads as in unfiltered FASTQ files                $                fastq-dump                -X                ane                -Z                SRR5790106 |                sed                -n                '2p'                |                awk                '{ impress length }'                Read 1 spots                for                SRR5790106 Written i spots                for                SRR5790106 100                          

Annotation: For every SRA tools, y'all can bank check all options by providing -h parameter (eg. fasterq-dump -h)

  • Differential gene expression analysis using DESeq2

References

  • Understanding SRA Search Results

If yous have whatever questions, comments or recommendations, please email me at reneshbe@gmail.com

If you enhanced your knowledge and applied skills from this article, consider supporting me on

Buy Me A Coffee

This work is licensed nether a Creative Commons Attribution 4.0 International License

coaldrakekned1979.blogspot.com

Source: https://www.reneshbedre.com/blog/ncbi_sra_toolkit.html

0 Response to "Can You Upload Gz Files to Sra"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel