Extract paired-end reads from (NCBI) SRA files

Written on January 24, 2016

SRA stores all the sequencing from GIO experiments in files in .sra format. These files are managed using the SRA Toolkit.

I recently download some .sra files from this GEO corresponding to paired-end sequencing data. My surprise when I run fastq-dump (from SRA toolkit) utility and I got only one file rather than two.

From the documentation of the tool, it seems that the option --split-files should be enough but not. We need to add the --split-3 option. If we run fastq-dump with this configuration in a single-end experiment a single .fastq files will be create, otherwise two files with suffixes _1 and _2 will be the matched paired read files (.fastq) while a posible third file (no sufix) will contain the non matched reads.

I currently run fastq-dump as:

fastq-dump --split-files --split-3 SRR1813404.sra -O SRR1813404