Sequence assembly
Overview
Teaching: 10 min
Exercises: 60 minQuestions
How can the information in the sequencing reads be reduced?
What are the different methods for assembly?
Objectives
Understand differences between assembly methods
Assemble the short reads
What is the effect of different k-mer sizes
Sequence assembly means the alignment and merging of reads in order to reconstruct the original sequence. The assembly of a genome from short sequencing reads will take a while - from minutes up to several hours per genome.
Sequence assembly
The assembler we will run is SPAdes. SPAdes generates a final assembly from multiple kmers. A list of kmers is automatically selected by SPAdes using the maximum read length of the input data, and each individual kmer contributes to the final assembly. If desired, a list of kmers can be specified with the -k flag which will override automatic kmer selection.
Assembly
Because assembly of each genome might take a while, we will a assemble two isolates per person. Reads have already been trimmed and error corrected. When the assembly has started for most people we can follow the assembly lecture (next page).
Preparation
$ cd ~
$ mkdir assembly
$ cd ~/reads
To run SPAdes we will use the spades.py command with the –only-assembler option as we have error corrected the reads already, -o for the output folder, -1 for the path to the forward reads, -2 for the path to the reverse reads. We will be doing an assembly with only 21 basepair kmers and an analysis with the default k-mer sizes of 21, 33 and 55 basepair which result in much better assemblies (why?). We can start the loop with the assemblies. The following is an example. Replace ERR026473 and ERR026474 with the names of your isolates
$ ls
$ for sample in ERR326690 ERR326694 ; do
spades.py -1 "$sample"_1.fastq.gz -2 "$sample"_2.fastq.gz -o ~/assembly/"$sample"_k21 --isolate -t 1 -k21
spades.py -1 "$sample"_1.fastq.gz -2 "$sample"_2.fastq.gz -o ~/assembly/"$sample" --isolate -t 1
done
#if you run this on cocalc3 which has an issue with python3, please do the following:
for sample in ERR326690 ERR326694 ; do
python /gnu/store/8v1ilv98y5wp3n25si2fbzhdv7s7iafv-profile/bin/spades.py -1 "$sample"_1.fastq.gz -2 "$sample"_2.fastq.gz -o ~/assembly/"$sample"_k21 --isolate -t 1 -k21
python /gnu/store/8v1ilv98y5wp3n25si2fbzhdv7s7iafv-profile/bin/spades.py -1 "$sample"_1.fastq.gz -2 "$sample"_2.fastq.gz -o ~/assembly/"$sample" --isolate -t 1
done
$ cd ~/assembly
$ ls
$ cd ERR326690
$ ls
The assemblies are found in the folders in the folder ~/assembly. The folders with k21 at the end are the assemblies with the very short kmer of 21. The end result before scaffolding is called contigs.fasta and the scaffolded contigs are available in the file called scaffolds.fasta. In the next lecture you will find out about what readspairs, contigs and scaffolds are.
Challenge: How many contigs were generated by SPAdes using different kmer sizes. What about the final assembly? And what is the difference between contigs and scaffolds.
Find out how many contigs or scaffolds there are in the S. pneumoniae isolates. Enter your solution in the Google Sheets table
Hint:
$ grep -c
prints a count of matching lines for each input file.
Solution
$ cd ~/assembly $ grep -c '>' ERR*/contigs.fasta $ grep -c '>' ERR*/scaffolds.fasta
At the moment, all samples are called scaffolds.fasta. This is not ideal. In the next episode we will rename the assembled scaffolds before processing them further.
Key Points
Assembly is a process which aligns and merges fragments from a longer DNA sequence in order to reconstruct the original sequence.
k-mers are short fragments of DNA of length k