Sequence assembly

Overview

Teaching: 10 min
Exercises: 60 min
Questions
  • How can the information in the sequencing reads be reduced?

  • What are the different methods for assembly?

Objectives
  • Understand differences between assembly methods

  • Assemble the short reads

  • What is the effect of different k-mer sizes

Sequence assembly means the alignment and merging of reads in order to reconstruct the original sequence. The assembly of a genome from short sequencing reads will take a while - from minutes up to several hours per genome.

Sequence assembly

The assembler we will run is SPAdes. SPAdes generates a final assembly from multiple kmers. A list of kmers is automatically selected by SPAdes using the maximum read length of the input data, and each individual kmer contributes to the final assembly. If desired, a list of kmers can be specified with the -k flag which will override automatic kmer selection.

Assembly

Because assembly of each genome might take a while, we will a assemble two isolates per person. Reads have already been trimmed and error corrected. When the assembly has started for most people we can follow the assembly lecture (next page).

Preparation

$ cd ~
$ mkdir assembly
$ cd ~/reads

To run SPAdes we will use the spades.py command with the –only-assembler option as we have error corrected the reads already, -o for the output folder, -1 for the path to the forward reads, -2 for the path to the reverse reads. We will be doing an assembly with only 21 basepair kmers and an analysis with the default k-mer sizes of 21, 33 and 55 basepair which result in much better assemblies (why?). We can start the loop with the assemblies. The following is an example. Replace ERR026473 and ERR026474 with the names of your isolates

$ ls
$ for sample in ERR326690 ERR326694  ; do
    spades.py -1 "$sample"_1.fastq.gz -2 "$sample"_2.fastq.gz -o ~/assembly/"$sample"_k21 --isolate -t 1 -k21
    spades.py -1 "$sample"_1.fastq.gz -2 "$sample"_2.fastq.gz -o ~/assembly/"$sample" --isolate -t 1
  done

#if you run this on cocalc3 which has an issue with python3, please do the following:
for sample in ERR326690 ERR326694  ; do
    python /gnu/store/8v1ilv98y5wp3n25si2fbzhdv7s7iafv-profile/bin/spades.py -1 "$sample"_1.fastq.gz -2 "$sample"_2.fastq.gz -o ~/assembly/"$sample"_k21 --isolate -t 1 -k21
    python /gnu/store/8v1ilv98y5wp3n25si2fbzhdv7s7iafv-profile/bin/spades.py -1 "$sample"_1.fastq.gz -2 "$sample"_2.fastq.gz -o ~/assembly/"$sample" --isolate -t 1
done

$ cd ~/assembly
$ ls 
$ cd ERR326690
$ ls

The assemblies are found in the folders in the folder ~/assembly. The folders with k21 at the end are the assemblies with the very short kmer of 21. The end result before scaffolding is called contigs.fasta and the scaffolded contigs are available in the file called scaffolds.fasta. In the next lecture you will find out about what readspairs, contigs and scaffolds are.

Challenge: How many contigs were generated by SPAdes using different kmer sizes. What about the final assembly? And what is the difference between contigs and scaffolds.

Find out how many contigs or scaffolds there are in the S. pneumoniae isolates. Enter your solution in the Google Sheets table

Hint:

$ grep -c

prints a count of matching lines for each input file.

Solution

$ cd ~/assembly
$ grep -c '>'  ERR*/contigs.fasta
$ grep -c '>'  ERR*/scaffolds.fasta

At the moment, all samples are called scaffolds.fasta. This is not ideal. In the next episode we will rename the assembled scaffolds before processing them further.

Key Points

  • Assembly is a process which aligns and merges fragments from a longer DNA sequence in order to reconstruct the original sequence.

  • k-mers are short fragments of DNA of length k