Background and Metadata

Overview

Teaching: 10 min
Exercises: 5 min
Questions
  • What data are we using?

  • Why is this experiment important?

Objectives
  • Why study E. coli?

  • Understand the data set.

  • What is hypermutability?

Background

We are going to use a long-term sequencing dataset from a population of Escherichia coli.

 [Wikimedia](https://species.wikimedia.org/wiki/Escherichia_coli#/media/File:EscherichiaColi_NIAID.jpg)

The data

View the metadata

We will be working with three sample events from the Ara-3 strain of this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how (the evolution of a Cit+ mutant and hypermutability) with our variant calling workflow. The metadata file associated with this lesson can be downloaded directly here or viewed in Github. If you would like to know details of how the file was created, you can look at some notes and sources here.

This metadata describes information on the Ara-3 clones and the columns represent:

Column Description
strain strain name
generation generation when sample frozen
clade based on parsimony-based tree
reference study the samples were originally sequenced for
population ancestral population group
mutator hypermutability mutant status
facility facility samples were sequenced at
run Sequence read archive sample ID
read_type library type of reads
read_length length of reads in sample
sequencing_depth depth of sequencing
cit citrate-using mutant status

Challenge

Based on the metadata, can you answer the following questions?
Try using the command line where appropriate (except Q1 which will be a real challenge requiring more background knowledge not directly covered within this course (regular expressions and/or using awk)).

  1. How many different generations exist in the data?
  2. How many rows and how many columns are in this data?
  3. How many citrate+ mutants have been recorded in Ara-3?
  4. How many hypermutable mutants have been recorded in Ara-3?

Solution

  1. 25 different generations
    • This answer is easiest found by inspecting the metadata in a spreadsheet.
    • More challenging option is to use awk or regular expressions in combination with grep (see advanced extra work (not covered in the exam)).
      Process: get the second field of the metadata, return an unique list and count.
      • grep/regular expression (matching from ^ start until first , then only numbers [0-9]:
        • grep -P -o '^[A-Za-z0-9]+,[0-9]+,' Ecoli_metadata_composite.csv | sed -r "s/.+,([0-9]+),/\1/" | sort -u | wc -l
          • first use grep to ONLY (-o) return the REGULAREXPRESSION (-P) match to the MOTIF line starting with (^) any alphabeth character followed by , and a number of more than 0 (+) length [0-9]+
          • sed is basically a one line search and replace. It searches in the grep output for the number and only returns (\1) the number found by ([0-9]+).
          • Next we make a sorted list, make it uniq and count the uniques lines.
          • Try asking ChatGPT to explain the grep answer and be amazed!
      • or if you followed the extra awk work:
        • awk -F',' '{print $2}' Ecoli_metadata_composite.csv | sort -u | wc -l
          • awk will let you work with column data where -F specifies we use the comma as column separator
          • next we need to tell by “programming” that we want to see column 2 ($2).
          • subsequently we make a sorted list into a unique list and line count.
        • Why does awk return 26 instead of 25 as a count? Header?
    • Student contributed solution using the toolbox covered in the main instructions:
      • cut -d, -f2 Ecoli_metadata_composite.csv | tail -n+2 | sort -u | wc -l
        • The cut command can extract a specific COLUMN of data based on the specified delimiter (-d saying it is a comma). Then -f2 extracts the SECOND column.
        • The tail -n+2 command uses explicit +2 to start printing output from line 2 onwards to skip counting the header.
        • We subsequently sort and make UNIQUE (-u) and count the lines
  2. 62 rows, 12 columns
    • tail -n+2 Ecoli_metadata_composite.csv | wc -l
      • tail complete file from line 2 down, subsequently count the number of lines
    • head -n1 Ecoli_metadata_composite.csv | grep ',' -o | wc -l
      • This takes the first HEADER line only, finds with grep ONLY (-o) matching , and counts them. We need to add 1 for the total: 11+1 = 12
  3. 10 citrate+ mutants
    • grep ',plus$' Ecoli_metadata_composite.csv | wc -l
      • find plus in the LAST ($=end of line) column and count number of rows
      • Altrnative a count can be done in one go with grep -c: grep -c ',plus$' Ecoli_metadata_composite.csv
  4. 5/6 hypermutable mutants (depending if you look at the clade (+H) or the mutator label)
    • grep -c '+H,' Ecoli_metadata_composite.csv
      • It finds all lines that has the +H (hypermutator) flag. It also counts (-c)

Key Points

  • It is important to record and understand your experiment’s metadata.