Data quality control¶
PopPUNK now comes with some basic quality control options which are applied by
default when running --create-db
:
- Outlying genome length (calculated during sketching, for assemblies or reads).
- Too many ‘N’s.
- Outlying accessory distance.
Accessory distance is also used with poppunk_assign
.
You can set the behaviour when creating the database by setting --qc-filter
to one of the following three values if any failing samples are found:
stop
(default) – Do not proceed past the sketching step, and throw an error.prune
– Remove failing samples from the sketch database before continuing.continue
– Ignore failing samples and run anyway.
In all cases a file will be written at qcreport.txt
which lists the failing samples, and the
reasons why they failed. If running with either prune or continue, you may also add --retain-failures
to write a separate sketch database with the failed samples.
Random match chances in PopPUNK are only calculated and added to the database after the chosen
QC step. If you use poppunk_sketch
directly, they will be added without any automated QC.
QC of input sequences¶
The first QC step is applied directly to the input sequences themselves, to identify poor quality sequences.
You can change the genome length cutoff with --length-sigma
which sets the maximum number
of standard deviations from the mean, and --length-range
which sets an absolute range of
allowable sizes.
Ambiguous bases are controlled by --prop-n
which gives the maximum percentage of Ns,
and --upper-n
which gives the absolute maximum value.
QC of pairwise distances¶
The second QC step uses the pairwise distances, to enable the removal of outlier samples that may not be part of the taxon being studied.
By default, the maximum allowed accessory distance is 0.5 to ensure you check for contamination.
However, many species do really have high accessory values above this range, in which case you
should increase the value of --max-a-dist
.
The maximum allowed core distance is also 0.5, by default. This can be altered with --max-pi-dist
.
All sequences differing from the type isolate by distances greater than either threshold will be
identified by the analysis, with the behaviour determined by the --qc-filter
option. The type
isolate will be selected by PopPUNK, unless specified using --type-isolate
.
Removing samples from an existing database¶
You can use the poppunk_prune
command to remove samples from a database,
for example those found to be of poor quality. Create a file
remove.txt
with the names of the samples you wish to remove, one per line,
and run:
poppunk_prune --remove remove.txt --distances strain_db/strain_db.dists --output pruned_db
This will remove the samples from the strain_db.dists
files. This can instead be done
simultaneously as the model is fitted - problematic sequences can be pruned, and the model fitted
to the remaining high-quality samples by modifying the model fitting command to include QC options:
poppunk –fit-model dbscan –ref-db example_db –output example_dbscan –max-a-dist 0.4 –max-pi-dist 0.2
Dealing with poor quality data¶
In this example we analyse 76 Haemophilus influenzae isolates. One isolate, 14412_4_15,
is contaminated with 12% of reads being Haemophilus parainfluenzae and a total
assembly length of 3.8Mb. It would be removed before input, but its presence
can also be found with PopPUNK --qc-filter continue
.
With the distances¶
A fit with three mixture components overestimates the number of between strain links, and gives a network with a poor score (0.6849) and only five components:

Distances in the top left of the plot, with low core distances and high accessory distances, are due to the contaminated contigs in the isolate. Finding which isolates contribute to these distances reveals a clear culprit:
awk '$3<0.02 && $4 > 0.3 {print $1}' contam_db/contam_db.search.out | cut -f 1 | sort | uniq -c
1 14412_3_81
1 14412_3_82
1 14412_3_83
1 14412_3_84
1 14412_3_88
1 14412_3_89
1 14412_3_91
1 14412_3_92
1 14412_4_1
1 14412_4_10
28 14412_4_15
In this case it is sufficient to increase the number of mixture components to four, which no longer includes these inflated distances. This gives a score of 0.9401 and 28 components:

The best thing to do is to remove the poor quality isolate, or if possible remove the contaminated reads/contigs from the assembly.
With the network¶
Alternatively, the network itself can be inspected with --cytoscape
. Using
the approach detailed in Cytoscape gives the following view:

The contaminated node appears when ordering by ClusteringCoefficient, Stress or TopologicalCoefficient, and its edges appear when ordering by EdgeBetweeness. It can be seen highlighted in the top right component, connecting two clusters which otherwise have no links. It can be removed, and components recalculated in cytoscape directly, though removal from the PopPUNK database is best.
The second largest cluster is also suspicious, where there are few triangles (low transitivity) and the nodes involved have high Stress. This is indicative of a bad fit overall, rather than a single problem sample.