Subclustering with PopPIPE¶
PopPIPE is still a work-in-progress and may not immediately work on your data. We anticipate finalising the pipeline in the near future.
You can run PopPIPE on your PopPUNK output, which will run subclustering and visualisation within your strains. The pipeline consists of the following steps:
- Split files into their strains.
- Calculate core and accessory distances within each strain.
- Use the core distances to make a neighbour-joining tree.
- (lineage_clust mode) Generate clusters from core distances with lineage clustering in PopPUNK.
- Use ska to generate within-strain alignments.
- Use IQ-TREE to generate an ML phylogeny for each strain using this alignment, and the NJ tree as a starting point.
- Use fastbaps to generate subclusters which are partitions of the phylogeny.
- Create an overall visualisation with both core and accessory distances, as in PopPUNK. The final tree consists of refining the NJ tree by grafting the maximum likelihood trees for subclusters to their matching nodes.
An example DAG for the steps (excluding
ska index, for which there is one per sample):
PopPIPE is a snakemake pipeline, which depends upon snakemake and pandas:
conda install snakemake pandas
Other dependencies will be automatically installed by conda the first time you run the pipeline. You can also install them yourself and omit the -use-conda directive to snakemake:
conda create -n poppipe --file=environment.yml
Then clone the repository:
git clone firstname.lastname@example.org:johnlees/PopPIPE.git
snakemake --cores <n_cores> --use-conda.
On a cluster or the cloud, you can use snakemake’s built-in
snakemake --cluster qsub -j 16 --use-conda
See the snakemake docs for more information on your cluster/cloud provider.
scripts/directory, if not running from the root of this repository
poppunk_db: The PopPUNK HDF5 database file, without the
poppunk_clusters: The PopPUNK cluster CSV file, usually
--rfileused with PopPUNK, which lists sample names and files, one per line, tab separated.
min_cluster_size: The minimum size of a cluster to run the analysis on (recommended at least 6).
enabled: Set to
falseto turn off ML tree generation, and use the NJ tree throughout.
mode: Set to
fullto run with the specified model, set to
fastto run using
model: A string for the
-mparameter describing the model. Adding