Scripts

Brief documentation on the helper scripts included in the package in the /scripts directory. To use these scripts you will need to have a clone of the git repository, or they should also be installed with the prefix ‘poppunk’ (e.g to run extract_distances.py, run the command poppunk_extract_distances.py).

Writing the pairwise distances to an output file

By default PopPUNK does not write the calculated \(\pi_n\) and \(a\) distances out, as this contains \(\frac{1}{2}n*(n-1)\) rows, which gives a multi Gb file for large datasets.

However, if needed, there is a script available to extract these distances as a text file:

python scripts/extract_distances.py --distances strain_db.dists --output strain_db.dists.out

Writing network components to an output file

Visualisation of large networks with cytoscape may become challenging. It is possible to extract individual components/clusters for visualisation as follows:

python scripts/extract_components.py strain_db_graph.gpickle strain_db

Calculating Rand indices

This script allows the clusters formed by different runs/fits/modes of PopPUNK to be compared to each other. 0 indicates the clusterings are totally discordant, and 1 indicates they are identical.

Run:

python scripts/calculate_rand_indices.py --input poppunk_gmm_clusters.csv,poppunk_dbscan_cluster.csv

The script will calculate the Rand index and the adjusted Rand index between all pairs of files provided (comma separated) to the --input argument. These will be written to the file rand.out, which can be changed using --output.

The --subset argument can be used to restrict comparisons to include only specific samples listed in the provided file.

Calculating silhouette indices

This script can be used to find how well the clusters project into core-accessory space by calculating the silhoutte index, which measures how close samples are to others in their own cluster compared to samples from other clusters. The silhoutte index is calculated for every sample and takes a value between -1 (poorly matched) to +1 (well matched). The script reports the average of these indices across all samples, using Euclidean distances between the (normalised) core and accessory divergences calculated by PopPUNK.

To run:

python scripts/calculate_silhouette.py --distances strain_db.dists --cluster-csv strain_db_clusters.csv

The following additonal options are available for use with external clusterings (e.g. from hierBAPS):

  • --cluster-col the (1-indexed) column index containing the cluster assignment
  • --id-col the (1-indexed) column index containing the sample names
  • --sub a string to remove from sample names to match them to those in --distances