Reference documentation¶
Documentation for module functions (for developers)
bgmm.py¶
Functions used to fit the mixture model to a database. Access using
BGMMFit
.
BGMM using sklearn

PopPUNK.bgmm.
assign_samples
(X, weights, means, covars, scale, values=False)[source]¶ Given distances and a fit will calculate responsibilities and return most likely cluster assignment
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples
 weights (numpy.array)
 Component weights from
BGMMFit
 means (numpy.array)
 Component means from
BGMMFit
 covars (numpy.array)
 Component covariances from
BGMMFit
 scale (numpy.array)
 Scaling of core and accessory distances from
BGMMFit
 values (bool)
Whether to return the responsibilities, rather than the most likely assignment (used for entropy calculation).
Default is False
 Returns:
 ret_vec (numpy.array)
 An nvector with the most likely cluster memberships or an n by k matrix with the component responsibilities for each sample.

PopPUNK.bgmm.
findWithinLabel
(means, assignments, rank=0)[source]¶ Identify withinstrain links
Finds the component with mean closest to the origin and also akes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)
 Args:
 means (numpy.array)
 K x 2 array of mixture component means
 assignments (numpy.array)
 Sample cluster assignments
 rank (int)
Which label to find, ordered by distance from origin. 0indexed.
(default = 0)
 Returns:
 within_label (int)
 The cluster label for the withinstrain assignments

PopPUNK.bgmm.
fit2dMultiGaussian
(X, dpgmm_max_K=2)[source]¶ Main function to fit BGMM model, called from
fit()
Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.
 Args:
 X (np.array)
 n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.
 dpgmm_max_K (int)
 Maximum number of components to use with the EM fit. (default = 2)
 Returns:
 dpgmm (sklearn.mixture.BayesianGaussianMixture)
 Fitted bgmm model

PopPUNK.bgmm.
log_likelihood
(X, weights, means, covars, scale)[source]¶ modified sklearn GMM function predicting distribution membership
Returns the mixture LL for points X. Used by
assign_samples()
andplot_contours()
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples
 weights (numpy.array)
 Component weights from
fit2dMultiGaussian()
 means (numpy.array)
 Component means from
fit2dMultiGaussian()
 covars (numpy.array)
 Component covariances from
fit2dMultiGaussian()
 scale (numpy.array)
 Scaling of core and accessory distances from
fit2dMultiGaussian()
 Returns:
 logprob (numpy.array)
 The log of the probabilities under the mixture model
 lpr (numpy.array)
 The components of the log probability from each mixture component

PopPUNK.bgmm.
log_multivariate_normal_density
(X, means, covars, min_covar=1e07)[source]¶ Log likelihood of multivariate normal density distribution
Used to calculate per component Gaussian likelihood in
assign_samples()
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples
 means (numpy.array)
 Component means from
fit2dMultiGaussian()
 covars (numpy.array)
 Component covariances from
fit2dMultiGaussian()
 min_covar (float)
 Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e7)
 Returns:
 log_prob (numpy.array)
 An nvector with the loglikelihoods for each sample being in this component
dbscan.py¶
Functions used to fit DBSCAN to a database. Access using
DBSCANFit
.
DBSCAN using hdbscan

PopPUNK.dbscan.
assign_samples_dbscan
(X, hdb, scale)[source]¶ Use a fitted dbscan model to assign new samples to a cluster
 Args:
 X (numpy.array)
 N x 2 array of core and accessory distances
 hdb (hdbscan.HDBSCAN)
 Fitted DBSCAN from hdbscan package
 scale (numpy.array)
 Scale factor of model object
 Returns:
 y (numpy.array)
 Cluster assignments by sample

PopPUNK.dbscan.
evaluate_dbscan_clusters
(model)[source]¶ Evaluate whether fitted dbscan model contains nonoverlapping clusters
 Args:
 model (DBSCANFit)
 Fitted model from
fit()
 Returns:
 indistinct (bool)
 Boolean indicating whether putative within and betweenstrain clusters of points overlap

PopPUNK.dbscan.
findBetweenLabel
(assignments, within_cluster)[source]¶ Identify betweenstrain links from a DBSCAN model
Finds the component containing the largest number of betweenstrain links, excluding the cluster identified as containing withinstrain links.
 Args:
 assignments (numpy.array)
 Sample cluster assignments
 within_cluster (int)
 Cluster ID assigned to withinstrain assignments, from
findWithinLabel()
 Returns:
 between_cluster (int)
 The cluster label for the betweenstrain assignments

PopPUNK.dbscan.
fitDbScan
(X, min_samples, min_cluster_size, cache_out)[source]¶ Function to fit DBSCAN model as an alternative to the Gaussian
Fits the DBSCAN model to the distances using hdbscan
 Args:
 X (np.array)
 n x 2 array of core and accessory distances for n samples
 min_samples (int)
 Parameter for DBSCAN clustering ‘conservativeness’
 min_cluster_size (int)
 Minimum number of points in a cluster for HDBSCAN
 cache_out (str)
 Prefix for DBSCAN cache used for refitting
 Returns:
 hdb (hdbscan.HDBSCAN)
 Fitted HDBSCAN to subsampled data
 labels (list)
 Cluster assignments of each sample
 n_clusters (int)
 Number of clusters used
mash.py¶
Mash functions for database construction

PopPUNK.mash.
checkMashVersion
(mash_exec)[source]¶ Checks that mash can be run, and is version 2 or higher. Exits if version < 2.
 Args:
 mash_exec (str)
 Location of mash executable

PopPUNK.mash.
constructDatabase
(assemblyList, klist, sketch, oPrefix, ignoreLengthOutliers=False, threads=1, mash_exec='mash', overwrite=False)[source]¶ Sketch the input assemblies at the requested kmer lengths
A multithread wrapper around
runSketch()
. Threads are used to either run multiple sketch processes for each klist value, or increase the threads used by eachmash sketch
process if len(klist) > threads.Also calculates random match probability based on length of first genome in assemblyList.
 Args:
 assemblyList (str)
 File with locations of assembly files to be sketched
 klist (list)
 List of kmer sizes to sketch
 sketch (int)
 Size of sketch (
s
option)  oPrefix (str)
 Output prefix for resulting sketch files
 ignoreLengthOutliers (bool)
Whether to check for outlying genome lengths (and error if found)
(default = False)
 threads (int)
Number of threads to use
(default = 1)
 mash_exec (str)
Location of mash executable
(default = ‘mash’)
 overwrite (bool)
Whether to overwrite sketch DBs, if they already exist.
(default = False)

PopPUNK.mash.
createDatabaseDir
(outPrefix, kmers)[source]¶ Creates the directory to write mash sketches to, removing old files if unnecessary
 Args:
 outPrefix (str)
 output db prefix
 kmers (list)
 kmer sizes in db

PopPUNK.mash.
fitKmerBlock
(idxRanges, distMat, raw, klist, jacobian)[source]¶ Multirow wrapper around
fitKmerCurve()
to the specified rows in idxRanges Args:
 idxRanges (int, int)
 Tuple of first and last row of slice to calculate
 distMat (numpy.array)
 sharedmem object to store core and accessory distances in (altered in place)
 raw (numpy.array)
 sharedmem object with proportion of kmer matches for each queryref pair by row, columns are at kmer lengths in klist
 klist (list)
 List of kmer lengths to use
 jacobian (numpy.array)
 The Jacobian for the fit, sent to
fitKmerCurve()

PopPUNK.mash.
fitKmerCurve
(pairwise, klist, jacobian)[source]¶ Fit the function \(pr = (1a)(1c)^k\)
Supply
jacobian = np.hstack((np.ones((klist.shape[0], 1)), klist.reshape(1, 1)))
 Args:
 pairwise (numpy.array)
 Proportion of shared kmers at kmer values in klist
 klist (list)
 kmer sizes used
 jacobian (numpy.array)
 Should be set as above (set once to try and save memory)
 Returns:
 transformed_params (numpy.array)
 Column with core and accessory distance

PopPUNK.mash.
getDatabaseName
(prefix, k)[source]¶ Gets the name for the mash database for a given k size
 Args:
 prefix (str)
 db prefix
 k (str)
 kmer size
 Returns:
 db_name (str)
 Name of mash db

PopPUNK.mash.
getKmersFromReferenceDatabase
(dbPrefix)[source]¶ Get kmers lengths from existing database
Parses the database name to determine klist
 Args:
 dbPrefix (str)
 Prefix for sketch DB files
 Returns:
 kmers (list)
 List of kmer lengths used in database

PopPUNK.mash.
getSeqsInDb
(mashSketch, mash_exec='mash')[source]¶ Return an array with the sequences in the passed mash database
Calls
mash info t
 Args:
 mashSketch (str)
 Mash sketches/database
 mash_exec (str)
 Location of mash executable
 Returns:
 seqs (list)
 List of sequence names in sketch DB

PopPUNK.mash.
getSketchSize
(dbPrefix, klist, mash_exec='mash')[source]¶ Call to
mash info
to determine sketch sizesys.exit(1)
is called if DBs have different sketch sizes Args:
 dbprefix (str)
 Prefix for mash databases
 klist (list)
 List of kmer lengths which databases were constructed at
 mash_exec (str)
 Location of mash executable
 Returns:
 sketchdb (dict)
 Dict of sketch sizes indexed by kmer size

PopPUNK.mash.
init_lock
(l)[source]¶ Sets a global lock to use when writing to STDERR in
runSketch()

PopPUNK.mash.
joinDBs
(db1, db2, output, klist, mash_exec='mash')[source]¶ Join two mash sketch databases with
mash paste
 Args:
 db1 (str)
 Prefix for db1
 db2 (str)
 Prefix for db2
 output (str)
 Prefix for joined output
 klist (list)
 List of kmer sizes to sketch
 mash_exec (str)
Location of mash executable
(default = ‘mash’)

PopPUNK.mash.
queryDatabase
(qFile, klist, dbPrefix, queryPrefix, self=True, number_plot_fits=0, no_stream=False, mash_exec='mash', threads=1)[source]¶ Calculate core and accessory distances between query sequences and a sketched database
For a reference database, runs the query against itself to find all pairwise core and accessory distances.
Uses the relation \(pr(a, b) = (1a)(1c)^k\)
To get the ref and query name for each row of the returned distances, call to the iterator
iterDistRows()
with the returned refList and queryList Args:
 qFile (str)
 File with location of query sequences
 klist (list)
 Kmer sizes to use in the calculation
 dbPrefix (str)
 Prefix for reference mash sketch database created by
constructDatabase()
 queryPrefix (str)
 Prefix for query mash sketch database created by
constructDatabase()
 self (bool)
Set true if query = ref
(default = True)
 number_plot_fits (int)
If > 0, the number of kmer length fits to plot (saved as pdfs). Takes random pairs of comparisons and calls
plot_fit()
(default = 0)
 no_stream (bool)
Rather than streaming mash dist input directly into parser, will write through an intermediate temporary file
(default = False)
 mash_exec (str)
Location of mash executable
(default = ‘mash’)
 threads (int)
Number of threads to use in the mash process
(default = 1)
 Returns:
 refList (list)
 Names of reference sequences
 queryList (list)
 Names of query sequences
 distMat (numpy.array)
 Core distances (column 0) and accessory distances (column 1) between refList and queryList

PopPUNK.mash.
readMashDBParams
(dbPrefix, kmers, sketch_sizes, mash_exec='mash')[source]¶ Get kmers lengths and sketch sizes from existing database
Calls
getKmersFromReferenceDatabase()
andgetSketchSize()
Uses passed values if db missing Args:
 dbPrefix (str)
 Prefix for sketch DB files
 kmers (list)
 Kmers to use if db not found
 sketch_sizes (list)
 Sketch size to use if db not found
 mash_exec (str)
Location of mash executable
Default = ‘mash’
 Returns:
 kmers (list)
 List of kmer lengths used in database
 sketch_sizes (list)
 List of sketch sizes used in database

PopPUNK.mash.
runSketch
(k, assemblyList, sketch, genome_length, oPrefix, mash_exec='mash', overwrite=False, threads=1)[source]¶ Actually run the mash sketch command
Called by
constructDatabase()
 Args:
 k (int)
 kmer size to sketch
 assemblyList (list)
 Locations of assembly files to be sketched
 sketch (int)
 Size of sketch (
s
option)  genome_length (int)
 Length of genomes being sketch, for random match probability calculation
 oPrefix (str)
 Output prefix for resulting sketch files
 mash_exec (str)
Location of mash executable
(default = ‘mash’)
 overwrite (bool)
Whether to overwrite sketch DB, if it already exists.
(default = False)
 threads (int)
Number of threads to use in the mash process
(default = 1)
models.py¶
Classes used for model fits

class
PopPUNK.models.
BGMMFit
(outPrefix, max_samples=100000)[source]¶ Class for fits using the Gaussian mixture model. Inherits from
ClusterFit
.Must first run either
fit()
orload()
before calling other functions Args:
 outPrefix (str)
 The output prefix used for reading/writing
 max_samples (int)
The number of subsamples to fit the model to
(default = 100000)

assign
(X, values=False)[source]¶ Assign the clustering of new samples using
assign_samples()
 Args:
 X (numpy.array)
 Core and accessory distances
 values (bool)
 Return the responsibilities of assignment rather than most likely cluster
 Returns:
 y (numpy.array)
 Cluster assignments or values by samples

fit
(X, max_components)[source]¶ Extends
fit()
Fits the BGMM and returns assignments by calling
fit2dMultiGaussian()
.Fitted parameters are stored in the object.
 Args:
 X (numpy.array)
 The core and accessory distances to cluster. Must be set if preprocess is set.
 max_components (int)
 Maximum number of mixture components to use.
 Returns:
 y (numpy.array)
 Cluster assignments of samples in X

load
(fit_npz, fit_obj)[source]¶ Load the model from disk. Called from
loadClusterFit()
 Args:
 fit_npz (dict)
 Fit npz opened with
numpy.load()
 fit_obj (sklearn.mixture.BayesianGaussianMixture)
 The saved fit object

plot
(X, y)[source]¶ Extends
plot()
Write a summary of the fit, and plot the results using
PopPUNK.plot.plot_results()
andPopPUNK.plot.plot_contours()
 Args:
 X (numpy.array)
 Core and accessory distances
 y (numpy.array)
 Cluster assignments from
assign()

class
PopPUNK.models.
ClusterFit
(outPrefix)[source]¶ Parent class for all models used to cluster distances
 Args:
 outPrefix (str)
 The output prefix used for reading/writing

fit
(X=None)[source]¶ Initial steps for all fit functions.
Creates output directory. If preprocess is set then subsamples passed X and draws a scatter plot from result using
plot_scatter()
. Args:
 X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
(default = None)

class
PopPUNK.models.
DBSCANFit
(outPrefix, max_samples=100000)[source]¶ Class for fits using HDBSCAN. Inherits from
ClusterFit
.Must first run either
fit()
orload()
before calling other functions Args:
 outPrefix (str)
 The output prefix used for reading/writing
 max_samples (int)
The number of subsamples to fit the model to
(default = 100000)

assign
(X, no_scale=False)[source]¶ Assign the clustering of new samples using
assign_samples_dbscan()
 Args:
 X (numpy.array)
 Core and accessory distances
 no_scale (bool)
Do not scale X
[default = False]
 Returns:
 y (numpy.array)
 Cluster assignments by samples

fit
(X, max_num_clusters, min_cluster_prop)[source]¶ Extends
fit()
Fits the distances with HDBSCAN and returns assignments by calling
fitDbScan()
.Fitted parameters are stored in the object.
 Args:
 X (numpy.array)
 The core and accessory distances to cluster. Must be set if preprocess is set.
 max_num_clusters (int)
 Maximum number of clusters in DBSCAN fitting
 min_cluster_prop (float)
 Minimum proportion of points in a cluster in DBSCAN fitting
 Returns:
 y (numpy.array)
 Cluster assignments of samples in X

load
(fit_npz, fit_obj)[source]¶ Load the model from disk. Called from
loadClusterFit()
 Args:
 fit_npz (dict)
 Fit npz opened with
numpy.load()
 fit_obj (hdbscan.HDBSCAN)
 The saved fit object

plot
(X=None, y=None)[source]¶ Extends
plot()
Write a summary of the fit, and plot the results using
PopPUNK.plot.plot_dbscan_results()
 Args:
 X (numpy.array)
 Core and accessory distances
 y (numpy.array)
 Cluster assignments from
assign()

class
PopPUNK.models.
RefineFit
(outPrefix)[source]¶ Class for fits using a triangular boundary and network properties. Inherits from
ClusterFit
.Must first run either
fit()
orload()
before calling other functions Args:
 outPrefix (str)
 The output prefix used for reading/writing

assign
(X, slope=None)[source]¶ Assign the clustering of new samples using
withinBoundary()
 Args:
 X (numpy.array)
 Core and accessory distances
 slope (int)
Override self.slope. Default  use self.slope
Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
 Returns:
 y (numpy.array)
 Cluster assignments by samples

fit
(X, sample_names, model, max_move, min_move, startFile=None, indiv_refine=False, no_local=False, threads=1)[source]¶ Extends
fit()
Fits the distances by optimising network score, by calling
refineFit2D()
.Fitted parameters are stored in the object.
 Args:
 X (numpy.array)
 The core and accessory distances to cluster. Must be set if preprocess is set.
 sample_names (list)
 Sample names in X (accessed by
iterDistRows()
)  model (ClusterFit)
 The model fit to refine
 max_move (float)
 Maximum distance to move away from start point
 min_move (float)
 Minimum distance to move away from start point
 startFile (str)
A file defining an initial fit, rather than one from
fitmodel
. See documentation for format.(default = None).
 indiv_refine (bool)
Run refinement for core and accessory distances separately
(default = False).
 no_local (bool)
 Turn off the local optimisation step. Quicker, but may be less well refined.
 num_processes (int)
Number of threads to use in the global optimisation step.
(default = 1)
 Returns:
 y (numpy.array)
 Cluster assignments of samples in X

load
(fit_npz, fit_obj)[source]¶ Load the model from disk. Called from
loadClusterFit()
 Args:
 fit_npz (dict)
 Fit npz opened with
numpy.load()
 fit_obj (None)
 The saved fit object (not used)

plot
(X, y=None)[source]¶ Extends
plot()
Write a summary of the fit, and plot the results using
PopPUNK.plot.plot_refined_results()
 Args:
 X (numpy.array)
 Core and accessory distances
 y (numpy.array)
 Assignments (unused)

PopPUNK.models.
loadClusterFit
(pkl_file, npz_file, outPrefix='', max_samples=100000)[source]¶ Call this to load a fitted model
 Args:
 pkl_file (str)
 Location of saved .pkl file on disk
 npz_file (str)
 Location of saved .npz file on disk
 outPrefix (str)
 Output prefix for model to save to (e.g. plots)
 max_samples (int)
Maximum samples if subsampling X
[default = 100000]
network.py¶
Functions used to construct the network, and update with new queries. Main
entry point is constructNetwork()
for new reference
databases, and findQueryLinksToNetwork()
for querying
databases.
Network functions

PopPUNK.network.
addQueryToNetwork
(rlist, qlist, qfile, G, kmers, assignments, model, queryDB, no_stream=False, queryQuery=False, threads=1, mash_exec='mash')[source]¶ Finds edges between queries and items in the reference database, and modifies the network to include them.
 Args:
 rlist (list)
 List of reference names
 qlist (list)
 List of query names
 qfile (str)
 File containing queries
 G (networkx.Graph)
 Network to add to (mutated)
 kmers (list)
 List of kmer sizes
 assignments (numpy.array)
 Cluster assignment of items in qlist
 model (ClusterModel)
 Model fitted to reference database
 queryDB (str)
 Query database location
 queryQuery (bool)
Add in all queryquery distances
(default = False)
 no_stream (bool)
Don’t stream mash output
(default = False)
 threads (int)
Number of threads to use if new db created
(default = 1)
 mash_exec (str)
Location of the mash executable
(default = ‘mash’)
 Returns:
 qlist1 (list)
 Ordered list of queries
 distMat (numpy.array)
 Queryquery distances

PopPUNK.network.
constructNetwork
(rlist, qlist, assignments, within_label, summarise=True)[source]¶ Construct an unweighted, undirected network without selfloops. Nodes are samples and edges where samples are within the same cluster
Will print summary statistics about the network to
STDERR
 Args:
 rlist (list)
 List of reference sequence labels
 qlist (list)
 List of query sequence labels
 assignments (numpy.array)
 Labels of most likely cluster assignment from
assign_samples()
 within_label (int)
 The label for the cluster representing withinstrain distances
from
findWithinLabel()
 summarise (bool)
Whether to calculate and print network summaries with
networkSummary()
(default = True)
 Returns:
 G (networkx.Graph)
 The resulting network

PopPUNK.network.
extractReferences
(G, mashOrder, outPrefix, existingRefs=None)[source]¶ Extract references for each cluster based on cliques
Writes chosen references to file by calling
writeReferences()
 Args:
 G (networkx.Graph)
 A network used to define clusters from
constructNetwork()
 mashOrder (list)
 The order of files in the sketches, so returned references are in the same order
 outPrefix (str)
 Prefix for output file (.refs will be appended)
 existingRefs (list)
 References that should be used for each clique
 Returns:
 refFileName (str)
 The name of the file references were written to
 references (list)
 An updated list of the reference names

PopPUNK.network.
fetchNetwork
(network_dir, model, refList, core_only=False, accessory_only=False)[source]¶ Load the network based on input options
Returns the network as a networkx, and sets the slope parameter of the passed model object.
 Args:
 network_dir (str)
 A network used to define clusters from
constructNetwork()
 model (ClusterFit)
 A fitted model object
 refList (list)
 Names of references that should be in the network
 core_only (bool)
Return the network created using only core distances
[default = False]
 accessory_only (bool)
Return the network created using only accessory distances
[default = False]
 Returns:
 genomeNetwork (nx.Graph)
 The loaded network
 cluster_file (str)
 The CSV of cluster assignments corresponding to this network

PopPUNK.network.
networkSummary
(G)[source]¶ Provides summary values about the network
 Args:
 G (networkx.Graph)
 The network of strains from
constructNetwork()
 Returns:
 components (int)
 The number of connected components (and clusters)
 density (float)
 The proportion of possible edges used
 transitivity (float)
 Network transitivity (triads/triangles)
 score (float)
 A score of network fit, given by \(\mathrm{transitivity} * (1\mathrm{density})\)

PopPUNK.network.
printClusters
(G, outPrefix='_clusters.csv', oldClusterFile=None, externalClusterCSV=None, printRef=True, printCSV=True)[source]¶ Get cluster assignments
Also writes assignments to a CSV file
 Args:
 G (networkx.Graph)
 Network used to define clusters (from
constructNetwork()
oraddQueryToNetwork()
)  outPrefix (str)
Prefix for output CSV
Default = “_clusters.csv”
 oldClusterFile (str)
CSV with previous cluster assignments. Pass to ensure consistency in cluster assignment name.
Default = None
 externalClusterCSV (str)
CSV with cluster assignments from any source. Will print a file relating these to new cluster assignments
Default = None
 printRef (bool)
If false, print only query sequences in the output
Default = True
 printCSV (bool)
Print results to file
Default = True
 Returns:
 clustering (dict)
 Dictionary of cluster assignments (keys are sequence names)

PopPUNK.network.
printExternalClusters
(newClusters, extClusterFile, outPrefix, oldNames, printRef=True)[source]¶ Prints cluster assignments with respect to previously defined clusters or labels.
 Args:
 newClusters (set iterable)
 The components from the graph G, defining the PopPUNK clusters
 extClusterFile (str)
 A CSV file containing definitions of the external clusters for each sample (does not need to contain all samples)
 outPrefix (str)
 Prefix for output CSV (_external_clusters.csv)
 oldNames (list)
 A list of the reference sequences
 printRef (bool)
If false, print only query sequences in the output
Default = True
refine.py¶
Functions used to refine an existing model. Access using
RefineFit
.
Refine mixture model using network properties

PopPUNK.refine.
decisionBoundary
(intercept, gradient)[source]¶ Returns the coordinates where the triangle the decision boundary forms meets the x and yaxes.
 Args:
 intercept (numpy.array)
 Cartesian coordinates of point along line (
transformLine()
) which intercepts the boundary  gradient (float)
 Gradient of the line
 Returns:
 x (float)
 The xaxis intercept
 y (float)
 The yaxis intercept

PopPUNK.refine.
likelihoodBoundary
(s, model, start, end, within, between)[source]¶ Wrapper function around
fit2dMultiGaussian()
so that it can go into a rootfinding function for probabilities between components Args:
 s (float)
 Distance along line from mean0
 model (BGMMFit)
 Fitted mixture model
 start (numpy.array)
 The coordinates of the centre of the withinstrain distribution
 end (numpy.array)
 The coordinates of the centre of the betweenstrain distribution
 within (int)
 Label of the withinstrain distribution
 between (int)
 Label of the betweenstrain distribution
 Returns:
 responsibility (float)
 The difference between responsibilities of assignment to the within component and the between assignment

PopPUNK.refine.
newNetwork
(s, sample_names, distMat, start_point, mean1, gradient, slope=2)[source]¶ Wrapper function for
constructNetwork()
which is called by optimisation functions moving a triangular decision boundary.Given the boundary parameterisation, constructs the network and returns its score, to be minimised.
 Args:
 s (float)
 Distance along line between start_point and mean1 from start_point
 sample_names (list)
 Sample names corresponding to distMat (accessed by iterator)
 distMat (numpy.array)
 Core and accessory distances
 start_point (numpy.array)
 Initial boundary cutoff
 mean1 (numpy.array)
 Defines line direction from start_point
 gradient (float)
 Gradient of line to move along
 slope (int)
 Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
 Returns:
 score (float)
 1 * network score. Where network score is from
networkSummary()

PopPUNK.refine.
readManualStart
(startFile)[source]¶ Reads a file to define a manual start point, rather than using
fitmodel
Throws and exits if incorrectly formatted.
 Args:
 startFile (str)
 Name of file with values to read
 Returns:
 mean0 (numpy.array)
 Centre of withinstrain distribution
 mean1 (numpy.array)
 Centre of betweenstrain distribution
 start_s (float)
 Distance along line between mean0 and mean1 to start at

PopPUNK.refine.
refineFit
(distMat, sample_names, start_s, mean0, mean1, max_move, min_move, slope=2, no_local=False, num_processes=1)[source]¶ Try to refine a fit by maximising a network score based on transitivity and density.
Iteratively move the decision boundary to do this, using starting point from existing model.
 Args:
 distMat (numpy.array)
 n x 2 array of core and accessory distances for n samples
 sample_names (list)
 List of query sequence labels
 start_s (float)
 Point along line to start search
 mean0 (numpy.array)
 Start point to define search line
 mean1 (numpy.array)
 End point to define search line
 max_move (float)
 Maximum distance to move away from start point
 min_move (float)
 Minimum distance to move away from start point
 slope (int)
 Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
 no_local (bool)
 Turn off the local optimisation step. Quicker, but may be less well refined.
 num_processes (int)
Number of threads to use in the global optimisation step.
(default = 1)
 Returns:
 start_point (tuple)
 (x, y) coordinates of starting point
 optimal_x (float)
 xcoordinate of refined fit
 optimal_y (float)
 ycoordinate of refined fit

PopPUNK.refine.
transformLine
(s, mean0, mean1)[source]¶ Return x and y coordinates for traversing along a line between mean0 and mean1, parameterised by a single scalar distance s from the start point mean0.
 Args:
 s (float)
 Distance along line from mean0
 mean0 (numpy.array)
 Start position of line (x0, y0)
 mean1 (numpy.array)
 End position of line (x1, y1)
 Returns:
 x (float)
 The Cartesian xcoordinate
 y (float)
 The Cartesian ycoordinate

PopPUNK.refine.
withinBoundary
(dists, x_max, y_max, slope=2)[source]¶ Classifies points as within or outside of a refined boundary. Numba JIT compiled for speed.
Also used to assign new points in
assign()
 Args:
 dists (numpy.array)
 Core and accessory distances to classify
 x_max (float)
 The xaxis intercept from
decisionBoundary()
 y_max (float)
 The yaxis intercept from
decisionBoundary()
 slope (int)
 Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
 Returns:
 signs (numpy.array)
 For each sample in dists, 1 if withinstrain and 1 if betweenstrain. 0 if exactly on boundary.
plot.py¶
Plots of GMM results, kmer fits, and microreact output

PopPUNK.plot.
buildRapidNJ
(rapidnj, refList, coreMat, outPrefix, tree_filename)[source]¶ Use rapidNJ for more rapid tree building
Creates a phylip of core distances, system call to rapidnj executable, loads tree as dendropy object (cleaning quotes in node names), removes temporary files.
 Args:
 rapidnj (str)
 Location of rapidnj executable
 refList (list)
 Names of sequences in coreMat (same order)
 coreMat (numpy.array)
 NxN core distance matrix produced in
outputsForMicroreact()
 outPrefix (int)
 Output prefix for temporary files
 outPrefix (str)
 Prefix for all generated output files, which will be placed in outPrefix subdirectory
 tree_filename (str)
 Filename for output tree (saved to disk)
 Returns:
 tree (dendropy.Tree)
 NJ tree from core distances

PopPUNK.plot.
generate_phylogeny
(coreMat, seqLabels, outPrefix, tree_suffix, rapidnj, overwrite)[source]¶ Generate phylogeny using dendropy or RapidNJ
Writes a neighbour joining tree (.nwk) from core distances.
 Args:
 coreMat (numpy.array)
 n x n array of core distances for n samples.
 seqLabels (list)
 Processed names of sequences being analysed.
 outPrefix (str)
 Prefix for all generated output files, which will be placed in outPrefix subdirectory
 tree_suffix (str)
 String to append to tree file name
 rapidnj (str)
 A string with the location of the rapidnj executable for treebuilding. If None, will use dendropy by default
 overwrite (bool)
 Overwrite existing output if present (default = False)

PopPUNK.plot.
get_grid
(minimum, maximum, resolution)[source]¶ Get a square grid of points to evaluate a function across
Used for
plot_scatter()
andplot_contours()
 Args:
 minimum (float)
 Minimum value for grid
 maximum (float)
 Maximum value for grid
 resolution (int)
 Number of points along each axis
 Returns:
 xx (numpy.array)
 x values across n x n grid
 yy (numpy.array)
 y values across n x n grid
 xy (numpy.array)
 n x 2 pairs of x, y values grid is over

PopPUNK.plot.
outputsForCytoscape
(G, clustering, outPrefix, epiCsv, queryList=None, suffix=None, writeCsv=True)[source]¶ Write outputs for cytoscape. A graphml of the network, and CSV with metadata
 Args:
 G (networkx.Graph)
 The network to write from
constructNetwork()
 clustering (dict)
 Dictionary of cluster assignments (keys are nodeNames).
 outPrefix (str)
 Prefix for files to be written
 epiCsv (str)
 Optional CSV of epi data to paste in the output in addition to the clusters.
 queryList (list)
 Optional list of isolates that have been added as a query. (default = None)
 suffix (string)
 String to append to network file name. (default = None)
 writeCsv (bool)
 Whether to print CSV file to accompany network

PopPUNK.plot.
outputsForGrapetree
(combined_list, coreMat, clustering, outPrefix, epiCsv, rapidnj, queryList=None, overwrite=False, microreact=False)[source]¶ Generate files for Grapetree
Write a neighbour joining tree (.nwk) from core distances and cluster assignment (.csv)
 Args:
 combined_list (list)
 Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
 coreMat (numpy.array)
 n x n array of core distances for n samples.
 clustering (dict or dict of dicts)
 List of cluster assignments from
printClusters()
. Further clusterings (e.g. 1D core only) can be included by passing these as a dict.  outPrefix (str)
 Prefix for all generated output files, which will be placed in outPrefix subdirectory.
 epiCsv (str)
 A CSV containing other information, to include with the CSV of clusters
 rapidnj (str)
 A string with the location of the rapidnj executable for treebuilding. If None, will use dendropy by default
 queryList (list)
Optional list of isolates that have been added as a query for colouring in the CSV.
(default = None)
 overwrite (bool)
 Overwrite existing output if present (default = False).
 microreact (bool)
 Avoid regenerating tree if already built for microreact (default = False).

PopPUNK.plot.
outputsForMicroreact
(combined_list, coreMat, accMat, clustering, perplexity, outPrefix, epiCsv, rapidnj, queryList=None, overwrite=False)[source]¶ Generate files for microreact
Output a neighbour joining tree (.nwk) from core distances, a plot of tSNE clustering of accessory distances (.dot) and cluster assignment (.csv)
 Args:
 combined_list (list)
 Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
 coreMat (numpy.array)
 n x n array of core distances for n samples.
 accMat (numpy.array)
 n x n array of accessory distances for n samples.
 clustering (dict or dict of dicts)
 List of cluster assignments from
printClusters()
. Further clusterings (e.g. 1D core only) can be included by passing these as a dict.  perplexity (int)
 Perplexity parameter passed to tSNE
 outPrefix (str)
 Prefix for all generated output files, which will be placed in outPrefix subdirectory
 epiCsv (str)
 A CSV containing other information, to include with the CSV of clusters
 rapidnj (str)
 A string with the location of the rapidnj executable for treebuilding. If None, will use dendropy by default
 queryList (list)
 Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
 overwrite (bool)
 Overwrite existing output if present (default = False)

PopPUNK.plot.
outputsForPhandango
(combined_list, coreMat, clustering, outPrefix, epiCsv, rapidnj, queryList=None, overwrite=False, microreact=False)[source]¶ Generate files for Phandango
Write a neighbour joining tree (.tree) from core distances and cluster assignment (.csv)
 Args:
 combined_list (list)
 Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
 coreMat (numpy.array)
 n x n array of core distances for n samples.
 clustering (dict or dict of dicts)
 List of cluster assignments from
printClusters()
. Further clusterings (e.g. 1D core only) can be included by passing these as a dict.  outPrefix (str)
 Prefix for all generated output files, which will be placed in outPrefix subdirectory
 epiCsv (str)
 A CSV containing other information, to include with the CSV of clusters
 rapidnj (str)
 A string with the location of the rapidnj executable for treebuilding. If None, will use dendropy by default
 queryList (list)
 Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
 overwrite (bool)
 Overwrite existing output if present (default = False)
 microreact (bool)
 Avoid regenerating tree if already built for microreact (default = False)

PopPUNK.plot.
plot_contours
(assignments, weights, means, covariances, title, out_prefix)[source]¶ Draw contours of mixture model assignments
Will draw the decision boundary for between/within in red
 Args:
 assignments (numpy.array)
 nvectors of cluster assignments for model
 weights (numpy.array)
 Component weights from
BGMMFit
 means (numpy.array)
 Component means from
BGMMFit
 covars (numpy.array)
 Component covariances from
BGMMFit
 title (str)
 The title to display above the plot
 out_prefix (str)
 Prefix for output plot file (.pdf will be appended)

PopPUNK.plot.
plot_dbscan_results
(X, y, n_clusters, out_prefix)[source]¶ Draw a scatter plot (png) to show the DBSCAN model fit
A scatter plot of core and accessory distances, coloured by component membership. Black is noise
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples.
 Y (numpy.array)
 n x 1 array of cluster assignments for n samples.
 n_clusters (int)
 Number of clusters used (excluding noise)
 out_prefix (str)
 Prefix for output file (.png will be appended)

PopPUNK.plot.
plot_fit
(klist, matching, fit, out_prefix, title)[source]¶ Draw a scatter plot (pdf) of kmer sizes vs match probability, and the fit used to assign core and accessory distance
Kmer sizes on xaxis, log(pr(match)) on y  expect a straight line fit with intercept representing accessory distance and slope core distance
 Args:
 klist (list)
 List of kmer sizes
 matching (list)
 Proportion of matching kmers at each klist value
 kfit (numpy.array)
 Fit to klist and matching from
fitKmerCurve()
 out_prefix (str)
 Prefix for output plot file (.pdf will be appended)
 title (str)
 The title to display above the plot

PopPUNK.plot.
plot_refined_results
(X, Y, x_boundary, y_boundary, core_boundary, accessory_boundary, mean0, mean1, start_point, min_move, max_move, scale, indiv_boundaries, title, out_prefix)[source]¶ Draw a scatter plot (png) to show the refined model fit
A scatter plot of core and accessory distances, coloured by component membership. The triangular decision boundary is also shown
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples.
 Y (numpy.array)
 n x 1 array of cluster assignments for n samples.
 x_boundary (float)
 Intercept of boundary with xaxis, from
RefineFit
 y_boundary (float)
 Intercept of boundary with yaxis, from
RefineFit
 core_boundary (float)
 Intercept of 1D (core) boundary with xaxis, from
RefineFit
 accessory_boundary (float)
 Intercept of 1D (core) boundary with yaxis, from
RefineFit
 mean0 (numpy.array)
 Centre of withinstrain distribution
 mean1 (numpy.array)
 Centre of betweenstrain distribution
 start_point (numpy.array)
 Start point of optimisation
 min_move (float)
 Minimum s range
 max_move (float)
 Maximum s range
 scale (numpy.array)
 Scaling factor from
RefineFit
 indiv_boundaries (bool)
 Whether to draw lines for core and accessory refinement
 title (str)
 The title to display above the plot
 out_prefix (str)
 Prefix for output plot file (.png will be appended)

PopPUNK.plot.
plot_results
(X, Y, means, covariances, scale, title, out_prefix)[source]¶ Draw a scatter plot (png) to show the BGMM model fit
A scatter plot of core and accessory distances, coloured by component membership. Also shown are ellipses for each component (centre: means axes: covariances).
This is based on the example in the sklearn documentation.
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples.
 Y (numpy.array)
 n x 1 array of cluster assignments for n samples.
 means (numpy.array)
 Component means from
BGMMFit
 covars (numpy.array)
 Component covariances from
BGMMFit
 scale (numpy.array)
 Scaling factor from
BGMMFit
 out_prefix (str)
 Prefix for output plot file (.png will be appended)
 title (str)
 The title to display above the plot

PopPUNK.plot.
plot_scatter
(X, scale, out_prefix, title, kde=True)[source]¶ Draws a 2D scatter plot (png) of the core and accessory distances
Also draws contours of kernel density estimare
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples.
 scale (numpy.array)
 Scaling factor from
BGMMFit
 out_prefix (str)
 Prefix for output plot file (.png will be appended)
 title (str)
 The title to display above the plot
 kde (bool)
Whether to draw kernel density estimate contours
(default = True)

PopPUNK.plot.
writeClusterCsv
(outfile, nodeNames, nodeLabels, clustering, output_format='microreact', epiCsv=None, queryNames=None)[source]¶ Print CSV file of clustering and optionally epi data
Writes CSV output of clusters which can be used as input to microreact and cytoscape. Uses pandas to deal with CSV reading and writing nicely.
The epiCsv, if provided, should have the node labels in the first column.
 Args:
 outfile (str)
 File to write the CSV to.
 nodeNames (list)
 Names of sequences in clustering (includes path).
 nodeLabels (list)
 Names of sequences to write in CSV (usually has path removed).
 clustering (dict or dict of dicts)
 Dictionary of cluster assignments (keys are nodeNames). Pass a dict with depth two to include multiple possible clusterings.
 output_format (str)
 Software for which CSV should be formatted (microreact, phandango, grapetree and cytoscape are accepted)
 epiCsv (str)
 Optional CSV of epi data to paste in the output in addition to the clusters (default = None).
 queryNames (list)
Optional list of isolates that have been added as a query.
(default = None)
tsne.py¶

PopPUNK.tsne.
generate_tsne
(seqLabels, accMat, perplexity, outPrefix, overwrite, verbosity=0)[source]¶ Generate tSNE projection using accessory distances
Writes a plot of tSNE clustering of accessory distances (.dot)
 Args:
 seqLabels (list)
 Processed names of sequences being analysed.
 accMat (numpy.array)
 n x n array of accessory distances for n samples.
 perplexity (int)
 Perplexity parameter passed to tSNE
 outPrefix (str)
 Prefix for all generated output files, which will be placed in outPrefix subdirectory
 overwrite (bool)
Overwrite existing output if present
(default = False)
 verbosity (int)
Verbosity of tSNE process (03)
(default = 0)
utils.py¶
General utility functions for data read/writing/manipulation in PopPUNK

PopPUNK.utils.
iterDistRows
(refSeqs, querySeqs, self=True)[source]¶ Gets the ref and query ID for each row of the distance matrix
Returns an iterable with ref and query ID pairs by row.
 Args:
 refSeqs (list)
 List of reference sequence names.
 querySeqs (list)
 List of query sequence names.
 self (bool)
Whether a selfcomparison, used when constructing a database.
Requires refSeqs == querySeqs
Default is True
 Returns:
 ref, query (str, str)
 Iterable of tuples with ref and query names for each distMat row.

PopPUNK.utils.
qcDistMat
(distMat, refList, queryList, a_max)[source]¶ Checks distance matrix for outliers. At the moment just a threshold for accessory distance
 Args:
 distMat (np.array)
 Core and accessory distances
 refList (list)
 Reference labels
 queryList (list)
 Query labels (or refList if self)
 a_max (float)
 Maximum accessory distance to allow
 Returns:
 passed (bool)
 False if any samples failed

PopPUNK.utils.
readClusters
(clustCSV, return_dict=False)[source]¶ Read a previous reference clustering from CSV
 Args:
 clustCSV (str)
 File name of CSV with previous cluster assignments
 return_type (str)
 If True, return a dict with sample>cluster instead of sets
 Returns:
 clusters (dict)
 Dictionary of cluster assignments (keys are cluster names, values are sets containing samples in the cluster). Or if return_dict is set keys are sample names, values are cluster assignments.

PopPUNK.utils.
readExternalClusters
(clustCSV)[source]¶ Read a cluster definition from CSV (does not have to be PopPUNK generated clusters). Rows samples, columns clusters.
 Args:
 clustCSV (str)
 File name of CSV with previous cluster assingments
 Returns:
 extClusters (dict)
 Dictionary of dictionaries of cluster assignments (first key cluster assignment name, second key sample, value cluster assignment)

PopPUNK.utils.
readPickle
(pklName)[source]¶ Loads core and accessory distances saved by
storePickle()
Called during
fitmodel
 Args:
 pklName (str)
 Prefix for saved files
 Returns:
 rlist (list)
 List of reference sequence names (for
iterDistRows()
)  qlist (list)
 List of query sequence names (for
iterDistRows()
)  self (bool)
 Whether an allvsall self DB (for
iterDistRows()
)  X (numpy.array)
 n x 2 array of core and accessory distances

PopPUNK.utils.
storePickle
(rlist, qlist, self, X, pklName)[source]¶ Saves core and accessory distances in a .npy file, names in a .pkl
Called during
createdb
 Args:
 rlist (list)
 List of reference sequence names (for
iterDistRows()
)  qlist (list)
 List of query sequence names (for
iterDistRows()
)  self (bool)
 Whether an allvsall self DB (for
iterDistRows()
)  X (numpy.array)
 n x 2 array of core and accessory distances
 pklName (str)
 Prefix for output files

PopPUNK.utils.
translate_distMat
(combined_list, core_distMat, acc_distMat)[source]¶ Convert distances from a square form (2 NxN matrices) to a long form (1 matrix with n_comparisons rows and 2 columns).
 Args:
 combined_list
 Combined list of references followed by queries (list)
 core_distMat (numpy.array)
 NxN core distances
 acc_distMat (numpy.array)
 NxN accessory distances
 Returns:
 distMat (numpy.array)
 Distances in long form

PopPUNK.utils.
update_distance_matrices
(refList, distMat, queryList=None, query_ref_distMat=None, query_query_distMat=None)[source]¶ Convert distances from long form (1 matrix with n_comparisons rows and 2 columns) to a square form (2 NxN matrices), with merging of query distances if necessary.
 Args:
 refList (list)
 List of references
 distMat (numpy.array)
 Two column long form list of core and accessory distances for pairwise comparisons between reference db sequences
 queryList (list)
 List of queries
 query_ref_distMat (numpy.array)
 Two column long form list of core and accessory distances for pairwise comparisons between queries and reference db sequences
 query_query_distMat (numpy.array)
 Two column long form list of core and accessory distances for pairwise comparisons between query sequences
 Returns:
 seqLabels (list)
 Combined list of reference and query sequences
 coreMat (numpy.array)
 NxN array of core distances for N sequences
 accMat (numpy.array)
 NxN array of accessory distances for N sequences