Reference documentation

Documentation for module functions (for developers)

bgmm.py

Functions used to fit the mixture model to a database. Access using BGMMFit.

BGMM using sklearn

PopPUNK.bgmm.assign_samples(X, weights, means, covars, scale, values=False)[source]

Given distances and a fit will calculate responsibilities and return most likely cluster assignment

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples
weights (numpy.array)
Component weights from BGMMFit
means (numpy.array)
Component means from BGMMFit
covars (numpy.array)
Component covariances from BGMMFit
scale (numpy.array)
Scaling of core and accessory distances from BGMMFit
values (bool)

Whether to return the responsibilities, rather than the most likely assignment (used for entropy calculation).

Default is False

Returns:
ret_vec (numpy.array)
An n-vector with the most likely cluster memberships or an n by k matrix with the component responsibilities for each sample.
PopPUNK.bgmm.findWithinLabel(means, assignments, rank=0)[source]

Identify within-strain links

Finds the component with mean closest to the origin and also akes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)

Args:
means (numpy.array)
K x 2 array of mixture component means
assignments (numpy.array)
Sample cluster assignments
rank (int)

Which label to find, ordered by distance from origin. 0-indexed.

(default = 0)

Returns:
within_label (int)
The cluster label for the within-strain assignments
PopPUNK.bgmm.fit2dMultiGaussian(X, dpgmm_max_K=2)[source]

Main function to fit BGMM model, called from fit()

Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.

Args:
X (np.array)
n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.
dpgmm_max_K (int)
Maximum number of components to use with the EM fit. (default = 2)
Returns:
dpgmm (sklearn.mixture.BayesianGaussianMixture)
Fitted bgmm model
PopPUNK.bgmm.log_likelihood(X, weights, means, covars, scale)[source]

modified sklearn GMM function predicting distribution membership

Returns the mixture LL for points X. Used by assign_samples() and plot_contours()

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples
weights (numpy.array)
Component weights from fit2dMultiGaussian()
means (numpy.array)
Component means from fit2dMultiGaussian()
covars (numpy.array)
Component covariances from fit2dMultiGaussian()
scale (numpy.array)
Scaling of core and accessory distances from fit2dMultiGaussian()
Returns:
logprob (numpy.array)
The log of the probabilities under the mixture model
lpr (numpy.array)
The components of the log probability from each mixture component
PopPUNK.bgmm.log_multivariate_normal_density(X, means, covars, min_covar=1e-07)[source]

Log likelihood of multivariate normal density distribution

Used to calculate per component Gaussian likelihood in assign_samples()

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples
means (numpy.array)
Component means from fit2dMultiGaussian()
covars (numpy.array)
Component covariances from fit2dMultiGaussian()
min_covar (float)
Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e-7)
Returns:
log_prob (numpy.array)
An n-vector with the log-likelihoods for each sample being in this component

dbscan.py

Functions used to fit DBSCAN to a database. Access using DBSCANFit.

DBSCAN using hdbscan

PopPUNK.dbscan.assign_samples_dbscan(X, hdb, scale)[source]

Use a fitted dbscan model to assign new samples to a cluster

Args:
X (numpy.array)
N x 2 array of core and accessory distances
hdb (hdbscan.HDBSCAN)
Fitted DBSCAN from hdbscan package
scale (numpy.array)
Scale factor of model object
Returns:
y (numpy.array)
Cluster assignments by sample
PopPUNK.dbscan.evaluate_dbscan_clusters(model)[source]

Evaluate whether fitted dbscan model contains non-overlapping clusters

Args:
model (DBSCANFit)
Fitted model from fit()
Returns:
indistinct (bool)
Boolean indicating whether putative within- and between-strain clusters of points overlap
PopPUNK.dbscan.findBetweenLabel(assignments, within_cluster)[source]

Identify between-strain links from a DBSCAN model

Finds the component containing the largest number of between-strain links, excluding the cluster identified as containing within-strain links.

Args:
assignments (numpy.array)
Sample cluster assignments
within_cluster (int)
Cluster ID assigned to within-strain assignments, from findWithinLabel()
Returns:
between_cluster (int)
The cluster label for the between-strain assignments
PopPUNK.dbscan.fitDbScan(X, min_samples, min_cluster_size, cache_out)[source]

Function to fit DBSCAN model as an alternative to the Gaussian

Fits the DBSCAN model to the distances using hdbscan

Args:
X (np.array)
n x 2 array of core and accessory distances for n samples
min_samples (int)
Parameter for DBSCAN clustering ‘conservativeness’
min_cluster_size (int)
Minimum number of points in a cluster for HDBSCAN
cache_out (str)
Prefix for DBSCAN cache used for refitting
Returns:
hdb (hdbscan.HDBSCAN)
Fitted HDBSCAN to subsampled data
labels (list)
Cluster assignments of each sample
n_clusters (int)
Number of clusters used

mash.py

Mash functions for database construction

PopPUNK.mash.checkMashVersion(mash_exec)[source]

Checks that mash can be run, and is version 2 or higher. Exits if version < 2.

Args:
mash_exec (str)
Location of mash executable
PopPUNK.mash.constructDatabase(assemblyList, klist, sketch, oPrefix, ignoreLengthOutliers=False, threads=1, mash_exec='mash', overwrite=False)[source]

Sketch the input assemblies at the requested k-mer lengths

A multithread wrapper around runSketch(). Threads are used to either run multiple sketch processes for each klist value, or increase the threads used by each mash sketch process if len(klist) > threads.

Also calculates random match probability based on length of first genome in assemblyList.

Args:
assemblyList (str)
File with locations of assembly files to be sketched
klist (list)
List of k-mer sizes to sketch
sketch (int)
Size of sketch (-s option)
oPrefix (str)
Output prefix for resulting sketch files
ignoreLengthOutliers (bool)

Whether to check for outlying genome lengths (and error if found)

(default = False)

threads (int)

Number of threads to use

(default = 1)

mash_exec (str)

Location of mash executable

(default = ‘mash’)

overwrite (bool)

Whether to overwrite sketch DBs, if they already exist.

(default = False)

PopPUNK.mash.createDatabaseDir(outPrefix, kmers)[source]

Creates the directory to write mash sketches to, removing old files if unnecessary

Args:
outPrefix (str)
output db prefix
kmers (list)
k-mer sizes in db
PopPUNK.mash.fitKmerBlock(idxRanges, distMat, raw, klist, jacobian)[source]

Multirow wrapper around fitKmerCurve() to the specified rows in idxRanges

Args:
idxRanges (int, int)
Tuple of first and last row of slice to calculate
distMat (numpy.array)
sharedmem object to store core and accessory distances in (altered in place)
raw (numpy.array)
sharedmem object with proportion of k-mer matches for each query-ref pair by row, columns are at k-mer lengths in klist
klist (list)
List of k-mer lengths to use
jacobian (numpy.array)
The Jacobian for the fit, sent to fitKmerCurve()
PopPUNK.mash.fitKmerCurve(pairwise, klist, jacobian)[source]

Fit the function \(pr = (1-a)(1-c)^k\)

Supply jacobian = -np.hstack((np.ones((klist.shape[0], 1)), klist.reshape(-1, 1)))

Args:
pairwise (numpy.array)
Proportion of shared k-mers at k-mer values in klist
klist (list)
k-mer sizes used
jacobian (numpy.array)
Should be set as above (set once to try and save memory)
Returns:
transformed_params (numpy.array)
Column with core and accessory distance
PopPUNK.mash.getDatabaseName(prefix, k)[source]

Gets the name for the mash database for a given k size

Args:
prefix (str)
db prefix
k (str)
k-mer size
Returns:
db_name (str)
Name of mash db
PopPUNK.mash.getKmersFromReferenceDatabase(dbPrefix)[source]

Get kmers lengths from existing database

Parses the database name to determine klist

Args:
dbPrefix (str)
Prefix for sketch DB files
Returns:
kmers (list)
List of k-mer lengths used in database
PopPUNK.mash.getSeqsInDb(mashSketch, mash_exec='mash')[source]

Return an array with the sequences in the passed mash database

Calls mash info -t

Args:
mashSketch (str)
Mash sketches/database
mash_exec (str)
Location of mash executable
Returns:
seqs (list)
List of sequence names in sketch DB
PopPUNK.mash.getSketchSize(dbPrefix, klist, mash_exec='mash')[source]

Call to mash info to determine sketch size

sys.exit(1) is called if DBs have different sketch sizes

Args:
dbprefix (str)
Prefix for mash databases
klist (list)
List of k-mer lengths which databases were constructed at
mash_exec (str)
Location of mash executable
Returns:
sketchdb (dict)
Dict of sketch sizes indexed by k-mer size
PopPUNK.mash.init_lock(l)[source]

Sets a global lock to use when writing to STDERR in runSketch()

PopPUNK.mash.joinDBs(db1, db2, output, klist, mash_exec='mash')[source]

Join two mash sketch databases with mash paste

Args:
db1 (str)
Prefix for db1
db2 (str)
Prefix for db2
output (str)
Prefix for joined output
klist (list)
List of k-mer sizes to sketch
mash_exec (str)

Location of mash executable

(default = ‘mash’)

PopPUNK.mash.queryDatabase(qFile, klist, dbPrefix, queryPrefix, self=True, number_plot_fits=0, no_stream=False, mash_exec='mash', threads=1)[source]

Calculate core and accessory distances between query sequences and a sketched database

For a reference database, runs the query against itself to find all pairwise core and accessory distances.

Uses the relation \(pr(a, b) = (1-a)(1-c)^k\)

To get the ref and query name for each row of the returned distances, call to the iterator iterDistRows() with the returned refList and queryList

Args:
qFile (str)
File with location of query sequences
klist (list)
K-mer sizes to use in the calculation
dbPrefix (str)
Prefix for reference mash sketch database created by constructDatabase()
queryPrefix (str)
Prefix for query mash sketch database created by constructDatabase()
self (bool)

Set true if query = ref

(default = True)

number_plot_fits (int)

If > 0, the number of k-mer length fits to plot (saved as pdfs). Takes random pairs of comparisons and calls plot_fit()

(default = 0)

no_stream (bool)

Rather than streaming mash dist input directly into parser, will write through an intermediate temporary file

(default = False)

mash_exec (str)

Location of mash executable

(default = ‘mash’)

threads (int)

Number of threads to use in the mash process

(default = 1)

Returns:
refList (list)
Names of reference sequences
queryList (list)
Names of query sequences
distMat (numpy.array)
Core distances (column 0) and accessory distances (column 1) between refList and queryList
PopPUNK.mash.readMashDBParams(dbPrefix, kmers, sketch_sizes, mash_exec='mash')[source]

Get kmers lengths and sketch sizes from existing database

Calls getKmersFromReferenceDatabase() and getSketchSize() Uses passed values if db missing

Args:
dbPrefix (str)
Prefix for sketch DB files
kmers (list)
Kmers to use if db not found
sketch_sizes (list)
Sketch size to use if db not found
mash_exec (str)

Location of mash executable

Default = ‘mash’

Returns:
kmers (list)
List of k-mer lengths used in database
sketch_sizes (list)
List of sketch sizes used in database
PopPUNK.mash.runSketch(k, assemblyList, sketch, genome_length, oPrefix, mash_exec='mash', overwrite=False, threads=1)[source]

Actually run the mash sketch command

Called by constructDatabase()

Args:
k (int)
k-mer size to sketch
assemblyList (list)
Locations of assembly files to be sketched
sketch (int)
Size of sketch (-s option)
genome_length (int)
Length of genomes being sketch, for random match probability calculation
oPrefix (str)
Output prefix for resulting sketch files
mash_exec (str)

Location of mash executable

(default = ‘mash’)

overwrite (bool)

Whether to overwrite sketch DB, if it already exists.

(default = False)

threads (int)

Number of threads to use in the mash process

(default = 1)

models.py

Classes used for model fits

class PopPUNK.models.BGMMFit(outPrefix, max_samples=100000)[source]

Class for fits using the Gaussian mixture model. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:
outPrefix (str)
The output prefix used for reading/writing
max_samples (int)

The number of subsamples to fit the model to

(default = 100000)

assign(X, values=False)[source]

Assign the clustering of new samples using assign_samples()

Args:
X (numpy.array)
Core and accessory distances
values (bool)
Return the responsibilities of assignment rather than most likely cluster
Returns:
y (numpy.array)
Cluster assignments or values by samples
fit(X, max_components)[source]

Extends fit()

Fits the BGMM and returns assignments by calling fit2dMultiGaussian().

Fitted parameters are stored in the object.

Args:
X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
max_components (int)
Maximum number of mixture components to use.
Returns:
y (numpy.array)
Cluster assignments of samples in X
load(fit_npz, fit_obj)[source]

Load the model from disk. Called from loadClusterFit()

Args:
fit_npz (dict)
Fit npz opened with numpy.load()
fit_obj (sklearn.mixture.BayesianGaussianMixture)
The saved fit object
plot(X, y)[source]

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_results() and PopPUNK.plot.plot_contours()

Args:
X (numpy.array)
Core and accessory distances
y (numpy.array)
Cluster assignments from assign()
save()[source]

Save the model to disk, as an npz and pkl (using outPrefix).

class PopPUNK.models.ClusterFit(outPrefix)[source]

Parent class for all models used to cluster distances

Args:
outPrefix (str)
The output prefix used for reading/writing
fit(X=None)[source]

Initial steps for all fit functions.

Creates output directory. If preprocess is set then subsamples passed X and draws a scatter plot from result using plot_scatter().

Args:
X (numpy.array)

The core and accessory distances to cluster. Must be set if preprocess is set.

(default = None)

no_scale()[source]

Turn off scaling (useful for refine, where optimization is done in the scaled space).

plot(X=None)[source]

Initial steps for all plot functions.

Ensures model has been fitted.

Args:
X (numpy.array)

The core and accessory distances to subsample.

(default = None)

class PopPUNK.models.DBSCANFit(outPrefix, max_samples=100000)[source]

Class for fits using HDBSCAN. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:
outPrefix (str)
The output prefix used for reading/writing
max_samples (int)

The number of subsamples to fit the model to

(default = 100000)

assign(X, no_scale=False)[source]

Assign the clustering of new samples using assign_samples_dbscan()

Args:
X (numpy.array)
Core and accessory distances
no_scale (bool)

Do not scale X

[default = False]

Returns:
y (numpy.array)
Cluster assignments by samples
fit(X, max_num_clusters, min_cluster_prop)[source]

Extends fit()

Fits the distances with HDBSCAN and returns assignments by calling fitDbScan().

Fitted parameters are stored in the object.

Args:
X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
max_num_clusters (int)
Maximum number of clusters in DBSCAN fitting
min_cluster_prop (float)
Minimum proportion of points in a cluster in DBSCAN fitting
Returns:
y (numpy.array)
Cluster assignments of samples in X
load(fit_npz, fit_obj)[source]

Load the model from disk. Called from loadClusterFit()

Args:
fit_npz (dict)
Fit npz opened with numpy.load()
fit_obj (hdbscan.HDBSCAN)
The saved fit object
plot(X=None, y=None)[source]

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_dbscan_results()

Args:
X (numpy.array)
Core and accessory distances
y (numpy.array)
Cluster assignments from assign()
save()[source]

Save the model to disk, as an npz and pkl (using outPrefix).

class PopPUNK.models.RefineFit(outPrefix)[source]

Class for fits using a triangular boundary and network properties. Inherits from ClusterFit.

Must first run either fit() or load() before calling other functions

Args:
outPrefix (str)
The output prefix used for reading/writing
assign(X, slope=None)[source]

Assign the clustering of new samples using withinBoundary()

Args:
X (numpy.array)
Core and accessory distances
slope (int)

Override self.slope. Default - use self.slope

Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope

Returns:
y (numpy.array)
Cluster assignments by samples
fit(X, sample_names, model, max_move, min_move, startFile=None, indiv_refine=False, no_local=False, threads=1)[source]

Extends fit()

Fits the distances by optimising network score, by calling refineFit2D().

Fitted parameters are stored in the object.

Args:
X (numpy.array)
The core and accessory distances to cluster. Must be set if preprocess is set.
sample_names (list)
Sample names in X (accessed by iterDistRows())
model (ClusterFit)
The model fit to refine
max_move (float)
Maximum distance to move away from start point
min_move (float)
Minimum distance to move away from start point
startFile (str)

A file defining an initial fit, rather than one from --fit-model. See documentation for format.

(default = None).

indiv_refine (bool)

Run refinement for core and accessory distances separately

(default = False).

no_local (bool)
Turn off the local optimisation step. Quicker, but may be less well refined.
num_processes (int)

Number of threads to use in the global optimisation step.

(default = 1)

Returns:
y (numpy.array)
Cluster assignments of samples in X
load(fit_npz, fit_obj)[source]

Load the model from disk. Called from loadClusterFit()

Args:
fit_npz (dict)
Fit npz opened with numpy.load()
fit_obj (None)
The saved fit object (not used)
plot(X, y=None)[source]

Extends plot()

Write a summary of the fit, and plot the results using PopPUNK.plot.plot_refined_results()

Args:
X (numpy.array)
Core and accessory distances
y (numpy.array)
Assignments (unused)
save()[source]

Save the model to disk, as an npz and pkl (using outPrefix).

PopPUNK.models.loadClusterFit(pkl_file, npz_file, outPrefix='', max_samples=100000)[source]

Call this to load a fitted model

Args:
pkl_file (str)
Location of saved .pkl file on disk
npz_file (str)
Location of saved .npz file on disk
outPrefix (str)
Output prefix for model to save to (e.g. plots)
max_samples (int)

Maximum samples if subsampling X

[default = 100000]

network.py

Functions used to construct the network, and update with new queries. Main entry point is constructNetwork() for new reference databases, and findQueryLinksToNetwork() for querying databases.

Network functions

PopPUNK.network.addQueryToNetwork(rlist, qlist, qfile, G, kmers, assignments, model, queryDB, no_stream=False, queryQuery=False, threads=1, mash_exec='mash')[source]

Finds edges between queries and items in the reference database, and modifies the network to include them.

Args:
rlist (list)
List of reference names
qlist (list)
List of query names
qfile (str)
File containing queries
G (networkx.Graph)
Network to add to (mutated)
kmers (list)
List of k-mer sizes
assignments (numpy.array)
Cluster assignment of items in qlist
model (ClusterModel)
Model fitted to reference database
queryDB (str)
Query database location
queryQuery (bool)

Add in all query-query distances

(default = False)

no_stream (bool)

Don’t stream mash output

(default = False)

threads (int)

Number of threads to use if new db created

(default = 1)

mash_exec (str)

Location of the mash executable

(default = ‘mash’)

Returns:
qlist1 (list)
Ordered list of queries
distMat (numpy.array)
Query-query distances
PopPUNK.network.constructNetwork(rlist, qlist, assignments, within_label, summarise=True)[source]

Construct an unweighted, undirected network without self-loops. Nodes are samples and edges where samples are within the same cluster

Will print summary statistics about the network to STDERR

Args:
rlist (list)
List of reference sequence labels
qlist (list)
List of query sequence labels
assignments (numpy.array)
Labels of most likely cluster assignment from assign_samples()
within_label (int)
The label for the cluster representing within-strain distances from findWithinLabel()
summarise (bool)

Whether to calculate and print network summaries with networkSummary()

(default = True)

Returns:
G (networkx.Graph)
The resulting network
PopPUNK.network.extractReferences(G, mashOrder, outPrefix, existingRefs=None)[source]

Extract references for each cluster based on cliques

Writes chosen references to file by calling writeReferences()

Args:
G (networkx.Graph)
A network used to define clusters from constructNetwork()
mashOrder (list)
The order of files in the sketches, so returned references are in the same order
outPrefix (str)
Prefix for output file (.refs will be appended)
existingRefs (list)
References that should be used for each clique
Returns:
refFileName (str)
The name of the file references were written to
references (list)
An updated list of the reference names
PopPUNK.network.fetchNetwork(network_dir, model, refList, core_only=False, accessory_only=False)[source]

Load the network based on input options

Returns the network as a networkx, and sets the slope parameter of the passed model object.

Args:
network_dir (str)
A network used to define clusters from constructNetwork()
model (ClusterFit)
A fitted model object
refList (list)
Names of references that should be in the network
core_only (bool)

Return the network created using only core distances

[default = False]

accessory_only (bool)

Return the network created using only accessory distances

[default = False]

Returns:
genomeNetwork (nx.Graph)
The loaded network
cluster_file (str)
The CSV of cluster assignments corresponding to this network
PopPUNK.network.networkSummary(G)[source]

Provides summary values about the network

Args:
G (networkx.Graph)
The network of strains from constructNetwork()
Returns:
components (int)
The number of connected components (and clusters)
density (float)
The proportion of possible edges used
transitivity (float)
Network transitivity (triads/triangles)
score (float)
A score of network fit, given by \(\mathrm{transitivity} * (1-\mathrm{density})\)
PopPUNK.network.printClusters(G, outPrefix='_clusters.csv', oldClusterFile=None, externalClusterCSV=None, printRef=True, printCSV=True)[source]

Get cluster assignments

Also writes assignments to a CSV file

Args:
G (networkx.Graph)
Network used to define clusters (from constructNetwork() or addQueryToNetwork())
outPrefix (str)

Prefix for output CSV

Default = “_clusters.csv”

oldClusterFile (str)

CSV with previous cluster assignments. Pass to ensure consistency in cluster assignment name.

Default = None

externalClusterCSV (str)

CSV with cluster assignments from any source. Will print a file relating these to new cluster assignments

Default = None

printRef (bool)

If false, print only query sequences in the output

Default = True

printCSV (bool)

Print results to file

Default = True

Returns:
clustering (dict)
Dictionary of cluster assignments (keys are sequence names)
PopPUNK.network.printExternalClusters(newClusters, extClusterFile, outPrefix, oldNames, printRef=True)[source]

Prints cluster assignments with respect to previously defined clusters or labels.

Args:
newClusters (set iterable)
The components from the graph G, defining the PopPUNK clusters
extClusterFile (str)
A CSV file containing definitions of the external clusters for each sample (does not need to contain all samples)
outPrefix (str)
Prefix for output CSV (_external_clusters.csv)
oldNames (list)
A list of the reference sequences
printRef (bool)

If false, print only query sequences in the output

Default = True

PopPUNK.network.writeReferences(refList, outPrefix)[source]

Writes chosen references to file

Args:
refList (list)
Reference names to write
outPrefix (str)
Prefix for output file (.refs will be appended)
Returns:
refFileName (str)
The name of the file references were written to

refine.py

Functions used to refine an existing model. Access using RefineFit.

Refine mixture model using network properties

PopPUNK.refine.decisionBoundary(intercept, gradient)[source]

Returns the co-ordinates where the triangle the decision boundary forms meets the x- and y-axes.

Args:
intercept (numpy.array)
Cartesian co-ordinates of point along line (transformLine()) which intercepts the boundary
gradient (float)
Gradient of the line
Returns:
x (float)
The x-axis intercept
y (float)
The y-axis intercept
PopPUNK.refine.likelihoodBoundary(s, model, start, end, within, between)[source]

Wrapper function around fit2dMultiGaussian() so that it can go into a root-finding function for probabilities between components

Args:
s (float)
Distance along line from mean0
model (BGMMFit)
Fitted mixture model
start (numpy.array)
The co-ordinates of the centre of the within-strain distribution
end (numpy.array)
The co-ordinates of the centre of the between-strain distribution
within (int)
Label of the within-strain distribution
between (int)
Label of the between-strain distribution
Returns:
responsibility (float)
The difference between responsibilities of assignment to the within component and the between assignment
PopPUNK.refine.newNetwork(s, sample_names, distMat, start_point, mean1, gradient, slope=2)[source]

Wrapper function for constructNetwork() which is called by optimisation functions moving a triangular decision boundary.

Given the boundary parameterisation, constructs the network and returns its score, to be minimised.

Args:
s (float)
Distance along line between start_point and mean1 from start_point
sample_names (list)
Sample names corresponding to distMat (accessed by iterator)
distMat (numpy.array)
Core and accessory distances
start_point (numpy.array)
Initial boundary cutoff
mean1 (numpy.array)
Defines line direction from start_point
gradient (float)
Gradient of line to move along
slope (int)
Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
Returns:
score (float)
-1 * network score. Where network score is from networkSummary()
PopPUNK.refine.readManualStart(startFile)[source]

Reads a file to define a manual start point, rather than using --fit-model

Throws and exits if incorrectly formatted.

Args:
startFile (str)
Name of file with values to read
Returns:
mean0 (numpy.array)
Centre of within-strain distribution
mean1 (numpy.array)
Centre of between-strain distribution
start_s (float)
Distance along line between mean0 and mean1 to start at
PopPUNK.refine.refineFit(distMat, sample_names, start_s, mean0, mean1, max_move, min_move, slope=2, no_local=False, num_processes=1)[source]

Try to refine a fit by maximising a network score based on transitivity and density.

Iteratively move the decision boundary to do this, using starting point from existing model.

Args:
distMat (numpy.array)
n x 2 array of core and accessory distances for n samples
sample_names (list)
List of query sequence labels
start_s (float)
Point along line to start search
mean0 (numpy.array)
Start point to define search line
mean1 (numpy.array)
End point to define search line
max_move (float)
Maximum distance to move away from start point
min_move (float)
Minimum distance to move away from start point
slope (int)
Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
no_local (bool)
Turn off the local optimisation step. Quicker, but may be less well refined.
num_processes (int)

Number of threads to use in the global optimisation step.

(default = 1)

Returns:
start_point (tuple)
(x, y) co-ordinates of starting point
optimal_x (float)
x-coordinate of refined fit
optimal_y (float)
y-coordinate of refined fit
PopPUNK.refine.transformLine(s, mean0, mean1)[source]

Return x and y co-ordinates for traversing along a line between mean0 and mean1, parameterised by a single scalar distance s from the start point mean0.

Args:
s (float)
Distance along line from mean0
mean0 (numpy.array)
Start position of line (x0, y0)
mean1 (numpy.array)
End position of line (x1, y1)
Returns:
x (float)
The Cartesian x-coordinate
y (float)
The Cartesian y-coordinate
PopPUNK.refine.withinBoundary(dists, x_max, y_max, slope=2)[source]

Classifies points as within or outside of a refined boundary. Numba JIT compiled for speed.

Also used to assign new points in assign()

Args:
dists (numpy.array)
Core and accessory distances to classify
x_max (float)
The x-axis intercept from decisionBoundary()
y_max (float)
The y-axis intercept from decisionBoundary()
slope (int)
Set to 0 for a vertical line, 1 for a horizontal line, or 2 to use a slope
Returns:
signs (numpy.array)
For each sample in dists, -1 if within-strain and 1 if between-strain. 0 if exactly on boundary.

plot.py

Plots of GMM results, k-mer fits, and microreact output

PopPUNK.plot.buildRapidNJ(rapidnj, refList, coreMat, outPrefix, tree_filename)[source]

Use rapidNJ for more rapid tree building

Creates a phylip of core distances, system call to rapidnj executable, loads tree as dendropy object (cleaning quotes in node names), removes temporary files.

Args:
rapidnj (str)
Location of rapidnj executable
refList (list)
Names of sequences in coreMat (same order)
coreMat (numpy.array)
NxN core distance matrix produced in outputsForMicroreact()
outPrefix (int)
Output prefix for temporary files
outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
tree_filename (str)
Filename for output tree (saved to disk)
Returns:
tree (dendropy.Tree)
NJ tree from core distances
PopPUNK.plot.generate_phylogeny(coreMat, seqLabels, outPrefix, tree_suffix, rapidnj, overwrite)[source]

Generate phylogeny using dendropy or RapidNJ

Writes a neighbour joining tree (.nwk) from core distances.

Args:
coreMat (numpy.array)
n x n array of core distances for n samples.
seqLabels (list)
Processed names of sequences being analysed.
outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
tree_suffix (str)
String to append to tree file name
rapidnj (str)
A string with the location of the rapidnj executable for tree-building. If None, will use dendropy by default
overwrite (bool)
Overwrite existing output if present (default = False)
PopPUNK.plot.get_grid(minimum, maximum, resolution)[source]

Get a square grid of points to evaluate a function across

Used for plot_scatter() and plot_contours()

Args:
minimum (float)
Minimum value for grid
maximum (float)
Maximum value for grid
resolution (int)
Number of points along each axis
Returns:
xx (numpy.array)
x values across n x n grid
yy (numpy.array)
y values across n x n grid
xy (numpy.array)
n x 2 pairs of x, y values grid is over
PopPUNK.plot.outputsForCytoscape(G, clustering, outPrefix, epiCsv, queryList=None, suffix=None, writeCsv=True)[source]

Write outputs for cytoscape. A graphml of the network, and CSV with metadata

Args:
G (networkx.Graph)
The network to write from constructNetwork()
clustering (dict)
Dictionary of cluster assignments (keys are nodeNames).
outPrefix (str)
Prefix for files to be written
epiCsv (str)
Optional CSV of epi data to paste in the output in addition to the clusters.
queryList (list)
Optional list of isolates that have been added as a query. (default = None)
suffix (string)
String to append to network file name. (default = None)
writeCsv (bool)
Whether to print CSV file to accompany network
PopPUNK.plot.outputsForGrapetree(combined_list, coreMat, clustering, outPrefix, epiCsv, rapidnj, queryList=None, overwrite=False, microreact=False)[source]

Generate files for Grapetree

Write a neighbour joining tree (.nwk) from core distances and cluster assignment (.csv)

Args:
combined_list (list)
Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
coreMat (numpy.array)
n x n array of core distances for n samples.
clustering (dict or dict of dicts)
List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.
outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory.
epiCsv (str)
A CSV containing other information, to include with the CSV of clusters
rapidnj (str)
A string with the location of the rapidnj executable for tree-building. If None, will use dendropy by default
queryList (list)

Optional list of isolates that have been added as a query for colouring in the CSV.

(default = None)

overwrite (bool)
Overwrite existing output if present (default = False).
microreact (bool)
Avoid regenerating tree if already built for microreact (default = False).
PopPUNK.plot.outputsForMicroreact(combined_list, coreMat, accMat, clustering, perplexity, outPrefix, epiCsv, rapidnj, queryList=None, overwrite=False)[source]

Generate files for microreact

Output a neighbour joining tree (.nwk) from core distances, a plot of t-SNE clustering of accessory distances (.dot) and cluster assignment (.csv)

Args:
combined_list (list)
Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
coreMat (numpy.array)
n x n array of core distances for n samples.
accMat (numpy.array)
n x n array of accessory distances for n samples.
clustering (dict or dict of dicts)
List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.
perplexity (int)
Perplexity parameter passed to t-SNE
outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
epiCsv (str)
A CSV containing other information, to include with the CSV of clusters
rapidnj (str)
A string with the location of the rapidnj executable for tree-building. If None, will use dendropy by default
queryList (list)
Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
overwrite (bool)
Overwrite existing output if present (default = False)
PopPUNK.plot.outputsForPhandango(combined_list, coreMat, clustering, outPrefix, epiCsv, rapidnj, queryList=None, overwrite=False, microreact=False)[source]

Generate files for Phandango

Write a neighbour joining tree (.tree) from core distances and cluster assignment (.csv)

Args:
combined_list (list)
Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
coreMat (numpy.array)
n x n array of core distances for n samples.
clustering (dict or dict of dicts)
List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.
outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
epiCsv (str)
A CSV containing other information, to include with the CSV of clusters
rapidnj (str)
A string with the location of the rapidnj executable for tree-building. If None, will use dendropy by default
queryList (list)
Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
overwrite (bool)
Overwrite existing output if present (default = False)
microreact (bool)
Avoid regenerating tree if already built for microreact (default = False)
PopPUNK.plot.plot_contours(assignments, weights, means, covariances, title, out_prefix)[source]

Draw contours of mixture model assignments

Will draw the decision boundary for between/within in red

Args:
assignments (numpy.array)
n-vectors of cluster assignments for model
weights (numpy.array)
Component weights from BGMMFit
means (numpy.array)
Component means from BGMMFit
covars (numpy.array)
Component covariances from BGMMFit
title (str)
The title to display above the plot
out_prefix (str)
Prefix for output plot file (.pdf will be appended)
PopPUNK.plot.plot_dbscan_results(X, y, n_clusters, out_prefix)[source]

Draw a scatter plot (png) to show the DBSCAN model fit

A scatter plot of core and accessory distances, coloured by component membership. Black is noise

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples.
Y (numpy.array)
n x 1 array of cluster assignments for n samples.
n_clusters (int)
Number of clusters used (excluding noise)
out_prefix (str)
Prefix for output file (.png will be appended)
PopPUNK.plot.plot_fit(klist, matching, fit, out_prefix, title)[source]

Draw a scatter plot (pdf) of k-mer sizes vs match probability, and the fit used to assign core and accessory distance

K-mer sizes on x-axis, log(pr(match)) on y - expect a straight line fit with intercept representing accessory distance and slope core distance

Args:
klist (list)
List of k-mer sizes
matching (list)
Proportion of matching k-mers at each klist value
kfit (numpy.array)
Fit to klist and matching from fitKmerCurve()
out_prefix (str)
Prefix for output plot file (.pdf will be appended)
title (str)
The title to display above the plot
PopPUNK.plot.plot_refined_results(X, Y, x_boundary, y_boundary, core_boundary, accessory_boundary, mean0, mean1, start_point, min_move, max_move, scale, indiv_boundaries, title, out_prefix)[source]

Draw a scatter plot (png) to show the refined model fit

A scatter plot of core and accessory distances, coloured by component membership. The triangular decision boundary is also shown

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples.
Y (numpy.array)
n x 1 array of cluster assignments for n samples.
x_boundary (float)
Intercept of boundary with x-axis, from RefineFit
y_boundary (float)
Intercept of boundary with y-axis, from RefineFit
core_boundary (float)
Intercept of 1D (core) boundary with x-axis, from RefineFit
accessory_boundary (float)
Intercept of 1D (core) boundary with y-axis, from RefineFit
mean0 (numpy.array)
Centre of within-strain distribution
mean1 (numpy.array)
Centre of between-strain distribution
start_point (numpy.array)
Start point of optimisation
min_move (float)
Minimum s range
max_move (float)
Maximum s range
scale (numpy.array)
Scaling factor from RefineFit
indiv_boundaries (bool)
Whether to draw lines for core and accessory refinement
title (str)
The title to display above the plot
out_prefix (str)
Prefix for output plot file (.png will be appended)
PopPUNK.plot.plot_results(X, Y, means, covariances, scale, title, out_prefix)[source]

Draw a scatter plot (png) to show the BGMM model fit

A scatter plot of core and accessory distances, coloured by component membership. Also shown are ellipses for each component (centre: means axes: covariances).

This is based on the example in the sklearn documentation.

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples.
Y (numpy.array)
n x 1 array of cluster assignments for n samples.
means (numpy.array)
Component means from BGMMFit
covars (numpy.array)
Component covariances from BGMMFit
scale (numpy.array)
Scaling factor from BGMMFit
out_prefix (str)
Prefix for output plot file (.png will be appended)
title (str)
The title to display above the plot
PopPUNK.plot.plot_scatter(X, scale, out_prefix, title, kde=True)[source]

Draws a 2D scatter plot (png) of the core and accessory distances

Also draws contours of kernel density estimare

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples.
scale (numpy.array)
Scaling factor from BGMMFit
out_prefix (str)
Prefix for output plot file (.png will be appended)
title (str)
The title to display above the plot
kde (bool)

Whether to draw kernel density estimate contours

(default = True)

PopPUNK.plot.writeClusterCsv(outfile, nodeNames, nodeLabels, clustering, output_format='microreact', epiCsv=None, queryNames=None)[source]

Print CSV file of clustering and optionally epi data

Writes CSV output of clusters which can be used as input to microreact and cytoscape. Uses pandas to deal with CSV reading and writing nicely.

The epiCsv, if provided, should have the node labels in the first column.

Args:
outfile (str)
File to write the CSV to.
nodeNames (list)
Names of sequences in clustering (includes path).
nodeLabels (list)
Names of sequences to write in CSV (usually has path removed).
clustering (dict or dict of dicts)
Dictionary of cluster assignments (keys are nodeNames). Pass a dict with depth two to include multiple possible clusterings.
output_format (str)
Software for which CSV should be formatted (microreact, phandango, grapetree and cytoscape are accepted)
epiCsv (str)
Optional CSV of epi data to paste in the output in addition to the clusters (default = None).
queryNames (list)

Optional list of isolates that have been added as a query.

(default = None)

tsne.py

PopPUNK.tsne.generate_tsne(seqLabels, accMat, perplexity, outPrefix, overwrite, verbosity=0)[source]

Generate t-SNE projection using accessory distances

Writes a plot of t-SNE clustering of accessory distances (.dot)

Args:
seqLabels (list)
Processed names of sequences being analysed.
accMat (numpy.array)
n x n array of accessory distances for n samples.
perplexity (int)
Perplexity parameter passed to t-SNE
outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
overwrite (bool)

Overwrite existing output if present

(default = False)

verbosity (int)

Verbosity of t-SNE process (0-3)

(default = 0)

utils.py

General utility functions for data read/writing/manipulation in PopPUNK

PopPUNK.utils.iterDistRows(refSeqs, querySeqs, self=True)[source]

Gets the ref and query ID for each row of the distance matrix

Returns an iterable with ref and query ID pairs by row.

Args:
refSeqs (list)
List of reference sequence names.
querySeqs (list)
List of query sequence names.
self (bool)

Whether a self-comparison, used when constructing a database.

Requires refSeqs == querySeqs

Default is True

Returns:
ref, query (str, str)
Iterable of tuples with ref and query names for each distMat row.
PopPUNK.utils.qcDistMat(distMat, refList, queryList, a_max)[source]

Checks distance matrix for outliers. At the moment just a threshold for accessory distance

Args:
distMat (np.array)
Core and accessory distances
refList (list)
Reference labels
queryList (list)
Query labels (or refList if self)
a_max (float)
Maximum accessory distance to allow
Returns:
passed (bool)
False if any samples failed
PopPUNK.utils.readClusters(clustCSV, return_dict=False)[source]

Read a previous reference clustering from CSV

Args:
clustCSV (str)
File name of CSV with previous cluster assignments
return_type (str)
If True, return a dict with sample->cluster instead of sets
Returns:
clusters (dict)
Dictionary of cluster assignments (keys are cluster names, values are sets containing samples in the cluster). Or if return_dict is set keys are sample names, values are cluster assignments.
PopPUNK.utils.readExternalClusters(clustCSV)[source]

Read a cluster definition from CSV (does not have to be PopPUNK generated clusters). Rows samples, columns clusters.

Args:
clustCSV (str)
File name of CSV with previous cluster assingments
Returns:
extClusters (dict)
Dictionary of dictionaries of cluster assignments (first key cluster assignment name, second key sample, value cluster assignment)
PopPUNK.utils.readPickle(pklName)[source]

Loads core and accessory distances saved by storePickle()

Called during --fit-model

Args:
pklName (str)
Prefix for saved files
Returns:
rlist (list)
List of reference sequence names (for iterDistRows())
qlist (list)
List of query sequence names (for iterDistRows())
self (bool)
Whether an all-vs-all self DB (for iterDistRows())
X (numpy.array)
n x 2 array of core and accessory distances
PopPUNK.utils.storePickle(rlist, qlist, self, X, pklName)[source]

Saves core and accessory distances in a .npy file, names in a .pkl

Called during --create-db

Args:
rlist (list)
List of reference sequence names (for iterDistRows())
qlist (list)
List of query sequence names (for iterDistRows())
self (bool)
Whether an all-vs-all self DB (for iterDistRows())
X (numpy.array)
n x 2 array of core and accessory distances
pklName (str)
Prefix for output files
PopPUNK.utils.translate_distMat(combined_list, core_distMat, acc_distMat)[source]

Convert distances from a square form (2 NxN matrices) to a long form (1 matrix with n_comparisons rows and 2 columns).

Args:
combined_list
Combined list of references followed by queries (list)
core_distMat (numpy.array)
NxN core distances
acc_distMat (numpy.array)
NxN accessory distances
Returns:
distMat (numpy.array)
Distances in long form
PopPUNK.utils.update_distance_matrices(refList, distMat, queryList=None, query_ref_distMat=None, query_query_distMat=None)[source]

Convert distances from long form (1 matrix with n_comparisons rows and 2 columns) to a square form (2 NxN matrices), with merging of query distances if necessary.

Args:
refList (list)
List of references
distMat (numpy.array)
Two column long form list of core and accessory distances for pairwise comparisons between reference db sequences
queryList (list)
List of queries
query_ref_distMat (numpy.array)
Two column long form list of core and accessory distances for pairwise comparisons between queries and reference db sequences
query_query_distMat (numpy.array)
Two column long form list of core and accessory distances for pairwise comparisons between query sequences
Returns:
seqLabels (list)
Combined list of reference and query sequences
coreMat (numpy.array)
NxN array of core distances for N sequences
accMat (numpy.array)
NxN array of accessory distances for N sequences
PopPUNK.utils.writeTmpFile(fileList)[source]

Writes a list to a temporary file. Used for turning variable into mash input.

Args:
fileList (list)
List of files to write to file
Returns:
tmpName (str)
Name of temp file list written to