Reference documentation

Documentation for module functions (for developers)

bgmm.py

Functions used to fit the mixture model to a database. Access using BGMMFit.

BGMM using sklearn

PopPUNK.bgmm.assign_samples(X, weights, means, covars, scale, values=False)[source]

Given distances and a fit will calculate responsibilities and return most likely cluster assignment

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples
weights (numpy.array)
Component weights from BGMMFit
means (numpy.array)
Component means from BGMMFit
covars (numpy.array)
Component covariances from BGMMFit
scale (numpy.array)
Scaling of core and accessory distances from BGMMFit
values (bool)

Whether to return the responsibilities, rather than the most likely assignment (used for entropy calculation).

Default is False

Returns:
ret_vec (numpy.array)
An n-vector with the most likely cluster memberships or an n by k matrix with the component responsibilities for each sample.
PopPUNK.bgmm.findWithinLabel(means, assignments, rank=0)[source]

Identify within-strain links

Finds the component with mean closest to the origin and also akes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)

Args:
means (numpy.array)
K x 2 array of mixture component means
assignments (numpy.array)
Sample cluster assignments
rank (int)

Which label to find, ordered by distance from origin. 0-indexed.

(default = 0)

Returns:
within_label (int)
The cluster label for the within-strain assignments
PopPUNK.bgmm.fit2dMultiGaussian(X, dpgmm_max_K=2)[source]

Main function to fit BGMM model, called from fit()

Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.

Args:
X (np.array)
n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.
dpgmm_max_K (int)
Maximum number of components to use with the EM fit. (default = 2)
Returns:
dpgmm (sklearn.mixture.BayesianGaussianMixture)
Fitted bgmm model
PopPUNK.bgmm.log_likelihood(X, weights, means, covars, scale)[source]

modified sklearn GMM function predicting distribution membership

Returns the mixture LL for points X. Used by assign_samples() and plot_contours()

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples
weights (numpy.array)
Component weights from fit2dMultiGaussian()
means (numpy.array)
Component means from fit2dMultiGaussian()
covars (numpy.array)
Component covariances from fit2dMultiGaussian()
scale (numpy.array)
Scaling of core and accessory distances from fit2dMultiGaussian()
Returns:
logprob (numpy.array)
The log of the probabilities under the mixture model
lpr (numpy.array)
The components of the log probability from each mixture component
PopPUNK.bgmm.log_multivariate_normal_density(X, means, covars, min_covar=1e-07)[source]

Log likelihood of multivariate normal density distribution

Used to calculate per component Gaussian likelihood in assign_samples()

Args:
X (numpy.array)
n x 2 array of core and accessory distances for n samples
means (numpy.array)
Component means from fit2dMultiGaussian()
covars (numpy.array)
Component covariances from fit2dMultiGaussian()
min_covar (float)
Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e-7)
Returns:
log_prob (numpy.array)
An n-vector with the log-likelihoods for each sample being in this component

dbscan.py

Functions used to fit DBSCAN to a database. Access using DBSCANFit.

DBSCAN using hdbscan

PopPUNK.dbscan.assign_samples_dbscan(X, hdb, scale)[source]

Use a fitted dbscan model to assign new samples to a cluster

Args:
X (numpy.array)
N x 2 array of core and accessory distances
hdb (hdbscan.HDBSCAN)
Fitted DBSCAN from hdbscan package
scale (numpy.array)
Scale factor of model object
Returns:
y (numpy.array)
Cluster assignments by sample
PopPUNK.dbscan.evaluate_dbscan_clusters(model)[source]

Evaluate whether fitted dbscan model contains non-overlapping clusters

Args:
model (DBSCANFit)
Fitted model from fit()
Returns:
indistinct (bool)
Boolean indicating whether putative within- and between-strain clusters of points overlap
PopPUNK.dbscan.findBetweenLabel(assignments, within_cluster)[source]

Identify between-strain links from a DBSCAN model

Finds the component containing the largest number of between-strain links, excluding the cluster identified as containing within-strain links.

Args:
assignments (numpy.array)
Sample cluster assignments
within_cluster (int)
Cluster ID assigned to within-strain assignments, from findWithinLabel()
Returns:
between_cluster (int)
The cluster label for the between-strain assignments
PopPUNK.dbscan.fitDbScan(X, min_samples, min_cluster_size, cache_out)[source]

Function to fit DBSCAN model as an alternative to the Gaussian

Fits the DBSCAN model to the distances using hdbscan

Args:
X (np.array)
n x 2 array of core and accessory distances for n samples
min_samples (int)
Parameter for DBSCAN clustering ‘conservativeness’
min_cluster_size (int)
Minimum number of points in a cluster for HDBSCAN
cache_out (str)
Prefix for DBSCAN cache used for refitting
Returns:
hdb (hdbscan.HDBSCAN)
Fitted HDBSCAN to subsampled data
labels (list)
Cluster assignments of each sample
n_clusters (int)
Number of clusters used

network.py

Functions used to construct the network, and update with new queries. Main entry point is constructNetwork() for new reference databases, and findQueryLinksToNetwork() for querying databases.

refine.py

Functions used to refine an existing model. Access using RefineFit.