Reference documentation¶
Documentation for module functions (for developers)
bgmm.py¶
Functions used to fit the mixture model to a database. Access using
BGMMFit
.
BGMM using sklearn

PopPUNK.bgmm.
assign_samples
(X, weights, means, covars, scale, values=False)[source]¶ Given distances and a fit will calculate responsibilities and return most likely cluster assignment
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples
 weights (numpy.array)
 Component weights from
BGMMFit
 means (numpy.array)
 Component means from
BGMMFit
 covars (numpy.array)
 Component covariances from
BGMMFit
 scale (numpy.array)
 Scaling of core and accessory distances from
BGMMFit
 values (bool)
Whether to return the responsibilities, rather than the most likely assignment (used for entropy calculation).
Default is False
 Returns:
 ret_vec (numpy.array)
 An nvector with the most likely cluster memberships or an n by k matrix with the component responsibilities for each sample.

PopPUNK.bgmm.
findWithinLabel
(means, assignments, rank=0)[source]¶ Identify withinstrain links
Finds the component with mean closest to the origin and also akes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)
 Args:
 means (numpy.array)
 K x 2 array of mixture component means
 assignments (numpy.array)
 Sample cluster assignments
 rank (int)
Which label to find, ordered by distance from origin. 0indexed.
(default = 0)
 Returns:
 within_label (int)
 The cluster label for the withinstrain assignments

PopPUNK.bgmm.
fit2dMultiGaussian
(X, dpgmm_max_K=2)[source]¶ Main function to fit BGMM model, called from
fit()
Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.
 Args:
 X (np.array)
 n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.
 dpgmm_max_K (int)
 Maximum number of components to use with the EM fit. (default = 2)
 Returns:
 dpgmm (sklearn.mixture.BayesianGaussianMixture)
 Fitted bgmm model

PopPUNK.bgmm.
log_likelihood
(X, weights, means, covars, scale)[source]¶ modified sklearn GMM function predicting distribution membership
Returns the mixture LL for points X. Used by
assign_samples()
andplot_contours()
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples
 weights (numpy.array)
 Component weights from
fit2dMultiGaussian()
 means (numpy.array)
 Component means from
fit2dMultiGaussian()
 covars (numpy.array)
 Component covariances from
fit2dMultiGaussian()
 scale (numpy.array)
 Scaling of core and accessory distances from
fit2dMultiGaussian()
 Returns:
 logprob (numpy.array)
 The log of the probabilities under the mixture model
 lpr (numpy.array)
 The components of the log probability from each mixture component

PopPUNK.bgmm.
log_multivariate_normal_density
(X, means, covars, min_covar=1e07)[source]¶ Log likelihood of multivariate normal density distribution
Used to calculate per component Gaussian likelihood in
assign_samples()
 Args:
 X (numpy.array)
 n x 2 array of core and accessory distances for n samples
 means (numpy.array)
 Component means from
fit2dMultiGaussian()
 covars (numpy.array)
 Component covariances from
fit2dMultiGaussian()
 min_covar (float)
 Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e7)
 Returns:
 log_prob (numpy.array)
 An nvector with the loglikelihoods for each sample being in this component
dbscan.py¶
Functions used to fit DBSCAN to a database. Access using
DBSCANFit
.
DBSCAN using hdbscan

PopPUNK.dbscan.
assign_samples_dbscan
(X, hdb, scale)[source]¶ Use a fitted dbscan model to assign new samples to a cluster
 Args:
 X (numpy.array)
 N x 2 array of core and accessory distances
 hdb (hdbscan.HDBSCAN)
 Fitted DBSCAN from hdbscan package
 scale (numpy.array)
 Scale factor of model object
 Returns:
 y (numpy.array)
 Cluster assignments by sample

PopPUNK.dbscan.
evaluate_dbscan_clusters
(model)[source]¶ Evaluate whether fitted dbscan model contains nonoverlapping clusters
 Args:
 model (DBSCANFit)
 Fitted model from
fit()
 Returns:
 indistinct (bool)
 Boolean indicating whether putative within and betweenstrain clusters of points overlap

PopPUNK.dbscan.
findBetweenLabel
(assignments, within_cluster)[source]¶ Identify betweenstrain links from a DBSCAN model
Finds the component containing the largest number of betweenstrain links, excluding the cluster identified as containing withinstrain links.
 Args:
 assignments (numpy.array)
 Sample cluster assignments
 within_cluster (int)
 Cluster ID assigned to withinstrain assignments, from
findWithinLabel()
 Returns:
 between_cluster (int)
 The cluster label for the betweenstrain assignments

PopPUNK.dbscan.
fitDbScan
(X, min_samples, min_cluster_size, cache_out)[source]¶ Function to fit DBSCAN model as an alternative to the Gaussian
Fits the DBSCAN model to the distances using hdbscan
 Args:
 X (np.array)
 n x 2 array of core and accessory distances for n samples
 min_samples (int)
 Parameter for DBSCAN clustering ‘conservativeness’
 min_cluster_size (int)
 Minimum number of points in a cluster for HDBSCAN
 cache_out (str)
 Prefix for DBSCAN cache used for refitting
 Returns:
 hdb (hdbscan.HDBSCAN)
 Fitted HDBSCAN to subsampled data
 labels (list)
 Cluster assignments of each sample
 n_clusters (int)
 Number of clusters used
network.py¶
Functions used to construct the network, and update with new queries. Main
entry point is constructNetwork()
for new reference
databases, and findQueryLinksToNetwork()
for querying
databases.