Reference documentation¶
Documentation for module functions (for developers)
bgmm.py¶
Functions used to fit the mixture model to a database. Access using
BGMMFit
.
BGMM using sklearn
-
PopPUNK.bgmm.
findWithinLabel
(means, assignments, rank=0)[source]¶ Identify within-strain links
Finds the component with mean closest to the origin and also akes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)
- Args:
- means (numpy.array)
- K x 2 array of mixture component means
- assignments (numpy.array)
- Sample cluster assignments
- rank (int)
Which label to find, ordered by distance from origin. 0-indexed.
(default = 0)
- Returns:
- within_label (int)
- The cluster label for the within-strain assignments
-
PopPUNK.bgmm.
fit2dMultiGaussian
(X, dpgmm_max_K=2)[source]¶ Main function to fit BGMM model, called from
fit()
Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.
- Args:
- X (np.array)
- n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.
- dpgmm_max_K (int)
- Maximum number of components to use with the EM fit. (default = 2)
- Returns:
- dpgmm (sklearn.mixture.BayesianGaussianMixture)
- Fitted bgmm model
-
PopPUNK.bgmm.
log_likelihood
(X, weights, means, covars, scale)[source]¶ modified sklearn GMM function predicting distribution membership
Returns the mixture LL for points X. Used by
assign_samples()
andplot_contours()
- Args:
- X (numpy.array)
- n x 2 array of core and accessory distances for n samples
- weights (numpy.array)
- Component weights from
fit2dMultiGaussian()
- means (numpy.array)
- Component means from
fit2dMultiGaussian()
- covars (numpy.array)
- Component covariances from
fit2dMultiGaussian()
- scale (numpy.array)
- Scaling of core and accessory distances from
fit2dMultiGaussian()
- Returns:
- logprob (numpy.array)
- The log of the probabilities under the mixture model
- lpr (numpy.array)
- The components of the log probability from each mixture component
-
PopPUNK.bgmm.
log_multivariate_normal_density
(X, means, covars, min_covar=1e-07)[source]¶ Log likelihood of multivariate normal density distribution
Used to calculate per component Gaussian likelihood in
assign_samples()
- Args:
- X (numpy.array)
- n x 2 array of core and accessory distances for n samples
- means (numpy.array)
- Component means from
fit2dMultiGaussian()
- covars (numpy.array)
- Component covariances from
fit2dMultiGaussian()
- min_covar (float)
- Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e-7)
- Returns:
- log_prob (numpy.array)
- An n-vector with the log-likelihoods for each sample being in this component
dbscan.py¶
Functions used to fit DBSCAN to a database. Access using
DBSCANFit
.
DBSCAN using hdbscan
-
PopPUNK.dbscan.
evaluate_dbscan_clusters
(model)[source]¶ Evaluate whether fitted dbscan model contains non-overlapping clusters
- Args:
- model (DBSCANFit)
- Fitted model from
fit()
- Returns:
- indistinct (bool)
- Boolean indicating whether putative within- and between-strain clusters of points overlap
-
PopPUNK.dbscan.
findBetweenLabel
(assignments, within_cluster)[source]¶ Identify between-strain links from a DBSCAN model
Finds the component containing the largest number of between-strain links, excluding the cluster identified as containing within-strain links.
- Args:
- assignments (numpy.array)
- Sample cluster assignments
- within_cluster (int)
- Cluster ID assigned to within-strain assignments, from
findWithinLabel()
- Returns:
- between_cluster (int)
- The cluster label for the between-strain assignments
-
PopPUNK.dbscan.
fitDbScan
(X, min_samples, min_cluster_size, cache_out)[source]¶ Function to fit DBSCAN model as an alternative to the Gaussian
Fits the DBSCAN model to the distances using hdbscan
- Args:
- X (np.array)
- n x 2 array of core and accessory distances for n samples
- min_samples (int)
- Parameter for DBSCAN clustering ‘conservativeness’
- min_cluster_size (int)
- Minimum number of points in a cluster for HDBSCAN
- cache_out (str)
- Prefix for DBSCAN cache used for refitting
- Returns:
- hdb (hdbscan.HDBSCAN)
- Fitted HDBSCAN to subsampled data
- labels (list)
- Cluster assignments of each sample
- n_clusters (int)
- Number of clusters used
network.py¶
Functions used to construct the network, and update with new queries. Main
entry point is constructNetwork()
for new reference
databases, and findQueryLinksToNetwork()
for querying
databases.
visualise.py¶
poppunk_visualise
main function