API
Quick reference
Functions to get things:
Returns a |
|
|
Returns a |
Returns a demographic model with the specified |
|
Returns a DFE with the specified |
|
Returns a set of annotations with the specified |
|
|
Adds the provided DFE to the contig, applying to the regions of the contig specified in |
Returns the simulation engine with the specified |
Classes of objects:
|
Class representing a species in the catalog. |
|
Class representing the genome for a species. |
|
Class representing a single chromosome for a species. |
|
Class representing a contiguous region of genome that is to be simulated. |
|
A reference to the literature that should be acknowledged by users of stdpopsim. |
|
Class representing an annotation track. |
|
Class representing a "Distribution of Fitness Effects", i.e., a DFE. |
|
Class representing a "type" of mutation. |
|
Class representing a demographic model. |
|
Class recording metadata representing a population in a simulation. |
Species definitions
The Catalog contains a large number of species and simulation
model definitions, which are built using a number of classes defined here.
These are usually not intended to be instantiated directly, but should be
accessed through the main entry point, get_species()
.
- class stdpopsim.Species[source]
Class representing a species in the catalog.
- Variables:
~.id (str) – The unique identifier for this species. The species ID is the three first letters of the genus name followed by the first three letters of the species name, and does not contain any spaces or punctuation. The usual scheme is to use the first three letters of the genus and species (similar to the approach used in the UCSC genome browser), e.g., “HomSap” is the ID for Homo Sapiens.
name (str) – The full name of this species in binominal nomenclature as it would be used in written text, e.g., “Homo sapiens”.
common_name (str) – The name of this species as it would most often be used informally in written text, e.g., “human”, or “Orang-utan”. Where no common name for the species exist, use the most common abbreviation, e.g., “E. Coli”.
genome (stdpopsim.Genome) – The
Genome
instance describing the details of this species’ genome.generation_time (float) – The current best estimate for the generation time of this species in years. Note that individual demographic models in the catalog may or may not use this estimate: each model uses the generation time that was used in the original publication(s).
ploidy (int) – The ploidy of the organism.
population_size (float) – The current best estimate for the population size of this species. Note that individual demographic models in the catalog may or may not use this estimate: each model uses the population sizes defined in the original publication(s).
citations (list) – A list of
Citation
objects providing the source for the generation time and population size estimates.demographic_models (list) – This list of
DemographicModel
instances in the catalog for this species.dfes (list) – This list of
DFE
instances in the catalog for this species.ensembl_id (str) – The ensembl id for the species which is used by maintenance scripts to query ensembl’s database.
- get_annotations(id)[source]
Returns a set of annotations with the specified
id
.- Parameters:
id (str) – The string identifier for the set of annotations A complete list of IDs for each species can be found in the “Annotations” subsection for the species in the Catalog.
- Return type:
- Returns:
A
Annotation
that holds genome annotation information from Ensembl
- get_contig(chromosome=None, genetic_map=None, length_multiplier=None, length=None, mutation_rate=None, recombination_rate=None, use_species_gene_conversion=False, inclusion_mask=None, exclusion_mask=None, left=None, right=None)[source]
Returns a
Contig
instance describing a section of genome that is to be simulated based on empirical information for a given species and chromosome.Coordinates: If a particular chunk of a given chromosome is obtained (by specifying a chromosome ID and left and right coordinates), the resulting genomes will have coordinates matching the original (full) genome, and portions of the genome outside of [left, right) will be masked (and so contain missing data). So, coordinates of genetic_map and masks should be in coordinates of the original genome (so, not shifted to be relative to left). If it is desired to have output coordinates relative to left, use the
tskit.TreeSequence.trim()
method on the result (for instance, ts_shifted = ts.trim()).- Parameters:
chromosome (str) – The ID of the chromosome to simulate. A complete list of chromosome IDs for each species can be found in the “Genome” subsection for the species in the Catalog. If the chromosome is not given, we specify a “generic” contig with given
length
.genetic_map (str) – If specified, obtain recombination rate information from the genetic map with the specified ID. If None, simulate using a default uniform recombination rate on a region with the length of the specified chromosome. The default rates are species- and chromosome- specific, and can be found in the Catalog. (Default: None)
length_multiplier (float) – Deprecated, use left and right instead. If specified, simulate a region of length length_multiplier times the length of the specified chromosome with the same chromosome-specific mutation and recombination rates. This option cannot be used in conjunction with the
genetic_map
argument.mutation_rate (float) – The per-base mutation rate. If none is given, the mutation rate defaults to the rate specified by species chromosomes.
recombination_rate (float) – The per-base recombination rate. If none is given, the recombination rate defaults to the rate specified by species chromosomes. Ignored when
genetic_map
argument is specified.use_species_gene_conversion (bool) – If set to True the parameters for gene conversion of the species chromosome are used if available. For “generic” contigs the gene conversion fraction and length are given by the mean values across all chromosomes.
inclusion_mask – If specified, simulated genomes are subset to only inlude regions given by the mask. The mask can be specified by the path and file name of a bed file or as a list or array of intervals given by the left and right end points of the intervals.
exclusion_mask – If specified, simulated genomes are subset to exclude regions given by the mask. The mask can be specified by the path and file name of a bed file or as a list or array of intervals given by the left and right end points of the intervals.
length (float) – Used with a “generic” contig, specifies the length of genome sequence for this contig. For a generic contig, mutation and recombination rates are equal to the genome-wide average across all autosomal chromosomes.
left (float) – The left coordinate (inclusive) of the region to keep on the chromosome. Defaults to 0. Remaining regions will have missing data when simulated.
right (float) – The right coordinate (exclusive) of the region to keep on the chromosome. Defaults to the length of the chromosome. Remaining regions will have missing data when simulated.
- Return type:
- Returns:
A
Contig
describing the section of the genome.
- get_demographic_model(id)[source]
Returns a demographic model with the specified
id
.- Parameters:
id (str) – The string identifier for the demographic model. A complete list of IDs for each species can be found in the “Demographic Models” subsection for the species in the Catalog.
- Return type:
- Returns:
A
DemographicModel
that defines the requested model.
- class stdpopsim.Genome[source]
Class representing the genome for a species.
- Variables:
chromosomes (list) – A list of
Chromosome
objects.assembly_name (str) – The name of the genome assembly.
assembly_accession (str) – The ID of the genome assembly accession.
assembly_source (str) – The source of the genome assembly data. (for instance “ensembl”). Use “manual” if manually entered.
assembly_build_version (str) – The version of the genome assembly build, or “None” if manually entered.
bacterial_recombination (bool) – Whether recombination is via horizontal gene transfer (if this is True) or via crossing-over and possibly gene conversion (if this is False). Default: False.
citations (list) – A list of
Citation
objects providing the source for the genome assembly, mutation rate, recombination rate, and gene conversion estimates.length (int) – The total length of the genome.
- static from_data(genome_data, *, recombination_rate, mutation_rate, ploidy, citations, bacterial_recombination=False, gene_conversion_fraction=None, gene_conversion_length=None)[source]
Construct a Genome object from the specified dictionary of genome information from Ensembl, recombination_rate, mutation_rate, and ploidy dictionaries.
This method is for internal use only.
- get_chromosome(id)[source]
Returns the chromosome with the specified
id
.- Parameters:
id (str) – The string ID of the chromosome. A complete list of chromosome IDs for each species can be found in the “Genome” subsection for the species in the Catalog.
- Return type:
- Returns:
A
Chromosome
that defines properties of the chromosome such as length, mutation rate, and recombination rate.
- property mean_gene_conversion_fraction
The genetic length-weighted mean gene conversion fraction across all chromosomes.
- property mean_mutation_rate
The length-weighted mean mutation rate across all chromosomes.
- property mean_recombination_rate
The length-weighted mean recombination rate across all chromosomes. (Note that this is the rate of crossing-over, not all double-stranded breaks.)
- property range_gene_conversion_lengths
The range of gene conversion tract lengths across chromosomes.
- class stdpopsim.Chromosome[source]
Class representing a single chromosome for a species. Note that although the recombination rate for a Chromosome is the rate of crossing-over, the recombination map for a Contig gives the rate of both crossing-over and gene conversion, so will differ if the gene conversion fraction is nonzero.
- Variables:
~.id (str) – The string identifier for the chromosome.
length (int) – The length of the chromosome.
mutation_rate (float) – The mutation rate used when simulating this chromosome.
recombination_rate (float) – The rate of crossing-over used when simulating this chromosome (if not using a genetic map).
gene_conversion_fraction (float) – The gene conversion fraction used when simulating this chromosome.
gene_conversion_length (float) – The mean tract length of gene conversion events used when simulating this chromosome.
synonyms (list of str) – List of synonyms that may be used when requesting this chromosome by ID, e.g. from the command line interface.
ploidy (int) – The ploidy used when simulating this chromosome.
- class stdpopsim.Contig[source]
Class representing a contiguous region of genome that is to be simulated. This contains the information about mutation rates, distributions of fitness effects (DFEs), gene conversion rates and recombination rates that are needed to simulate this region.
Information about targets of selection are contained in the
dfe_list
andinterval_list
. These must be of the same length, and the k-th DFE applies to the k-th interval; seeadd_dfe()
for more information.- Variables:
mutation_rate (float) – The rate of mutation per base per generation.
ploidy (int) – The ploidy of the contig. Defaults to 2 (diploid).
genetic_map (.GeneticMap) – The genetic map for the region, that gives the rates of crossing-over. A
GeneticMap
object.recombination_map (msprime.RateMap) – The recombination map for the region, that gives the rates of double-stranded breaks. Note that if gene conversion fraction is nonzero, then this map can have a larger rate than the corresponding chromosome’s recombination rate, since the chromosome’s rate describes only crossing-over. A
msprime.RateMap
object.bacterial_recombination – Whether the model of recombination is by horizontal gene transfer or not (default is False, i.e., recombination is by crossing-over and possibly gene conversion).
gene_conversion_fraction (float) – The fraction of recombinations that resolve as gene conversion events. The recombination map gives the rates of both crossing-over and gene conversion, with a fraction of gene conversions given by this value. Defaults to None, i.e., no gene conversion. Must be None or between 0 and 1. Must be None if bacterial_recombination is True.
gene_conversion_length (float) – The mean tract length of gene conversions (if bacterial_recombination is False), or horizontally tranferred segments (if it is True). Must be present, and above 1 if either gene_conversion_fraction is given or bacterial_recombination is True.
mask_intervals (array-like (?)) – Intervals to keep in simulated tree sequence, as a list of (left_position, right_position), such that intervals are non-overlapping and in ascending order. Should have shape Nx2, where N is the number of intervals.
exclude (bool) – If True,
mask_intervals
specify regions to exclude. If False,mask_intervals
specify regions in keep.dfe_list (list) – A list of
DFE
objects. By default, the only DFE is completely neutral.interval_list – A list of
np.array
objects containing integers. By default, the inital interval list spans the whole chromosome with the neutral DFE.coordinates (tuple) – The location of the contig on a named chromosome, as a tuple of the form (chromosome, left, right). If None, the contig is assumed to be generic (i.e. it does not inherit a coordinate system from a larger chromosome).
Note
To run stdpopsim simulations with alternative, user-specified mutation, recombination, or gene conversion rates, a new contig can be created based on an existing one. For instance, the following will create a
new_contig
that, when simulated, will have double the mutation rate of theold_contig
:new_contig = stdpopsim.Contig( mutation_rate=old_contig.mutation_rate * 2, recombination_map=old_contig.recombination_map, genetic_map=old_contig.genetic_map, )
- add_dfe(intervals, DFE)[source]
Adds the provided DFE to the contig, applying to the regions of the contig specified in
intervals
. These intervals are also removed from the intervals of any previously-present DFEs - in other words, more recently-added DFEs take precedence. Each DFE added in this way carries its own MutationTypes: no mutation types are shared between DFEs added by different calls toadd_dfe( )
, even if the same DFE object is added more than once.For instance, if we do
a1 = np.array([[0, 100]]) a2 = np.array([[50, 120]]) contig.add_dfe(a1, dfe1) contig.add_dfe(a2, dfe2)
then
dfe1
applies to the region from 0 to 50 (including 0 but not 50) anddfe2
applies to the region from 50 to 120 (including 50 but not 120).Any of the
intervals
that fall outside of the contig will be clipped to the contig boundaries. If nointervals
overlap the contig, an “empty” DFE will be added with a warning.- Parameters:
intervals (array) – A valid set of intervals.
dfe (DFE) – A DFE object.
- add_single_site(id, coordinate, description=None, long_description=None)[source]
Adds a single site mutation class at the provided coordinate. The string label of the single site may be referenced by extended events passed to the simulation engine.
For instance, to simulate a selective sweep in population pop_0 starting at 1000 generations in the past:
contig.add_single_site( id="hard_sweep", coordinate=mutation_coordinate, ) extended_events = ext.selective_sweep( mutation_generation_ago=1000, selection_coeff=0.1, population="pop_0", ) engine = stdpopsim.get_engine("slim") ts_sweep = engine.simulate( ..., extended_events=extended_events, )
See
ext.selective_sweep()
for more details on the sweep model.- Parameters:
id (str) – A unique identifier for the single site.
coordinate (int) – The coordinate of the site on the contig.
description (str) – A short description of the single site mutation model as it would be used in written text, e.g., “Strong selective sweep”.
long_description (str) – If necessary, a more detailed summary.
- clear_dfes()[source]
Removes all DFEs from the contig (as well as the corresponding list of intervals).
- dfe_breakpoints(*, relative_coordinates=None)[source]
Returns two things: the sorted vector of endpoints of all intervals across all DFEs in the contig, and a vector of integer labels for these intervals, saying which DFE goes with which interval. This provides a complementary method to tell which bit of the contig has which DFE attached, which may be more convenient than the list of two-column arrays provided by interval_list.
Suppose there are n+1 unique interval endpoints across all the DFE intervals in
interval_list
. (If the intervals in that list cover the whole genome, the number of intervals is n.) This method returns a tuple of two things:breaks, dfe_labels
. “breaks” is the array containing those n+1 unique endpoints, in increasing order, and “dfe” is the array of length n containing the index of the DFE the applies to that interval. So,breaks[0]
is always 0, andbreaks[n+1]
is always the length of the contig, anddfe_labels[k] = j
if[breaks[k], breaks[k+1]]
is an interval incontig.interval_list[j]
, i.e., ifcontig.dfe_list[j]
applies to the interval starting atbreaks[k]
. Some intervals may not be covered by a DFE, in which case they will have the label-1
(beware of python indexing!).- Returns:
A tuple (breaks, dfe_labels).
- property is_neutral
Returns True if the contig has no non-neutral mutation types.
- mutation_types()[source]
Provides information about the MutationTypes assigned to this Contig, along with information about which DFE they correspond to. This is useful because when simulating with SLiM, the mutation types are assigned numeric IDs in the order provided here (e.g., in the order encountered when iterating over the DFEs); this method provides an easy way to map back from their numeric ID to the DFE that each MutationType corresponds to.
This method returns a list of dictionaries of length equal to the number of MutationTypes in all DFEs of the contig, each dictionary containing three things:
"dfe_id"
: the ID of the DFE this MutationType comes from;mutation_type
: the mutation type; andid
: the index in the list.For instance, if
muts
is a list of mutation objects in a SLiM tree sequence, then the following code will print the IDs of the DFEs that each comes from:mut_types = contig.mutation_types() for m in muts: dfe_ids = [mut_types[k]["dfe_id"] for md in m.metadata["muation_list"]] print(f"Mutation {m.id} has mutations from DFE(s) {','.join(dfe_ids)}")
- property origin
The location of the contig on a named chromosome as a string with format, “chromosome:left-right”; or None if a generic contig.
- class stdpopsim.Citation[source]
A reference to the literature that should be acknowledged by users of stdpopsim.
- Variables:
- class stdpopsim.Annotation[source]
Class representing an annotation track.
- Variables:
~.id (str) – String that uniquely identifies the annotation.
species (
Species
) – The species to which this annotation applies.url (str) – The URL where the packed and compressed GFF3 can be found.
intervals_url (str) – The URL of the intervals cache of the annotations.
intervals_sha256 (str) – The SHA256 checksum of the annotations cache.
~.description (str) – One line description of the annotation.
citations (list of
Citation
) – List of citations for the annotation.file_pattern – The pattern used to map individual chromosome id strings to files
Demographic Models
- class stdpopsim.DemographicModel[source]
Class representing a demographic model.
Instances of this class are constructed by model implementors, following the developer documentation. To instead obtain a pre-specified model as listed in the Catalog, see
Species.get_demographic_model
.- Variables:
~.id (str) – The unique identifier for this model. DemographicModel IDs should be short and memorable, and conform to the stdpopsim naming conventions for demographic models.
~.description (str) – A short description of this model as it would be used in written text, e.g., “Three population Out-of-Africa”. This should describe the model itself and not contain author or year information.
long_description (str) – A concise, but detailed, summary of the model.
generation_time (int) – Mean inter-generation interval, in years.
citations (list of
Citation
) – A list ofCitation
, that describe the primary reference(s) for the model.mutation_rate (float) – The mutation rate associated with the demographic model. If no mutation rate is given, the species’ default mutation rate is used.
recombination_rate (float) – The recombination rate associated with the demographic model. If no recombination rate is given, the species’ default recombination rate is used.
- get_sample_sets(individuals_per_population, ploidy=2)[source]
Returns a list of msprime.SampleSet objects, using the number of individuals sampled from each population as specified in individuals_per_population. For instance,
model.get_sample_sets({"pop_0":5, "pop_1":5}
would return sample sets encompassing 10 individuals, equally split between populations named “pop_0” and “pop_1”. Omitted populations are assumed to have zero samples.
- get_samples(*args)[source]
Returns a list of msprime.SampleSet objects, with the number of haploids from each population determined by the positional arguments. For instance,
model.get_samples(2, 5, 7)
would return a list with sample sets encompassing 14 samples, two of which are from the model’s first population (i.e., with population IDmodel.populations[0].id
), five are from the model’s second population, and seven are from the model’s third population. The number of of arguments must be less than or equal to the number of “sampling” populations,model.num_sampling_populations
; if the number of arguments is less than the number of sampling populations, then remaining numbers are treated as zero.
- class stdpopsim.Population[source]
Class recording metadata representing a population in a simulation.
- Variables:
~.id (str) – The id of the population.
~.description (str) – a short description of the population
sampling_time (int) – an integer value which indicates how many generations prior to the present individuals should samples should be drawn from this population. If None, sampling not allowed from this population (default = 0).
Gene conversion and bacterial recombination
Some species have estimates of the process of gene conversion.
However, by default gene conversion does not happen, because
(unfortunately) it can make simulations take much longer,
and result in much bigger files.
This is because the rate of gene conversion is usually much higher than
the rate of crossing-over, so enabling gene conversion effectively
increases the rate of recombination, which can have a strong effect on runtimes.
To enable gene conversion for a species that has estimates,
pass the use_species_gene_conversion=True argument to
Engine.simulate()
.
Gene conversion attributes are visible through the gene_conversion_fraction and gene_conversion_length properties of the species’ genome. The “gene conversion fraction” gives the (average) fraction of double-stranded breaks that resolve as gene conversion (a.k.a non-crossing-over) events, and the “gene conversion length” is the mean length of the gene conversion tracts. So, if the rate of crossing over (which is often referred to as “recombination rate”) is r and the gene conversion fraction is f, then the total rate of double-stranded breaks is r / (1-f), and the gene conversion rate is r * f / (1-f).
A consequence of this is that
the recombination_map attribute of a Contig
is the rate of
double-stranded break initiation, not the rate of crossing-over. The terminology
conflicts somewhat with the recombination_rate property of
a Chromosome
, which specifies the rate of crossing-over. So, creating
a contig with use_species_gene_conversion=True will result in a Contig
with larger “recombination rates” than otherwise, because these rates
include gene conversion as well as crossing over:
species = stdpopsim.get_species("DroMel")
contig = species.get_contig("2L")
contig.recombination_map.mean_rate
# 2.40462600791e-08
contig = species.get_contig("2L", use_species_gene_conversion=True)
contig.recombination_map.mean_rate
# 1.414485887005882e-07
Different engines implement gene conversion in different ways: at the time of writing, msprime only allows a constant rate of gene conversion, while SLiM only allows gene conversion to be a fixed fraction of double-stranded break events. So, in SLiM we use the recombination map for the local rate of double-stranded breaks along the genome, while in msprime we specify a constant rate of gene conversion so that the average total number is the same as we would get from SLiM.
In principle, bacterial recombination is a completely different issue, since bacterial recombination is by horizontal transfer of genomic segments. However, bacterial recombination is at present implemented using the engine’s gene conversion mechanisms. So, bacterial species have the bacterial_recombination flag set, and need gene_conversion_length to be defined. (It could also be called horizontal transfer or homologous recombination segment length, but that would add another option.) For such species, gene conversion does not happen, so gene_conversion_fraction must not be set, and the “recombination rate” is the rate of bacterial recombination.
Selection and DFEs
To allow for different kinds of mutations,
each Contig
carries a list of DFE
objects
(for “Distribution of Fitness Effects”),
and a list of collections of intervals to which these apply.
The Contig
contains the overall mutation rate,
and the DFE
describes the proportions of different types of mutations.
Each Contig
comes, by default, with a single, neutral DFE
that applies to the entire Contig.
- class stdpopsim.DFE[source]
Class representing a “Distribution of Fitness Effects”, i.e., a DFE. The class records the different mutation types, and the proportions with which they occur. The overall rate of mutations will be determined by the Contig to which the DFE is applied (see
Contig.add_dfe()
).Instances of this class are constructed by DFE implementors, following the developer documentation. To instead obtain a pre-specified model as listed in the Catalog, see
Species.get_dfe()
.proportions
andmutation_types
must be lists of the same length, andproportions
should be nonnegative numbers summing to 1.- Variables:
~.mutation_types (list) – A list of
MutationType
objects associated with the DFE. Defaults to an empty list.~.proportions (list) – A list of the proportions of new mutations that fall in to each of the mutation types (must sum to 1).
~.id (str) – The unique identifier for this model. DFE IDs should be short and memorable, and conform to the stdpopsim naming conventions for DFE models.
~.description (str) – A short description of this model as it would be used in written text, e.g., “Lognormal DFE”. This should describe the DFE itself and not contain author or year information.
long_description (str) – A concise, but detailed, summary of the DFE model.
citations (list of
Citation
) – A list ofCitations
, that describe the primary reference(s) for the DFE model.
- class stdpopsim.MutationType[source]
Class representing a “type” of mutation. The design closely mirrors SLiM’s MutationType class.
The main thing that mutation types carry is a way of drawing a selection coefficient for each new mutation. This
distribution_type
should be one of (see the SLiM manual for more information on these):f
: fixed, one parameter (the selection coefficient)e
: exponential, one parameter (mean)g
: gamma, two parameters (mean, shape)n
: normal, two parameters (mean, SD)w
: Weibull, two parameters (scale, shape)u
: Uniform, two parameters (min, max)lp
: positive logNormal, two parameters (mean and sd on log scale; see rlnorm)ln
: negative logNormal, two parameters (mean and sd on log scale; see rlnorm)
Type “lp” is always positive, and type “ln” is always negative: both use the same log-normal distribution, but “ln” is multiplied by -1. For exponential and gamma, a negative mean can be provided, obtaining always negative values.
Instead of a single dominance coefficient (which would be specified with dominance_coeff), a discretized relationship between dominance and selection coefficient can be implemented: if dominance_coeff_list is provided, mutations with selection coefficient s for which dominance_coeff_breaks[k-1] <= s <= dominance_coeff_breaks[k] will have dominance coefficient dominance_coeff[k]. In other words, the first entry of dominance_coeff_list applies to any mutations with selection coefficient below the first entry of dominance_coeff_breaks; the second entry of dominance_coeff_list applies to mutations with selection coefficient between the first and second entries of dominance_coeff_breaks, and so forth. The list of breaks must therefore be of length one less than the list of dominance coefficients.
- Variables:
distribution_type (list) – A one-letter abbreviation for the distribution of fitness effects that each new mutation of this type draws from (see below).
distribution_args – Arguments for the distribution type.
dominance_coeff (float) – The dominance coefficient (negative = underdominance, 0 = recessive, 0.5 = additive, 1.0 = completely dominant, > 1.0 = overdominant) Default: 0.5.
convert_to_substitution (bool) – Whether to retain any fixed mutations in the simulation: if not, we cannot ask about their frequency once fixed. (Either way, they will remain in the tree sequence).
dominance_coeff_list (list of floats) – Either None (the default) or a list of floats describing a list of dominance coefficients, to apply to different selection coefficients (see details). Cannot be specified along with dominance_coeff.
dominance_coeff_breaks (list of floats) – Either None (the default) or a list of floats describing the intervals of selection coefficient over which each of the entries of dominance_coeff_list applies (see details). Must be of length one shorter than dominance_coeff_list.
- property is_neutral
Tests whether the mutation type is strictly neutral. This is defined here to be of type “f” and with fitness effect 0.0, and so excludes other situations that also produce only neutral mutations (e.g., exponential with mean 0).
Selection and sweeps
ExtendedEvent
and subclasses may be used to condition on sequences of
events at particular loci using the SLiM engine, by passing lists of events to
the extended_events argument in Engine.simulate. These are intended for
internal use only, as they may change in the future. However, a stable API is
provided to construct the necessary events for selective sweeps.
- stdpopsim.selective_sweep(single_site_id, population, mutation_generation_ago, selection_coeff, dominance_coeff=0.5, start_generation_ago=None, end_generation_ago=None, min_freq_at_start=None, min_freq_at_end=None, globally_adaptive=True)[source]
Creates a list of extended events corresponding to a selective sweep at a single site (see
Contig.add_single_site()
) that may subsequently be passed to the simulation engine. The sequence of events is:In generation mutation_generation_ago, a neutral mutation is introduced into population.
In generation start_generation_ago, the mutation becomes beneficial (whether globally or locally is controlled by the globally_adaptive option). If the frequency of the mutation in population is below min_freq_at_start in this generation, the simulation will restart from mutation_generation_ago.
In the generation after end_generation_ago, the mutation reverts to neutrality in population. If the frequency of the mutation in population is below min_freq_at_end in this generation, the simulation will restart from mutation_generation_ago.
During the sweep, the fitnesses of individuals that are homozygous or heterozygous for the beneficial mutation are 1 + selection_coeff and 1 + selection_coeff * dominance_coeff, respectively. If globally_adaptive is True (the default), then all individuals carrying the mutation are impacted regardless of the population they belong to. Otherwise, the mutation only influences fitness within population.
If the mutation is lost in population at any point between mutation_generation_ago and end_generation_ago, then the simulation will restart from mutation_generation_ago. The effect of “restarting” (aka “rejection sampling”) is that the resulting simulations are conditioned on the mutation being extant after time mutation_generation_ago with frequency above min_freq_at_start at start_generation_ago and min_freq_at_end at end_generation_ago.
Because the simulation will restart until the allele frequency conditions are satisfied, care must be taken to ensure that these are reasonable. For example, if start_generation_ago is very close to mutation_generation_ago and min_freq_at_start is substantially greater than zero, the vast majority of proposed allele frequency trajectories will be rejected and the simulation may take an infeasibly long time to complete.
- Parameters:
single_site_id (str) – The string ID of the single site at which to introduce the mutation, see
Contig.add_single_site()
.population (int) – The name of the population in which the mutation is introduced and the sweep occurs.
mutation_generation_ago (float) – The number of generations in the past at which to introduce the mutation.
selection_coeff (float) – The selection coefficient of the beneficial mutation during the sweep. Must be greater than or equal to zero.
dominance_coeff (float) – The dominance coefficient of the beneficial mutation during the sweep. Defaults to 0.5.
start_generation_ago (float) – The number of generations in the past at which the sweep starts and the mutation switches from neutral to beneficial. Defaults to mutation_generation_ago (i.e. a “hard” sweep).
end_generation_ago (float) – The number of generations in the past at which the sweep ends and the mutation switches from beneficial to neutral. Defaults to 0 (i.e. the end of the simulation).
min_freq_at_start (float) – The minimum allowed frequency for the mutation in population at the start of the sweep (optional).
min_freq_at_end (float) – The minimum allowed frequency for the mutation in population at the end of the sweep (optional).
globally_adaptive (bool) – If True (default), the mutation is beneficial in all populations. If False, the mutation is beneficial only in the population of origin.
Generic models
The Catalog contains simulation models from the literature that are defined for particular species. It is also useful to be able to simulate more generic models, which are documented here. Please see the Running a generic model for examples of using these models.
- class stdpopsim.PiecewiseConstantSize(N0, *args)[source]
Class representing a generic simulation model that can be run to output a tree sequence. This is a piecewise constant size model, which allows for instantaneous population size change over multiple epochs in a single population.
- Parameters:
N0 (float) – The initial effective population size
args – Each subsequent argument is a tuple (t, N) which gives the time at which the size change takes place and the population size.
The usage is best illustrated by an example:
model1 = stdpopsim.PiecewiseConstantSize(N0, (t1, N1)) # One change model2 = stdpopsim.PiecewiseConstantSize(N0, (t1, N1), (t2, N2)) # Two changes
- class stdpopsim.IsolationWithMigration(NA, N1, N2, T, M12, M21)[source]
Class representing a generic simulation model that can be run to output a tree sequence. A generic isolation with migration model where a single ancestral population of size NA splits into two populations of constant size N1 and N2 time T generations ago, with migration rates M12 and M21 between the split populations. Sampling is disallowed in population index 0, as this is the ancestral population.
- Parameters:
NA (float) – The initial ancestral effective population size
N1 (float) – The effective population size of population 1
N2 (float) – The effective population size of population 2
T (float) – Time of split between populations 1 and 2 (in generations)
M12 (float) – Migration rate from population 1 to 2
M21 (float) – Migration rate from population 2 to 1
Example usage:
model1 = stdpopsim.IsolationWithMigration(NA, N1, N2, T, M12, M21)
Simulation Engines
Support for additional simulation engines can be implemented by subclassing
the abstract Engine
class, and registering an instance of the
subclass with register_engine()
.
These are usually not intended to be instantiated directly, but should be
accessed through the main entry point, get_engine()
.
- stdpopsim.get_default_engine()[source]
Returns the default simulation engine (msprime).
- Return type:
- stdpopsim.register_engine(engine)[source]
Registers the specified simulation engine.
- Parameters:
engine (
Engine
) – The simulation engine object to register.
- class stdpopsim.Engine[source]
Abstract class representing a simulation engine.
To implement a new simulation engine, one should inherit from this class. At a minimum, the
id
,description
andcitations
attributes must be set, and thesimulate()
andget_version()
methods must be implemented. See msprime example inengines.py
.- Variables:
- simulate(demographic_model, contig, samples, *, seed=None, dry_run=False)[source]
Simulates the model for the specified contig and samples.
demographic_model
,contig
, andsamples
must be specified.- Parameters:
demographic_model (
DemographicModel
) – The demographic model to simulate.contig (
Contig
) – The contig, defining the length, mutation rate, and recombination rate(s).samples (dict containing number of individuals per population) – The samples to be obtained from the simulation.
seed (int) – The seed for the random number generator.
dry_run (bool) – If True, the simulation engine will return None without running the simulation.
- Returns:
A succinct tree sequence.
- Return type:
tskit.trees.TreeSequence
or None
- class stdpopsim.engines._MsprimeEngine[source]
Bases:
Engine
- description = 'Msprime coalescent simulator'
- id = 'msprime'
- simulate(demographic_model, contig, samples, *, seed=None, msprime_model=None, msprime_change_model=None, dry_run=False, **kwargs)[source]
Simulate the demographic model using msprime. See
Engine.simulate()
for definitions of parameters defined for all engines.- Parameters:
msprime_model (str) – The msprime simulation model to be used. One of
hudson
,dtwf
,smc
, orsmc_prime
. See msprime API documentation for details.msprime_change_model (list of (float, str) tuples) – A list of (time, model) tuples, which changes the simulation model to the new model at the time specified.
dry_run (bool) – If True,
end_time=0
is passed tomsprime.simulate()
to initialise the simulation and then immediately return.**kwargs – Further arguments passed to
msprime.sim_ancestry()
- class stdpopsim.slim_engine._SLiMEngine[source]
Bases:
Engine
- description = 'SLiM forward-time Wright-Fisher simulator'
- id = 'slim'
- recap_and_rescale(ts, demographic_model, contig, samples, extended_events=None, slim_scaling_factor=1.0, seed=None, keep_mutation_ids_as_alleles=False, **kwargs)[source]
Apply post-SLiM transformations to
ts
. This rescales node times, does recapitation, simplification, and adds neutral mutations.If the SLiM engine was used to output a SLiM script, and the script was run outside of stdpopsim, this function can be used to transform the SLiM tree sequence following the procedure that would have been used if stdpopsim had run SLiM itself. The parameters after
ts
have the same meaning as forsimulate()
, and the values fordemographic_model
,contig
,samples
, andslim_scaling_factor
should match those that were used to generate the SLiM script withsimulate()
.- Parameters:
ts (
tskit.TreeSequence
) – The tree sequence output by SLiM.
Warning
The
recap_and_rescale()
function is provided in the hope that it will be useful. But as we can’t anticipate what changes you’ll make to the SLiM code before using it, the stdpopsim source code should be consulted to determine if the behaviour is appropriate for your case.
- simulate(demographic_model, contig, samples, *, seed=None, extended_events=None, slim_path=None, slim_script=False, slim_scaling_factor=1.0, slim_burn_in=10.0, dry_run=False, verbosity=None, logfile=None, logfile_interval=100, keep_mutation_ids_as_alleles=False, _recap_and_rescale=True)[source]
Simulate the demographic model using SLiM. See
Engine.simulate()
for definitions of thedemographic_model
,contig
, andsamples
parameters.- Parameters:
seed (int) – The seed for the random number generator.
extended_events (list) – A list of
ExtendedEvents
to be passed to SLiM, e.g. produced byselective_sweep()
.slim_path (str) – The full path to the slim executable, or the name of a command in the current PATH.
slim_script (bool) – If true, the simulation will not be executed. Instead the generated SLiM script will be printed to stdout.
slim_scaling_factor (float) – Rescale model parameters by the given value, to speed up simulation. Population sizes and generation times are divided by this factor, whereas the mutation rate, recombination rate, and growth rates are multiplied by the factor. See SLiM manual: 5.5 Rescaling population sizes to improve simulation performance.
slim_burn_in (float) – Length of the burn-in phase, in units of N generations.
dry_run (bool) – If True, run the setup and then end the simulation.
logfile – Name of file to write a log of summary statistics (currently, only mean and SD of fitness values per population). Defaults to None, meaning “do not log”.
logfile_interval – How often to write to the log file, in generations.
keep_mutation_ids_as_alleles (bool) – If true, alleles will be coded by integer mutation ids (assigned by SLiM) rather than by randomly-generated nucleotides.