Development
stdpopsim
is a community effort, and we welcome YOU to join us!
We envision at least three main types of stdpopsim
developers:
Contributors of new species, demographic models, and other features (such as recombination/genetic maps, annotations, DFE models)
API developers
Documentation and tutorial curators
Contributors add simulation code for new or existing species in the catalog.
This is the main way we envision biologists to continually add
to the catalog of available models, and it is a great first step for new
contributors to learn the ins and outs of stdpopsim
development.
See the appropriate sections below:
API developers work on infrastructure development for the PopSim Consortium,
which could include improvements and additions to the internal code base of
stdpopsim
, establishment of benchmarking pipelines,
and new projects that align with consortium goals.
Documentation and tutorial curators help maintain the documentation and tutorials. This can be as easy as pointing out confusing bits of the documentation in a GitHub issue., or adding or editing the documentation. See the section Documentation.
Get into contact with the stdpopsim
community by subscribing to our
email list serve
and by creating and commenting on
Github issue.
There is a lot of chatter through
Github, and we’ve been building code
there cooperatively.
If you want to help out and don’t know where to start, you can look through the
list of
Good first issues
or
Help wanted issues
To get started helping with stdpopsim
development, please read the
following sections to learn how to contribute.
And, importantly, please have a look at our
code of conduct.
Installation
Before installing, be sure to make a fork of the repo and clone it locally following the instructions in the GitHub Workflow.
The stdpopsim
library requires Python 3.4 or later.
For pip
users, install the packages required for development using:
$ python3 -m pip install -r requirements/development.txt
You can then install the development version of stdpopsim
like this:
$ python3 setup.py install
For conda
users, you will need to add the conda-forge channel to your conda
environment and then should be able to install the development requirements using:
$ conda config --add channels conda-forge
$ conda install --file=requirements/development.txt
We do require msprime
, so please see the the installation notes if you
encounter problems with it.
Note
If you have trouble installing any of the requirements, your pip
may be the wrong version.
Try pip3 install -r requirements/development.txt
Using a Virtual Environment
We encourage the use of a virtual environment.
For pip
, you can use venv
.
First, create the virtual environment (You only need to do this once):
$ python3 -m venv stdpopsim_env
Next, activate the virtual environment:
$ source stdpopsim_env/bin/activate
You will then see the virtual environment in your prompt. Like so:
(stdpopsim_env) $
Once the virtual environment is activated, install the requirements:
(stdpopsim_env) $ python3 -m pip install -r requirements/development.txt
You can then run any of the code in the virtual environment with the packages installed, without conflicting with other packages in your local environment. To deactivate the virtual environment:
(stdpopsim_env) $ deactivate
GitHub workflow
Make your own fork of the
stdpopsim
repository on GitHub, and clone a local copy.Install the pre-commit hooks with:
$ pre-commit install
Make sure that your local repository has been configured with an upstream remote.
Create a “topic branch” to work on. One reliable way to do it is to follow this recipe:
$ git fetch upstream $ git checkout upstream/main $ git checkout -b topic_branch_nameAs you work on your topic branch you can add commits to it. Once you’re ready to share this, you can then open a pull request. Your PR will be reviewed by some of the maintainers, who may ask you to make changes.
If your topic branch has been around for a long time and has gotten out of date with the main repository, we might ask you to rebase. Please see the next section on how to rebase.
Pre-commit checks
On each commit a pre-commit hook will run
that checks for violations of code style and other common problems.
Where possible, these hooks will try to fix any problems that they find (including reformatting
your code to conform to the required style). In this case, the commit
will not complete and report that “files were modified by this hook”.
To include the changes that the hooks made, git add
any
files that were modified and run git commit
(or, use git commit -a
to commit all changed files.)
If you would like to run the checks without committing, use pre-commit run
(but, note that this will only check changes that have been staged;
do pre-commit run --all
to check unstaged changes as well).
To bypass the checks (to save or get feedback on work-in-progress) use
git commit --no-verify
Rebasing
Rebasing is used for two basic tasks we might ask for during review:
Your topic branch has gotten out of date with the tip of
upstream/main
and needs to be updated.Your topic branch has lots of messy commits, which need to be cleaned up by “squashing”.
Rebasing in git
basically means changing where your branch forked off the main code
in upstream/main
. A good way of visualising what’s happening is to
look at the Network view on
GitHub. This shows you all the forks and branches that GitHub knows about
and how they relate to the main repository. Rebasing lets you change where
your branch splits off.
To see this for your local repo on your computer, you can look at the Git graph output via the command line:
$ git log --decorate --oneline --graph
This will show something like:
|* 923ab2e Merge pull request #9 from mcveanlab/docs-initial
|\
| * 0190a92 (origin/docs-initial, docs-initial) First pass at development docs.
| * 2a5fc09 Initial outline for docs.
| * 1ccb970 Initial addition of docs infrastructure.
|/
* c49601f Merge pull request #8 from mcveanlab/better-genomes
|\
| * fab9310 (origin/better-genomes, better-genomes) Added pongo tests.
| * 62c9560 Tidied up example.
| * 51e21e8 Added basic tests for population models.
| * 6fff557 Split genetic_maps into own module.
| * 90d6367 Added Genome concept.
| * e2aaf95 Changed debug to info for logging on download.
| * 2fbdfdc Added badges for CircleCI and CodeCov.
|/
* c66b575 Merge pull request #5 from mcveanlab/tests-ci
|\
| * 3ae454f (origin/tests-ci, tests-ci) Initial circle CI config.
| * c39415a Added basic tests for genetic map downloads.
|/
* dd47000 Merge pull request #3 from mcveanlab/recomb-map-infrastructure
|\
This shows a nice, linear git history: we can see four pull requests, each of
which consists of a small number of meaningful commits. This is the ideal that
we’re aiming for, and git allows us to achieve it by rewriting history as
much as we want within our own forks (we never rewrite history in the
upstream
repository, as this would cause problems for other developers).
Having a clean, linear git history is a good idea for lots of reasons, not
least of which is making git bisect
easier.
One of the most useful things that we can do with rebasing is to “squash” commits
so that we remove some noise from the git history. For example, this PR
(on the branch topic_branch_name
) currently looks like:
$ git log --decorate --oneline --graph
* 97a9458 (HEAD -> topic_branch_name) DONE!!!
* c9c4a28 PLEASE work, CI!
* ad4c807 Please work, CI!
* 0fe6dc4 Please work, CI!
* 520e6ac Add documentation for rebasing.
* 20fb835 (upstream/main) Merge pull request #22 from mcveanlab/port-tennyson
|\
| * b3d45ea (origin/port-tennyson, port-tennyson) Quickly port Tennesen et al model.
|/
* 79d26b4 Merge pull request #20 from andrewkern/fly_model
|\
Here, in my initial commit (520e6ac) I’ve added some updated documentation for rebasing. Then, there’s four more commits where I’m trying to get CI pass. History doesn’t need to know about this, so I can rewrite it using rebase:
$ git fetch upstream
$ git rebase -i upstream/main
We first make sure that we’re rebasing against the most recent version of the
upstream repo. Then, we ask git to perform an interactive rebase against
the upstream/main
branch. This starts up your editor, showing something
like this:
pick 520e6ac Add documentation for rebasing.
pick 0fe6dc4 Please work, CI!
pick ad4c807 Please work, CI!
pick c9c4a28 PLEASE work, CI!
pick 97a9458 DONE!!!
# Rebase 20fb835..97a9458 onto 20fb835 (5 commands)
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
# d, drop = remove commit
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out
We want git to squash the last five commits, so we edit the rebase instructions to look like:
pick 520e6ac Add documentation for rebasing.
s 0fe6dc4 Please work, CI!
s ad4c807 Please work, CI!
s c9c4a28 PLEASE work, CI!
s 97a9458 DONE!!!
# Rebase 20fb835..97a9458 onto 20fb835 (5 commands)
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
# d, drop = remove commit
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out
After performing these edits, we then save and close. Git will try to do the rebasing, and if successful will open another editor screen that lets you edit the text of the commit message:
# This is a combination of 5 commits.
# This is the 1st commit message:
Add documentation for rebasing.
# This is the commit message #2:
Please work, CI!
# This is the commit message #3:
Please work, CI!
# This is the commit message #4:
PLEASE work, CI!
# This is the commit message #5:
DONE!!!
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# Date: Tue Mar 5 17:00:39 2019 +0000
#
# interactive rebase in progress; onto 20fb835
# Last commands done (5 commands done):
# squash c9c4a28 PLEASE work, CI!
# squash 97a9458 DONE!!!
# No commands remaining.
# You are currently rebasing branch 'topic_branch_name' on '20fb835'.
#
# Changes to be committed:
# modified: docs/development.rst
#
#
We don’t care about the commit messages for the squashed commits, so we delete them and end up with:
Add documentation for rebasing.
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
#
# Date: Tue Mar 5 17:00:39 2019 +0000
#
# interactive rebase in progress; onto 20fb835
# Last commands done (5 commands done):
# squash c9c4a28 PLEASE work, CI!
# squash 97a9458 DONE!!!
# No commands remaining.
# You are currently rebasing branch 'topic_branch_name' on '20fb835'.
#
# Changes to be committed:
# modified: docs/development.rst
After saving and closing this editor session, we then get something like:
[detached HEAD 6b8a2a5] Add documentation for rebasing.
Date: Tue Mar 5 17:00:39 2019 +0000
1 file changed, 2 insertions(+), 2 deletions(-)
Successfully rebased and updated refs/heads/topic_branch_name.
Finally, after a successful rebase, you must force-push! If you try to
push without specifying -f
, you will get a very confusing and misleading
message:
$ git push origin topic_branch_name
To github.com:jeromekelleher/stdpopsim.git
! [rejected] topic_branch_name -> topic_branch_name (non-fast-forward)
error: failed to push some refs to 'git@github.com:jeromekelleher/stdpopsim.git'
hint: Updates were rejected because the tip of your current branch is behind
hint: its remote counterpart. Integrate the remote changes (e.g.
hint: 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
DO NOT LISTEN TO GIT IN THIS CASE! Git is giving you terrible advice
which will mess up your branch. What we need to do is replace the state of
the branch topic_branch_name
on your fork on GitHub (the upstream
remote)
with the state of your local branch, topic_branch_name
. We do this
by “force-pushing”:
$ git push -f origin topic_branch_name
Counting objects: 4, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 4.33 KiB | 1.44 MiB/s, done.
Total 4 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.
To github.com:jeromekelleher/stdpopsim.git
+ 6b8a2a5...d033ffa topic_branch_name -> topic_branch_name (forced update)
Success! We can check the history again to see if everything looks OK:
$ git log --decorate --oneline --graph
* d033ffa (HEAD -> topic_branch_name, origin/topic_branch_name) Add documentation for rebasing.
* 20fb835 (upstream/main) Merge pull request #22 from mcveanlab/port-tennyson
|\
| * b3d45ea (origin/port-tennyson, port-tennyson) Quickly port Tennesen et al model.
|/
* 79d26b4 Merge pull request #20 from andrewkern/fly_model
|
This looks just right: we have one commit, pointing to the head of upstream/main
and have successfully squashed and rebased.
When rebasing goes wrong
Sometimes rebasing goes wrong, and you end up in a frustrating loop of making
and undoing the same changes over and over again. First, here’s an explanation
of what’s going on. Let’s say that the branch we’re working on (and trying to
rebase) is called topic_branch
, and it branched off from upstream/main
at some point in the past:
A1---A2---A3 (topic_branch)
/
---M---o---o---o---o---B (upstream/main)
So, what we’d really like to do is to take the commits A1
, A2
, and
A3
and apply them to the current state of the upstream/main
branch,
i.e., on top of commit B
. If we just do git rebase upstream/main
then git will try to first apply A1
; then A2
; and finally A3
.
If there’s conflicts, this is painful, so we might want to first squash
the three commits together into one commit, and then rebase that single commit.
Then we’ll only have to resolve conflicts once. Said another way: we often
use git rebase -i upstream/main
to both squash and rebase; but
it may be easier to squash first then rebase after.
We’ll be doing irreversible changes, so first we should make a backup copy of the branch:
$ git checkout topic_branch # make sure we're on the right branch
$ git checkout -b topic_backup # make the backup
$ git checkout topic_branch # go back to the topic branch
Next, we take the diff between the current state of the files and the place
where your changes last diverged from upstream/main
(the commit labelled
M
in the diagram above), and save it as a patch. To do this, make sure
you are in the root of the git directory, and:
$ git diff --merge-base upstream/main > changes.patch
After that, we can check out a fresh branch and check if everything works as it’s supposed to:
$ git checkout -b test_branch upstream/main
$ patch -p1 < changes.patch
$ git commit -a
# check things work
After we’ve verified that everything works, we then checkout the original
topic branch and replace it with the state of the test_branch
, and
finally force-push to the remote topic branch on your fork:
$ git checkout topic_branch
$ git reset --hard test_branch
$ git push -f origin topic_branch
Hard resetting and force pushing are not reversible operations, so please beware! After you’ve done this, you can go make sure nothing bad happened by checking that the only changes listed under “files changed” in the github pull request are changes that you have made. For more on finding the fork point, with diagrams, and an alternative workflow, see the git docs.
Adding a new species
Which information do I need to have for my species?
In stdpopsim
, we aim to be inclusive and facilitate adding a diverse range of species.
That said, there are certain basic requirements we have
for every species added to the Catalog.
We specify these requirements below.
If you are unsure whether your species satisfies these baseline requirements,
but you still think it will be useful to add it to stdpopsim
,
then we encourage you to open an issue
on the GitHub repository to discuss this.
Others researchers in the community may be able to help you fill in the missing details
or find other solutions.
Every species added to stdpopsim
should have the following information available:
A chromosome-level genome assembly
Mutation rate (per generation)
Recombination rate (per generation)
A characteristic population size
An average generation time
Of course, many species do not have precise estimates of each of these (e.g., mutation rates are usually not known). So, in practice we often have to use approximate estimates. We provide below a set of guidelines for each of the five components, with a brief discussion of possible courses of action to take when components have incomplete information.
The genome assembly should consist of a list of chromosomes or scaffolds and their lengths. Having a good quality assembly with complete chromosomes, or at least very long scaffolds, is essential for chromosome-level simulations produced by
stdpopsim
. Thus, currently,stdpopsim
only supports adding species with near-complete chromosome-level genome assemblies (i.e., close to one contig per chromosome).An average mutation rate should be specified for each chromosome (per generation per bp). The mutation rate estimate can be based on sequence data from pedigrees, mutation accumulation studies, or comparative genomic analysis calibrated by fossil data (i.e., phylogenetic estimates). If there is no information on the variation of mutation rates across chromosomes, the average genome-wide mutation rate can be specified for all chromosomes. Finally, if your species of interest does not have direct estimates of mutation rates, we recommend using estimates for some other species (hopefully closely related).
An average recombination rate should be specified for each chromosome (per generation per bp). Ideally, one would want to specify a fine-scale chromosome-level genetic map, since the recombination rate is known to vary widely across and along chromosomes. If a genetic map exists for your species, you may specify it separately (see Adding a recombination/genetic map or annotation). Nonetheless, you should also specify a default (average) recombination rate for each chromosome. As with mutation rates, if there is no information on the variation of recombination rates across chromosomes, the average genome-wide recombination rate can be specified for all chromosomes. Furthermore, if your species of interest does not have direct estimates of recombination rates, we recommend using estimates for some other species (hopefully closely related).
The effective population size should represent the historical average effective population size, and should ideally produce simulated data that matches the average observed genetic diversity in that species. Population size is defined as the number of individuals, regardless of ploidy. However, this will often not capture features of genetic variation that are caused by recent changes in population size and the presence of population structure. To capture those, one should also provide a demographic model (or multiple models) for the species (see Adding a new demographic model).
The average generation time is an important part of the species’ natural history, but its value does not directly affect the simulation, since the
SLiM
andmsprime
simulation engines operate in time units of generations. Thus, the average generation time is only currently used to convert time units to years, which is useful when comparing different demographic models.
All values used in the species model should be based on current knowledge for a typical population
in that species, as represented in the literature.
Before you add your species to stdpopsim
, see that you can collect the values
mentioned above from the literature.
You will later need to specify these citations in your code files
(see Coding the species parameters).
If you are unsure whether your species of interest satisfies the base requirements above
(such as a near-complete genome assembly), or have questions about how to set some parameters,
feel free to open an issue
on the GitHub repository to get assistance from other members of the stdpopsim
community.
Getting set up to add a new species
If this is your first time adding a species to stdpopsim
, it’s a good
idea to take some time browsing the Catalog
to see how existing species are typically specified and documented. If you have
any questions or confusion about the required code, please
don’t hesitate to
open a new issue.
We’re more than happy to answer any questions and help get you up and running.
Before you add any code, be sure to have forked the stdpopsim
repository
and cloned it locally, following the instructions in the GitHub Workflow section.
After you collected the relevant parameters from the literature (see list above),
the first step is to create a new subdirectory devoted to the new species,
which you should name using the six-character species identifier
(see Naming conventions for more details).
All code associated with simulation of this species should go into this directory,
unless explicitly specified otherwise
(code for documentation and testing is written in other directories).
For example, the simulation code for D. melanogaster resides in directory
stdpopsim/catalog/DroMel/
in the repository.
Once the species directory is set up, you may use the maintenance
utility
of stdpopsim
to generate template files where you can enter
all relevant information for your species.
The maintenance
utility downloads useful information on a genome build published
in Ensembl,
and uses it to generate initial versions of the required source files.
A partial list of the
genomes housed on Ensembl can be found here.
To use this utility, execute the maintenance
command with the Ensembl species ID;
replace spaces in the Ensembl ID with _
characters.
For example, the template files for A. thaliana were generated by executing this command:
$ python -m maintenance add-species arabidopsis_thaliana
The maintenance
utility generates three new files inside the species directory
(stdpopsim/catalog/<SPECIES_ID>/
):
__init__.py
: a script that loads all the relevant libraries for your species. It should be edited only when you add components to your species, such as demographic models, genetic maps, or DFE models.genome_data.py
: a file that contains information on the physical map of the genome. This file is generated automatically by themaintenance
utility with a data dictionary which has slots for the assembly accession number, the assembly name, and a dictionary representing the chromosome names and their associated lengths. If synonyms are defined (e.g., chr2L for 2L) then those are given in the list that follows. You should double-check the downloaded values, but there is probably no reason to edit this file after it has been generated by themaintenance
utility.species.py
: a file containing information about the species’ mutation and recombination rates, effective population size, and the average generation time, along with all accompanying citations (see details in Which information do I need to have for my species?). The following section provides detailed instructions on how to code information in this file, including some specific examples.
Note
The maintenance
utility also generates test code for your species in
the file tests/test_<SPECIES_ID>.py
.
This is used later for your local tests and in the review process
(see Testing your species model and submitting a PR
and Implementing tests for the review of new species).
Note
If your species of interested does not have a published genome in Ensembl, you may manually create and edit the three files described above. Try to follow an example from the catalog that was downloaded from Ensembl to maintain a consistent format.
Coding the species parameters
Information about a species’ mutation and recombination rates,
effective population size, and the average generation time,
is all summarized in the species.py
file,
along with all accompanying citations
(see details in Which information do I need to have for my species?).
The initial version of the file generated by the maintenance
utility
contains commented instructions to help you figure out where everything goes.
Essentially, the information in this file is recorded in two main objects: _genome
and _species
.
The _genome
object contains chromosome-level information, such as
chromosome ids, lengths, mutation and recombination rates, and ploidy.
The _species
object contains the remaining information about the species,
including its full name, abbreviated name, id, effective population size
and average generation time.
Each value specified in these two object should be accompanied by a
stdpopsim.Citation
object indicating the publication from which it was derived.
Each stdpopsim.Citation
object is initialized with the following information:
author (string): abbreviated author list in a single string, such as “1000GenomesConsortium” or “Huber et al.”.
year (int): year of publication.
doi (string): a URL for the doi.org webpage of the publication.
reasons (list of
stdpopsim.CiteReason
): possible reasons to include a citation inspecies.py
are:stdpopsim.CiteReason.ASSEMBLY
stdpopsim.CiteReason.REC_RATE
stdpopsim.CiteReason.MUT_RATE
stdpopsim.CiteReason.POP_SIZE
stdpopsim.CiteReason.GEN_TIME
To demonstrate how the _genome
and _species
objects are set,
we provide below a detailed example for A. thaliana
(see also stdpopsim/catalog/AraTha/species.py
).
We start by defining auxiliary objects that specify the recombination rate, mutation rate, and ploidy for each chromosome. In the case of A. Thaliana, these objects are defined to associate the mitochondrial and plastid genomes (chromsoomes Mt and Pt) with ploidy of 1 and recombination rate of 0. All other chromosomes are associated with a ploidy of 2 and the genome-wide average recombination rate. The genome-wide mutation rate is associated with all chromosomes.
# genome-wide recombination rate from Huber et al 2014 MBE
# associated with all recombining chromosomes
_rho = 200 / 1e6 # 200/Mb
_Ne = 124000
_mean_recombination_rate = _rho / (2 * _Ne)
_recombination_rate = {str(j): _mean_recombination_rate for j in range(1, 6)}
_recombination_rate["Mt"] = 0
_recombination_rate["Pt"] = 0
# genome-wide average mutation rate from Ossowski 2010 Science
# associated with all chromosomes
_mean_mutation_rate = 7e-9
_mutation_rate = {str(j): _mean_mutation_rate for j in range(1, 6)}
_mutation_rate["Mt"] = _mean_mutation_rate
_mutation_rate["Pt"] = _mean_mutation_rate
# species ploidy and chromosome-specific ploidy
_species_ploidy = 2
_ploidy = {str(j): _species_ploidy for j in range(1, 6)}
_ploidy["Mt"] = 1
_ploidy["Pt"] = 1
The _genome
object is then defined by calling the stdpopsim
function
stdpopsim.Genome.from_data
.
This functions generates the genome object based on information from the
data
object defined in the genome_data.py
file,
the auxiliary objects defined above,
and a list of stdpopsim.Citation
objects.
_genome = stdpopsim.Genome.from_data(
genome_data.data,
recombination_rate=_recombination_rate,
mutation_rate=_mutation_rate,
ploidy=_ploidy,
citations=[
stdpopsim.Citation(
author="Ossowski et al.",
year=2010,
doi="https://doi.org/10.1126/science.1180677",
reasons={stdpopsim.CiteReason.MUT_RATE},
),
stdpopsim.Citation(
author="Huber et al.",
year=2014,
doi="https://doi.org/10.1093/molbev/msu247",
reasons={stdpopsim.CiteReason.REC_RATE},
),
stdpopsim.Citation(
doi="https://doi.org/10.1093/nar/gkm965",
year=2007,
author="Swarbreck et al.",
reasons={stdpopsim.CiteReason.ASSEMBLY},
),
],
)
The _species
object contains a reference to the _genome
object and
the remaining information about the species,
including the effective population size and average generation time,
accompanied by the appropriate stdpopsim.Citation
objects.
_species = stdpopsim.Species(
id="AraTha",
ensembl_id="arabidopsis_thaliana",
name="Arabidopsis thaliana",
common_name="A. thaliana",
genome=_genome,
generation_time=1.0,
population_size=10_000,
ploidy=_species_ploidy,
citations=[
stdpopsim.Citation(
doi="https://doi.org/10.1890/0012-9658(2002)083[1006:GTINSO]2.0.CO;2",
year=2002,
author="Donohue",
reasons={stdpopsim.CiteReason.GEN_TIME},
),
stdpopsim.Citation(
doi="https://doi.org/10.1016/j.cell.2016.05.063",
year=2016,
author="1001GenomesConsortium",
reasons={stdpopsim.CiteReason.POP_SIZE},
),
],
)
Once these two objects (_genome
and _species
) are specified in the species.py
file,
you should be able to load and simulate the newly added species using stdpopsim
.
Testing your species model and submitting a PR
The maintenance
utility that generated the three species template files
in the species directory (stdpopsim/catalog/<SPECIES_ID>/
)
also generates test code for the species in a separate file, tests/test_<SPECIES_ID>.py
.
The tests in this file are executed as follows
(where <SPECIES_ID>
is the six-character species id):
$ python -m pytest tests/test_<SPECIES_ID>.py
The tests already implemented in this file when it is generated
check for basic formatting and missing information.
For example, there is a test checking that the citation year is of type int
rather than string (e.g. 2004 and not “2014”).
Other tests in this file are generated by the maintenance
utility
as blank and disabled.
These tests should not be filled out by the person who writes the code in
the species.py
file,
but rather by someone else, as part of the review process (see below).
Once your code passes the basic tests implemented in the automatically generated
version of the test file,
you should submit a pull request (PR) with your changes to the catalog.
See the GitHub workflow for more details about this process.
At this point, most of your work is done.
You have officially joined the stdpopsim
development team. Welcome!!
Your code still needs to undergo review by another member (or members)
of the development team before it is fully incorporated into stdpopsim
.
This will likely require additional feedback from you,
so, stay tuned for discussion during the review process.
Overview of the stdpopsim review process
We provide here a general outline for the review process we use in stdpopsim
,
including guidelines for how to settle discrepancies that are found during review
(see Step 6 below).
The seven steps described below should be followed whenever a new species is added,
or when components such as demographic models are added to a species
already in the catalog.
After the original contributor submitted a PR with their new code, the code is checked by one of the core maintainers of
stdpopsim
for basic problems or style issues. Once the code meets the basic standards, the maintainer merges the PR, and the newly added code is considered provisional.The original contributor then opens a new QC issue on GitHub to track the progress of the review. One simple way to do this is to use one of the template issues we provide. For example, the
Species QC issue template
should be used when adding a new species and theModel QC issue template
should be used when adding a new demographic model. Simply press Get started for the appropriate template, and fill in the required details. If you don’t find an appropriate template for your purpose, you should simply open a new blank issue and add the relevant details manually. Make sure to include information about the primary sources (citations) you used as well as other considerations you made in your code. The QC issue contains a checklist and all the items on this list should be checked off for the review process to complete.A different member of the
stdpopsim
community volunteers to review the newly added demographic model. If you volunteer to review a model, you should state your intention on the QC issue, so we don’t duplicate effort. Typically, there will be one reviewer assigned to every QC issue. However, sometimes multiple reviewers may wish to partition tasks between them. For examples, when reviewing a new species, one reviewer may wish to test the recombination rates, and another may wish to test the effective population size. Some aspects of the review, such as examining citations, involve checking the code of the original contributor. However, most of the review involves implementing tests based on the reviewer’s understanding of the source publications and additional documentation specified by the original contributor in the QC issue. Ideally, the code for these tests should be written by the reviewer without looking at the original contributor’s code. If the reviewer is uncertain about some aspects of the implementation, they can discuss this with the original contributor in the QC issue. Different types of tests are involved when you are reviewing a new species added tostdpopsim
or when you are reviewing a demographic model added to an existing species. See the appropriate sections below for specific instructions on how to implement the different tests. The reviewer should write the testing code on their own fork of the repository, as outlined in the GitHub workflow.After writing the appropriate code, the reviewer should execute it by running the Unit tests. The unit tests will produce error messages if inconsistencies are found between the original contributor’s implementation and the tests written by the reviewer.
Once the reviewer is confident in their tests, they should submit a PR with their test code. The reviewer may choose to do so even if some tests fail, to facilitate discussion with the original contributor (see Step 6 below).
If the tests written by the reviewer produce error messages, the differences between the implementation of the original contributor and the blind tests of the reviewer need to be resolved through discussion between the two of them. This discussion can take place either in the review PR submitted in Step 5, or in the QC issue opened in Step 2. Differences between the two implementations can indicate an error, but very often they are a result of different interpretations of the data presented in the source publications. For example, there might be different mutation rates estimated for a given species from two different groups of samples. The original contributor and reviewer should reach an agreement as to the best (or at least a reasonable) interpretation of the published data. If they cannot reach an agreement, then the discussion on GitHub should be opened to others in the community. It may also be useful to contact the authors of the original publication to resolve some of these ambiguities. After each difference is resolved, the final decision should be clearly noted in the discussion on GitHub, and the code should be modified accordingly. This could be either the code written by the original contributor or the test code written by the reviewer (or both in some cases). Since at this point the review PR submitted in Step 5 is still open (not merged), then we recommend making the code changes using additional commits in this PR. In case the review process found different possible interpretations of the published data, the rationale behind the final (consensus) interpretation should be clearly specified in comments above the relevant block of code. This documentation will help future contributors in resolving ambiguities in similar cases.
Once the review PR submitted in Step 5 passes all unit tests, it is merged, and the QC issue opened in Step 2 is closed. The new code is now officially added to the
stdpopsim
catalog!
Implementing tests for the review of new species
The tests associated with the review of a new species
should be written by the reviewer in the tests/test_<SPECIES_ID>.py
file
as part of Step 3 of the review process described above.
Recall that this file was generated by the maintenance
utility, with most
of the tests disabled.
The reviewer should enable all the tests and implement them.
For example, the test for the recombination rates is initialized by the
maintenance
utility in the following form:
@pytest.mark.skip("Recombination rate QC not done yet")
@pytest.mark.parametrize(["name", "rate"], {}.items())
def test_recombination_rate(self, name, rate):
assert rate == pytest.approx(self.genome.get_chromosome(name).recombination_rate)
When writing the tests for the recombination rates, the reviewer should first
delete the @pytest.mark.skip
line to enable the test.
Then, they should specify inside the { }
a valid dictionary:
a list of key
:value
with the name and average
recombination rate for each chromosome.
We provide an example below from A. aegypti (see tests/test_AedAeg.py
):
@pytest.mark.parametrize(
["name", "rate"],
{"1": 0.306e-8, "2": 0.249e-8, "3": 0.291e-8, "MT": 0.0}.items(),
)
def test_recombination_rate(self, name, rate):
assert rate == pytest.approx(self.genome.get_chromosome(name).recombination_rate)
The tests can be executed by running the complete set of Unit tests,
or by invoking only the tests in tests/test_<SPECIES_ID>.py
, as follows:
$ python -m pytest tests/test_<SPECIES_ID>.py
The tests compare the values specified in the
test file to the values in the species.py
and genome_data.py
files,
and they produce error messages if differences are found.
Differences should be resolved using the general process outlined in
Step 6 of the Overview of the stdpopsim review process.
Adding a new demographic model
A demographic model describes ancestral population sizes, split times, and migration rates. Misspecification of the model can generate unrealistic patterns of genetic variation that will affect downstream analyses. So, having at least one detailed demographic model is recommended for every species. A given species might have more than one demographic model, fit from different data or by different methods or different assumptions/parameters.
What models are appropriate to add?
Any model added to stdpopsim
should be based the published literature
and a clear citation to the relevant paper(s) should be provided with the model.
The demographic model should include, at a minimum,
a single population with a series of population sizes changes.
Multi-population models typically include other demographic events,
such as population splits and changes in the amount of gene flow between populations.
The values of different parameters should be specified in units of “number of individuals”
(for population sizes) and generations (for times).
Sometimes, you will need to convert values published in the literature
to these units by making some assumptions on the mutation rate (sometimes even recombination rate);
typically the same assumptions made by the study that published the demographic model.
The stdpopsim
Catalog also contains a collection of generic models,
which are not associated with a certain species and are primarily used for development
and testing of demographic inference methods.
Due to their nature, the rationale for adding such models is different,
and they are also implemented in a slightly different way.
If you wish to contribute a new generic model,
then we suggest that you open a new issue
to discuss your suggestion with others in the community and decide on the best
way to implement your suggestion.
Getting set up to add a new demographic model
If this is your first time implementing a demographic model in stdpopsim
, it’s a good
idea to take some time browsing the Catalog
to see how existing demographic models are coded and documented.
If you have any questions or confusion about formatting or implementing demographic models, please
don’t hesitate to open a new issue.
We’re more than happy to answer any questions and help get you up and running.
Before you add any code, be sure to have forked the stdpopsim
repository
and cloned it locally, following the instructions in the GitHub Workflow section.
All code for a species’ demographic models is written in the demographic_models.py
file in that species directory stdpopsim/catalog/<SPECIES_ID>/
(where <SPECIES_ID>
is the six-character identifier of the species;
e.g., CanFam).
If the species does not currently have any demographic model,
then you should add this file to stdpopsim/catalog/<SPECIES_ID>/
,
with the following three lines of code:
import msprime
import stdpopsim
_species = stdpopsim.get_species("<SPECIES_ID>")
Furthermore, to ensure that the demographic model(s) are fully incorporated to the
species’ code base, you should add the following import to the __init__.py
file
in the species directory:
from . import demographic_models
Coding the demographic model
The demographic model should be coded in the demographic_models.py
file
by defining a specialized function, which essentially returns
a stdpopsim.DemographicModel
object initialized with the appropriate values.
This function should then be added to the _species
object using the add_demographic_model
function.
We provide below a template block of code for these two operations:
def _model_func_name():
return stdpopsim.DemographicModel(
id=...,
description=...,
long_description=...,
populations=...,
citations=...,
generation_time=...,
mutation_rate=...,
recombination_rate=...,
population_configurations=...,
migration_matrix=...,
demographic_events=...,
)
_species.add_demographic_model(_model_func_name())
A demographic model is thus defined using up to eleven different attributes. The first eight attributes are quite straightforward:
id
(string): A unique, short-hand identifier for this demographic model. This id contains a short description written in camel case, followed by an underscore, and then four characters: (1) a digit character specifying the number of sampled populations; (2) the first letter of the name of the first author of the publication; (3-4) and two digit characters specifying the year the study was published. For example, the “Out of Africa” demographic model for humans published by Gutenkunst et al. (2009) has theid
“OutOfAfrica_3G09”. See Naming conventions for more details.description
(string): A brief one-line description of the demographic model.long_description
(string): A more detailed textual description of the model (short paragraph).populations
: A list ofstdpopsim.Population
objects, which have their ownid
anddescription
. For example, the Thousand Genomes Project Yoruba panel could be defined asstdpopsim.Population(id="YRI", description="1000 Genomes YRI (Yorubans)")
.citations
: A list ofstdpopsim.Citation
objects for the publications from which this model was derived. The citation object requires author, year, and doi information, and a specified reason for citing this model (see Coding the species parameters). The reason associated with demographic model citations will typically bestdpopsim.CiteReason.DEM_MODEL
.generation_time
(double): The generation time for the species in years. The value of this parameter does not directly affect the simulation, since theSLiM
andmsprime
simulation engines operate in time units of generations. The generation time is only currently used to convert time units to years, which is useful when comparing among different demographic models.mutation_rate
(double): The mutation rate assumed during the inference of this demographic model (per bp per generation). Most demographic inference methods make some assumption about the average genome-wide mutation rate. These assumptions are sometimes “baked” into the methods, and in other cases are just used to convert parameter values from mutation-scale to physical scale (number of individuals for population size and generations for times). If you are confident that inference did not make any assumption about mutation rate, then set the mutation rate of the demographic model toNone
. However, note that this is quite uncommon, so you should make sure this is the case before you set the mutation rate toNone
.recombination_rate
(double): The recombination rate assumed during the inference of this demographic model, if any (per bp per generation). While demographic model inference might make less assumptions about recombination rates than mutation rates, we provide this option for completness. Namely, a demographic model might have been inferred under the assumption of a specific recombination rate, which does not match with the species’ recombination rate implemented instdpopsim
. Also, some demographic models were inferred under the assumptions of a specifc ratio of mutation to recombination rates.
The final three attributes
(population_configurations
, migration_matrix
, and demographic_events
)
describe the inferred demographic history that you wish to code.
This history consists of ancestral population size changes,
migration rates, split times, and admixture events.
Note that population size is defined as the number of individuals, regardless of ploidy.
These attributes should be coded using the standard format of msprime
.
If this is your first time specifying a demographic model using msprime
,
then we highly recommend that you take some time to read through its
documentation and tutorials.
Note
Most published demographic models provide a range of plausible values for each parameter of interest. In your coded model, you should use some reasonable point estimate, such as the value associated with the the maximum likelihood fit, or the mean of posterior distribution for Bayesian methods.
Adding a parameter table to the docs
The parameters used in the implementation of the demographic model should
also be specified in the docs in a file docs/parameter_tables/<SPECIES_ID>/<MODEL_ID>.csv
,
where <SPECIES_ID>
is the six-character species id,
and <MODEL_ID>
is the id
of the demographic model.
This provides a straightforward documentation and also helps in the review
process (see below).
Each line in this csv file should have the format:
Parameter Type (units), Value, Description
You may examine csv files currently in the docs/parameter_tables/
directory
for useful examples.
Once you completed the csv file,
you can check that the documentation was built properly by running
make
in the docs/
directory and opening the Catalog page in the
docs/_build/
directory.
See Documentation for more details.
Testing your demographic model and submitting a PR
Once you have written the demographic model function in the demographic_models.py
file,
you should test it locally using the development version of stdpopsim
.
First, make sure to install the development version of stdpopsim
and its requirements,
by following the Installation instructions.
Then, check that your new demographic model function has been imported
by executing the following Python code,
where <SPECIES_ID>
is the six-character species id (e.g., HomSap or AraTha):
import stdpopsim
species = stdpopsim.get_species("<SPECIES_ID>")
for x in species.demographic_models:
print(x.id)
This prints the identifiers (id
; see above) for all demographic models defined for the species.
You should make sure that the identifier of your newly added model is printed.
Note
If the identifier of your demographic model is not printed,
make sure that you included the call _species.add_demographic_model(_model_func_name())
for your newly defined function _model_func_name()
in the end of the demographic_models.py
file.
If you are still having trouble, check the GitHub issues or open a new issue to get help from others.
After you confirmed that your demographic model was added to the species code, you should check that you can successfully simulate it with the Python API. See Running stdpopsim with the Python interface (API) for more details. Finally, once everything looks okay, you should submit a pull request (PR) with your changes to the code. See the GitHub workflow for more details about this process.
At this point, most of your work is done.
You have officially joined the stdpopsim
development team. Welcome!!
Your model still needs to undergo review by another member (or members)
of the development team before it is fully incorporated into stdpopsim
.
This will likely require additional feedback from you,
so, stay tuned for discussion during the review process.
Implementing tests for the review of a demographic model
After a contributor submits a PR with a new demographic model,
the code undergoes seven steps of review before it
is officially added to stdpopsim
(see Overview of the stdpopsim review process).
In Step 3 of this process, the reviewer writes testing code for the newly
added demographic model.
This is done in file stdpopsim/qc/<SPECIES_ID>.py
(where <SPECIES_ID>
is the six-character identifier of the species).
If this is the first demographic model added for this species,
the reviewer should create this file and add an import
statement for the species to stdpopsim/qc/__init__.py
.
The code written by the reviewer in stdpopsim/qc/<SPECIES_ID>.py
should define a function that returns a
stdpopsim.DemographicModel
object, parallel to the function defined
by the original contributor of the demographic model (see Coding the demographic model).
After this function is defined, it should be registered as the QC function of the
original function by adding this bit of code to stdpopsim/qc/<SPECIES_ID>.py
:
_species.get_demographic_model(_MODEL_ID_).register_qc(_your_review_function())
Where _MODEL_ID_
is the string specified by the original contributor as the
id
of the demographic model, and _your_review_function()
is the function
implemented by the reviewer.
The original demographic model and its registered QC model are compared as part of
the stdpopsim
Unit tests.
Adding a recombination/genetic map or annotation
Note
Recombination map and genetic map are terms used to describe
maps of recombination rates that vary across and along chromosomes.
In the stdpopsim
code and documentation, we use the term
genetic map to refer specifically to a “crossing-over rate map” and
recombination map to refer to a “crossing-over and gene conversion rate map.””
See further details on this distinction.
Some species have sub-chromosomal genetic maps or genomic annotations available. These files are large enough that adding them directly to the package would quickly cause slow package installation and loading, so these files are downloaded as-needed from AWS and stored locally in a cache directory. The following documentation describes adding genetic maps; the procedure for annotations is similar (but see the important note below).
Genetic maps can be added to
stdpopsim by creating a new GeneticMap object and providing a formatted file
detailing recombination rates to a designated stdpopsim
maintainer who then uploads
it to AWS. If there is one for your species that you wish to include, create a space
delimited file with four columns: Chromosome, Position(bp), Rate(cM/Mb), and Map(cM).
Each chromosome should be placed in a separate file and with the chromosome id in the
file name in such a way that it can be programatically parsed out. IMPORTANT: chromosome
ids must match those provided in the genome definition exactly! Below is an example start
to a genetic map file (see here
for more details):
Chromosome Position(bp) Rate(cM/Mb) Map(cM)
chr1 32807 5.016134 0
chr1 488426 4.579949 0
Once you have the genetic map files formatted, tar and gzip them into a single
compressed archive. The gzipped tarball must be FLAT (there are no directories in the
tarball). This file will be sent to one of the stdpopsim
uploaders for placement in the
AWS cloud once the new genetic map(s) are approved. Finally, you must add a GeneticMap
object to the file named for your species in the stdpopsim/catalog/<SPECIES_ID>/
directory
(the one that contains all the simulation code for that species,
see Getting set up to add a new species):
_genetic_map_citation = stdpopsim.Citation(
doi="FILL_ME", author="FILL_ME", year=9999, reasons={stdpopsim.CiteReason.GEN_MAP}
)
"""
The file_pattern argument is a pattern that matches the genetic map filenames,
where '{id}' is replaced with the 'id' field of a given chromosome.
"""
_gm = stdpopsim.GeneticMap(
species=_species,
id="FILL_ME", # ID for genetic map, see naming conventions
description="FILL_ME",
long_description="FILL_ME",
url=(
"https://stdpopsim.s3-us-west-2.amazonaws.com/"
"genetic_maps/dir/filename.tar.gz"
),
sha256="FILL_ME",
file_pattern="name_{id}_more_name.txt",
citations=[_genetic_map_citation],
)
_species.add_genetic_map(_gm)
The SHA256 checksum of the the genetic map tarball can be obtained using the
sha256sum
command from GNU coreutils. If this is not available on your
system, the following can instead be used:
python -c 'from stdpopsim.utils import sha256; print(sha256("genetic_map.tgz"))'
Once all this is done, submit a PR containing the code changes and wait for directions on whom to send the compressed archive of genetic maps to (currently Andrew Kern is the primary uploader but please wait to send files to him until directed).
An important note:
Since the checksum for the file uploaded to AWS is hardcoded into the package,
it is important that we do not change that file in the future on AWS.
For instance, if we uploaded a different annotation file with the same name
(and hence obtainable by the same URL),
then users of previous versions of stdpopsim who try to use that annotation
would receive an error when the package downloads the file and finds
its checksum does not match what is expected.
So, when updating an existing resource file (such as a genetic map or annotation),
we need to give the file a unique URL,
which we do by updating the file name in the URL with a version number
(i.e. url=(<...>/filename_v1.tar.gz)
).
File names do not follow a fixed convention,
so simply add an underscore and version number to the end of whatever the current
file name is (before the .tar.gz
file extension),
or increment the version number if the previous file already has one.
When the file is downloaded locally to the cache, it is given a standardized name
that will be the same regardless of which file is pulled from AWS.
Lifting over a recombination/genetic map
Existing genetic maps will need to be lifted over to a new assembly, if and when the
current assembly is updated in stdpopsim
. This process can be partially automated by running
the liftOver
maintenance code.
First, you must download and install the liftOver
executable from the
UCSC Genome Browser Store.
Next, you must download the appropriate chain files, again from UCSC
(see UCSC Genome Browser downloads for more details).
To validate the remapping between assemblies it is required to have chain files
corresponding to both directions of the liftOver
(e.g. hg19ToHg38.over.chain.gz and hg38ToHg19.over.chain.gz) as in the
example below.
An example of the process for
lifting over the GeneticMap "HapMapII_GRCh37"
to the "Hg19"
assembly
is shown below:
python /maintenance/liftOver_catalog.py \
--species HomSap \
--map HapMapII_GRCh37 \
--chainFile hg19ToHg38.over.chain.gz \
--validationChain hg38ToHg19.over.chain.gz \
--winLen 1000 \
--useAdjacentAvg \
--retainIntermediates \
--gapThresh 1000000
Here, the argument "--winLen"
corresponds to the size of the window over which a weighted
average of recombination rates is taken when comparing the original map with the
back-lifted map (for validation purposes only). The argument "--gapThresh"
is used to select a threshold for
which gaps in the new assembly longer than the "--gapThresh"
will be set with a
recombination rate equal to 0.0000, instead of an average rate. The type of average rate used for gaps
shorter than the "--gapThresh"
is determined either by using the mean rate of two most adjacent windows
or by using the mean rate for the entire chromosome, using options "--useAdjacentAvg"
or
"--useChromosomeAvg"`
respectively.
Validation plots will automatically be generated in the liftOver_validation/
directory. Intermediate files created by the liftOver
executable will be saved
for inspection in the "/liftOver_intermediates/"
, only if the
"--retainInermediates"
option is used. Once the user has inspected the validation plots
and deemed the liftOver process to be sufficiently accurate, they can proceed to generating
the SHA256 checksum.
The SHA256 checksum of the new genetic map tarball can be obtained using the
sha256sum
command from GNU coreutils. If this is not available on your
system, the following can instead be used:
python -c 'from stdpopsim.utils import sha256; print(sha256("genetic_map.tgz"))'
The newly lifted over maps will be formatted in a compressed archive and
automatically named using the assembly name from the chain file.
This file will be sent to one of the stdpopsim
uploaders for placement in the
AWS cloud, once the new map is approved. Finally, you must add a GeneticMap
object to the file named for your species in the stdpopsim/catalog/<SPECIES_ID>/
directory, as shown in Adding a recombination/genetic map or annotation.
Again, once all this is done, submit a PR containing the code changes and wait for directions on whom to send the compressed archive of genetic maps to (currently Andrew Kern is the primary uploader but please wait to send files to him until directed).
Note
The GeneticMap
named "ComeronCrossoverV2_dm6"
for "DroMel"
was generated by similar code (albeit slightly different
compared to that shown above) using the following command:
python /maintenance/liftOver_comeron2012.py \
--winLen 1000 \
--gapThresh 1000000 \
--useAdjacentAvg \
--retainIntermediates
Note
The GeneticMap
named "SalomeAveraged_TAIR10"
for "AraTha"
was generated by aligning the TAIR7 and TAIR10 with "minimap2"
,
and lifting the recombination rates on TAIR7 to TAIR10 with
"paftools.js liftover"
.
Adding a DFE model
A distribution of fitness effects (DFE) describes
the probability distribution of selection coefficients
(deleterious, neutral, and beneficial)
for mutations occurring in a certain set of genomic regions.
This is a central component of the way that stdpopsim
incorporates natural selection in its simulations.
See Simulating genomes with selection.
There are various computational methods for estimating DFEs from genomic data
and you may use published DFEs to code a DFE model, as described below.
Getting set up to add a new DFE model
If this is your first time implementing a DFE in stdpopsim
, it’s a good
idea to take some time browsing the Catalog
to see how existing DFE models are coded and documented.
If you have any questions or confusion about formatting or implementing demographic models, please
don’t hesitate to open a new issue.
We’re more than happy to answer any questions and help get you up and running.
Before you add any code, be sure to have forked the stdpopsim
repository
and cloned it locally, following the instructions in the GitHub Workflow section.
The code for for a species’ DFE models is written in the dfes.py
file in that species directory stdpopsim/catalog/<SPECIES_ID>/
(where <SPECIES_ID>
is the six-character identifier of the species;
e.g., AraTha).
If the species does not currently have any DFE model,
then you should add this file to stdpopsim/catalog/<SPECIES_ID>/
,
with the following two lines of code:
import stdpopsim
_species = stdpopsim.get_species("<SPECIES_ID>")
Furthermore, to ensure that the DFE model is fully incorporated to the
species’ code base, you should add the following import to the __init__.py
file
in the species directory:
from . import dfes
Coding the DFE model
The DFE model should be coded in the dfes.py
file
by defining a specialized function, which essentially returns
a stdpopsim.DFE
object initialized with the appropriate values.
This function should then be added to the _species
object using the add_dfe
function.
We provide below a template block of code for these two operations:
def _dfe_func_name():
return stdpopsim.DFE(
id=...,
description=...,
long_description=...,
citations=...,
mutation_types=...,
proportions=...,
)
_species.add_dfe(_dfe_func_name())
A DFE model is thus defined using six different attributes.
id
(string): A unique, short-hand identifier for this DFE model. This id contains a short description of the distribution written in camel case, (such as “LogNormal” or “Gamma”), followed by an underscore, and then three characters: (1) the first letter of the name of the first author of the publication; (2-3) and two digit characters specifying the year the study was published. For example, the DFE inferred by Kim et al. (2017) hasid
set to “Gamma_K17”. See Naming conventions for more details.description
(string): A brief one-line description of the demographic model.long_description
(string): A more detailed textual description of the model (short paragraph).citations
: A list ofstdpopsim.Citation
objects for the publications from which this model was derived. The citation object requires author, year, and doi information, and a specified reason for citing this model (see Coding the species parameters). The reason associated with demographic model citations will typically bestdpopsim.CiteReason.DFE
.mutation_types
: A list ofstdpopsim.MutationType
objects corresponding to different mutation types (such as negative, neutral, or positive). For more details, see the example below and the documentation ofstdpopsim.MutationType
proportions
: A list of positive numbers that sum to 1 of the same length asmutation_types
. This list specifies the proportion of each mutation type.
For example, the code block below demonstrates a DFE model
with three mutation types: neutral, negative, and positive.
In this model, negative mutations are assumed to have
dominance coefficience of 0.5
and a selection
coefficients distributed according to a Gamma distribution
with mean -0.0004
and shape 0.27
.
The positive mutations also have a dominance coefficience of 0.5
,
but they have a fixed selection coefficient of 0.01
.
def _dfe_func_name():
# Default mutation type is neutral
neutral = stdpopsim.MutationType()
# Negative mutation type with gamma-distributed selection coefficients
negative = stdpopsim.MutationType(
dominance_coeff=0.5,
distribution_type="g", # gamma distribution
distribution_args=[-0.0004, 0.27], # mean and shape of distributoin
)
# Positive mutation type with fixed selection coefficient of 0.01
positive = stdpopsim.MutationType(
dominance_coeff=0.5,
distribution_type="f", # fixed selection coefficient
distribution_args=[0.01], # fixed value
)
# The proportions of the three mutation types
p_neutral = 0.7
p_negative = 0.299
p_positive = 1 - p_neutral - p_negative
return stdpopsim.DFE(
id=...,
description=...,
long_description=...,
citations=...,
mutation_types=[neutral, negative, positive],
proportions=[p_neutral, p_negative, p_positive],
)
_species.add_dfe(_dfe_func_name())
Testing your DFE model and submitting a PR
After you finished your implementation, and specified all the
necessary citations,
we recommend that you run some basic local checks to see that
the model was successfully loaded to stdpopsim
.
You may follow the process outlined for Testing your demographic model and submitting a PR.
Once you are convinced that the model was accurately implemented and loaded to stdpopsim
,
you should submit a pull request (PR) with your changes to the code.
See the GitHub workflow for more details about this process.
At this point, most of your work is done.
You have officially joined the stdpopsim
development team. Welcome!!
Your DFE model still needs to undergo review by another member
of the development team before it is fully incorporated into stdpopsim
.
This will likely require additional feedback from you,
so, stay tuned for discussion during the review process.
To facilitate this, there is one more step: please open a new issue, using the “DFE QC” template. The template asks for the basic information that someone will need to independently verify the implemented DFE.
Reviewing a DFE model
The process for reviewing a DFE is essentially the same
as for reviewing a demographic model (see Overview of the stdpopsim review process).
Briefly, you will re-implement the DFE “blind”, i.e., without looking at the DFE
implementation already added to the code.
Then, the unit tests check whether the implementations are equivalent.
To do this, you add your implementation to stdpopsim/qc/<SPECIES_ID>.py
,
followed by a call like
_species.get_dfe(_MODEL_ID_).register_qc(_your_review_function())
where _MODEL_ID_
is the string specified as the ID for the original DFE,
and _your_review_function()
is the function you’ve added to the QC file
that returns a DFE object.
Coding standards
To ensure that the code in stdpopsim
is as readable as possible
and follows a reasonably uniform style, we require that all code follows
the PEP8 style guide.
Lines of code should be no more than 89 characters.
Conformance to this style is checked as part of the Continuous Integration
testing suite.
Naming conventions
To ensure uniformity in naming schemes across objects in stdpopsim
we have strict conventions for species, genetic maps, and demographic
models.
Species names follow a ${first_3_letters_genus}${first_3_letters_species}
convention with capitilization such that Homo sapiens becomes “HomSap”. This
is similar to the UCSC Genome Browser naming convention and should be familiar.
Genetic maps are named using a descriptive name and the assembly version according
to ${CamelCaseDescriptiveName}_${Assembly}
. e.g., the HapMap phase 2 map on
the GRCh37 assembly becomes HapMapII_GRCh37.
Demographic models are named using a combination of a descriptive name,
information about the simulation, and information about the publication it was
presented in. Specifically we use
${SomethingDescriptive}_${number_of_populations}${first_author_initial}${two_digit_date}
where the descriptive text is meant to capture something about the model
(i.e. an admixture model, a population crash, etc.) and the number of populations
is the number of populations implemented in the model (not necessarily the number
from which samples are drawn). For author initial we will use a single letter, the 1st,
until an ID collision, in which case we will include the 2nd letter, and so forth.
DFE (Distribution of Fitness Effects) models are similarly named using a string describing
the distribution, and information about the publication:
${SomethingDescriptive}_${First_authors_last_name_first_letter}{two_digit_date}
.
For instance, if the distribution in question is a lognormal distribution,
then LogNormal
might be the descriptive string.
Unit tests
All code added to stdpopsim
should have
unit tests. These are typically
simple and fast checks to ensure that the code makes basic sense (the
entire unit test suite should not require more than a few seconds to run).
Test coverage is checked using CodeCov,
which generates reports about each pull request.
It is not practical to test the statistical properties of simulation models as part of unit tests.
The unit test suite is in the tests/
directory. Tests are run using the
pytest module. Use:
$ python3 -m pytest
from the project root to run the full test suite. Pytest is very powerful and has lots of options; please see the tskit documentation for help on how to run pytest and some common options.
It’s useful to run the flake8
CI tests locally before pushing a commit.
To set this up use either pip
or conda
to install flake8
To run the test simply use:
$ flake8 --max-line-length 89 stdpopsim tests
If you would like to automatically run this test before a commit is permitted,
add the following line in the file stdpopsim/.git/hooks/pre-commit.sample
:
exec flake8 --max-line-length 89 setup.py stdpopsim tests
before:
# If there are whitespace errors, print the offending file names and fail.
exec git diff-index --check --cached $against --
Finally, rename pre-commit.sample
to simply pre-commit
Code Coverage
As part of the continuous testing suite we have automated checking of how well the test units cover the source code. As a result it’s very helpful to check locally how well your tests are covering your code by asking pytest for coverage reports. This can be done with:
$ pytest --cov-report html --cov=stdpopsim tests/
this will output a directory of html files for you to browse test coverage
for every file in stdpopsim
in a reasonably straightfoward graphical
way. To see them, direct your web browser to the htmlcov/index.html file.
You’ll be looking for lines of code that are highlighted yellow or red
indicated that tests do not currently cover that bit of code.
Documentation
Documentation is written using reStructuredText
markup and the sphinx documentation system.
It is defined in the docs/
directory.
To build the documentation type make
in the docs/
directory. This should build
HTML output in the docs/_build/html/
directory.
Note
You will need stdpopsim
to be installed for the build to work.
Making a new release
Here is a list of things to do when making a new release:
Update the changelog and commit
Create a release using the GitHub UI
- git fetch upstream on your local branch.
Then check out upstream/main and create a release tarball (with python setup.py sdist). Setuptools_scm will detect the version appopriately.
Upload to PyPI: twine upload dist/{version just tagged}.tar.gz
After the release, if everything looks OK, update the symlink for
stable
in the stdpopsim-docs repositoryCheck on the conda feedstock PR.