The ms
command usage looks like this
usage: ms nsam howmany
So it is nessary to provide nsam
(The number of haplotypes to be sampled) and howmany
which is the number of replicate sets of data to generate.
For PSMC data we always choose nsam
to be 2 because the method is designed for diploid genomes. For convenience howmany
should just be set to 1
because we will rerun ms to generate separate random replicate datasets
Other arguments to ms
are listed as options but not all are actually optional. In our case we will always set the following options;
-t theta -r rho nsites -p n
theta
is equal to 4*N0*µ
where N0
is the baseline effective population size and µ is the mutation rate per generation across the entire region being simulated.
We will run simulations across various values of N0
ranging between 100 and 1e7
The parameter µ is not the usual definition of mutation rate because it is scaled by length. A reasonable value of mutation rate is 3e-8 per generation per base. We need to multiply this by L
(length of the region being simulated) to arrive at µ. Assuming L=1e7
(10Mb) we get µ=0.3
rho
is the scaled recombination rate. The ms manual states that this is 4Nc
where N
is effective population size again. c
is the product of the per-base recombination rate and the length of the simulated region .. ie (c=rL
) and c
is the length of the region to be simulated. r
is a hard parameter to measure so let's just use a constant value of 1e-8. Assuming L=1e7
we get rho=0.1
For N0=10000 and L=1Mb we would therefore have;
theta = 1200 (4Nµ)
rho = 1600 (4Nc)
Adjust p
so that we can track heterozygous sites to their exact position.
ms 2 1 -t 1200 -r 1600 1000000 -p 6