How to use MS to simulate data for PSMC/MSMC

The ms command usage looks like this

usage: ms nsam howmany

So it is nessary to provide nsam (The number of haplotypes to be sampled) and howmany which is the number of replicate sets of data to generate.

For PSMC data we always choose nsam to be 2 because the method is designed for diploid genomes. For convenience howmany should just be set to 1 because we will rerun ms to generate separate random replicate datasets

Other arguments to ms are listed as options but not all are actually optional. In our case we will always set the following options;

-t theta -r rho nsites -p n

theta is equal to 4*N0*µ where N0 is the baseline effective population size and µ is the mutation rate per generation across the entire region being simulated.

We will run simulations across various values of N0 ranging between 100 and 1e7 The parameter µ is not the usual definition of mutation rate because it is scaled by length. A reasonable value of mutation rate is 3e-8 per generation per base. We need to multiply this by L (length of the region being simulated) to arrive at µ. Assuming L=1e7 (10Mb) we get µ=0.3

rho is the scaled recombination rate. The ms manual states that this is 4Nc where N is effective population size again. c is the product of the per-base recombination rate and the length of the simulated region .. ie (c=rL) and c is the length of the region to be simulated. r is a hard parameter to measure so let's just use a constant value of 1e-8. Assuming L=1e7 we get rho=0.1

For N0=10000 and L=1Mb we would therefore have;

theta = 1200 (4Nµ)
rho = 1600 (4Nc)

Adjust p so that we can track heterozygous sites to their exact position.

ms 2 1 -t 1200 -r 1600 1000000 -p 6

iracooke/ms_for_psmc.md

How to use MS to simulate data for PSMC/MSMC