vignettes/yourData.Rmd
yourData.Rmd
In this vignette we describe how you can use BactDating on your own data.
The first required input (let’s call it t
) is a
phylogenetic tree of the class phylo
from the
ape
package, which can be loaded from a Newick or Nexus
file using respectively the following commands from the ape
package:
library(ape)
t=read.tree('filename.nwk')
t=read.nexus('filename.nex')
This tree should have branch lengths measured in units of
substitutions. In other words, the branch lengths should not be measured
in substitutions per site. If you are not sure whether the branch
lengths are measured correctly or not, try to compute
sum(t$edge.length)
. This should be the total number of
substitutions throughout the tree, so if you get a value below one or
even only a bit above one, your branch lengths are probably not in the
correct unit.
If the branch lengths are measured per site, you will need to rescale them using the command:
t$edge.length=t$edge.length*L
where L
is the number of sites which were used to build
this tree.
Alternatively, you can load directly from the output of ClonalFrameML or Gubbins using the following commands respectively:
t=loadCFML(prefix)
t=loadGubbins(prefix)
This will have the advantage to account for the recombination events detected by ClonalFrameML or Gubbins when performing the dating.
The second required input (let’s call it d
) is the dates
at which the isolates were sampled. Dates need to be expressed
decimally, for example 2015.5
means 1st July 2015. If
needed, you can convert into decimal years from other date formats using
the decimal_date
function of the lubridate
package. This input can be simply a vector of decimal dates, in which
case the order needs to be the same as the order of the tree tips in
t$tip.label
. For example, if t$tip.label
is
(A,B,C)
then d=c(2010,2011,2012)
indicates
that A
was isolated in 2010, B in 2011 and C in 2012.
Alternatively, the vector d
does not have to be ordered
if names(d)
corresponds to the tip names. For example,
d=c(2011,2010,2012)with names(d)=c('B','A','C')
would have
the same meaning as in the previous example.
If some isolation dates are unknown, this can be indicated using
NA
in the vector, in which case the date is considered
unknown but within the range of known dates, ie with the range return by
the command range(d,na.rm=T)
.
Finally, the isolation dates can be specified as an interval rather
than an exact value. This is useful for example if the isolation year is
2015 but with unknown month or day. To do so, d
needs to be
a matrix with two columns, with the first column indicating the lower
bound and the second column indicating the upper bound of the interval.
For example, if year
is the vector of isolation years, but
the exact date in each year is unknown, you can specify this by building
the matrix d
as follows:
d=cbind(year,year+1)
Once the two inputs t
and d
described above
are in the correct format, you are ready to run BactDating using the
main command:
res=bactdate(t,d)
See other vignettes for additional functionality. In particular, if your tree is unrooted you can find an optimal root using:
rooted=initRoot(t,d)
Before running the main bactdate
analysis, it is also a
good idea to assess the strength of the temporal signal using a
root-to-tip linear regression:
r=roottotip(t,d)
BactDating is a MCMC algorithm which needs to be run for long enough
to ensure that the results are meaningful. The length of the algorithm
can be controlled using the nbIts
argument of the
bactdate
function, which by default is 10e4 but may need
increasing for example ten folds:
res=bactdate(t,d,nbIts=1e5)
To test the MCMC convergence, a good start is to check that the traces of the parameters look stable using the command:
plot(res,'trace')
For further testing of convergence, you can export the BactDating
result to the format required by the coda
package using the
command:
mcmc=as.mcmc(res)
You can then compute for example the effective sample size of the parameters which should be at least 100 for each paramater, using:
effectiveSize(mcmc)