In this tutorial, we will go step
by step through the basic steps of performing linkage analysis in
polyploids using polymapR
. Most of the theory behind the
functions described here can be found in the Bioinformatics
publication of Bourke et al. (2018) at https://doi.org/10.1093/bioinformatics/bty371. The
example dataset is a simulated dataset of an hypothetical tetraploid
outcrossing plant species with five chromosomes. Chromosomes have random
pairing behaviour. The population consists of 207 individuals
originating from an F1 cross between two non-inbred parents. Frequently
occurring data imperfections like missing values, genotyping errors and
duplicated individuals are taken into account.
polymapR
package.polymapR
polymapR
package.In R, a function can be called by typing function_name()
in the console and pressing [Enter]
. (Optional) arguments
go within the parentheses, followed by a =
sign and the
value you want to give to the argument. Arguments are separated by a
comma. All possible arguments and default values of a function are in
the help file, which is called by typing ?
followed by the
function name. Getting the help file for the commonly used function
seq
looks like this:
And generating a sequence of numbers from 2 to 14 by an increment of
3 using seq
looks like this:
## [1] 2 5 8 11 14
Although copy pasting the commands from this document to your R
console will do for this tutorial, we recommend to have a bit more
experience in using R than above before using polymapR
.
This will allow you better to troubleshoot and check out data output. A
good place to start would be https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
or any other good introductory course in R.
The polymapR
package is literally a collection of
functions, each defined by their function name and arguments. It relies
on dosage data of molecular markers, and therefore is logically used
after the packages fitTetra
or fitPoly
. As for
any project, it is best to first create a new folder which will be the
main working directory for the mapping project. Set the working
directory to your new folder (setwd()
or use the drop-down
menu) and copy the marker dosage file from fitTetra
there.
polymapR
polymapR
is available on CRAN at https://cran.R-project.org/package=polymapR which is
probably where you accessed this vignette. As with other packages on
CRAN, it can be installed directly from within R using the command:
Most exported functions in polymapR
have a
log
argument. The default setting is NULL
. In
that case, all messages are returned to the standard output, which means
your R console if you are calling the functions interactively. However,
if it is set to a file name (e.g. log.Rmd
or
log.txt
) all function calls and output are written to that
file. It automatically creates one if it does not exist yet. The log
file is written in R markdown (https://rmarkdown.rstudio.com/). The advantage of
markdown is that it is readable as plain text, but it can also easily be
rendered into a nicely formatted html or pdf. In order to do so, give
your logfile a .Rmd extension and use the knitr
package to
turn it into a pdf, docx or html.
We assume that the user has used fitPoly
or some
alternative method to convert genotyping information (e.g.
signal intensity ratios or read depths) into marker dosages. The
standard version of polymapR
assumes that such genotypes
are discrete, but it is also possible to use probabilistic genotypes
since version 1.1.0 (see other vignette). In this tutorial we assume
such discrete dosage values are the starting point, but we will impose
some further quality checks before we can begin mapping (such as allowed
numbers of missing values). We also need to check the markers for
skewness (differences in the observed and expected marker segregation)
which can cause problems later in the clustering. For this, we have
incorporated a number of functions developed for the
fitPoly
package [1], and
included in the currently-unpublished R package
fitPolyTools
developed by Roeland Voorrips.
But first, a note on polyploid terminology
In the
polymapR
package, and throughout this vignette, we often refer to terms like “simplex”, “duplex” or “nulliplex” etc. These terms describe the dosage scores of SNP markers, essentially the count of the number of “alternative” alleles present at a particular locus in an individual. A simplex x nulliplex marker is therefore a marker that for which the maternal dosage is 1 (mothers always come first) and paternal dosage is 0. To save space, we also refer to these markers as SxN markers, which is synonymous with 1x0 markers. Sometimes statements are made concerning simplex x nulliplex markers - it should be understood that these statements also usually apply to nulliplex x simplex markers - markers for which the segregating allele is carried by the father rather than the mother.
The first step is to import the marker data. The layout of the dataset should be marker name, parental scores followed by offspring scores (so-called wide data format):
Marker | P1 | P2 | F1.001 | F1.002.. |
---|---|---|---|---|
mymarker1 | 1 | 0 | 1 | 1 |
mymarker2 | 2 | 4 | 3 | 2 |
mymarker3 | 0 | 3 | 2 | 2 |
In cases where the parents have been genotyped more than once, a consensus parental score is required. Missing parental scores are not allowed at this stage, and should have been imputed already if desired.
The data should be in some format that can be read easily into R,
such as .csv, .dat or .txt file. In the example a .csv file is read in.
For other file types check out ?read.delim
.
ALL_dosages <- read.csv("tetraploid_dosages.csv",
stringsAsFactors = FALSE,
row.names = 1) #first column contains rownames
## [1] "data.frame"
By default, read.csv
returns data in a data frame. All
polymapR
functions that use dosage data only accept dosage
data when the dosages are in a matrix. A matrix is a two-dimensional R
object with all elements of the same type (e.g. character,
integer, numeric). Markernames and individual names should be specified
in rownames and columnnames respectively. To use the data with
functions, the data should be converted from a data.frame to a matrix by
the as.matrix
function:
## [1] "matrix" "array"
## P1 P2 F1_001 F1_002 F1_003
## Zm_rs005505 2 2 2 3 2
## Zm_ts097822 1 2 1 1 2
## Ac_ws072452 1 0 1 NA 1
## Ac_ts073123 0 2 1 NA 2
## Ap_ws071152 3 1 3 2 NA
## St_rs076767 1 0 1 1 1
## [1] 3000 209
If you are testing the package and would like a sample dataset to work with, a set of tetraploid marker dosage data is provided within the package itself. First load the package and then the dosage data as follows:
The next step before we proceed further is to check whether the
marker scores in the F1 correspond to those expected according to the
parental dosages. We do this using the checkF1
function
(from fitPolyTools
). However, we first load the
polymapR
package into the global environment:
Now run checkF1
:
F1checked <- checkF1(dosage_matrix = ALL_dosages,parent1 = "P1",parent2 = "P2",
F1 = colnames(ALL_dosages)[3:ncol(ALL_dosages)],
polysomic = TRUE, disomic = FALSE, mixed = FALSE, ploidy = 4)
head(F1checked$checked_F1)
## m MarkerName parent1 parent2 F1_0 F1_1 F1_2 F1_3 F1_4 F1_NA P1 P2 bestfit
## 1 1 Zm_rs005505 2 2 6 39 90 45 4 23 2 2 18I81_0
## 2 2 Zm_ts097822 1 2 14 79 83 10 0 21 1 2 1551_0
## 3 3 Ac_ws072452 1 0 102 83 0 0 0 22 1 0 11_0
## 4 4 Ac_ts073123 0 2 27 121 37 0 0 22 0 2 141_0
## 5 5 Ap_ws071152 3 1 1 48 77 55 0 26 3 1 121_1
## 6 6 St_rs076767 1 0 92 88 0 0 0 27 1 0 11_0
## frqInvalid_bestfit Pvalue_bestfit matchParent_bestfit bestParentfit
## 1 0.0000 0.9187 Yes 18I81_0
## 2 0.0000 0.4724 Yes 1551_0
## 3 0.0000 0.1624 Yes 11_0
## 4 0.0000 0.4160 Yes 141_0
## 5 0.0055 0.1165 Yes 121_1
## 6 0.0000 0.7656 Yes 11_0
## frqInvalid_bestParentfit Pvalue_bestParentfit matchParent_bestParentfit
## 1 0.0000 0.9187 Yes
## 2 0.0000 0.4724 Yes
## 3 0.0000 0.1624 Yes
## 4 0.0000 0.4160 Yes
## 5 0.0055 0.1165 Yes
## 6 0.0000 0.7656 Yes
## q1_segtypefit q2_parents q3_fracscored qall_mult qall_weights
## 1 1.0000 1 0.7222 0.7222 0.9383
## 2 1.0000 1 0.7464 0.7464 0.9436
## 3 1.0000 1 0.7343 0.7343 0.9410
## 4 1.0000 1 0.7343 0.7343 0.9410
## 5 0.9278 1 0.6860 0.6365 0.8901
## 6 1.0000 1 0.6739 0.6739 0.9275
There is the possibility to check for shifted markers (incorrectly
genotyped) but we will not concern ourselves with that here. The main
arguments to be specified are what the parental identifiers are
(parent1
and parent2
), what the identifiers of
the F1 population are (argument F1
- note we just take the
column names of the dosages matrix from column 3 to the end), and which
sort of inheritance model we wish to compare against (here we have
specified polysomic
to be TRUE
and both
disomic
and mixed
to be FALSE
.
Note that mixed
allows for one parent to be fully disomic
and the other fully polysomic, but does not refer to the situation of a
mixed inheritance pattern within one parent, associated with segmental
allopolyploidy - this is because we have no fixed expectation to compare
with if this is the case). The argument ploidy
specifies
the ploidy level of parent 1, if parent 2 has a different ploidy (for
instance in a tetraploid x diploid cross) then we can use the
ploidy2
argument. The results are stored in the list
F1checked
, which has two elements: meta
which
holds meta-data regarding the function call, and
checked_F1
, the actual function output. We are interested
in seeing whether there is consistency between the parental scores and
the F1. In general, if there is good consistency between the parental
and offspring scores we will have a high value for
qall_mult
and qall_weights
, which are
aggregated quality measures of q1
, q2
and
q3
. In the sample datatset provided with
polymapR
there are no skewed markers present. However, you
may find it useful to examine the output of checkF1
and
remove markers which show conflict between the parental and offspring
scores, as these markers will likely cause issues in the later mapping
steps.
It can also be useful to run a principal component analysis at this
stage using the PCA_progeny
function on the marker
dosages:
In cases where there are serious issues with the population (such as parental mix-ups, inclusion of unrelated samples with the bi-parental population, multiple sub-populations from e.g. multiple pollen-donors etc) a PCA plot can help. Here, we see no such problems so we can proceed.
We now summarise the marker data in terms of the different segregation types, number of missing values, number of inadmissible scores (which could be genotyping errors or possible double reduction scores). Run the summary function on the marker data as follows:
mds <- marker_data_summary(dosage_matrix = ALL_dosages,
ploidy = 4,
pairing = "random",
parent1 = "P1",
parent2 = "P2",
progeny_incompat_cutoff = 0.05)
## Calculating parental info...
## Checking compatibility between parental and offspring scores...
##
## #### parental_info
##
## | | P2_0| P2_1| P2_2| P2_3| P2_4|
## |:----|----:|----:|----:|----:|----:|
## |P1_0 | 0| 312| 227| 49| 3|
## |P1_1 | 300| 625| 356| 67| 4|
## |P1_2 | 230| 323| 175| 29| 2|
## |P1_3 | 66| 65| 30| 5| 0|
## |P1_4 | 2| 5| 3| 0| 122|
##
## #### offspring_incompatible
##
## | | P2_0| P2_1| P2_2| P2_3| P2_4|
## |:----|----:|----:|----:|----:|----:|
## |P1_0 | NA| 0.00| 0.00| 0.56| 2.42|
## |P1_1 | 0.00| 0.00| 0.00| 0.28| 0.85|
## |P1_2 | 0.00| 0.00| 0.00| 0.12| 0.00|
## |P1_3 | 0.45| 0.33| 0.14| 0.19| NA|
## |P1_4 | 0.97| 0.48| 0.64| NA| 1.07|
##
## #### Incompatible individuals:
##
## None
## | | 1x0| 2x0| 3x0| 4x0| 0x1| 1x1| 2x1| 3x1| 4x1| 0x2| 1x2| 2x2| 3x2| 4x2| 0x3| 1x3| 2x3| 3x3| 0x4| 1x4| 2x4| 4x4|
## |:---------|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
## |frequency | 300| 230| 66| 2| 312| 625| 323| 65| 5| 227| 356| 175| 30| 3| 49| 67| 29| 5| 3| 4| 2| 122|
If you want to know exactly what these functions are doing and what
can be typed as an argument just use ?marker_data_summary
and/or ?parental_quantities
.
Simplex x nulliplex markers segregate in a 1:1 ratio in the offspring, but this segregation is also possible with triplex x nulliplex, triplex x quadriplex and simplex x quadriplex, and the segregating allele in each case comes from the same parent. We can therefore re-code all of these segregation types into the simplest 1 x 0 case, taking care that we handle possible double reduction scores correctly. A list of the conversions in tetraploid we apply which simplify the subsequent analysis and mapping work are:
3x0,3x4,1x4 -> 1x0 |
0x3,4x3,4x1 -> 0x1 |
2x4 -> 2x0 |
4x2 -> 0x2 |
3x3 -> 1x1 |
3x1 -> 1x3 |
3x2 -> 1x2 |
2x3 -> 2x1 |
Note that some classes such as 2x2 cannot be converted into any other category, and we do not convert non-segregating classes such as 4x4 as these are of no further use for mapping or QTL analysis. To convert the marker data:
## Warning in convert_marker_dosages(dosage_matrix = ALL_dosages, ploidy = 4): There are dosages greater than 4 or less than 0 in the dataset.
## If they include parental dosages, markers are removed from the output.
## Otherwise the dosage is made missing.
##
## #### Marker dosage frequencies:
##
## | | P2_0| P2_1| P2_2| P2_3|
## |:----|----:|----:|----:|----:|
## |P1_0 | 0| 366| 230| 0|
## |P1_1 | 370| 630| 386| 132|
## |P1_2 | 232| 352| 175| 0|
##
## markers not converted: 2615
##
## markers 1 parent converted: 129
##
## markers 2 parents converted: 256
##
## non-segregating markers deleted: 127
## | | 1x0| 2x0| 0x1| 1x1| 2x1| 0x2| 1x2| 2x2| 1x3|
## |:---------|---:|---:|---:|---:|---:|---:|---:|---:|---:|
## |frequency | 370| 232| 366| 630| 352| 230| 386| 175| 132|
A number of quality checks are recommended before we begin mapping. For example, we do not want to include markers which have too many missing values (“NA” scores) as this may be an indication that the marker has performed poorly and may have many errors associated with it. We would also like to screen out individuals that have many missing values as this might be an indication that the DNA quality of that sample was poor, or duplicate individuals which give no extra information and should probably also be removed / combined. Because there are no fixed rules for how stringent we should be with quality checks (5% missing values ok? 10% missing values ok? 20%?), some visual aids are produced in order to help with these decisions.
We first screen the marker data for missing values using
screen_for_NA_values
. This function allows the user to
remove rows or columns with a certain rate of missing values from the
dataset. The dosage matrix, which is specified as
segregating_data
contains markers in rows and individuals
in columns. In R, rows are usually specified by the number 1, and
columns by 2. The argument margin
lets you specify by
margin number whether you want to screen markers
(margin = 1
) or individuals (margin = 2
) If
the argument cutoff
is set to NULL
, it takes
user input on the rate of markers to be screened. In the example below
cutoff
is specified, but because decisions are made after
data inspection, we recommend to decide on the cutoff value after
inspection of the data by specifying cutoff = NULL
. A
threshold of 10% missing values is sensible, as a compromise between
high-confidence marker data versus keeping a large proportion of the
markers for mapping and QTL analysis. In general, the most problematic
markers tend to have higher rates of missing values so any reasonable
threshold here should be fine. If problems are encountered later in the
mapping, it might be worthwhile to re-run this step with a more
stringent (e.g. 5%) threshold.
screened_data <- screen_for_NA_values(dosage_matrix = segregating_data,
margin = 1, # margin 1 means markers
cutoff = 0.10,
print.removed = FALSE)
##
## Number of removed markers : 1458
##
##
## There are now 1415 markers leftover in the dataset.
As in step 7.1 we can also visualise the rate of missing values per individual, which can help us to make a decision about whether certain individuals should be removed from the dataset before commencing the mapping. In order to screen the individuals for missing values we enter:
screened_data2 <- screen_for_NA_values(dosage_matrix = screened_data,
cutoff = 0.1,
margin = 2, #margin = 2 means columns
print.removed = FALSE)
##
## Number of removed individuals : 2
##
##
## There are now 205 individuals leftover in the dataset.
Again, it is not clear what an acceptable threshold for the rate of missing values should be, but we recommend a cut-off rate of 0.1 – 0.15 (10-15%) missing values allowed.
It is also desirable to be able to check whether there are any
duplicated individuals in the population as can sometimes happen (or due
to mix-ups in the DNA preparation and genotyping). The function
screen_for_duplicate_individuals
does this for you. Note
that a duplicate can remain in the dataset, but they add no further
information for mapping or QTL analysis, and may even distort the
analysis (by giving an inaccurate population size). As for the function
screen_for_NA_values
, cutoff can be set to
NULL
to accept user input. Especially for
screen_for_duplicate_individuals
it makes sense to make a
decision on the cut off after inspection of the data and figures. To
screen the data for duplicate individuals, enter:
screened_data3 <- screen_for_duplicate_individuals(dosage_matrix = screened_data2,
cutoff = 0.95,
plot_cor = TRUE)
##
## Combining F1_060 & F1_081 into F1_060
##
## Combining F1_046 & F1_085 into F1_046
## Warning in screen_for_duplicate_individuals(dosage_matrix = screened_data2, :
## Multiple duplicates of single genotype identified at this threshold. Attempting
## to merge...
##
## Combining F1_032 & F1_100 into F1_032
##
## Combining F1_032 & F1_114 into F1_032
##
## Combining F1_079 & F1_101 into F1_079
##
## Combining F1_106 & F1_113 into F1_106
##
## Combining F1_112 & F1_199 into F1_112
##
## #### 7 individuals removed:
##
## _ _ _ _
## [1,] "F1_081" "F1_085" "F1_100" "F1_114"
## [2,] "F1_101" "F1_113" "F1_199" ""
## |_ |_ |_ |_ |
## |:------|:------|:------|:------|
## |F1_081 |F1_085 |F1_100 |F1_114 |
## |F1_101 |F1_113 |F1_199 | |
Here we see that there were a number of duplicate pairs of individuals identified (red points in the output graph). Each duplicate pair is merged into a consensus individual (conflicts are made missing), keeping the name of the first individual. This is summarised in the output as well.
Finally, we can also check for duplicated markers (those which give
us no extra information about a locus) and remove these from the dataset
before mapping, using the screen_for_duplicate_markers
function. This function returns a list with 2 items - the
filtered_dosage_matrix which we are interested in now, and bin_list,
which we are possibly interested in again later.
It is a good idea to save the output of this function, as you might like
to add back the duplicate markers after mapping.
## | | | 0% | |======================================================================| 100%
## Marked and merged 3 duplicated markers
Note that increasingly, there are far too many markers generated in comparison to the population size. It is possible to estimate the genetic resolution of the population you are working with, and from that estimate the expected bin-size in your dataset, assuming your markers are approximately randomly-distributed across the genome (this may not be a good assumption in your case, but as a rough guide this function may still be useful). Considering only bivalent pairing, in a tetrasomic species we can assume approximately four cross-over recombinations occurred, taking the rule-of-thumb guide of one cross-over recombination per bivalent, which has been observed in autotetraploid Arabidopsis arenosa for example [2].
We can try to estimate the average bin size if we run the previous
function and specifying the argument
estimate_bin_size = TRUE
. The function then does a quick
calculation on how many markers would be expected in an average marker
bin (under the tenuous assumptions of uniform distributions of both
markers and cross-over break-points). In a tetraploid, each offspring
would be expected to have 4 cross-over break-points per chromosomal
linkage group (2 maternal and 2 paternal), so across a population of
N individuals there are approximately 4N breaks per
linkage group, and 4NL such breaks across L linkage
groups. Supposing there are in total y distinct marker types
(like 1x0, 1x1, 2x0, 0x1 etc), then your M markers can
be split into M/4NLy segregation patterns (assuming no errors),
which is an approximate estimate for the average bin size. In datasets
with very large numbers of markers (e.g. genotyping using NGS),
it may be worthwhile to only include markers with a minimum
number of duplicates present, and that minimum number can be informed by
the calculation above. In this case, supposing we demand at least 7
duplicates (so a minimum bin size of 6), we would do something like the
following:
reliable.markers <- names(which(sapply(screened_data4$bin_list,length) >= 6))
reliable_data <- screened_data4$filtered_dosage_matrix[reliable.markers,]
Of course, this assumes we have a very large number of markers to choose from, as this is a very severe demand in a more traditional, well-balanced dataset. It also presumes there are no genotyping errors. A genotyping error will make duplicate markers appear to be different even when they are not (from a genetic mapping perspective). For now we are ignoring this issue, but may return to it in future updates.
In the current data-set there is quite a good balance between the number of markers and the number of individuals in the mapping population, and so we extract the filtered_dosage_matrix that has the duplicate markers merged:
Before we start the mapping, it is a good idea to run the summary
function again, this time on the filtered_data
object, so
we have a clear record of the numbers of markers going forward to the
mapping stage.
## | | 1x0| 2x0| 0x1| 1x1| 2x1| 0x2| 1x2| 2x2| 1x3|
## |:---------|---:|---:|---:|---:|---:|---:|---:|---:|---:|
## |frequency | 175| 111| 194| 301| 183| 101| 198| 88| 61|
The first step in actually mapping the markers is to define linkage groups for the markers. We use the simplex x nulliplex markers to define the linkage groups (chromosomes) and following that, the homologues (4 per linkage group are expected in a tetraploid). We use the LOD scores to cluster markers, which increases dramatically for tight coupling-phase estimates.
To estimate the recombination frequencies, LOD and phasing between all marker pairs of P1 enter:
SN_SN_P1 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
parent1 = "P1",
parent2 = "P2",
which_parent = 1,
ploidy = 4,
pairing = "random"
)
A note on arguments and multi-core processing
There is much more to the
linkage
function. Checkout?linkage
to see the arguments that can be passed to it. One important feature is that it can run in parallel on multiple cores. This can be very time-efficient if you are working with a large number of markers. The number of cores to use can be specified using thencores
argument. Before doing so, first check the number of cores on your system usingparallel::detectCores()
. Use at least one core less than you have available to prevent overcommitment.
Note that since polymapR version 1.1.5, some of the arguments of the
linkage
function have been updated. The arguments
target_parent
and other_parent
have been
removed, and replaced with parent1
and
parent2
, referring to the column names of the two parents.
A new argument which_parent
has been introduced, to allow
you to specify which parent is being “targeted” for linkage analysis
(either 1 or 2). This was needed to accommodate an update for mapping in
triploid populations, but also makes it easier if you rename your
parental columns “P1” and “P2” at the start, as these are the default
values provided for the arguments parent1
and
parent2
. This is the case here, so for subsequent calls to
linkage
we will omit specifying the parental names each
time.
It is a good idea to first visualise the relationship between the
recombination frequency and LOD score, which gives us some insight into
the differences between the information content of the different phases,
as well as highlighting possible problems in the data. This can be done
by using the r_LOD_plot
function:
## Warning: Computation failed in `stat_binhex()`.
## Caused by error in `compute_group()`:
## ! The package "hexbin" is required for `stat_bin_hex()`.
In some situations, it is possible to identify outlying datapoints on
this plot, which do not fit the expected relationship. We have
implemented a function to check this for the case of pairs of simplex x
nulliplex markers, as these are the markers that we use in the
subsequent steps of clustering and identifying chromosome and homologue
linkage groups. If you are unsure, you can run the
SNSN_LOD_deviation
function anyway to see whether your data
fits the expected pattern:
P1deviations <- SNSN_LOD_deviations(linkage_df = SN_SN_P1,
ploidy = 4,
N = ncol(filtered_data) - 2, #The F1 population size
alpha = c(0.05,0.2),
plot_expected = TRUE,
phase="coupling")
Here we see there are a few “outliers” in the coupling pairwise data
(identified by the stars) from the expected relationship between r and
LOD (the bounds of which are shown by the dotted lines - these can be
widened as necessary using the alpha
argument for the upper
and lower tolerances around the “true” line, usually in a trial and
error approach). In our example dataset, these highlighted points are
hardly outliers and we can proceed. In other datasets, you may want to
remove these starred pairwise estimates before continuing with marker
clustering, as it is possible they may cause unrelated homologue
clusters to clump together and prevent a clear clustering structure from
being defined. To do so, remove them from the linkage data frame as
follows:
We are now in a position to cluster the marker data into chromosomes
using the cluster_SN_markers
function. This function
clusters markers over a range of LOD thresholds. In the example below
this is over a range from LOD 3 to 10 in steps of 1
(LOD_sequence = c(3:10)
). If the argument
plot_network
is set to TRUE
this function will
plot the network of the lowest LOD threshold (LOD 3 in this case), and
overlays the clusters of the higher LOD scores. This setting is not
recommended with large number of marker pairs. Your computer will most
likely run out of memory.
P1_homologues <- cluster_SN_markers(linkage_df = SN_SN_P1,
LOD_sequence = seq(3, 10, 0.5),
LG_number = 5,
ploidy = 4,
parentname = "P1",
plot_network = FALSE,
plot_clust_size = FALSE)
## Total number of edges: 894
## 20 clusters were expected.
The function cluster_SN_markers
returns a list of
cluster data frames, one data frame for each LOD threshold. The first
plot shows two things - on the normal y-axis, the number of clusters
over a range of LOD thresholds (x-axis) - and on the right-hand y-axis,
the number of markers that do not form part of any cluster, which is
shown by the blue dashed line. In the cluster_SN_markers
function, we specify the ploidy (4 for a tetraploid) and the expected
number of chromosomes (LG_number = 5) and therefore there are 4 x 5 = 20
homologue clusters expected. If we examine the second plot, we see how
these clusters stay together or fragment as the linkage stringency (LOD
score) increases. Here, we see clearly that at LOD 3.5, there are 5
clusters which all eventually divide into 4 sub-clusters each. This is
the ideal scenario - we have identified both the chromosomal linkage
groups and the homologues using only the simplex x nulliplex marker
data. Note that it is the coupling-phase estimates which define the
homologue clusters and the repulsion-phase estimates which provide the
associations between these coupling-clusters (homologues). In practice,
things might not always be so orderly. In noisy datasets with higher
numbers of genotyping errors, or at higher ploidy levels, the
repulsion-phase linkages between homologues will be lost among false
positive linkages between un-linked markers. This means that all markers
will likely clump together at lower LOD scores.
You can manually check how many clusters there are at a specific LOD score, and how many markers are in a cluster:
P1_hom_LOD3.5 <- P1_homologues[["3.5"]]
t <- table(P1_hom_LOD3.5$cluster)
print(paste("Number of clusters:",length(t)))
## [1] "Number of clusters: 5"
##
## 1 2 3 4 5
## 37 36 29 31 41
P1_hom_LOD5 <- P1_homologues[["5"]]
t <- table(P1_hom_LOD5$cluster)
print(paste("Number of clusters:",length(t)))
## [1] "Number of clusters: 20"
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 13 9 10 4 12 7 9 8 3 13 6 7 4 10 6 11 17 6 6 12
Based on the plots (or the output as shown) you should be able to
decide which LOD threshold best separates the data. In this example, at
a LOD of 3.5 we identify chromosomes and anywhere between LOD 4.5 and
LOD 7 we have identified our homologues (4 x 5 = 20 clusters). In this
parent (P1) we have therefore solved the marker clustering problem
without any difficulty. It remains to gather this into a structure that
can be used at later stages. To do this, we need two LOD values which
define chromosomal groupings (here, LOD_chm
= 3.5) and
homologue groupings (here, LOD_hom
= 5). We can then use
the define_LG_structure
function with these arguments as
follows (note the argument LG_number refers to the expected number of
chromosomes for this species):
LGHomDf_P1 <- define_LG_structure(cluster_list = P1_homologues,
LOD_chm = 3.5,
LOD_hom = 5,
LG_number = 5)
##
## #### SxN Marker(s) lost in clustering step at LOD 5:
##
## |_ |
## |:-----------|
## |Ls_rs099908 |
## SxN_Marker LG homologue
## 1 Ac_ns035629 1 1
## 3 Ac_ns045766 1 1
## 7 Ac_ns096474 1 1
## 22 Ac_ts072686 1 1
## 24 Ac_ws013795 1 1
## 42 Ap_rs075640 1 1
However, you may find that it is not possible to identify chromosomal groups as simply as this. We therefore recommend that homologue groups be first identified, and then another marker type be used to provide associations between these clusters, and hence help form chromosomal linkage groups. This can be done as follows:
Generate homologue clusters using the
cluster_SN_markers
function at a high LOD.
Use the Duplex x Nulliplex or Simplex x Simplex marker information to put the clusters back together into chromosomal groups.
It makes sense to use only SxN marker pairs in coupling phase for the
homologue clustering (since these associations define the homologues).
Note that the most likely phase per marker pair (coupling or repulsion)
was already estimated by our previous call to the linkage
function.
SN_SN_P1_coupl <- SN_SN_P1[SN_SN_P1$phase == "coupling",] # select only markerpairs in coupling
P1_homologues_1 <- cluster_SN_markers(linkage_df = SN_SN_P1_coupl,
LOD_sequence = c(3:12),
LG_number = 5,
ploidy = 4,
parentname = "P1",
plot_network = FALSE,
plot_clust_size = FALSE)
## Total number of edges: 682
## 20 clusters were expected.
In the dot plot above, we see that all chromosomal associations have now been removed with the repulsion information.
A note on nested lists
In the
polymapR
package, output is often given as a nested list. Some basic information on lists is available on e.g. r-tutor. We use nested lists because we are often dealing with hierarchical data: marker data > homologue > linkage group > organism. Some basic R knowledge is needed to retrieve, visualize or filter data from such a nested list. However, if you would like to visualize it without R (e.g. in your favourite spreadsheet program), you can choose to export a nested list usingwrite_nested_list
. It will keep its hierarchical structure by writing it to directories that have the same structure as the the nested list. Here’s an example:write_nested_list(list = P1_homologues, directory = "P1_homologues")
Next, we use linkage to the DxN markers to provide bridging linkages
between homologue clusters, using the bridgeHomologues
function. The first time we run this function, we use a LOD threshold of
4 for evidence of linkage between a SxN and a DxN marker. However, there
may also be some false positive linkages at this level. It might
therefore be necessary to run the function again at a higher LOD (say
LOD 8) to avoid such false positives. In doing so, we might lose true
linkage information. Therefore, it’s a good idea to run the function at
LOD 4 first, then increase to a higher LOD if necessary, and build up a
picture of which clusters form part of the same chromosome. First
calculate linkage between SxN and DxN markers:
SN_DN_P1 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
markertype2 = c(2,0),
which_parent = 1,
ploidy = 4,
pairing = "random")
Then, use bridgeHomologues
to use these SxN - DxN
linkages to find associations (“bridges”) between homologue clusters.
Note that we must provide a particular division of the Simplex x
Nulliplex data - in this example, we use the division at LOD 6 which
gave us 20 clusters (our 20 putative homologues):
LGHomDf_P1_1 <- bridgeHomologues(cluster_stack = P1_homologues_1[["6"]],
linkage_df = SN_DN_P1,
LOD_threshold = 4,
automatic_clustering = TRUE,
LG_number = 5,
parentname = "P1")
This results in an ideal situation of 5 linkage groups each with exactly 4 homologues:
##
## 1 2 3 4
## 1 13 9 4 10
## 2 12 7 9 8
## 3 3 13 5 7
## 4 4 10 6 11
## 5 17 12 6 6
However, clustering of P2 is a bit more difficult. The following
re-does everything for P2. Note that for clarity, we include the
arguments parent1
and parent2
here, but we
could have just as well left them out (as they are the default column
names, see ?linkage
):
SN_SN_P2 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
parent1 = "P1",
parent2 = "P2",
which_parent = 2,
ploidy = 4,
pairing = "random"
)
SN_SN_P2_coupl <- SN_SN_P2[SN_SN_P2$phase == "coupling",] # get only markerpairs in coupling
P2_homologues <- cluster_SN_markers(linkage_df = SN_SN_P2_coupl,
LOD_sequence = c(3:12),
LG_number = 5,
ploidy = 4,
parentname = "P2",
plot_network = FALSE,
plot_clust_size = FALSE)
## Total number of edges: 743
## 20 clusters were expected.
SN_DN_P2 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
markertype2 = c(2,0),
which_parent = 2,
ploidy = 4,
pairing = "random")
LGHomDf_P2 <- bridgeHomologues(cluster_stack = P2_homologues[["6"]],
linkage_df = SN_DN_P2,
LOD_threshold = 4,
automatic_clustering = TRUE,
LG_number = 5,
parentname = "P2")
##
## 1 2 3 4 5
## 1 11 9 3 5 0
## 2 13 13 7 11 0
## 3 13 11 11 8 2
## 4 9 9 13 11 0
## 5 10 10 6 9 0
Occasionally linkage groups possesses fewer than four “homologues”.
This can result in a loss of phase information in later steps, but will
not prevent us from proceeding with the mapping. However, it is also
possible that more than 4 homologues are grouped together, likely the
result of having a number of fragmented homologues. However, as we don’t
know which clusters are actual homologues, and which are fragments, we
must examine the affected linkage group in more detail.
polymapR
currently offers two methods for this.
overviewSNlinks
Run overviewSNlinks
as follows:
## Warning: Computation failed in `stat_binhex()`.
## Computation failed in `stat_binhex()`.
## Computation failed in `stat_binhex()`.
## Computation failed in `stat_binhex()`.
## Computation failed in `stat_binhex()`.
## Computation failed in `stat_binhex()`.
## Computation failed in `stat_binhex()`.
## Computation failed in `stat_binhex()`.
## Computation failed in `stat_binhex()`.
## Computation failed in `stat_binhex()`.
## Caused by error in `compute_group()`:
## ! The package "hexbin" is required for `stat_bin_hex()`.
In cases of fragmented homologues, it may be possible to identify
which fragments belong together by examining the phase of intra-cluster
marker pairs. Coupling-phase (green) linkage between fragments can be an
indication that separate homologue fragments belong together. This can
be then be used to merge these homologues (specified in the argument
mergeList
- use ?merge_homologues
for more
information). For example, if we wanted to merge homologues 4 and 5 of
linkage group 3 as the above data suggests, we would use the
following:
LGHomDf_P2_1 <- merge_homologues(LG_hom_stack = LGHomDf_P2,
ploidy = 4,
LG = 3,
mergeList = list(c(4,5)))
## #### Number of markers per homologue:
##
## | | 1| 2| 3| 4| 5|
## |:--|--:|--:|--:|--:|--:|
## |1 | 11| 13| 13| 9| 10|
## |2 | 9| 13| 11| 9| 10|
## |3 | 3| 7| 11| 13| 6|
## |4 | 5| 11| 10| 11| 9|
cluster_per_LG
cluster_per_LG
does not rely on markers in repulsion,
but uses variation in LOD thresholds between markers in coupling. Run it
as follows:
cluster_per_LG(LG = 3,
linkage_df = SN_SN_P2[SN_SN_P2$phase == "coupling",],
LG_hom_stack = LGHomDf_P2,
LOD_sequence = c(3:10), # The first element is used for network layout
modify_LG_hom_stack = FALSE,
network.layout = "stacked",
nclust_out = 4,
label.offset=1.2)
## Total number of edges: 185
The network layout is based on the first LOD score you provide in
LOD_sequence
(which is LOD = 3 in this example). The darker
the background, the more consistent the clustering is over the range of
LOD scores. Change the LOD_sequence
if the network is not
as required (4 clusters in the case of a tetraploid) and re-run. If the
network looks as you would like it to, use the
cluster_per_LG
function but with a
LOD_sequence
of length 1 (the first LOD score, so again LOD
= 3 in this example), and specify
modify_LG_hom_stack = TRUE
to make a modification to the
cluster numbering (the output is then returned and can be saved as a new
LG_hom_stack object):
LGHomDf_P2_1 <- cluster_per_LG(LG = 3,
linkage_df = SN_SN_P2[SN_SN_P2$phase == "coupling",],
LG_hom_stack = LGHomDf_P2,
LOD_sequence = 3,
modify_LG_hom_stack = TRUE,
network.layout = "n",
nclust_out = 4)
## Total number of edges: 185
##
## 1 2 3 4 5
## 1 11 13 10 9 10
## 2 9 13 13 9 10
## 3 3 7 11 13 6
## 4 5 11 11 11 9
polymapR
is also able to map triploid populations, and
may be extended to other cross-ploidy populations if there is demand for
that in future. Tetraploid x diploid crosses are however the most common
(often done to induce seedlessness in the F1). Previously we had
implemented a mapping approach that assumes that the parent of higher
ploidy level is parent 1. However, since polymapR v.1.1.5 this is no
longer the case, so a 4x2 and 2x4 cross are both acceptable.
The next point to be aware of is that the segregation of markers in the diploid parent is assumed to follow disomic inheritance (there is no other option in a diploid), whereas the segregation of the tetraploid parent is assumed to follow tetrasomic inheritance. Mixtures with preferential pairing in the tetraploid parent are currently not implemented for triploid populations.
Because of this disomic behaviour in the diploid parent, we are
unable to use the normal clustering function
(cluster_SN_markers
) for identifying the homologues.
Proceed as normal with the steps described above for the tetraploid
parent. However in order to proceed correctly for the diploid parent 2,
the clustering proceeds a bit differently.
If you are confused about what marker types are possible in a triploid population, run the following, which gives all possible marker combinations (and the segregation ratios in the F1):
## dosage1 dosage2 f1_0 f1_1 f1_2 f1_3
## 1 0 0 1 0 0 0
## 2 0 1 1 1 0 0
## 3 0 2 0 1 0 0
## 4 1 0 1 1 0 0
## 5 1 1 1 2 1 0
## 6 1 2 0 1 1 0
## 7 2 0 1 4 1 0
## 8 2 1 1 5 5 1
## 9 2 2 0 1 4 1
## 10 3 0 0 1 1 0
## 11 3 1 0 1 2 1
## 12 3 2 0 0 1 1
## 13 4 0 0 0 1 0
## 14 4 1 0 0 1 1
## 15 4 2 0 0 0 1
To run the preliminary linkage analysis of 1x0 markers in the diploid
parent, we need to specify the ploidy of both parents,
e.g. ploidy = 4
and ploidy2 = 2
(or vice
versa). To demonstrate this, we need to load a special dataset with
triploid data, a sample of which is available from the
polymapR
package also:
data("TRI_dosages")
# Estimate the linkage in the diploid parent (assuming this has been done for the 4x parent already):
SN_SN_P2.tri <- linkage(dosage_matrix = TRI_dosages,
markertype1 = c(1,0),
parent1 = "P1",
parent2 = "P2",
which_parent = 2,
ploidy = 4,
ploidy2 = 2,
pairing = "random"
)
Have a look at the plot of r versus LOD score - here you should notice that both the coupling and repulsion phases are equally informative now and cannot therefore be separated by LOD score as before:
## Warning: Computation failed in `stat_binhex()`.
## Caused by error in `compute_group()`:
## ! The package "hexbin" is required for `stat_bin_hex()`.
If we try to cluster using the old function, we get the following result - no separation of the homologues, but the chromosomal linkage groups are identified:
P2_homologues.tri <- cluster_SN_markers(linkage_df = SN_SN_P2.tri,
LOD_sequence = seq(3, 10, 1),
LG_number = 3,
ploidy = 2, #because P2 is diploid..
parentname = "P2",
plot_network = FALSE,
plot_clust_size = FALSE)
## Total number of edges: 345
## 3 clusters were expected.
To separate the 0x1 markers into homologues, we use the phase
information from the linkage analysis directly, using the
phase_SN_diploid
function. To learn more about how this
function works, input ?phase_SN_diploid
first, to get an
idea of the arguments used.
LGHomDf_P2.tri <- phase_SN_diploid(linkage_df = SN_SN_P2.tri,
cluster_list = P2_homologues.tri,
LOD_chm = 4, #LOD at which chromosomes are identified
LG_number = 3) #number of linkage groups
## Total number of edges: 161
## Complete phase assignment possible using only coupling information at LOD 4
The rest of the analysis should be pretty much the same as for
tetraploids or hexaploids, although you should check each time you use a
new function whether that function has a ploidy2
argument,
and if it does, use it!
Assigning simplex x simplex markers to a homologue and linkage group
is done by calculating linkage between SxS and SxN markers and after
that assigning them based on this linkage using
assign_linkage_group
:
SN_SS_P1 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
markertype2 = c(1,1),
which_parent = 1,
ploidy = 4,
pairing = "random")
P1_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P1,
LG_hom_stack = LGHomDf_P1,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
##
## #### Marker(s) showing ambiguous linkage to more than one LG:
##
## |_ |
## |:-----------|
## |Ap_ts069042 |
##
## In total, 296 out of 301 markers were assigned.
##
## #### Marker(s) not assigned:
##
## |_ |_ |_ |_ |
## |:-----------|:-----------|:-----------|:-----------|
## |Ac_ws033297 |Ap_ts089267 |St_ns048176 |St_ts030296 |
## |Zm_ws010615 | | | |
## Assigned_LG LG1 LG2 LG3 LG4 LG5 Hom1 Hom2 Hom3 Hom4 Assigned_hom1
## Ac_ns002135 4 0 0 0 7 0 0 7 0 0 2
## Ac_ns002510 2 0 7 0 0 0 0 0 0 7 4
## Ac_ns024650 2 0 11 0 0 0 11 0 0 0 1
## Ac_ns028519 5 0 0 0 0 5 0 5 0 0 2
## Ac_ns028533 5 0 0 0 0 3 0 0 3 0 3
## Ac_ns029178 3 0 0 12 0 0 0 12 0 0 2
## Assigned_hom2 Assigned_hom3 Assigned_hom4
## Ac_ns002135 NA NA NA
## Ac_ns002510 NA NA NA
## Ac_ns024650 NA NA NA
## Ac_ns028519 NA NA NA
## Ac_ns028533 NA NA NA
## Ac_ns029178 NA NA NA
SN_SS_P2 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
markertype2 = c(1,1),
which_parent = 2,
ploidy = 4,
pairing = "random")
P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2,
LG_hom_stack = LGHomDf_P2_1,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
##
## In total, 300 out of 300 markers were assigned.
As simplex x simplex markers are present in both parents, we can define which linkage groups correspond with each other between parents. After this, one of the linkage groups of the parents should be renamed and the SxS markers assigned again according to the new linkage group names:
LGHomDf_P2_2 <- consensus_LG_names(modify_LG = LGHomDf_P2_1,
template_SxS = P1_SxS_Assigned,
modify_SxS = P2_SxS_Assigned)
##
## #### Original LG names
##
## | | 1| 2| 3| 4| 5|
## |:--|--:|--:|--:|--:|--:|
## |1 | 0| 0| 0| 47| 0|
## |2 | 64| 0| 0| 0| 0|
## |3 | 0| 0| 58| 0| 0|
## |4 | 0| 58| 0| 0| 0|
## |5 | 0| 0| 0| 0| 68|
##
## #### Modified LG names
##
## | | 1| 2| 3| 4| 5|
## |:--|--:|--:|--:|--:|--:|
## |1 | 47| 0| 0| 0| 0|
## |2 | 0| 64| 0| 0| 0|
## |3 | 0| 0| 58| 0| 0|
## |4 | 0| 0| 0| 58| 0|
## |5 | 0| 0| 0| 0| 68|
P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2,
LG_hom_stack = LGHomDf_P2_2,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
##
## In total, 300 out of 300 markers were assigned.
Since we now have a consistent linkage group numbering, we can also assign the DxN markers:
P1_DxN_Assigned <- assign_linkage_group(linkage_df = SN_DN_P1,
LG_hom_stack = LGHomDf_P1,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
##
## In total, 111 out of 111 markers were assigned.
P2_DxN_Assigned <- assign_linkage_group(linkage_df = SN_DN_P2,
LG_hom_stack = LGHomDf_P2_2,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
##
## In total, 101 out of 101 markers were assigned.
Now that we have a backbone of SxN markers with consistent linkage
group names, it is time to assign all other marker types to a linkage
group and homologue using their linkage to SxN markers. Since we already
did this for DxN and SxS markers, there is no need to re-do this work
for these marker types. The function
homologue_lg_assignment
finds all markertypes that have not
been assigned yet, does the linkage analysis and assigns them to a
linkage group and homologue. As linkage is calculated for a lot of
marker combinations, this might take a while. Again, note that since
v.1.1.5, we need to specify which_parent
we are interested
in, similar to the linkage
function that is ultimately
being called multiple times in the background here.
marker_assignments_P1 <- homologue_lg_assignment(dosage_matrix = filtered_data,
assigned_list = list(P1_SxS_Assigned,
P1_DxN_Assigned),
assigned_markertypes = list(c(1,1), c(2,0)),
LG_hom_stack = LGHomDf_P1,
which_parent = 1,
ploidy = 4,
pairing = "random",
convert_palindrome_markers = FALSE,
LG_number = 5,
LOD_threshold = 3,
write_intermediate_files = FALSE
)
Also do this for P2:
marker_assignments_P2 <- homologue_lg_assignment(dosage_matrix = filtered_data,
assigned_list = list(P2_SxS_Assigned,
P2_DxN_Assigned),
assigned_markertypes = list(c(1,1), c(2,0)),
LG_hom_stack = LGHomDf_P2_2,
which_parent = 2,
ploidy = 4,
pairing = "random",
convert_palindrome_markers = TRUE,
LG_number = 5,
LOD_threshold = 3,
write_intermediate_files = FALSE
)
Next, to make sure the marker linkage group assignment is consistent
across parents we run the check_marker_assignment
function,
removing any bi-parental markers if they show linkage to different
chromosomes (which suggests a problem with these markers):
Since all markers have been assigned to a linkage group, we now do
the linkage calculations per linkage group of every marker type
combination (so not only with SxN markers). This is done with the
finish_linkage_analysis
function:
all_linkages_list_P1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P1,
dosage_matrix = filtered_data,
which_parent = 1,
convert_palindrome_markers = FALSE,
ploidy = 4,
pairing = "random",
LG_number = 5)
all_linkages_list_P2 <- finish_linkage_analysis(marker_assignment = marker_assignments$P2,
dosage_matrix = filtered_data,
which_parent = 2,
convert_palindrome_markers = TRUE, # convert 3.1 markers
ploidy = 4,
pairing = "random",
LG_number = 5)
The output is returned in a list:
The basis of producing linkage maps has now been achieved, that is
the pairwise recombination frequencies have been estimated and LOD
scores for these estimates have been calculated for all markers. We rely
on a package developed by Katherine Preedy and Christine Hackett [3], called using the function
MDSMap_from_list
. This function applies the
estimate.map
function from the MDSMap
package
to a list of linkage dataframes. Below, we are using the default
settings of estimate.map
, however, they can be changed by
supplying extra arguments.
By calling the MDSMap_from_list
function you
automatically produce the input .txt files that MDSMap
requires in the correct format. Note that unless you provide a different
folder name (by specifying mapdir
), the files will be
overwritten each time you run the function. We generally recommend using
the default mapping settings of the estimate.map
function
(so using the method of principal curves in 2 dimensions with
LOD2 as weights), but there are a number of options that can
be specified which are described in the manual of that package (check
out https://cran.R-project.org/package=MDSMap). Output plots
can be saved as .pdf files by specifying
write_to_file = TRUE
, in the same mapdir
folder that contains map input files as well.
To create an integrated map of all linkage groups, we first combine
the linkage information together and then run the
MDSMap_from_list
function which prepares the files and
passes them to MDSMap
for the mapping:
If you ran the function screen_for_duplicate_markers
before mapping, you may want to add back the duplicate markers to the
map. This can be achieved using the add_dup_markers
function. If you also want to include the duplicate markers on a phased
map file (next section), then you also need to update the
marker_assignments as well (otherwise, just leave the marker_assignments
argument out). This is done as follows:
complete_mapdata <- add_dup_markers(maplist = integrated.maplist,
bin_list = screened_data4$bin_list,
marker_assignments = marker_assignments)
Note that the output of this function is again a list, which we might like to re-assign as follows:
integrated.maplist_complete <- complete_mapdata$maplist
marker_assignments_complete <- complete_mapdata$marker_assignments
Note that if we are producing a phased maplist, we also need the
marker dosages matrix to be present (in order to check whether the
phasing corresponds with the parental scores). Therefore it is no longer
appropriate to use filtered_data
if we have added back
duplicates (where the duplicate markers have been removed). Using
screened_data3
from earlier would be the correct
choice.
One final step that is useful is to generate phased linkage maps from
the integrated linkage map and marker assignments. This provides
information on the coverage of markers across the parental homologues.
To phase the markers, we use the create_phased_maplist
function as follows:
phased.maplist <- create_phased_maplist(maplist = integrated.maplist,
dosage_matrix.conv = filtered_data,
N_linkages = 5,
ploidy = 4,
marker_assignment.1 = marker_assignments$P1,
marker_assignment.2 = marker_assignments$P2)
The N_linkages
option specifies the minimum number of
significant linkages of a marker to a chromosomal linkage group to be
confident of its assignment. Note that this level of significance
(e.g. LOD > 3) was already defined in the
homologue_lg_assignment
function earlier. There are a
number of other arguments with create_phased_maplist
,
including original_coding
which produces a phased mapfile
in the original, uncoverted format. This may have advantages for
tracking marker alleles, although it has the disadvantage that the
homologues appear to be more saturated with markers than they actually
are!
A well-know software program for plotting maps is MapChart [4]. polymapR
can write out
MapChart compatible files (.mct) using the function
write.mct
. There are many plotting options available in
MapChart, here we only currently incorporate a single option, namely
showMarkerNames
, which by default is FALSE
in
anticipation of high-density maps. Further formatting can be achieved
within the MapChart environment itself.
However, polymapR
also comes with its own simple
built-in function to plot maps, namely the plot_map
function:
We might also want to visualise the integrated map output using the
plot_phased_maplist
function (but here we limit it to a
single linkage group). Note that we need to have first used the
create_phased_maplist
, described earlier.
plot_phased_maplist(phased.maplist = phased.maplist[1], #Can plot full list also, remove "[1]"
ploidy = 4,
cols = c("black","grey50","grey50"))
This function can also visualise phased hexaploid maps, or triploid
maps (tetraploid x diploid) if ploidy2
is specified.
MDSMap
produces output that can be examined each time a
map is produced, namely the principal coordinate plots with the
principal curve (on which the map is based) - highlighting possible
outlying markers that do not associate well with the rest of the
markers, and a plot of the nearest-neighbour fit (nnfit) for the markers
along each linkage group, giving an indication of markers with high
stress. Both of these can help the user to evaluate the quality of the
map (for more details, we recommend reading the MDSMap
publication [3]) and often results in a
number of rounds of mapping, where problematic markers are identified
and removed followed by subsequent remapping and re-evaluation. Once the
maps have been produced it is also an idea to get a general overview of
the map quality using the check_map
function. Note that by
default this function expects lists as input arguments, enabling
diagnostics to be run over all maps. Here we run it for a single linkage
group, LG1:
For each map produced, there are 3 diagnostic plots. The first of
these shows the differences between the pairwise estimates of
recombination frequency and the effective estimate of recombination
frequency based on the map itself (the multi-point estimate of the
recombination frequency) . These differences are compared to the LOD
score for each pairwise estimate. An overall estimate of the weighted
root mean square error of these differences (weighted by the LOD scores)
is printed on the console. In general, we expect that if the difference
between the expected and realised recombination frequency is high, the
LOD should be low. If this is not the case, something went wrong in the
final mapping step. The second plot shows the comparison between a
marker’s map position and the recombination frequency estimates to all
other markers. In general we expect that small recombination frequency
estimates come from nearby markers and should therefore be concentrated
around the diagonal, which are shown as light green regions. Areas of
light green off the diagonal suggest a sub-optimal map order. The third
plot is similar to the previous except that LOD values are shown in
place of recombination frequencies. Again, we expect a concentration of
higher LOD scores around the diagonal. By default, LOD scores less than
5 are not shown to make things clearer, but this value can be varied
using the lod.thresh
argument.
Up to now, we have been working under the assumption of random bivalent pairing in the parents. However, it has been shown that in certain species the pairing behaviour is neither completely random nor completely preferential (as we have in allopolyploids) but something in-between, a condition termed segmental allopolyploidy [5]. A study on this topic in rose [6] found evidence of disomic behaviour in certain regions of the genome, with tetrasomic behaviour everywhere else. This sort of mixed inheritance can complicate the analysis and in cases where this effect is particularly pronounced, it is probably unwise to ignore [7].
The polymapR
package can currently accommodate
preferential pairing in tetraploid species only, by correcting the
recombination frequency estimates once the level of preferential pairing
in a parental chromosome is known. In order to determine whether this is
necessary, we can run some diagnostic tests using the marker data after
an initial map has been produced which can inform our choice regarding
whether to include a preferential pairing parameter or not (in a
subsequent re-analysis).
First, a word of warning. There is a function to test for
preferential pairing among pairs of closely-linked simplex x nulliplex
markers in repulsion phase within polymapR
. However, it is
certainly also advised to use multi-point methods to test for
preferential pairing (if possible). The only possible drawback is that
multi-point methods (using all markers across a chromosome) generally
assume uniform pairing behaviour across the length of a chromosome,
which may not always occur [6]). A robust
multi-point approach as described in [6]
used identity-by-descent probabilities for the population, estimated
using TetraOrigin [8], which estimates the
most likely pairing behaviour per individual and can therefore reveal
whether there are deviations from the assumption of random chromosomal
pairing and recombination across the population.
The function test_prefpairing
examines closely-linked
repulsion-phase simplex x nulliplex marker pairs only, and is therefore
the results of this function may not be as robust as a
multi-point whole-chromosome approach. Be that as it may, it is probably
the simplest check we can perform post-mapping for the possibility of
unusual pairing behaviour. The method used here is described also in
Bourke et al.[6], which is a development
on ideas originally described by Qu and Hancock[9]. The idea is to look for signatures of
preferential pairing in repulsion-phase pairs which map to the same
locus.
We first need to define a minimum distance by which to consider
markers essentially at the same locus. This is somewhat arbitrary - in
our study of tetraploid rose, we used pairs of duplicated individuals to
determine an approximate error rate which, along with considerations of
the size of the population and rate of missing values, led us to
conclude that distances less than ~1 cM were close enough to be
informative (and for which we could assume 0 recombinations). The
smaller the distance we choose here the better, although by choosing a
very small distance we are limiting the number of repulsion-pairs
available. By default in test_prefpairing
we use a value of
0.5 cM. This can be increased or decreased as required using the
min_cM
argument.
P1.prefPairing <- test_prefpairing(dosage_matrix = ALL_dosages,
maplist = integrated.maplist,
LG_hom_stack = LGHomDf_P1_1,
min_cM = 1, #changed from default of 0.5 cM
ploidy = 4)
head(P1.prefPairing)
The output shows the repulsion-phase marker pairs tested. The most
important column is probably the column P_value.adj
, which
gives the FDR-adjusted P-values from a Binomial test of the hypothesis
that the disomic repulsion estimate of recombination frequency differs
significantly from $\frac{1}{3}$
(actually it is a one-sided test, so we are testing that r
disom falls below $\frac{1}{3}$).
In cases where this is significant (with P_value.adj
<
0.01, say), we might have some evidence for preferential pairing (but
from that marker pair only!). The output of
test_prefpairing
also lists the markers, their positions
and on which homologues they have been mapped. There is also an estimate
of the pairwise preferential pairing parameter pref.p
. Note
that there are two possible (maximum-likelihood) estimators of a
preferential pairing parameter in the case of a pair of simplex x
nulliplex markers, depending on whether the marker alleles reside on
homologous chromosomes, or homoeologous chromosomes (i.e.
whether the pair are contained within a “subgenome”, or are straddled
across “subgenomes”, to borrow the language of allopolyploids for
convenience). If there is strong evidence of preferential pairing from
multiple marker pairs, the approach we recommend is to take the average
of the pref.p
values for a particular chromosomal linkage
group, which becomes one of the parameters p1
or
p2
in the linkage
function.
Here, we have purely random pairing in our parents, so the estimates
of pref.p
should not be considered meaningful. The
preferential parameter p
is defined such that $0 < p < \frac{2}{3}$. Supposing this
were not the case, and we wanted to estimate the strength of
preferential pairing on LG1 of parent 1, we would do something like the
following:
Note that again we require first that there be significant evidence
of disomic behaviour before including the estimate. If there is evidence
of preferential pairing, it is probably a good idea to go back and
reduce the min_cM
argument even further, since we really
want this value to be as close to 0 as possible for an accurate estimate
of a preferential pairing parameter.
Finally, to re-run the linkage analysis with this estimate for parent 1 of linkage group 1, we would first have to subset our marker dosage data corresponding to that linkage group only (since we do not need to re-analyse any other chromosome!), and then re-run the linkage analysis. Note that in cases with extreme levels of preferential pairing, the clustering into separate homologues can also be affected. Supposing we had estimated a value of 0.25 for the preferential pairing parameter we could proceed as follows:
lg1_markers <- unique(c(rownames(marker_assignments_P1[marker_assignments_P1[,"Assigned_LG"] == 1,]),
rownames(marker_assignments_P2[marker_assignments_P2[,"Assigned_LG"] == 1,])))
all_linkages_list_P1_lg1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P1[lg1_markers,],
dosage_matrix = filtered_data[lg1_markers,],
which_parent = 1,
convert_palindrome_markers = FALSE,
ploidy = 4,
pairing = "preferential",
prefPars = c(0.25,0), #just for example!
LG_number = 1 #interested in just 1 chm.
)
all_linkages_list_P2_lg1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P2[lg1_markers,],
dosage_matrix = filtered_data[lg1_markers,],
which_parent = 2,
convert_palindrome_markers = FALSE,
ploidy = 4,
pairing = "preferential",
prefPars = c(0,0.25), #Note that this is in reverse order now.
LG_number = 1
)
The package polyqtlR
, available through CRAN, has been
developed for QTL analysis in outcrossing polyploid populations, using
IBD probabilities derived using the phased linkage maps developed in
polymapR
.
Alternatively, the TetraploidSNPMap [10] software provides the possibility of not
just generating linkage maps but performing IBD-based interval QTL
mapping as well. The creation of input files for TetraploidSNPMap is
simple once phased mapfiles have been created with the
create_phased_maplist
function (see section 12.3).
To generate these files, use the write.TSNPM
function as
follows:
Other software for linkage mapping and QTL analysis in polyploid
populations includes MAPpoly
and QTLpoly
, as
well as diaQTL
and polyOrigin
.
We hope you have been able to successfully create an integrated map
using the methods described here. We anticipate polymapR
will improve over the coming years (in terms of applicability as well as
finding and removing bugs). Feedback on performance including bug
reporting is most welcome; please refer to the package information on
CRAN for details on who to contact regarding maintenance issues.
polymapR
package.polymapR