---
title: "polymapR: linkage mapping in outcrossing polyploids"
author: "Peter M. Bourke, Geert van Geest"
date: "12-02-2024"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{How to use polymapR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: bibliography.bib
csl: biomed-central.csl
---
In this tutorial, we will go step by step through the basic steps of performing linkage analysis in polyploids using `polymapR`. Most of the theory behind the functions described here can be found in the *Bioinformatics* publication of Bourke et al. (2018) at https://doi.org/10.1093/bioinformatics/bty371. The example dataset is a simulated dataset of an hypothetical tetraploid outcrossing plant species with five chromosomes. Chromosomes have random pairing behaviour. The population consists of 207 individuals originating from an F1 cross between two non-inbred parents. Frequently occurring data imperfections like missing values, genotyping errors and duplicated individuals are taken into account.
## Table of Contents
1. [A (very) short introduction to R and the `polymapR` package.](#sect1)
2. [Install `polymapR`](#sect2)
3. [Logging function calls and output](#sect3)
4. [Data importing](#sect4)
5. [Generate summary data](#sect5)
6. [Convert marker dosages to simple segregations; remove non-segregating data](#sect6)
7. [Quality checks on marker data](#sect7)
8. [Simplex x nulliplex markers – defining chromosomes and homologues](#sect8)
9. [Assigning SxS and DxN markers and consensus linkage group (LG) names](#sect9)
10. [Assign all other markertypes](#sect10)
11. [Finish the linkage analysis](#sect11)
12. [Marker ordering](#sect12)
13. [Plotting a map](#sect13)
14. [Evaluating map quality](#sect14)
15. [Preferential pairing](#sect15)
16. [QTL analysis](#sect16)
17. [Concluding remarks](#sect17)
18. [References](#sect18)
## 1. A (very) short introduction to R and the `polymapR` package.
In R, a function can be called by typing `function_name()` in the console and pressing `[Enter]`. (Optional) arguments go within the parentheses, followed by a ` = ` sign and the value you want to give to the argument. Arguments are separated by a comma. All possible arguments and default values of a function are in the help file, which is called by typing `?` followed by the function name. Getting the help file for the commonly used function `seq` looks like this:
```{r, eval = FALSE}
?seq
```
And generating a sequence of numbers from 2 to 14 by an increment of 3 using `seq` looks like this:
```{r}
seq(from = 2,
to = 14,
by = 3)
```
Although copy pasting the commands from this document to your R console will do for this tutorial, we recommend to have a bit more experience in using R than above before using `polymapR`. This will allow you better to troubleshoot and check out data output. A good place to start would be https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf or any other good introductory course in R.
The `polymapR` package is literally a collection of functions, each defined by their function name and arguments. It relies on dosage data of molecular markers, and therefore is logically used after the packages `fitTetra` or `fitPoly`. As for any project, it is best to first create a new folder which will be the main working directory for the mapping project. Set the working directory to your new folder (`setwd()` or use the drop-down menu) and copy the marker dosage file from `fitTetra` there.
## 2. Install `polymapR`
`polymapR` is available on CRAN at https://cran.R-project.org/package=polymapR which is probably where you accessed this vignette. As with other packages on CRAN, it can be installed directly from within R using the command:
```{r, eval = FALSE}
install.packages("polymapR")
```
## 3. Logging function calls and output
Most exported functions in `polymapR` have a `log` argument. The default setting is `NULL`. In that case, all messages are returned to the standard output, which means your R console if you are calling the functions interactively. However, if it is set to a file name (*e.g.* `log.Rmd` or `log.txt`) all function calls and output are written to that file. It automatically creates one if it does not exist yet. The log file is written in R markdown (https://rmarkdown.rstudio.com/). The advantage of markdown is that it is readable as plain text, but it can also easily be rendered into a nicely formatted html or pdf. In order to do so, give your logfile a .Rmd extension and use the `knitr` package to turn it into a pdf, docx or html.
## 4. Data importing
### 4.1 Reading in marker dosage data
We assume that the user has used `fitPoly` or some alternative method to convert genotyping information (*e.g.* signal intensity ratios or read depths) into marker dosages. The standard version of `polymapR` assumes that such genotypes are discrete, but it is also possible to use probabilistic genotypes since version 1.1.0 (see other vignette). In this tutorial we assume such discrete dosage values are the starting point, but we will impose some further quality checks before we can begin mapping (such as allowed numbers of missing values). We also need to check the markers for skewness (differences in the observed and expected marker segregation) which can cause problems later in the clustering. For this, we have incorporated a number of functions developed for the `fitPoly` package [@Voorrips2011], and included in the currently-unpublished R package `fitPolyTools` developed by Roeland Voorrips.
> ###### But first, a note on polyploid terminology
> In the `polymapR` package, and throughout this vignette, we often refer to terms like "simplex", "duplex" or "nulliplex" *etc*. These terms describe the dosage scores of SNP markers, essentially the count of the number of "alternative" alleles present at a particular locus in an individual. A simplex x nulliplex marker is therefore a marker that for which the maternal dosage is 1 (mothers always come first) and paternal dosage is 0. To save space, we also refer to these markers as SxN markers, which is synonymous with 1x0 markers. Sometimes statements are made concerning simplex x nulliplex markers - it should be understood that these statements also usually apply to nulliplex x simplex markers - markers for which the segregating allele is carried by the father rather than the mother.
The first step is to import the marker data. The layout of the dataset should be marker name, parental scores followed by offspring scores (so-called wide data format):
```{r, echo=FALSE}
knitr::kable(
data.frame(Marker=paste0("mymarker", c(1:3)),
P1=c(1,2,0),
P2=c(0,4,3),
F1.001=c(1,3,2),
F1.002..=c(1,2,2)
)
)
```
In cases where the parents have been genotyped more than once, a consensus parental score is required. Missing parental scores are not allowed at this stage, and should have been imputed already if desired.
The data should be in some format that can be read easily into R, such as .csv, .dat or .txt file. In the example a .csv file is read in. For other file types check out `?read.delim`.
```{r, eval = FALSE}
ALL_dosages <- read.csv("tetraploid_dosages.csv",
stringsAsFactors = FALSE,
row.names = 1) #first column contains rownames
```
```{r, echo = FALSE, message = FALSE, error = FALSE}
library(polymapR)
data(ALL_dosages)
ALL_dosages <- as.data.frame(ALL_dosages)
```
```{r}
class(ALL_dosages)
```
By default, `read.csv` returns data in a data frame. All `polymapR` functions that use dosage data only accept dosage data when the dosages are in a matrix. A matrix is a two-dimensional R object with all elements of the same type (*e.g.* character, integer, numeric). Markernames and individual names should be specified in rownames and columnnames respectively. To use the data with functions, the data should be converted from a data.frame to a matrix by the `as.matrix` function:
```{r}
ALL_dosages <- as.matrix(ALL_dosages)
class(ALL_dosages)
head(ALL_dosages[,1:5])
dim(ALL_dosages)
```
If you are testing the package and would like a sample dataset to work with, a set of tetraploid marker dosage data is provided within the package itself. First load the package and then the dosage data as follows:
```{r, message = FALSE, error = FALSE}
library(polymapR)
data(ALL_dosages)
```
### 4.2 Checking for skewness
The next step before we proceed further is to check whether the marker scores in the F1 correspond to those expected according to the parental dosages. We do this using the `checkF1` function (from `fitPolyTools`). However, we first load the `polymapR` package into the global environment:
```{r, message = FALSE, error = FALSE}
library(polymapR)
```
Now run `checkF1`:
```{r, eval = FALSE}
F1checked <- checkF1(dosage_matrix = ALL_dosages,parent1 = "P1",parent2 = "P2",
F1 = colnames(ALL_dosages)[3:ncol(ALL_dosages)],
polysomic = TRUE, disomic = FALSE, mixed = FALSE, ploidy = 4)
head(F1checked$checked_F1)
```
```{r, echo = FALSE}
load("polymapR_vignette.RData")
head(F1checked) #old version
```
There is the possibility to check for shifted markers (incorrectly genotyped) but we will not concern ourselves with that here. The main arguments to be specified are what the parental identifiers are (`parent1` and `parent2`), what the identifiers of the F1 population are (argument `F1` - note we just take the column names of the dosages matrix from column 3 to the end), and which sort of inheritance model we wish to compare against (here we have specified `polysomic` to be `TRUE` and both `disomic` and `mixed` to be `FALSE`. Note that `mixed` allows for one parent to be fully disomic and the other fully polysomic, but does not refer to the situation of a mixed inheritance pattern within one parent, associated with segmental allopolyploidy - this is because we have no fixed expectation to compare with if this is the case). The argument `ploidy` specifies the ploidy level of parent 1, if parent 2 has a different ploidy (for instance in a tetraploid x diploid cross) then we can use the `ploidy2` argument. The results are stored in the list `F1checked`, which has two elements: `meta` which holds meta-data regarding the function call, and `checked_F1`, the actual function output. We are interested in seeing whether there is consistency between the parental scores and the F1. In general, if there is good consistency between the parental and offspring scores we will have a high value for `qall_mult` and `qall_weights`, which are aggregated quality measures of `q1`, `q2` and `q3`. In the sample datatset provided with `polymapR` there are no skewed markers present. However, you may find it useful to examine the output of `checkF1` and remove markers which show conflict between the parental and offspring scores, as these markers will likely cause issues in the later mapping steps.
### 4.3 Running a PCA
It can also be useful to run a principal component analysis at this stage using the `PCA_progeny` function on the marker dosages:
```{r, fig.width = 6, fig.height = 6}
PCA_progeny(dosage_matrix = ALL_dosages,
highlight = list(c("P1", "P2")),
colors = "red")
```
In cases where there are serious issues with the population (such as parental mix-ups, inclusion of unrelated samples with the bi-parental population, multiple sub-populations from *e.g.* multiple pollen-donors *etc*) a PCA plot can help. Here, we see no such problems so we can proceed.
## 5. Generate summary data
We now summarise the marker data in terms of the different segregation types, number of missing values, number of inadmissible scores (which could be genotyping errors or possible double reduction scores). Run the summary function on the marker data as follows:
```{r, fig.width = 7, fig.height = 5}
mds <- marker_data_summary(dosage_matrix = ALL_dosages,
ploidy = 4,
pairing = "random",
parent1 = "P1",
parent2 = "P2",
progeny_incompat_cutoff = 0.05)
pq_before_convert <- parental_quantities(dosage_matrix = ALL_dosages,
las = 2)
```
If you want to know exactly what these functions are doing and what can be typed as an argument just use `?marker_data_summary` and/or `?parental_quantities`.
## 6. Convert marker dosages to simple segregations; remove non-segregating data
Simplex x nulliplex markers segregate in a 1:1 ratio in the offspring, but this segregation is also possible with triplex x nulliplex, triplex x quadriplex and simplex x quadriplex, and the segregating allele in each case comes from the same parent. We can therefore re-code all of these segregation types into the simplest 1 x 0 case, taking care that we handle possible double reduction scores correctly.
A list of the conversions in tetraploid we apply which simplify the subsequent analysis and mapping work are:
| |
|:-----------------|
|3x0,3x4,1x4 -> 1x0|
|0x3,4x3,4x1 -> 0x1|
| 2x4 -> 2x0 |
| 4x2 -> 0x2 |
| 3x3 -> 1x1 |
| 3x1 -> 1x3 |
| 3x2 -> 1x2 |
| 2x3 -> 2x1 |
Note that some classes such as 2x2 cannot be converted into any other category, and we do not convert non-segregating classes such as 4x4 as these are of no further use for mapping or QTL analysis.
To convert the marker data:
```{r, fig.width = 7, fig.height = 5}
segregating_data <- convert_marker_dosages(dosage_matrix = ALL_dosages, ploidy = 4)
pq_after_convert <- parental_quantities(dosage_matrix = segregating_data)
```
## 7. Quality checks on marker data
A number of quality checks are recommended before we begin mapping. For example, we do not want to include markers which have too many missing values (“NA” scores) as this may be an indication that the marker has performed poorly and may have many errors associated with it. We would also like to screen out individuals that have many missing values as this might be an indication that the DNA quality of that sample was poor, or duplicate individuals which give no extra information and should probably also be removed / combined. Because there are no fixed rules for how stringent we should be with quality checks (5% missing values ok? 10% missing values ok? 20%?), some visual aids are produced in order to help with these decisions.
### 7.1 Missing value rate per marker
We first screen the marker data for missing values using `screen_for_NA_values`. This function allows the user to remove rows or columns with a certain rate of missing values from the dataset. The dosage matrix, which is specified as `segregating_data` contains markers in rows and individuals in columns. In R, rows are usually specified by the number 1, and columns by 2. The argument `margin` lets you specify by margin number whether you want to screen markers (`margin = 1`) or individuals (`margin = 2`) If the argument `cutoff` is set to `NULL`, it takes user input on the rate of markers to be screened. In the example below `cutoff` is specified, but because decisions are made after data inspection, we recommend to decide on the cutoff value after inspection of the data by specifying `cutoff = NULL`. A threshold of 10% missing values is sensible, as a compromise between high-confidence marker data versus keeping a large proportion of the markers for mapping and QTL analysis. In general, the most problematic markers tend to have higher rates of missing values so any reasonable threshold here should be fine. If problems are encountered later in the mapping, it might be worthwhile to re-run this step with a more stringent (*e.g.* 5%) threshold.
```{r, fig.width = 6, fig.height = 6}
screened_data <- screen_for_NA_values(dosage_matrix = segregating_data,
margin = 1, # margin 1 means markers
cutoff = 0.10,
print.removed = FALSE)
```
### 7.2 Missing value rate per individual
As in step 7.1 we can also visualise the rate of missing values per individual, which can help us to make a decision about whether certain individuals should be removed from the dataset before commencing the mapping. In order to screen the individuals for missing values we enter:
```{r, fig.width = 6, fig.height = 6}
screened_data2 <- screen_for_NA_values(dosage_matrix = screened_data,
cutoff = 0.1,
margin = 2, #margin = 2 means columns
print.removed = FALSE)
```
Again, it is not clear what an acceptable threshold for the rate of missing values should be, but we recommend a cut-off rate of 0.1 – 0.15 (10-15%) missing values allowed.
### 7.3 Duplicate individuals
It is also desirable to be able to check whether there are any duplicated individuals in the population as can sometimes happen (or due to mix-ups in the DNA preparation and genotyping). The function `screen_for_duplicate_individuals` does this for you. Note that a duplicate can remain in the dataset, but they add no further information for mapping or QTL analysis, and may even distort the analysis (by giving an inaccurate population size). As for the function `screen_for_NA_values`, cutoff can be set to `NULL` to accept user input. Especially for `screen_for_duplicate_individuals` it makes sense to make a decision on the cut off after inspection of the data and figures.
To screen the data for duplicate individuals, enter:
```{r, fig.width = 6, fig.height = 6}
screened_data3 <- screen_for_duplicate_individuals(dosage_matrix = screened_data2,
cutoff = 0.95,
plot_cor = TRUE)
```
Here we see that there were a number of duplicate pairs of individuals identified (red points in the output graph). Each duplicate pair is merged into a consensus individual (conflicts are made missing), keeping the name of the first individual. This is summarised in the output as well.
### 7.4 Duplicated markers
Finally, we can also check for duplicated markers (those which give us no extra information about a locus) and remove these from the dataset before mapping, using the `screen_for_duplicate_markers` function. This function returns a list with 2 items - the filtered_dosage_matrix which we are interested in now, and bin_list, which we are possibly interested in again [later](#sect14.2). It is a good idea to save the output of this function, as you might like to add back the duplicate markers after mapping.
```{r}
screened_data4 <- screen_for_duplicate_markers(dosage_matrix = screened_data3)
```
Note that increasingly, there are far too many markers generated in comparison to the population size. It is possible to estimate the genetic resolution of the population you are working with, and from that estimate the expected bin-size in your dataset, assuming your markers are approximately randomly-distributed across the genome (this may not be a good assumption in your case, but as a rough guide this function may still be useful). Considering only bivalent pairing, in a tetrasomic species we can assume approximately four cross-over recombinations occurred, taking the rule-of-thumb guide of one cross-over recombination per bivalent, which has been observed in autotetraploid *Arabidopsis arenosa* for example [@yant2013].
```{r, out.width = "450px", echo = FALSE, fig.align = "center"}
knitr::include_graphics("figures/meiosis_4x.png")
```
We can try to estimate the average bin size if we run the previous function and specifying the argument `estimate_bin_size = TRUE`. The function then does a quick calculation on how many markers would be expected in an average marker bin (under the tenuous assumptions of uniform distributions of both markers and cross-over break-points). In a tetraploid, each offspring would be expected to have 4 cross-over break-points per chromosomal linkage group (2 maternal and 2 paternal), so across a population of *N* individuals there are approximately *4N* breaks per linkage group, and *4NL* such breaks across *L* linkage groups. Supposing there are in total *y* distinct marker types (like 1x0, 1x1, 2x0, 0x1 *etc*), then your *M* markers can be split into *M/4NLy* segregation patterns (assuming no errors), which is an approximate estimate for the average bin size. In datasets with very large numbers of markers (*e.g.* genotyping using NGS), it may be worthwhile to *only* include markers with a minimum number of duplicates present, and that minimum number can be informed by the calculation above. In this case, supposing we demand at least 7 duplicates (so a minimum bin size of 6), we would do something like the following:
```{r eval = FALSE}
reliable.markers <- names(which(sapply(screened_data4$bin_list,length) >= 6))
reliable_data <- screened_data4$filtered_dosage_matrix[reliable.markers,]
```
Of course, this assumes we have a very large number of markers to choose from, as this is a very severe demand in a more traditional, well-balanced dataset. It also presumes there are no genotyping errors. A genotyping error will make duplicate markers appear to be different even when they are not (from a genetic mapping perspective). For now we are ignoring this issue, but may return to it in future updates.
In the current data-set there is quite a good balance between the number of markers and the number of individuals in the mapping population, and so we extract the filtered_dosage_matrix that has the duplicate markers merged:
```{r}
filtered_data <- screened_data4$filtered_dosage_matrix
```
Before we start the mapping, it is a good idea to run the summary function again, this time on the `filtered_data` object, so we have a clear record of the numbers of markers going forward to the mapping stage.
```{r, fig.width = 6, fig.height = 6}
pq_screened_data <- parental_quantities(dosage_matrix = filtered_data)
```
## 8. Simplex x nulliplex markers – defining chromosomes and homologues
The first step in actually mapping the markers is to define linkage groups for the markers. We use the simplex x nulliplex markers to define the linkage groups (chromosomes) and following that, the homologues (4 per linkage group are expected in a tetraploid). We use the LOD scores to cluster markers, which increases dramatically for tight coupling-phase estimates.
To estimate the recombination frequencies, LOD and phasing between all marker pairs of P1 enter:
```{r, eval = FALSE}
SN_SN_P1 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
parent1 = "P1",
parent2 = "P2",
which_parent = 1,
ploidy = 4,
pairing = "random"
)
```
> ###### A note on arguments and multi-core processing
There is much more to the `linkage` function. Checkout `?linkage` to see the arguments that can be passed to it. One important feature is that it can run in parallel on multiple cores. This can be very time-efficient if you are working with a large number of markers. The number of cores to use can be specified using the `ncores` argument. Before doing so, first check the number of cores on your system using `parallel::detectCores()`. Use at least one core less than you have available to prevent overcommitment.
Note that since polymapR version 1.1.5, some of the arguments of the `linkage` function have been updated. The arguments `target_parent` and `other_parent` have been removed, and replaced with `parent1` and `parent2`, referring to the column names of the two parents. A new argument `which_parent` has been introduced, to allow you to specify which parent is being "targeted" for linkage analysis (either 1 or 2). This was needed to accommodate an update for mapping in triploid populations, but also makes it easier if you rename your parental columns "P1" and "P2" at the start, as these are the default values provided for the arguments `parent1` and `parent2`. This is the case here, so for subsequent calls to `linkage` we will omit specifying the parental names each time.
It is a good idea to first visualise the relationship between the recombination frequency and LOD score, which gives us some insight into the differences between the information content of the different phases, as well as highlighting possible problems in the data. This can be done by using the `r_LOD_plot` function:
```{r, fig.width = 6, fig.height = 6}
r_LOD_plot(linkage_df = SN_SN_P1, r_max = 0.5)
```
In some situations, it is possible to identify outlying datapoints on this plot, which do not fit the expected relationship. We have implemented a function to check this for the case of pairs of simplex x nulliplex markers, as these are the markers that we use in the subsequent steps of clustering and identifying chromosome and homologue linkage groups. If you are unsure, you can run the `SNSN_LOD_deviation` function anyway to see whether your data fits the expected pattern:
```{r, fig.width = 6, fig.height = 6}
P1deviations <- SNSN_LOD_deviations(linkage_df = SN_SN_P1,
ploidy = 4,
N = ncol(filtered_data) - 2, #The F1 population size
alpha = c(0.05,0.2),
plot_expected = TRUE,
phase="coupling")
```
Here we see there are a few "outliers" in the coupling pairwise data (identified by the stars) from the expected relationship between r and LOD (the bounds of which are shown by the dotted lines - these can be widened as necessary using the `alpha` argument for the upper and lower tolerances around the "true" line, usually in a trial and error approach). In our example dataset, these highlighted points are hardly outliers and we can proceed. In other datasets, you may want to remove these starred pairwise estimates before continuing with marker clustering, as it is possible they may cause unrelated homologue clusters to clump together and prevent a clear clustering structure from being defined. To do so, remove them from the linkage data frame as follows:
```{r}
SN_SN_P1.1 <- SN_SN_P1[SN_SN_P1$phase == "coupling",][-which(P1deviations > 0.2),]
```
We are now in a position to cluster the marker data into chromosomes using the `cluster_SN_markers` function. This function clusters markers over a range of LOD thresholds. In the example below this is over a range from LOD 3 to 10 in steps of 1 (`LOD_sequence = c(3:10)`). If the argument `plot_network` is set to `TRUE` this function will plot the network of the lowest LOD threshold (LOD 3 in this case), and overlays the clusters of the higher LOD scores. This setting is not recommended with large number of marker pairs. Your computer will most likely run out of memory.
```{r, fig.width = 6, fig.height = 6}
P1_homologues <- cluster_SN_markers(linkage_df = SN_SN_P1,
LOD_sequence = seq(3, 10, 0.5),
LG_number = 5,
ploidy = 4,
parentname = "P1",
plot_network = FALSE,
plot_clust_size = FALSE)
```
The function `cluster_SN_markers` returns a list of cluster data frames, one data frame for each LOD threshold. The first plot shows two things - on the normal y-axis, the number of clusters over a range of LOD thresholds (x-axis) - and on the right-hand y-axis, the number of markers that do not form part of any cluster, which is shown by the blue dashed line. In the `cluster_SN_markers` function, we specify the ploidy (4 for a tetraploid) and the expected number of chromosomes (LG_number = 5) and therefore there are 4 x 5 = 20 homologue clusters expected. If we examine the second plot, we see how these clusters stay together or fragment as the linkage stringency (LOD score) increases. Here, we see clearly that at LOD 3.5, there are 5 clusters which all eventually divide into 4 sub-clusters each. This is the ideal scenario - we have identified both the chromosomal linkage groups and the homologues using only the simplex x nulliplex marker data. Note that it is the coupling-phase estimates which define the homologue clusters and the repulsion-phase estimates which provide the associations between these coupling-clusters (homologues). In practice, things might not always be so orderly. In noisy datasets with higher numbers of genotyping errors, or at higher ploidy levels, the repulsion-phase linkages between homologues will be lost among false positive linkages between un-linked markers. This means that all markers will likely clump together at lower LOD scores.
You can manually check how many clusters there are at a specific LOD score, and how many markers are in a cluster:
```{r}
P1_hom_LOD3.5 <- P1_homologues[["3.5"]]
t <- table(P1_hom_LOD3.5$cluster)
print(paste("Number of clusters:",length(t)))
t[order(as.numeric(names(t)))]
P1_hom_LOD5 <- P1_homologues[["5"]]
t <- table(P1_hom_LOD5$cluster)
print(paste("Number of clusters:",length(t)))
t[order(as.numeric(names(t)))]
```
Based on the plots (or the output as shown) you should be able to decide which LOD threshold best separates the data. In this example, at a LOD of 3.5 we identify chromosomes and anywhere between LOD 4.5 and LOD 7 we have identified our homologues (4 x 5 = 20 clusters). In this parent (P1) we have therefore solved the marker clustering problem without any difficulty. It remains to gather this into a structure that can be used at later stages. To do this, we need two LOD values which define chromosomal groupings (here, `LOD_chm` = 3.5) and homologue groupings (here, `LOD_hom` = 5). We can then use the `define_LG_structure` function with these arguments as follows (note the argument LG_number refers to the expected number of chromosomes for this species):
```{r}
LGHomDf_P1 <- define_LG_structure(cluster_list = P1_homologues,
LOD_chm = 3.5,
LOD_hom = 5,
LG_number = 5)
head(LGHomDf_P1)
```
However, you may find that it is not possible to identify chromosomal groups as simply as this. We therefore recommend that homologue groups be first identified, and then another marker type be used to provide associations between these clusters, and hence help form chromosomal linkage groups. This can be done as follows:
* Generate homologue clusters using the `cluster_SN_markers` function at a high LOD.
* Use the Duplex x Nulliplex or Simplex x Simplex marker information to put the clusters back together into chromosomal groups.
It makes sense to use only SxN marker pairs in coupling phase for the homologue clustering (since these associations define the homologues). Note that the most likely phase per marker pair (coupling or repulsion) was already estimated by our previous call to the `linkage` function.
```{r, fig.width = 6, fig.height = 6}
SN_SN_P1_coupl <- SN_SN_P1[SN_SN_P1$phase == "coupling",] # select only markerpairs in coupling
P1_homologues_1 <- cluster_SN_markers(linkage_df = SN_SN_P1_coupl,
LOD_sequence = c(3:12),
LG_number = 5,
ploidy = 4,
parentname = "P1",
plot_network = FALSE,
plot_clust_size = FALSE)
```
In the dot plot above, we see that all chromosomal associations have now been removed with the repulsion information.
> ###### A note on nested lists
> In the `polymapR` package, output is often given as a nested list. Some basic information on lists is available on *e.g.* [r-tutor](https://www.r-tutor.com/r-introduction/list). We use nested lists because we are often dealing with hierarchical data: marker data > homologue > linkage group > organism. Some basic R knowledge is needed to retrieve, visualize or filter data from such a nested list. However, if you would like to visualize it without R (*e.g.* in your favourite spreadsheet program), you can choose to export a nested list using `write_nested_list`. It will keep its hierarchical structure by writing it to directories that have the same structure as the the nested list. Here's an example: `write_nested_list(list = P1_homologues, directory = "P1_homologues")`
Next, we use linkage to the DxN markers to provide bridging linkages between homologue clusters, using the `bridgeHomologues` function. The first time we run this function, we use a LOD threshold of 4 for evidence of linkage between a SxN and a DxN marker. However, there may also be some false positive linkages at this level. It might therefore be necessary to run the function again at a higher LOD (say LOD 8) to avoid such false positives. In doing so, we might lose true linkage information. Therefore, it’s a good idea to run the function at LOD 4 first, then increase to a higher LOD if necessary, and build up a picture of which clusters form part of the same chromosome.
First calculate linkage between SxN and DxN markers:
```{r, eval = FALSE}
SN_DN_P1 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
markertype2 = c(2,0),
which_parent = 1,
ploidy = 4,
pairing = "random")
```
Then, use `bridgeHomologues` to use these SxN - DxN linkages to find associations ("bridges") between homologue clusters. Note that we must provide a particular division of the Simplex x Nulliplex data - in this example, we use the division at LOD 6 which gave us 20 clusters (our 20 putative homologues):
```{r, fig.width = 6, fig.height = 6}
LGHomDf_P1_1 <- bridgeHomologues(cluster_stack = P1_homologues_1[["6"]],
linkage_df = SN_DN_P1,
LOD_threshold = 4,
automatic_clustering = TRUE,
LG_number = 5,
parentname = "P1")
```
This results in an ideal situation of 5 linkage groups each with exactly 4 homologues:
```{r}
table(LGHomDf_P1_1$LG, LGHomDf_P1_1$homologue)
```
However, clustering of P2 is a bit more difficult. The following re-does everything for P2. Note that for clarity, we include the arguments `parent1` and `parent2` here, but we could have just as well left them out (as they are the default column names, see `?linkage`):
```{r, eval=FALSE}
SN_SN_P2 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
parent1 = "P1",
parent2 = "P2",
which_parent = 2,
ploidy = 4,
pairing = "random"
)
```
```{r, fig.width = 6, fig.height = 6}
SN_SN_P2_coupl <- SN_SN_P2[SN_SN_P2$phase == "coupling",] # get only markerpairs in coupling
P2_homologues <- cluster_SN_markers(linkage_df = SN_SN_P2_coupl,
LOD_sequence = c(3:12),
LG_number = 5,
ploidy = 4,
parentname = "P2",
plot_network = FALSE,
plot_clust_size = FALSE)
```
```{r, eval = FALSE}
SN_DN_P2 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
markertype2 = c(2,0),
which_parent = 2,
ploidy = 4,
pairing = "random")
```
```{r, fig.width = 6, fig.height = 6}
LGHomDf_P2 <- bridgeHomologues(cluster_stack = P2_homologues[["6"]],
linkage_df = SN_DN_P2,
LOD_threshold = 4,
automatic_clustering = TRUE,
LG_number = 5,
parentname = "P2")
table(LGHomDf_P2$LG,LGHomDf_P2$homologue)
```
Occasionally linkage groups possesses fewer than four "homologues". This can result in a loss of phase information in later steps, but will not prevent us from proceeding with the mapping. However, it is also possible that more than 4 homologues are grouped together, likely the result of having a number of fragmented homologues. However, as we don't know which clusters are actual homologues, and which are fragments, we must examine the affected linkage group in more detail. `polymapR` currently offers two methods for this.
### 8.1 Using the function `overviewSNlinks`
Run `overviewSNlinks` as follows:
```{r, fig.width = 7, fig.height = 6}
overviewSNlinks(linkage_df = SN_SN_P2,
LG_hom_stack = LGHomDf_P2,
LG = 3,
LOD_threshold = 0)
```
In cases of fragmented homologues, it may be possible to identify which fragments belong together by examining the phase of intra-cluster marker pairs. Coupling-phase (green) linkage between fragments can be an indication that separate homologue fragments belong together. This can be then be used to merge these homologues (specified in the argument `mergeList` - use `?merge_homologues` for more information). For example, if we wanted to merge homologues 4 and 5 of linkage group 3 as the above data suggests, we would use the following:
```{r}
LGHomDf_P2_1 <- merge_homologues(LG_hom_stack = LGHomDf_P2,
ploidy = 4,
LG = 3,
mergeList = list(c(4,5)))
```
### 8.2 Using the function `cluster_per_LG`
`cluster_per_LG` does not rely on markers in repulsion, but uses variation in LOD thresholds between markers in coupling. Run it as follows:
```{r, fig.width = 7.5, fig.height = 6}
cluster_per_LG(LG = 3,
linkage_df = SN_SN_P2[SN_SN_P2$phase == "coupling",],
LG_hom_stack = LGHomDf_P2,
LOD_sequence = c(3:10), # The first element is used for network layout
modify_LG_hom_stack = FALSE,
network.layout = "stacked",
nclust_out = 4,
label.offset=1.2)
```
The network layout is based on the first LOD score you provide in `LOD_sequence` (which is LOD = 3 in this example). The darker the background, the more consistent the clustering is over the range of LOD scores. Change the `LOD_sequence` if the network is not as required (4 clusters in the case of a tetraploid) and re-run. If the network looks as you would like it to, use the `cluster_per_LG` function but with a `LOD_sequence` of length 1 (the first LOD score, so again LOD = 3 in this example), and specify `modify_LG_hom_stack = TRUE` to make a modification to the cluster numbering (the output is then returned and can be saved as a new LG_hom_stack object):
```{r}
LGHomDf_P2_1 <- cluster_per_LG(LG = 3,
linkage_df = SN_SN_P2[SN_SN_P2$phase == "coupling",],
LG_hom_stack = LGHomDf_P2,
LOD_sequence = 3,
modify_LG_hom_stack = TRUE,
network.layout = "n",
nclust_out = 4)
table(LGHomDf_P2_1$homologue, LGHomDf_P2_1$LG)
```
### 8.3 Cross-ploidy populations (*e.g.* tetraploid x diploid to give triploid F1)
`polymapR` is also able to map triploid populations, and may be extended to other cross-ploidy populations if there is demand for that in future. Tetraploid x diploid crosses are however the most common (often done to induce seedlessness in the F1). Previously we had implemented a mapping approach that assumes that the parent of higher ploidy level is parent 1. However, since polymapR v.1.1.5 this is no longer the case, so a 4x2 and 2x4 cross are both acceptable.
The next point to be aware of is that the segregation of markers in the diploid parent is assumed to follow disomic inheritance (there is no other option in a diploid), whereas the segregation of the tetraploid parent is assumed to follow tetrasomic inheritance. Mixtures with preferential pairing in the tetraploid parent are currently not implemented for triploid populations.
Because of this disomic behaviour in the diploid parent, we are unable to use the normal clustering function (`cluster_SN_markers`) for identifying the homologues. Proceed as normal with the steps described above for the tetraploid parent. However in order to proceed correctly for the diploid parent 2, the clustering proceeds a bit differently.
If you are confused about what marker types are possible in a triploid population, run the following, which gives all possible marker combinations (and the segregation ratios in the F1):
```{r}
get("seg_p3_random",envir=getNamespace("polymapR"))
```
To run the preliminary linkage analysis of 1x0 markers in the diploid parent, we need to specify the ploidy of both parents, e.g. `ploidy = 4` and `ploidy2 = 2` (or vice versa). To demonstrate this, we need to load a special dataset with triploid data, a sample of which is available from the `polymapR` package also:
```{r, eval=FALSE}
data("TRI_dosages")
# Estimate the linkage in the diploid parent (assuming this has been done for the 4x parent already):
SN_SN_P2.tri <- linkage(dosage_matrix = TRI_dosages,
markertype1 = c(1,0),
parent1 = "P1",
parent2 = "P2",
which_parent = 2,
ploidy = 4,
ploidy2 = 2,
pairing = "random"
)
```
Have a look at the plot of r versus LOD score - here you should notice that both the coupling and repulsion phases are equally informative now and cannot therefore be separated by LOD score as before:
```{r, fig.width = 6, fig.height = 6}
r_LOD_plot(SN_SN_P2.tri)
```
If we try to cluster using the old function, we get the following result - no separation of the homologues, but the chromosomal linkage groups are identified:
```{r, fig.width = 6, fig.height = 6}
P2_homologues.tri <- cluster_SN_markers(linkage_df = SN_SN_P2.tri,
LOD_sequence = seq(3, 10, 1),
LG_number = 3,
ploidy = 2, #because P2 is diploid..
parentname = "P2",
plot_network = FALSE,
plot_clust_size = FALSE)
```
To separate the 0x1 markers into homologues, we use the phase information from the linkage analysis directly, using the `phase_SN_diploid` function. To learn more about how this function works, input `?phase_SN_diploid` first, to get an idea of the arguments used.
```{r}
LGHomDf_P2.tri <- phase_SN_diploid(linkage_df = SN_SN_P2.tri,
cluster_list = P2_homologues.tri,
LOD_chm = 4, #LOD at which chromosomes are identified
LG_number = 3) #number of linkage groups
```
The rest of the analysis should be pretty much the same as for tetraploids or hexaploids, although you should check each time you use a new function whether that function has a `ploidy2` argument, and if it does, __use it!__
## 9. Assigning SxS and DxN markers and consensus linkage group (LG) names
Assigning simplex x simplex markers to a homologue and linkage group is done by calculating linkage between SxS and SxN markers and after that assigning them based on this linkage using `assign_linkage_group`:
```{r, eval = FALSE}
SN_SS_P1 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
markertype2 = c(1,1),
which_parent = 1,
ploidy = 4,
pairing = "random")
```
```{r}
P1_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P1,
LG_hom_stack = LGHomDf_P1,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
head(P1_SxS_Assigned)
```
```{r, eval = FALSE}
SN_SS_P2 <- linkage(dosage_matrix = filtered_data,
markertype1 = c(1,0),
markertype2 = c(1,1),
which_parent = 2,
ploidy = 4,
pairing = "random")
```
```{r}
P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2,
LG_hom_stack = LGHomDf_P2_1,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
```
As simplex x simplex markers are present in both parents, we can define which linkage groups correspond with each other between parents. After this, one of the linkage groups of the parents should be renamed and the SxS markers assigned again according to the new linkage group names:
```{r}
LGHomDf_P2_2 <- consensus_LG_names(modify_LG = LGHomDf_P2_1,
template_SxS = P1_SxS_Assigned,
modify_SxS = P2_SxS_Assigned)
P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2,
LG_hom_stack = LGHomDf_P2_2,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
```
Since we now have a consistent linkage group numbering, we can also assign the DxN markers:
```{r}
P1_DxN_Assigned <- assign_linkage_group(linkage_df = SN_DN_P1,
LG_hom_stack = LGHomDf_P1,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
P2_DxN_Assigned <- assign_linkage_group(linkage_df = SN_DN_P2,
LG_hom_stack = LGHomDf_P2_2,
SN_colname = "marker_a",
unassigned_marker_name = "marker_b",
phase_considered = "coupling",
LG_number = 5,
LOD_threshold = 3,
ploidy = 4)
```
## 10. Assign all other markertypes
Now that we have a backbone of SxN markers with consistent linkage group names, it is time to assign all other marker types to a linkage group and homologue using their linkage to SxN markers. Since we already did this for DxN and SxS markers, there is no need to re-do this work for these marker types. The function `homologue_lg_assignment` finds all markertypes that have not been assigned yet, does the linkage analysis and assigns them to a linkage group and homologue. As linkage is calculated for a lot of marker combinations, this might take a while. Again, note that since v.1.1.5, we need to specify `which_parent` we are interested in, similar to the `linkage` function that is ultimately being called multiple times in the background here.
```{r, eval = FALSE}
marker_assignments_P1 <- homologue_lg_assignment(dosage_matrix = filtered_data,
assigned_list = list(P1_SxS_Assigned,
P1_DxN_Assigned),
assigned_markertypes = list(c(1,1), c(2,0)),
LG_hom_stack = LGHomDf_P1,
which_parent = 1,
ploidy = 4,
pairing = "random",
convert_palindrome_markers = FALSE,
LG_number = 5,
LOD_threshold = 3,
write_intermediate_files = FALSE
)
```
Also do this for P2:
```{r, eval = FALSE}
marker_assignments_P2 <- homologue_lg_assignment(dosage_matrix = filtered_data,
assigned_list = list(P2_SxS_Assigned,
P2_DxN_Assigned),
assigned_markertypes = list(c(1,1), c(2,0)),
LG_hom_stack = LGHomDf_P2_2,
which_parent = 2,
ploidy = 4,
pairing = "random",
convert_palindrome_markers = TRUE,
LG_number = 5,
LOD_threshold = 3,
write_intermediate_files = FALSE
)
```
Next, to make sure the marker linkage group assignment is consistent across parents we run the `check_marker_assignment` function, removing any bi-parental markers if they show linkage to different chromosomes (which suggests a problem with these markers):
```{r, eval = FALSE}
marker_assignments <- check_marker_assignment(marker_assignments_P1,marker_assignments_P2)
```
## 11. Finish the linkage analysis
Since all markers have been assigned to a linkage group, we now do the linkage calculations per linkage group of every marker type combination (so not only with SxN markers). This is done with the `finish_linkage_analysis` function:
```{r, eval = FALSE}
all_linkages_list_P1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P1,
dosage_matrix = filtered_data,
which_parent = 1,
convert_palindrome_markers = FALSE,
ploidy = 4,
pairing = "random",
LG_number = 5)
all_linkages_list_P2 <- finish_linkage_analysis(marker_assignment = marker_assignments$P2,
dosage_matrix = filtered_data,
which_parent = 2,
convert_palindrome_markers = TRUE, # convert 3.1 markers
ploidy = 4,
pairing = "random",
LG_number = 5)
```
The output is returned in a list:
```{r, eval = FALSE}
str(all_linkages_list_P1)
```
## 12. Marker ordering
The basis of producing linkage maps has now been achieved, that is the pairwise recombination frequencies have been estimated and LOD scores for these estimates have been calculated for all markers. We rely on a package developed by Katherine Preedy and Christine Hackett [@Hackett2016], called using the function `MDSMap_from_list`. This function applies the `estimate.map` function from the `MDSMap` package to a list of linkage dataframes. Below, we are using the default settings of `estimate.map`, however, they can be changed by supplying extra arguments.
By calling the `MDSMap_from_list` function you automatically produce the input .txt files that `MDSMap` requires in the correct format. Note that unless you provide a different folder name (by specifying `mapdir`), the files will be overwritten each time you run the function. We generally recommend using the default mapping settings of the `estimate.map` function (so using the method of principal curves in 2 dimensions with LOD^2^ as weights), but there are a number of options that can be specified which are described in the manual of that package (check out https://cran.R-project.org/package=MDSMap). Output plots can be saved as .pdf files by specifying `write_to_file = TRUE`, in the same `mapdir` folder that contains map input files as well.
## 12.1 Creating an integrated chromosomal linkage map
To create an integrated map of all linkage groups, we first combine the linkage information together and then run the `MDSMap_from_list` function which prepares the files and passes them to `MDSMap` for the mapping:
```{r, eval = FALSE}
linkages <- list()
for(lg in names(all_linkages_list_P1)){
linkages[[lg]] <- rbind(all_linkages_list_P1[[lg]], all_linkages_list_P2[[lg]])
}
integrated.maplist <- MDSMap_from_list(linkages)
```
## 12.2 *Optional:* Adding back duplicated markers
If you ran the function `screen_for_duplicate_markers` before mapping, you may want to add back the duplicate markers to the map. This can be achieved using the `add_dup_markers` function. If you also want to include the duplicate markers on a phased map file (next section), then you also need to update the marker_assignments as well (otherwise, just leave the marker_assignments argument out). This is done as follows:
```{r, eval = FALSE}
complete_mapdata <- add_dup_markers(maplist = integrated.maplist,
bin_list = screened_data4$bin_list,
marker_assignments = marker_assignments)
```
Note that the output of this function is again a list, which we might like to re-assign as follows:
```{r, eval = FALSE}
integrated.maplist_complete <- complete_mapdata$maplist
marker_assignments_complete <- complete_mapdata$marker_assignments
```
Note that if we are producing a phased maplist, we also need the marker dosages matrix to be present (in order to check whether the phasing corresponds with the parental scores). Therefore it is no longer appropriate to use `filtered_data` if we have added back duplicates (where the duplicate markers have been removed). Using `screened_data3` from earlier would be the correct choice.
## 12.3 Phasing an integrated map
One final step that is useful is to generate phased linkage maps from the integrated linkage map and marker assignments. This provides information on the coverage of markers across the parental homologues. To phase the markers, we use the `create_phased_maplist` function as follows:
```{r, eval=FALSE}
phased.maplist <- create_phased_maplist(maplist = integrated.maplist,
dosage_matrix.conv = filtered_data,
N_linkages = 5,
ploidy = 4,
marker_assignment.1 = marker_assignments$P1,
marker_assignment.2 = marker_assignments$P2)
```
```{r, echo = FALSE}
data("phased.maplist")
```
The `N_linkages` option specifies the minimum number of significant linkages of a marker to a chromosomal linkage group to be confident of its assignment. Note that this level of significance (*e.g.* LOD > 3) was already defined in the `homologue_lg_assignment` function earlier. There are a number of other arguments with `create_phased_maplist`, including `original_coding` which produces a phased mapfile in the original, uncoverted format. This may have advantages for tracking marker alleles, although it has the disadvantage that the homologues appear to be more saturated with markers than they actually are!
## 13. Plotting a map
A well-know software program for plotting maps is MapChart [@Voorrips2002]. `polymapR` can write out MapChart compatible files (.mct) using the function `write.mct`. There are many plotting options available in MapChart, here we only currently incorporate a single option, namely `showMarkerNames`, which by default is `FALSE` in anticipation of high-density maps. Further formatting can be achieved within the MapChart environment itself.
However, `polymapR` also comes with its own simple built-in function to plot maps, namely the `plot_map` function:
```{r, echo = FALSE}
data("maplist_P1_subset","integrated.maplist")
maplist_P1_LG1 <- maplist_P1_subset$LG1
```
```{r, fig.width = 7, fig.height = 7}
plot_map(maplist = integrated.maplist)
```
We might also want to visualise the integrated map output using the `plot_phased_maplist` function (but here we limit it to a single linkage group). Note that we need to have first used the `create_phased_maplist`, described earlier.
```{r, fig.width = 6, fig.height = 6}
plot_phased_maplist(phased.maplist = phased.maplist[1], #Can plot full list also, remove "[1]"
ploidy = 4,
cols = c("black","grey50","grey50"))
```
This function can also visualise phased hexaploid maps, or triploid maps (tetraploid x diploid) if `ploidy2` is specified.
## 14. Evaluating map quality
`MDSMap` produces output that can be examined each time a map is produced, namely the principal coordinate plots with the principal curve (on which the map is based) - highlighting possible outlying markers that do not associate well with the rest of the markers, and a plot of the nearest-neighbour fit (nnfit) for the markers along each linkage group, giving an indication of markers with high stress. Both of these can help the user to evaluate the quality of the map (for more details, we recommend reading the `MDSMap` publication [@Hackett2016]) and often results in a number of rounds of mapping, where problematic markers are identified and removed followed by subsequent remapping and re-evaluation. Once the maps have been produced it is also an idea to get a general overview of the map quality using the `check_map` function. Note that by default this function expects lists as input arguments, enabling diagnostics to be run over all maps. Here we run it for a single linkage group, LG1:
```{r, eval = FALSE}
check_map(linkage_list = linkages[1], maplist = integrated.maplist[1])
```
```{r, out.width = "500px", echo = FALSE, fig.align = "center"}
knitr::include_graphics("figures/LG1_check_map_plotA.png")
```
```{r, out.width = "650px", echo = FALSE, fig.align = "center"}
knitr::include_graphics("figures/LG1_check_map_plotB.png")
```
For each map produced, there are 3 diagnostic plots. The first of these shows the differences between the pairwise estimates of recombination frequency and the effective estimate of recombination frequency based on the map itself (the multi-point estimate of the recombination frequency) . These differences are compared to the LOD score for each pairwise estimate. An overall estimate of the weighted root mean square error of these differences (weighted by the LOD scores) is printed on the console. In general, we expect that if the difference between the expected and realised recombination frequency is high, the LOD should be low. If this is not the case, something went wrong in the final mapping step.
The second plot shows the comparison between a marker's map position and the recombination frequency estimates to all other markers. In general we expect that small recombination frequency estimates come from nearby markers and should therefore be concentrated around the diagonal, which are shown as light green regions. Areas of light green off the diagonal suggest a sub-optimal map order.
The third plot is similar to the previous except that LOD values are shown in place of recombination frequencies. Again, we expect a concentration of higher LOD scores around the diagonal. By default, LOD scores less than 5 are not shown to make things clearer, but this value can be varied using the `lod.thresh` argument.
## 15. Preferential pairing
Up to now, we have been working under the assumption of random bivalent pairing in the parents. However, it has been shown that in certain species the pairing behaviour is neither completely random nor completely preferential (as we have in allopolyploids) but something in-between, a condition termed segmental allopolyploidy [@Stebbins1947]. A study on this topic in rose [@Bourke2017] found evidence of disomic behaviour in certain regions of the genome, with tetrasomic behaviour everywhere else. This sort of mixed inheritance can complicate the analysis and in cases where this effect is particularly pronounced, it is probably unwise to ignore [@Bourke2016].
The `polymapR` package can currently accommodate preferential pairing in tetraploid species only, by correcting the recombination frequency estimates once the level of preferential pairing in a parental chromosome is known. In order to determine whether this is necessary, we can run some diagnostic tests using the marker data after an initial map has been produced which can inform our choice regarding whether to include a preferential pairing parameter or not (in a subsequent re-analysis).
First, a word of warning. There is a function to test for preferential pairing among pairs of closely-linked simplex x nulliplex markers in repulsion phase within `polymapR`. However, it is certainly also advised to use multi-point methods to test for preferential pairing (if possible). The only possible drawback is that multi-point methods (using all markers across a chromosome) generally assume uniform pairing behaviour across the length of a chromosome, which may not always occur [@Bourke2017]). A robust multi-point approach as described in [@Bourke2017] used identity-by-descent probabilities for the population, estimated using TetraOrigin [@Zheng2016], which estimates the most likely pairing behaviour per individual and can therefore reveal whether there are deviations from the assumption of random chromosomal pairing and recombination across the population.
The function `test_prefpairing` examines closely-linked repulsion-phase simplex x nulliplex marker pairs only, and is therefore the results of this function may not be as *robust* as a multi-point whole-chromosome approach. Be that as it may, it is probably the simplest check we can perform post-mapping for the possibility of unusual pairing behaviour. The method used here is described also in Bourke et al.[@Bourke2017], which is a development on ideas originally described by Qu and Hancock[@Qu2001]. The idea is to look for signatures of preferential pairing in repulsion-phase pairs which map to the same locus.
We first need to define a minimum distance by which to consider markers essentially at the same locus. This is somewhat arbitrary - in our study of tetraploid rose, we used pairs of duplicated individuals to determine an approximate error rate which, along with considerations of the size of the population and rate of missing values, led us to conclude that distances less than ~1 cM were close enough to be informative (and for which we could assume 0 recombinations). The smaller the distance we choose here the better, although by choosing a very small distance we are limiting the number of repulsion-pairs available. By default in `test_prefpairing` we use a value of 0.5 cM. This can be increased or decreased as required using the `min_cM` argument.
```{r eval = FALSE}
P1.prefPairing <- test_prefpairing(dosage_matrix = ALL_dosages,
maplist = integrated.maplist,
LG_hom_stack = LGHomDf_P1_1,
min_cM = 1, #changed from default of 0.5 cM
ploidy = 4)
head(P1.prefPairing)
```
The output shows the repulsion-phase marker pairs tested. The most important column is probably the column `P_value.adj`, which gives the FDR-adjusted P-values from a Binomial test of the hypothesis that the disomic repulsion estimate of recombination frequency differs significantly from $\frac{1}{3}$ (actually it is a one-sided test, so we are testing that *r disom* falls below $\frac{1}{3}$). In cases where this is significant (with `P_value.adj` < 0.01, say), we might have some evidence for preferential pairing (but from that marker pair only!). The output of `test_prefpairing` also lists the markers, their positions and on which homologues they have been mapped. There is also an estimate of the pairwise preferential pairing parameter `pref.p`. Note that there are two possible (maximum-likelihood) estimators of a preferential pairing parameter in the case of a pair of simplex x nulliplex markers, depending on whether the marker alleles reside on homologous chromosomes, or homoeologous chromosomes (*i.e.* whether the pair are contained within a "subgenome", or are straddled across "subgenomes", to borrow the language of allopolyploids for convenience). If there is strong evidence of preferential pairing from multiple marker pairs, the approach we recommend is to take the average of the `pref.p` values for a particular chromosomal linkage group, which becomes one of the parameters `p1` or `p2` in the `linkage` function.
Here, we have purely random pairing in our parents, so the estimates of `pref.p` should not be considered meaningful. The preferential parameter `p` is defined such that $0 < p < \frac{2}{3}$. Supposing this were not the case, and we wanted to estimate the strength of preferential pairing on LG1 of parent 1, we would do something like the following:
```{r eval = FALSE}
mean(P1.prefPairing[P1.prefPairing$P_value.adj < 0.01 & P1.prefPairing$LG_a == 1,]$pref.p)
```
Note that again we require first that there be significant evidence of disomic behaviour before including the estimate. If there is evidence of preferential pairing, it is probably a good idea to go back and reduce the `min_cM` argument even further, since we really want this value to be as close to 0 as possible for an accurate estimate of a preferential pairing parameter.
Finally, to re-run the linkage analysis with this estimate for parent 1 of linkage group 1, we would first have to subset our marker dosage data corresponding to that linkage group only (since we do not need to re-analyse any other chromosome!), and then re-run the linkage analysis. Note that in cases with extreme levels of preferential pairing, the clustering into separate homologues can also be affected. Supposing we had estimated a value of 0.25 for the preferential pairing parameter we could proceed as follows:
```{r eval = FALSE}
lg1_markers <- unique(c(rownames(marker_assignments_P1[marker_assignments_P1[,"Assigned_LG"] == 1,]),
rownames(marker_assignments_P2[marker_assignments_P2[,"Assigned_LG"] == 1,])))
all_linkages_list_P1_lg1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P1[lg1_markers,],
dosage_matrix = filtered_data[lg1_markers,],
which_parent = 1,
convert_palindrome_markers = FALSE,
ploidy = 4,
pairing = "preferential",
prefPars = c(0.25,0), #just for example!
LG_number = 1 #interested in just 1 chm.
)
all_linkages_list_P2_lg1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P2[lg1_markers,],
dosage_matrix = filtered_data[lg1_markers,],
which_parent = 2,
convert_palindrome_markers = FALSE,
ploidy = 4,
pairing = "preferential",
prefPars = c(0,0.25), #Note that this is in reverse order now.
LG_number = 1
)
```
## 16. QTL analysis
The package `polyqtlR`, available through CRAN, has been developed for QTL analysis in outcrossing polyploid populations, using IBD probabilities derived using the phased linkage maps developed in `polymapR`.
Alternatively, the TetraploidSNPMap [@hackett2017] software provides the possibility of not just generating linkage maps but performing IBD-based interval QTL mapping as well. The creation of input files for TetraploidSNPMap is simple once phased mapfiles have been created with the `create_phased_maplist` function (see [section 12.3](#sect12.3)).
To generate these files, use the `write.TSNPM` function as follows:
```{r eval = FALSE}
write.TSNPM(phased.maplist = phased.maplist,ploidy=4)
```
Other software for linkage mapping and QTL analysis in polyploid populations includes `MAPpoly` and `QTLpoly`, as well as `diaQTL` and `polyOrigin`.
## 17. Concluding remarks
We hope you have been able to successfully create an integrated map using the methods described here. We anticipate `polymapR` will improve over the coming years (in terms of applicability as well as finding and removing bugs). Feedback on performance including bug reporting is most welcome; please refer to the package information on CRAN for details on who to contact regarding maintenance issues.
## 18. References