polymapR: linkage mapping in outcrossing polyploids

In this tutorial, we will go step by step through the basic steps of performing linkage analysis in polyploids using polymapR. Most of the theory behind the functions described here can be found in the Bioinformatics publication of Bourke et al. (2018) at https://doi.org/10.1093/bioinformatics/bty371. The example dataset is a simulated dataset of an hypothetical tetraploid outcrossing plant species with five chromosomes. Chromosomes have random pairing behaviour. The population consists of 207 individuals originating from an F1 cross between two non-inbred parents. Frequently occurring data imperfections like missing values, genotyping errors and duplicated individuals are taken into account.

1. A (very) short introduction to R and the polymapR package.

In R, a function can be called by typing function_name() in the console and pressing [Enter]. (Optional) arguments go within the parentheses, followed by a = sign and the value you want to give to the argument. Arguments are separated by a comma. All possible arguments and default values of a function are in the help file, which is called by typing ? followed by the function name. Getting the help file for the commonly used function seq looks like this:

?seq

And generating a sequence of numbers from 2 to 14 by an increment of 3 using seq looks like this:

seq(from = 2,
    to = 14,
    by = 3)
## [1]  2  5  8 11 14

Although copy pasting the commands from this document to your R console will do for this tutorial, we recommend to have a bit more experience in using R than above before using polymapR. This will allow you better to troubleshoot and check out data output. A good place to start would be https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf or any other good introductory course in R.

The polymapR package is literally a collection of functions, each defined by their function name and arguments. It relies on dosage data of molecular markers, and therefore is logically used after the packages fitTetra or fitPoly. As for any project, it is best to first create a new folder which will be the main working directory for the mapping project. Set the working directory to your new folder (setwd() or use the drop-down menu) and copy the marker dosage file from fitTetra there.

2. Install polymapR

polymapR is available on CRAN at https://cran.R-project.org/package=polymapR which is probably where you accessed this vignette. As with other packages on CRAN, it can be installed directly from within R using the command:

install.packages("polymapR")

3. Logging function calls and output

Most exported functions in polymapR have a log argument. The default setting is NULL. In that case, all messages are returned to the standard output, which means your R console if you are calling the functions interactively. However, if it is set to a file name (e.g. log.Rmd or log.txt) all function calls and output are written to that file. It automatically creates one if it does not exist yet. The log file is written in R markdown (https://rmarkdown.rstudio.com/). The advantage of markdown is that it is readable as plain text, but it can also easily be rendered into a nicely formatted html or pdf. In order to do so, give your logfile a .Rmd extension and use the knitr package to turn it into a pdf, docx or html.

4. Data importing

4.1 Reading in marker dosage data

We assume that the user has used fitPoly or some alternative method to convert genotyping information (e.g. signal intensity ratios or read depths) into marker dosages. The standard version of polymapR assumes that such genotypes are discrete, but it is also possible to use probabilistic genotypes since version 1.1.0 (see other vignette). In this tutorial we assume such discrete dosage values are the starting point, but we will impose some further quality checks before we can begin mapping (such as allowed numbers of missing values). We also need to check the markers for skewness (differences in the observed and expected marker segregation) which can cause problems later in the clustering. For this, we have incorporated a number of functions developed for the fitPoly package [1], and included in the currently-unpublished R package fitPolyTools developed by Roeland Voorrips.

But first, a note on polyploid terminology

In the polymapR package, and throughout this vignette, we often refer to terms like “simplex”, “duplex” or “nulliplex” etc. These terms describe the dosage scores of SNP markers, essentially the count of the number of “alternative” alleles present at a particular locus in an individual. A simplex x nulliplex marker is therefore a marker that for which the maternal dosage is 1 (mothers always come first) and paternal dosage is 0. To save space, we also refer to these markers as SxN markers, which is synonymous with 1x0 markers. Sometimes statements are made concerning simplex x nulliplex markers - it should be understood that these statements also usually apply to nulliplex x simplex markers - markers for which the segregating allele is carried by the father rather than the mother.

The first step is to import the marker data. The layout of the dataset should be marker name, parental scores followed by offspring scores (so-called wide data format):

Marker P1 P2 F1.001 F1.002..
mymarker1 1 0 1 1
mymarker2 2 4 3 2
mymarker3 0 3 2 2

In cases where the parents have been genotyped more than once, a consensus parental score is required. Missing parental scores are not allowed at this stage, and should have been imputed already if desired.

The data should be in some format that can be read easily into R, such as .csv, .dat or .txt file. In the example a .csv file is read in. For other file types check out ?read.delim.

ALL_dosages <- read.csv("tetraploid_dosages.csv",
                        stringsAsFactors = FALSE,
                        row.names = 1) #first column contains rownames
class(ALL_dosages)
## [1] "data.frame"

By default, read.csv returns data in a data frame. All polymapR functions that use dosage data only accept dosage data when the dosages are in a matrix. A matrix is a two-dimensional R object with all elements of the same type (e.g. character, integer, numeric). Markernames and individual names should be specified in rownames and columnnames respectively. To use the data with functions, the data should be converted from a data.frame to a matrix by the as.matrix function:

ALL_dosages <- as.matrix(ALL_dosages)

class(ALL_dosages)
## [1] "matrix" "array"
head(ALL_dosages[,1:5])
##             P1 P2 F1_001 F1_002 F1_003
## Zm_rs005505  2  2      2      3      2
## Zm_ts097822  1  2      1      1      2
## Ac_ws072452  1  0      1     NA      1
## Ac_ts073123  0  2      1     NA      2
## Ap_ws071152  3  1      3      2     NA
## St_rs076767  1  0      1      1      1
dim(ALL_dosages)
## [1] 3000  209

If you are testing the package and would like a sample dataset to work with, a set of tetraploid marker dosage data is provided within the package itself. First load the package and then the dosage data as follows:

library(polymapR)
data(ALL_dosages)

4.2 Checking for skewness

The next step before we proceed further is to check whether the marker scores in the F1 correspond to those expected according to the parental dosages. We do this using the checkF1 function (from fitPolyTools). However, we first load the polymapR package into the global environment:

library(polymapR)

Now run checkF1:

F1checked <- checkF1(dosage_matrix = ALL_dosages,parent1 = "P1",parent2 = "P2",
                     F1 = colnames(ALL_dosages)[3:ncol(ALL_dosages)],
                     polysomic = TRUE, disomic = FALSE, mixed = FALSE, ploidy = 4)

head(F1checked$checked_F1)
##   m  MarkerName parent1 parent2 F1_0 F1_1 F1_2 F1_3 F1_4 F1_NA P1 P2 bestfit
## 1 1 Zm_rs005505       2       2    6   39   90   45    4    23  2  2 18I81_0
## 2 2 Zm_ts097822       1       2   14   79   83   10    0    21  1  2  1551_0
## 3 3 Ac_ws072452       1       0  102   83    0    0    0    22  1  0    11_0
## 4 4 Ac_ts073123       0       2   27  121   37    0    0    22  0  2   141_0
## 5 5 Ap_ws071152       3       1    1   48   77   55    0    26  3  1   121_1
## 6 6 St_rs076767       1       0   92   88    0    0    0    27  1  0    11_0
##   frqInvalid_bestfit Pvalue_bestfit matchParent_bestfit bestParentfit
## 1             0.0000         0.9187                 Yes       18I81_0
## 2             0.0000         0.4724                 Yes        1551_0
## 3             0.0000         0.1624                 Yes          11_0
## 4             0.0000         0.4160                 Yes         141_0
## 5             0.0055         0.1165                 Yes         121_1
## 6             0.0000         0.7656                 Yes          11_0
##   frqInvalid_bestParentfit Pvalue_bestParentfit matchParent_bestParentfit
## 1                   0.0000               0.9187                       Yes
## 2                   0.0000               0.4724                       Yes
## 3                   0.0000               0.1624                       Yes
## 4                   0.0000               0.4160                       Yes
## 5                   0.0055               0.1165                       Yes
## 6                   0.0000               0.7656                       Yes
##   q1_segtypefit q2_parents q3_fracscored qall_mult qall_weights
## 1        1.0000          1        0.7222    0.7222       0.9383
## 2        1.0000          1        0.7464    0.7464       0.9436
## 3        1.0000          1        0.7343    0.7343       0.9410
## 4        1.0000          1        0.7343    0.7343       0.9410
## 5        0.9278          1        0.6860    0.6365       0.8901
## 6        1.0000          1        0.6739    0.6739       0.9275

There is the possibility to check for shifted markers (incorrectly genotyped) but we will not concern ourselves with that here. The main arguments to be specified are what the parental identifiers are (parent1 and parent2), what the identifiers of the F1 population are (argument F1 - note we just take the column names of the dosages matrix from column 3 to the end), and which sort of inheritance model we wish to compare against (here we have specified polysomic to be TRUE and both disomic and mixed to be FALSE. Note that mixed allows for one parent to be fully disomic and the other fully polysomic, but does not refer to the situation of a mixed inheritance pattern within one parent, associated with segmental allopolyploidy - this is because we have no fixed expectation to compare with if this is the case). The argument ploidy specifies the ploidy level of parent 1, if parent 2 has a different ploidy (for instance in a tetraploid x diploid cross) then we can use the ploidy2 argument. The results are stored in the list F1checked, which has two elements: meta which holds meta-data regarding the function call, and checked_F1, the actual function output. We are interested in seeing whether there is consistency between the parental scores and the F1. In general, if there is good consistency between the parental and offspring scores we will have a high value for qall_mult and qall_weights, which are aggregated quality measures of q1, q2 and q3. In the sample datatset provided with polymapR there are no skewed markers present. However, you may find it useful to examine the output of checkF1 and remove markers which show conflict between the parental and offspring scores, as these markers will likely cause issues in the later mapping steps.

4.3 Running a PCA

It can also be useful to run a principal component analysis at this stage using the PCA_progeny function on the marker dosages:

PCA_progeny(dosage_matrix = ALL_dosages, 
            highlight = list(c("P1", "P2")), 
            colors = "red")

In cases where there are serious issues with the population (such as parental mix-ups, inclusion of unrelated samples with the bi-parental population, multiple sub-populations from e.g. multiple pollen-donors etc) a PCA plot can help. Here, we see no such problems so we can proceed.

5. Generate summary data

We now summarise the marker data in terms of the different segregation types, number of missing values, number of inadmissible scores (which could be genotyping errors or possible double reduction scores). Run the summary function on the marker data as follows:

mds <- marker_data_summary(dosage_matrix = ALL_dosages,
                           ploidy = 4,
                           pairing = "random",
                           parent1 = "P1",
                           parent2 = "P2",
                           progeny_incompat_cutoff = 0.05)
## Calculating parental info...
## Checking compatibility between parental and offspring scores...
## 
## #### parental_info
## 
## |     | P2_0| P2_1| P2_2| P2_3| P2_4|
## |:----|----:|----:|----:|----:|----:|
## |P1_0 |    0|  312|  227|   49|    3|
## |P1_1 |  300|  625|  356|   67|    4|
## |P1_2 |  230|  323|  175|   29|    2|
## |P1_3 |   66|   65|   30|    5|    0|
## |P1_4 |    2|    5|    3|    0|  122|
## 
## #### offspring_incompatible
## 
## |     | P2_0| P2_1| P2_2| P2_3| P2_4|
## |:----|----:|----:|----:|----:|----:|
## |P1_0 |   NA| 0.00| 0.00| 0.56| 2.42|
## |P1_1 | 0.00| 0.00| 0.00| 0.28| 0.85|
## |P1_2 | 0.00| 0.00| 0.00| 0.12| 0.00|
## |P1_3 | 0.45| 0.33| 0.14| 0.19|   NA|
## |P1_4 | 0.97| 0.48| 0.64|   NA| 1.07|
## 
## #### Incompatible individuals:
## 
## None
pq_before_convert <- parental_quantities(dosage_matrix = ALL_dosages, 
                                         las = 2)

## |          | 1x0| 2x0| 3x0| 4x0| 0x1| 1x1| 2x1| 3x1| 4x1| 0x2| 1x2| 2x2| 3x2| 4x2| 0x3| 1x3| 2x3| 3x3| 0x4| 1x4| 2x4| 4x4|
## |:---------|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
## |frequency | 300| 230|  66|   2| 312| 625| 323|  65|   5| 227| 356| 175|  30|   3|  49|  67|  29|   5|   3|   4|   2| 122|

If you want to know exactly what these functions are doing and what can be typed as an argument just use ?marker_data_summary and/or ?parental_quantities.

6. Convert marker dosages to simple segregations; remove non-segregating data

Simplex x nulliplex markers segregate in a 1:1 ratio in the offspring, but this segregation is also possible with triplex x nulliplex, triplex x quadriplex and simplex x quadriplex, and the segregating allele in each case comes from the same parent. We can therefore re-code all of these segregation types into the simplest 1 x 0 case, taking care that we handle possible double reduction scores correctly. A list of the conversions in tetraploid we apply which simplify the subsequent analysis and mapping work are:

3x0,3x4,1x4 -> 1x0
0x3,4x3,4x1 -> 0x1
2x4 -> 2x0
4x2 -> 0x2
3x3 -> 1x1
3x1 -> 1x3
3x2 -> 1x2
2x3 -> 2x1

Note that some classes such as 2x2 cannot be converted into any other category, and we do not convert non-segregating classes such as 4x4 as these are of no further use for mapping or QTL analysis. To convert the marker data:

segregating_data <- convert_marker_dosages(dosage_matrix = ALL_dosages, ploidy = 4)
## Warning in convert_marker_dosages(dosage_matrix = ALL_dosages, ploidy = 4): There are dosages greater than 4 or less than 0 in the dataset.
##             If they include parental dosages, markers are removed from the output.
##             Otherwise the dosage is made missing.
## 
## #### Marker dosage frequencies:
## 
## |     | P2_0| P2_1| P2_2| P2_3|
## |:----|----:|----:|----:|----:|
## |P1_0 |    0|  366|  230|    0|
## |P1_1 |  370|  630|  386|  132|
## |P1_2 |  232|  352|  175|    0|
## 
## markers not converted: 2615
## 
## markers 1 parent converted: 129
## 
## markers 2 parents converted: 256
## 
## non-segregating markers deleted: 127
pq_after_convert <- parental_quantities(dosage_matrix = segregating_data)

## |          | 1x0| 2x0| 0x1| 1x1| 2x1| 0x2| 1x2| 2x2| 1x3|
## |:---------|---:|---:|---:|---:|---:|---:|---:|---:|---:|
## |frequency | 370| 232| 366| 630| 352| 230| 386| 175| 132|

7. Quality checks on marker data

A number of quality checks are recommended before we begin mapping. For example, we do not want to include markers which have too many missing values (“NA” scores) as this may be an indication that the marker has performed poorly and may have many errors associated with it. We would also like to screen out individuals that have many missing values as this might be an indication that the DNA quality of that sample was poor, or duplicate individuals which give no extra information and should probably also be removed / combined. Because there are no fixed rules for how stringent we should be with quality checks (5% missing values ok? 10% missing values ok? 20%?), some visual aids are produced in order to help with these decisions.

7.1 Missing value rate per marker

We first screen the marker data for missing values using screen_for_NA_values. This function allows the user to remove rows or columns with a certain rate of missing values from the dataset. The dosage matrix, which is specified as segregating_data contains markers in rows and individuals in columns. In R, rows are usually specified by the number 1, and columns by 2. The argument margin lets you specify by margin number whether you want to screen markers (margin = 1) or individuals (margin = 2) If the argument cutoff is set to NULL, it takes user input on the rate of markers to be screened. In the example below cutoff is specified, but because decisions are made after data inspection, we recommend to decide on the cutoff value after inspection of the data by specifying cutoff = NULL. A threshold of 10% missing values is sensible, as a compromise between high-confidence marker data versus keeping a large proportion of the markers for mapping and QTL analysis. In general, the most problematic markers tend to have higher rates of missing values so any reasonable threshold here should be fine. If problems are encountered later in the mapping, it might be worthwhile to re-run this step with a more stringent (e.g. 5%) threshold.

screened_data <- screen_for_NA_values(dosage_matrix = segregating_data, 
                                      margin = 1, # margin 1 means markers
                                      cutoff =  0.10,
                                      print.removed = FALSE) 

## 
## Number of removed markers :  1458 
## 
## 
## There are now 1415 markers leftover in the dataset.

7.2 Missing value rate per individual

As in step 7.1 we can also visualise the rate of missing values per individual, which can help us to make a decision about whether certain individuals should be removed from the dataset before commencing the mapping. In order to screen the individuals for missing values we enter:

screened_data2 <- screen_for_NA_values(dosage_matrix = screened_data, 
                                       cutoff = 0.1, 
                                       margin = 2, #margin = 2 means columns
                                       print.removed = FALSE)

## 
## Number of removed individuals :  2 
## 
## 
## There are now 205 individuals leftover in the dataset.

Again, it is not clear what an acceptable threshold for the rate of missing values should be, but we recommend a cut-off rate of 0.1 – 0.15 (10-15%) missing values allowed.

7.3 Duplicate individuals

It is also desirable to be able to check whether there are any duplicated individuals in the population as can sometimes happen (or due to mix-ups in the DNA preparation and genotyping). The function screen_for_duplicate_individuals does this for you. Note that a duplicate can remain in the dataset, but they add no further information for mapping or QTL analysis, and may even distort the analysis (by giving an inaccurate population size). As for the function screen_for_NA_values, cutoff can be set to NULL to accept user input. Especially for screen_for_duplicate_individuals it makes sense to make a decision on the cut off after inspection of the data and figures. To screen the data for duplicate individuals, enter:

screened_data3 <- screen_for_duplicate_individuals(dosage_matrix = screened_data2, 
                                                   cutoff = 0.95, 
                                                   plot_cor = TRUE)

## 
## Combining F1_060 & F1_081 into F1_060
## 
## Combining F1_046 & F1_085 into F1_046
## Warning in screen_for_duplicate_individuals(dosage_matrix = screened_data2, :
## Multiple duplicates of single genotype identified at this threshold. Attempting
## to merge...
## 
## Combining F1_032 & F1_100 into F1_032
## 
## Combining F1_032 & F1_114 into F1_032
## 
## Combining F1_079 & F1_101 into F1_079
## 
## Combining F1_106 & F1_113 into F1_106
## 
## Combining F1_112 & F1_199 into F1_112
## 
## ####  7 individuals removed:
## 
##      _        _        _        _       
## [1,] "F1_081" "F1_085" "F1_100" "F1_114"
## [2,] "F1_101" "F1_113" "F1_199" ""      
## |_      |_      |_      |_      |
## |:------|:------|:------|:------|
## |F1_081 |F1_085 |F1_100 |F1_114 |
## |F1_101 |F1_113 |F1_199 |       |

Here we see that there were a number of duplicate pairs of individuals identified (red points in the output graph). Each duplicate pair is merged into a consensus individual (conflicts are made missing), keeping the name of the first individual. This is summarised in the output as well.

7.4 Duplicated markers

Finally, we can also check for duplicated markers (those which give us no extra information about a locus) and remove these from the dataset before mapping, using the screen_for_duplicate_markers function. This function returns a list with 2 items - the filtered_dosage_matrix which we are interested in now, and bin_list, which we are possibly interested in again later. It is a good idea to save the output of this function, as you might like to add back the duplicate markers after mapping.

screened_data4 <- screen_for_duplicate_markers(dosage_matrix = screened_data3)

##   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
## Marked and merged 3 duplicated markers

Note that increasingly, there are far too many markers generated in comparison to the population size. It is possible to estimate the genetic resolution of the population you are working with, and from that estimate the expected bin-size in your dataset, assuming your markers are approximately randomly-distributed across the genome (this may not be a good assumption in your case, but as a rough guide this function may still be useful). Considering only bivalent pairing, in a tetrasomic species we can assume approximately four cross-over recombinations occurred, taking the rule-of-thumb guide of one cross-over recombination per bivalent, which has been observed in autotetraploid Arabidopsis arenosa for example [2].

We can try to estimate the average bin size if we run the previous function and specifying the argument estimate_bin_size = TRUE. The function then does a quick calculation on how many markers would be expected in an average marker bin (under the tenuous assumptions of uniform distributions of both markers and cross-over break-points). In a tetraploid, each offspring would be expected to have 4 cross-over break-points per chromosomal linkage group (2 maternal and 2 paternal), so across a population of N individuals there are approximately 4N breaks per linkage group, and 4NL such breaks across L linkage groups. Supposing there are in total y distinct marker types (like 1x0, 1x1, 2x0, 0x1 etc), then your M markers can be split into M/4NLy segregation patterns (assuming no errors), which is an approximate estimate for the average bin size. In datasets with very large numbers of markers (e.g. genotyping using NGS), it may be worthwhile to only include markers with a minimum number of duplicates present, and that minimum number can be informed by the calculation above. In this case, supposing we demand at least 7 duplicates (so a minimum bin size of 6), we would do something like the following:

reliable.markers <- names(which(sapply(screened_data4$bin_list,length) >= 6))
reliable_data <- screened_data4$filtered_dosage_matrix[reliable.markers,]

Of course, this assumes we have a very large number of markers to choose from, as this is a very severe demand in a more traditional, well-balanced dataset. It also presumes there are no genotyping errors. A genotyping error will make duplicate markers appear to be different even when they are not (from a genetic mapping perspective). For now we are ignoring this issue, but may return to it in future updates.

In the current data-set there is quite a good balance between the number of markers and the number of individuals in the mapping population, and so we extract the filtered_dosage_matrix that has the duplicate markers merged:

filtered_data <- screened_data4$filtered_dosage_matrix 

Before we start the mapping, it is a good idea to run the summary function again, this time on the filtered_data object, so we have a clear record of the numbers of markers going forward to the mapping stage.

pq_screened_data <- parental_quantities(dosage_matrix = filtered_data)

## |          | 1x0| 2x0| 0x1| 1x1| 2x1| 0x2| 1x2| 2x2| 1x3|
## |:---------|---:|---:|---:|---:|---:|---:|---:|---:|---:|
## |frequency | 175| 111| 194| 301| 183| 101| 198|  88|  61|

8. Simplex x nulliplex markers – defining chromosomes and homologues

The first step in actually mapping the markers is to define linkage groups for the markers. We use the simplex x nulliplex markers to define the linkage groups (chromosomes) and following that, the homologues (4 per linkage group are expected in a tetraploid). We use the LOD scores to cluster markers, which increases dramatically for tight coupling-phase estimates.

To estimate the recombination frequencies, LOD and phasing between all marker pairs of P1 enter:

SN_SN_P1 <- linkage(dosage_matrix = filtered_data, 
                    markertype1 = c(1,0),
                    parent1 = "P1",
                    parent2 = "P2",
                    which_parent = 1,
                    ploidy = 4,
                    pairing = "random"
)
A note on arguments and multi-core processing

There is much more to the linkage function. Checkout ?linkage to see the arguments that can be passed to it. One important feature is that it can run in parallel on multiple cores. This can be very time-efficient if you are working with a large number of markers. The number of cores to use can be specified using the ncores argument. Before doing so, first check the number of cores on your system using parallel::detectCores(). Use at least one core less than you have available to prevent overcommitment.

Note that since polymapR version 1.1.5, some of the arguments of the linkage function have been updated. The arguments target_parent and other_parent have been removed, and replaced with parent1 and parent2, referring to the column names of the two parents. A new argument which_parent has been introduced, to allow you to specify which parent is being “targeted” for linkage analysis (either 1 or 2). This was needed to accommodate an update for mapping in triploid populations, but also makes it easier if you rename your parental columns “P1” and “P2” at the start, as these are the default values provided for the arguments parent1 and parent2. This is the case here, so for subsequent calls to linkage we will omit specifying the parental names each time.

It is a good idea to first visualise the relationship between the recombination frequency and LOD score, which gives us some insight into the differences between the information content of the different phases, as well as highlighting possible problems in the data. This can be done by using the r_LOD_plot function:

r_LOD_plot(linkage_df = SN_SN_P1, r_max = 0.5)
## Warning: Computation failed in `stat_binhex()`.
## Caused by error in `compute_group()`:
## ! The package "hexbin" is required for `stat_bin_hex()`.

In some situations, it is possible to identify outlying datapoints on this plot, which do not fit the expected relationship. We have implemented a function to check this for the case of pairs of simplex x nulliplex markers, as these are the markers that we use in the subsequent steps of clustering and identifying chromosome and homologue linkage groups. If you are unsure, you can run the SNSN_LOD_deviation function anyway to see whether your data fits the expected pattern:

P1deviations <- SNSN_LOD_deviations(linkage_df = SN_SN_P1,
                                    ploidy = 4,
                                    N = ncol(filtered_data) - 2, #The F1 population size
                                    alpha = c(0.05,0.2),
                                    plot_expected = TRUE,
                                    phase="coupling")

Here we see there are a few “outliers” in the coupling pairwise data (identified by the stars) from the expected relationship between r and LOD (the bounds of which are shown by the dotted lines - these can be widened as necessary using the alpha argument for the upper and lower tolerances around the “true” line, usually in a trial and error approach). In our example dataset, these highlighted points are hardly outliers and we can proceed. In other datasets, you may want to remove these starred pairwise estimates before continuing with marker clustering, as it is possible they may cause unrelated homologue clusters to clump together and prevent a clear clustering structure from being defined. To do so, remove them from the linkage data frame as follows:

SN_SN_P1.1 <- SN_SN_P1[SN_SN_P1$phase == "coupling",][-which(P1deviations > 0.2),]

We are now in a position to cluster the marker data into chromosomes using the cluster_SN_markers function. This function clusters markers over a range of LOD thresholds. In the example below this is over a range from LOD 3 to 10 in steps of 1 (LOD_sequence = c(3:10)). If the argument plot_network is set to TRUE this function will plot the network of the lowest LOD threshold (LOD 3 in this case), and overlays the clusters of the higher LOD scores. This setting is not recommended with large number of marker pairs. Your computer will most likely run out of memory.

P1_homologues <- cluster_SN_markers(linkage_df = SN_SN_P1, 
                                    LOD_sequence = seq(3, 10, 0.5), 
                                    LG_number = 5,
                                    ploidy = 4,
                                    parentname = "P1",
                                    plot_network = FALSE,
                                    plot_clust_size = FALSE)
## Total number of edges: 894

## 20 clusters were expected.

The function cluster_SN_markers returns a list of cluster data frames, one data frame for each LOD threshold. The first plot shows two things - on the normal y-axis, the number of clusters over a range of LOD thresholds (x-axis) - and on the right-hand y-axis, the number of markers that do not form part of any cluster, which is shown by the blue dashed line. In the cluster_SN_markers function, we specify the ploidy (4 for a tetraploid) and the expected number of chromosomes (LG_number = 5) and therefore there are 4 x 5 = 20 homologue clusters expected. If we examine the second plot, we see how these clusters stay together or fragment as the linkage stringency (LOD score) increases. Here, we see clearly that at LOD 3.5, there are 5 clusters which all eventually divide into 4 sub-clusters each. This is the ideal scenario - we have identified both the chromosomal linkage groups and the homologues using only the simplex x nulliplex marker data. Note that it is the coupling-phase estimates which define the homologue clusters and the repulsion-phase estimates which provide the associations between these coupling-clusters (homologues). In practice, things might not always be so orderly. In noisy datasets with higher numbers of genotyping errors, or at higher ploidy levels, the repulsion-phase linkages between homologues will be lost among false positive linkages between un-linked markers. This means that all markers will likely clump together at lower LOD scores.

You can manually check how many clusters there are at a specific LOD score, and how many markers are in a cluster:

P1_hom_LOD3.5 <- P1_homologues[["3.5"]]
t <- table(P1_hom_LOD3.5$cluster)
print(paste("Number of clusters:",length(t)))
## [1] "Number of clusters: 5"
t[order(as.numeric(names(t)))]
## 
##  1  2  3  4  5 
## 37 36 29 31 41
P1_hom_LOD5 <- P1_homologues[["5"]]
t <- table(P1_hom_LOD5$cluster)
print(paste("Number of clusters:",length(t)))
## [1] "Number of clusters: 20"
t[order(as.numeric(names(t)))]
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
## 13  9 10  4 12  7  9  8  3 13  6  7  4 10  6 11 17  6  6 12

Based on the plots (or the output as shown) you should be able to decide which LOD threshold best separates the data. In this example, at a LOD of 3.5 we identify chromosomes and anywhere between LOD 4.5 and LOD 7 we have identified our homologues (4 x 5 = 20 clusters). In this parent (P1) we have therefore solved the marker clustering problem without any difficulty. It remains to gather this into a structure that can be used at later stages. To do this, we need two LOD values which define chromosomal groupings (here, LOD_chm = 3.5) and homologue groupings (here, LOD_hom = 5). We can then use the define_LG_structure function with these arguments as follows (note the argument LG_number refers to the expected number of chromosomes for this species):

LGHomDf_P1 <- define_LG_structure(cluster_list = P1_homologues, 
                                  LOD_chm = 3.5, 
                                  LOD_hom = 5, 
                                  LG_number = 5)
## 
## #### SxN Marker(s) lost in clustering step at LOD 5:
## 
## |_           |
## |:-----------|
## |Ls_rs099908 |
head(LGHomDf_P1)
##     SxN_Marker LG homologue
## 1  Ac_ns035629  1         1
## 3  Ac_ns045766  1         1
## 7  Ac_ns096474  1         1
## 22 Ac_ts072686  1         1
## 24 Ac_ws013795  1         1
## 42 Ap_rs075640  1         1

However, you may find that it is not possible to identify chromosomal groups as simply as this. We therefore recommend that homologue groups be first identified, and then another marker type be used to provide associations between these clusters, and hence help form chromosomal linkage groups. This can be done as follows:

  • Generate homologue clusters using the cluster_SN_markers function at a high LOD.

  • Use the Duplex x Nulliplex or Simplex x Simplex marker information to put the clusters back together into chromosomal groups.

It makes sense to use only SxN marker pairs in coupling phase for the homologue clustering (since these associations define the homologues). Note that the most likely phase per marker pair (coupling or repulsion) was already estimated by our previous call to the linkage function.

SN_SN_P1_coupl <- SN_SN_P1[SN_SN_P1$phase == "coupling",] # select only markerpairs in coupling

P1_homologues_1 <- cluster_SN_markers(linkage_df = SN_SN_P1_coupl, 
                                    LOD_sequence = c(3:12), 
                                    LG_number = 5,
                                    ploidy = 4,
                                    parentname = "P1",
                                    plot_network = FALSE,
                                    plot_clust_size = FALSE)
## Total number of edges: 682

## 20 clusters were expected.

In the dot plot above, we see that all chromosomal associations have now been removed with the repulsion information.

A note on nested lists

In the polymapR package, output is often given as a nested list. Some basic information on lists is available on e.g. r-tutor. We use nested lists because we are often dealing with hierarchical data: marker data > homologue > linkage group > organism. Some basic R knowledge is needed to retrieve, visualize or filter data from such a nested list. However, if you would like to visualize it without R (e.g. in your favourite spreadsheet program), you can choose to export a nested list using write_nested_list. It will keep its hierarchical structure by writing it to directories that have the same structure as the the nested list. Here’s an example: write_nested_list(list = P1_homologues, directory = "P1_homologues")

Next, we use linkage to the DxN markers to provide bridging linkages between homologue clusters, using the bridgeHomologues function. The first time we run this function, we use a LOD threshold of 4 for evidence of linkage between a SxN and a DxN marker. However, there may also be some false positive linkages at this level. It might therefore be necessary to run the function again at a higher LOD (say LOD 8) to avoid such false positives. In doing so, we might lose true linkage information. Therefore, it’s a good idea to run the function at LOD 4 first, then increase to a higher LOD if necessary, and build up a picture of which clusters form part of the same chromosome. First calculate linkage between SxN and DxN markers:

SN_DN_P1 <- linkage(dosage_matrix = filtered_data, 
                    markertype1 = c(1,0),
                    markertype2 = c(2,0),
                    which_parent = 1,
                    ploidy = 4,
                    pairing = "random")

Then, use bridgeHomologues to use these SxN - DxN linkages to find associations (“bridges”) between homologue clusters. Note that we must provide a particular division of the Simplex x Nulliplex data - in this example, we use the division at LOD 6 which gave us 20 clusters (our 20 putative homologues):

LGHomDf_P1_1 <- bridgeHomologues(cluster_stack = P1_homologues_1[["6"]], 
                               linkage_df = SN_DN_P1, 
                               LOD_threshold = 4, 
                               automatic_clustering = TRUE, 
                               LG_number = 5,
                               parentname = "P1")

This results in an ideal situation of 5 linkage groups each with exactly 4 homologues:

table(LGHomDf_P1_1$LG, LGHomDf_P1_1$homologue)
##    
##      1  2  3  4
##   1 13  9  4 10
##   2 12  7  9  8
##   3  3 13  5  7
##   4  4 10  6 11
##   5 17 12  6  6

However, clustering of P2 is a bit more difficult. The following re-does everything for P2. Note that for clarity, we include the arguments parent1 and parent2 here, but we could have just as well left them out (as they are the default column names, see ?linkage):

SN_SN_P2 <- linkage(dosage_matrix = filtered_data, 
                    markertype1 = c(1,0),
                    parent1 = "P1",
                    parent2 = "P2",
                    which_parent = 2,
                    ploidy = 4,
                    pairing = "random"
)
SN_SN_P2_coupl <- SN_SN_P2[SN_SN_P2$phase == "coupling",] # get only markerpairs in coupling

P2_homologues <- cluster_SN_markers(linkage_df = SN_SN_P2_coupl, 
                                    LOD_sequence = c(3:12), 
                                    LG_number = 5,
                                    ploidy = 4,
                                    parentname = "P2",
                                    plot_network = FALSE,
                                    plot_clust_size = FALSE)
## Total number of edges: 743

## 20 clusters were expected.
SN_DN_P2 <- linkage(dosage_matrix = filtered_data, 
                    markertype1 = c(1,0),
                    markertype2 = c(2,0),
                    which_parent = 2,
                    ploidy = 4,
                    pairing = "random")
LGHomDf_P2 <- bridgeHomologues(cluster_stack = P2_homologues[["6"]], 
                               linkage_df = SN_DN_P2, 
                               LOD_threshold = 4, 
                               automatic_clustering = TRUE, 
                               LG_number = 5,
                               parentname = "P2")

table(LGHomDf_P2$LG,LGHomDf_P2$homologue)
##    
##      1  2  3  4  5
##   1 11  9  3  5  0
##   2 13 13  7 11  0
##   3 13 11 11  8  2
##   4  9  9 13 11  0
##   5 10 10  6  9  0

Occasionally linkage groups possesses fewer than four “homologues”. This can result in a loss of phase information in later steps, but will not prevent us from proceeding with the mapping. However, it is also possible that more than 4 homologues are grouped together, likely the result of having a number of fragmented homologues. However, as we don’t know which clusters are actual homologues, and which are fragments, we must examine the affected linkage group in more detail. polymapR currently offers two methods for this.

8.2 Using the function cluster_per_LG

cluster_per_LG does not rely on markers in repulsion, but uses variation in LOD thresholds between markers in coupling. Run it as follows:

cluster_per_LG(LG = 3,
               linkage_df = SN_SN_P2[SN_SN_P2$phase == "coupling",], 
               LG_hom_stack = LGHomDf_P2, 
               LOD_sequence = c(3:10), # The first element is used for network layout
               modify_LG_hom_stack = FALSE, 
               network.layout = "stacked",
               nclust_out = 4,
               label.offset=1.2)
## Total number of edges: 185

The network layout is based on the first LOD score you provide in LOD_sequence (which is LOD = 3 in this example). The darker the background, the more consistent the clustering is over the range of LOD scores. Change the LOD_sequence if the network is not as required (4 clusters in the case of a tetraploid) and re-run. If the network looks as you would like it to, use the cluster_per_LG function but with a LOD_sequence of length 1 (the first LOD score, so again LOD = 3 in this example), and specify modify_LG_hom_stack = TRUE to make a modification to the cluster numbering (the output is then returned and can be saved as a new LG_hom_stack object):

LGHomDf_P2_1 <- cluster_per_LG(LG = 3, 
                               linkage_df = SN_SN_P2[SN_SN_P2$phase == "coupling",], 
                               LG_hom_stack = LGHomDf_P2, 
                               LOD_sequence = 3, 
                               modify_LG_hom_stack = TRUE, 
                               network.layout = "n",
                               nclust_out = 4)
## Total number of edges: 185
table(LGHomDf_P2_1$homologue, LGHomDf_P2_1$LG)
##    
##      1  2  3  4  5
##   1 11 13 10  9 10
##   2  9 13 13  9 10
##   3  3  7 11 13  6
##   4  5 11 11 11  9

8.3 Cross-ploidy populations (e.g. tetraploid x diploid to give triploid F1)

polymapR is also able to map triploid populations, and may be extended to other cross-ploidy populations if there is demand for that in future. Tetraploid x diploid crosses are however the most common (often done to induce seedlessness in the F1). Previously we had implemented a mapping approach that assumes that the parent of higher ploidy level is parent 1. However, since polymapR v.1.1.5 this is no longer the case, so a 4x2 and 2x4 cross are both acceptable.

The next point to be aware of is that the segregation of markers in the diploid parent is assumed to follow disomic inheritance (there is no other option in a diploid), whereas the segregation of the tetraploid parent is assumed to follow tetrasomic inheritance. Mixtures with preferential pairing in the tetraploid parent are currently not implemented for triploid populations.

Because of this disomic behaviour in the diploid parent, we are unable to use the normal clustering function (cluster_SN_markers) for identifying the homologues. Proceed as normal with the steps described above for the tetraploid parent. However in order to proceed correctly for the diploid parent 2, the clustering proceeds a bit differently.

If you are confused about what marker types are possible in a triploid population, run the following, which gives all possible marker combinations (and the segregation ratios in the F1):

get("seg_p3_random",envir=getNamespace("polymapR"))
##    dosage1 dosage2 f1_0 f1_1 f1_2 f1_3
## 1        0       0    1    0    0    0
## 2        0       1    1    1    0    0
## 3        0       2    0    1    0    0
## 4        1       0    1    1    0    0
## 5        1       1    1    2    1    0
## 6        1       2    0    1    1    0
## 7        2       0    1    4    1    0
## 8        2       1    1    5    5    1
## 9        2       2    0    1    4    1
## 10       3       0    0    1    1    0
## 11       3       1    0    1    2    1
## 12       3       2    0    0    1    1
## 13       4       0    0    0    1    0
## 14       4       1    0    0    1    1
## 15       4       2    0    0    0    1

To run the preliminary linkage analysis of 1x0 markers in the diploid parent, we need to specify the ploidy of both parents, e.g. ploidy = 4 and ploidy2 = 2 (or vice versa). To demonstrate this, we need to load a special dataset with triploid data, a sample of which is available from the polymapR package also:

data("TRI_dosages")

# Estimate the linkage in the diploid parent (assuming this has been done for the 4x parent already):
SN_SN_P2.tri <- linkage(dosage_matrix = TRI_dosages, 
                    markertype1 = c(1,0),
                    parent1 = "P1",
                    parent2 = "P2",
                    which_parent = 2, 
                    ploidy = 4,
                    ploidy2 = 2,
                    pairing = "random"
)

Have a look at the plot of r versus LOD score - here you should notice that both the coupling and repulsion phases are equally informative now and cannot therefore be separated by LOD score as before:

r_LOD_plot(SN_SN_P2.tri)
## Warning: Computation failed in `stat_binhex()`.
## Caused by error in `compute_group()`:
## ! The package "hexbin" is required for `stat_bin_hex()`.

If we try to cluster using the old function, we get the following result - no separation of the homologues, but the chromosomal linkage groups are identified:

P2_homologues.tri <- cluster_SN_markers(linkage_df = SN_SN_P2.tri, 
                                    LOD_sequence = seq(3, 10, 1), 
                                    LG_number = 3,
                                    ploidy = 2, #because P2 is diploid..
                                    parentname = "P2",
                                    plot_network = FALSE,
                                    plot_clust_size = FALSE) 
## Total number of edges: 345

## 3 clusters were expected.

To separate the 0x1 markers into homologues, we use the phase information from the linkage analysis directly, using the phase_SN_diploid function. To learn more about how this function works, input ?phase_SN_diploid first, to get an idea of the arguments used.

LGHomDf_P2.tri <- phase_SN_diploid(linkage_df = SN_SN_P2.tri,
                                   cluster_list = P2_homologues.tri,
                                   LOD_chm = 4, #LOD at which chromosomes are identified
                                   LG_number = 3) #number of linkage groups
## Total number of edges: 161
## Complete phase assignment possible using only coupling information at LOD 4

The rest of the analysis should be pretty much the same as for tetraploids or hexaploids, although you should check each time you use a new function whether that function has a ploidy2 argument, and if it does, use it!

9. Assigning SxS and DxN markers and consensus linkage group (LG) names

Assigning simplex x simplex markers to a homologue and linkage group is done by calculating linkage between SxS and SxN markers and after that assigning them based on this linkage using assign_linkage_group:

SN_SS_P1 <- linkage(dosage_matrix = filtered_data, 
                    markertype1 = c(1,0),
                    markertype2 = c(1,1),
                    which_parent = 1,
                    ploidy = 4,
                    pairing = "random")
P1_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P1,
                                        LG_hom_stack = LGHomDf_P1,
                                        SN_colname = "marker_a",
                                        unassigned_marker_name = "marker_b",
                                        phase_considered = "coupling",
                                        LG_number = 5,
                                        LOD_threshold = 3,
                                        ploidy = 4)
## 
## #### Marker(s) showing ambiguous linkage to more than one LG:
## 
## |_           |
## |:-----------|
## |Ap_ts069042 |
## 
##  In total, 296 out of 301 markers were assigned.
## 
## #### Marker(s) not assigned:
## 
## |_           |_           |_           |_           |
## |:-----------|:-----------|:-----------|:-----------|
## |Ac_ws033297 |Ap_ts089267 |St_ns048176 |St_ts030296 |
## |Zm_ws010615 |            |            |            |
head(P1_SxS_Assigned)
##             Assigned_LG LG1 LG2 LG3 LG4 LG5 Hom1 Hom2 Hom3 Hom4 Assigned_hom1
## Ac_ns002135           4   0   0   0   7   0    0    7    0    0             2
## Ac_ns002510           2   0   7   0   0   0    0    0    0    7             4
## Ac_ns024650           2   0  11   0   0   0   11    0    0    0             1
## Ac_ns028519           5   0   0   0   0   5    0    5    0    0             2
## Ac_ns028533           5   0   0   0   0   3    0    0    3    0             3
## Ac_ns029178           3   0   0  12   0   0    0   12    0    0             2
##             Assigned_hom2 Assigned_hom3 Assigned_hom4
## Ac_ns002135            NA            NA            NA
## Ac_ns002510            NA            NA            NA
## Ac_ns024650            NA            NA            NA
## Ac_ns028519            NA            NA            NA
## Ac_ns028533            NA            NA            NA
## Ac_ns029178            NA            NA            NA
SN_SS_P2 <- linkage(dosage_matrix = filtered_data, 
                    markertype1 = c(1,0),
                    markertype2 = c(1,1),
                    which_parent = 2,
                    ploidy = 4,
                    pairing = "random")
P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2,
                                        LG_hom_stack = LGHomDf_P2_1,
                                        SN_colname = "marker_a",
                                        unassigned_marker_name = "marker_b",
                                        phase_considered = "coupling",
                                        LG_number = 5,
                                        LOD_threshold = 3,
                                        ploidy = 4)
## 
##  In total, 300 out of 300 markers were assigned.

As simplex x simplex markers are present in both parents, we can define which linkage groups correspond with each other between parents. After this, one of the linkage groups of the parents should be renamed and the SxS markers assigned again according to the new linkage group names:

LGHomDf_P2_2 <- consensus_LG_names(modify_LG = LGHomDf_P2_1, 
                                   template_SxS = P1_SxS_Assigned, 
                                   modify_SxS = P2_SxS_Assigned)
## 
## #### Original LG names
## 
## |   |  1|  2|  3|  4|  5|
## |:--|--:|--:|--:|--:|--:|
## |1  |  0|  0|  0| 47|  0|
## |2  | 64|  0|  0|  0|  0|
## |3  |  0|  0| 58|  0|  0|
## |4  |  0| 58|  0|  0|  0|
## |5  |  0|  0|  0|  0| 68|
## 
## #### Modified LG names
## 
## |   |  1|  2|  3|  4|  5|
## |:--|--:|--:|--:|--:|--:|
## |1  | 47|  0|  0|  0|  0|
## |2  |  0| 64|  0|  0|  0|
## |3  |  0|  0| 58|  0|  0|
## |4  |  0|  0|  0| 58|  0|
## |5  |  0|  0|  0|  0| 68|
P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2,
                                        LG_hom_stack = LGHomDf_P2_2,
                                        SN_colname = "marker_a",
                                        unassigned_marker_name = "marker_b",
                                        phase_considered = "coupling",
                                        LG_number = 5,
                                        LOD_threshold = 3,
                                        ploidy = 4)
## 
##  In total, 300 out of 300 markers were assigned.

Since we now have a consistent linkage group numbering, we can also assign the DxN markers:

P1_DxN_Assigned <- assign_linkage_group(linkage_df = SN_DN_P1,
                                        LG_hom_stack = LGHomDf_P1,
                                        SN_colname = "marker_a",
                                        unassigned_marker_name = "marker_b",
                                        phase_considered = "coupling",
                                        LG_number = 5,
                                        LOD_threshold = 3,
                                        ploidy = 4)
## 
##  In total, 111 out of 111 markers were assigned.
P2_DxN_Assigned <- assign_linkage_group(linkage_df = SN_DN_P2,
                                        LG_hom_stack = LGHomDf_P2_2,
                                        SN_colname = "marker_a",
                                        unassigned_marker_name = "marker_b",
                                        phase_considered = "coupling",
                                        LG_number = 5,
                                        LOD_threshold = 3,
                                        ploidy = 4)
## 
##  In total, 101 out of 101 markers were assigned.

10. Assign all other markertypes

Now that we have a backbone of SxN markers with consistent linkage group names, it is time to assign all other marker types to a linkage group and homologue using their linkage to SxN markers. Since we already did this for DxN and SxS markers, there is no need to re-do this work for these marker types. The function homologue_lg_assignment finds all markertypes that have not been assigned yet, does the linkage analysis and assigns them to a linkage group and homologue. As linkage is calculated for a lot of marker combinations, this might take a while. Again, note that since v.1.1.5, we need to specify which_parent we are interested in, similar to the linkage function that is ultimately being called multiple times in the background here.

marker_assignments_P1 <- homologue_lg_assignment(dosage_matrix = filtered_data,
                                                 assigned_list = list(P1_SxS_Assigned, 
                                                                      P1_DxN_Assigned),
                                                 assigned_markertypes = list(c(1,1), c(2,0)),
                                                 LG_hom_stack = LGHomDf_P1,
                                                 which_parent = 1,
                                                 ploidy = 4,
                                                 pairing = "random",
                                                 convert_palindrome_markers = FALSE,
                                                 LG_number = 5,
                                                 LOD_threshold = 3,
                                                 write_intermediate_files = FALSE
)

Also do this for P2:

marker_assignments_P2 <- homologue_lg_assignment(dosage_matrix = filtered_data,
                                                 assigned_list = list(P2_SxS_Assigned, 
                                                                      P2_DxN_Assigned),
                                                 assigned_markertypes = list(c(1,1), c(2,0)),
                                                 LG_hom_stack = LGHomDf_P2_2,
                                                 which_parent = 2,
                                                 ploidy = 4,
                                                 pairing = "random",
                                                 convert_palindrome_markers = TRUE,
                                                 LG_number = 5,
                                                 LOD_threshold = 3,
                                                 write_intermediate_files = FALSE
)

Next, to make sure the marker linkage group assignment is consistent across parents we run the check_marker_assignment function, removing any bi-parental markers if they show linkage to different chromosomes (which suggests a problem with these markers):

marker_assignments <- check_marker_assignment(marker_assignments_P1,marker_assignments_P2)

11. Finish the linkage analysis

Since all markers have been assigned to a linkage group, we now do the linkage calculations per linkage group of every marker type combination (so not only with SxN markers). This is done with the finish_linkage_analysis function:

all_linkages_list_P1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P1,
                                                dosage_matrix = filtered_data,
                                                which_parent = 1,
                                                convert_palindrome_markers = FALSE,
                                                ploidy = 4,
                                                pairing = "random",
                                                LG_number = 5) 

all_linkages_list_P2 <- finish_linkage_analysis(marker_assignment = marker_assignments$P2,
                                                dosage_matrix = filtered_data,
                                                which_parent = 2,
                                                convert_palindrome_markers = TRUE, # convert 3.1 markers
                                                ploidy = 4,
                                                pairing = "random",
                                                LG_number = 5)

The output is returned in a list:

str(all_linkages_list_P1)

12. Marker ordering

The basis of producing linkage maps has now been achieved, that is the pairwise recombination frequencies have been estimated and LOD scores for these estimates have been calculated for all markers. We rely on a package developed by Katherine Preedy and Christine Hackett [3], called using the function MDSMap_from_list. This function applies the estimate.map function from the MDSMap package to a list of linkage dataframes. Below, we are using the default settings of estimate.map, however, they can be changed by supplying extra arguments.

By calling the MDSMap_from_list function you automatically produce the input .txt files that MDSMap requires in the correct format. Note that unless you provide a different folder name (by specifying mapdir), the files will be overwritten each time you run the function. We generally recommend using the default mapping settings of the estimate.map function (so using the method of principal curves in 2 dimensions with LOD2 as weights), but there are a number of options that can be specified which are described in the manual of that package (check out https://cran.R-project.org/package=MDSMap). Output plots can be saved as .pdf files by specifying write_to_file = TRUE, in the same mapdir folder that contains map input files as well.

12.1 Creating an integrated chromosomal linkage map

To create an integrated map of all linkage groups, we first combine the linkage information together and then run the MDSMap_from_list function which prepares the files and passes them to MDSMap for the mapping:

linkages <- list()
for(lg in names(all_linkages_list_P1)){
  linkages[[lg]] <- rbind(all_linkages_list_P1[[lg]], all_linkages_list_P2[[lg]])
}

integrated.maplist <- MDSMap_from_list(linkages)

12.2 Optional: Adding back duplicated markers

If you ran the function screen_for_duplicate_markers before mapping, you may want to add back the duplicate markers to the map. This can be achieved using the add_dup_markers function. If you also want to include the duplicate markers on a phased map file (next section), then you also need to update the marker_assignments as well (otherwise, just leave the marker_assignments argument out). This is done as follows:

complete_mapdata <- add_dup_markers(maplist = integrated.maplist,
                                    bin_list = screened_data4$bin_list,
                                    marker_assignments = marker_assignments)

Note that the output of this function is again a list, which we might like to re-assign as follows:

integrated.maplist_complete <- complete_mapdata$maplist
marker_assignments_complete <- complete_mapdata$marker_assignments

Note that if we are producing a phased maplist, we also need the marker dosages matrix to be present (in order to check whether the phasing corresponds with the parental scores). Therefore it is no longer appropriate to use filtered_data if we have added back duplicates (where the duplicate markers have been removed). Using screened_data3 from earlier would be the correct choice.

12.3 Phasing an integrated map

One final step that is useful is to generate phased linkage maps from the integrated linkage map and marker assignments. This provides information on the coverage of markers across the parental homologues. To phase the markers, we use the create_phased_maplist function as follows:

phased.maplist <- create_phased_maplist(maplist = integrated.maplist,
                                        dosage_matrix.conv = filtered_data,
                                        N_linkages = 5,
                                        ploidy = 4,
                                        marker_assignment.1 = marker_assignments$P1,
                                        marker_assignment.2 = marker_assignments$P2)

The N_linkages option specifies the minimum number of significant linkages of a marker to a chromosomal linkage group to be confident of its assignment. Note that this level of significance (e.g. LOD > 3) was already defined in the homologue_lg_assignment function earlier. There are a number of other arguments with create_phased_maplist, including original_coding which produces a phased mapfile in the original, uncoverted format. This may have advantages for tracking marker alleles, although it has the disadvantage that the homologues appear to be more saturated with markers than they actually are!

13. Plotting a map

A well-know software program for plotting maps is MapChart [4]. polymapR can write out MapChart compatible files (.mct) using the function write.mct. There are many plotting options available in MapChart, here we only currently incorporate a single option, namely showMarkerNames, which by default is FALSE in anticipation of high-density maps. Further formatting can be achieved within the MapChart environment itself.

However, polymapR also comes with its own simple built-in function to plot maps, namely the plot_map function:

plot_map(maplist = integrated.maplist)

We might also want to visualise the integrated map output using the plot_phased_maplist function (but here we limit it to a single linkage group). Note that we need to have first used the create_phased_maplist, described earlier.

plot_phased_maplist(phased.maplist = phased.maplist[1], #Can plot full list also, remove "[1]"
                    ploidy = 4,
                    cols = c("black","grey50","grey50"))

This function can also visualise phased hexaploid maps, or triploid maps (tetraploid x diploid) if ploidy2 is specified.

14. Evaluating map quality

MDSMap produces output that can be examined each time a map is produced, namely the principal coordinate plots with the principal curve (on which the map is based) - highlighting possible outlying markers that do not associate well with the rest of the markers, and a plot of the nearest-neighbour fit (nnfit) for the markers along each linkage group, giving an indication of markers with high stress. Both of these can help the user to evaluate the quality of the map (for more details, we recommend reading the MDSMap publication [3]) and often results in a number of rounds of mapping, where problematic markers are identified and removed followed by subsequent remapping and re-evaluation. Once the maps have been produced it is also an idea to get a general overview of the map quality using the check_map function. Note that by default this function expects lists as input arguments, enabling diagnostics to be run over all maps. Here we run it for a single linkage group, LG1:

check_map(linkage_list = linkages[1], maplist = integrated.maplist[1])

For each map produced, there are 3 diagnostic plots. The first of these shows the differences between the pairwise estimates of recombination frequency and the effective estimate of recombination frequency based on the map itself (the multi-point estimate of the recombination frequency) . These differences are compared to the LOD score for each pairwise estimate. An overall estimate of the weighted root mean square error of these differences (weighted by the LOD scores) is printed on the console. In general, we expect that if the difference between the expected and realised recombination frequency is high, the LOD should be low. If this is not the case, something went wrong in the final mapping step. The second plot shows the comparison between a marker’s map position and the recombination frequency estimates to all other markers. In general we expect that small recombination frequency estimates come from nearby markers and should therefore be concentrated around the diagonal, which are shown as light green regions. Areas of light green off the diagonal suggest a sub-optimal map order. The third plot is similar to the previous except that LOD values are shown in place of recombination frequencies. Again, we expect a concentration of higher LOD scores around the diagonal. By default, LOD scores less than 5 are not shown to make things clearer, but this value can be varied using the lod.thresh argument.

15. Preferential pairing

Up to now, we have been working under the assumption of random bivalent pairing in the parents. However, it has been shown that in certain species the pairing behaviour is neither completely random nor completely preferential (as we have in allopolyploids) but something in-between, a condition termed segmental allopolyploidy [5]. A study on this topic in rose [6] found evidence of disomic behaviour in certain regions of the genome, with tetrasomic behaviour everywhere else. This sort of mixed inheritance can complicate the analysis and in cases where this effect is particularly pronounced, it is probably unwise to ignore [7].

The polymapR package can currently accommodate preferential pairing in tetraploid species only, by correcting the recombination frequency estimates once the level of preferential pairing in a parental chromosome is known. In order to determine whether this is necessary, we can run some diagnostic tests using the marker data after an initial map has been produced which can inform our choice regarding whether to include a preferential pairing parameter or not (in a subsequent re-analysis).

First, a word of warning. There is a function to test for preferential pairing among pairs of closely-linked simplex x nulliplex markers in repulsion phase within polymapR. However, it is certainly also advised to use multi-point methods to test for preferential pairing (if possible). The only possible drawback is that multi-point methods (using all markers across a chromosome) generally assume uniform pairing behaviour across the length of a chromosome, which may not always occur [6]). A robust multi-point approach as described in [6] used identity-by-descent probabilities for the population, estimated using TetraOrigin [8], which estimates the most likely pairing behaviour per individual and can therefore reveal whether there are deviations from the assumption of random chromosomal pairing and recombination across the population.

The function test_prefpairing examines closely-linked repulsion-phase simplex x nulliplex marker pairs only, and is therefore the results of this function may not be as robust as a multi-point whole-chromosome approach. Be that as it may, it is probably the simplest check we can perform post-mapping for the possibility of unusual pairing behaviour. The method used here is described also in Bourke et al.[6], which is a development on ideas originally described by Qu and Hancock[9]. The idea is to look for signatures of preferential pairing in repulsion-phase pairs which map to the same locus.

We first need to define a minimum distance by which to consider markers essentially at the same locus. This is somewhat arbitrary - in our study of tetraploid rose, we used pairs of duplicated individuals to determine an approximate error rate which, along with considerations of the size of the population and rate of missing values, led us to conclude that distances less than ~1 cM were close enough to be informative (and for which we could assume 0 recombinations). The smaller the distance we choose here the better, although by choosing a very small distance we are limiting the number of repulsion-pairs available. By default in test_prefpairing we use a value of 0.5 cM. This can be increased or decreased as required using the min_cM argument.

P1.prefPairing <- test_prefpairing(dosage_matrix = ALL_dosages, 
                                   maplist = integrated.maplist, 
                                   LG_hom_stack = LGHomDf_P1_1, 
                                   min_cM = 1, #changed from default of 0.5 cM
                                   ploidy = 4)

head(P1.prefPairing)

The output shows the repulsion-phase marker pairs tested. The most important column is probably the column P_value.adj, which gives the FDR-adjusted P-values from a Binomial test of the hypothesis that the disomic repulsion estimate of recombination frequency differs significantly from $\frac{1}{3}$ (actually it is a one-sided test, so we are testing that r disom falls below $\frac{1}{3}$). In cases where this is significant (with P_value.adj < 0.01, say), we might have some evidence for preferential pairing (but from that marker pair only!). The output of test_prefpairing also lists the markers, their positions and on which homologues they have been mapped. There is also an estimate of the pairwise preferential pairing parameter pref.p. Note that there are two possible (maximum-likelihood) estimators of a preferential pairing parameter in the case of a pair of simplex x nulliplex markers, depending on whether the marker alleles reside on homologous chromosomes, or homoeologous chromosomes (i.e. whether the pair are contained within a “subgenome”, or are straddled across “subgenomes”, to borrow the language of allopolyploids for convenience). If there is strong evidence of preferential pairing from multiple marker pairs, the approach we recommend is to take the average of the pref.p values for a particular chromosomal linkage group, which becomes one of the parameters p1 or p2 in the linkage function.

Here, we have purely random pairing in our parents, so the estimates of pref.p should not be considered meaningful. The preferential parameter p is defined such that $0 < p < \frac{2}{3}$. Supposing this were not the case, and we wanted to estimate the strength of preferential pairing on LG1 of parent 1, we would do something like the following:

mean(P1.prefPairing[P1.prefPairing$P_value.adj < 0.01 & P1.prefPairing$LG_a == 1,]$pref.p)

Note that again we require first that there be significant evidence of disomic behaviour before including the estimate. If there is evidence of preferential pairing, it is probably a good idea to go back and reduce the min_cM argument even further, since we really want this value to be as close to 0 as possible for an accurate estimate of a preferential pairing parameter.

Finally, to re-run the linkage analysis with this estimate for parent 1 of linkage group 1, we would first have to subset our marker dosage data corresponding to that linkage group only (since we do not need to re-analyse any other chromosome!), and then re-run the linkage analysis. Note that in cases with extreme levels of preferential pairing, the clustering into separate homologues can also be affected. Supposing we had estimated a value of 0.25 for the preferential pairing parameter we could proceed as follows:

lg1_markers <- unique(c(rownames(marker_assignments_P1[marker_assignments_P1[,"Assigned_LG"] == 1,]),
                        rownames(marker_assignments_P2[marker_assignments_P2[,"Assigned_LG"] == 1,])))

all_linkages_list_P1_lg1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P1[lg1_markers,],
                                                    dosage_matrix = filtered_data[lg1_markers,],
                                                    which_parent = 1,
                                                    convert_palindrome_markers = FALSE,
                                                    ploidy = 4,
                                                    pairing = "preferential",
                                                    prefPars = c(0.25,0), #just for example!
                                                    LG_number = 1 #interested in just 1 chm.
)

all_linkages_list_P2_lg1 <- finish_linkage_analysis(marker_assignment = marker_assignments$P2[lg1_markers,],
                                                    dosage_matrix = filtered_data[lg1_markers,],
                                                    which_parent = 2,
                                                    convert_palindrome_markers = FALSE,
                                                    ploidy = 4,
                                                    pairing = "preferential",
                                                    prefPars = c(0,0.25), #Note that this is in reverse order now.
                                                    LG_number = 1 
)

16. QTL analysis

The package polyqtlR, available through CRAN, has been developed for QTL analysis in outcrossing polyploid populations, using IBD probabilities derived using the phased linkage maps developed in polymapR.

Alternatively, the TetraploidSNPMap [10] software provides the possibility of not just generating linkage maps but performing IBD-based interval QTL mapping as well. The creation of input files for TetraploidSNPMap is simple once phased mapfiles have been created with the create_phased_maplist function (see section 12.3).

To generate these files, use the write.TSNPM function as follows:

write.TSNPM(phased.maplist = phased.maplist,ploidy=4)

Other software for linkage mapping and QTL analysis in polyploid populations includes MAPpoly and QTLpoly, as well as diaQTL and polyOrigin.

17. Concluding remarks

We hope you have been able to successfully create an integrated map using the methods described here. We anticipate polymapR will improve over the coming years (in terms of applicability as well as finding and removing bugs). Feedback on performance including bug reporting is most welcome; please refer to the package information on CRAN for details on who to contact regarding maintenance issues.

18. References

1. Voorrips RE, Gort G, Vosman B: Genotype calling in tetraploid species from bi-allelic marker data using mixture models. BMC bioinformatics 2011, 12:172.
2. Yant L, Hollister JD, Wright KM, Arnold BJ, Higgins JD, Franklin FCH, Bomblies K: Meiotic adaptation to genome duplication in arabidopsis arenosa. Current biology 2013, 23:2151–2156.
3. Preedy KF, Hackett CA: A rapid marker ordering approach for high-density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling. Theoretical and Applied Genetics 2016, 129:2117–2132.
4. Voorrips R: MapChart: software for the graphical presentation of linkage maps and QTLs. Journal of heredity 2002, 93:77–78.
5. Stebbins GL: Types of polyploids: their classification and significance. Advances in genetics 1947, 1:403–429.
6. Bourke PM et al: Partial preferential chromosome pairing is genotype dependent in tetraploid rose. The Plant Journal 2017, 90:330–343.
7. Bourke PM, Voorrips RE, Kranenburg T, Jansen J, Visser RGF, Maliepaard C: Integrating haplotype-specific linkage maps in tetraploid species using SNP markers. Theoretical and Applied Genetics 2016, 129:2211–2226.
8. Zheng C, Voorrips RE, Jansen J, Hackett CA, Ho J, Bink MCAM: Probabilistic multilocus haplotype reconstruction in outcrossing tetraploids. Genetics 2016, 203:119–131.
9. Qu L, Hancock J: Detecting and mapping repulsion-phase linkage in polyploids with polysomic inheritance. Theoretical and Applied Genetics 2001, 103:136–143.
10. Hackett CA, Boskamp B, Vogogias A, Preedy KF, Milne I: TetraploidSNPMap: Software for linkage analysis and QTL mapping in autotetraploid populations using SNP dosage data. Journal of Heredity 2017, 108:438–442.