Using genotype probabilities in polymapR

This vignette is additional to the basic polymapR vignette “How to use polymapR”, to which we would direct readers’ attention to first.

Here, we will go through the main steps of polyploid linkage analysis using polymapR [1] when discrete dosages are not available or not desired, but where genotype probabilities are instead available. Genotype probabilties are a direct output of many polyploid genotype callers such as fitPoly [2], updog [3] or polyRAD [4], although dosage probabilities from any other genotyping software can be used.

For more background information on the functions described here, please refer to the 2021 article of Liao et al. [5].

1. Introduction

By their very nature, probabilistic datasets in polyploids are larger than discrete ones: each marker-individual combination has ploidy + 1 floating-point numbers associated with it, in comparison to a single integer value for a discrete dataset. For example, we might encounter the following probabilistic data for 5 SNP markers in a single tetraploid individual:

P0 P1 P2 P3 P4
snp01 0.999849 0.000151 0 0 0
snp02 0.000238 0.999762 0 0 0
snp03 0.000000 1.000000 0 0 0
snp04 0.000001 0.999999 0 0 0
snp05 0.009729 0.990271 0 0 0

Using discrete dosages (e.g. by selecting the maximum probability) would give the following table for the same individual:

geno
snp01 0
snp02 1
snp03 1
snp04 1
snp05 1

However, there are some advantages to probabilistic genotypes - in cases where uncertainty exists in the genotype assignment, it might not be obvious how to discretise the genotype scores. This may lead to observations being marked as missing data or even being entirely removed from the dataset, depending on what filtering thresholds are used.

Probabilistic genotypes are a standard output from many polyploid genotype calling procedures. This vignette assumes the R package fitPoly has been used (developed to score the output of SNP arrays). polymapR also includes functions to convert the output of other polyploid genotype callers developed for sequencing reads into a compatible format. For example, output of the multiflex function of updog can be re-formatted using the convert_updog function, while convert_polyRAD does the same with the RADobject output of function PipelineMapping2Parents in R package polyRAD.

2. Importing sample data

In this vignette we will use a simulated SNP dataset from an autotetraploid F1 population with 200 individuals. This fictive organism has 5 chromosomes (2n = 4x = 10) and exhibits polysomic inheritance (although the pairing behaviour was that of random bivalents). The dataset was generated using PedigreeSim [6] and the GenoSim package (Liao, Tumino et al., in preparation) which simulates SNP array intensity values (and also sequencing reads) given an underlying population. The genotypes were called using fitPoly [2] using the function saveMarkerModels.

The data should be downloaded from the following FigShare repository and saved to your working directory:

https://figshare.com/articles/dataset/fitPoly_4x_output_2343_SNPs/12589271

3. Initial mapping steps

We will assume you have already installed or updated the version of polymapR to at least version 1.1.0. If not, please install the latest version of polymapR from CRAN:

install.packages("polymapR")
library(polymapR)

Assuming you have not altered the filename of the sample dataset, we will first load the genotype data into R:

geno <- read.csv("fitPoly_4x_output_2343_SNPs.csv")

Have a look at the layout of the data, as this is generally the format that is expected for probabilistic genotypes:

knitr::kable(head(geno))
marker MarkerName SampleName ratio P0 P1 P2 P3 P4 maxgeno maxP geno
1 LG1_88.17 F1_001 0.2548363 0.9998493 0.0001507 0 0 0 0 0.9998493 0
1 LG1_88.17 F1_002 0.3823194 0.0002380 0.9997620 0 0 0 1 0.9997620 1
1 LG1_88.17 F1_003 0.4570357 0.0000000 1.0000000 0 0 0 1 1.0000000 1
1 LG1_88.17 F1_004 0.4290638 0.0000006 0.9999994 0 0 0 1 0.9999994 1
1 LG1_88.17 F1_005 0.3535575 0.0097291 0.9902709 0 0 0 1 0.9902709 1
1 LG1_88.17 F1_006 0.2337918 0.9999929 0.0000071 0 0 0 0 0.9999929 0

For clarity, we’ll also define what the parental and offspring samples are already. In this dataset there are two maternal replicates (‘P1’ and ‘P1a’) but only a single paternal replicate (‘P2’). It is quite common (recommended in fact) to have multiple parental replicates to ensure parental calls are reliable.

parent1.reps <- c("P1","P1a")
parent2.reps <- "P2"
individuals <- setdiff(unique(geno$SampleName),c(parent1.reps,parent2.reps))

The first step in the mapping using probabilistic genotypes is to run the checkF1 function (written by Roeland Voorrips). This looks for concordance between parental and offspring genotypes, as well as checking for marker skewness / distorted segregation under various genetic models (disomic, polysomic, mixed) and returns a number of different quality parameters that might be useful for flagging problematic markers. See ?checkF1 for more details. We will capture the output of checkF1, as this will be needed in many subsequent function calls:

chk1 <- checkF1(input_type = "probabilistic",
                probgeno_df = geno,
                parent1 = parent1.reps,
                parent2 = parent2.reps,
                F1 = individuals,
                polysomic = TRUE,
                disomic = FALSE,
                mixed = FALSE,
                ploidy = 4)

This function returns a list with elements checked_F1 (the actual function output) and meta (which carries the information about parameter settings used). Note that we have assumed that we are dealing with a polysomic species here (polysomic = TRUE). In cases where there is uncertainty regarding the mode of inheritance, it can be useful to run checkF1 multiple times, e.g. setting polysomic = FALSE and disomic = TRUE. Quality of the markers can be assessed by filtering on qall_mult > 0 for example. In some cases, you may notice that parental scores have been made missing:

chk1$checked_F1[1:5,]
##      m MarkerName parent1 parent2      F1_0      F1_1     F1_2      F1_3
## 1  563   LG1_0.02       2       1 16.307194  91.47333 81.57135 10.648126
## 2 1315   LG1_0.03       1       0 91.105597  83.89326 22.70502  2.296116
## 3  437   LG1_0.13       0      NA  0.000000 101.97259 97.02741  0.000000
## 4 1130   LG1_0.21       1       2 17.758076  75.43383 91.64195 15.166141
## 5  512   LG1_0.64       4      NA  5.552898  32.96319 63.11849 72.637833
##       F1_4 F1_NA P1 P1a P2 bestfit frqInvalid_bestfit Pvalue_bestfit
## 1  0.00000     0  2   2  1  1551_0              0.000         0.3895
## 2  0.00000     0  1   1  0    11_0              0.125         0.5856
## 3  0.00000     1  0   0  2    11_1              0.000         0.7259
## 4  0.00000     0  1   1  2  1551_0              0.000         0.6185
## 5 23.72759     2  4   4  1  1551_1              0.028         0.0000
##   matchParent_bestfit bestParentfit frqInvalid_bestParentfit
## 1                 Yes        1551_0                   0.0000
## 2                 Yes          11_0                   0.1250
## 3               OneOK          11_1                   0.0000
## 4                 Yes        1551_0                   0.0000
## 5                  No         141_2                   0.1945
##   Pvalue_bestParentfit matchParent_bestParentfit q1_segtypefit q2_parents
## 1               0.3895                       Yes             1          1
## 2               0.5856                       Yes             0          1
## 3               0.7259                     OneOK             1          1
## 4               0.6185                       Yes             1          1
## 5               0.0000                     OneOK             0          1
##   q3_fracscored qall_mult qall_weights
## 1        1.0000    1.0000       1.0000
## 2        1.0000    0.0000       0.4444
## 3        0.9875    0.9875       0.9972
## 4        1.0000    1.0000       1.0000
## 5        0.9750    0.0000       0.4389

If you are interested, you can impute these values using the internal function assign_parental_dosage within polymapR as follows:

pardose <- polymapR:::assign_parental_dosage(chk = chk1,
                                             probgeno_df = geno)
knitr::kable(head(pardose))
MarkerName parent1 parent2
LG1_0.02 2 1
LG1_0.03 1 0
LG1_0.13 0 3
LG1_0.21 1 2
LG1_0.64 4 2
LG1_1.07 2 1

For now, we will just focus our attention on potentially problematic markers that have had a qall_mult value of 0:

length(which(chk1$checked_F1$qall_mult == 0)) #859 markers

There appear to be quite a large number of markers with questionable agreement between parental and offspring scores - we will remove these as follows:

remove.mark <- chk1$checked_F1[chk1$checked_F1$qall_mult==0,"MarkerName"]
geno1 <- geno[!geno$MarkerName %in% remove.mark,]

If you are concerned about removing these markers, you can revisit them later on and try to add them back to the map. For now we will proceed with a cleaned-up dataset, as this makes the mapping a lot easier.

4. Data filtering

4.1. Quality check on offspring scores

We can get an overview of the “precision” of probability scores using the function gp_overview:

gpo <- gp_overview(probgeno_df = geno1)

By default a plot is generated giving some summary statistics of the marker scores, such as the mean genotype probability encountered etc. The output can be used to filter the marker dataset further (although here using the default parameters no filtering actually occurs):

geno1 <- gpo$probgeno_df

4.2. Distribution of maximum genotype probabilities

The next step is to have a look at the distribution of maximum genotype probabilities, as we would ideally like these to be as close to 1 as possible in a clearly-scored dataset:

maxP.chk <- check_maxP(probgeno_df = geno1)
##      interval  count percent
## 1    0 ~ 0.05      0      0%
## 2  0.05 ~ 0.1      0      0%
## 3  0.1 ~ 0.15      0      0%
## 4  0.15 ~ 0.2      0      0%
## 5  0.2 ~ 0.25      0      0%
## 6  0.25 ~ 0.3      0      0%
## 7  0.3 ~ 0.35      0      0%
## 8  0.35 ~ 0.4      0      0%
## 9  0.4 ~ 0.45     19   0.01%
## 10 0.45 ~ 0.5    153   0.05%
## 11 0.5 ~ 0.55   1624   0.54%
## 12 0.55 ~ 0.6   1661   0.55%
## 13 0.6 ~ 0.65   1880   0.62%
## 14 0.65 ~ 0.7   2017   0.67%
## 15 0.7 ~ 0.75   2546   0.85%
## 16 0.75 ~ 0.8   2824   0.94%
## 17 0.8 ~ 0.85   2872   0.95%
## 18 0.85 ~ 0.9   3534   1.17%
## 19 0.9 ~ 0.95   6012      2%
## 20   0.95 ~ 1 275591  91.48%
## 21         NA    519   0.17%

This may give some indication of possible issues with the assigned genotypes. In the example above, over 90% of the data (marker-individual combinations) had an assigned probability of 0.95 or higher - suggesting a relatively high-quality dataset.

4.3. Screen for duplicate individuals

As in the original polymapR pipeline, it is also possible to check the marker data for duplicate individuals. This is accomplished using the screen_for_duplicate_individuals.gp function:

geno2 <- screen_for_duplicate_individuals.gp(probgeno_df = geno1,
                                             ploidy = 4,
                                             parent1 = parent1.reps,
                                             parent2 = parent2.reps,
                                             F1 = individuals)

## 
## No duplicates found

4.4. Screen for duplicate markers

Currently there is no specific function to check for duplicate markers using genotype probabilities. The current advice here is to discretise the data and use the screen_for_duplicate_markers function from the original polymapR pipeline, and then go back and filter the probabilistic dataset. Specific functions to assist with this task are planned but not currently implemented.

4.5. Screening for missingness

Note that with genotype probabilities there are no missing values, so screening for missing values per marker or per individual is no longer needed or possible.

5. Marker conversion

In the original polymapR pipeline, marker conversion is an important step that reduces the number of marker segregation types to a smaller set (without any loss or distortion of the linkage information they contain). This function has been moved to an internal function as part of the linkage.gp call and therefore is no longer directly run by the user.

6. Linkage analysis

The rest of the mapping pipeline is very similar to that of the original polymapR pipeline. We will run quickly through the steps here, starting with estimation of pairwise recombination frequencies using the linkage.gp function. Note that there are two possible methods / approaches here. The default method (method = "approx") uses an approximation, by summing the probabilities of each genotype class across the population, and using these as estimates for different marker classes in the two-point estimation of recombination frequency. This has been shown to lead to biased estimates of recombination frequency (thanks to Marcelo Mollinari, NCSU for highlighting this), but has the advantage that (a) the calculation is much faster and (b) the bias is in general rather small and can probably be ignored. The full likelihood, in which all dosage probabilities are carried through in the calculation, has been implemented in the mappoly function est_pairwise_rf, the results of which can also be used in polymapR. Therefore, although there can be quite a computational penalty involved here, the results using method = "mappoly" are of higher accuracy.

The function, like a number of others in polymapR allows for parallel processing using the argument ncores. To detect how many available cores your computer has, as well as determining a reasonable number of them to use (max - 2 for example) we first run the following:

nc <- parallel::detectCores() - 2

Note that because we have filtered the marker dataset, we need to re-run checkF1 at this point to have a compatible genotype and checkF1 object:

chk1 <- checkF1(input_type = "probabilistic",
                probgeno_df = geno2,
                parent1 = parent1.reps,
                parent2 = parent2.reps,
                F1 = individuals,
                polysomic = TRUE,
                disomic = FALSE,
                mixed = FALSE,
                ploidy = 4)

In what follows, we are using the approximate method for recombination frequency estimation. If the full likelihood approach is required, make sure to specify method = "mappoly in the calls to the linkage.gp function.

6.1 Simplex markers - clustering homologues

We begin with the simplex markers in parent 1 in order to begin with marker clustering:

SN_SN_P1 <- linkage.gp(probgeno_df = geno2,
                       chk = chk1,
                       markertype1 = c(1,0),
                       target_parent = "P1",
                       LOD_threshold = 3,
                       ncores = nc)

The output of this function is a linkage data.frame, with the following format:

head(SN_SN_P1)
##    marker_a  marker_b          r       LOD     phase
## 1  LG1_1.73 LG1_13.91 0.01695488  4.562119 repulsion
## 2  LG1_1.73 LG1_15.75 0.16375769 21.050858  coupling
## 3  LG1_1.73  LG1_16.6 0.09751892  3.133736 repulsion
## 6  LG1_1.73 LG1_22.73 0.21450923 14.975402  coupling
## 8  LG1_1.73 LG1_28.44 0.22455287 13.874338  coupling
## 10 LG1_1.73 LG1_31.52 0.23455153 12.827951  coupling

We can immediately begin to cluster this data into homologues using the cluster_SN_markers function as follows (after first setting par(mfrow) to 2 rows and 1 column to keep the 2 output plots together):

par(mfrow = c(2,1))
P1_homologues <- cluster_SN_markers(linkage_df = SN_SN_P1,
                                    LOD_sequence = seq(3,12,0.5),
                                    LG_number = 5,
                                    ploidy = 4,
                                    parentname = "P1",
                                    plot_clust_size = FALSE,
                                    min_clust_size = 3)

Here, we specify that we expect there to be 5 chromosomal linkage groups (LG_number = 5) and we look at the clustering over a range of LOD values (from 3 to 12 in steps of 0.5). We also demand that the min_clust_size is 3, so we are not interested in clusters with fewer than 3 markers (you can choose this value yourself). Had we set plot_clust_size to be TRUE, we would see instead the number of markers per cluster rather than a visualisation of their size with a red dot.

Although we are looking for 20 clusters here (an auto-tetraploid with 5 chromosomes, so 5 x 4 = 20 homologue clusters are expected per parent), we can immediately see that there is a clear splitting of the data into 5 chromosomes at a lower LOD scores (as well as some “noise” - small clusters of markers that are separate from the rest and will be / should be discarded). It is less simple to see at which LOD score the homologues split apart (in fact, there are variable LOD scores that are relevant). If we look at the upper plot we see that there is a sort of plateau in the data (black solid line) between LOD 5 - 8 approx, and so we could take some LOD value in this range. It is useful to have a look at the breakdown of cluster size in more detail:

sort(table(P1_homologues[['7']]$cluster),decreasing = T)
## 
##  8 16 17 18  5  3  9  7  1  2 21 10 15 19 20  4  6 11 12 13 14 
## 17 15 15 14 12 11 11 10  9  9  9  8  8  8  8  7  7  3  3  3  3
length(table(P1_homologues[['7']]$cluster)) 
## [1] 21

So there are 21 clusters at LOD 7 (we wanted 20, this is close enough) and there seems to be a good distribution of markers across these clusters (at most 17 markers and at least 3, but all with more or less the same order of magnitude in terms of size). We’ll go with this for now, and switch our attention to parent 2 in the same way:

SN_SN_P2 <- linkage.gp(probgeno_df = geno2,
                       chk = chk1,
                       markertype1 = c(1,0),
                       target_parent = "P2",
                       LOD_threshold = 3,
                       ncores = nc)

We cluster the parent 2 data as before:

P2_homologues <- cluster_SN_markers(linkage_df = SN_SN_P2,
                                    LOD_sequence = seq(3,12,0.5),
                                    LG_number = 5,
                                    ploidy = 4,
                                    parentname = "P2",
                                    plot_clust_size = F,
                                    min_clust_size = 3)

Here the plateau in the upper plot is less obvious, but it is possible to see the chromosomal linkage groups splitting up nicely into homologue clusters in the lower plot. At LOD 4.5 we seem to have identified our chromosomes already. If we follow these we can see that around LOD 6 we have identified for the most part 4 homologues (this would have been better at LOD 5 for the first chromosome - we are working in this direction but the function doesn’t handle it yet). For the moment, we will use a LOD of 6:

sort(table(P2_homologues[['6']]$cluster),decreasing = T)
## 
## 22  6 12  3  7 10 17 11 16  2 15 18  4 14  1  8 13 20  5  9 19 21 
## 34 25 24 21 18 18 17 16 16 15 13 13 12 10  9  9  9  6  4  3  3  3
length(table(P2_homologues[['6']]$cluster)) 
## [1] 22

Here we see there are 22 clusters - again close enough to what we want.

6.2 Higher-dose markers - connecting clustered homologues

In this dataset it would have been possible to define chromosomal linkage groups from the 1x0 data alone - see the main vignette for how to proceed if this is the case in your data also (using the function define_LG_structure). However, for now we will assume things are not so straightforward. We next turn to the simplex x simplex markers (1x1) which can help provide bridging links between homologues to assist in the clustering puzzle (duplex markers can also be used, see the main vignette for more details here). We first calculate the linkage between 1x0 and 1x1 markers, and then use this information for joining our homologue clusters together into chromosomal linkage groups.

SN_SS_P1 <- linkage.gp(probgeno_df = geno2,
                       chk = chk1,
                       markertype1 = c(1,0),
                       markertype2 = c(1,1),
                       target_parent = "P1",
                       ncores = nc)

SN_SS_P2 <- linkage.gp(probgeno_df = geno2,
                       chk = chk1,
                       markertype1 = c(1,0),
                       markertype2 = c(1,1),
                       target_parent = "P2",
                       ncores = nc)

Using the function bridgeHomologues we can now look for associations between the homologue clusters, first in parent 1:

LGHomDf_P1 <- bridgeHomologues(cluster_stack = P1_homologues[["7"]],
                               cluster_stack2 = P2_homologues[["7"]],
                               linkage_df = SN_SS_P1,
                               linkage_df2 = SN_SS_P2,
                               LOD_threshold = 5,
                               LG_number = 5)

One of the chromosomes has too many homologues while another one has only 3. We can see this more clearly using the table function:

table(LGHomDf_P1$LG,LGHomDf_P1$homologue)
##    
##      1  2  3  4  5
##   1  9  9 11  7  0
##   2 12  7 10  0  0
##   3 17 11  8  3  3
##   4  3  8 15 15  0
##   5 14  8  8  9  0

However, we do have a clear clustering into 5 linkage groups, the number of chromosomes we expected in this species.

We can perform a similar step in parent 2, however this time at LOD 6:

LGHomDf_P2 <- bridgeHomologues(cluster_stack = P2_homologues[["6"]],
                               cluster_stack2 = P1_homologues[["6"]],
                               linkage_df = SN_SS_P2,
                               linkage_df2 = SN_SS_P1,
                               LOD_threshold = 5,
                               LG_number = 5)

Again have a look more closely:

table(LGHomDf_P2$LG,LGHomDf_P2$homologue)
##    
##      1  2  3  4
##   1  9 15 21 12
##   2 25 18  9  3
##   3 18 16 24  9
##   4 10 13 16 17
##   5 13  6 34  0

So again we are missing one homologue, but for the rest it seems quite OK.

We have a number of options to merge homologues (or just delete the small ones) - as it is not possible to proceed in a tetraploid mapping with more than 4 homologues per parent. In this example dataset we’ll use the function cluster_per_LG:

LGHomDf_P1a <-cluster_per_LG(LG = 3,
                             linkage_df = SN_SN_P1[SN_SN_P1$phase == "coupling",], 
                             LG_hom_stack = LGHomDf_P1, 
                             LOD_sequence = 3:10, 
                             modify_LG_hom_stack = TRUE, 
                             network.layout = "stacked",
                             nclust_out = 4,
                             label.offset=1.2) 

In this example, it is not possible to merge homologues satisfactorily using this function (even if we go down to LOD = 0). Since we have specified that there should be at most 4 clusters in the output (nclust_out = 4) and that we want to modify the LG_hom_stack object (modify_LG_hom_stack = TRUE), the function does the only thing it can in this situation - delete the smallest homologue (here, it is a random decision as there were two small homologues with only 3 markers)…

table(LGHomDf_P1a$LG,LGHomDf_P1a$homologue)
##    
##      1  2  3  4
##   1  9  9 11  7
##   2 12  7 10  0
##   3 17  3 11  8
##   4  3  8 15 15
##   5 14  8  8  9

In parent 2 we don’t have to do any merging as we had no situation of more than 4 homologues.

6.3 Marker assignments

Now that we have a structure with simplex markers that define both a LG and a homologue, we are in a position to start assigning other marker types - if you are unclear about what we just did, have a look at one of the LGHomDf stacks, e.g.:

head(LGHomDf_P1a)
##   SxN_Marker LG homologue
## 1   LG1_1.73  1         1
## 2  LG1_31.52  1         1
## 3  LG1_56.45  1         1
## 4  LG1_71.82  1         1
## 5  LG1_79.15  1         1
## 6  LG1_88.76  1         1

This is not quite a marker assignment object, but it is getting close. We have already calculated linkage with 1x1 markers, so these can be assigned directly as follows:

P1_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P1,
                                        LG_hom_stack = LGHomDf_P1a,
                                        SN_colname = "marker_a",
                                        unassigned_marker_name = "marker_b",
                                        phase_considered = "coupling",
                                        LG_number = 5,
                                        LOD_threshold = 3,
                                        ploidy = 4)

P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2,
                                        LG_hom_stack = LGHomDf_P2,
                                        SN_colname = "marker_a",
                                        unassigned_marker_name = "marker_b",
                                        phase_considered = "coupling",
                                        LG_number = 5,
                                        LOD_threshold = 3,
                                        ploidy = 4)

We can now directly compare the assignments in both parents to get a “consensus” naming of the chromosomal linkage groups across parents:

LGHomDf_P2c <- consensus_LG_names(modify_LG = LGHomDf_P2, 
                                  template_SxS = P1_SxS_Assigned, 
                                  modify_SxS = P2_SxS_Assigned)
## 
## #### Original LG names
## 
## |   |  1|  2|  3|  4|  5|
## |:--|--:|--:|--:|--:|--:|
## |1  | 21|  0|  0|  0|  0|
## |2  |  0| 19|  0|  0|  0|
## |3  |  0|  0|  0| 20|  0|
## |4  |  0|  0|  0|  0| 15|
## |5  |  0|  0| 19|  0|  0|
## 
## #### Modified LG names
## 
## |   |  1|  2|  3|  4|  5|
## |:--|--:|--:|--:|--:|--:|
## |1  | 21|  0|  0|  0|  0|
## |2  |  0| 19|  0|  0|  0|
## |3  |  0|  0| 20|  0|  0|
## |4  |  0|  0|  0| 15|  0|
## |5  |  0|  0|  0|  0| 19|

Overall, looking at the diagonal in the lower table (these are the number of 1x1 markers that have been assigned to this linkage group in both parents), it is clear which linkage groups correspond across parents.

It is a good idea to save these objects (e.g. using save or saveRDS):

save(LGHomDf_P1a,LGHomDf_P2c, file = "LGHomDf_stacks.Rdata") #for example.. 

We re-run the P2 assignment for the 1x1 markers, now that our naming of linkage groups has changed in this parent:

P2_SxS_Assigned <- assign_linkage_group(linkage_df = SN_SS_P2,
                                        LG_hom_stack = LGHomDf_P2c, #this is changed
                                        SN_colname = "marker_a",
                                        unassigned_marker_name = "marker_b",
                                        phase_considered = "coupling",
                                        LG_number = 5,
                                        LOD_threshold = 3,
                                        ploidy = 4)

We now run the complete marker assignment for both parents and the rest of the markers, using linkage with our 1x0 markers:

marker_assignments_P1 <- homologue_lg_assignment(input_type = "probabilistic",
                                                 probgeno_df = geno2,
                                                 chk = chk1,
                                                 assigned_list = list(P1_SxS_Assigned),
                                                 assigned_markertypes = list(c(1,1)),
                                                 LG_hom_stack = LGHomDf_P1a,
                                                 target_parent = "P1",
                                                 other_parent = "P2",
                                                 ploidy = 4,
                                                 pairing = "random",
                                                 convert_palindrome_markers = FALSE,
                                                 LG_number = 5,
                                                 LOD_threshold = 3,
                                                 write_intermediate_files = FALSE)

marker_assignments_P2 <- homologue_lg_assignment(input_type = "probabilistic",
                                                 probgeno_df = geno2,
                                                 chk = chk1,
                                                 assigned_list = list(P2_SxS_Assigned),
                                                 assigned_markertypes = list(c(1,1)),
                                                 LG_hom_stack = LGHomDf_P2c,
                                                 target_parent = "P2",
                                                 other_parent = "P1",
                                                 ploidy = 4,
                                                 pairing = "random",
                                                 convert_palindrome_markers = FALSE,
                                                 LG_number = 5,
                                                 LOD_threshold = 3,
                                                 write_intermediate_files = FALSE)

Once this step has completed (it takes a couple of minutes) we can compare chromosomal linkage group assignments for all markers across parents and screen out those markers that give conflicting results, a further quality check:

marker_assignments <- check_marker_assignment(marker_assignments_P1,marker_assignments_P2)

This is a very important object and it is recommended to save this also (rather than relying on it being saved in a heap in the Global Environment, which is a somewhat sloppy practice):

saveRDS(marker_assignments, file = "marker_assignments.RDS")

The final step in the linkage analysis is to compile linkage estimates for all pairs of markers within a chromosomal linkage group using the finish_linkage_analysis function. This step may take a bit longer, so this is a good time to catch up on some other work, having set both of the following function calls running:

all_linkages_list_P1 <- finish_linkage_analysis(input_type = "probabilistic",
                                                marker_assignment = marker_assignments$P1,
                                                probgeno_df = geno2,
                                                chk = chk1,
                                                target_parent = "P1",
                                                other_parent = "P2",
                                                convert_palindrome_markers = FALSE,
                                                ploidy = 4,
                                                pairing = "random",
                                                LG_number = 5, 
                                                ncores = nc) 

all_linkages_list_P2 <- finish_linkage_analysis(input_type = "probabilistic",
                                                marker_assignment = marker_assignments$P2,
                                                probgeno_df = geno2,
                                                chk = chk1,
                                                target_parent = "P2",
                                                other_parent = "P1",
                                                convert_palindrome_markers = TRUE,
                                                ploidy = 4,
                                                pairing = "random",
                                                LG_number = 5,
                                                ncores = nc) 

We compile the results from both parents into a single list as follows:

linkages <- list()
for(lg in names(all_linkages_list_P1)){
  linkages[[lg]] <- rbind(all_linkages_list_P1[[lg]], all_linkages_list_P2[[lg]])
}

This (large) object is also worth saving. We will again use the .RDS format:

saveRDS(linkages, file = "linkages.RDS")

This can be read back in using readRDS in future:

linkages <- readRDS("linkages.RDS")

7. Ordering the map

For map ordering we use the ordering algorithm implemented in the MDSMap package [7]. This is achieved using the function MDSMap_from_list:

integrated.maplist <- MDSMap_from_list(linkages)

Once the map has been generated, we can use the marker assignment information to generate a phased linkage map (necessary for other subsequent steps e.g. IBD-based QTL mapping, available in the polyqtlR package):

phased.maplist <- create_phased_maplist(input_type = "probabilistic",
                                        maplist = integrated.maplist,
                                        chk = chk1,
                                        ploidy = 4,
                                        marker_assignment.1 = marker_assignments$P1, 
                                        marker_assignment.2 = marker_assignments$P2)

For extra details e.g on map visualisations, see the main polymapR vignette.

8. References

1. Bourke PM et al: polymapR—linkage analysis and genetic map construction from F1 populations of outcrossing polyploids. Bioinformatics 2018, 34:3496–3502.
2. Voorrips RE, Gort G, Vosman B: Genotype calling in tetraploid species from bi-allelic marker data using mixture models. BMC bioinformatics 2011, 12:172.
3. Gerard D, Ferrão LFV, Garcia AAF, Stephens M: Genotyping polyploids from messy sequencing data. Genetics 2018, 210:789–807.
4. Clark LV, Lipka AE, Sacks EJ: polyRAD: Genotype calling with uncertainty from sequencing data in polyploids and diploids. G3: Genes, Genomes, Genetics 2019, 9:663–673.
5. Liao Y, Voorrips RE, Bourke PM, Tumino G, Arens P, Visser RG, Smulders MJ, Maliepaard C: Using probabilistic genotypes in linkage analysis of polyploids. Theoretical and Applied Genetics 2021, 134:2443--2457.
6. Voorrips RE, Maliepaard CA: The simulation of meiosis in diploid and tetraploid organisms using various genetic models. BMC bioinformatics 2012, 13:248.
7. Preedy KF, Hackett CA: A rapid marker ordering approach for high-density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling. Theoretical and Applied Genetics 2016, 129:2117–2132.