---
title: "diaQTL Vignette"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{diaQTL Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r include=FALSE}
knitr::opts_chunk$set(echo = TRUE,collapse=FALSE,message = FALSE,comment="##",fig.width=5,fig.height=5)
knitr::opts_knit$set(root.dir="~/")  
```

diaQTL uses the Markov Chain Monte Carlo (MCMC) algorithm in R package BGLR to regress phenotypes on the expected dosage of parental origin genotypes. The example dataset is a potato half-diallel population with three founders. 

## Structure of the input files
Three input files are needed for QTL analysis: (1) pedigree file, (2) genotype file, (3) phenotype file. diaQTL contains several functions to prepare the input files from the output of several linkage analysis packages, including [PolyOrigin](https://github.com/chaozhi/PolyOrigin.jl) (`convert_polyorigin`), [MAPpoly](https://github.com/mmolina/MAPpoly) (`convert_mappoly`), [RABBIT](https://github.com/chaozhi/RABBIT) (`convert_rabbit`), and [onemap](https://github.com/augusto-garcia/onemap) (`convert_onemap`). The tutorial dataset is a multiparental tetraploid population, for which PolyOrigin is the only option.

### 1) Pedigree file
The pedigree file has three columns: id, parent1, and parent2 (maternal effects are not modeled). 
```{r}
pedcsv <- system.file("vignette_data", "potato_ped.csv", package = "diaQTL")
ped <- read.csv(pedcsv, as.is = T)
head(ped)
table(apply(ped[,2:3],1,paste,collapse=" x "))
```

### 2) Genotype file
The first 3 columns of the genotype file must be labeled marker, chrom, and cM, and the position in a reference genome (labeled bp) is optional as the fourth column (plotting features can use either cM or bp). Subsequent columns contain the genotype probabilities for each individual. 
```{r}
genocsv <- system.file( "vignette_data", "potato_geno.csv", package = "diaQTL" )
geno <- read.csv( genocsv, as.is = T, check.names = F )
geno[1:5,1:4]
```
Genotype probabilities are encoded as strings, following the format exported by the PolyOrigin software:
```{r}
geno[1,10]
```
The integers separated by | on the left side of the equal sign refer to genotype states, and the decimal numbers on the right side of the equal sign are probabilities. Only nonzero probabilities need to be included. There are 100 possible states for F1 populations, and 35 possible states for S1 populations:
```{r}
library(diaQTL)
head(F1codes)
head(S1codes)
```
Each state has four integers, separated by dashes, to indicate which parental chromosomes were inherited. For F1 populations, the maternal chromosomes are labeled 1-4 and the paternal chromosomes 5-8. 

### 3) Phenotype file
In the phenotype input file, the first column should be the individual identifier, followed by columns for different traits, and then optionally any columns with fixed effects to include in the linear model (e.g., block, environment). Only one trait, tuber shape, is provided in the example potato data set.

```{r}
phenocsv <- system.file( "vignette_data", "potato_pheno.csv", package = "diaQTL" )
pheno <- read.csv( phenocsv, as.is = T )
head( pheno )
hist(pheno$tuber_shape,main="",xlab="Tuber shape")
```

To improve normality of the residuals, tuber shape in this data set is defined as log(L/W - 1), where L/W is the average length/width ratio of tubers weighing 6-10 ounces (170-285g). 

## Read the data

After installing and attaching the package, use `read_data` to read in all three files. (If there are fixed effects in the phenotype input file, they need to be specified as well; consult the reference manual.) By default, markers with the same map position in cM (using whatever numerical precision is present in the input map) are binned to reduce the computing time. The argument `n.core = 2` is used for parallel execution on multiple cores.
```{r }
data <- read_data(genofile = genocsv,
                  ploidy = 4,
                  pedfile = pedcsv,
                  phenofile = phenocsv, 
                  n.core = 2)
```

## Setting parameters

The function `set_params` determines the burn-in and total number of iterations using the Raftery and Lewis diagnostic from R package `coda`, based on a 95% probability that the estimate for quantile `q` of the additive effects is within the interval `(q-r,q+r)`. For the genome scan, we have found the results based on `q=0.5,r=0.1` to be adequate. Because MCMC is a stochastic process, the results will not be the same each time. Results are shown for each variance component, and the user should choose values based on the slowest to converge (i.e., the largest number of iterations).
```{r}
set_params( data, trait = "tuber_shape", q=0.5, r=0.1)
```

## QTL Discovery

The `scan1` function performs a hypothesis test for each marker bin. By default an additive model is used, which means the predictor variables are founder haplotype dosages. The test statistic is $-\Delta$DIC, which is the DIC (Deviance Information Criterion) for the null model relative to the QTL model. Lower values of DIC indicate a better tradeoff between model complexity and goodness-of-fit. For a single hypothesis test, DIC differences of 5 or 10 are [commonly recommended](https://www.mrc-bsu.cam.ac.uk/software/bugs/the-bugs-project-dic/#q9) for model selection, but through simulation we have shown that larger differences are needed to control the genome-wide Type I error rate. The function `DIC_thresh` returns the $-\Delta$DIC threshold for a half-diallel based on the number of parents, genome size, ploidy and $\alpha$. The genome size for the dataset can be obtained using `get_map`.

```{r }
get_map(data)
alpha.05 <- DIC_thresh(genome.size=12.1,num.parents=3,
                       ploidy=4,alpha=0.05) 
alpha.1 <- DIC_thresh(genome.size=12.1,num.parents=3,
                      ploidy=4,alpha=0.1) 

ans1 <- scan1(data = data, 
              trait = "tuber_shape",
              params = list(burnIn=50,nIter=500), 
              n.core = 2)

ans1.summary <- scan1_summary(ans1, position="bp")
ans1.summary$peaks
library(ggplot2)
ans1.summary$plot + geom_hline(yintercept=alpha.05,color="gold2",linetype=2) +
  geom_hline(yintercept=alpha.1,color="red",linetype=2)
```

The `scan1_summary` function returns the marker with the highest $-\Delta$DIC score on each chromosome and a plot of the $-\Delta$DIC profile. The peak on chromsome 10\@63 cM coincides with the location of the classical Ro (round) QTL in potato, which has been identified as the gene *StOFP20* [(Wu et al. 2018)](https://doi.org/10.1038/s41467-018-07216-8). The 90% Bayesian credible interval (CI) can be obtained using `BayesCI`, based on the profile log-likelihood (`LL`), and this interval contains *StOFP20* based on reference genome coordinates.

```{r}
BayesCI(ans1,data,chrom="10",CI.prob=0.9)
```

The small peak on chromosome 1\@ 133 cM is right at the detection threshold for $\alpha = 0.05$ but significant at $\alpha = 0.1$. 

## Dominance

For a single tetraploid locus, there are four types of genetic effects. As mentioned already, additive effects are the regression coefficients for parental haplotype dosage. Digenic dominance effects are the regression coefficients for parental diplotypes, i.e., a combination of two parental haplotypes. Trigenic and quadrigenic dominance effects are also possible (and do not exist in diploid species). Throughout the diaQTL package, the argument `dominance` is used to specify the highest order of the effect to include in the model. Thus, dominance = 1 indicates only additive effects, dominance = 2 indicates additive and digenic effects, etc.

Having discovered a QTL, one may wish to rescan the chromosome with a digenic model to refine its position. In this case, the location of the QTL peak is unchanged, but we can see from the deltaDIC output that including digenic dominance lowered the DIC by over 15 points. 
```{r}
ans2 <- scan1(data = data,
              trait = "tuber_shape",
              params = list(burnIn=50,nIter=500),
              dominance = 2,
              chrom = "10",
              n.core = 2)
scan1_summary(ans2, position="bp")$peaks
```

Markers can also be used as covariates with `scan1`, which is particularly useful for resolving multiple QTL on the same chromosome. 

## QTL Modeling

After discovery, the next step is to fit a multiple QTL model with function `fitQTL`. The argument `qtl` is a data frame with the marker names and dominance values for each QTL, and `epistasis` is a data frame with two markers for additive x additive epistasis. A set of progressively more complex models can be compared based on DIC. To improve accuracy, we will use more iterations based on the arguments `q=0.05,r=0.025` for `set_params`. 

```{r}
qtl.10at63 <- ans1.summary$peaks$marker[10]
qtl.1at133 <- ans1.summary$peaks$marker[1]
model1 <- data.frame(marker=c(qtl.10at63,qtl.1at133),dominance=c(2,1))
model2 <- data.frame(marker=c(qtl.10at63,qtl.1at133),dominance=c(3,1))
model3 <- data.frame(marker=c(qtl.10at63,qtl.1at133),dominance=c(2,2))

set_params(data, trait = "tuber_shape", q=0.05, r=0.025, qtl = model1)
params <- list(burnIn=100,nIter=5000)

fit1 <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model1)
fit2 <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model2)
fit3 <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model3)
fit4 <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model1,
               epistasis=data.frame(marker1=qtl.10at63,marker2=qtl.1at133))

fit1$deltaDIC
fit2$deltaDIC
fit3$deltaDIC
fit4$deltaDIC
```

Based on the DIC results, we select model1. Although inclusion of a polygenic effect does not have much impact on QTL mapping in diallel populations, it provides potentially useful information for genomic selection. The proportion of variance for each of the effects is returned in `var`. 
```{r}
fit1.poly <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model1,
                    polygenic=TRUE)
fit1.poly$deltaDIC #accept polygenic effect based on DIC
fit1.poly$var
```

Plots of the additive and digenic effects for each marker are contained as elments of the list `plots`. For the dominance plot, digenic effects are above the diagonal, and below the diagonal is the sum of the additive and digenic effects. 
```{r}
fit3$plots[[qtl.10at63]]
```

## Haplotype-based Selection

From the additive effects, we can select which haplotypes are desirable. In this example, since large negative values are desirable to maintain round tuber shape, the most desirable haplotype is W6511-1R.2. The function `haplo_get` can be used to extract the dosage of this haplotype across the population.
```{r}
haplos <- haplo_get( data = data, 
                      marker = qtl.10at63)
hist(haplos[,"W6511-1R.2"],main="",xlab="Dosage")
which(haplos[,"W6511-1R.2"] > 1.8)
```

The result shows there are three individuals with two copies of the W6511-1R.2 haplotype, which is possible due to "double reduction." This occurs when a quadrivalent forms in meiosis I and sister chromatid fragments migrate to the same pole in meiosis II. The function `haplo_plot` can be used to visualize the pattern of recombination between parental haplotypes.
```{r}
haplo_plot( data = data, 
            id = "W15268-53R", 
            chrom = 10,
            position = "cM",
            marker = qtl.10at63)
```

The dark blue segment indicates two copies of the W6511-1R.2 haplotype, and the dashed vertical line shows the position of the 10\@63 QTL.