diaQTL uses the Markov Chain Monte Carlo (MCMC) algorithm in R package BGLR to regress phenotypes on the expected dosage of parental origin genotypes. The example dataset is a potato half-diallel population with three founders.
Three input files are needed for QTL analysis: (1) pedigree file, (2)
genotype file, (3) phenotype file. diaQTL contains several functions to
prepare the input files from the output of several linkage analysis
packages, including PolyOrigin
(convert_polyorigin
), MAPpoly
(convert_mappoly
), RABBIT
(convert_rabbit
), and onemap
(convert_onemap
). The tutorial dataset is a multiparental
tetraploid population, for which PolyOrigin is the only option.
The pedigree file has three columns: id, parent1, and parent2 (maternal effects are not modeled).
pedcsv <- system.file("vignette_data", "potato_ped.csv", package = "diaQTL")
ped <- read.csv(pedcsv, as.is = T)
head(ped)
## id parent1 parent2
## 1 W15268-1R W6511-1R VillettaRose
## 2 W15268-4R W6511-1R VillettaRose
## 3 W15268-5R W6511-1R VillettaRose
## 4 W15268-6R W6511-1R VillettaRose
## 5 W15268-7R W6511-1R VillettaRose
## 6 W15268-8R W6511-1R VillettaRose
##
## VillettaRose x W9914-1R W6511-1R x VillettaRose W6511-1R x W9914-1R
## 113 155 147
The first 3 columns of the genotype file must be labeled marker, chrom, and cM, and the position in a reference genome (labeled bp) is optional as the fourth column (plotting features can use either cM or bp). Subsequent columns contain the genotype probabilities for each individual.
genocsv <- system.file( "vignette_data", "potato_geno.csv", package = "diaQTL" )
geno <- read.csv( genocsv, as.is = T, check.names = F )
geno[1:5,1:4]
## marker chrom cM bp
## 1 solcap_snp_c2_36608 1 0.00 251132
## 2 solcap_snp_c2_36658 1 0.15 269400
## 3 solcap_snp_c1_10930 1 0.29 309342
## 4 PotVar0120130 1 0.41 353979
## 5 PotVar0120070 1 0.69 433801
Genotype probabilities are encoded as strings, following the format exported by the PolyOrigin software:
## [1] "12|13|19|23|31|34|62|63|67|69|83=>0.005|0.074|0.004|0.004|0.023|0.006|0.016|0.85|0.002|0.014|0.001"
The integers separated by | on the left side of the equal sign refer to genotype states, and the decimal numbers on the right side of the equal sign are probabilities. Only nonzero probabilities need to be included. There are 100 possible states for F1 populations, and 35 possible states for S1 populations:
## Code State
## 1 1 1-1-5-5
## 2 2 1-1-5-6
## 3 3 1-1-5-7
## 4 4 1-1-5-8
## 5 5 1-1-6-6
## 6 6 1-1-6-7
## Code State
## 1 1 1-1-1-1
## 2 2 1-1-1-2
## 3 3 1-1-1-3
## 4 4 1-1-1-4
## 5 5 1-1-2-2
## 6 6 1-1-2-3
Each state has four integers, separated by dashes, to indicate which parental chromosomes were inherited. For F1 populations, the maternal chromosomes are labeled 1-4 and the paternal chromosomes 5-8.
In the phenotype input file, the first column should be the individual identifier, followed by columns for different traits, and then optionally any columns with fixed effects to include in the linear model (e.g., block, environment). Only one trait, tuber shape, is provided in the example potato data set.
phenocsv <- system.file( "vignette_data", "potato_pheno.csv", package = "diaQTL" )
pheno <- read.csv( phenocsv, as.is = T )
head( pheno )
## id tuber_shape
## 1 W15268-1R -0.38
## 2 W15268-4R -2.31
## 3 W15268-5R -1.91
## 4 W15268-6R -0.69
## 5 W15268-7R -1.61
## 6 W15268-8R -2.14
To improve normality of the residuals, tuber shape in this data set is defined as log(L/W - 1), where L/W is the average length/width ratio of tubers weighing 6-10 ounces (170-285g).
After installing and attaching the package, use
read_data
to read in all three files. (If there are fixed
effects in the phenotype input file, they need to be specified as well;
consult the reference manual.) By default, markers with the same map
position in cM (using whatever numerical precision is present in the
input map) are binned to reduce the computing time. The argument
n.core = 2
is used for parallel execution on multiple
cores.
data <- read_data(genofile = genocsv,
ploidy = 4,
pedfile = pedcsv,
phenofile = phenocsv,
n.core = 2)
## 415 individuals with pedigree and genotype data
## 413 individuals with pedigree, genotype and phenotype data
## Preparing genotype data...
## Preparing for polygenic effects...
The function set_params
determines the burn-in and total
number of iterations using the Raftery and Lewis diagnostic from R
package coda
, based on a 95% probability that the estimate
for quantile q
of the additive effects is within the
interval (q-r,q+r)
. For the genome scan, we have found the
results based on q=0.5,r=0.1
to be adequate. Because MCMC
is a stochastic process, the results will not be the same each time.
Results are shown for each variance component, and the user should
choose values based on the slowest to converge (i.e., the largest number
of iterations).
## burnIn nIter
## additive 16 592
## residual 2 104
The scan1
function performs a hypothesis test for each
marker bin. By default an additive model is used, which means the
predictor variables are founder haplotype dosages. The test statistic is
−ΔDIC, which is the DIC
(Deviance Information Criterion) for the null model relative to the QTL
model. Lower values of DIC indicate a better tradeoff between model
complexity and goodness-of-fit. For a single hypothesis test, DIC
differences of 5 or 10 are commonly
recommended for model selection, but through simulation we have
shown that larger differences are needed to control the genome-wide Type
I error rate. The function DIC_thresh
returns the −ΔDIC threshold for a half-diallel
based on the number of parents, genome size, ploidy and α. The genome size for the dataset
can be obtained using get_map
.
## chrom Morgans
## 1 1 1.37
## 2 2 0.98
## 3 3 1.11
## 4 4 1.13
## 5 5 0.87
## 6 6 0.94
## 7 7 0.91
## 8 8 0.93
## 9 9 1.08
## 10 10 0.93
## 11 11 0.91
## 12 12 0.94
## 13 total: 12.09
alpha.05 <- DIC_thresh(genome.size=12.1,num.parents=3,
ploidy=4,alpha=0.05)
alpha.1 <- DIC_thresh(genome.size=12.1,num.parents=3,
ploidy=4,alpha=0.1)
ans1 <- scan1(data = data,
trait = "tuber_shape",
params = list(burnIn=50,nIter=500),
n.core = 2)
ans1.summary <- scan1_summary(ans1, position="bp")
ans1.summary$peaks
## marker chrom cM bp LL deltaDIC
## 663 PotVar0099436 1 133.54 86852633 -364.2834 -19.811136
## 1063 PotVar0010012 2 67.17 39792263 -367.1739 -12.181148
## 1471 PotVar0055393 3 61.87 45178977 -366.8075 -12.938175
## 1788 solcap_snp_c2_7751 4 17.33 4012716 -367.8386 -8.849346
## 2257 solcap_snp_c2_23836 5 2.52 694653 -368.9612 -9.620244
## 2848 solcap_snp_c1_15726 6 0.87 238858 -370.1300 -4.044124
## 3496 solcap_snp_c1_6240 7 62.63 48778791 -371.2051 -4.791841
## 3670 solcap_snp_c2_47896 8 26.78 7843374 -368.2593 -9.485905
## 4074 solcap_snp_c2_39216 9 15.49 2792921 -370.8404 -3.934421
## 4511 solcap_snp_c2_25522 10 62.71 48617457 -290.5895 -160.672277
## 4676 solcap_snp_c2_13392 11 0.78 597956 -373.5064 -1.219346
## 5334 solcap_snp_c2_5440 12 94.28 61086761 -372.6588 -2.241159
library(ggplot2)
ans1.summary$plot + geom_hline(yintercept=alpha.05,color="gold2",linetype=2) +
geom_hline(yintercept=alpha.1,color="red",linetype=2)
The scan1_summary
function returns the marker with the
highest −ΔDIC score on each
chromosome and a plot of the −ΔDIC profile. The peak on chromsome
10@63 cM coincides with the location of the classical Ro (round) QTL in
potato, which has been identified as the gene StOFP20 (Wu et al. 2018).
The 90% Bayesian credible interval (CI) can be obtained using
BayesCI
, based on the profile log-likelihood
(LL
), and this interval contains StOFP20 based on
reference genome coordinates.
## marker chrom cM bp LL deltaDIC
## 4509 solcap_snp_c2_45612 10 61.96 48203384 -304.2553 -132.6171
## 4510 solcap_snp_c2_49532 10 62.18 48443334 -302.2670 -137.5223
## 4511 solcap_snp_c2_25522 10 62.71 48617457 -290.5895 -160.6723
## 4512 PotVar0111667 10 62.85 48717669 -291.8617 -157.0391
## 4513 PotVar0111687 10 62.98 48721966 -291.9920 -157.6171
## 4514 solcap_snp_c2_25485 10 63.78 48737840 -293.8377 -154.0239
## 4515 solcap_snp_c2_27829 10 74.54 50782097 -309.3422 -122.0998
The small peak on chromosome 1@ 133 cM is right at the detection threshold for α = 0.05 but significant at α = 0.1.
For a single tetraploid locus, there are four types of genetic
effects. As mentioned already, additive effects are the regression
coefficients for parental haplotype dosage. Digenic dominance effects
are the regression coefficients for parental diplotypes, i.e., a
combination of two parental haplotypes. Trigenic and quadrigenic
dominance effects are also possible (and do not exist in diploid
species). Throughout the diaQTL package, the argument
dominance
is used to specify the highest order of the
effect to include in the model. Thus, dominance = 1 indicates only
additive effects, dominance = 2 indicates additive and digenic effects,
etc.
Having discovered a QTL, one may wish to rescan the chromosome with a digenic model to refine its position. In this case, the location of the QTL peak is unchanged, but we can see from the deltaDIC output that including digenic dominance lowered the DIC by over 15 points.
ans2 <- scan1(data = data,
trait = "tuber_shape",
params = list(burnIn=50,nIter=500),
dominance = 2,
chrom = "10",
n.core = 2)
scan1_summary(ans2, position="bp")$peaks
## marker chrom cM bp LL deltaDIC
## 4511 solcap_snp_c2_25522 10 62.71 48617457 -272.9918 -176.4873
Markers can also be used as covariates with scan1
, which
is particularly useful for resolving multiple QTL on the same
chromosome.
After discovery, the next step is to fit a multiple QTL model with
function fitQTL
. The argument qtl
is a data
frame with the marker names and dominance values for each QTL, and
epistasis
is a data frame with two markers for additive x
additive epistasis. A set of progressively more complex models can be
compared based on DIC. To improve accuracy, we will use more iterations
based on the arguments q=0.05,r=0.025
for
set_params
.
qtl.10at63 <- ans1.summary$peaks$marker[10]
qtl.1at133 <- ans1.summary$peaks$marker[1]
model1 <- data.frame(marker=c(qtl.10at63,qtl.1at133),dominance=c(2,1))
model2 <- data.frame(marker=c(qtl.10at63,qtl.1at133),dominance=c(3,1))
model3 <- data.frame(marker=c(qtl.10at63,qtl.1at133),dominance=c(2,2))
set_params(data, trait = "tuber_shape", q=0.05, r=0.025, qtl = model1)
## burnIn nIter
## solcap_snp_c2_25522 additive 18 1582
## solcap_snp_c2_25522 digenic 12 1056
## PotVar0099436 additive 6 501
## residual 3 321
params <- list(burnIn=100,nIter=5000)
fit1 <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model1)
fit2 <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model2)
fit3 <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model3)
fit4 <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model1,
epistasis=data.frame(marker1=qtl.10at63,marker2=qtl.1at133))
fit1$deltaDIC
## [1] -200.4348
## [1] -196.5416
## [1] -195.2042
## [1] -195.7474
Based on the DIC results, we select model1. Although inclusion of a
polygenic effect does not have much impact on QTL mapping in diallel
populations, it provides potentially useful information for genomic
selection. The proportion of variance for each of the effects is
returned in var
.
fit1.poly <- fitQTL(data=data, trait="tuber_shape", params=params, qtl=model1,
polygenic=TRUE)
fit1.poly$deltaDIC #accept polygenic effect based on DIC
## [1] -278.5444
## Mean CI.lower CI.upper
## solcap_snp_c2_25522 additive 0.27 0.16 0.37
## solcap_snp_c2_25522 digenic 0.08 0.03 0.14
## PotVar0099436 additive 0.05 0.02 0.09
## polygenic 0.29 0.19 0.40
## residual 0.31 0.23 0.40
Plots of the additive and digenic effects for each marker are
contained as elments of the list plots
. For the dominance
plot, digenic effects are above the diagonal, and below the diagonal is
the sum of the additive and digenic effects.
## $additive
##
## $digenic
From the additive effects, we can select which haplotypes are
desirable. In this example, since large negative values are desirable to
maintain round tuber shape, the most desirable haplotype is W6511-1R.2.
The function haplo_get
can be used to extract the dosage of
this haplotype across the population.
haplos <- haplo_get( data = data,
marker = qtl.10at63)
hist(haplos[,"W6511-1R.2"],main="",xlab="Dosage")
## W15268-53R W15268-93R W15267-10R
## 37 59 187
The result shows there are three individuals with two copies of the
W6511-1R.2 haplotype, which is possible due to “double reduction.” This
occurs when a quadrivalent forms in meiosis I and sister chromatid
fragments migrate to the same pole in meiosis II. The function
haplo_plot
can be used to visualize the pattern of
recombination between parental haplotypes.
The dark blue segment indicates two copies of the W6511-1R.2 haplotype, and the dashed vertical line shows the position of the 10@63 QTL.