Imputation of missing genetic markers SNP using linear regression models

For imputation of missing SNP are used software products which require known relationship between genotyped individuals. In common breeding business the genotypes of parents are not always known. That is why our own methodological process was used. The aim of this study is to map the current research of genetic chips and to verify the calculation process. The testing was processed at chosen loci in two datasets and in 8 models with different amount of SNPs. For the dataset A was prediction of missing values almost accurate with model reliability 100 % with the exception of one homozygous locus where the reliability reached only 55 %. In the dataset B the most extensive model reached the reliability of 80–90 % even in case of homozygous loci. The prediction error value was higher than in the first case. It was proven that missing values prediction is possible to calculate using the neighbouring SNPs.


Introduction
Working with genomic information in cattle breeding has become a standard procedure.These polymorfisms are used for evaluation of genomic relationship, prediction of genomic breeding values and for the evaluation of tested animals.The most common chips used for genotyping are Illumina and Affymetrix.Each company develops its own techniques of genotype obtaining.Affymetrix has unified coding type of SNPs among chips of different generations and thus even older data can be used.Illumina uses many coding types between different generations of chips.Thus, direct comparison of SNPs is not possible.Illumina has chips of different density and financial costingness.Illumina chips have become a standard all over the world and it is used by all breeding companies.The most used software programs for imputations are Beagle (Browning et Browning, 2007), AlphaImpute (Hickey et al., 2012), Impute 2 (Howie and Marchini, 2009), DAGPHASE (Druet and Georges, 2010), FImpute (Sargolzaei et al., 2008), PedImpute (Nicolazzi et al., 2013) and MaCH (Li et al., 2010).This study is focused on completion of missing genetic markers -SNPs (single nucleotide polymorphisms) -on genetic chips.More specifically completion of missing values in datasets which contain pieces of information about SNP occurence in cattle genome.It was developed our own methodology because the genotypes of parents were missing and also allele coding was incomplete.The aim of this study was to map the current research of genetic chips and to verify the calculation process.

2
Material and methods

Data
Dataset A contained 260 bull genotypes of different dairy breeds from the Czech Republic.Dataset B contained 3982 genotypes of pure Holstein bulls from nine countries.

Dataset preparation
For the marking of the tested SNPs were used three numbers according to the genotype (0 = BB, 1 = AB, 2 = AA).Three loci (located on chromosome 1) from each dataset were chosen for testing according to the percentage rate of allele A.
Dataset A: Locus 201: 50 % of allele A and average value of the locus 1.05 (heterozygous locus) Locus 716: 75 % of allele A and average value of the locus 1.5 Locus 133: 95 % of allele A and average value of the locus 1.9 (almost homozygous locus) Dataset B Locus 760: 50 % of allele A and average value of the locus 1.04 (heterozygous locus) Locus 893: 75 % of allele A and average value of the locus 1.5 Locus 201: 95 % of allele A and average value of the locus 1.9 (almost homozygous locus)

Statistical methods
In total, 8 models was used for the testing of both datasets.Each model was different in number of used neighbouring loci (10-100 loci).The number of neighbouring loci was determined on the basis of assumption that the loci are all inherited together and there is no crossing-over in the particular area.
The largest model obtained 100 loci which means 50 loci from the left side and 50 from the right side of the tested locus.These loci were used for calculation of regression coefficients, that were used for backward prediction of tested loci.Testing was processed in SAS analytical software using GLM procedure.The following model equation was used: Where  is tested locus;  is mean;  is locus on the left side of the tested locus;  is locus on the right side of the tested locus;  is number of tested locus;  is number of used neighbouring loci;  is error.With more loci better results could be obtained but the bigger amount of data causes higher costingness of the calculations.

Results and discussion
The testing of every model indicated that the prediction of SNPs was the most successful at heterozygous locus with 50% rate of allele A. Only 50 neighbouring loci was enough for almost precise prediction of SNP (locus 201).In locus with 75% rate of allele A (locus 716) were obtained the same results when 100 loci was used.At almost homozygous locus (locus 133) with 95% rate of allele A was achieved only 56 % of reliability in the largest model (100 loci).The values of maximal absolute error were bigger in dataset B but on the other side the values of reliability were more balanced in comparison with dataset A. The reason of these differences could be caused by using of different animals and different tested loci in each dataset.
Our results are not comparable with other studies because we developed our own methodology.We could not use any program commonly used for imputations because our database was not tailored to these softwares.If we had all pieces of information needed for the programs the best option for us would be Beagle and Impute 2 (Browning and Browning, 2007;Howie and Marchini, 2009) because these programs do not need genotypes connected with pedigree data for correct calculation.It was proven that missing values prediction is possible to calculate using the neighbouring SNPs.For the calculations were excluded loci with more than 5 % of missing data values and individuals with more than 10 % of missing data values.

Figure
Figure 1 Increasing reliability (R 2 )of every model for dataset A

Figure 2
Figure 2 Increasing reliability (R 2 ) of every model for dataset B

Figure 3
Figure 3 Conformity of prediction and real value for locus 201 in 8th model (100 loci)

Table 1
Values of maximal absolute error for every model