A reference haplotype panel for genome-wide imputation of short tandem repeats

1000 Genomes SNP-STR Haplotype Panel

Data Description:

Availability: Amazon S3 bucket s3://snp-str-imputation/1000genomes

[SNP-STR Panel chr1] [chr1 index]

[SNP-STR Panel chr2] [chr2 index]

[SNP-STR Panel chr3] [chr3 index]

[SNP-STR Panel chr4] [chr4 index]

[SNP-STR Panel chr5] [chr5 index]

[SNP-STR Panel chr6] [chr6 index]

[SNP-STR Panel chr7] [chr7 index]

[SNP-STR Panel chr8] [chr8 index]

[SNP-STR Panel chr9] [chr9 index]

[SNP-STR Panel chr10] [chr10 index]

[SNP-STR Panel chr11] [chr11 index]

[SNP-STR Panel chr12] [chr12 index]

[SNP-STR Panel chr13] [chr13 index]

[SNP-STR Panel chr14] [chr14 index]

[SNP-STR Panel chr15] [chr15 index]

[SNP-STR Panel chr16] [chr16 index]

[SNP-STR Panel chr17] [chr17 index]

[SNP-STR Panel chr18] [chr18 index]

[SNP-STR Panel chr19] [chr19 index]

[SNP-STR Panel chr20] [chr20 index]

[SNP-STR Panel chr21] [chr21 index]

[SNP-STR Panel chr22] [chr22 index]

Tredparse pathogenic loci

Huntington’s Disease - index

Spinocerebellar Ataxia 1 - index

Spinocerebellar Ataxia 17 - index

Dentatorubral-pallidoluysian Atrophy - index

Spinocerebellar Ataxia 2 - index

Spinocerebellar Ataxia 8 - index

Spinocerebellar Ataxia 3 - index

Spinocerebellar Ataxia 6 - index

Myotonic Dystrophy Type 1 - index

Supplementary Data

Supplementary Tables 2 and 3 give imputation summary statistics for each locus:

Saini et al. Supplementary Table 2

Saini et al. Supplementary Table 3

Usage:

Download Beagle .jar to impute STRs from our reference panel into SNP genotype data. We suggest using the latest version 4.1. If you are working with related samples and want to use pedigree information, use Beagle version 4.0

Ensure that the alleles in the target SNP file match our reference panel. We suggest using conform-gt. Example:

java -jar conform-gt.jar \
gt=snp.vcf.gz \
ref=1kg.snp.str.chr1.vcf.gz \
chrom=1 match=POS \
out=snp.chr1.consistent

Impute STRs into your SNP file:

java -Xmx8g -jar  beagle.version.jar \
gt=snp.chr1.consistent.vcf.gz \
ref=1kg.snp.str.chr1.vcf.gz \
out=snp.str.chr1.vcf.gz

FAQ

How do I convert Plink BED format files to VCF format? Solution: Use Plink to convert from plink bed to VCF format. Ensure that the reference allele matches our panel.

REFPANEL=1kg.snp.str.chr1.vcf.gz
zcat ${REFPANEL} | grep -v "^#" | cut -f 3 | grep -v ":" > refpanel_chr1_snps.txt
zcat ${REFPANEL} | grep -v "^#" | awk '($3!~/:/)' | cut -f 1-5 > refpanel_chr1_alleles.txt

plink \
--bfile snp_file \
--recode vcf bgz \
--out snp_file_recoded \
--extract refpanel_chr1_snps.txt \
--real-ref-alleles \
--a2-allele refpanel_chr1_alleles.txt 4 3 '#'

Do I need phased SNPs as input? No, Beagle will phase the input SNPs during the imputation process.

How do I measure the STR imputation accuracy? We measure the per locus accuracy of imputing STRs from Simon’s Simplex Collection into the 1000 Genomes data for three different populations: EUR, EAS and AFR. This data is available in the Amazon S3 bucket mentioned earlier (Saini_etal_SuppTable2.xlsx). You can expect to impute STRs with similar accuracy if your SNP data is from one of these populations.