Creating an internal database VCF file
Upon request, Illumina Support builds internal historic or noise database VCF files from the customer’s cases based on supplied requirements.
What data is tracked for each variant?
For each variant, internal historic database records:
Allele frequency (
AF): The ratio of the number of variant alleles to the total number of alleles in the datasetAllele count (
AC): The number of observed variant allelesAllele number (
AN): The total number of alleles assessed, derived from the number of individualsGenotype counts: heterozygous (
HET), homozygous (HOM), and hemizygous (HEMI)Sample list (
TEN): A short list of samples where the variant was found
How are variants merged across cases?
Historic database
Variant extraction
Variants are extracted from the original input files without any filtering, ensuring that all calls are preserved.
Variant grouping and merging
Variants are grouped by calling methodology and merged across cases based on matching information:
Single nucleotide variants (SNVs) Merged across cases when they share the same chromosome, position, reference and alternate alleles
Copy number variants (CNVs), insertions >50bp (INS), short tandem repeats (STRs), and regions of homozygosity (ROH) Merged across cases when they share the same chromosome, start position, end position, and reference and alternate alleles
Noise database
Variant extraction
Variants are extracted from the original input files without any filtering, ensuring that all calls are preserved.
Variant grouping and merging
Variants are grouped by calling methodology and merged across cases based on matching information:
Single nucleotide variants (SNVs) Merged across cases when they share the same chromosome, position, reference and alternate alleles
Copy number variants (CNVs), insertions >50bp (INS), short tandem repeats (STRs), and regions of homozygosity (ROH) Merged across cases when they share the same chromosome, start position, end position, and reference and alternate alleles
CNV and INS variants: Variant clustering
The algorithm selects the most frequent variant at each locus and uses it as a pivot (reference) variant.
The pivot is merged with other variants of the same type that reciprocally overlap it by at least 70%. Let P be the pivot (length = p) and V be another variant (length = v), with an overlapping region of length o. They are merged if both conditions are satisfied:
o/p≥0.7
o/v≥0.7
In other words, at least 70% of pivot overlaps variant, and at least 70% of variant overlaps pivot.
CNV and INS variants: Recalculation of allele counts
After clustering, the allele count (AC) and the genotype counts are recomputed based on the new consensus variant.
Important notes and limitations
Variant quality
No quality filters are applied; all variants are retained in the database regardless of confidence or quality.
Duplication events
CNV callers often report copy number (CN) without specifying zygosity. In Emedgene, the following assumptions are applied:
CN = 3 → Interpreted as heterozygous duplication
CN > 3 → Interpreted as homozygous duplication
When samples contain different CN values for the same region, they are merged into a single DUP entry. The counts for this entry are then calculated based on the inferred zygosity rather than the exact copy number.
Sex chromosomes
Chromosome X
Allele counts depend on the sample’s recorded sex:
Females: Diploid — contribute one allele to the allele count (AC) if heterozygous, or two alleles if homozygous
Males: Haploid — contribute one allele to AC
Mosaic chrX variants in males: Treated as heterozygous, contributing one allele to AC
Chromosome Y
Haploid; only samples recorded as male contribute to allele counting.
Pseudo‑autosomal regions (PAR)
No special handling is implemented yet.
Mitochondrial DNA variants
Heteroplasmy levels are not taken into account; mitochondrial DNA variants are treated as homoplasmic and therefore counted as haploid homozygous in the dataset.
No data regions
The issue
The merging algorithm does not account for sequencing coverage and incorrectly interprets positions with no data as homozygous reference calls. As a result, the dataset becomes artificially enriched in reference alleles, leading to an underestimation of allele frequency (AF) for any variant located in regions with absent coverage.
This applies to any region with absent coverage in any dataset (exomes only, genomes only, or mixed). However, it becomes particularly problematic in mixed exome–genome datasets, where exome samples systematically lack data outside the capture regions (see below).
Recommendation
Be cautious when evaluating variant rarity in regions that are not uniformly covered across samples in a dataset.
Special case: Non-coding variant allele frequency in mixed exome–genome datasets
When exome and genome samples are combined in a single dataset, AF values for non‑coding variants become systematically lower than their true population frequency. This occurs because exome samples have no coverage in non‑coding regions, and the merging algorithm incorrectly interprets these missing data points as homozygous reference. As a result, variants located outside the exome capture regions—such as common intronic or intergenic variants—are underrepresented in the aggregated dataset.
The downward AF bias becomes especially pronounced when exome samples greatly outnumber genome samples:
Exomes outnumber genomes by ~20× Common intronic variants may appear at an artificially low frequency (<0.05), despite being frequent in the population.
Exomes outnumber genomes by ~100× or more The bias becomes so severe that common non‑coding variants may appear at <0.01 AF, leading to potential misclassification as rare variants.
Last updated
Was this helpful?