CKB research has been greatly enhanced by large-scale genotyping of study individuals. Together with other exposure and outcome data, genetic data enable a wide range of investigations, including discovery of genetic determinants of disease risk, the role of lifestyle factors and quantitative traits (e.g. blood pressure, adiposity), Mendelian randomisation assessment of the causal contribution of risk factors and behaviours to disease, and phenome-wide analyses of potential drug targets.
SNP genotyping data
Using the multiplex Illumina Golden Gate® platform, approximately 100,000 DNA samples were genotyped for panels of 384 single nucleotide polymorphisms (SNPs) during 2012-2013. These SNPs were selected to support a range of projects, including investigation of genetic variants affecting the function of genes encoding potential drug targets (JACC 2016, IJE 2016, JAMA Cardiol. 2018). These early SNP data also provided an important check of sample linkage and DNA quality. For example, there was a mismatch between participants’ reported gender and genetically-determined sex for just 0.1% of samples, and only 2.5% of samples failed quality control (similar to other studies, e.g. UK Biobank genotyping). Together, these data provided high confidence in subsequent use of the extracted DNA for genome-wide genotyping.
Genome wide genotyping data
There is substantial genetic diversity both between populations of different ancestries and across China. When CKB genome-wide genotyping commenced in 2015, the available genotyping arrays did not fully capture such variation. Hence, CKB designed a custom Affymetrix (now ThermoFisher) Axiom® genotyping array with improved genome-wide coverage of common and low-frequency variation in Chinese populations. The array also included a series of probes for detection and classification of circulating hepatitis B virus (HBV), which is prevalent in the Chinese population. The final array design assayed a total of 803,030 genetic variants, including more than 80,000 variants with predicted functional effects on specific genes.
Using the CKB custom genotyping array, we genotyped just over 100,000 CKB samples: a population-representative subset of 77,176 participants and an additional 23,542 selected for studies of specific diseases (e.g. stroke, COPD). Genotyping data quality was high: for example, there was 99.9% concordance between pairs of duplicates. We then conducted imputation to statistically infer genotypes using the 1000 Genomes Phase 3 reference panel, which yielded genotypes for more than 21 million variants. These data have supported many studies within the CKB group and as part of collaborations and consortia. This work has included Mendelian randomisation studies of the contribution of blood lipids to different types of stroke (Nat Med 2019, Ann Neurol 2020), assessment of polygenic risk scores (PRS) for risk of fracture (Genome Med 2021) or breast cancer (Gen in Med 2021), or genome-wide association studies (GWAS) of lung function and respiratory disease (Eur Respir J 2021). A recent further round of imputation using two larger imputation reference panels (TopMed, Westlake BioBank for Chinese), has provided genotypes for more than 50 million variants.
Population structure
Appropriate analyses of genetic data rely on proper understanding of the genetic relationships between individuals in the study. We have identified substantial relatedness among CKB participants, with 24% (28% in rural, and 18% in urban areas) having at least one parent, child, or sibling in the study and 32% (39% rural, 23% urban) having one or more second-degree relative i.e. grandparent, grandchild, uncle/aunt, niece/nephew. We also found evidence of past consanguinity e.g. marriages between second cousins.
These aspects of the data have been used in analysis of the impact of inbreeding on reproductive success (Nat Commun 2019), and for within-family GWAS to understand how shared environment can influence the results of genetic association studies (bioRxiv 2021).
Analysis of genome-wide variation between individuals has also identified substantial genetic differences between individuals from different regions of China. Principal Component Analysis (PCA) groups study participants into discrete clusters largely reflecting the regions from which individuals were recruited, in a pattern strongly correlated with longitude and latitude. PCA within particular study regions revealed further patterns of genetic variation corresponding to participants’ specific recruitment clinics. This was, in general, much more pronounced in rural regions, reflecting established communities with little population movement, and was less strongly observed in urban regions in which there had often been recent inward migration. These findings have enhanced our overall understanding of the cohort, and have informed many of our genetic analyses.
Further resources
We continue to expand the available genetic resources. Data on DNA methylation (a chemical change in DNA involved in turning genes on and off) is available for approximately 1,000 individuals, and has provided evidence for involvement of methylation at the ANKS1A and SNX30 genes in cardiovascular risk (eLife 2021). In 2021, whole genome sequencing of 10,000 individuals was completed at BGI, Shenzhen and, after initial variant calling and quality control, 122M genetic variants have been identified in at least one sample. Given the rapidly falling cost of sequencing, we are seeking funding through private and public partnership to expand this pilot to sequencing of the entire CKB cohort.
Impact of research
Together with the many available phenotypes and disease endpoints in CKB, these genetic resources are enabling a broad range of projects led by CKB and external researchers (links to Genetics Collaborations). Together with other large biobanks in diverse populations, CKB will help to correct the strong Euro-centric bias of the genetic literature. With further development of the genetic resources, in combination with the growing range of diverse molecular assays, CKB will continue to make significant contributions to genetic discovery and elucidation of disease prediction, prevention, and treatment.