Insights from Genomic Variation

Introduction

Your DNA is your blueprint— or so you’ve been told. While certainly not everything is set in stone from DNA, our genetic makeup can reveal some interesting information about our biology, such as pre-dispositions we may have for developing certain diseases. How can we read the codes of our genetic composition, and what signals should we look for?

The total sum of all gene sequences, including regions that encode proteins along with those that don’t appear to have any function, is known as the genome. The human genome has been compiled from multiple individuals— not one single person— which provides enough evidence to determine the gene sequences that are most likely to be preserved without changes across multiple samples or populations. This assembly is known as a reference genome, a standard genomic scaffolding that is not “normal” or “better”, but one that attempts to represent minimal variation. Therefore, in a given sample, a sequence variation from the reference genome that is observed in >1% of an examined population is called a single nucleotide polymorphism (SNP).

With over 3 billion base pairs in the human genome, there’s a lot of potential SNPs, and since SNPs can exist anywhere within the genome— including in both protein-coding and non-protein coding regions— there’s a lot of variation in the clinical implications of SNPs. Some appear to be significant drivers of disease, while others play much more minor roles or have unknown effects. The Single Nucleotide Polymorphism Database (dbSNP) and ClinVar are two publicly-available curated databases containing clinically-relevant SNP details from published studies and clinical trials, making them invaluable resources for understanding SNPs in sample data.

We’re going to walk through a pipeline that starts with sequencing data from a single individual, annotates key genetic information, and provides clinically-relevant insights.

A Note on R

The tidyverse provides an excellent set of tools for data manipulation in R. This tutorial uses tidyverse commands and syntax. It will also use the package cowplot for the plot themes.

Initial sequencing data

Variant call format (VCF) is a standardized format in bioinformatics for reporting SNPs. It looks like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SRR038564.bwa-mem.sorted.marked.bam
chr1    49272   .   G   A   5.04598 .   DP=1;SGB=-0.379885;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=40  GT:PL   1/1:33,3,0
chr1    49291   .   C   T   5.75677 .   DP=1;SGB=-0.379885;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=40  GT:PL   1/1:34,3,0
chr1    52238   .   T   G   5.04598 .   DP=1;SGB=-0.379885;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,0,1;MQ=33  GT:PL   1/1:33,3,0
chr1    54579   .   G   A   3.7766  .   DP=1;SGB=-0.379885;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,1,0;MQ=31  GT:PL   1/1:31,3,0

We can see the gene sequences that differ from the reference genome by comparing the REF and ALT columns. In the first row, the REF column (reference genome) contains a G, but the ALT column (our sample data) contains an A. This means that the individual in this dataset has an A at position 49,272 on Chromosome 1, but the reference genome has a G at this location. That’s all we can take from this, though— only this low-level detail about the exact sequence change. The VCF file doesn’t provide any direct insight into genes that may be affected, or phenotypes that may result.

Annotating the sequencing data

To get further information on the gene sequence variants, we have to annotate our dataset. This means that we take our VCF file and match the SNP locations to the annotated human genome assembly, which will tell us the functional regions of the genome (such as specific genes) in which our variants reside. It’s sort of like having a list of locations around the world, and now we need to get the names of the countries for those locations.

To do this, we’ll use the Variant Effect Predictor (VEP) from Ensembl. This is a tool that takes a VCF file as input and produces an annotated output file with specific gene assignments, as well as much more. It can be run from the command-line within a cloud computing environment. Fortunately, eCellula has an AMI with the VEP tool installed. To run, use the following code below. The .vcf file location on the server is given in the first line, and the following flags specify some output format details. For more information, check the documentation.

nohup vep -i ../data/SRR038564.called.vcf \
--cache \
--tab \
--compress_output gzip \
--output_file variant_effect_output.txt.gz \
--check_existing \
--symbol \
--clin_sig_allele 1 \
--af \
--regulatory \
--biotype \
--show_ref_allele \
--uploaded_allele \
--pubmed &

Re-formatting the raw annotated output

Now, we have the raw output file from the VEP tool with the following format (only showing the first 6 columnns):

# A tibble: 6,522,352 × 30
   `#Uploaded_variation` Location   Allele Gene            Feature   Feature_type
   <chr>                 <chr>      <chr>  <chr>           <chr>     <chr>
 1 chr1_49272_G/A        chr1:49272 A      ENSG00000268020 ENST0000… Transcript
 2 chr1_49291_C/T        chr1:49291 T      ENSG00000268020 ENST0000… Transcript
 3 chr1_52238_T/G        chr1:52238 G      ENSG00000268020 ENST0000… Transcript
 4 chr1_54579_G/A        chr1:54579 A      ENSG00000268020 ENST0000… Transcript
 5 chr1_54579_G/A        chr1:54579 A      ENSG00000290826 ENST0000… Transcript
 6 chr1_56573_T/C        chr1:56573 C      ENSG00000268020 ENST0000… Transcript
 7 chr1_56573_T/C        chr1:56573 C      ENSG00000290826 ENST0000… Transcript
 8 chr1_61987_A/G        chr1:61987 G      ENSG00000240361 ENST0000… Transcript
 9 chr1_61987_A/G        chr1:61987 G      ENSG00000186092 ENST0000… Transcript
10 chr1_61987_A/G        chr1:61987 G      ENSG00000290826 ENST0000… Transcript
# ℹ 6,522,342 more rows

We’ll re-format this data to improve its readability. It may be helpful later on to have the information in the Location column split separately to show the chromosome, start position, and end position. But we’ll also want to retain the original Location information in a different format.

vep_output_processed <-
  vep_output %>%
  rename(Uploaded_variation = `#Uploaded_variation`) %>%
  separate_wider_delim(
    Location, ":", names = c("chrom", "loc")
  ) %>%
  separate_wider_delim(
    loc, "-", names = c("start", "end"), too_few = "align_start"
  ) %>%
  mutate(
    start = as.numeric(.$start),
    end = as.numeric(.$end),
    end = case_when(is.na(end) ~ start),
    Location = paste0(chrom, ":", start, "-", end)
  )

Filtering and summarizing data

Now, our data look like:

# A tibble: 6,522,352 × 36
   Uploaded_variation chrom start   end Allele Gene            Feature    Feature_type
   <chr>              <chr> <dbl> <dbl> <chr>  <chr>           <chr>      <chr>
 1 chr1_49272_G/A     chr1  49272 49272 A      ENSG00000268020 ENST00000… Transcript
 2 chr1_49291_C/T     chr1  49291 49291 T      ENSG00000268020 ENST00000… Transcript
 3 chr1_52238_T/G     chr1  52238 52238 G      ENSG00000268020 ENST00000… Transcript
 4 chr1_54579_G/A     chr1  54579 54579 A      ENSG00000268020 ENST00000… Transcript
 5 chr1_54579_G/A     chr1  54579 54579 A      ENSG00000290826 ENST00000… Transcript
 6 chr1_56573_T/C     chr1  56573 56573 C      ENSG00000268020 ENST00000… Transcript
 7 chr1_56573_T/C     chr1  56573 56573 C      ENSG00000290826 ENST00000… Transcript
 8 chr1_61987_A/G     chr1  61987 61987 G      ENSG00000240361 ENST00000… Transcript
 9 chr1_61987_A/G     chr1  61987 61987 G      ENSG00000186092 ENST00000… Transcript
10 chr1_61987_A/G     chr1  61987 61987 G      ENSG00000290826 ENST00000… Transcript
# ℹ 6,522,342 more rows

This re-formatted file can still be processed further. Since the data includes some non-standard chromosome technical artifacts (like 11_KI270721v1_random), we want to filter for the variants on only the standard chromosomes 1-23 (where here the 23rd chromosome is X) plus the mitochondrial chromosome. Also, we want to only examine the protein-coding variants; and more specifically, only the unique variants at each location for each gene. Then, we’ll get a count of all these variants on each chromosome using the group_by and summarize commands.

vep_output_processed_count <-
  vep_output_processed %>%
  filter(
    chrom %in% c(
      "chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7",
      "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14",
      "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21",
      "chr22", "chrX", "chrM"
    ),
    BIOTYPE == "protein_coding"
  ) %>%
  distinct(Location, Gene, .keep_all = TRUE) %>%
  group_by(chrom) %>%
  summarize(count = n())

Joining data

Now, we want to combine these summarized variant counts by chromosome with actual chromosomal base pair summary data, to get an idea of the percentage of entire chromosomes that our data represent. We’ll do this by loading the most recent assembly of the human genome (GRCh38, build 38) from Ensembl and processing the file a little:

# https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.26/
grch38 <-
  read_tsv("~/ecellula/pipeline/grch38.tsv")

grch38_bp <-
  grch38 %>%
  select(`UCSC style name`, `Seq length`) %>%
  rename(
    chrom = `UCSC style name`,
    bp = `Seq length`
  ) %>%
  filter(chrom %in% c(
    "chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7",
    "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14",
    "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21",
    "chr22", "chrX", "chrM"
  ))

Then, we’ll join the summarized variant counts to the processed chromosomal summary data, and calculate two percentages: the number of variants on a chromosome as a percentage of the total variants in our dataset (i.e., 15% of the variants in our dataset are on Chromosome 1), and the number of variants in our dataset on a chromosome as a percentage of the total base pairs on that chromosome (i.e., the variants in our dataset on Chromosome 1 represent 25% of the size of Chromosome 1). These percentages will be helpful to see if any chromosome are over- or under-represented in our datasets, and if any chromosomes seem particularly under- or over-impacted.

vep_output_summarized <-
  left_join(
    vep_output_processed_count, grch38_bp, by = "chrom"
  ) %>%
  mutate(
    variantTotal = (sum(count)),
    pctTotal = (count/variantTotal) * 100,
    pctChrom = (count/bp) * 100,
    chrom = factor(
      chrom,
      levels = c(
        "chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9",
        "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17",
        "chr18", "chr19", "chr20", "chr21", "chr22", "chrX", "chrM"
      )
    )
  )

Visualizing data: Part 1

Before we start making any summary visualizations, let’s just take a look at the raw locations of the identified protein coding variants:

The positions of each variant (x-axis) on every chromosome (y-axis). Each point is a variant.

It’s pretty cool to see our data in this way! It’s a good reminder that each variant we observe is a real base pair on a chromosome. Although a more interaction visualization would be helpful to zoom in on each variant, or filter them in specific ways, we can at least immediately see here there’s a distinct pattern of distribution. Certain spots on chromosomes clearly have more variants than others, with definite gaps apparent. We won’t go into what this means for now, as it could be that those gaps in the genome aren’t functionally significant, but it’s something to keep in mind anytime you are considering the impact of variation.

Now, let’s make a visualization of the number of protein-coding variants per chromosome as a percentage of the total variants in the dataset. We’ll use polar coordinates so the plot is circular, with the largest chromosome (Chromosome 1) positioned at 12 o’clock and the remaining chromosomes following clockwise in descending size order.

plot_pctTotal <-
  vep_output_summarized %>%
  filter(chrom != "chrM") %>%
  ggplot(aes(x = chrom, y = pctTotal, fill = bp)) +
  geom_col() +
  labs(y = "% Variants in Dataset") +
  scale_fill_gradient(
    "Base Pairs",
    low = "gray",
    high = #56B4E9"
  ) +
  coord_polar() +
  theme_cowplot() +
  theme(
    axis.title.x = element_blank()
  )

The number of variants on each chromosome as a percentage of the total variants in the dataset (y-axis). Chromosomes are shown sequentially clockwise. The fill indicates the total number of base pairs on a given chromosome.

From this, we can see a general trend that most of the variants in our dataset are on the larger chromosomes— which makes sense— but this is not uniform. Surprisingly few variants appear on Chromosome 9 given its size, while more variants may be on Chromosomes 17 and 19 than comparatively expected.

Now, let’s look at the composition of protein-coding variants in our dataset on each chromosome, relative to the size (number of base pairs) of the chromosome.

plot_pctChrom <-
  vep_output_summarized %>%
  filter(chrom != "chrM") %>%
  ggplot(aes(x = chrom, y = pctChrom, fill = bp)) +
  geom_col() +
  labs(y = "% Variants on Total Chromosome") +
  scale_fill_gradient(
    "Base Pairs",
    low = "gray",
    high = "#CC79A7"
  ) +
  coord_polar() +
  theme_cowplot() +
  theme(
    axis.title.x = element_blank()
  )

The number of variants on each chromosome as a percentage of the total base pairs on a chromosome (y-axis). Chromosomes are shown sequentially clockwise. The fill indicates the total number of base pairs on a given chromosome.

This plot definitely shows some different patterns than the previous one. Although Chromosome 1 is the largest chromosome, it does not contain a particularly large proportion of variants from our dataset, appearing mostly similar to all the chromosomes through Chromosome 12 (excluding 9). What stands out, though, is the proportion of variants on the smaller chromosomes, in particular 19, 17, and 22. Though more work will need to be done to determine what this significance is, this is a good exercise in seeing how certain visualizations and summaries can influence interpretations of the data.

As a last step for now, let’s make a visualization that begins to address the clinical relevance of our variants. Similar to how we grouped and summarized the counts of protein-coding variants by chromosome, we’ll group by gene (coded by gene symbol) and count the number of variants that are associated with each gene. We’ll plot the top 10 genes that are associated with the greatest numbers of variants in our dataset.

plot_countSymbol <-
  vep_output_processed %>%
  filter(
    chrom != "chrM",
    SYMBOL != "-",
    BIOTYPE == "protein_coding"
  ) %>%
  group_by(SYMBOL) %>%
  summarize(countSymbol = n()) %>%
  arrange(-countSymbol) %>%
  slice(1:10) %>%
  ggplot(
    aes(x = reorder(SYMBOL, countSymbol), y = countSymbol)
  ) +
  geom_col(fill = "#D55E00") +
  labs(
    y = "Number of Variants associated with Gene"
  ) +
  coord_flip() +
  theme_minimal_vgrid() +
  theme(
    axis.title.y = element_blank()
  )

The number of unique protein-coding variants (x-axis) associated with a given gene (y-axis).

A plot like this is a good starting point to explore what clinically-relevant diseases and phenotypes our variants may be associated with. Although gene symbols appear unfamiliar to most people, those with appropriate expertise can easily identify these gene symbols as if they were everyday words. RBFOX1, for example, is known to be associated with autism spectrum disorder, whereas CACNA1C encodes a well-known subunit of a calcium channel.

Further investigating clinical relevance of variants

We’ll go back to the vep_output_processed dataset earlier that we got by processing the raw output from the VEP tool.

ClinVar is a curated variant database that contains a lot of information about individual variants, including clinically-relevant phenotypes and clinical significance scores. We want to match this database to the variants in our dataset. We’ll identify all the entries in the Clinvar database that match the location (chromosome:start-end) of the variants in our dataset.

# wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz
clinvar <-
  read_tsv("~/Downloads/variant_summary.txt.gz")

clinvar_vep <-
  clinvar %>%
  mutate(Location = paste0("chr", Chromosome, ":", Start, "-", Stop)) %>%
  filter(
    Assembly == "GRCh38",
    Location %in% vep_output_processed$Location
  )

ClinVar includes the clinically-relevant phenotypes in a single column separated by the character |. We want to split this out so that each phenotype has its own row. This means that a single variant entry can be duplicated across mulitple rows, differing only in the phenotype entry.

clinvar_phenotypes_count <-
  clinvar_vep %>%
  separate_longer_delim(PhenotypeList, "|") %>%
  group_by(PhenotypeList) %>%
  summarize(count = n())

Visualizing data: Part 2

This lets us group the variants by phenotype and then summarize the counts of each variant per phenotype, allowing us to see which phenotypes from matched variants in the ClinVar database have the greatest number of counts in our dataset. This analysis and visualization is similiar to the gene count plot done earlier.

clinvar_vep_phenotypes <-
  left_join(clinvar_vep, clinvar_phenotypes_count, by = "PhenotypeList") %>%
  filter(!PhenotypeList %in% c("not provided", "not specified", "Inborn genetic diseases")) %>%
  distinct(PhenotypeList, .keep_all = TRUE) %>%
  arrange(-count)

plot_phenotype <-
  clinvar_vep_phenotypes %>%
  slice(1:10) %>%
  ggplot(
    aes(x = reorder(PhenotypeList, count), y = count)
  ) +
  geom_col(fill = "#009E73") +
  scale_x_discrete(label = function(x) str_trunc(x, 40)) +
  labs(
    y = "Number of Variants in ClinVar"
  ) +
  coord_flip() +
  theme_minimal_vgrid() +
  theme(
    axis.title.y = element_blank()
  )

The number of unique protein-coding variants (x-axis) associated with a given phenotype (y-axis).

Conclusion

Lastly, we can put all the plots together! This is a nice summary of everything we’ve learned so far from our starting VCF file, which for our purposes only showed how and where individual variants differed from a reference genome. In annotating the VCF file, we matched the variants to known entries in the dbSNP database, linked the variants to specific genes, and gained insight into potential associated phenotypes. This type of high-throughput, high-performance pipeline can be applied at scale to continually investigate the scope of variation in the human genome.

In our next article, we’ll show how to leverage Nextflow to run these scripts and tools in an integrated workflow within a cloud environment.

plot_all <-
  plot_grid(
    plot_pctTotal, plot_countSymbol,
    plot_pctChrom, plot_phenotype,
    nrow = 2,
    rel_widths = c(1.5, 1.5),
    labels = c("A", "C", "B", "D")
  )

(A) The number of variants on each chromosome as a percentage of the total variants in the dataset (y-axis). Chromosomes are shown sequentially clockwise. The fill indicates the total number of base pairs on a given chromosome. (B) The number of variants on each chromosome as a percentage of the total base pairs on a chromosome (y-axis). Chromosomes are shown sequentially clockwise. The fill indicates the total number of base pairs on a given chromosome. (C) The number of unique protein-coding variants (x-axis) associated with a given gene (y-axis) (D) The number of unique protein-coding variants (x-axis) associated with a given phenotype (y-axis).