by Vivek
home science code scribbles books about

Open {Data, Science, Source} -

First Chapter, Best Summer Ever!

April, 2016 

“What’s the worst that can happen? Not like it will be different this time. some sleep”, I told myself as I prepared to brace the bad news. The clock moves closer to midnight with every passing moment, but I am too tired to think about it anymore.

About half an hour into my sleep, I am startled by a loud phone call. I check time. It’s 12.30 am. The results must have been announced already at the midnight. Anxious to get it over with, I recall my own words from earlier and prepare to face the worst. This year was my third attempt at Google Summer of Code.

But, this year, I was selected.

August, 2016 


It is the final week of Google Summer of Code. Participating students are required to present their work and submit a final wrap up post about their project as the last task before final evaluation.

This blog post is my submission.

Through the summers, I have been working on my proposed project ‘Linking Phenotypes to Genotypes’ for OpenSNP under the umbrella of Open Bioinformatics Foundation (OBF). This post is also a personal account of my experiences besides being a suitably technical introduction to my project.

So, by the end of this post, I hope to provide everyone with a clear idea of what I have been working on and why it matters, describe my progress on the project, and share the marvelous experience of working with the OpenSNP team.


Currently, OpenSNP portal has a decent database collection of genotypes (3000+ users), phenotypes, SNPs and associated literature information all available in an easily searchable way. There is, however, a lot of scope in improving the impact and reach of the project by modernizing the interface, studying the aggregated data, and providing intuitive analysis to the users to allow for quicker yet improved and comprehensive understanding. For example, OpenSNP portal currently displays the allele frequency of a genotype - a valuable insight into how rare your mutation is.

This proposal seeks to work on the second problem of bridging different components to allow for better interactivity and information conveyance.

In line with the proposition set up in the introduction, my project aims to Link the SNPs with Phenotypes in the portal.

What do I mean by linking?

Single Nucleotide Polymorphisms (SNPs) (pronounced snips), simply, are nucleotide changes in DNA that occur with a certain percentage of a population. Mutations in DNA arise all the time due to intrinsic stochasticity of biological processes. However, most of these changes are neutral and do not benefit or harm the organism in any way, while others are self-corrected by the cell. Rarely, a few variations occur in critical regions of the genome (a gene that codes an essential protein, for example) and manage to escape correction. This variation (including the phenotype or disease it may be responsible for!) could then propagate through the population via sexual recombination.

They can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines, and other agents. A set of SNPs for a single individual can also act as a unique signature useful in identification and forensics. Naturally, figuring out the possible downstream effects of an SNP is indeed useful. And that’s exactly what we mean by linking SNPs to phenotypes. See SNP Genotyping for methods to detect SNPs.

However, a word of caution here. Figuring out the causal effects of a particular SNP in reality a significantly difficult problem and a major topic of study in the field of computational biology and genetics. Genome-wide association studies (GWAS)

Our current work instead builds previously published findings available to us as references and mines SNP to phenotype associations reported in the results to perform a meta-anlaysis.


OpenSNP has a single purpose — to act as a free, open-data repository by collecting personal genomics data into the public domain. The web-portal, thus, is the critical component of the infrastructure. From my understanding, I can imagine two immediate benefits of this project.

Although the second point is not the obvious goal of this project, I believe it could potentially serve as a useful data source to openSNP users in future.


The main components of the project were: (commits on GitHub)

In the first step, I familiarized myself with Rails and ActiveRecord pattern before moving to database associations. A useful strategy to learn fast was to set up dummy examples. For example, using rails console, one could examine arbitrary objects, query database, or modify objects. It became super easy to understand concepts like foreign key, indexing, migration and debug errors. Through this, I could learn how pieces of code come together to create a coherent architecture, like the MVC pattern. — exciting start!

Next step was to write code to score phenotype matches across the references corresponding to each SNP and rank the possible matches. The first strategy was to try a simple frequency lookup. While this sounds easy and attractive option, it is a naive solution, and there are some serious limitations to it. For instance:

All in all, you end up with too few phenotypes detected for a single SNP even though it has tons of references. We discussed this and thought of several ideas - use additional data aggregation tools (for example,, use a standard list of phenotypes, or even better create a phenotype network and leverage phenotype associations. dashboard with SNP to Phenotype linkage prototype. The SNP info dashboard showing current columns of genotype and allele frequency along with newly implemented (for demo purposes!) list of suggestive phenotypes.

However, for the sake of simplicity and as a first milestone, we proceeded with the initial naive approach hoping to improve it in future iterations. The final logic was packed into a Sidekiq worker script — so far so good.

The next critical task was to write extensive unit tests and document all of the implemented changes. This is what I have been mainly doing in the last few weeks of the project. In the remaining time, I hope to continue testing and improving on the worker script. I obtained an actual data dump from the production database that I can use actually to check and see the performance. Especially because it’s a per SNP operation and there are thousands of it. So, it would be crucial to see how much additional overhead it adds to the production server.

Besides, during this process, I came across several aspects of the code that could be improved. For example, Phenotype table did not have a unique key constraint on the characteristic column. So, you could have two (or more!) rows describing hair color or height.

Overall, I would not say that I delivered entirely on each deliverable, but we definitely moved in a forward direction. Several new challenges came up, but we hope to continue working on their resolution - after and beyond the scope of GSoC.


I cannot thank my mentors enough for their continuous effort throughout the project. Philipp and Bastian are one of the most enthusiastic and voracious readers I have met so far. Thanks to an excellent stroke of luck, I got to meet Bastian at BOSC 2016 in Orlando, Florida. More about it soon!

A lovely card along with cute OpenSNP stickers! A lovely card with a personal message and free OpenSNP stickers as surprise from Bastian.

Last but not the least, a big thanks to Google for the amazing program! Also, thanks to Mateus and Graham, the other summer of code students for OpenSNP, for providing an excellent company throughout.