Building a resource sharing site with BioThings SDK and the CD2H Data Discovery Engine (part 4)

Sep 28 2020 by Ginger Tsueng

Although there has been a proliferation of biological datasets made available in recent years, often this information isn’t machine readable, making it hard for things like Google Dataset Search to find and index them. In this series of blog posts, we’ll outline how we are working to make datasets that our collaborators generate and open data more findable, accessible, interoperable, and reusable, as well as tools that we’ve developed to make it easier to share data. In this post we discuss the development of the BioThings SDK and how it can be used to quickly spin up a new API.

Building the API with the BioThings SDK

As mentioned previously, the BioThings Software Development Kit (SDK) already existed which made it easy for the Outbreak.info team to quickly spin up an Application Programming Interface (API) for the schema-compliant metadata of various COVID-19-related resources. What is this SDK and where did it come from?

The BioThings SDK evolved from the Wu lab’s desire to constantly abstract successful tools to build new tools and really got its start from BioGPS.

BioGPS was a plug-n-play portal for gene-centric information. It was developed by the Su lab to

1. Visualize and disseminate microarray data.

2. Extend interpretability of said data by allowing users to create plugins of gene-centric resources of use to them.

You see, the Su lab understood that a clinical geneticist would need the information from different resources than (for example) a pharmacogenetics researcher. Yes, the two would have some overlapping resources, but the clinical geneticist might want more resources like ClinGen, while the pharmacogenetics researcher might want more resources like DrugBank. Both types of researchers might be interested in model organisms as well.

BioGPS allows researchers to create and save gene-centric reports specialized for their own needs by enabling researchers to easily create plugins from resources they already use. For more information about BioGPS, see the publication.

While BioGPS was very useful for researchers with no programming experience, it was limited in its ability to handle bulk information. For this reason, the gene metadata search tool behind BioGPS was abstracted into MyGene.info--a gene annotation RESTful API (more about the process here.

MyGene.info proved to be quite popular, and requests were made to add gene variant annotation information. While gene annotations are related to gene variant annotations, the two types are still different, necessitating the creation of a separate API: MyVariant.info.

Since the process of building a RESTful API had already been established with MyGene.info, Dr. Kevin Xin (a graduate student at that time) and Adam Mark (another graduate student) repeated the process to build MyVariant.info--only they had to account for A LOT more metadata. They and other members of the Su and Wu labs improved the process and abstracted key portions of the process to create the BioThings SDK which would allow other people to readily create similar RESTful APIs.

The SDK comes with some presets for making it even easier to cater to your particular needs.

First, is the BioThings Standalone. The BioThings Standalone is a set of Docker containers which include fully pre-configured, ready-to-use Biothings API that can easily be maintained and kept up-to-date. Because these are preconfigured containers of existing APIs (like MyGene.info, MyVariant.info, etc.), you are in essence using these to build a standalone instance of the site to which you can add private data which will be stored on your own server. This is the ‘go-to’ kit if you wish to have your own instance of MyGene.info, or another BioThings API which you can manipulate as you wish. If you have information (as we did with Outbreak.info) that didn’t fit into the Gene or Variant types, you would need something with more flexibility like the BioThings Studio.

The BioThings Studio is a pre-configured environment used to build and create BioThings APIs. With the Studio, you are not limited to pre-existing APIs, and have the flexibility to build new APIs from files containing your data. Think of the BioThings Studio as a preconfigured way to use the BioThings Hub--the heart and soul of the BioThings SDK.

The BioThings Hub is where all the action happens-- it uses MongoDB as the “staging” storage backend for JSON objects before they are sent to Elasticsearch for indexing. For more details about the use of Elasticsearch with Mygene.info as an example, see our blog post for the Elasticsearch blog. It has generalized functions for downloading, updating, processing, and merging data but must be provided with customized Python scripts which allow other resources to be ‘plugged in’ to the BioThings ‘hub’ in order to create a new API.

With this tool in place long before the Outbreak.info team even started on creating a schema, the creation of the outbreak.info APIs happened very quickly. Parsers were already built to collect and normalize metadata COVID-19 publications via (LitCovid, Biorxiv, and Medrxiv), and simply needed to be plugged into the BioThings SDK to be incorporated, indexed, and made searchable. Incorporating a new parser or resource is easy thanks to the way the BioThings SDK works, and the BioThings team. Even someone with limited programming experience (like myself) was able to create a plugin for a BioThings API-based API like the Outbreak.info resources API. We’ll detail how to ‘plugin’ a resource parser in the next post.