Building a resource sharing site with BioThings SDK and the CD2H Data Discovery Engine (part 6)

by Ginger Tsueng

Although there has been a proliferation of biological datasets made available in recent years, often this information isn’t machine readable, making it hard for things like Google Dataset Search to find and index them. In this series of blog posts, we’ll outline how we are working to make datasets that our collaborators generate and open data more findable, accessible, interoperable, and reusable, as well as tools that we’ve developed to make it easier to share data. In this final post for this series, we'll discuss how all the tools were used to create a resource for making information on COVID-19 more F.A.I.R.

Putting all the tools together to create something new

Now that we are familiar with all the parts required to build a resource sharing site with BioThings SDK and the CD2H Data Discovery Engine, it’s time to put them together and see how it all works.  In this case, we’ll continue to use as an example since COVID-19 is on everybody’s mind right now.

What is is a site for disseminating data on the COVID-19 pandemic and the many resources that have sprung up surrounding it.  It consolidates epidemiological data from numerous resources, provides clean and customizable visualizations from that data and serves as a clean front-end interface for the two APIs serving up information for the site. (More on this site next week)

The first API serves up epidemiological data, while the second API serves up resource data.  We’ll focus on the resource API development

As previously mentioned, the NIAID Systems Biology Data Dissemination Working Group had already developed a Dataset Schema (the NIAID Dataset schema) which we were able to leverage for  Because we knew that we would be pulling datasets from many disparate data catalogs and repositories, we had to relax many of the requirements in the NIAID Dataset schema when adapting it for  Since linking between different resources is also an important feature of, we also included the different types of citation relationships in the Outbreak Dataset Schema.  This includes directional citations (`citedBy`, `isBasedOn`) and non-directional citations (`relatedTo`).  You can further compare and contrast the NIAID Dataset Schema and the Outbreak Dataset schema from the schema playground in the Data Discovery Engine. In addition to the Outbreak Dataset Schema, we also created schemas for Analyses, Clinical Trials, Publications, Protocols, and any subschemas (all the other schemas listed here besides those five) needed for each of those resources. Note that subschemas aren’t officially called such, but I use the term to make it easier to differentiate.

Once we had all our schemas, we had everything needed in order to write parsers and generate the mapping file. Some preliminary parsers were written for Litcovid, BiorRiv, and Clinical Trials since the metadata for those resources were readily available via API. Note that the creation of the schemas was not an all-at-once-and-it's-a-done-deal type of task because the schemas were iteratively improved as the availability of upstream data became more apparent with each parser completed, and the downstream use/needs was clarified. Once the first parser was written, we were able to use it to quickly spin up the resource API with the BioThings SDK.  After that, resources could be added to the API by writing plugins such as the Dataverse plugin written by Julia, or the Imperial College plugin that I wrote.

Additionally, the Data Discovery Engine could semi-automatically generate a “guide” that would allow credentialed users to manually create metadata which would fit our schema.

After that, it was a matter of configuration, design and visualization--which fields should be searchable via the API? What sorts of queries should be accepted and how should they be formatted?  How should this massive treasure trove of information made available via API calls be presented to a user so that they can search for, interact with, and filter the information of most interest to them?  Unfortunately, there is no SDK for generating good design or visualization.  Fortunately, our team has members like Marco Cano and Laura Hughes who have expertise on the matter.  Curious as to what you can do with and/or the design decisions that went into it?  Add the rss to your news feed as future posts will cover uses, design decisions, and more.