Building a resource sharing site with BioThings SDK and the CD2H Data Discovery Engine (part 2)

by Ginger Tsueng

Although  there has been a proliferation of biological datasets made available in  recent years, often this information isn’t machine readable, making it  hard for things like Google Dataset Search to find and index them. In this series of blog posts, we’ll outline how  we are working to make datasets that our collaborators generate and  open data more findable, accessible, interoperable, and reusable, as  well as tools that we’ve developed to make it easier to share data.  In this post we discuss schemas and the rationale behind the schema playground in the Data Discovery Engine.

What Are Schemas?

We previously gave a general introduction on building a resource sharing site with the tools developed by the Su  and Wu labs, but to understand the need for the tools, it’s important to  understand schemas. In psychology, a schema is a pre-existing cognitive  framework or pattern of thought with which you categorize and interpret  new information. In theory, this allows us to process new information  more quickly, but predisposes us to errors due to stereotypes and bias.   In the data world, you can think of schemas as a means for allowing  search engines to consume, process, and interpret information very  quickly so that more meaningful results are returned.

If you’ve ever googled a movie or a recipe, you might notice that you sometimes get specialized infoboxed results.

For example if you search for a ‘coconut butter fudge’, the google  search engine may return a special type of result (the recipe) which is  presented differently and may include special information like prep  time, ingredients, ratings, images, and the website that the recipe  comes from.

As seen in this example, these recipes come from different websites,  yet they are presented in a manner which allows the user to easily  compare them by the specialized information.  If the user was pressed  for time, maybe the 5 minute recipe would be the one of greatest  interest.  If the user wanted to attempt a recipe that others have rated  highly, then maybe the one with 5 stars from 3 reviewers would be  preferred.

The reason the search engine can return consistent information from  different websites so that the user can better find what they’re looking  for is because the metadata for these recipes are structured in a  schema-compatible manner.

Simply put, a consortium from the big search engine companies (Google, Yahoo, Bing)  came up with a standard way of structuring metadata so that the search  engines could provide better results.  This standard can be found at https://schema.org.

Because there are many different types of information, there are a  lot of different schemas (or ways information can be structured) which  depend on the type of information.

For example, a recipe would be one type (or schema since schemas are  type-dependent) of information; a scholarly article would be another  type.  For each type (or schema), there will be a different set of  properties.  For example, a recipe might have a ‘ingredients’ property,  but a scholarly article would not.  In contrast, a scholarly article  might have a starting page, while a recipe would not.

But wait, couldn’t a recipe have a starting page if it were published in a recipe book?

Yes, absolutely!  This is one reason why schemas are inherently very  flexible!  In fact, they’re so flexible, you could find ways to nest,  couple, and structure metadata in ways that may ultimately confuse a  search engine or render your metadata less meaningful.  You  could also create new properties which would be meaningful only to you,  but would be un-interpretable otherwise.

This is why other schema-creating consortia exist–and if you’re  interested in schemas in the biological science space, you should have a  look at bioschemas.org.

In our next post we’ll discuss the schema playground of the Data Discovery Engine that was created to make it easier to adapt existing  schemas to specific use cases.