October 27, 2016

Open Access Week: Working with Secondary Datasets

This is one of two posts in celebration of Open Access week (on Twitter: #oaweek, #open access, #OpenScience #OpenData). This post will focus on the use of secondary data in scientific discovery.


The analysis of open datasets has become a major part of my research program. There are many sources of secondary data, from web scraping [1] to downloading data from repositories. Likewise, there are many potential uses for secondary data, from meta-analysis to validating simulations [2]. If sufficiently annotated [3], we can use secondary data for purposes of conducting new analyses [4], fusion with other relevant data, and data visualization. Access to secondary (and tertiary) data access relies on a philosophy of open data amongst researchers which has been validated by major funding agencies.

The first step in reusing a dataset is to select datasets that are relevant to the question or phenomenon you are interested in. While data reuse is not synonymous with exploratory data analysis, secondary datasets can be used for a variety of purposes, including for exploratory purposes. It is important to understand what data you need to address your set of issues, why you want to assemble the dataset, and how you want to manage the associated metadata [5]. Examples of data repositories include Dryad Digital Repository, Figshare, or the Gene Expression Omnibus (GEO). It is also important to remember that successful data reuse relies on good data management practices to allow for first-hand data to be applied to new contexts [6].

An example of an archived dataset from the Dryad repository (original analysis published in doi:10.1098/rsos/150333).

Now let's focus on three ways to reuse data. The simplest way to reuse data is to download and reanalyze data from a repository using a technique not used by the first-hand generators of the data. This could be done by using a different statistical model (e.g. Bayesian inference), or by including the data in a meta-analysis (e.g. surveying the effects size across multiple studies of similar design). Such research can be useful in terms of looking at the broader scope of a specific set of research questions.

The second way is to download data from a repository for the purpose of combining data from multiple sources. This is what is sometimes referred to as data fusion or data integration, and can be done in a number of ways. One way this has been useful in my research has been for comparative analysis, such as computational analyses of gene expression data across different cell types within a species [7], or developmental processes across species [8]. Another potential use of recombined data is to verify models and validating theoretical assumptions. This is a particular concern for datasets that focus on basic science.

In fact, the recent technological and cultural changes associated with increased data sharing is enabling broader research questions to be asked. Instead of asking strictly mechanistic questions (what is the effect of x on y), combined datasets enable larger-scale (e.g. systems-level) comparisons (what are the combinatorial effects of all x and all y) across experiments. Doing this in a reductionist manner might take many orders of magnitude more time than assembling and analyzing a well-combined dataset. This allows us to verify the replicability of single experiments, in addition to applying statistical learning techniques [9] in order to find previously undiscovered relationships between datasets and experimental conditions.

The third way is to annotate and reuse data generated by your own research group [10]. This type of data reuse allows us to engage in data validation, test new hypotheses as we learn more about the problem, and comparing across different ways to attack the same problem. The practice of data reuse within your own research group can encourage research continuity that transcends turnover in personnel, encouraging people to make data and methods open and explicit. Internal data reuse also educational opportunities, such as providing students with hands-on opportunities to analyze and integrate well-characterized data. Be aware that reusing familiar data still requires extensive annotation of both the data and previous attempts at analysis, and that there is of yet no culturally-coherent set of standard practices for sharing data [11].


There are a few caveats with respect to successfully using data. As is the case with experimental design, the data should be sufficient to answer the type of questions you would like to answer. This includes going back to the original paper and associated metadata to understand how the data was collected and what it was originally intended to measure. While this does not directly limit what you can do with the data, it is important to understand in terms of combining datasets. There is a need for ways of assessing internal validity for secondary datasets, whether they be single data sources or combinations of data sources.

To learn more about these techniques, please try to earn the Literature Mining badge series hosted by the OpenWorm badge system. You can earn Literature Mining I (working with papers), or both Literature Mining I and II (working with secondary data). Here you will learn about how to use secondary data sources to address scientific questions, as well as the interrelationship between the scientific literature and secondary data sources.


NOTES (Try accessing the paper dois through http://oadoi.org):

[1] Marres, N. and Weltevrede, E. (2012). Scraping the Social? Issues in live social research. Journal of Cultural Economy, 6(3), 313-315. doi:10.1080/17530350.2013.772070

[2] Sargent, R.G. (2013). Verification and validation of simulation models. Journal of Simulation, 7(1), 12–24. doi:10.1057/jos.2012.20.

[3] For an example of how this has been a consideration in the ENCODE project, please see:
Hong, E.L., Sloan, C.A., Chan, E.T., Davidson, J.M., Malladi, V.S., Strattan, J.S., Hitz, B.C., Gabdank, I., Narayanan, A.K., Ho, M., Lee, B.T., Rowe, L.D., Dreszer, T.R., Roe, G.R., Podduturi, N.R., Tanaka, F., Hilton, J.A., and Cherry, J.M. (2016). Principles of metadata organization at the ENCODE data coordination center. Database, pii:bav001. doi: 10.1093/database/bav001.

[4] Church, R.M. (2001). The Effective Use of Secondary Data. Learning and Motivation, 33, 32–45. doi:10.1006/lmot.2001.1098.

[5] One example of this includes: Kyoda, K., Tohsato, Y., Ho, K.H.L., and Onami, S. (2014). Biological Dynamics Markup Language (BDML): an open format for representing quantitative biological dynamics data. Bioinformatics, 31(7), 1044-1052. doi: 10.1093/bioinformatics/ btu767

[6] Fecher, B., Friesike, S., and Hebing, M. (2015). What drives academic data sharing? PLoS One, 10(2), e0118053. doi:10.1371/journal.pone.0118053.

[7] Alicea, Bradly (2016): Dataset for "Collective properties of cellular identity: a computational approach". Figshare, doi:10.6084/m9.figshare.4082400

[8] Here is an example of a comparative analysis based on data from two secondary datasets: Alicea, B. and Gordon, R. (2016). C. elegans Embryonic Differentiation Tree (10 division events). Figshare, doi:10.6084/m9.figshare.2118049 AND Alicea, B. and Gordon, R. (2016). C. intestinalis Embryonic Differentiation Tree (1- to 112-cell stage). Figshare, doi:10.6084/m9.figshare.2117152

[9] Dietterich, T.G. Machine Learning for Sequential Data: A Review. Structural, Syntactic, and
Statistical Pattern Recognition, LNCS, 2396. doi:10.1007/3-540-70659-3_2

[10] Federer, L.M., Lu, Y-L., Joubert, D.J., Welsh, J., and Brandys, B. (2015). Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff. PLoS One, 10(6), e0129506. doi:10.1371/journal.pone.0129506.

[11] Pampel, H. and Dallmeier-Tiessen, S. (2014). Open Research Data: From Vision to Practice. In "Opening Science", S. Bartling and S. Friesike eds., Pgs. 213-224. Springer Open, Berlin. Dynamic version.

October 24, 2016

Open Access Week: How Am I Doing, Altmetrics?

This is one of two posts in celebration of Open Access week (on Twitter: #oaweek, #open access, #OpenScience). To kick things off, we will go through an informal evaluation of Altmetrics and other indicators of research paper usership.

In this post, I will discuss some quick investigations I did using the Altmetric metric system (known visually as the number within the multicolored donut). Altmetrics go beyond academic metrics based solely on academic journal prestige or number of formal citations in academic papers (e.g. h-index). In this post, I will discuss how these metrics might be used to help better understand the full impact on one's work.


The Altmetric donut and its diversity of input sources. The Altmetric score is based on how many interactions your content received from each source medium.

The first exercise I did was to acquire Altmetric donuts for journal articles and preprints for which I did not have such data. This includes venues such as arXivStem Cells and Development, and Principles of Cloning II, which do not feature Altmetric donuts on their pages. Interestingly, the bioRxiv preprint server does, in addition to tracking .pdf download and abstract view counts.



Example of an Altmetric donut in context (top) and readership stats (bottom) from a recent Biology paper for which I am an author. 

Retrieving a donut and data summary from the Altmetric database is easy. You embed a few line of code (see inset below) into an HTML document, and the donut and score appear where desired. While the donut is most useful for augmenting a publication list, in this case I simply created a test document for collating data from across many papers.

// Formal journal article citation
Alicea, B., Murthy, S., Keaton, S.A., Cobbett, P., Cibelli, J.B., and Suhr, S.T.  Defining phenotypic respecification diversity using multiple cell lines and reprogramming regimens. Stem Cells and Development, 22(19), 2641-2654 (2013).
// Code for donut and database call; possible data subclasses include:
// data-arxiv-id
// data-handle
// data-doi

In context, the donut can provide useful information about how a given paper is diffusing through the academic internet. In the case of the Stem Cells and Development paper (see code), the paper has an Altmetric score of 9. While the Journal website does not have Altmetric or download data, it does provide a doi identifier and select forward citations.


Examples of the Altmetric database entry (top) and the Journal website (bottom) for the Stem Cells and Development paper.

Similar data exist for a follow-up paper to the Stem Cells and Development paper -- in this case, a preprint involving a specialized quantitative analysis (based on Signal Detection Theory) of the same data. For this paper, we have an arXiv identifier, which provides us with a donut and statistics on the relative popularity of the paper based on age and other similar documents in the Altmetric database.

A typical arXiv article page, in this case for an arXiv preprint related to the Stem Cells and Development paper.

This arXiv preprint comes with code for the analysis, which is posted to Github.

For this particular paper, there is an associated Github repository. Even for preprint repositories with Altmetric and readership data (such as bioRxiv), the integration of Github materials is rather poor, particularly in generating an Altmetric. Alternately, there is an opportunity for Github to This is an area for which user statistics linked back to the original paper would be appreciated. 

Altmetrics for the same arXiv preprint. We can access data on the sources of the Altmetric score, as well as the attention score in the context of all other tracked documents in the Altmetrics database.

We can also integrate readership data across sources to come up with a picture of how our academic work is being shared, consumed, and diffused. In this example, I will show how data from a blog analytics engine and Altmetric data can be combined. Research blogs are an up-and-coming area of research in Altmetric statistics capture. I have taken two blogrolls (Carnival of Evolution #46 and Carnival of Evolution #70), for which citable versions were posted to Figshare immediately after going live. My blogging platform (Blogger) has readership stats but no Altmetrics, while Figshare has Altmetrics and readership stats for the Figshare version only.



Altmetric data for two blogrolls cross-posted to Figshare, which provides both a doi identifier and an Altmetric donut. There is also view and download information for the Figshare version, which may or may not be inclusive of people viewing such content on the blog site.

Let's look at the Figshare data first. Carnival #46 has an Altmetrics score of 10 with 188 views and 58 downloads. By contrast, Carnival #70 has an Altmetrics score of 6 with 331 views and 82 downloads. Clearly, there is some variation in direct engagement between the two datasets that is proportional to the score.


Readership statistics for Carnival of Evolution #46 (top) and Carnival of Evolution 
#70 (bottom). Blog analytics only provides the number of "reads" on the home site since publication.

There is also little relationship between the number of Blogger reads and the Altmetric score (as the Altmetric score does not directly capture this number). Carnival #46 has 7928 reads over roughly 4 years and 7 months. Carnival #70 has 1602 reads over roughly 2 years and 7 months. 

Even in cases where no Altmetric donut can be generated (such as for book chapters), there are still ways to evaluate an article's reach. In the case of Academia.edu, a new feature has been added that allows people to leave a comment when they interact with a document. This is a more qualitative assessment of engagement, but also provides authors an idea of whether or not "reads" or "views" translate into more than just a passing glance.

Two consumers of a book chapter took time to express their gratitude. Other reasons can be quite interesting as well, particularly when they have to do with educational purposes.

Hope you have enjoyed this exercise. It is not meant to be an exhaustive discussion of the Altmetric evaluation system, nor is it the limit of what can be done with Altmetrics and other tools for tracking you work. While there is clearly more technical work to be done on this front, tools such as Altmetric APIs are available. The biggest challenge is to building a social economy based on a variety of research outputs. The field is moving quite rapidly, so what I have shown here is likely to be just the beginning. 

October 20, 2016

OpenWorm Blog: Announcing the OpenWorm Open House 2016

The content is being cross-posted from the OpenWorm blog, and will be updated periodically.


Hello Everybody!

We want to announce our first Open House for 2016 that will happen on October 25th from 10:30am to 4pm EST (UTC-4) (check here for your timezone), so mark the date on your calendars! The event will be live streamed at this link.


If you were waiting for an opportunity to look at the recent progress we’ve made across all the projects, this is your chance. During the meetings many contributors will present a number of flash talks and various demos, so if you are interested to hear the latest about PyOpenWorm, c302, Sibernetic, Geppetto, Analysis toolbox or any other thing happening under our roof don’t miss this opportunity!

Click below for the schedule of events.

Streamed Online:

10:30 AM - 11AM: Welcome (Stephen Larson)

Flash talks
          11:00 - 11:05: Recent progress in OpenWorm (Stephen Larson)
          11:10 - 11:15: C. elegans nervous system simulation (Padraig Gleeson)
          11:20 - 11:25: C. elegans body simulation (Andrey Palyanov)
          11:30 - 11:35: OpenWorm Badge System (Chee-Wai Lee)
          11:40 - 11:45: DevoWorm Overview (Bradly Alicea)
          11:50 - 11:55: Neuroinformatics (Rick Gerkin)
          12:00 - 12:05: Geppetto (Matteo Cantarelli)
          12:10 - 12:15: Movement Validation (Michael Currie)
          12:20 - 12:25: WormSim (Giovanni Idili)

On social media channels
          12:30 - 1:30: Social media interactions & break out signup

Streamed online (links to be added)
          1:30 PM - 3:00: Multiple track breakout sessions
                    Morphozoic Tutorial - Tom Portegys

On social media channels
3:00 - 3:30: Wrap up & Social Media Networking

Oh and bring along your nerdy friends, the more the merrier!

Hope to see you there!

The OpenWorm team


Printfriendly