January 10, 2017

How to Kick-Start a Crypto-Currency

Here is an infographic (see below) I received from interested reader Steve Rogen, which follows up on a critique of Bitcoin I published back in 2014). He pointed me to a blogpost by Dinar Durham (a Financial Tech startup) explaining the concept of an initial coin offering (ICO). 

An ICO is a way for a new crypto-currency to distribute its coinage across a broader number of users than the more standard Bitcoin approach, and eliminates severe favoritism towards early adopters. The infographic itself demonstrates the process of public offering for a new coin. 

According to Dinar Durham's blogpost, ICOs have a mixed track record of success; while some are successful, others are not. However, they are becoming more popular as the number of altcoin types increases

December 30, 2016

New Paper at The Winnower

I have a new paper up at The Winnower, just in time for holiday (New Year's Day) reading. The Winnower is a publishing platform that allows people to post manuscripts and other writing while sharing it with the general public and receive feedback. They use a post-publication peer review system, and allows people to gather reviews and revise the original submission (winnowing) before assigning a formal doi (publication). This is my second experience with this type of publication system.

The paper is titled "On Braitenberg's Vehicles, Compound Polygons, and Evolutionary Developmental Structural Complexity", and network theory to analyzing the geometry and spatial composition of biological phenotypes. The paper is currently open for review (which you can submit at the site). I invite you to read and evaluate!

December 1, 2016

Searching for Food and Better Data Science at the Same Time

Two presentations to announce, both of which are happening live on 12/2. The first is the latest OpenWorm Journal Club, happening via YouTube live stream. The title is "The Search For Food", and is a survey of a recently-published paper on food search behaviors in C. elegans [1].

While the live-stream will be available in near-term perpetuity [2] on YouTube, the talk will begin at 12:45 EST [3]. The abstract is here:
Random search is a behavioral strategy used by organisms from bacteria to humans to locate food that is randomly distributed and undetectable at a distance. We investigated this behavior in the nematode Caenorhabditis elegans, an organism with a small, well-described nervous system. Here we formulate a mathematical model of random search abstracted from the C. elegans connectome and fit to a large-scale kinematic analysis of C. elegans behavior at submicron resolution. The model predicts behavioral effects of neuronal ablations and genetic perturbations, as well as unexpected aspects of wild type behavior. The predictive success of the model indicates that random search in C. elegans can be understood in terms of a neuronal flip-flop circuit involving reciprocal inhibition between two populations of stochastic neurons. Our findings establish a unified theoretical framework for understanding C. elegans locomotion and a testable neuronal model of random search that can be applied to other organisms.
The other presentation is one that I will give at the Champaign-Urbana Data Science Users' Group. This will be a bit more informal (20 minutes long), and part of the monthly meeting. The meeting will be live (12 noon CST) at the Enterprise Works building in the University Research Park. The archived slides are located here. The title is "Open Data Science and Theory", and the abstract is here:
Over the past few years, I have been working to develop a way to use secondary data and Open Science practices and standards for the purpose of establishing new systems-level discoveries as well as confirming theoretical propositions. While much of this work has been done in the field of comparative biology, many of the things I will be highlighting will apply to other disciplines. Of particular interest is in how the merger of data science and Open Science principles will facilitate interdisciplinary science.

[1] Subtitle: To boldly go where no worm has gone before. Yup, Star Trek pun. Full reference: Roberts, W. et.al   A stochastic neuronal model predicts random search behaviors at multiple spatial scales in C. elegans. eLife, 2016; 5: e12572.

[2] for as long as YouTube exists.

[3] Click here for UTC conversion.

November 21, 2016

Be as Brief as Possible but no Briefer

Nature Highlights article on the Journal of Brief Ideas, which itself is brief.

No, this is not an Einstein quote. But Einstein very well may have submitted to the Journal of Brief Ideas [1], an open access version of Occam's razor. I just submitted a brief paper called "Playing Games with Ideas: when epistemology pays off", which is the equivalent of a fully-indexed abstract [2]. While some people might find 200 words to be too brief, the Journal allows for attachments to be submitted, thus allowing a bit of circumventing with regard to the word limit [3].

According to the Journal FAQ, submitting such brief reports is part of establishing something below the current standard for the minimal publishable unit. It is also important for enforcing good scientific citizenship practices [4]. Very short papers have occasionally been published in regular journals. Mathematics papers by Lander and Parkin [5] and Conway and Soifer [6] accomplished mathematical proofs in less than a paragraph (but with multiple figures). Other than these rather mythical examples, it is quite the challenge to integrate a well-formulated idea into the Journal of Brief Ideas' 200 word limit.

[1] Woolston, C. (2015). Journal publishes 200-word papers. Nature, 518, 277.

[2] Indexing done via document object identification on Zenodo, doi:10.5281/zenodo.167647

[3] If a picture is worth 1000 words, then the Journal of Brief Ideas become less brief than its name implies.

[4] Neisseria (2015). All you need to publish in this journal is an idea. Science Made Easy blog, February 13.

[4] Lander, L.J. and Parkin, T.R. (1966). Counterexample to Euler's Conjecture on sums of like powers. Bulletin of the American Mathematical Society, 72(6), 1079.

[5] Conway, J.H. and Soifer, A. (2004). Can n2 + 1 unit equilateral triangles cover an equilateral triangle of side > n, say n + ɛ? American Mathematical Monthly, 1.

October 27, 2016

Open Access Week: Working with Secondary Datasets

This is one of two posts in celebration of Open Access week (on Twitter: #oaweek, #open access, #OpenScience #OpenData). This post will focus on the use of secondary data in scientific discovery.

The analysis of open datasets has become a major part of my research program. There are many sources of secondary data, from web scraping [1] to downloading data from repositories. Likewise, there are many potential uses for secondary data, from meta-analysis to validating simulations [2]. If sufficiently annotated [3], we can use secondary data for purposes of conducting new analyses [4], fusion with other relevant data, and data visualization. Access to secondary (and tertiary) data access relies on a philosophy of open data amongst researchers which has been validated by major funding agencies.

The first step in reusing a dataset is to select datasets that are relevant to the question or phenomenon you are interested in. While data reuse is not synonymous with exploratory data analysis, secondary datasets can be used for a variety of purposes, including for exploratory purposes. It is important to understand what data you need to address your set of issues, why you want to assemble the dataset, and how you want to manage the associated metadata [5]. Examples of data repositories include Dryad Digital Repository, Figshare, or the Gene Expression Omnibus (GEO). It is also important to remember that successful data reuse relies on good data management practices to allow for first-hand data to be applied to new contexts [6].

An example of an archived dataset from the Dryad repository (original analysis published in doi:10.1098/rsos/150333).

Now let's focus on three ways to reuse data. The simplest way to reuse data is to download and reanalyze data from a repository using a technique not used by the first-hand generators of the data. This could be done by using a different statistical model (e.g. Bayesian inference), or by including the data in a meta-analysis (e.g. surveying the effects size across multiple studies of similar design). Such research can be useful in terms of looking at the broader scope of a specific set of research questions.

The second way is to download data from a repository for the purpose of combining data from multiple sources. This is what is sometimes referred to as data fusion or data integration, and can be done in a number of ways. One way this has been useful in my research has been for comparative analysis, such as computational analyses of gene expression data across different cell types within a species [7], or developmental processes across species [8]. Another potential use of recombined data is to verify models and validating theoretical assumptions. This is a particular concern for datasets that focus on basic science.

In fact, the recent technological and cultural changes associated with increased data sharing is enabling broader research questions to be asked. Instead of asking strictly mechanistic questions (what is the effect of x on y), combined datasets enable larger-scale (e.g. systems-level) comparisons (what are the combinatorial effects of all x and all y) across experiments. Doing this in a reductionist manner might take many orders of magnitude more time than assembling and analyzing a well-combined dataset. This allows us to verify the replicability of single experiments, in addition to applying statistical learning techniques [9] in order to find previously undiscovered relationships between datasets and experimental conditions.

The third way is to annotate and reuse data generated by your own research group [10]. This type of data reuse allows us to engage in data validation, test new hypotheses as we learn more about the problem, and comparing across different ways to attack the same problem. The practice of data reuse within your own research group can encourage research continuity that transcends turnover in personnel, encouraging people to make data and methods open and explicit. Internal data reuse also educational opportunities, such as providing students with hands-on opportunities to analyze and integrate well-characterized data. Be aware that reusing familiar data still requires extensive annotation of both the data and previous attempts at analysis, and that there is of yet no culturally-coherent set of standard practices for sharing data [11].

There are a few caveats with respect to successfully using data. As is the case with experimental design, the data should be sufficient to answer the type of questions you would like to answer. This includes going back to the original paper and associated metadata to understand how the data was collected and what it was originally intended to measure. While this does not directly limit what you can do with the data, it is important to understand in terms of combining datasets. There is a need for ways of assessing internal validity for secondary datasets, whether they be single data sources or combinations of data sources.

To learn more about these techniques, please try to earn the Literature Mining badge series hosted by the OpenWorm badge system. You can earn Literature Mining I (working with papers), or both Literature Mining I and II (working with secondary data). Here you will learn about how to use secondary data sources to address scientific questions, as well as the interrelationship between the scientific literature and secondary data sources.

NOTES (Try accessing the paper dois through http://oadoi.org):

[1] Marres, N. and Weltevrede, E. (2012). Scraping the Social? Issues in live social research. Journal of Cultural Economy, 6(3), 313-315. doi:10.1080/17530350.2013.772070

[2] Sargent, R.G. (2013). Verification and validation of simulation models. Journal of Simulation, 7(1), 12–24. doi:10.1057/jos.2012.20.

[3] For an example of how this has been a consideration in the ENCODE project, please see:
Hong, E.L., Sloan, C.A., Chan, E.T., Davidson, J.M., Malladi, V.S., Strattan, J.S., Hitz, B.C., Gabdank, I., Narayanan, A.K., Ho, M., Lee, B.T., Rowe, L.D., Dreszer, T.R., Roe, G.R., Podduturi, N.R., Tanaka, F., Hilton, J.A., and Cherry, J.M. (2016). Principles of metadata organization at the ENCODE data coordination center. Database, pii:bav001. doi: 10.1093/database/bav001.

[4] Church, R.M. (2001). The Effective Use of Secondary Data. Learning and Motivation, 33, 32–45. doi:10.1006/lmot.2001.1098.

[5] One example of this includes: Kyoda, K., Tohsato, Y., Ho, K.H.L., and Onami, S. (2014). Biological Dynamics Markup Language (BDML): an open format for representing quantitative biological dynamics data. Bioinformatics, 31(7), 1044-1052. doi: 10.1093/bioinformatics/ btu767

[6] Fecher, B., Friesike, S., and Hebing, M. (2015). What drives academic data sharing? PLoS One, 10(2), e0118053. doi:10.1371/journal.pone.0118053.

[7] Alicea, Bradly (2016): Dataset for "Collective properties of cellular identity: a computational approach". Figshare, doi:10.6084/m9.figshare.4082400

[8] Here is an example of a comparative analysis based on data from two secondary datasets: Alicea, B. and Gordon, R. (2016). C. elegans Embryonic Differentiation Tree (10 division events). Figshare, doi:10.6084/m9.figshare.2118049 AND Alicea, B. and Gordon, R. (2016). C. intestinalis Embryonic Differentiation Tree (1- to 112-cell stage). Figshare, doi:10.6084/m9.figshare.2117152

[9] Dietterich, T.G. Machine Learning for Sequential Data: A Review. Structural, Syntactic, and
Statistical Pattern Recognition, LNCS, 2396. doi:10.1007/3-540-70659-3_2

[10] Federer, L.M., Lu, Y-L., Joubert, D.J., Welsh, J., and Brandys, B. (2015). Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff. PLoS One, 10(6), e0129506. doi:10.1371/journal.pone.0129506.

[11] Pampel, H. and Dallmeier-Tiessen, S. (2014). Open Research Data: From Vision to Practice. In "Opening Science", S. Bartling and S. Friesike eds., Pgs. 213-224. Springer Open, Berlin. Dynamic version.