Showing posts with label art-science-engineering. Show all posts
Showing posts with label art-science-engineering. Show all posts

January 17, 2020

Work With Me in 2020


The Google Summer of Code (GSoC) is once again accepting applications from students to work on a range of programming-oriented projects over the Summer of 2020. Orthogonal Research and Education Laboratory and the OpenWorm Foundation have contributed a number of projects to the list. Here are links to the project descriptions (login to INCF Neurostars required):


DevoWorm Group:

Project  #15: OpenDevoCell Integration

Orthogonal Laboratory:


I am the contact person for both the Orthogonal Laboratory and DevoWorm Group projects. If you have any questions about the application process or want to have be review your application before submission, please feel free do so. According to this year's schedule, the proposal deadline is March 31, and the community bonding period starts on April 27. Stay tuned!

Join us once again on the "Road to GSoC"!

October 31, 2018

October: Geppetto month

Here is a recap of Geppetto Month at the OpenWorm Foundation. This content has been cross-posted from the OpenWorm Foundation blog (h/t Giovanni Idili).

 

OpenWorm is made up of many sub-projects, “project of the month” is an effort to highlight a different OpenWom sub-project every month. This month is Geppetto’s turn!


What is Geppetto?
Geppetto is a visualisation and simulation web-based platform for building neuroscience applications. The first use case ever of Geppetto was OpenWorm itself (some lore: the virtual Worm being Pinocchio, a Geppetto was needed to “make it”), but since then many groups have adopted it as their platform of choice. It is basically a set of reusable components for simulation, visualisation and data aggregation that make it easier to develop your neuroscience application, be it a data portal or an entry point to external simulation engines.

Projects that currently make use of Geppetto as a platform:
OpenWorm uses Geppetto as an integration platform for the output of various of its subprojects, from connectome browsing to replaying of integrated electrophysiology and fluid dynamics simulations.


Open Source Brain uses Geppetto to share, visualize and simulate neuronal models, both for individual neurons and networks.

Virtual Fly Brain is an ontology and 3D/2D  morphology browser for drosophila resources built using Geppetto.

NetPyNE-ui is a user friendly UI to create and run neuronal models using the NetPyNE library.

Open Development
Geppetto development is entirely open source, like anything else that happens under the OpenWorm umbrella. There are open sprint meetings every two weeks that anybody can join, and we keep a public development board showing development activities and progress. You can browse the issues and see if there is anything you might wanna chance your hand on!

Resources
Here are some links if you want to learn more about Geppetto:

          Open access paper (Philosophical Transaction of the Royal Society B, 2018)

          Geppetto docs

          Geppetto live demo

          Development board

          Geppetto source code (Github)

          Geppetto Blog

          Geppetto on Twitter

Get involved!
Getting involved is easy, simply fill out the OpenWorm volunteer application form and we will invite you to the OpenWorm foundation slack, from there you can interact with the community and join the #geppetto channel if you are interested to learn more about Geppetto or get involved as a contributor.

April 13, 2018

Video Game Complexity: a short overview

What inspired this post was both an interesting question from Twitter user @ID_AA_Carmack, CTO at Oculus VR and a long-suffering manuscript I have been trying to get into publication with some collaborators [1]. With reference to the former:


In the accompanying thread, someone mentioned that the classic game Tempest used a paddle controller, and therefore was an example of analogue control (and outside the scope of John's taxonomy). But in general, joystick and button-press control provide a measure of complexity for a wide variety of games.

Let's begin by highlighting a 2-bit game. Two examples of a 2-bit game (in terms of control) are Pong and Breakout. In both games, the player uses a single controller to move a virtual paddle back and forth across the screen [2].


A screenshot of Pong (left) and Breakout (right) and the degrees of freedom per player. Each degree of freedom (movement along the right-left axis) represents 2 bits of information.

Now let's discuss the mention of Joust as an example of  3-bit control and Pac-Man as an example of 4-bit control. For uninitiated readers, here are screen shots from the game (click to enlarge)


And here is a picture of the Joust arcade console and its 3 bit control mechanism.



The Joust joystick provides 2 bits of control (left on/off and right on/off), and the button is 1-bit of control (press or no press). This is sufficient complexity to control the action, but not enough complexity to account for all observed action in the game.

Pac-Man utilizes a maze rather than linear action, but has only one additional degree of control complexity. Here are some screenshots from the game (click to enlarge)



In terms of control complexity, Pac-Man is a 4-bit game (2 degrees of freedom), while the Atari 2600 controller can provide up to 5 bits of information (2 degrees of freedom and a button). These are arbitrary assessments of information content, and do not correspond with every possible degree of freedom in the game environment.

     

Examples of 4- and 5-bit control via joystick for a version of Pac-Man (left) and the Atari 2600 (right). 

There are two additional sources of complexity we should be concerned with: the combinatorial complexity of objects within the game, and perceptual complexity of human information processing during gameplay. The former can be characterized by computational complexity classes, while the latter can be characterized in conjunction with control complexity in the form of information bandwidth (measured in bits or bits per second).


A taxonomy of computational classes for Pac-Man and other games is described by [3-5]. While Pac-Man is considered an NP-hard game, not all video games are. Furthermore, games that are in the same computational class do not have the same types of objectives or in-game properties. The games Tetris [6] and Minesweeper [7] have been found to be NP-complete. This is likely based on the combinatorial properties of each game.

According to Aloupis et.al [8,9], Super Mario Brothers can be classified as NP-hard. Other Nintendo games, such as Donkey Kong Country, are PSPACE complete. The Nintendo examples are interesting from an approximation perspective, as several Nintendo games have been reclassified by this research group between 2012 and 2016 [8,10]. 

The criterion are based on things like how the avatar moves within the game's activities. A game like Pac-Man involves so-called "location traversal" [4], which in computational terms is similar to the travelling salesman problem (NP-hard) [3].  By contrast, a game like Breakout involves a balancing maneuver [see 2] that is solvable in LSPACE.

Video game screenshots and computational complexity for several example games. Clockwise from top: Super Mario Brothers (NP), Donkey Kong Country (PSPACE), Breakout (LSPACE), Minesweeper (NP), Tetris (NP), Pac-Man (NP).  

Human information processing (or performance) can be characterized in the form of informational complexity. One classic way to assess complexity in performance is the use of psychometric laws [11]. Invariants relevant to video game performance are Fitts Law (hand-eye coordination), the Weber-Fechner Law (just noticable differences in stimulus intensity), and Stevens Law (the perception of physical stimuli relative to intensity). In the case of Fitts Law, the index of difficulty is often expressed as a function of bandwidth (bits/second). This provides a real-time expression of complexity that can be compared across games. Similarly, the presence of signal relative to noise can be expressed in terms of information content (Shannon-Hartley theorem). The stimulus information available to the game player may or may not be congruent with control complexity and in-game combinatorial complexity [12]. Accounting for all three dimensions of this complexity is an avenue for further study.

NOTES: 
[1] Weber, R., Alicea, B., Huskey, R., and Mathiak, K. (2018). Network Dynamics of Attention During a Naturalistic Behavioral Paradigm. Frontiers in Human Neuroscience, doi:10.3389/fnhum. 2018.00182.

[2] Many early video games were designed around simple programming benchmarks. Games such as Pong and Breakout take inspiration from the pole-balancing problem, while the game Qix takes inspiration from k-D partitioning. In this case, the human serves to solve the computational problem, however inefficiently.

[3] Pac-Man Proved NP-Hard By Computational Complexity Theory. Emerging technology from the arXiv blog, January 26 (2012). 

[4] Viglietta, G. (2013). Gaming Is A Hard Job, But Someone Has To Do It! arXiv, 1201.4995.

[5] Demaine, E.D., Lockhart, J., and Lynch, J. (2016). The Computational Complexity of Portal and Other 3D Video Games. arXiv, 1611.10319.

[6] Demaine, E.D., Hohenberger, S., and Liben-Nowell, D. (2003). Tetris is hard, even to approximate. Proceedings of the International Computing and Combinatorics Conference, 351–363.

[7] Kaye, R. (2000). Minesweeper is NP-complete. The Mathematical Intelligencer, 22(2), 9–15.

[8] Demaine, E.D., Viglietta, G., and Williams, A. (2016). Super Mario Brothers is harder/easier than we thought. Proceedings of the International Conference on Fun with Algorithms.

[9] Aloupis, G., Demaine, E.D., Guo, A., and Viglietta, G. (2014). Classic Nintendo games are (NP-) hard. arXiv, 1203.1895.

[10] Aloupis, G., Demaine, E.D., Guo, A., Viglietta, G. (2012). Classic Nintendo Games are (Computationally) Hard. arXiv, 1203.1895.

[11] One challenge in linking psychophysical complexity to video game perception is in capturing the entirely of what is cognitively extracted from the game environment.

For an example of a purely psychophysical aproach to game design, please see: Hussain, Z., Astle, A.T., Webb, B.S., and McGraw, P.V. (2014). The challenges of developing a contrast-based video game for treatment of amblyopia. Frontiers in Psychology, 5, 1210. doi:10.3389/fpsyg.2014.01210

[12] Forisek, M. (2010). Computational complexity of two-dimensional platform games. Proceedings of the International Conference on Fun with Algorithms, 214–227.


April 1, 2018

What is MS Fool for $1000, Alex?

Want more Jeopardy! answers in question form? Check out the J! archive.

There is a recurring insider reference among Very Serious Computer Users regarding using Microsoft products to perform sophisticated computational tasks [1]. While most people tend to think of these programs as not computationally sophisticated, programs such as Excel [2], PowerPoint [3], and even MS Paint can do some pretty powerful computing. One such example are 3-D Models enabled by object/shape manipulation in PowerPoint and Paint.

Around every April 1, Carnegie Mellon University students participate in an annual tongue-in-cheek conference (sponsored by the satirical Association for Computational Heresy) called SIGBOVIK (Special Interest Group on Harry Questionable Bovik). Not sure what where Harry Q. Bovik got his credentials, but if you enjoy the Ig Nobel Awards, this should be right up your alley.

SIGBOVIK often features highly interesting quasi-research [4] projects based on the aforementioned Microsoft program suite. But there are other groups creating programs and computational artifacts because they can. Here are a few examples of this collected from around the web:

1) Using MS Paint to create high-quality works of art, courtesy of Hal Lasko and "The Pixel Painter".

2) Tom Wildenhain's SIGBOVIK 2017 lecture on Turing-complete Power Point.

The (non-) Turing Completeness of PowerPoint. One of many computational artifacts that are Turing incomplete. COURTESY: Tom Wildenhain YouTube channel and HackerNoon.

3) Try out this version of Doom programmed in Excel, courtesy of Gamasutra. The game program runs on a series of equations that requires VBA to run. There are several files you need to download, and the blogpost goes through the full set of challenges.


Rasterized Graphics with Underlying Linear Equations [5]. Examples from the Excel Art Series. ARTIST: Oleksiy Say

4) The final example returns us to SIGBOVIK (2016), where David Fouhey and Daniel Maturana bring us ExcelNet. This is an architecture for Deep Learning in Excel, and has gone through the paces of the Kaggle MNIST competition. Another example features Blake West and his implementation of a deep learning architecture in Google Sheets.

NOTES:
[1] This is similar to advertising membership in the cult of LaTeX, touched upon in this discussion of LaTeX fetishes.

[2] the first spreadsheet (Autoplan) was developed in the form of a scripting language, which is a more general version of more formal programming languages (e.g. Java versus Javascript).

[3] presentation software can be pretty diverse. But is any of it Turing Complete (TC)? If you work out your design process using Wang Dominoes, then perhaps TC-ness can be realized.

Example of a Wang tile.

[4] definition of quasi-research: research projects that do not produce useful results directly, but often as a side-effects of the research process itself.

[5] by combining the world of image rasterization with interlinked linear equations, there are some exciting (but conceptually and technlogically undeveloped) opportunities in this space.

March 19, 2018

Google Summer of Code Application Deadline is Approaching!


A quick reminder as to what is near. Nothing to fear -- except that applications for Google Summer of Code 2018 are due on March 27th (in a little more than a week). Thanks to all the applicants to our projects so far. I am the mentor for the following projects (all sponsored by INCF):

Contextual Geometric Structures (Project 8)

Towards a k-D embryo (Project 10.2)

Physics-based XML (Project 10.1)

Apply to work work either the Orthogonal Research Lab community (Project 8) or the OpenWorm Foundation community (Projects 10.1, 10.2). Please contact me (Bradly Alicea) for more information.

There are several other projects hosted by the OpenWorm Foundation that combine work with Neuroscientific data and Computational Modeling. These include:

Advanced Neuron Dynamics in WormSim (Project 10.3)

Mobile application to explore C. elegans nervous system dynamics (Project 10.4)

Add support for Neurodata Without Borders 2.0 to Geppetto (Project 10.5)

All three of these projects are based in the OpenWorm Foundation community, and are lead by mentors Matteo Cantarelli, Giovanni Idili, and Stephen Larson.

December 15, 2017

Work With Me, the Orthogonal Laboratory, and the OpenWorm Foundation This Summer!


The Google Summer of Code (GSoC) is once again accepting applications from students to work on a range of programming-oriented projects over the Summer of 2018. Orthogonal Laboratory and the OpenWorm Foundation have contributed a number of projects to the list. Here are links to the project descriptions (login required):

Orthogonal Laboratory:


DevoWorm Group:



OpenWorm Foundation:



I am the contact person for the Orthogonal Laboratory and DevoWorm Group projects, and Matteo Cantarelli is the contact person for the other projects. If you have any questions about the application process or want to have be review your application before submission, please feel free do so. The deadline for application submission is tentatively in late March/early April. Stay tuned!

Join us on "The Road to GSoC"!


October 2, 2017

Pseudo-Heliocentric Readership Information in Gravitationally Bound Form

Or, how to get 300,000 reads by being persistent [1] and getting results in unexpected places. Let's review our milestones in three cartoons.






The made-up planetary orbits featured here [2] may violate the physics of actual solar system orbits, at least as simulated by Super Planet Crash [3].


NOTES:
[1] Candy, A. (2011). The 8 Habits of Highly Effective Bloggers. Copyblogger, October 25.

[2] Previous readership milestones, in order of distance from central star: 20000, 50000, 100000 (first image), 120000, 150000 (second image), 200000, 250000 (third image).

[3] Featured in the Scientific Bytes and Pieces, August 2015 post.


May 4, 2017

Announcing our Google Summer of Code 2017 Students


As mentioned in a previous post, the OpenWorm Foundation (and DevoWorm group) has been receiving application for this year's Google Summer of Code. We have now selected our student applicants and projects to be awarded the internship. We had a very good group of applicants this year, so congratulations go out to everyone who applied!


Shubham Singh will be working on the model completion dashboard project, which is a general tool for the OpenWorm community. Siddharth Yadav will be working with me and the rest of the DevoWorm group to quantify and analyze secondary microscopy data that capture the process of embryogenesis for C. elegans and other organisms [1]. Good luck!

Thanks to the INCF for coordinating the selection process!


NOTES: 
[1] For more reading on the promise of this approach, please see: Chi, K.R. (2017). Picking out Patterns. The Scientist, May 1 AND Rizvi, A.H., Camara, P.G., Kandror, E.K., Roberts, T.J., Schieren, I., Maniatis, T., and Rabadan, R. (2017). Single-cell topological rNA-seq analysis reveals insights into cellular differentiation and development. Nature Biotechnology, doi:10.1038/nbt.3854.

April 17, 2017

Breaking Out From the Tyranny of the PPT

Player 1 vs. Powerpoint (with a screenshot of the game Breakout). The image itself was made in PowerPoint, but I promise this post will not be recursive nonsense.


By now, you have probably chosen a side in the PowerPoint debate: namely, does it enhance or hinder scholarly communication? I will present both sides of this argument, but not argue to moderation. Rather, I will show that PowerPoint is good (or good to get rid of) only if you define your own style of presentation. In either case, you will need to "break out" of the box containing typical advice for creating PowerPoint presentations.


A number of people have argued (both rhetorically and in practice) that PowerPoint represents an enforced tyrrany on presented information. It forces big ideas into small compartments, defined by slide optimization and bullet points. What follows are a few examples of PowerPoint tyranny, or cases in which the default style of organization imposes constraints on communication and the exchange of ideas. 


A few years ago, Franck Frommer wrote a book on how PowerPoint makes us stupid [1]. Frommer's definition of stupid refers to impovrishing our ability to communicated logical flow, contextual detail, and the confusion of opinion and fact. Supprting this position is Peter Norvig's Gettysburg Address analysis, which suggests that the cognitive style of PowerPoint and its visual gimmickry often obscure rather than enhance the logical flow of a larger idea.


Example slide from the Gettysburg Address as a PowerPoint presentation.


Education might also benefit from breaking away from PowerPoint tradition. In fact, there is an argument to be made that the use of PowerPoint in education reduces course content to an overly-simplified, pre-packaged learning experience [2]. Dr. Chris Waters at Michigan State University has moved to eliminate PowerPoint lectures altogether in his undergraduate Microbiology course. He is instead adapting the existing presentations into a series of chalk talks which are more conducing to communicating scientific ideas. 


Perhaps the failures of PowerPoint are not about varied styles of communication across different domains of knowledge (scientific, business, legal), but more about the relevance of ideas and their overall structure. Relevance theory (Dan Sperber and Deidre Wilson) suggests that are biased according to what seems relevant [3]. Some of this is mediated by the cognition of attentional resources, but there is also an underappreciated role of cultural preferences and constraints. In the realm of science communication, the narrowly-defined relevance of typical PowerPoint design practice might encourage some aspects of scientific practice (science as memorization of facts, still images, simple graphs) at the expense of others (experimentation, data exploration, theory-building). 


The tyrrany of representational orthodoxy, PowerPoint style. On the other hand, this is actually pretty good in terms of available clip art. While perfectly suitable for business-oriented communication (e.g. team-building, simple storytelling), this may or may not be suitable for other domains of knowledge.


So how does one break out from the restrictions of PowerPoint? One way forward is shown by the artistic community's use of PowerPoint as an expressive medium. Like the latter-day explosion of animated .gif art on Tumblr [4], artists have been using PowerPoint to create animations and short videos. Interestingly, the limitations of PowerPoint for representing alternate forms of argumentation does not seem to limit artistic innovation [5]. Perhaps this has to do with the use of symbols rather than the ambiguity of linguistic syntax. 

A more argumentative-based way to approach PowerPoint is to adopt the Lessig Method of presentation [6], which presents ideas in only a few words in a large font. One example of this is Larry Lessig's "Free Culture" lecture, which connects a sequence of court cases and landmark ideas in sparse blocks of text. Whether this solves the ambiguity issue is not clear to me, but does provide a way to simplify without losing information.


The last several talks I have given include a final "Thanks for your Attention" Acknowledgements slide which features a graphic that has something to do with attention (visual illusion and/or obscure reference). This is one such example featuring Marshall McLuhan (e.g. breaking the message out of the medium).

UPDATED (4/23): Here is a presentation to the Association of Computational Heresy by Tom Wildenhain on how to construct a Turing Machine with PowerPoint. While it is a lot of fun, it does bring to mind some more creative uses of PowerPoint.


NOTES:
[1] Frommer, F. (2012). How PowerPoint Makes You Stupid: The Faulty Causality, Sloppy Logic, Decontextualized Data, and Seductive Showmanship That Have Taken Over Our Thinking. New Press, New York.

[2] Ralph, P. (2015). Why universities should get rid of PowerPoint and why they won’t. The Conversation, June 23.

[3] Sperber, D. and Wilson, D. (1995). Relevance: Communication and Cognition. Blackwell Publishers, Oxford, UK.

[4] Alicea, B. (2012). Moving the Still, courtesy of the .gifted. Synthetic Daisies blog, October 19.

[5] Greenberg, A. (2010). The Underground Art of PowerPoint. Forbes, May 11. Some examples of PowerPoint art (converted to YouTube videos) include:

a) "Infiltration" by Jeremiah Lee.

b) "Joiners" by blastoons.


[6] Reynolds, G. (2005). The "Lessig Method of Presentation". Presentation Zen blog, October 7.

March 18, 2017

Almost time for GSoC Applications!

Your chance to join the DevoWorm group is almost upon us. If you are a student, the Google Summer of Code (GSoC) is a good opportunity to gain programming experience. Applications are being accepted from March 20 to April 3. If selected, you will join the DevoWorm group, and also have the chance to network with people from the OpenWorm Foundation and the INCF.

The best approach to a successful application is to discuss your skills, provide an outline of what you plan to do (which should resemble the project description), and then discuss your approach to solving the problems at hand. We are particularly interested in a demonstration of your problem-solving abilities, since many people will apply with a similar level of skill. You can find an application template in outline form here.


You can apply to work on one of two DevoWorm projects: "Physics-based Modeling of the Mosaic Embryo in CompuCell3D" or "Image processing with ImageJ (segmentation of high-resolution images)". If you have any questions, comment in the discussions or contact me directly.

March 15, 2017

A Tree of Deeper Experiences -- the Authorship Tree

One of the most difficult aspects of academic publishing with multiple authors is in determining the order of authorship. In many fields, authorship order is the key to job promotion. Unfortunately, these conventions vary field, while the criteria for authorship slots often varies by research group. Since a responsible accounting of contributions are key to determining authorship and authorship order [1], it is worth considering multiple possibilities for conveying this information.

Example of an Authorship list (with affiliations)

A mathematics or computer science researcher might also see the problem as one of choosing the proper representational data structure. The authorship order, no matter how determined, is a 1-dimensional queue (ordered list). Even though some publishers (such as PLoS) allow for footnotes (an inventory of author contributions), there is still little room for nuance.

Example from "The Academic Family Tree"

But is there a better way? Academic genealogies provide one potential answer. A typical genealogy can be thought of as a 1-dimensional order, from mentor to student. In reality, however, an academic have multiple mentors, influenced by a number of predecessors. The construction of academic family trees [2] is one step in this direction, turning the 1-dimensional graph into a 2-dimensional one.


Picture of the Authorship tree cover. COURTESY: "The Giving Tree" by Shel Silverstein

This is why Orthogonal Lab has just published a hybrid infographic/paper called the The Authorship Tree [3]. This is a working document, so suggestions are welcome. The idea is to not only determine the relative scope of each contribution, but also to graphically represent the interrelationships between authors, ideas, and scope of the contributions.

As we can see from the example below, this includes not only our authors, but also people from the acknowledgements, funders, reviewers, authors of important papers/methods, and funders. While the ordering of branches along the stem suggests an authorship order, they are actually ranked according to their degree of contribution [4]. To this end, there can be equivalent amounts of contribution, as well as inclusion of minor contributors not normally included in an authorship list.

Example of an authorship tree (derived from original 1-D author list).

NOTES:
[1] Cozzarelli, N.R. (2004). Responsible authorship of papers in PNAS. PNAS, 101(29), 10495.

[2] David, S.V. and Hayden, B.Y. (2012). Neurotree: A Collaborative, Graphical Database of the Academic Genealogy of Neuroscience. PLoS One, 7(10), e46608. doi:10.1371/journal.pone.0046608.

[3] Orthogonal Lab (2017). The Authorship Tree. Figshare, doi:10.6084/m9.figshare.4731913.

[4] For more on the point system convention, please see: Venkatraman, V. (2010). Conventions of Scientific Authorship. Science Issues and Perspectives, doi:10.1126/science.caredit.a1000039.

February 8, 2017

Work With Me (and the OpenWorm Foundation) This Summer!

Want to work with me and contribute to the DevoWorm project this summer? Apply to work on a project funded through the Google Summer of Code (GSoC) fellowship​!



If you a student in the computational sciences and want to be challenged and get paid while working on an applied computer science project, then apply now to one of two projects [1, 2]. CSoC is a high profile fellowship, which provides a stipend, opportunities for professional collaboration, and an impressive line on your CV/resume. As I (Bradly Alicea) will serve as your mentor, please contact me if you have any questions.


Take ownership of an available cell! Apply to work with the DevoWorm project and OpenWorm Foundation!

You can apply to work on one of two DevoWorm projects: "Physics-based Modeling of the Mosaic Embryo in CompuCell3D" or "Image processing with ImageJ (segmentation of high-resolution images)". You may also apply to other projects sponsored by the OpenWorm Foundation and INCF.

UPDATE (3/8): Applications are being accepted starting March 20, 2017, and will close on April 3rd at 16:00 UTC.

January 31, 2017

Crossing the Rubicon of 10^6 * 0.25

Synthetic Daisies will has achieved another milestone (250,000 reads) in just a few short days! When I started this blog in December of 2008 (roughly 8 years ago), I did not have any real expectations for readership. I was, however, drawn to analytics and the power of blogging as a platform to reach new audiences. And I kept updating milestones for the blog when the number of visitors hit 20000, 50000, 100000, 120000, 150000, and 200000.

Readership has increased exponentially since blog inception, despite the uneven sampling points in time.

Since the blog's inception, I have increasingly used social media for outreach activities (both at this blog and elsewhere). Part of this has been motivated by a deliberately radically open science strategy [1-4]. For a while, I was cross-posting from a Tumblr blog (Tumbld Thoughts), as well as a blog run by #SciFund (Fireside Science). I also have my entries cross-posted to the OpenWorm Foundation blog.

Visualizing radical open access. COURTESY: Open Reflections blog.

NOTES:
[1] Kriegeskorte, N. (2016). The selfish scientist’s guide to preprint posting. The Winnower, 4. doi:10.15200/winn.145838.88372.

[2] Chawla, D.S. (2017). When a preprint becomes the final paper. Nature Research Highlights, doi:10.1038/nature.2017.21333

[3] Lancaster, A. (2016). Open Science and its' Discontents. Ronin Institute blog, June 28.

[4] Faulkes, Z. (2012). Why I published a paper on my blog instead of a journal. NeuroDojo blog, September 7.





December 1, 2016

Searching for Food and Better Data Science at the Same Time

Two presentations to announce, both of which are happening live on 12/2. The first is the latest OpenWorm Journal Club, happening via YouTube live stream. The title is "The Search For Food", and is a survey of a recently-published paper on food search behaviors in C. elegans [1].


While the live-stream will be available in near-term perpetuity [2] on YouTube, the talk will begin at 12:45 EST [3]. The abstract is here:
Random search is a behavioral strategy used by organisms from bacteria to humans to locate food that is randomly distributed and undetectable at a distance. We investigated this behavior in the nematode Caenorhabditis elegans, an organism with a small, well-described nervous system. Here we formulate a mathematical model of random search abstracted from the C. elegans connectome and fit to a large-scale kinematic analysis of C. elegans behavior at submicron resolution. The model predicts behavioral effects of neuronal ablations and genetic perturbations, as well as unexpected aspects of wild type behavior. The predictive success of the model indicates that random search in C. elegans can be understood in terms of a neuronal flip-flop circuit involving reciprocal inhibition between two populations of stochastic neurons. Our findings establish a unified theoretical framework for understanding C. elegans locomotion and a testable neuronal model of random search that can be applied to other organisms.
The other presentation is one that I will give at the Champaign-Urbana Data Science Users' Group. This will be a bit more informal (20 minutes long), and part of the monthly meeting. The meeting will be live (12 noon CST) at the Enterprise Works building in the University Research Park. The archived slides are located here. The title is "Open Data Science and Theory", and the abstract is here:
Over the past few years, I have been working to develop a way to use secondary data and Open Science practices and standards for the purpose of establishing new systems-level discoveries as well as confirming theoretical propositions. While much of this work has been done in the field of comparative biology, many of the things I will be highlighting will apply to other disciplines. Of particular interest is in how the merger of data science and Open Science principles will facilitate interdisciplinary science.

NOTES:
[1] Subtitle: To boldly go where no worm has gone before. Yup, Star Trek pun. Full reference: Roberts, W. et.al   A stochastic neuronal model predicts random search behaviors at multiple spatial scales in C. elegans. eLife, 2016; 5: e12572.

[2] for as long as YouTube exists.

[3] Click here for UTC conversion.

October 27, 2016

Open Access Week: Working with Secondary Datasets

This is one of two posts in celebration of Open Access week (on Twitter: #oaweek, #open access, #OpenScience #OpenData). This post will focus on the use of secondary data in scientific discovery.


The analysis of open datasets has become a major part of my research program. There are many sources of secondary data, from web scraping [1] to downloading data from repositories. Likewise, there are many potential uses for secondary data, from meta-analysis to validating simulations [2]. If sufficiently annotated [3], we can use secondary data for purposes of conducting new analyses [4], fusion with other relevant data, and data visualization. Access to secondary (and tertiary) data access relies on a philosophy of open data amongst researchers which has been validated by major funding agencies.

The first step in reusing a dataset is to select datasets that are relevant to the question or phenomenon you are interested in. While data reuse is not synonymous with exploratory data analysis, secondary datasets can be used for a variety of purposes, including for exploratory purposes. It is important to understand what data you need to address your set of issues, why you want to assemble the dataset, and how you want to manage the associated metadata [5]. Examples of data repositories include Dryad Digital Repository, Figshare, or the Gene Expression Omnibus (GEO). It is also important to remember that successful data reuse relies on good data management practices to allow for first-hand data to be applied to new contexts [6].

An example of an archived dataset from the Dryad repository (original analysis published in doi:10.1098/rsos/150333).

Now let's focus on three ways to reuse data. The simplest way to reuse data is to download and reanalyze data from a repository using a technique not used by the first-hand generators of the data. This could be done by using a different statistical model (e.g. Bayesian inference), or by including the data in a meta-analysis (e.g. surveying the effects size across multiple studies of similar design). Such research can be useful in terms of looking at the broader scope of a specific set of research questions.

The second way is to download data from a repository for the purpose of combining data from multiple sources. This is what is sometimes referred to as data fusion or data integration, and can be done in a number of ways. One way this has been useful in my research has been for comparative analysis, such as computational analyses of gene expression data across different cell types within a species [7], or developmental processes across species [8]. Another potential use of recombined data is to verify models and validating theoretical assumptions. This is a particular concern for datasets that focus on basic science.

In fact, the recent technological and cultural changes associated with increased data sharing is enabling broader research questions to be asked. Instead of asking strictly mechanistic questions (what is the effect of x on y), combined datasets enable larger-scale (e.g. systems-level) comparisons (what are the combinatorial effects of all x and all y) across experiments. Doing this in a reductionist manner might take many orders of magnitude more time than assembling and analyzing a well-combined dataset. This allows us to verify the replicability of single experiments, in addition to applying statistical learning techniques [9] in order to find previously undiscovered relationships between datasets and experimental conditions.

The third way is to annotate and reuse data generated by your own research group [10]. This type of data reuse allows us to engage in data validation, test new hypotheses as we learn more about the problem, and comparing across different ways to attack the same problem. The practice of data reuse within your own research group can encourage research continuity that transcends turnover in personnel, encouraging people to make data and methods open and explicit. Internal data reuse also educational opportunities, such as providing students with hands-on opportunities to analyze and integrate well-characterized data. Be aware that reusing familiar data still requires extensive annotation of both the data and previous attempts at analysis, and that there is of yet no culturally-coherent set of standard practices for sharing data [11].


There are a few caveats with respect to successfully using data. As is the case with experimental design, the data should be sufficient to answer the type of questions you would like to answer. This includes going back to the original paper and associated metadata to understand how the data was collected and what it was originally intended to measure. While this does not directly limit what you can do with the data, it is important to understand in terms of combining datasets. There is a need for ways of assessing internal validity for secondary datasets, whether they be single data sources or combinations of data sources.

To learn more about these techniques, please try to earn the Literature Mining badge series hosted by the OpenWorm badge system. You can earn Literature Mining I (working with papers), or both Literature Mining I and II (working with secondary data). Here you will learn about how to use secondary data sources to address scientific questions, as well as the interrelationship between the scientific literature and secondary data sources.


NOTES (Try accessing the paper dois through http://oadoi.org):

[1] Marres, N. and Weltevrede, E. (2012). Scraping the Social? Issues in live social research. Journal of Cultural Economy, 6(3), 313-315. doi:10.1080/17530350.2013.772070

[2] Sargent, R.G. (2013). Verification and validation of simulation models. Journal of Simulation, 7(1), 12–24. doi:10.1057/jos.2012.20.

[3] For an example of how this has been a consideration in the ENCODE project, please see:
Hong, E.L., Sloan, C.A., Chan, E.T., Davidson, J.M., Malladi, V.S., Strattan, J.S., Hitz, B.C., Gabdank, I., Narayanan, A.K., Ho, M., Lee, B.T., Rowe, L.D., Dreszer, T.R., Roe, G.R., Podduturi, N.R., Tanaka, F., Hilton, J.A., and Cherry, J.M. (2016). Principles of metadata organization at the ENCODE data coordination center. Database, pii:bav001. doi: 10.1093/database/bav001.

[4] Church, R.M. (2001). The Effective Use of Secondary Data. Learning and Motivation, 33, 32–45. doi:10.1006/lmot.2001.1098.

[5] One example of this includes: Kyoda, K., Tohsato, Y., Ho, K.H.L., and Onami, S. (2014). Biological Dynamics Markup Language (BDML): an open format for representing quantitative biological dynamics data. Bioinformatics, 31(7), 1044-1052. doi: 10.1093/bioinformatics/ btu767

[6] Fecher, B., Friesike, S., and Hebing, M. (2015). What drives academic data sharing? PLoS One, 10(2), e0118053. doi:10.1371/journal.pone.0118053.

[7] Alicea, Bradly (2016): Dataset for "Collective properties of cellular identity: a computational approach". Figshare, doi:10.6084/m9.figshare.4082400

[8] Here is an example of a comparative analysis based on data from two secondary datasets: Alicea, B. and Gordon, R. (2016). C. elegans Embryonic Differentiation Tree (10 division events). Figshare, doi:10.6084/m9.figshare.2118049 AND Alicea, B. and Gordon, R. (2016). C. intestinalis Embryonic Differentiation Tree (1- to 112-cell stage). Figshare, doi:10.6084/m9.figshare.2117152

[9] Dietterich, T.G. Machine Learning for Sequential Data: A Review. Structural, Syntactic, and
Statistical Pattern Recognition, LNCS, 2396. doi:10.1007/3-540-70659-3_2

[10] Federer, L.M., Lu, Y-L., Joubert, D.J., Welsh, J., and Brandys, B. (2015). Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff. PLoS One, 10(6), e0129506. doi:10.1371/journal.pone.0129506.

[11] Pampel, H. and Dallmeier-Tiessen, S. (2014). Open Research Data: From Vision to Practice. In "Opening Science", S. Bartling and S. Friesike eds., Pgs. 213-224. Springer Open, Berlin. Dynamic version.

Printfriendly