March 3, 2018

Open Data Day 2018: Orthogonal Research Version

Time once again for International Open Data Day, an annual event hosted by organizations all around the world. For the Orthogonal Research contribution, I am sharing a presentation on the role of theory in data science (and the analysis of open data).

 Full set of slides are available on Figshare, doi:10.6084/m9.figshare.5483746

A theory of data goes back to before there were concepts such as "big data" or "open data". In fact, we can learn a lot from attempts to characterize regularities in scientific phenomena, particularly in the behavioral sciences (e.g. Psychophysics).

There are a number of ways to build a mini-theory, but one advantage of the approach we are working on is that (assuming partial information about the data being analyzed) a theoretical model can be built with very limited amounts of data. I did not mention the role of non-empirical reasoning [1] in the theory-building, but might be an important issue for future consideration.

 The act of theory-building is also creating generalized models of pattern interpretation. In this case, our mini-theory detects sheep-shaped arrays. But there are bottom-up and top-down assumptions that go into this recognition, and theory-building is a way to make those explicit.
 Naive theories are a particular mode of error in theory-building from sparse or incomplete data. In the case of human reasoning, naive theories result from generalization based on limited empirical observation and blind inference of mechanism. They are characterized in the Cognitive Science literature as being based on implicit and non-domain-specific knowledge [2].

Taken together, mini-theories and naive theories can help us not only better characterize unlabeled and sparsely labelled data, but also gain an appreciation for local features in the dataset. In some cases, naive theory-building might be beneficial for enabling feature engineering, ontologies/metadata [3] and other characteristics of the data.

In terms of usefulness, theory-building in data science lies somewhere in between mathematical discovery programs and epistemological models. 

[1] Dawid, R. (2013). Novel Confirmation and the Underdetermination of Scientific Theory Building. PhilSci Archive.

[2] Gelman, S.A., Noles, N.S. (2011). Domains and naive theories. WIREs Cognitive Science, 2, 490–502. doi:10.1002/wcs.124

[3] Rzhetsky, A., Evans, J.A. (2011). War of Ontology Worlds: Mathematics, Computer Code, or Esperanto? PLoS Computational Biology, 7(9), e1002191. doi:10.1371/journal.pcbi.1002191

No comments:

Post a Comment