Synthetic Daisies: Popular Algorithmics

Here is my series called Popular Algorithmics, originally featured on Tumbld Thoughts. A tour of various algorithmic approaches, highlighted with useful links.

The RANSAC (Random Sample Consensus) Algorithm [1, 2] is an adaptive method to detect outliers by dividing the sample datapoints into "inliers" and "outliers". Treated in this way, the data can be incorporated into a model that exhibits minimial error. RANSAC is also robust to outliers, which might be useful for dealing with heavy-tailed and other non-normal (e.g. Gaussian) distributions.

RANSAC was developed for and used extensively in computer vision research [3]. A MATLAB toolbox specialized for image processing, but possibly amenable to other applications [4], is available from the MATLAB Exchange [3].

[1] Zuliani, M. RANSAC for Dummies. August 13 (2012).

[2] Second picture is from: Zaman, T. [3D] Ransac (get planes from point clouds). March 23 (2011).

[3] RANSAC Algorithm for Image Processing. Stack Overflow (2011).

[4] Kowdle, A. Parallel implementation of RAndom SAmple Consensus (RANSAC). Docstoc (2010).

[5] Chen, H. RANSAC Toolbox. MATLAB Toolbox (2009).

Here are links to a feature by David Ackley (University of New Mexico) on robust-first computing in this month's Communications of the ACM [1]. The argument: computing systems were originally designed to optimize efficiency (rather than robustness or adaptability), but at a cost of normal operation under novel or unexpected circumstances [2].

In this YouTube video, Ackley uses a bubblesort algorithm and alien symbols (e.g. unlabeled data) to demonstrate how and when a robust-first computing system can outperform systems designed using efficiency-first principles (e.g. your computer). This area of research also has broader implications for understanding social and biological systems.

[1] Ackley, D. Viewpoint: Beyond Efficiency. Communications of the ACM, October (2013). Open-access version here.

[2] Posey, B.M. Demystifying the "Blue Screen of Death". Microsoft TechNet AND Use a Windows Blue Screen of Death for Your WordPress 404 Error Page. How-to Geek.

Here are a few recent articles on the process and blind spots of applying algorithms to real-world problems. In [1], Rhett Allain discusses the role of random walk theory in modeling how two lost people find each other in a forest (or perhaps two ships in the night).

In [2], the NP-hard nature of the travelling salesman problem (TSP) is discussed in the context of logistics optimization. And to spice things up (or top them off), the author of [3] finds analogies between the process of algorithmics and the process of baking.

[1] Allain, R. How Should Two Lost People Find Each Other? Wired Science Dot Physics blog, September 10 (2013).

[2] Vanderbilt, T. Unhappy truckers and other algorithmic problems. Nautil.us magazine, Issue 003 (2013).

[3] Hromkovic, J. Algorithmics, or What Have Programming and Baking in Common? In "Algorithmic Adventures". pp. 37-71, Springer (2009).

Here are some papers on shark genomes and machine learning. First up is a paper [1] in which the Callorhinchus milii genome is sequenced and analyzed in an evolutionary framework. Possesses the slowest evolving vertebrate genome. The study also clarifies the phylogeny of chordates, and offers the intrepid reader 250 pages of supplemental material.

The second article and associated YouTube video [2, 3] explore the potential of deep learning, which perfect the art of AI as pattern recognition. Deep learning does increasingly better as data is added to the model. This overcomes the typical sigmoidal learning curve that characterizes traditional neural nets. But is the biological metaphor (learning machine as human brain) sufficient to push AI research to the next plateau?

[1] Venkatesh, B. et.al Elephant shark genome provides unique insights into gnathosome evolution. Nature, doi:10.1038/nature12826 (2014).

[2] Jones, N. The Learning Machines. Nature, 505, 146-148 (2014).

[3] Hinton, G. Recent Developments in Deep Learning. Google TechTalks, March 19 (2010).

Here are a few readings and examples of the Hydra game [1-4]. This game play resembles fast OLS and other tree-swapping phylogenetic reconstruction methods, but is a one player "game against nature" that pits an autonomous player (who cuts off nodes one at a time) against a tree structure (that randomly generates new branches in response). This occurs much as it does in the ancient Greek myth of the Hydra.

The interesting thing about this game is that the player cannot lose. Yet, in its pure form, neither can that player win. Rather, play continues infinitely until someone gives up -- according to a recent Quora post [3], it is one of the most counterintuitive results in Mathematics.

[1] Bauer, A. The Hydra Game. Mathematics and Computation blog, February 2 (2008).

[2] Kirby, L. and Paris, J. Accessible Independence Results for Peano Arithmetic. Bulletin of the London Mathematical Society, 14, 285-293 (1982).

[3] Mathematics: what are some of the most counterintuitive mathematical results? Quora, March 27 (2014).

[4] JAVA Application: Hydra game. Mathematics and Computation blog.

In this installment of Popular Algorithmics, I present a short reading list covering a data structure called the Bloom filter. The Bloom filter [1] is a complex probabilistic hash function (e.g. maps semantic data to quantitative parameters) with a membership query mechanism [2]. This is good for any kind of set-based classification. In general, Bloom filters (particularly the Murmur algorithm) are more efficient than the cryptographic hash used in applications such as Bitcoin.

The probabilistic aspect of Bloom filters mean that there is a deterministic false positive rate, which can be tuned using a known number of input elements and hash functions. Bloom filters can be applied to many types of problems, including probabilistic verification [3] and bioinformatics [4].

[1] Broder, A. and Mitzenmacher, M. Network Applications of Bloom Filters: A Survey. Internet Mathematics, 1(4), 485-509 (2003).

[2] Holmes, A. Hadoop in Practice. Manning Press (2012).

[3] Dillinger, P.C. and Monolios, P. Bloom Filters in Probabilistic Verification. Lecture Notes in Computer Science, 3312, 367-381 (2004).

[4] Stranneheim, H., Kaller, M., Allander, T., Andersson, B., Arvestad, L., and Lundeberg, J. Classification of DNA sequences using Bloom filters. Bioinformatics, 26(13), 1595-1600 (2010).

Reading List:

Ellis, J. All you ever wanted to know about Bloom filters. Spyced blog, January 28 (2009).

Ksankar Is our Neocortex a Giant Semantic Bloom Filter? Of Natural Intelligence, Machine Learning, and Jeff Hawkins. My Missives blog, April 14 (2013).

Bloom Filters by Example. Bill Mill blog. Github fork.

Bloom Filter calculator.

BARCODE: Fast Lossless Compression via Cascading Bloom Filters. Homolog.us blog, May 8 (2014).

What is the Advantage to Using Bloom Filters? StackOverflow.

Davies, J. Bloom Filters Overview. Jason Davies blog.

This installment of Popular Algorithmics is about Bayesian Poisoning. This is a heuristic strategy dating from the late 1990s, and involves the intentional degradation of e-mail spam filters (typically naive Bayesian models) by spammers [1]. Poisoning strategies consist of Type I (intentionally increasing the number of false positives) and Type II (embedding common English words into a Spam message) attacks.

[1] Accettura, R. Bayesian Span Filter Poisoning with RSS. Fun with Wordage blog, Januiary 29 (2007).

[2] Eckleberry, A. What is the effect of Bayesian Poisoning? Security News blog, August 26 (2006).

This edition of Popular Algorithmics is a throwback to the 1990s. 7 August of 2014 was, according to Microsoft's "The Fire Hose" blog, the 20th anniversary of Microsoft's first website. Microsoft 1.0 was originally intended to be a low-res information portal. Interesting foray into what was to come.

Quasi-algorithmic word of the day: sockpuppetry. Sockpuppetry is the practice of inventing an deceptive identity for one’s avatar in a virtual world. As with Bayesian Poisoning, sockpuppetry comes in two flavors: type I (a phony persona different from the creator is used) and Type II (where obfuscation of identity is the only requirement). Unlike Bayesian Poisoning, however, Sockpuppetry can be done by a live person and not a bot.

Seife, C. The Weird Reasons Why People Make Up False Identities on the Internet. Wired, July 29 (2014) and the book Virtual Unreality.

This edition of Popular Algorithmics is an introduction to cactus graphs [1]. Wolfram Mathworld defines cactus graphs as a connected graph where any two graph cycles have no edges in common. This is similar to the leaves of a botanical cactus. Cactus graph topologies are an instance of a block graph, in which every connected component constitutes a clique or a cycle. The cactus representation can be used for a variety of applications, from modeling wireless networks to hierarchical tree building.

[1] Lima, M. The Book of Trees: Visualizing Branches of Knowledge. Princeton Press (2014).

Popular Algorithmics

1 comment:

Printfriendly