Big Data, Big Confusion
In an era when storage and processing costs are increasingly smaller, the traditional view of the manner in which we operate with data is changing crucially.
The hunt for information in the data forest
In “Big Data: A Revolution That Will Transform How We Live, Work and Think” authors Viktor Mayer-Schonberger and Kenneth Cukier begin by presenting the situation of the year 2009, when the virus H1N1 represented a major concern for World Health Organisation and, in particular, for the American government. The rapid evolution of the epidemics created difficulties for CDC (Centre for Disease Control and Prevention), a governmental agency, as it reported the situation with a delay of 2 weeks in comparison to the reality in the field, partly because the population did not come into contact with the medical personnel after the first symptoms appeared. Real-time reporting would have allowed for a better understanding of the size of the epidemics, an optimisation of the prevention and treatment tactics, actions with the potential of saving lives in a disaster which ultimately amounted to 284,000 victims.
Incidentally, a few weeks before H1N1 reached the first page of newspapers, Google published in Nature, a scientific journal, a paper in which they presented the results of a study that started from the question “Is there a correlation between the spread of an epidemics and searches on Google”? The assumption from which Google started is that when someone feels the effects of a newly acquired disease they will use the Internet to search for information about the symptoms (e.g. “medicine for flue and fever”). Thus, using the data published between 2003 and 2008 by the CDC and the top 50 million most frequent searches from the same period, Google managed to identify a mathematical model (iterating through over 400 million) which would demonstrate the correlation between the evolution of an epidemics and the manner in which people search on the Internet. With the help of this new technology, named Google Flu Trends, the CDC has managed in 2009 to monitor in a more efficient manner the spread of H1N1.
The story of Google Flu Trends is from many points of view the archetypal example both for the benefits as well as for the technology and the challenged involved in solving a problem from the Big Data space. Starting from a hypothesis that looks for a correlation and using small unstructured amounts of data together with modern processing technologies, one is attempting to validate the correlation which, eventually, will bring value through the transformation of data to new information.
Big Data: The New “Cloud Computing”
Big Data is at its starting point. A proof for this is the confusion we can observe on the market when it comes to defining the problem that Big Data addresses and the manner (or manners) in which it does this. When I was talking in 2009 about Cloud Computing, I was constantly amused that the question “What is Cloud Computing?” addressed to a room of 50 participants had the potential of receiving 52 answers of which, go figure, many were correct. The situation is similar today in the case of Big Data and this is because we are in a period close to what Gartner calls “peak of inflated expectations”. In other words, Big Data is discussed everywhere, and the entire industry is engaged in discovering benefits in a wide range of technologies and concepts, starting from an increased degree of maturity/applicability (e.g. Predictive Analytics, Web Analytics) and ending with Star Trek inspired scenarios (e.g. Internet of Things, Information Valuation, Semantic Web).
Figure 1 – The comparative volume of “Big Data” (blue) and “Cloud Computing” (red) searches (source: Google Trends)
“Cloud Computing” has already passed its peak, according to the volume of searches on Google, while “Big Data” is still growing. The fundamental problem that determines the confusion and implicitly the non-realistic expectations is, however, caused by the fact that Big Data consists, according to Gartner’s “Hype-Cycle” model, of over 45 concepts in various stages, from the pioneering one (i.e. “Technology Trigger”) to the maturity one (i.e. “Plateau of Productivity”). Thus, Big Data cannot be treated holistically at a tactical level, but rather only in principle, at a strategic level.
Figure 2 – Big Data “Hype Cycle” (source: Gartner, 2012)
Small Data Thinking, Small Data Results
Mayer-Schonberger and Cukier identify 3 fundamental principles that allow for a shift from the Small Data approach to a Big Data approach.
“More”: keep and do not throw away
Data storage costs have reached in 2013 a historical minimum. At present, storing 1 gigabyte (GB) of data costs less than 9 cents / month using a cloud storage service (e.g. Windows Azure) and for archiving they reach 1 cent / month (e.g. Amazon Glacier), reducing the storage costs of a petabyte (1.048.576 GB) to almost $10,000.- (or $10 for a terabyte), 1,000,000 times cheaper than at the start of the 1990s, when the average storage cost / GB was of approximately $10,000. In this context, erasing the digital data accumulated through the informatics processes makes increasingly less sense. Google, Facebook, Twitter raise this principle at the level of a fundamental law, representing their ticket for new development and innovation dimensions, an opportunity open now to those that until now were limited by the prohibitive costs.
“Messy”: quantity precedes quality
Google Flu Trends functioned because Google successfully introduced in the process of iteration of the mathematical models the most frequent 50,000,000 searches. Many of these searches were irrelevant, but volume was required for determining the model which finally managed to demonstrate the correlation. Peter Norvig, the Google expert in artificial intelligence, stated in his book “The Unreasonable Effectiveness of Data” that “simple models supplied with a big volume of data are going to eclipse more elaborate models based on less data”, a principle used also in the building of Google Translate, an automated translation service based on a corpus of over 95 billion sentences formulated in English, capable of translated in and from 60 languages.
“Correlation”: facts and not explanations
We have been taught and we got used to the fact that the effect is determined by a cause, a reason for which naturally we are tempted to find out “why?”. In the Big Data world, the correlation becomes more important that the causality. In 1997 Amazon had on their payroll an entire department responsible with drawing up lists of reading recommendations for those who visited the online bookshop. It was a manual process, expensive and with a limited impact on generating sales. Today, thanks to an algorithm named “item-to-item collaborative filtering” developed by Amazon, the recommendations are made completely automatically, dynamically and with a massive impact on sales (a third of the income generated by the electronic commerce coming from the automated recommendations). Amazon does not want to know why customers buying “The Lord of the Rings” by J. R. R. Tolkien are interested as well in buying “Friendship and the Moral Life by Paul J. Wadell, but what interests them is that there is a strong correlation between these two titles, and this fact is going to generate income three times as much as without such a system.
At this time, Big Data represents the most abused trend on the market, and as a result the degree of confusion generated by the plethora of opinions encountered at every step (a category from which this article is not excluded) is extremely high, leading to unrealistic expectations and similar disappointments. However, clarity comes from understanding the potential, from adopting the principles (i.e. more, messy, correlation) and from acting preventively for the adaptation of current systems to the new manner of thinking from the perspective of the calculus infrastructure, of the architecture and of the technical competences of those operating them. The stake is of identifying new addressable opportunities of transforming the data into information which could increase the efficiency of a product or of a business, as Google did through Flu Trends or Amazon through their automated recommendation system.
Yonder has been accumulating Big Data experience, investing strategically in applied research projects together with product companies that understood the vision we have outlined and the benefits that such an investment could generated both on short and on long term, this trend representing one of the four technological directions chosen as an innovation topic in 2013.