Developer Exchange Blog
What is Big Data?
Can you define Big Data for me? I wouldn't be surprised to have a number of very competent information technology professionals give me varied answers to that question. I'd like to clear this up, if at all possible.
In the information technology business you tend to get into discussions that involve terms that are either unclear or abstract. In those situations, it's interesting to listen as a group of people discuss ideas or activity they are involved in because often, if you'll carefully listen, you can detect that while a group is having a discussion, it's apparent that they aren't really on the same page about the subject matter – even if everyone is nodding their heads agreeably.
Over the last few years, I would have awarded nebulous-term-of-the-year (NTotY) to the term Cloud. However, the new challenger is Big Data. Last week I set myself on a mission to demystify this tech term to try to get to the bottom of the matter. To do so, I asked myself several questions:
What is Big Data?
- Is it an amount of data?
- Is it a concept?
- Is it a technology?
- Is it a certain type of data?
- Is it consumer behavioral data?
When someone says they're working on Big Data, what do they mean?
- Are they working on predictive analytics?
- Are they applying statistical/econometrical theory to data?
- Are they doing a BI project?
- Are they building a data warehouse?
- Are they working on artificial intelligence?
Are there specific elements that make something Big Data versus not Big Data?
- Does Big Data require non-transactional, unstructured data accumulation or analysis?
- Does Big Data require that one be dealing with a certain amount of data?
- Does Big Data require the use of specific technologies?
- Does Big Data specifically imply integrating data from social media?
And finally, are any of those questions important or relevant? More pointedly, is the term Big Data important?
Research on the term has solidified my choice of Big Data as NTotY. Even major journalism on the subject seems unclear as to what constitutes Big Data. Some present data warehousing, data analytics, and statistical data mining as Big Data. Others offer a more expansive, and unclear, idea regarding the assembly, integration and analysis of data from non-traditional, unstructured and external sources.
Data Analytics, "the process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful information, suggesting conclusions and supporting decision making" (en.wikipedia.org/wiki/Data_analysis), has been around for a long time. Considering the breadth of the above, one question is whether Big Data is just a fancy new tech term for old-fashioned Data Analytics.
New tech terms typically materialize for concrete reasons. Surely, there must be specific properties that make something Big Data. Cloud, for example, has specific properties of elasticity (ability to dynamically increase computational and storage resource use), measurable resource use and multi-tenancy support among others. Contrary to popular implications, it doesn't mean Internet hosted computing, virtualization or SaaS, all of which may or may not exist alongside Cloud computing.
To be fair, the term Big Data has been in popular use for several years. But this year, to me, has seen Big Data surfacing into to common, casual tech related discussions. We're now even seeing people with Big Data in their titles! Surely, it's important – way more important than Small Data or even Medium Data, whatever they may be.
Wikipedia offers an answer. "Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications." (en.wikipedia.org/wiki/Big_data). At least that forms a starting point for what Big Data may be - something involving too much data to process with traditional database technology.
Unfortunately, a definition like this is still not concrete for at least two reasons: (1) "difficult to process" is a relative term that depends on requirements such as the amount of time available to process the data and the effort required to do so, and (2) every year our ability to store and compute increases dramatically. What was "big" 5 years ago is not necessarily "big" now.
The fact is, however, we're producing exponentially more data
all the time. So, one could only conclude that for almost any industry imaginable, Big Data
will be something to be tackled soon if not now. More and more devices are producing "data."
Everything from cell phones to sensors and RFId are emerging daily that are spewing
information that will be stored with the hope that it is potentially useful. These devices are
pumping out way more information than the estimated 500 million tweets a day we
Further, as individuals, we're contributing to this through our "digital shadow", which is the digital footprint of us that is being accumulated, which is much larger than the digital information we, ourselves, explicitly create.
Some have suggested that Big Data is not just a lot of data but also implies different types of data that must be handled in different ways. If that were true, a monstrous set of transactional data wouldn't qualify as Big Data, regardless of how difficult it may be to process. To me, that aspect seems more like a common occurrence than a definitive characteristic.
So, what about consumer behavior? Much of the Big Data discussion seems to surround predictive analytics for consumer behavior. Does that make something Big Data? The short answer is no. Traditional data analytics have been used for decades for market research on consumer behavior; so, that activity, itself, doesn't imply Big Data. However, today's consumer analytics very well may push into the realm of Big Data, particularly any effort that attempts to assemble wide arrays of consumer information, including social media, Internet reviews and blog postings into a collection that can be analyzed for predictive, behavioral patterns.
Let me be clear that Big Data initiatives are complex. By their very nature, they require time. For starters, one has to accumulate Big Data itself. Big Data doesn't magically appear for analytics consideration. The first phase of moving into Big Data involves developing a strategy for accumulating massive amounts of data – data that's too large for your traditional RDBMS to store or analyze. To do that, one must figure out where (technology-wise) one will accumulate this data.
So, what technologies are in play when Big Data is handled? Due to the assertion above about the very nature of Big Data being collections that are "difficult to process" with on-hand database management technologies (which I prefer to think of as "traditional RDBMS" technologies), Big Data requires non-traditional approaches and non-traditional tools. Processing Big Data is complex. There's no simple, single "Big Data" tool. With that said, new technologies have emerged to help with the problem. Specifically, Hadoop, NoSQL database technologies and massively parallel analytic engines (supporting massively parallel processing, or MPP) apply here.
NoSQL databases have emerged for at least a couple of reasons. One reason was to get around the traditional ACID requirements inherent in traditional RDBMS databases. This allows a pipelining of data into to storage with techniques that are much better aligned with the needs of something accumulating massive amounts of data. Another reason was to better allow for unstructured data. Traditional databases love rigid structure. Sure, they offer ways to store unstructured information, but that's not really what they were designed for.
Hadoop offers a Java based software framework for processing, storing and analyzing massive amounts of distributed, unstructured data. Hadoop was inspired by Google's MapReduce (and retains its programming model) technology, which was created in the early 2000s for indexing the web. As opposed to traditional RDBMS data processing, Hadoop breaks up Big Data into multiple parts so that each part can be processed and analyzed concurrently. This allows it to scale out massively horizontally, unlike traditional database analysis approaches.
Massively Parallel Processing (MPP) technologies and architectures provide a highly scalable, "shared-nothing" approach that scales horizontally by adding nodes versus adding CPUs or processing power to a single node. These technologies, like Teradata, are capable of higher data ingestion rates through parallelized data processing. These technologies, too, are designed to allow for massively horizontal scale-out.
Once Big Data is accumulated, strategies for making sense out of the data must be employed. This can require years of exercising the scientific method: hypothesize, test, compare results to hypothesis, learn lessons and repeat to extract value. Here, the full gamut of computer science has its chance to be valuable as techniques to infer intelligence, correlations and meaning from and among structured and non-structured data can be useful. The result can be aggregations, summarizations, probabilities and other approaches that send reduced data backward into traditional technologies for more traditional statistical analysis, exploration and visualization. And most likely, it never stops.
With all this said, we can now see that while there's no specific technology or capability that defines what Big Data is, it's also apparent that if your data isn't necessitating the use of NoSQL databases, Hadoop analytics, MPP technologies or other non-traditional technology, you may not really have a Big Data case - yet. Note, however, that just because you are playing with any or all of the above technologies, it doesn't necessarily follow that you're handling Big Data. You may be using Big Data techniques, though, that would prepare you for Big Data.
At the end of the day, the term Big Data, itself, is useful in differentiating traditional data analytics from modern-day data collections that require massively scalable technologies to render any useful analytical result. At least that's my conclusion. Let me wrap this up with my best answers to the questions I posed above.
What is Big Data? Data too big to be handled with traditional database technologies.
Is it an amount of data? Not a specific number of bytes, no.
Is it a concept? Too some degree, it's a concept requiring "new" techniques.
Is it a technology? Not really, but it necessitates new technologies.
Is it a certain type of data? No.
Is it consumer behavioral data? It could be but not necessarily. Increasingly that will be no.
When someone says they're working on Big Data, what do they mean? They mean they are dealing with data so large it necessitates non-traditional data analytics.
Are they working on predictive analytics? Possibly, but not necessarily.
Are they applying statistical/econometrical theory to data? Possibly, but not necessarily.
Are they doing a BI project? Not necessarily.
Are they building a data warehouse? Probably, but that may be the "reduced" output.
Are they working on artificial intelligence? Possibly, but most likely no.
Are there specific elements that make something Big Data versus not Big Data? No. Big data can be transactional data, non-transactional/unstructured data or both, and there can be many technologies in play, no single one of which being required for Big Data to be an appropriate name.
Does Big Data require non-transactional, unstructured data accumulation or
Does Big Data require that one be dealing with a certain amount of data? Not exactly, but generally yes in the sense that it must be enough data to be problematic with traditional database technologies.
Does Big Data require the use of specific technologies? Not specifically, but it necessitates some type of new approach that goes beyond traditional data analytics.
Does Big Data specifically imply integrating data from social media? No, not necessarily, but that's a common source of data.
And finally, are any of those questions important or relevant? More pointedly, is the term Big Data important? Yes, assuming that we can all agree on my conclusions above :-).