While the term “big data” is often considered a buzzword, the intention behind the term and the concept itself is not quite so new. Simply put, “big data” is the act of gathering a large volume of information and storing it for later processing, specifically data sets so large that traditional methods for organizing and processing the data prove inadequate. One might argue that the creation of libraries were one of the first practices of what we now call “big data,” as that quantity of data in written form far exceeded what anyone until then would have encountered, and it forced the first librarians to develop new schema to organize it. While in the distant past, this information might have been measured in the number of scrolls or books stored, we now measure this in terabytes and petabytes–even exabytes occasionally. Suffice it to say, when we speak of “big data,” we speak of data sets so large that they prove problematic for the conventional means of managing data.
Big Data: A Constantly Moving Target
Because of the nature of its definition, when to use the term is not something that can be pinned down. There is no stable minimum size above which one can say that a data set officially qualifies as “big data.” It depends on the current levels of processing power, the nature of the data itself, and the what the end product is supposed to be. What also complicates matters is the exponential growth of data over time.
As technology advances, the rate at which data is produced increases, and as this production rate increases, advanced technology can harness this data to advance technology at an even faster rate. It is a self-accelerating cycle.
We’re seeing this culminate (currently anyway) in the internet of things. Data is being produced by and consumed by an ever growing list of devices via the internet, and this data is often captured by companies to form data sets of unprecedented size. We’re seeing two major forces working to expand this data.
The adoption of internet-connected embedded systems spreads awareness of their existence, even to “non-geeks,” further increasing their adoption. Second, this adoption drives down cost, which also, in turn, leads to further adoption. This leads to an ever-accelerating increase in the amount data produced and transmitted.
The other side of this is that as technology adapts to big data, data sets that were previously considered nearly intractably large because trivial to manage, let alone store, after some time.
We can see this even on a consumer scale. Only ten years ago, the idea that a consumer could even store 500 GB on a personal computer would have seemed laughable. Now it’s considered a little on the smaller side for a home PC. The search feature on consumer operating systems have also incorporated indexing to better cope with the markedly increased amount of data consumers can now store on their PC, among other adoptions and innovations.
So What is This Used For?
Glad you asked. Most commonly they are used in three major applications
- Descriptive statistics
- Predictive analytics
- Machine learning
Descriptive statistics is a discipline that takes a data set and uses quantitative analysis to describe features about the data set itself. Very commonly encountered forms of descriptive statistics are the attributes of the performance of athletes. Things like average points scored per game and batting average are both examples of a descriptive statistic.
Predictive analytics, on the other hand, is a discipline that uses data to construct models that are themselves used to predict the future. Generally these predictions are informed by the trends identified by descriptive statistics.
Machine learning is field of computer science that has its roots within artificial intelligence, particularly pattern recognition. The goal of machine learning is to enable computer programs to “learn” without being explicitly programmed to do so, generally by recognizing patterns in data and using these patterns to predict results. This field can itself be broken down into three major fields:
- Supervised learning: The computer program is given a training data set and what the desired outputs should be by the “teacher.” The goal of the machine learning algorithm becomes finding a general rule that maps the inputs to the desired outputs.
- Unsupervised learning: As the name implies, there is no explicit “teacher.” The algorithm first must uncover structure within the data. Sometimes that is, in fact, the purpose of using this method: teasing out hidden structures within ostensibly unstructured data.
- Reinforcement learning: This is sort of the hot-cold game for computers. There is no “teacher,” rather the program must maximize some value called the “reward.”
Big Data and the Future
Going forward, big data will likely only increase in prominence. With applications ranging from targeted advertising to cancer research, the utility and power it holds are too attractive to ignore. But as anything with the potential to do so much good, its potential for great evil is equally large. Already we’re seeing big data used by the Chinese government to collect detailed amounts of data on the behaviors of its citizenry, enabling them to conduct Orwellian surveillance. However, it’s important to remember that big data itself is neither good nor bad. It simply is. Like any other nascent discipline, we must decide as a society where the ethical boundaries lie.