What is this larger-than-normal data you speak of?
Big Data is an evolving IT industry buzzword that, at its core, refers to incredibly large collections of data that require advanced data management tools — as well as skilled professionals — to maintain and leverage the data effectively. The term "big data" encompasses the data collections themselves, the technologies being used to wrangle them, and the people who manage the data, perform analysis of the data, and make recommendations or decisions based on said data analysis.
The recent evolution of Big Data has been largely influenced by four key factors:
� The increasing capacities and dropping costs of data storage.
� The explosive growth of all mobile devices that sense and upload data.
� New tools that go beyond traditional relational database management systems.
� Big business buying into the value of big data.
Like other computing technology, data storage has continued to grow in capacity and drop in price since its inception. Today's home user can easily add several terabytes of storage to their home computers, or subscribe to free online cloud storage services from a number of vendors. Large businesses and science organizations can have data collections measured in petabytes (1,000 terabytes) and even exabytes (1,000 petabytes).
For example, it's been reported that the World of Warcraft MMORPG uses 1.3 petabytes of storage to maintain the game and player data. Earlier this year, Facebook stated in a technology blog post that its data warehouse is around 300 petabytes in size. On a slightly more sinister note, sources have claimed that the new data center being built by the National Security Agency (NSA) at Camp Williams (near the town of Bluffdale in Utah, about 20 minutes south of Salt Lake City) will have several exabytes of storage available.
Many large online data collections are constantly being fed by our ubiquitous mobile devices. Smartphones and tablets have several onboard sensors that gather and send data to various sources nearly every time the device is operational.
In large urban centers, vehicles interact with city traffic networks that use above and below ground sensors to continually gather data on commuter trends. Metropolitan and private security system networks capture and store video and audio data every second of every hour.
Dealing with huge data collections has traditionally required the use of supercomputers, large and expensive machines capable of massive numbers of calculations per second. Supercomputers are still used in several high-end industries and certain areas of scientific research, but other solutions have been developed in recent years. Now, more than ever before, it's possible for less well-funded organizations to access and analyze big data.
One such solution is Hadoop, an open-source software framework created by Apache that can be used to store and manage massive data sets on large clusters of relatively inexpensive computer hardware. Hadoop has given smaller companies and science departments the ability to establish and work with large data collections, while staying within their more modest budgetary constraints.
For any new enterprise-level technology to take off, however, it needs to gain acceptance with big business. Big data has been adopted by an increasing number of companies because it is based around the coin of the modern business realm: information. The information economy is one of the largest in the world, and every competitor is looking to gain a technological edge. Big data offers greater data storage capacity, faster data transfer and manipulation, and rich data analytics usually built directly into the system.
Better yet, a number of large players like IBM and Microsoft have begun offering big data as a service. (Maybe you've seen the commercials: IBM is using British actor Dominic Cooper as its big data pitchman.) This makes it possible for organizations to get the benefits of big data without having to make the necessary investments in personnel and equipment.
That said, many companies have taken the plunge into establishing their own big data infrastructure. One interesting result of this is the rise of a new job role: the data scientist. Data scientists are a mix of traditional data analysts, and interpretive prognosticators who are required to make accurate judgments of current trends, and well-founded predictions of future trends.
The role of data scientist has been dismissed by some industry pundits as essentially fictional, too much like the geeky data savant portrayed in several recent primetime television shows. There is no denying the industry adoption of this job title, however, and a recent search for data scientist job postings on Indeed.com turned up nearly 12,000 results.
Stripped of its buzz, big data is a recognizable and useful technological advance. For businesses, big data is being used to enhance traditional functions such as marketing, sales, and the providing of services. When it comes to science, big data has given more people the ability to do research and predictive modeling that previously required expensive supercomputer access.
When it comes to you, big data is already in the certification mix, with credentials like Cloudera Certified Professional: Data Scientist (CCP:DS) and EMC: Data Science Associate (EMC:DSA) floating around. If you're drawn to the concepts and challenges discussed above, then certification could be your best next step.