Obviously, the amount of data produced by mankind has increased dramatically over the years. Starting from the beginning of time until 2003, the amount of data produced was not more than 5 billion gigabytes. However, as expected, the amount of data produced today is enormous. In 2011 alone, about 5 billion data were produced, which is equivalent to the amount of data produced for decades. As at 2013, about 5 billion gigabytes of data were produced in ten minutes. Should we expect the same amount of data today? No doubt, the amount of data produced today is far beyond what was produced in the previous years.

With the little explanation above, big data simply means a huge data. This includes a collection of huge datasets that cannot be processed using the conventional computing methodologies. In fact, big data is a subject on its own, and it involves many tools, framework, and techniques.

What are the sources of big data?

The sources of big data include all the different devices and applications. Highlighted below are some of the major sources of big data:
Search engine data – everyone knows that search engine collect a huge amount of data from the web every second. All the data collected from the search engine also contribute to big data.
Social media data – the billions of data that is being generated on social media also contribute to the big data.
Stock exchange data – the financial market holds a lot of information about buying and selling. This also contributes to big data.
Black box data – the information collected by the black boxes in airplanes, helicopters, and jets also contribute to big data.Other sources of big data include Power Grid Data, Transport Data, and so on.

What is Hadoop?

Hadoop simply refers to the open-source software used for storing data and running applications on different hardware. Hadoop is a viable solution for storing all kinds of data. It also has the capacity to process unlimited tasks at the same time.
The benefits of Hadoop
Definitely, a device that is capable of storing unlimited data, and processing multiple tasks at the same time must be loaded with plenty advantages. Some of the notable benefits of Hadoop are discussed in the subsequent paragraphs.
Storage capacity – one of the major reasons why companies turn to Hadoop is due to its unlimited storage capacity. With the rate at which data is being produced, one need to look for a reliable storage device that can accommodate as much as possible.
Flexibility – Hadoop makes it possible for companies to store data without thinking about the processing. You can store all kinds of data and decide how to use it later. This a huge advantage over the traditional relational databases.
Low cost – Hadoop is a free open-source software.
Computing power – No doubt, Hadoop is one of the most powerful data processing devices available today. Its numerous computing notes make it possible to process huge without issues.
Fault tolerance – Hadoop has the capacity to redirect jobs to another node if one goes down. This capacity makes it possible to continue operation even if one or more nodes are faulty.
Scalability – You have the capacity to grow your device by adding more nodes.Components of Hadoop
Presently, there are four core modules in the basic framework from the Apache foundation. The modules are:
MapReduce - MapReduce is a software programming model that is capable of processing large datasets in parallel.
Hadoop Distributed File System (HDFS) - is a Java-based file system that provides reliable and scalable data storage across multiple machines.
YARN – is an acronym for Yet Another Resource Negotiator. It is a framework for scheduling and handling requests from distributed applications.
Hadoop common – is the utilities and libraries used by other Hadoop modules.
There are some other notable software components that can run alongside Hadoop. Some of them are listed below:
Pig – helps to manipulate data stored in the HDFS. It also provides a mean for data extractions, loading and transformations.
HBase – is a distributed database that operates on top of Hadoop.
Spark – is a cluster computing software with in-memory analytics.
Hive – is a data warehouse similar to SQL query, which presents data in a table format.