Introduction to Big Data & the use of Hadoop
Introduction:
1.
The
Hadoop is basically an Apache product which is open-source framework.
2.
It
is very user friendly, reliable and it is being written in java environment.
3. Whenever
we are going to deals with distributed processing of large datasets across
clusters of computers then it uses a simple programming models to accomplishes
the task.
4. Its
architecture is very simple and sophisticated and designed in such a manner so
that it can scale up from single server to thousands of machines. It also offering
local computation and storage of data.
5.
Now
if we need to understand what exactly the Hadoop is, then we ned to have first
understand the issues related to the Big Data and the traditional processing
system as it is being considered as a major component and area of Hadoop.
As
the technology is going to be Advanced day by day ahead, so we need to
understand the importance of Hadoop, and its application strategy using which
it can be able to provide the solution to the problems associated with Big
Data. Here I am also going to discuss about the CERN case study to highlight
the benefits of using Hadoop.
Problems
with Traditional Approach:
1. In
traditional approach, the main issue was handling the heterogeneity of data
i.e., structured, semi-structured and unstructured.
2.
In
this approach, the user interacts with the application, which in turn handles
the part of data storage and analysis.
3.
It
is mainly suffering with the problem for storing the colossal amount of data.
4.
It
also has the problem to store heterogeneous data.
In traditional
processing the accessing and processing speed is also having the major problem
specially when the data size is very large.
Limitation/
problems lies with Traditional processing:
1.
This approach works fine with those applications that process less
voluminous data that can be accommodated by standard database servers.
2.
It is limited up to the limit of the processor that is processing the
data.
3.
But when it comes to dealing with huge amounts of scalable data, it is a
hectic task to process such data through a single database bottleneck.
Now if we are going to
consider the Big Data then it is being considered as a best solution over the
traditional approach. Few major parts are discussed as below.
- The Big Data is emerging technology which is being used by most of the organization.
- It is basically the collection of large datasets that cannot be processed using traditional computing techniques.
- It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, techniques and frameworks.
- Now the Organizations are examining large data sets to uncover all hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.
What is Hadoop?
The
Hadoop is basically a framework that allows you to first store Big Data in a
distributed environment, so that, you can process it parallelly. There are
basically two components in Hadoop:
- . HDFS for storage (Hadoop distributed File System), that allows you to store data of various formats across a cluster.
- YARN, for resource management in Hadoop. It allows parallel processing over the data, i.e. stored across HDFS.
As we have already discussed earlier that the
Hadoop is an open-source software framework like Java script and is mainly
used for storing and processing Big Data in a distributed manner on
large clusters of commodity hardware. Hadoop is licensed under the Apache v2
license.
Hadoop was developed, based on the paper
written by Google on the MapReduce system and it applies concepts of functional programming. Hadoop is
written in the Java programming language and ranks among the highest-level Apache projects. Hadoop was developed
by Doug Cutting and Michael J. Cafarella.
Hadoop-as-a-Solution:
Let’s understand how Hadoop provides a solution to the Big Data problems that we have discussed so far.
The
major challenges associated with big data are as follows −
- Capturing
data
- Curation
- Storage
- Searching
- Sharing
- Transfer
- Analysis
- Presentation
How to store huge amount of data:
- The Hadoop is mainly having a HDFS system which is provides a distributed way to store Big Data.
- Here the data is stored in blocks in DataNodes and you specify the size of each block.
- For example, if you have 512 MB of data and you have configured HDFS such that it will create 128 MB of data blocks.
- The HDFS will divide data into 4 blocks as 512/128=4 and stores it across different DataNodes.
- While storing these data blocks into DataNodes, data blocks are replicated on different DataNodes to provide fault tolerance.
How to store a variety of data:
- The HDFS in Hadoop can capable enough to store all kinds of data whether it is structured, semi-structured or unstructured.
- HDFS, there is no pre-dumping schema validation.
- It also follows write once and read many models.
- Due to this, you can just write any kind of data once and you can read it multiple times for finding insights.
How to process the data faster:
In this case the Hadoop is allowed to move
the processing unit to data instead of moving data to the processing unit.
So, what does it mean by moving the computation
unit to data?
It means that instead of moving data from
different nodes to a single master node for processing, the processing
logic is sent to the nodes where data is stored so as that each
node can process a part of data in parallel. Finally, all of the intermediary
output produced by each node is merged together and the final response is sent
back to the client.
Where is Hadoop used:
As we have already discussed earlier that the Hadoop
is framework hence it is mostly used for:
- Designing the Search engine as it has the ability to
process a huge data. For designing the search in Yahoo, Amazon, Zvents it
is mostly used.
- It is also used for designing the Log processing environment
like Facebook, Yahoo does have.
- For making the Data Warehouse based application
layer like the Facebook, AOL have.
- For Video and Image Analysis based application. As it requires the high processing.
When not to use Hadoop:
Following are some cases where it is being not
recommended by the expert to use the Hadoop:
1. Low Latency data access : Quick access to small parts of data
2. Multiple data modification : Hadoop is a better fit only if we are
primarily concerned about reading data and not modifying data.
3. Lots of small files : Hadoop is suitable for scenarios, where we
have few but large files.
4. After knowing the best suitable use-cases, let us move on and look at a
case study where Hadoop has done wonders.
Scope @ N9 IT Solutions:
- N9 IT Solutions is a leading IT development and consulting firm providing a broad array of customized solutions to clients throughout the United States. We established primarily with an aim to provide consulting and IT services in today’s dynamic environment.
- We established primarily with an aim to provide consulting and IT services in today’s dynamic environment.
- N9 IT also offers consulting services in Java/J2ee, Cloud Computing, Database Solutions, DevOps, ERP, Mobility, Big Data, Application Development, Infrastructure Managed Services, Quality Assurance and Testing.