page contents Verification: c3f660ad6b44b250 My title
big data and hadoop
BIG DATA & ANALYTICS Data Engineering

Big data and Hadoop- An overview!

Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems.

It is not a type of database, but rather a software ecosystem that allows for massively parallel computing. It is an enabler of certain types NoSQL distributed databases (such as HBase), which can allow for data to be spread across thousands of servers with little reduction in performance.

Difference between Big Data and Hadoop:-

Big Data is nothing but a concept which facilitates handling large amount of data sets. Hadoop is just a single framework out of dozens of tools. Hadoop is primarily used for batch processing. The difference between big data and the open source software Hadoop is a distinct and fundamental one.

Hadoop runs on clusters of commodity servers and can scale up to support thousands of hardware nodes and massive amounts of data. It uses a namesake distributed file system that’s designed to provide rapid data access across the nodes in a cluster, plus fault-tolerant capabilities so applications can continue to run if individual nodes fail. Consequently, Hadoop became a foundational data management platform for big data analytics uses after it emerged in the mid-2000s.

Hadoop was created by computer scientists Doug Cutting and Mike Cafarella, initially to support processing in the Nutch open source search engine and web crawler. After Google published technical papers detailing its Google File System (GFS) and MapReduce programming framework in 2003 and 2004, respectively, Cutting and Cafarella modified earlier technology plans and developed a Java-based MapReduce implementation and a file system modeled on Google’s.

In early 2006, those elements were split off from Nutch and became a separate Apache subproject, which Cutting named Hadoop after his son’s stuffed elephant. At the same time, Cutting was hired by internet services company Yahoo, which became the first production user of Hadoop later in 2006. (Cafarella, then a graduate student, went on to become a university professor.)

Use of the framework grew over the next few years, and three independent Hadoop vendors were founded: Cloudera in 2008, MapR a year later and Hortonworks as a Yahoo spinoff in 2011. In addition, AWS launched a Hadoop cloud service called Elastic MapReduce in 2009. That was all before Apache released Hadoop 1.0.0, which became available in December 2011 after a succession of 0.x releases.

Now a days Big Data Hadoop is the most hottest skills. Companies are looking for good skilled professionals because most of the companies are started using Hadoop and they need goodskilled candidates. Today BigData is leading everywhere.

Hadoop Architecture

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. Though it is possible to have data-only worker nodes and compute-only worker nodes, a slave or worker node acts as both a DataNode and TaskTracker. In a larger cluster, the Hadoop Distributed File System (HDFS) is managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the NameNode’s memory structures, thus preventing file-system corruption and reducing loss of data.

Apache Hadoop Architecture

The Apache Hadoop framework comprises:

• Hadoop Common – Contains libraries and utilities needed by other Hadoop modules

• Hadoop Distributed File System (HDFS) – A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster

• Hadoop YARN – A resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users’ applications

• Hadoop MapReduce– A programming model for large-scale data processing

Hadoop Features

Big Data is classified in terms of:

  • Volume – Today data size has increased to size of terabytes in the form of records or transactions
  • Variety – There is huge variety of data based on internal, external, behavioral, or/and social type. Data can be of structured, semi structured or unstructured type.
  • Velocity – It means near or real-time assimilation of data coming in huge volume.

Reliability: 

When machines are working in tandem, if one of the machines fails, another machine will take over the responsibility and work in a reliable and fault tolerant fashion. Hadoop infrastructure has inbuilt fault tolerance features and hence, Hadoop is highly reliable.

Economical:

Hadoop uses commodity hardware (like your PC, laptop). For example, in a small Hadoop cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB hard disk and Xeon processors, but if I would have used hardware-based RAID with Oracle for the same purpose, I would end up spending 5x times more at least. So, the cost of ownership of a Hadoop-based project is pretty minimized. It is easier to maintain the Hadoop environment and is economical as well. Also, Hadoop is an open source software and hence there is no licensing cost.

Scalability: 

Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor because you can go ahead and procure more hardware and expand your setup within minutes whenever required.

Flexibility: 

Hadoop is very flexible in terms of ability to deal with all kinds of data. We discussed “Variety” in our previous blog on Big Data , where data can be of any kind and Hadoop can store and process them all, whether it is structured, semi-structured or unstructured data.

Big data tools associated with Hadoop

  • Apache Flume: a tool used to collect, aggregate and move huge amounts of streaming data into HDFS;
  • Apache HBase: a distributed database that is often paired with Hadoop;
  • Apache Hive: an SQL-on-Hadoop tool that provides data summarization, query and analysis;
  • Apache Oozie: a server-based workflow scheduling system to manage Hadoop jobs;
  • Apache Phoenix: an SQL-based massively parallel processing (MPP) database engine that uses HBase as its data store;
  • Apache Pig: a high-level platform for creating programs that run on Hadoop clusters;
  • Apache Sqoop: a tool to help transfer bulk data between Hadoop and structured data stores, such as relational databases; and
  • Apache ZooKeeper: a configuration, synchronization and naming registry service for large distributed systems.

More Information about Hadoop certification exam

Here I am providing you some of the useful information for learning the Hadoop certification exam.

Information and image  source-www.quora.com , google.co.in and www.atulhost.com

Related posts

Leave a Comment