Saturday, January 7, 2012

What Is Apache Hadoop - HDFS And MapReduce? Explained

what apache hadoopApache Hadoop in short, Hadoop is an open source software framework written in Java programming language that supports data-intensive distributed applications that can be used to process a huge data sets across clusters of commodity nodes to get meaningful results faster.

This is my first article on Apache Hadoop trying to explain you all about Hadoop framework. In my next article I will try to cover the installation and configuration part. So stay tuned for that.

What Is Apache Hadoop?

Apache Hadoop is an open source and free software framework that used to handle large set of data-intensive distributed applications across the thousands of commodity nodes and provides a programming model which allows programmers to easily and effectively process large data sets to get the results faster using reliable and scalable architecture. Hadoop is designed to scale up from one single servers to thousands of nodes and petabytes of data, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Hadoop project was created by Doug Cutting and was inspired by Google's MapReduce and Google File System (GFS) papers.

Apache Hadoop is the Apache Software Foundation's (AFS's) trademark for Apache Hadoop. The best place to learn all about Hadoop is of course the Apache Hadoop project, where you can get the detailed information of about the Hadoop software.

What Is HDFS?

HDFS stands for Hadoop Distributed File System (HDFS) is designed to reliably store very large files across nodes in a large cluster. It is a storage system used by Hadoop. It is inspired by the Google Filesystem. HDFS can be deployed on any low-cost hardware and highly fault-tolerant. HDFS provides high throughput of data access (gigabytes to terabytes in size) to application and is suitable for large data sets.

For more detailed information visit project URL

What Is MapReduce?

MapReduce is a software framework that is used to process a vast amounts of data sets in parallel on a large clusters or thousands of nodes of commodity hardware in a reliable, fault-tolerant manner. MapReduce splits the data set into independent chunks which are executed by the map tasks on various nodes parallely. thus increases the speed and reduce the tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce and HDFS (Hadoop Distributed File System) both are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. Read more about MapReduce at

source : ravisaive


Post a Comment

Silakan kirimkan komentar anda entah itu kritik atau saran anda dan mohon maaf jika komentar anda tidak mendapatkan tanggapan karena saya tidak selalu online 24 jam