The Hadoop Distributed File System | RDL Research Database

Abstract

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!.

Keywords

Computer sciencePetabyteServerDistributed File SystemOperating systemDistributed data storeFile systemFile serverDistributed databaseDatabaseDistributed computingBig data

Affiliated Institutions

Yahoo (United Kingdom) GB

Related Publications

Wide-area cooperative storage with CFS

Frank Dabek , M. Frans Kaashoek , David R. Karger +2 more

The Cooperative File System (CFS) is a new peer-to-peer read-only storage system that provides provable guarantees for the efficiency, robustness, and load-balance of file stora...

2001 1434 citations

Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility

Antony Rowstron , Peter Druschel

This paper presents and evaluates the storage management and caching in PAST, a large-scale peer-to-peer persistent storage utility. PAST is based on a self-organizing, Internet...

2001 1220 citations

Production Workflow: Concepts and Techniques

Frank Leymann , Dieter Roller

1. Introduction. Business Processes. Business Processes as Enterprise Resource. Virtual Enterprises. Processes and Workflows. Dimensions of Workflow. User Support. Categories of...

1999 722 citations

MapReduce: Simplified Data Processing on Large Cluster

Jay B. Dean , Sanjay Ghemawat

<p>Abstract - MapReduce is a data processing approach, where a single machine acts as a master, assigning map/reduce tasks to all the other machines attached in the cluste...

2018 INTERNATIONAL JOURNAL OF RESEARCH AND... 2972 citations

The Grid file: A data structure designed to support proximity queries on spatial objects

Klaus Hinrichs , J. Nievergelt

Abstract : This document describes a technique for storing large sets of spatial objects so that proximity queries are handled efficiently as part of the accessing mechanism. Th...

1983 Repository for Publications and Resea... 41 citations

Publication Info

Year: 2010
Type: article
Pages: 1-10
Citations: 4766
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

The Hadoop Distributed File System

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

4766

OpenAlex

Cite This

APA Style

                            
                                    Konstantin V. Shvachko, 
                                
                                    Hairong Kuang, 
                                
                                    Sanjay Radia
                                
                                et al.
                            
                            (2010). 
                            The Hadoop Distributed File System. 
                            
                            , 1-10.
                            https://doi.org/10.1109/msst.2010.5496972

Identifiers

DOI: 10.1109/msst.2010.5496972