The Hadoop Distributed File System (HDFS) is considered a core
component of Hadoop, but it’s not an essential one. Lately, IBM has been
talking up the benefits of hooking Hadoop up to the General Parallel
File System (GPFS). IBM has done the work of integrating GPFS with
Hadoop. The big question is, What can GPFS on Hadoop do for you?
IBM
developed GPFS in 1998 as a SAN file system for use in HPC applications
and IBM’s biggest supercomputers, such as Blue Gene, ASCI Purple,
Watson, Sequoia, and MIRA. In 2009, IBM hooked GPFS to Hadoop, and today
IBM is running GPFS, which scales into the petabyte range and has more
advanced data management capabilities than HDFS, on InfoSphere
BigInsights, its collection of Hadoop-related offerings, as well as
Platform Symphony.
GPFS was originally developed as a SAN file system. That would
normally prevent it from being used in Hadoop and the direct-attach
disks that make up a cluster. This is where an IBM GPFS feature called
File Placement Optimization (FPO) comes into play.
Phil Horwitz, a senior engineer at IBM’s Systems Optimization Competency Center, recently discussed
how IBM is using GPFS with BigInsights and System x servers, and in
particular how FPO has is helping GPFS to make inroads in a Hadoop
cluster. (IBM has since sold off the System x business to Lenovo, which IBM now must work closely with for GPFS-based solutions, but the points are still valid).
According to Horwitz, FPO essentially emulates a key component of
HDFS: moving the application workload to the data. “Basically, it moves
the job to the data as opposed to moving data to the job,” he says in
the interview. “Say I have 20 servers in a rack and three racks. GPFS
FPO knows a copy of the data I need is located on the 60th server and it
can send the job right to that server. This reduces network traffic
since GPFS- FPO does not need to move the data. It also improves
performance and efficiency.”
Last month, IBM published an in-depth technical white paper titled “Deploying a Big Data Solution using IBM GPFS-FPO”
that explains how to roll out GPFS on Hadoop. It also explains some of
the benefits users will see from using the technology. For starters,
GPFS is POSIX compliant, which enables any other applications running
atop the Hadoop cluster to access data stored in the file system in a
straightforward manner. With HDFS, only Hadoop applications can access
the data, and they must go through the Java-based HDFS API.
The flexibility to access GPFS-resident data from Hadoop and
non-Hadoop applications frees users to build more flexible big data
workflows. For example, a customer may analyze a piece of data with SAS.
As part of that workflow, they may use a series of ETL steps to
manipulate data. Those ETL processes may be best executed by a MapReduce
program. Trying to build this workflow on HDFS would require additional
steps, as well as moving data in and out of HDFS. Using GPFS simplifies
the architecture and minimizes the data movement, IBM says.
There are many other general IT housekeeping-type benefits to using GPFS. According to IBM’s "Harness the Power of Big Data"
publication, POSIX compliance also allows users to manage their Hadoop
storage “just as you would any other computer in your IT environment.”
This allows customers to use traditional backup and restore utilities
with their Hadoop clusters, as opposed to using the “copy” command in
HDFS. What’s more, GPFS supports point-in-time snapshots and off-site
replication capabilities, which aren't available in plain-vanilla HDFS.
The size of data blocks is also an issue with HDFS. In IBM's June 2013 whitepaper "Extending IBM InfoSphere BigInsights with GPFS FPO and IBM Platform Symphony" IBM makes the case that, because Hadoop MapReduce is optimized for
blocks that are around 64MB in size, HDFS is inefficient at dealing with
smaller data sizes. In the world of big data, it's not always the size
of the data that matters; the number of data points and the frequency at
which the data changes is important too.
GPFS also brings benefits in the area of data de-duplication, because
it does not tend to duplicate data as HDFS does, IBM says. However, if
users prefer to have copies of their data spread out in multiple places
on their cluster, they can use the Write-affinity depth (WAD) feature
that debuted with the introduction of FPO. The GPFS quote system also
helps to control the number of files and the amount of file data in the
file system, which helps to manage storage.
Capacity planning of Hadoop clusters is easier when the data stored
in GPFS, IBM says. In HDFS, administrators need to carefully design the
disk space dedicated to the Hadoop cluster, including dedicating space
for the output of MapReduce jobs and log files. “With GPFS-FPO,” IBM
says, “you only need to worry about the disks themselves filling up;
there’s no need to dedicate storage for Hadoop.”
Other
benefits include the capability to used policy-based information
lifecycle management functions. That means third-part management tools,
such as IBM’s Tivoli Storage Manager software, can manage the data
storage for internal storage pools. The hierarchical storage management
(HSM) capabilities that are built into GPFS mean you can keep the
“hottest” data on the fastest disks. That feature is not available in
plan-vanilla Hadoop running HDFS.
The shared-nothing architecture used by GPFS-FPO also provides
greater resilience than HDFS by allowing each node to operate
independently, reducing the impact of failure events across multiple
nodes. The elimination of the HDFS NameNode also eliminates the
single-point-of-failure problem that shadows enterprise Hadoop
deployments. “By storing your data in GPFS-FPO you are freed from the
architectural restrictions of HDFS,” IBM says.
The Active File Management (AFM) feature of GPFS also boosts
resiliency by caching datasets in different places on the cluster,
thereby ensuring applications access to data even when the remote
storage cluster is unavailable. AFM also effectively masks wide-area
network latencies and outages. Customers can either use AFM to maintain
an asynchronous copy of the data at a separate physical location or use
GPFS synchronous replication, which are used by FPO replicas.
Security is also bolstered with GPFS. Customers can use either
traditional ACLs based on the POSIX model, or network file system (NFS)
version 4 ACLs. IBM says NFS ACLs provide much more control of file and
directory access. GPFS also includes immutability and appendOnly
restriction capabilities, which can be used to protect data and prevent
it from being modified or deleted.
You don’t have to be using IBM’s BigInsights (or its Platform
Symphony offering) to take advantage of GPFS. The company will sell the
file system to do-it-yourself Hadoopers, as well as those who are
running distributions from other companies. And using GPFS allows you to
use the wide array of Hadoop tools in the big data stack, such as
Flume, Sqoop, Hive, Pig, Hbase, Lucene, Oozie, and of course MapReduce
itself.
IBM added the FPO capabilities to GPFS version 3.5 in December 2012.
Although it's POSIX compliant, GPFS-FPO is only available on Linux at
this point. IBM says GPFS is currently being used in a variety of big
data applications in the areas of bioinformatics, operational analytics,
digital media, engineering design, financial analytics, seismic data
processing, and geographic information systems.
By : Alex Woodie
Link :http://www.datanami.com/datanami/2014-02-18/what_can_gpfs_on_hadoop_do_for_you_.html
Aucun commentaire:
Enregistrer un commentaire