How does Facebook manage 100 petabytes of data online?
The answer is: Apache Hadoop. It’s no secret that the social network Facebook stores thousands of data, just about 100 petabytes so far. Nor is unknown, at least for curious geeks, that Facebook uses Apache Hadoop to manage that data and make them available in milliseconds whenever required.
Apache Hadoop is a software framework for distributed applications under Free Software license. The Hadoop project is a high-level development Apache that has grown with community input group is made up of developers, including always highlighted one of its main users and contributors: Yahoo!
This framework allows applications running on large clusters using a dedicated hardware architecture where the application is divided into many small fragments of work, each of which can be run or rerun any node in the cluster
Today companies such as AOL, eBay, Fox, IBM, LinkedIn and Twitter among others, used Hadoop as a solution for multi platforms.
During the day there was the first of two days of the Hadoop Summit in San Jose, California , where one of the engineers at Facebook, Andrew Ryan, said that the solution to one of the main problems that usually has been solved with Hadoop own solution.
The problem is called NameNode Hadoop, a service that Hadoop architecture, as detailed in GigaOm, “handles all metadata operations in distributed file system, but that only runs on a single node. If a node goes down, so does, for all purposes, Hadoop, because nothing based on HDFS (file system Hadoop itself) is achieved by running 100% correctly. “
Andrew Ryan noted that the solution to the problem of availability Facebook generated NameNode called AvatarNode. Ryan said AvatarNode started building about two years ago (the name was inspired by James Cameron’s film) and is now in production.
DeAvatarNode function is replace with an architecture NameNode two nodes, in which one acts as an expected version if the other low. At present, the failover process is manual, but Ryan said that “we are working to further enhance and integrate AvatarNode a high availability framework that allows us to failover unattended, automated and secure.”
Facebook’s solution is not final or the best that can be achieved in time for NameNode limitation, but as Ryan explains, although only 10% of unplanned downtime can be avoided with Facebook AvatarNode, more than 50% of downtime planned for the future may be tolerated by the architecture of high demand.
Facebook is not the only company to solve this problem. The company Appistry presented a fully distributed file system for a couple of years that included the distribution of Hadoop MAPR, which also provides a file system highly available.
In times of mega Data Networks and services in the cloud, known developments like these and they are designed on free platforms gives us a better picture of things to come in the future of distribution and administration of large-scale data.
Link: How Facebook Keeps 100 petabytes of data online Hadoop (GigaOM)Tags: Apache, cluster, Data Networks, distribution, Facebook, Free Software