Apache Hadoop Overview: Scalable Open Source Software

Ranked #1,916 in Internet, #113,213 overall

Hadoop: The Software and the Community

If you are involved in large-system computing, likely you have heard of Apache Hadoop. For those new to Hadoop: This open source software project is a platform to do parallel computation. The Apache Hadoop project web page describes it this way:

"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model."

This page is an overview to give you a taste of the Hadoop project and its history and current poularity. This site also includes pointers for how to find good resources for understanding and using Hadoop and how to get involvedin the Apache Hadoop community through discussion groups and the Hadoop mailing lists.

Logo from the the Apache Hadoop Project

Go to a HUG!

Ted Dunning, MapR TechnologiesHadoop User Group meetings or HUGs are now found in locations world wide. Check online to find a meet-up near you. Some speakers are local to the meet-up, others travel distance as guest speakers. For example, February 29, 2012 speaker at the Atlanta HUG is David Whitehouse, Datameer in Conn. Others, such as Ted Dunning of MapR Technologies (pictured here), is located in the San Francisco Bay Area but also has spoken at Los Angeles, Atlanta and Boston HUGs. For those lucky enough to live in the Raleigh-Durham area, you may get to hear Mahout co-founder Grant-Ingersoll, Lucid Imagination at a Triangle HUG.

For links to meet-ups in your area here's a link to many Hadoop User Groups internationally:

Hadoop User Groups

Hadoop is an Apache Software Foundation Project

Hadoop is one of the most popular of the large open source software projects that are under the umbrella of the Apache Software Foundation and its open source license. For more information go to

Apache Software Foundation

Hadoop and Scalability

A system can have reliable scalability by design through the use of a cluster of computers rather than relying on just increasing size of one computer. This approach has cost advantages in terms of money and time. Hadoop is used by a wide range of major companies and is of considerable interest in the computing community.

Why a Yellow Elephant?

Doug Cutting and the Hadoop Logo

Hadoop developer Doug Cutting with a famous yellow elephant, © Ted Dunning 2009When Doug Cutting began to design the Hadoop framework, his son had a toy that has now become famous. It was a little yellow elephant named "Hadoop". Now the idea of the yellow elephant lives on in a new form: the toy was the playful inspiration for the logo of the internationally known Apache Hadoop project.

This picture of Doug Cutting is used by permission. I did not ask the elephant what he thought, but he looks pretty happy.

The yellow elephant of Hadoop is reflected in the logo of the related Apache project named Mahout. Mahout is an Indian term for someone who drives an elephant. In the case of the Apache Mahout project logo, it's a yellow elephant on which the mahout rides. For more information about Apache Mahout, see the review of the book Mahout in Action at Best Book on Mahout.

Scalability

A key to success in large data projects

Scalability refers to the ability of a system to continue to perform efficiently and effectively with increased capacity that improves linearly relative to the addition of new resources such as additional hardware or additional time. Systems whose capacity increases at a rate more slowly than available resources are said to not scale. Systems that do not scale will eventually fail with increasing load.

With the widespread occurrence of systems having very large and growing data, such as internet sites, the need for scalable software is great. The open source Hadoop project provides a reliable software that is scalable for distributed computing.

Hadoop and MapReduce

Hadoop relies on the idea that large scale computing can be distributed across a cluster of servers. A key point in the development of this idea came from publication of a paper by Google Labs in 2004. It presented map-reduce algorithms that make this type of distributed computing possible.

The idea of map-reduce inspired the developers of the Apache Lucene sub-project, Nutch, to produce the Hadoop framework to solve some scaling problems in Nutch. Yahoo commissioned a team of programmers to work on Hadoop, contributing the results back to Apache. Several other companies did likewise. Now a large and growing number of companies use Hadoop-based approaches for large projects.

At the Start: A Moment in Hadoop History

Ted Dunning and Doug Cutting, discussing Hadoop in Berlin 2011, photo © E. FriedmanDoug Cutting started Hadoop, and it has grown to be an internationally known and widely used software framework for open source software. I ran across an entry from an old blog of Doug's from 13 March 2006 Free Search that marks an B>early point in the development of Hadoop. Doug wrote the following:

"We've split the distributed computing parts of Nutch into a new project named Hadoop. This includes a filesystem modelled after GFS and a distributed computing system modelled after Google's MapReduce. So far a few folks are using Hadoop on tens of machines, and we're testing it on clusters with hundreds of machines. Next stop, thousands!"

By the way, this March 2006 blog posting had one comment...I recall an evening in November 2007 when Ted Dunning and I hosted a local Bay Area Hadoop users get together at a bar in Palo Alto. We invited about 20 people; about 30 showed up. Doug was there. It's amazing today to think of this early point in the project when you consider how many developers are using Hadoop now. The Hadoop users group meetings in the San Francisco area now involve hundreds of attendees for each monthly chapter meeting and there are local meetings all over the world. The Hadoop Summit 2010 in Santa Clara, CA had over 1000 attendees. Hadoop Summit 2011 was on 29 June 2011 in Santa Clara with over 1600 people.

Photo is a modern moment in Hadoop: Ted Dunning and Doug Cutting discussing Hadoop during a break at the Berlin BuzzWords 2011 Conference. Image © E. Friedman, used with permission Ted and Doug.

Hadoop Summit 2012

Apache Hadoop Summit 2012 takes place 13- 14 June at the Santa Clara Convention Center in Santa Clara, California.

Hadoop Summit 2012

Hadoop Summit 2011

MapR Technologies video at Hadoop Summit 2011, image © E. Friedman 2011

Over 1600 people attended Hadoop Summit 2011. The presentation for MapR Technologies grabbed audience attention with a video CLICK HERE.

Hadoop Committers - Who Are They?

The people who make the project

Hadoop is a top-level Apache Foundation project, and as such, it involves a large community. To date, the Hadoop committers, who develop, expand and update the software, number over 40.

Go to this link on the Apache site to see a the list of who are the current Hadoop project committers:

Apache Hadoop Committers

How to Participate in Hadoop

Want to join the Hadoop community? Go to a local Hadoop Users Group or HUG meetup in your area or join in online in the discussion groups.

Visit the main site for the Apache Foundation Open Source Hadoop project at

Hadoop Home

You will find news updates, a technical description of Hadoop and a list of the committers who develop it as well as links to all aspects of the Hadoop community.

If you want to get involved, go to the mailing lists and discussion groups. There are several choices depending on whether you want to join in the users group, the developers discussion or project level discussions. The link for the various mailing lists and discussions is at

Hadoop Mailing Lists

You can participate by looking for a local Hadoop Users Group Meet-up. HUGs are now International. New ones are forming in many areas. Search online to find a local HUG. Or check this link to find one in your area:

List of Active HUGs

For example, in the California Bay Area, the Hadoop User Group Monthly Meetup (HUG) is the 3rd Wednesday of each month. For more information or to sign up for a meet-up, click this link:

Bay Area Hadoop Users Group.

Follow me on Twitter for announcements about specific Hadoop related events worldwide @Ellen_Friedman

Books on Hadoop

There are several choices for a guide to using Hadoop. All of these are conveniently available from Amazon.
Loading

Hadoop at QCon in San Francisco November 2011

"Hadoop for the Enterprise Architect Panel" was presented on Friday 18 November 2011 at QCon in San Francisco. Panelists/presenters included Amr Awadallah, Guy Bayes, Ron Bodkin, Ted Dunning, Sanjay Radia and Peter Sirota.

Hadoop Panel at QCon: Hadoop for the Enterprise Architect"

Hadoop at Berlin BuzzWords 2012

view from Bundestag in Berlin, © E. Friedman 2011Berlin Buzzwords 2012 takes place 4th-5th June Here's the link Berlin Buzzwords 2012

Apache Projects were one of the topic of discussion at the Berlin BuzzWords 2011 conference in Germany last June, and Hadoop was a major focus, particularly of the two keynote addresses.

Participants worked hard during the two-day conference and many also joined in at several hack-a-thons at local companies and the Technical University held in conjunction with the BuzzWords conference.

Participants also had the opportunity to enjoy seeing Berlin, the tourists sights and the high-tech happenings as well.

1st Keynote Speaker at Berlin BuzzWords 2011

Doug Cutting talks about Hadoop, Avro and other projects

Doug Cutting at Berlin BuzzWords 2011, © E. Friedman 2011Doug Cutting was the keynote speaker on the opening day of the Berlin BuzzWords 2011 conference on 6 June 2011. He described the history of Hadoop and other Apache projects including Avro and Lucene.

Doug Cutting was elected chairman of the board of the Apache Software Foundation in September 2010. He contributes to several Apache projects. Doug is currently at a company called Cloudera.

2nd Keynote Speaker at Berlin BuzzWords 2011

Ted Dunning talks about the future of Hadoop

Ted Dunning delivers keynote at Berlin BuzzWords 2011, © E. Friedman 2011Ted Dunning gave the keynote address on the second day of the Berlin BuzzWords 2011 conference on 7 June 2011. He talked mainly about the future of Hadoop and the changes the community faces as the world of computing embraces Hadoop.

Ted Dunning is a member of the Apache Software Foundation, a committer for the Mahout project and active in the Hadoop community. Ted is currently at a company called MapR.

Read more from Ted Dunning's blog at

Surprise and Coincidence: Musings from the Long Tail.

Useful Books on Related Topics

If you have an interest in Hadoop, you may also find these books useful. Both of these titles come highly rated at Amazon.
Loading

Book Review: Mahout in Action

If you'd like to know more about the related Apache open source software project Mahout read this review of a new how-to book on Mahout, titled Mahout in Action. This book is published by a technical publisher called Manning.
Loading

Mahout in Action

eBook and print version from Manning

Mahout in Action, published by Manning 2011To ORDER Mahout in Action NOW: Get just the eBook or eBook with print version. And the eBook has audio/video enhancements. Both available as of 4 October 2011.

To get a limited-time 37% discount on all formats of Mahout in Action book, go to the publisher Manning and use the discount code

mahout37

at the following link: Manning's Mahout in Action

Mahout in Action from Amazon

Pre-order print version now

If you want to get just the text version for slightly less, pre-ORDER NOW just the print version from Amazon, and they will ship you the copy when available 28 October 2011.
Loading

Please leave your comment or question

  • Tipi Aug 18, 2011 @ 9:00 pm | delete
    I like how you sneak in your humor here and there....didn't ask the elephant about use of his picture but he looks happy!
  • sukkran Jun 28, 2011 @ 11:55 pm | delete
    really useful info about a open source software. thanks for sharing

Other Topics I've Written About

Life cannot be all work - here are some ideas for play, from great Indian food in the San Francisco Bay Area, how to make oolong tea, fun music for mandolin and fiddle and a new photo project. Food for thought and food for you!
Loading

Computer Laptop Sleeve

Loading

Convenient Computer Memory

These flash drives are a handy way to temporarily back up or transfer information. They are useful at a presentation to share files. And the swivel design means that the cap won't get lost.
Loading

by

efriedman

I am co-author of a book about another Apache project, titled Mahout in Action. By training, I am a biochemist/ molecular biologist. Most of my writi... more »

Feeling creative? Create a Lens!

If you like Hadoop, you may want Mahout 

Loading

O'Reilly Book on Hadoop 

Hadoop: The Definitive Guide

Amazon Price: $30.10 (as of 05/28/2012)Buy Now

This book not only has a cool cover, it should be a good guide for how to use Hadoop. The author is a Hadoop committer.

A Manning book on Hadoop 

Hadoop in Action

Amazon Price: $22.98 (as of 05/28/2012)Buy Now

This Hadoop book has gotten excellent reviews.