Tag Archives: hadoop

Goodtimes for Powerset, Hard times for Hadoop?

Yahoo’s troubles and a recent Microsoft acquisition could be bad news for open source software that enables “internet-scale” computing.

Hadoop is a project to build an open source version of the infrastruture that Google uses to process data. It provides a huge filesystem that can be distributed over dozens or even thousands of computers (analogous to GFS), as well as support for processing all that data in parallel in the same way Google does when they build and update their index of the web (using MapReduce). It also provides HBase a distributed database that is built on top of the filesytem in the manner of Google’s BigTable. Hadoop is a spin-off of the Nutch project to build an opensource search engine that could index a significant portion of the web.

Most of the work on Hadoop and HBase has been supported by Yahoo, and a lot of the recent work was supported by a semantic-search startup called Powerset. In fact, a quick look at the personnel on the project shows that it is dominated by people from those two companies.

Given that Yahoo is in turmoil, and has been showing some signs of reconsidering their search business, and given that Powerset was just bought by Microsoft, who likely already has its own infrastructure for these sorts of applications, I have to wonder what will happen with Hadoop.