mar 28

Google unveiled a “patent pledge” that it hopes will shield cloud software and big data developers from the type of litigation that has engulfed the mobile phone industry. The pledge, which is like a non-aggression pact, covers ten patents related to Google’s MapReduce technology.

The pledge, which Google announced on Thursday, says that developers are free to use or sell the technology described in the patents without fear of future lawsuits. The shield applies, however, only to projects based on open source software that is available to all.

Google’s patent pledge appears intended to complement the open-source software licenses that allow programmers to build on each other’s work. Such licenses, like the GNU General Public License, grant anyone the right under copyright law to use designated blocks of software code; these rights, however, can be undercut by competing patent rights.

The ten patents included in Google’s pledge include a controversial one issued last year that covers a form of parallel processing known as MapReduce. The patent gave rise to fears that Google would be able to monopolize tools like Hadoop, which is an integral part of the so-called “big data” revolution that is fueling a wide range of new products and services. Google’s pledge appears intended to allay that fear.

In a phone interview, a person from Google explained that the MapReduce patent pledge is intended to help the emerging big data and cloud software industry avoid a litigation train-wreck like the one that befell the mobile industry. (In recent years, an arms race of patents covering smartphones has led to a relentless series of global lawsuits which have limited the spread of software technology and increased prices for consumers.)

Google suggests it will add other patents to its non-aggression pool and is inviting others to do the same. In theory, this will lead to an open and expanding workshop of tools for cloud developers; however, there is no guarantee it will work out this way.

One problem is that the pledge will have little effect against patent trolls like Intellectual Ventures, which buy up old patents and use them to file lawsuits against productive companies. The trolls are largely immune from retaliation because they operate through shell companies and don’t actually make any products that can be the subject of a counter-suit.

The person at Google, who did not want to identified, said the pledge may not be effective against trolls but that it may curtain the practice of “privateering” — where major companies give patents to trolls in order to harass rivals or in return for a cut of the proceeds the trolls obtain. This person said that, under the terms of the pledge, Google reserves the right to sue anyone who financially benefits from such lawsuits.

There is also the question of whether the Google pledge is legally enforceable. Typically, promises to the world at large don’t carry any legal force because they lack what lawyers call “consideration.” The Google source, however, said those who rely on the pledge could likely prevent Google from going back on the pledge through a doctrine called “promissory estoppel.”

In the bigger picture, the Google patent pledge represents part of a growing effort among Silicon Valley companies to rein in a patent system that many believe has become over-extended. Twitter, for instance, last year introduced an employment contract that promises its engineers that their inventions won’t be used to fuel the patent wars.

(Image by alphaspirit via Shutterstock)


Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.



Tagged with:
mar 04

Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search-engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.

Alone, Hadoop is a software market that IDC predicts will be worth $813 million in 2016 (although that number is likely very low), but it’s also driving a big data market the research firm predicts will hit more than $23 billion by 2016. Since Cloudera launched in 2008, Hadoop has spawned dozens of startups and spurred hundreds of millions in venture capital investment since 2008.

In this four-part series, we’ll explain everything anyone concerned with information technology needs to know about Hadoop. Part I is the history of Hadoop from the people who willed it into existence and took it mainstream. Part II is more graphic; a map of the now-large and complex ecosystem of companies selling Hadoop products. Part III is a look into the future of Hadoop that should serve as an opening salvo for much of the discussion at our Structure: Data conference March 20-21 in New York. Finally, Part IV will highlight some the best Hadoop applications and seminal moments in Hadoop history, as reported by GigaOM over the years.

Wanted: A better search engine

Almost everywhere you go online now, Hadoop is there in some capacity. Facebook, eBay, Etsy, Yelp , Twitter, Salesforce.com — you name a popular web site or service, and the chances are it’s using Hadoop to analyze the mountains of data it’s generating about user behavior and even its own operations. Even in the physical world, forward-thinking companies in fields ranging from entertainment to energy management to satellite imagery are using Hadoop to analyze the unique types of data they’re collecting and generating.

Everyone involved with information technology at least knows what it is. Hadoop even serves as the foundation for new-school graph and NoSQL databases, as well as bigger, badder versions of relational databases that have been around for decades.

But it wasn’t always this way, and today’s uses are a long way off from the original vision of what Hadoop could be.

Doug Cutting

Doug Cutting

When the seeds of Hadoop were first planted in 2002, the world just wanted a better open-source search engine. So then-Internet Archive search director Doug Cutting and University of Washington graduate student Mike Cafarella set out to build it. They called their project Nutch and it was designed with that era’s web in mind.

Looking back on it today, early iterations of Nutch were kind of laughable. About a year into their work on it, Cutting and Cafarella thought things were going pretty well because Nutch was already able to crawl and index hundreds of millions of pages. “At the time, when we started, we were sort of thinking that a web search engine was around a billion pages,” Cutting explained to me, “so we were getting up there.”

There are now about 700 million web sites and, according to Wired’s Kevin Kelly, well over a trillion web pages.

But getting Nutch to work wasn’t easy. It could only run across a handful of machines, and someone had to watch it around the clock to make sure it didn’t fall down.

Mike Cafarella

Mike Cafarella

“I remember working on it for several months, being quite proud of what we had been doing, and then the Google File System paper came out and I realized ‘Oh, that’s a much better way of doing it. We should do it that way,’” reminisced Cafarella. “Then, by the time we had a first working version, the MapReduce paper came out and that seemed like a pretty good idea, too.”

Google released the Google File System paper in October 2003 and the MapReduce paper in December 2004. The latter would prove especially revelatory to the two engineers building Nutch.

“What they spent a lot of time doing was generalizing this into a framework that automated all these steps that we were doing manually,” Cutting explained.

Raymie Stata, founder and CEO of Hadoop startup VertiCloud (and former Yahoo CTO), calls MapReduce “a fantastic kind of abstraction” over the distributed computing methods and algorithms most search companies were already using:

“Everyone had something that pretty much was like MapReduce because we were all solving the same problems. We were trying to handle literally billions of web pages on machines that are probably, if you go back and check, epsilon more powerful than today’s cell phones. … So there was no option but to latch hundreds to thousands of machines together to build the index. So it was out of desperation that MapReduce was invented.”

MapReduce diagram, from the Google paper

Parallel processing in MapReduce, from the Google paper

Over the course of a few months, Cutting and Cafarella built up the underlying file systems and processing framework that would become Hadoop (in Java, notably, whereas Google’s MapReduce used C++) and ported Nutch on top of it. Now, instead of having one guy watch a handful of machines all day long, Cutting explained, they could just set it running on between 20 and 40 machines that he and Cafarella were able to scrape together from their employers.

Bringing Hadoop to life (but not in search)

Anyone vaguely familiar with the history of Hadoop can guess what happens next: In 2006, Cutting went to work with Yahoo, which was equally impressed by the Google File System and MapReduce papers and wanted to build open source technologies based on them. They spun out the storage and processing parts of Nutch to form Hadoop (named after Cutting’s son’s stuffed elephant) as an open-source Apache Software Foundation project and the Nutch web crawler remained its own separate project.

“This seem like a perfect fit because I was looking for more people to work on it, and people who had thousands of computers to run it on,” Cutting said.

Cafarella, now an associate professor at the University of Michigan, opted to forgo a career in corporate IT and focus on his education. He’s happy as a professor — and currently working on a Hadoop-complementary project called RecordBreaker — but, he joked, “My dad calls me the Pete Best of the big data world.”

Ironically, though, the 2006-era Hadoop was nowhere near ready to handle production search workloads at webscale — the very task it was created to do. “The thing you gotta remember,” explained Hortonworks Co-founder and CEO Eric Baldeschwieler (who was previously VP of Hadoop software development at Yahoo), “is at the time we started adopting it, the aspiration was definitely to rebuild Yahoo’s web search infrastructure, but Hadoop only really worked on 5 to 20 nodes at that point, and it wasn’t very performant, either.”

Baldeschwieler at Hadoop Summit 2010. Source: Yodel Anectdotal

Baldeschwieler at Hadoop Summit 2010. Source: Yodel Anectdotal

Stata recalls a “slow march” of horizontal scalability, growing Hadoop’s capabilities from the single digits of nodes into the tens of nodes and ultimately into the thousands. “It was just an ongoing slog … every factor of 2 or 1.5 even was serious engineering work,” he said. But Yahoo was determined to scale Hadoop as far as it needed to go, and it continued investing heavy resources into the project.

It actually took years for Yahoo to moves its web index onto Hadoop, but in the meantime the company made what would be a fortuitous decision to set up what it called a “research grid” for the company’s data scientists, to use today’s parlance. It started with dozens of nodes and ultimately grew to hundreds as they added more and more data and Hadoop’s technology matured. What began life as a proof of concept fast became a whole lot more.

“This very quickly kind of exploded and became our core mission,” Baldeschwieler said, “because what happened is the data scientists not only got interesting research results — what we had anticipated — but they also prototyped new applications and demonstrated that those applications could substantially improve Yahoo’s search relevance or Yahoo’s advertising revenue.”

Shortly thereafter, Yahoo began rolling out Hadoop to power analytics for various production applications. Eventually, Stata explained, Hadoop had proven so effective that Yahoo merged its search and advertising into one unit so that Yahoo’s bread-and-butter sponsored search business could benefit from the new technology.

Cutting (center) flanked by Baldeschwieler and Om Malik at GigaOM's Hadoop Meetup in 2008.

Cutting (center) flanked by Baldeschwieler and Om Malik at GigaOM’s Hadoop Meetup in 2008.

And that’s exactly what happened, because although data scientists didn’t need things like service-level agreements, business leaders did. So, Stata said, Yahoo implemented some scheduling changes within Hadoop. And although data scientists didn’t need security, Securities and Exchange Commission requirements mandated a certain level of security when Yahoo moved its sponsored search data onto it.

“That drove a certain level of maturity,” Stata said. “… We ran all the money in Yahoo through it, eventually.”

The transformation into Hadoop being “behind every click” (or every batch process, technically) at Yahoo was pretty much complete by 2008, Baldeschwieler said. That meant doing everything from these line-of-business applications to spam filtering to personalized display decisions on the Yahoo front page. By the time Yahoo spun out Hortonworks into a separate, Hadoop-focused software company in 2011, Yahoo’s Hadoop infrastructure consisted of 42,000 nodes and hundreds of petabytes of storage.

From the classroom …

However, although Yahoo was responsible for the vast majority of development during its formative years, Hadoop didn’t exist in a bubble inside Yahoo’s headquarters. It was a full-on Apache project that attracted users and contributors from around the world. Guys like Tom White, a Welshman who actually wrote O’Reilly Media’s book Hadoop: The Definitive Guide despite being what Cutting describes as a guy who just liked software and played with Hadoop at night.

Up in Seattle in 2006, a young Google engineer named Christophe Bisciglia was using his 20 percent time to teach a computer science course at the University of Washington. Google wanted to hire new employees with experience working on webscale data, but its MapReduce code was proprietary, so it bought a rack of servers and used Hadoop as a proxy.

Go to page 2 (of 2) on GigaOM .


Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.



Tagged with:
feb 21

When we first began putting together the schedule for Structure: Data several months ago, we knew that running SQL queries on Hadoop would be a big deal — we just didn’t know how big a deal it would actually become. Fast-forward to today, a mere month away from the event (March 20-21 in New York), and the writing on the wall is a lot clearer. SQL support isn’t the end-game for Hadoop, but it’s the feature that will help Hadoop find its way into more places in more companies that understand the importance of next-generation analytics but don’t want to (or can’t yet) re-invent the wheel by becoming MapReduce experts.

In fact, there are now so many products and projects pushing SQL queries and interactive data analysis on Hadoop — including two more announced this week — that it’s getting hard to keep track. But I’ll do my best.

Of course, Facebook began this whole movement to bring SQL database-like functionality to Hadoop when it created Hive in 2009. Hive, now an Apache project, includes a data-management layer and SQL-like query language called HiveQL. It has proven rather useful and popular over the years, but Hive’s reliance on MapReduce makes it somewhat slow by nature — MapReduce scans the entire data set and moves a lot of data over the network while processing a job — and there hasn’t been much effort to package it in a manner that might attract mainstream users.

And keep in mind that this next generation of SQL-on-Hadoop tools aren’t just business intelligence or database products that can access data stored in Hadoop; EMC Greenplum, HP Vertica, IBM Netezza, ParAccel, Microsoft SQL Server and Teradata/Aster Data (which this week released some cool new features for just this purpose) all allow some sort of access to Hadoop data. Rather, these are applications, frameworks and engines that let users query Hadoop data from inside Hadoop, sometimes by re-architecting the underlying compute and data infrastructures. The beauty of this approach is that data is usable in its existing form and, in theory, doesn’t require two separate data stores for analytic applications.

Data warehouses and BI: The Structure: Data set

Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now.I’m highlighting this group of companies first, not because I think they’re the best (although that might well be), but because I’m truly excited about the panel they’ll be featured on at our conference next month. The panel is moderated by Facebook engineering manager Ravi Murthy– a guy who knows his way around a database — so they’ll have to answer some tough questions from one of the most-advanced and most-aggressive Hadoop and analytics tools users out there:

Apache Drill: Drill is a MapR-led effort to create a Google Dremel-like (or BigQuery-like) interactive query engine on top of Hadoop. First announced in August, the project is still under development and in the incubator program within Apache. According to its web site, “One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.”

Hadapt: Hadapt, which actually launched at Structure: Data in 2011, was the first of the SQL on Hadoop vendors and is somewhat unique in that it has a real product on the market and real users in production. Its unique architecture includes tools for advanced SQL functions and a split-execution engine for MapReduce and relational tasks, and both HDFS and relational storage. In October, the company announced a tight integration with Tableau Software around advanced visual analytics.

HAD_Graphic2-scaled

platforaarchPlatfora: Technically not a SQL product, Platfora is red-hot right now and is trying to re-imagine the world of business intelligence for a big data world. Essentially an HTML5 canvas laid atop Hadoop and an in-memory, massively parallel processing engine, the company’s software, which it unveiled in October, is designed to make analyzing data stored in Hadoop a fast and visually intuitive process.

Qubole: Qubole is an interesting case in that it’s essentially a cloud-based version of the popular Apache Hive framework launched by the guys who created Hive while working at Facebook. Qubole claims it auto-scaling abilities, optimized Hadoop code and columnar data cache make its service run much faster than Hive alone — and running on Amazon Web Services makes it easier than maintaining a physical cluster.

cache

Data warehouses and BI: The rest

Citus Data: Citus Data’s CitusDB isn’t just about Hadoop, but rather wants to bring the power of its distributed Postgres implementation to all types of data. It relies on Postgres’s foreign data wrappers feature to convert disparate data types into the database’s native format, and then on its own distributed-processing technology to carry out queries in seconds or less. Because of its Postgres foundation, CitusDB can join data from different data sources and retains all the native features that come with that database.

citus_hadoop_architecture

Cloudera ImpalaCloudera’s Impala might just be the most-important SQL-on-Hadoop effort around because of Cloudera’s expansive installation and partner footprints. It’s a massively parallel processing engine that bypasses MapReduce to enable interactive queries on data stored in either HDFS or HBase, using the same variant of SQL that Hive uses. However, because Cloudera doesn’t build applications, it’s relying on higher-level BI and analytics partners to provide the user interface.

impala

Karmasphere: Karmasphere is one of the first startups to build an analytic application atop Hadoop, and in its 2.0 release last year the company added support for SQL queries of data in HDFS. Like Hive, Karmasphere still relies on MapReduce to process queries, which means it’s inherently slower than newer approaches. However, unlike Hive, Karmasphere allows for parallel queries to run at the same time and includes a visual interface for writing queries and filtering results.

multiple-large

Lingual: Lingual is a new open source project from Concurrent (see disclosure), the parent company of the Cascading framework for Hadoop. Announced on Wednesday, Lingual runs on Cascading and gives developers and analysts a true ANSI SQL interface from which to run analytics or build applications. Lingual is compatible with traditional BI tools, JDBC  and the Cascading family of APIs.

Phoenix: Phoenix is a new and relatively unknown open source project that comes out of Salesforce.com and aims to allow fast SQL queries of data stored in HBase, the NoSQL database built atop HDFS. Its stated mission: “Become the standard means of accessing HBase data through a well-defined, industry standard API.” Users interact with it through JDBC interfaces, and its developers claim its sub-second response times for small queries and seconds-long response for querying tens of millions of rows.

A sample of Phoenix via the SQuirreL client

A sample of Phoenix via the SQuirreL client

sharkShark: Shark isn’t technically Hadoop, but it’s cut from the same cloth. Shark, in this case, stands for “Hive on Spark,” with Hive meaning the same thing it does to Hadoop, but with Spark being an in-memory platform designed to run parallel-processing jobs 100 times faster than MapReduce (a speed improve over traditional Hive that Shark also claims). Shark also includes APIs for turning query results into a type of data format amenable to machine learning algorithms. Both Shark and Spark are developed by the University of California, Berkeley’s AMPLab.

Screen-Shot-2013-02-19-at-5.37.01-PM-300x235Stinger Initiative: Launched on Wednesday (along with a security gateway called Knox and a faster, simpler processing framework called Tez), the Stinger Initiative is a Hortonworks-led effort to make Hive faster — up too 100x — and more functional. Stinger adds more SQL analytics capabilities to Hive, but the most-important aspects are infrastructural: an optimized execution engine, a columnar file format and the ability to avoid MapReduce bottlenecks by running atop Tez.

Operational SQL

Drawn to Scale: Drawn to Scale is a startup that has built an operational SQL database on top of HBase. The key word here is database, as its product, called Spire, is modeled after Google’s F1 designed to power transactional applications as analytic ones. Spire has a fully distributed index and queries are sent only to the node with the relevant data, so reads and writes are fast and the system can handle lots of concurrent users without falling down.

SpireArchitecture.015

spliceSplice Machine: Database startup Splice Machine is also trying to get into the operational space by building its Splice SQL Engine atop the naturally distributed HBase database. Splice Machine focuses its message on transactional integrity, which is really where it separates itself from scalable NoSQL databases and analytics-focused SQL-on-Hadoop efforts. It relies on HBase’s aut0-sharding feature in order to making scaling an easy process.

Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now.

Feature image courtesy of Shutterstock user hauhu.


Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.



Tagged with:
gen 28

Chances are, the quad-core processor powering your desktop computer or high-end laptop is vastly underworked. But it’s not your fault: Writing code that executes in parallel is difficult, so most consumer applications (save for some compute-intensive video games that really need help, for example) continue to run on just one core at a time. Which makes it all the more impressive that a group of Stanford researchers recently ran a jet-engine-noise simulation across 1 million cores simultaneously.

As anyone even casually familiar with parallel processing knows, running applications across more nodes means jobs execute faster because they’re able to share the computing workload. The more cores, the faster it runs. This what makes Hadoop, for example, so great at processing large chunks of data. The MapReduce framework on which it’s based divvies up the work across nodes and everything they find is stitched back together as the result of a job.

But even Hadoop can only scale to tens of thousands of nodes and, because of its focus on “nodes,” actually isn’t really good at utilizing multi-core processors to their fullest (expect to hear more about the limitations of Hadoop at our Structure: Data conference March 20-21 in New York). The IBM-built Sequoia supercomputer (housed at Lawrence Livermore National Laboratory) that the Stanford team used consists of 98,304 processors (or nodes), each containing 16 computing cores. That’s a grand total of 1,572,864 cores, and the researchers were able to use the majority of them, which they claim is a record of some sort.

Sequoia, decomposed

Sequoia, decomposed

But record or not, that’s an incredibly complex undertaking. Programming the jet-engine simulation meant figuring out how to divvy the code into more than a million different tasks that could run across tens of thousands of nodes and 16 cores within each of those nodes. If even one of those processes is buggy, it could slow down or ruin the whole simulation.

Even in the world of supercomputing, where systems now regularly contain hundreds of thousands of cores — some of them special-purpose GPU co-processors — there’s a shortage of programming talent to actually use them all to their fullest potential. As my colleague Stacey Higginbotham explained in some time ago, the world of high-performance computing is hurtling toward exascale computing but a bigger problem than energy-consumption might be finding applications that need that much computing power and the algorithms capable of operating at that scale.

Still, the implications of advances in parallel programming are huge — like potentially life-altering huge. This is true not only because of the scientific questions we’ll soon be able to answer at speeds inconceivable even a decade ago, but also because of the computing power we’ll all soon be carrying around in our pockets and purses. If you think those multi-core smartphones and tablets are great now because they can run multiple applications at the same time, just wait until their processors are even bigger and badder and we have more applications — photo- and video-editing, computer-aided design, games and who knows what else — that can actually get the most out of them.




Tagged with:
 

Pages Menu 

Tags 

 

Archivi 

 

Categories 

Meta

preload preload preload