mag 22

If you know Java, R or SAS, doing machine learning on Hadoop data just got a lot easier. Concurrent (see disclosure), the company behind the popular Cascading framework for writing big data jobs, has developed a new open source tool called Pattern that lets users export their models from statistical analysis applications and run THEM? at scale on Hadoop data with little to no code change.

The reason for creating Pattern is pretty simple, according to Concurrent founder and CTO Chris Wensel: “Hadoop is never used alone.” It’s always part of a data environment that also includes databases, visualization tools, analytics software and/or statistical analysis tools that arguably do the really valuable work. Hadoop’s real value is an integration platform that can feed data into these other systems and, ideally, put their outputs to work across much larger datasets.

Developers can use the Pattern Java API to create machine learning jobs, but they can also simply export a Predictive Model Markup Language (PMML) file from software like R, SAS and MicroStrategy that Pattern will read and run them as a Cascading workflow. Models are useless unless you can run them in production, Wensel said, and Pattern lets them run across more data, stored in Hadoop, than you can use to build them with those other tools.

However, Wensel noted, “The real takeaway isn’t Pattern itself.”

From his perspective, the real story is Pattern plus Cascading plus Lingual, the open source SQL-to-Hadoop tool that Concurrent recently developed and released. Lingual is the tie that binds everything together, creating a sort of assembly line for data as it works its way from generation to delivering some value. For example, someone might create a Cascading job that adds structure to incoming data, and then pull some of the data into R using Lingual. Once a model is created in R and exported to the Hadoop cluster using Pattern, Lingual can feed the MapReduce output file back to R so a data scientist can test the model’s accuracy.

arch-diagram

And actually, Wensel said, Lingual could have a positive effect on companies’ bottom lines. Airbnb recently replaced a departed engineer with Lingual for monthly migrations of data from Hadoop and into SQL environments. Climate Corporation, a massive Hadoop and Cascading user, could use Lingual to let its crop-and-weather insurance customers access their data from the company’s Hadoop store.

Lingual and Pattern should help Concurrent finally make some money, too. Both of them, as well as the Cascading framework that underpins them, will always be open source, Wensel said, but it plans to create “a suite of products that will make your life much better if … you standardize on Cascading.”

For example, the company has the ability to monitor jobs at the application level rather than the cluster level, meaning it can tell you the details of that job that’s locking up all the resources and whether you really want to kill it (it might be an important report for the CFO …). “We can do some really interesting things,” Wensel said.

Disclosure: Concurrent is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, the founder of Giga Omni Media, is also a venture partner at True.

This post was updated at 2:48pm PT to correct Chris Wensel’s title. He is CTO.

Feature image courtesy of Shutterstock user PENGYOU91.


Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

    


Tagged with:
mag 06

Maybe this is just news to me, but IBM has a SQL-on-Hadoop product in the works called Big SQL. The company announced the technology preview version in March (well under my radar and, from what I’ve seen, nearly everyone else’s radar), and is offering up a cloud-based demo environment for a select group of early users.

As a refresher, the big difference between SQL on Hadoop and the Hadoop connectors that were popular a couple years ago is that SQL-on-Hadoop products query the data where it resides — in HDFS or HBase — rather than pulling it into a relational database environment to analyze it. We have been talking for months about the emergence of a large SQL-on-Hadoop market, but IBM’s name was conspicuously absent from that discussion. The company has Hadoop software called BigInsights and lots of SQL expertise, so it only made sense that IBM would get into the game at some point.

Details on Big SQL are still pretty sparse save for a few high-level blog posts and an instructional video (embedded below), but it looks to take the standard approach, as Cloudera is doing with Impala, of enabling access through traditional tools via JDBC and ODBC drivers.

Ultimately, I think the advent of big data will enable some new types of querying techniques quite a bit different than the SQL queries we’ve come to know and love over the past couple decades. But SQL is still the language du jour and might never go away, so there’s a lot of value to be had if people can put their SQL skills to work on data stored inside Hadoop or other environments, and if companies can work toward a nirvana where all the data is stored in a single place rather than across database environments.

That IBM got this message and got into the game isn’t surprising at all, but it is important. Lots of large companies buy IBM’s software.  If it wants them to follow it into the world of big data and Hadoop, it has to give them the tools they need to use it.


Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

    


Tagged with:
apr 30

There is no shortage of confidence in the Hadoop space, and market leader Cloudera bolstered its own on Tuesday with the general availability of its Impala SQL query engine for Hadoop. And if CEO Mike Olson’s comments are any indication, we’re in for a long ride of competitive jockeying and oneupmanship as Cloudera and its peers go all Microsoft or Google and create myriad new data-processing engines to turn their Hadoop distributions into bona fide platforms.

Launched as a private beta in May 2012 and made public in October, Impala is Cloudera’s attempt to address the growing demand for interactive SQL analytics on Hadoop data. It’s essentially a massively parallel database designed to share the same storage platform and metadata as Hadoop MapReduce, only it is its own separate processing engine.

How Impala fits in

How Impala fits in

Impala actually uses the same “nearly ANSI” version of SQL as does current standard bearer Hive, but that technology (created by Facebook in 2009 as a data warehouse layer for Hadoop) doesn’t run nearly fast enough to sate many users’ desire for interactive analytics. This is because Hive transforms SQL queries into MapReduce jobs, meaning every one is processed against the entire corpus of data in the Hadoop Distributed File System.

Sizing up the competition

Only Cloudera isn’t the first to have the idea, nor is it alone in trying to sell interactive SQL on Hadoop. The idea was first commercialized by Boston-based startup Hadapt in 2011, and is now being pushed by numerous startups and larger Hadoop players. Among them: Pivotal (formerly EMC) Greenplum, MapR (with Drill), Hortonworks (with Stinger), Drawn to Scale, Splice Machine, Jethro Data and Citus Data.

Hadapt's architecture

Hadapt’s architecture

But Cloudera is arguably the biggest name pushing SQL on Hadoop, and CEO Mike Olson thinks Impala stands out for several reasons — not the least of which is that it exists as a product. “Nobody else is shipping production-grade SQL query support on Hadoop,” he told me during a recent call. “At least not in open source.” He seems content to let the startups do their things, instead focusing his attention on Cloudera’s big three Hadoop-distribution competitors in Pivotal, MapR and Hortonworks. Greenplum and Pivotal SVP Scott Yara was full of confidence — and R&D budget– when the company announced the Pivotal HD distribution and HAWQ technology in February, but Olson claims the approach requires a siloed DBMS within HDFS and is a “rearguard defensive strategy” to protect the company’s sunk costs in its database technology.

The Pivotal HD and Hawq architecture

The Pivotal HD and Hawq architecture

As for Hortonworks, Olson questions the wisdom of its Stinger initiative to boost Hive’s speed, noting that “Hive never got good while it was running standalone on MapReduce.” Hortonworks also partners with vendors such as Teradata to let their platforms access Hadoop data in its native format, but those approaches still require sending data over the network. “It’s not the way you would build it if you woke up in the 2000s and were building this anew,” Olson said.

The Stinger roadmap

The Stinger roadmap

Olson acknowledged that the MapR-led Apache Drill project is cut from the same cloth as Impala (that is, being a Google Dremel clone designed specifically for Hadoop), but “the difference is we’re shipping code.” Being generally available and ready for production workloads means Cloudera can lock down users and market share before many even have a chance to experiment with Drill. He all but dismissed questions over the readiness of Impala, spurred by rumblings in the Hadoop space that Cloudera rushed it into public beta in order to get on the scoreboard against more fully baked offerings.

“I don’t feel we’re under the gun competitively to pull it out of beta because no one else has product in the market,” Olson said. “I have no problems … calling this GA quality.” He did, however, acknowledge that Impala is shipping with a “minium viable feature set” that the company has plans to build on in the near future. Impala Senior Product Manager Justin Erickson noted a few issues of concern, including around the number of concurrent users Impala can support, but said they have been addressed during the beta period.

One piece of a larger platform

Really, though, the whole point of Impala and its competitors is to turn Hadoop from a tool for batch analytics and mass storage into a platform that can handle nearly all of companies’ data-processing needs. In that regard, it appears we’re just getting started. Cloudera, MapR, Pivotal Greenplum and Hortonworks are already pushing their own products and projects, and Olson said “it’s absolutely our intent” to enhance Cloudera’s platform with even more open-source products — perhaps even more database technologies a la HBase — that will let users do more stuff with more types of data. Over time, this strategy could result in Hadoop displacing the current breed of databases and data warehouses and becoming the single data store atop of which users run whatever applications they so desire. For now, though, especially when it comes to Impala and the data warehouse incumbents, Olson is taking a measured approach. “The likelihood that we’re going to knock them off in the near term,” he said, “… it would be a tough fight to win.”


Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.

    


Tagged with:
mar 20

Concurrent, proprietor of the open-source Cascading framework for developing big data workflows, has closed $4 million Series A investment round from True Ventures  (see disclosure) and Rembrandt Partners. Cascading has been around for a few years, actually, but Concurrent only raised seed funding in 2011 and has been riding the wave of interest in making big data easier to do.

In practice, Cascading is generally used as a higher-level method than MapReduce for writing Hadoop jobs, although it’s technically a framework that could support any number of distributed-processing frameworks. It’s used by a number of notable users, including Etsy, Airbnb and Climate Corporation. In February, the Cascading project expanded its scope to address the growing SQL-on-Hadoop trend with a project called Lingual.

Software veteran Gary Nakamura is taking on the role of Concurrent CEO, replacing Cascading creator Chris Wenzel, who’ll stay on as the company’s CTO.

api-diagram (1)

Disclosure: Concurrent is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, Giga Omni Media. Om Malik, the founder of Giga Omni Media, is also a venture partner at True.


Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.



Tagged with:
 

Pages Menu 

Tags 

 

Archivi 

 

Categories 

Meta

preload preload preload