Category Archives: Data Science

Open Source Analytics

A short history.

Few inventions  in American history have had the massive impact of the IBM ® System/360—on technology, on the way the world works and on the organization that created them. Jim Collins, author of Good to Great, ranks the S/360 as one of the all-time top three business accomplishments, along with Ford’s Model T and Boeing’s first jetliner, the 707. 

NasaIBM

Most significantly, the S/360 ushered in an era of computer compatibility—for the first time, allowing machines across a product line to work with each other. In fact, it marked a turning point in the emerging field of information science and the understanding of complex systems. After the S/360, we no longer talked about automating particular tasks with “computers.” Now, we talked about managing complex processes through “computer systems.” 

Source: http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/system360/

Next, operating systems emerged, introducing economies of scale.

Before the System/360 operating system was introduced in 1961 there were individual peripherals with different user interfaces, programming models, connection ports, and storage media. This means for any new business solution a programmer was effectively starting from scratch. Its like Ford having to invent a new engine every time they release a new car and train all of the mechanics that support it. 10 million lines of code later, 20+ peripheral solutions emerged as the information platform to help put a man on the moon, manage millions of flight reservations, and introduce the era of information science. A standard operating system for information science created economies of scale, which in turn lowered the barrier to entry to information science. Now, anyone wanting to develop an information solution could take advantage of System 360 and take information science even further. One problem, the operating system was attached to a massive super computer.

Fast forward 30 years, a portable operating system emerged with no strings attached.

Over the next 30 years the System 360 operating system was adopted by practically every Fortune 100 company as their system of record. IBM established the Chief Information Officer, and many companies followed suit by appointing CIOs to organize their information assets. Information management was established, and the information age began.

But data, development, and access were all confined to only a few select information managers due to high costs associated with the systems and fear of the spread of misinformation. It wasn’t until Linus Torvalds, a computer scientist (a new profession at the time) invented a new operating system in 1991 that was “portable” to any system, big or small, that data began to be democratized. Not only did he make the operating system portable, he also licensed it as an open source technology, completely removing all barriers to adoption: all you needed was hardware, skills, and creativity. This portable application operating system was Linux. At 13 million lines of code, Linux established itself as the application development operating system that launched the Web, Social, Mobile, and numerous applications that created new systems of engagement. A massive audience could now interact with ubiquitous information.

In 2000, Linux received an important boost when IBM announced it would embrace Linux as strategic to its systems strategy. A year later, IBM invested US$1 billion to back the Linux movement, embracing it as an operating system for IBM servers and software. Over the next 15 years, IBM introduced 500 solutions built on Linux and contributed millions of lines of code from over 600 open source contributors. 

Millions of applications built on Linux opened the flood gates to rich data with value trapped just below the surface. To fish value out, mathematicians fell madly in love with system engineers.

Almost  as quickly as Linux was introduced, the amount of data exhaust created by applications across mobile, social, web in new systems of engagement introduced a data problem never before seen. Simply finding information became a monumental challenge: the world wide web needed an open source search engine. Doug Cutting, then an engineer at Yahoo, and University of Washington grad student Mike Cafarella built what became Apache Hadoop, a marvel of systems engineering, designed to distribute data and processing across many commodity servers. Apache Hadoop turned working with data on its head. All of a sudden you could leap over the information managers who controlled the so-called “extract transform load” (ETL) process that bottlenecked new data ingestion. Apache Hadoop introduced “extract load transform” (ELT) making it possible for anyone to work with any data type—no matter the source. Apache Hadoop’s success as an unstructured data management environment set the bar, the introduction of Apache Spark a few years later did compute took distributed systems even further. If Apache Hadoop was the hard drive, Apache Spark is the processing chip for complex math. In a short time, we went from algebra to calculus, making machine learning possible at a much larger scale.

linux20infographic

Now, mathematicians can use any data type to build algorithms that learn and the challenge is no longer a data problem by a systems engineering one. Apache Spark changes the way we work with data with an elegant API that makes it so you don’t have to think in terms of distributed programming. Spark does this by storing the logic in-memory in what is called a directed acyclic graph (DAG) model to process data interactively while carrying forward the advantages that Hadoop introduced; there’s no need to format, cleanse, or manipulate the data before storage and processing. Hadoop and Spark have set the stage for a new way to manage and compute data, ushering in the Cognitive Era. Spark and Hadoop alone are not enough to successfully build a robust platform that is portable, scalable, usable, and flexible—and able to meet the demands of industry. For this reason, we launched the Spark Technology Center in San Francisco, and an additional STC in India last week. The Spark Technology Center is growing the ecosystem around Apache Spark, to help meet the real world demand for Spark-based applications. Already we have introduced Apache SystemML, Torree, and most recently Quarks to expand the industry use cases for distributed analytics.

A quark (/ˈkwɔːrk/ or /ˈkwɑːrk/) is an elementary particle and a fundamental constituent of matter. Quarks combine to form composite particles called hadrons, the most stable of which are protons and neutrons, the components of atomic nuclei.

Last week IBM introduced a new open source project called Quarks. It is called Quarks to represent the smallest analytics operating system that can run on any device imaginable. It was created from years of research and development on System S or streams innovation that supplies the foundation for continuous computing for the most advanced organizations in the world including the city of Stockholm, Wimbledon, Telco’s, Financial Institutions, Government, Automotive, and many others.

Now every device can be more intelligent at the edge without having to be always connected to the internet.  Quarks allows complex models to be run at amazing speeds and allows analytics to be run against data streams, not only data at rest.  Quarks works with Spark to unify access to data across the organization through support for multiple programming languages and a multitude of data sources, and it reduces development time with high-level tools for machine learning and streaming data.

Quarks with Spark opens data science to many users such as designers, mathematicians, data scientist and developers.  It’s an agile way to build applications powered by any kind of data and push it to the absolute edge of the web.

So what’s next? My prediction is we are at the verge of data products becoming mainstream.

Data products are not quite applications nor are they simply mathematical equations. To me, they are a combination of data pipelines that feed machine learning algorithms that are embedded into the very fabric of our decision making experiences.  These data products are the gateway to the cognitive era.  Data products will be built to augment human thought processes in a computerized model. Cognitive computing involves self-learning systems that use data mining, pattern recognition and natural language processing to mimic the way the human brain works.

Want to build data products and learn more about open source analytics? Join me at our next Datapalooza event in Austin next month by going to http://www.spark.tc/datapalooza

 

Designer Data Science

I am pleased to report Big Data is here to stay and we are now moving into the application age with many moving beyond descriptive (BI) to prescriptive or machine learning focus. After attending STRATA NYC last month and Databeat this past week I am seeing first hand how this trend is rapidly evolving.  First, lets take an example of another major technological shift that happened a little over 10 years ago when the internet and web applications came of age.

At first there were only a handful of way to access the web via “internet portals.” Many people could access the open web to truly leverage its amazing potential to communicate, access information quickly, and create content. Next, we saw the dot.com boom create a huge demand for web developers with little emphasis on design.  I remember fondly many of my engineering colleagues jumping into the fray learning php, html, tcp/ip and other web programming languages to take advantage of the demand.  It wasn’t until the bubble bust and the next era of Web 2.0 arrived that frameworks became standard and the focus shifted to design.  These days do people call themselves Web Developers? Not really, I’d say you see more Web Designers attracting the high salary that can use established web frameworks to design the best customer experience.

blue print

It often reminds me of the situation of the Data Scientist today where many believe the best are great programmers who can leverage R, Python and MapReduce to create one off analysis. Scott Yara from Pivotal went so far as to say this last week, “It only takes minutes for a Programmer to become a Data Scientist.”  Do we truly believe that? When we heard from Allen Day, Data Scientist at MapR, he did not talk in terms of data frames or Hadoop jobs.  Instead, he focused on the design component of engineering a big data application.  No question he has a strong ability to program and work with big data technology, but what truly sets him apart is his ability to design solutions.  You can hear more snipets from his talk “What Shape is Your Data,” by liking us on Facebook.

Today the majority of Data Science applications are heavily coding and scripting frameworks (Python, R, Scala, Java, and Map Reduce).  However, at Alpine we are thinking differently about how to design and replicate analysis without having to start from scratch each time.  We go further and abstract the code into representations of operations to make it less programming intensive.  I agree with Trifacta’s CEO Joe Hellerstein, when he states “Let’s take the programming requirement out of Data Science.”