Category Archives: OpenSource

Open Source Analytics

A short history.

Few inventions  in American history have had the massive impact of the IBM ® System/360—on technology, on the way the world works and on the organization that created them. Jim Collins, author of Good to Great, ranks the S/360 as one of the all-time top three business accomplishments, along with Ford’s Model T and Boeing’s first jetliner, the 707. 

NasaIBM

Most significantly, the S/360 ushered in an era of computer compatibility—for the first time, allowing machines across a product line to work with each other. In fact, it marked a turning point in the emerging field of information science and the understanding of complex systems. After the S/360, we no longer talked about automating particular tasks with “computers.” Now, we talked about managing complex processes through “computer systems.” 

Source: http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/system360/

Next, operating systems emerged, introducing economies of scale.

Before the System/360 operating system was introduced in 1961 there were individual peripherals with different user interfaces, programming models, connection ports, and storage media. This means for any new business solution a programmer was effectively starting from scratch. Its like Ford having to invent a new engine every time they release a new car and train all of the mechanics that support it. 10 million lines of code later, 20+ peripheral solutions emerged as the information platform to help put a man on the moon, manage millions of flight reservations, and introduce the era of information science. A standard operating system for information science created economies of scale, which in turn lowered the barrier to entry to information science. Now, anyone wanting to develop an information solution could take advantage of System 360 and take information science even further. One problem, the operating system was attached to a massive super computer.

Fast forward 30 years, a portable operating system emerged with no strings attached.

Over the next 30 years the System 360 operating system was adopted by practically every Fortune 100 company as their system of record. IBM established the Chief Information Officer, and many companies followed suit by appointing CIOs to organize their information assets. Information management was established, and the information age began.

But data, development, and access were all confined to only a few select information managers due to high costs associated with the systems and fear of the spread of misinformation. It wasn’t until Linus Torvalds, a computer scientist (a new profession at the time) invented a new operating system in 1991 that was “portable” to any system, big or small, that data began to be democratized. Not only did he make the operating system portable, he also licensed it as an open source technology, completely removing all barriers to adoption: all you needed was hardware, skills, and creativity. This portable application operating system was Linux. At 13 million lines of code, Linux established itself as the application development operating system that launched the Web, Social, Mobile, and numerous applications that created new systems of engagement. A massive audience could now interact with ubiquitous information.

In 2000, Linux received an important boost when IBM announced it would embrace Linux as strategic to its systems strategy. A year later, IBM invested US$1 billion to back the Linux movement, embracing it as an operating system for IBM servers and software. Over the next 15 years, IBM introduced 500 solutions built on Linux and contributed millions of lines of code from over 600 open source contributors. 

Millions of applications built on Linux opened the flood gates to rich data with value trapped just below the surface. To fish value out, mathematicians fell madly in love with system engineers.

Almost  as quickly as Linux was introduced, the amount of data exhaust created by applications across mobile, social, web in new systems of engagement introduced a data problem never before seen. Simply finding information became a monumental challenge: the world wide web needed an open source search engine. Doug Cutting, then an engineer at Yahoo, and University of Washington grad student Mike Cafarella built what became Apache Hadoop, a marvel of systems engineering, designed to distribute data and processing across many commodity servers. Apache Hadoop turned working with data on its head. All of a sudden you could leap over the information managers who controlled the so-called “extract transform load” (ETL) process that bottlenecked new data ingestion. Apache Hadoop introduced “extract load transform” (ELT) making it possible for anyone to work with any data type—no matter the source. Apache Hadoop’s success as an unstructured data management environment set the bar, the introduction of Apache Spark a few years later did compute took distributed systems even further. If Apache Hadoop was the hard drive, Apache Spark is the processing chip for complex math. In a short time, we went from algebra to calculus, making machine learning possible at a much larger scale.

linux20infographic

Now, mathematicians can use any data type to build algorithms that learn and the challenge is no longer a data problem by a systems engineering one. Apache Spark changes the way we work with data with an elegant API that makes it so you don’t have to think in terms of distributed programming. Spark does this by storing the logic in-memory in what is called a directed acyclic graph (DAG) model to process data interactively while carrying forward the advantages that Hadoop introduced; there’s no need to format, cleanse, or manipulate the data before storage and processing. Hadoop and Spark have set the stage for a new way to manage and compute data, ushering in the Cognitive Era. Spark and Hadoop alone are not enough to successfully build a robust platform that is portable, scalable, usable, and flexible—and able to meet the demands of industry. For this reason, we launched the Spark Technology Center in San Francisco, and an additional STC in India last week. The Spark Technology Center is growing the ecosystem around Apache Spark, to help meet the real world demand for Spark-based applications. Already we have introduced Apache SystemML, Torree, and most recently Quarks to expand the industry use cases for distributed analytics.

A quark (/ˈkwɔːrk/ or /ˈkwɑːrk/) is an elementary particle and a fundamental constituent of matter. Quarks combine to form composite particles called hadrons, the most stable of which are protons and neutrons, the components of atomic nuclei.

Last week IBM introduced a new open source project called Quarks. It is called Quarks to represent the smallest analytics operating system that can run on any device imaginable. It was created from years of research and development on System S or streams innovation that supplies the foundation for continuous computing for the most advanced organizations in the world including the city of Stockholm, Wimbledon, Telco’s, Financial Institutions, Government, Automotive, and many others.

Now every device can be more intelligent at the edge without having to be always connected to the internet.  Quarks allows complex models to be run at amazing speeds and allows analytics to be run against data streams, not only data at rest.  Quarks works with Spark to unify access to data across the organization through support for multiple programming languages and a multitude of data sources, and it reduces development time with high-level tools for machine learning and streaming data.

Quarks with Spark opens data science to many users such as designers, mathematicians, data scientist and developers.  It’s an agile way to build applications powered by any kind of data and push it to the absolute edge of the web.

So what’s next? My prediction is we are at the verge of data products becoming mainstream.

Data products are not quite applications nor are they simply mathematical equations. To me, they are a combination of data pipelines that feed machine learning algorithms that are embedded into the very fabric of our decision making experiences.  These data products are the gateway to the cognitive era.  Data products will be built to augment human thought processes in a computerized model. Cognitive computing involves self-learning systems that use data mining, pattern recognition and natural language processing to mimic the way the human brain works.

Want to build data products and learn more about open source analytics? Join me at our next Datapalooza event in Austin next month by going to http://www.spark.tc/datapalooza

 

Announcing the Bay Area Analytics Meetup

bay area analytics

Over the past few months, I’ve attended many big data, analytics, data science, social media, Hadoop, Vertica, d3.js, you name it meetups in SF and the Bay Area. I have yet to find one that is truly focused on leveraging data analytics to power your business. For this reason, I am starting a completely new and open group to discuss how we can solve real-world business problems with data by sharing experience and transfering knowledge that can make us all better at data decision making.

We’ll meet once a month in a location on the peninsula and have an open forum with break out sessions that will tailor to your interests.  Already we have over 120+ members.  I’ve analyzed your feedback and you can see what our members are interested in the infographic below.

Feel free to check out the group here: http://www.meetup.com/BayAreaAnalytics and of course let me know if you have other ideas or topics you’d like to discuss at future meetups.

Looking forward to see you at the first meetup!