Category Archives: General

Welcome to the Maker Generation

portland_made_local_goods_collective_1-thumb-900x610-54673

After spending a few years in the Bay Area working at High Tech, “Big Data” startups (Datameer, Alpine Data Labs, H2O.ai, to name a few) my family and I decided to leave the fast paced sun-soaked peninsula to start the next chapter in our lives. In June of 2016, my wife, son, and I packed up our small rental house in San Mateo California and moved to the Pacific Northwest (PNW) for the outdoors, the affordability, and what I call the Maker movement culture of Portland Oregon. Whether its biking, cooking, brewing, composing, crafting, or any other activity that involves sustained concentration; Makers are flourishing in this town due to a number of key elements. Over the next few months, I’ll attempt to extract these elements as I interview local makers in many industries and professions. Follow me as I embark on a tour of Portland Makers delivered to you in dispatches every month.

In the meantime, check out a chapter from a book I contributed to called “The End of Tech Companies” by Rob Thomas.

“Developers make software for the world to use. The job of a developer is to crank out code – fresh code for new products, code fixes for maintenance, code for business logic, and code for supporting libraries.” –Nick Hardiman 

When was the last time you built something from nothing? Was it the time you had to make a diorama for a school project? How about a gift for someone else? Perhaps you composed a song for someone you love. Whatever it was, there is nothing quite like the feeling of creating something from nothing. It is a form of expression that invokes creativity, freedom, passion, and deep thinking. For these reasons, a growing number of people are inspired to learn new skills to make new things. Until recently, people who identified themselves as “makers” were considered hobbyists, do-it-yourselfers, craftsmen, or simply tinkerers. Although those makers are continuing to thrive, other makers in the form of Designers, Developers, Marketers, and the emerging Data Science Practitioners are moving out of niche areas into many professions across every industry. Some of these professions used to be considered Ivory Tower disciplines. But now, computer science, for example, has been penetrated by the “maker movement” and its practitioners simply recast as “developers”. 

Developers represent the largest maker movement of our time by making software for everything imaginable, from consumer applications to enterprise processes, to entire marketplaces and, most recently, to automated systems that can think. It is important to realize that these developer makers operate differently than others. For one thing, they are highly suspicious of “black box” solutions. Many vendors have tried to reach developer makers with proprietary software solutions, and failed. Developer makers also differ fromfrom other professions in how they work. For example, as Paul Graham states, “one reason programmers dislike meetings so much is that they’re on a different type of schedule from other people. Meetings cost them more.” Although this was written about developers, it applies to any profession in which sustained attention is needed to build. He goes on to say that “when you are operating on a maker’s schedule, meetings are a disaster. A single meeting can blow a whole afternoon, by breaking it into two pieces each too small to do anything hard in.” Thinking gets shifted back to the fast-paced world of immediate actions and away from deep concentration. Reduce distractions, and developer makers become far more productive on account of their highly resourceful and self-reliant nature, providing highly detailed information about programs in the form of documentation, example code, and a thriving community. It is no wonder that developer makers are the makers who most significantly disrupt industries and professions. One profession that had been out of reach until recently is Information Management. 

In early 2010, a diverse functional team sat down to discuss why user growth and revenue had slowed, even though installations were on the rise. Marketing professionals presented their campaign data that showed a strong conversion rate of web traffic to downloads. Product analysts showed that the free-to-paid conversion was steady. Lastly, financial analysts showed that the daily, weekly, and monthly active user counts were declining, along with subsequent revenue. Business analysts were on the hook to come up with an explanation for this disconnect. Unfortunately, they had neither access to data, nor a flexible data environment, nor the sophisticated analytical tools needed to connect the dots.

Meanwhile, the engineering team was collecting application log files, network delivery log files, installer log files, and license types as part of the quality control effort. These individuals were not considered part of the “information” or “business intelligence” group and therefore did not make this data available for others – until a group of data scientists and engineers, or data makers, convinced the engineering team to open its data assets to the organization through a distributed data environment built on Hadoop. As soon as these data makers were able to work with the data in an unrestricted way, they quickly developed data products, including curated data sets and business metrics, which could be validated by analysts before being rolled out to the rest of the organization. It took a team of data makers, who could facilitate a conversation across the two organizations, to expand the corpus of information that was available to the business. Data was programmatically used to solve the riddle of the user problem. As it turned out, the answer could be found in the combination of clickstream data, installer log files, and transaction records that showed a channel-specific relationship to specific product offerings that was not otherwise apparent. 

As they disrupt the information consumption status quo, data makers are emerging as organizational change agents. Prior to 2011, Information Management consisted of a linear series of steps to produce a dashboard or report that could be distributed, perhaps quarterly, as part of a business review. Data makers apply creativity to attack business outcomes. They wrangle, munge, extract, and analyze data to transform it into a product that incites others to act. To foster a data maker culture, it is critical to make data available, provide an open forum for results to be discussed, and provide a collaborative environment for data artifacts to be shared across organizations. 

Today these professions share information freely and promote education through workshops and online courses. Individuals and organizations are starting to realize that to do their best work or attract top talent, the walls between professionals must come down. Makers, by their very nature, are collaborative and open to all comers. Makers have driven the rise of open source software, meetups, hackathons, Massive Online Open Courses (MOOCs), and various programs that promote inclusivity in technology. This is the Maker Era; a key cultural condition for prospering in the post-tech world.

READ THE FULL BOOK HERE

All proceeds go to

 

 

 

 

Data Needs a Platform


Google or the Yellowpages, Uber or Yellow Cab, Netflix or Comcast, Nest or Honeywell, Stitchfix or Nordstrom, Etsy or Bernhardt Furniture, every industry, profession, startup, enterprise, is making a hard shift to digital at a rapid pace. Data exhaust is growing exponentially from every interaction as a by-product of this digital evolution from mobile, web, social, commerce, and many new touch points of the digital landscape. According to Oxford Dictionaries, data is “facts and statistics collected together for reference or analysis.” It is important to note that data is different than information. It is an abstraction from information that lends itself to code and math to make data products. An ecosystem of data suppliers, producers, services, and consumers are emerging to support a dataFirst development practice.

Many factors are accelerating the transition from the offline world to the always on generation including a cultural shift in connectedness, a technology shift from centralized to decentralized computing infrastructure, and an economic shift from cost prohibitive resources to accessible cloud computing, memory, storage, and software due to the rise of Open Source Software (OSS) and indirect monetization business models. Making this transition is not easy and requires data literacy to be competitive and take full advantage of this shift.  

At IBM, we have recognized this shift by declaring our strategic initiatives as cloud computing and cognitive solutions. At the center of this shift is data. Said another way, our clients are moving their business online and in doing so creating data exhaust that can be leveraged for machine learning to build data products like customer service chat bots and teaching assistants. Unfortunately, there isn’t a data platform for working with exhaust data and transactional data to build data products for cognitive solutions.

We have built platforms for application development, Bluemix, and cognitive solutions, Watson, and have yet to build a platform for data to connect the two with a robust data ecosystem of data producer and consumer partners. Instead, as an industry, we have continued to drive product-centric data ecosystems that have succeeded in the past, but are now faltering due to the transforming data consumer. For example, NoSQL data ecosystem (Hadoop, Cassandra, MongoDB), the MPP data ecosystem (Vertica, Netezza, Greenplum, or RDBMS data ecosystem (MySQL, PostGRES, DB2, Oracle, etc.) all have depended on a Business Intelligence consumption model to drive a business process. Dashboards are dead.

On September 26th, IBM will launch the first data platform built on open source software, cloud computing, and include key Watson services to deliver cognitive solutions. Additionally, we’ll introduce dataFirst methods to help clients and partners bridge the gap between digital and cognitive solutions. We’ll introduce dataFirst certifications to extend the data platform by supporting a broad ecosystem of partners building on a single data platform that is open for all. We’ll bring together leading data programs across IBM including consulting services, skills and training, independent software vendor’s, technology leaders, and many others who have an interest in data to a seamless experience to maximize interactions between data producers and consumers.

In the past year, we invested in the open source technology most notably Apache Spark as the Analytics OS and introduced industry leading user experiences for both the data consumers and data producers. Watson Analytics makes analytics consumable and the Data Science Experience makes data producible. Together, these two offerings represent two ends of the data & analytics spectrum. After launching the data platform these two disparate experiences become connected through a fabric with open services to a growing ecosystem of suppliers for data ingestion, persistence, machine learning, orchestration, discovery, and access. In addition to the Data Science Experience and Watson Analytics; other producer and consumer endpoints will also emerge to address every industry and profession. For example, in IoT, we’ll introduce experiences for device makers and application developers. By connecting data producers to data consumers, a data marketplace is born for dataFirst practitioners to collaborate and learn from each other instead of remaining niche providers of disconnected ecosystems. 

Over the next 30 days, read about data literacy, open source software, pipelines to platforms, and shift our industry from product pipelines commoditization to platform growth that services emerging markets. Join us at our dataFirst launch event http://ibm.co/datafirstnow

R and Hadoop make Machine Learning Possible for Everyone.

Note: this is a repost from the article I wrote for KDNuggets last November.

In statistics, bootstrapping can refer to any test or metric that relies on random sampling with replacement.  In simple terms, it allows a way to measure the accuracy of the sampling distribution often used in constructing a hypothesis test.  In business, bootstrapping refers to starting a business without external help or capital.  Bootstrapping in general parlance refers to an absurdly impossible action, “to pull oneself over a fence by one’s bootstraps.”  R and Hadoop are very much bootstrapped technologies having received zero direct investment capital and relying on what might appear to be a random group of contributors over the past 20 years in practically every industry and use case imaginable.

R Pirates Pillage Businesses Worldwide

R first appeared in 1993 when Ross Ihaka and Robert Gentleman at the University of Auckland released a free version as a software package.  Since then, R has grown to over 3 million users in the US alone according to the download site log files released last year.

In addition, R surpassed SAS with over 7,000 unique packages you can view on Crantastic website.  It is no wonder it has found wide use in many industries and academia. In fact, during the summer of 2014, R surpassed IBM SPSS as the most widely used analytics software for scholarly articles according to Robert Muenchen.  For this reason, R is now the “Gold Standard” for doing all sorts of statistics, economics, and even machine learning.  Furthermore, from my experience I found that many if not most people use R as a complimentary tool today for spot checking their work even when using other far more expensive or popular enterprise software.  It is no wonder R is quickly taking over as the go to tool for Data Scientists in the 21st century.

What is fueling R growth is predominantly the community for making the core software useful and relevant by providing answers to common questions via many blogs and user groups.  In addition, it is clear there is an underserved job market according to data from LinkedIn (see image below).  Due to this demand, R is now offered in practically all major universities as the de facto language for statistical programming and many new online courses are starting each day.  Datacamp is one such example having built an interactive web environment with rich lessons that non-programmers can easily get started without ever touching a command line.

People to jobs ratio

Businesses too are flocking to statistics and embracing the probabilistic vs the deterministic nature of problems that arise when data is expanding at an increasing size and rate where tradition Business Intelligence cannot keep pace.  For this reason, many turned to Hadoop to open up the data platform to unlock the world of enterprise data management that had been kept away from business analysts for many years.  Gone are the days of pre-filtered, pre-aggregated dashboards and excel workbooks that are emailed around haphazardly to executives and decision makers left to little interpretation or devoid of any “storytelling” to guide the business to make informed decisions.

Hadoop Growth

Apache Hadoop came almost 10 years after R first hit the scene in 2005 and wasn’t widely adopted until as late as 2013 when more than half of the Fortune 50 got around to building their own clusters.  The name “Hadoop” is after a toy elephant of famed Yahoo! engineer Doug Cutting who along with Mike Cafarella originally developed the technology to create a better search engine, of course. Along with its ability to process enormous sums of data on relatively inexpensive hardware, it also made it possible to store data on a distributed file system (HDFS) without having to transform it ahead of time.  As with R, many open source projects were created to re-imagine the data platform. Starting with getting data into HDFS (sqoop, flume, kafka, etc.) to compute and streaming (Spark, YARN, MapReduce, Storm, etc.), to querying data (Hive, Pig, Stinger / Tez, Drill, Presto, etc.), to datastores (Hbase, Cassandra, Redis, Voldermort, etc.), to schedulers (Oozie, Cascading, Scalding, etc.), and finally to Machine Learning (Mahout, MLlib, H2O, etc.) among many other applications.

Unfortunately, there is not a simple way to see all of these technologies and easily install with one line of code like R.   Nor is MapReduce a simple language for the average developer.  In fact, you can clearly see the shortage like R of Hadoop and MapReduce skilled workers to the number of jobs available thanks to LinkedIn.  It is for this reason Hadoop has not fully caught fire in the same way R has and there is talk of its demise at the recent Strata Hadoop World conference in NYC this past fall.

People and Jobs Ratio at Strata

What’s the Real Problem Here? In One Answer, Data

Over the past few years, the issues of data have cropped up in the field of data science as the number one problem faced when working with the vast variety and volumes of data.  I’d be remiss to not mention the velocity of unrelenting data waves crashing against our fragile analysis environments.  In fact, it is projected that the volume of data is expected to exceed the number of stars in the universe by 2020 according to IDC.  Fortunately, there is an entirely new approach to this problem that has until now escaped us in our persistent habit of wanting to constrain data to our querying tools.

Machine Learning is the new SQL

Put simply, “Machine Learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data.” It is a quantum shift in the standard way of simply counting things; instead, its the start of a fantastic journey into the deeper pools of the unknown.

People and Jobs ratio in ML

So here comes the really interesting part of the story.  According to my LinkedIn analysis, Machine Learning and Data Science are actually very well matched to the overall demand in the job market to the people available (unlike R and Hadoop).

We’ll need another plot to really understand what is going on here of the actual number of jobs that exist for Data Science vs Machine Learning.  My interpretation of this graph is that the job of a data scientist today is synonymous to that of an analytics professional or analyst and the real opportunity is in the growing area of Machine Learning.

Jobs and Skills in ML

Machine Learning is the New Kid on the Block

Data Science was first described as the intersection of programming or “hacking skills”, math and statistics along with business expertise according to Drew Conway’s blog.  As it turns out, programming is too generic a term and what is really meant is applied math to large scale data through new algorithms that can crawl through this tangled mess.  To search for answers in this jungle, simply flying over the canopy will not reveal the treasure boxes hidden just beneath the canopy. It is evident to me from the number of machine learning projects that have cropped up and the maturity of the market accepting probabilistic information not only deterministic marks a new era in the race to find value in our data assets.

“The machine does not isolate man from the great problems of nature but plunges him more deeply into them.” – Antoine de St. Exupery

Hadoop 2.0 is Here, Sort Of

Many people have tried to claim that Hadoop 2.0 had arrived with MR2 or YARN or high-availability HDFS capabilities, but this is a misnomer when considering the similarly named Web 2.0 that brought us into the age of the web applications like Facebook, Twitter, LinkedIn, Amazon, and the vast majority of the internet.  According to John Battelle and Tim O’Reilly of now Strata fame defined the shift as simply “Web as a Platform” meaning software applications are built upon the Web as opposed to the desktop. Hints of this change are coming from Apache Spark and specifically new capabilities like SparkR, KeystoneML, and extensions that are making it possible to develop intelligent applications on large-scale data. As Matei Zaharia, the godfather of Spark, would say himself, “its all about data science and interfaces” as he reported in his keynote address earlier this year.  It is now finally possible for Data Scientists and Developers together in the same framework.  It is clear to me having worked in the “Big Data” industry for some time, software developers and statisticians want to program in their language, not MapReduce.  It’s an exciting time to move off the desktop and onto the cluster where the constraints are lifted and the opportunities are endless.

Jobs and Skills Analysis Explained

Many people have conducted research as of late on the growing popularity of statistical software using indirect methods like academic research citations, job posts, books, website traffic, blogs, surveys like KDnuggets annual poll, GitHub Activity and many more.  However, all of these methods have generally been focused on the technical crowd.  Where the rubber meets the road is in the business context which in my mind LinkedIn represents as it is highly representative of the business world.  Further, if you want to go to an even more general audience you can perform the same trick with Google Adwords.  Reverse engineering Ad platforms is a good way to get back of the envelope market sizing information.  I wrote a complete blog on this subject on my personal blog.  In the following instructions, I’ll walk through how I used LinkedIn as my sample and R to analyze the business market for my analysis above.

1. Data Gathering

There are two ways to gather data from LinkedIn.  One is to use the ad shown in the left image below or the other is to use the direct search functionality shown to the right below.

LinkedIn Advertising and Searching

In this case, I went the manual route and used the search function. For each product category that I search there is a count of results that show up and use that as a proxy for demand.  See below:

LinkedIn Product Categories

From this example, we can see the phrase “R Programming” has 1720 results.  I’ve also included “R statistics” and other “R” relevant terms.

2. Data Analysis

As a new R user myself, I manually created each data frame to hold the data by first creating the individual vectors for the people and jobs:

name<-c(“R”,”Python”,"SAS","SPSS","Excel","RapidMiner","SQL")
people <- c(230750,815555,128860,752205,15390756,3306,4648240)
jobs <-c(5059,13519,2414,7429,123874,17,37571) 

after I created a simple data frame:

skills <- data.frame(c(name),c(people),c(jobs)) 

to get the ratio, I simply use the transform function:

skills <- transform(skills, ratio=people/job) 

3. Data Visualization

R comes with many visualization packages, the most notable one being ggplot2.  For this situation, I used a built in barplot as it was much easier out of the box.  Frankly, the visualizations that are produced in R may not seem the most compelling to the general audience, but it does force you to consider what you’re plotting making for more informed visuals.

to get the bar graph (in H2O colors):

barplot(skills$ratio,names.arg=name,col=“#fbe920”, main=“Ratio of People to Jobs”,xlab=“Skill”) 

Thats it! Pretty simple and I am sure there are ways of doing this analysis more elegantly, but for me this was the way that I can be sure the analysis makes sense.

To get the full script you can download to try yourself or add to my analysis.

Welcome to the Age of Free Energy

Well, its official, the future is here.  This week we were introduced to how we’ll power our homes, businesses, transportation, and industries for years to come.  In Elon Musk’s keynote, he took us on a journey that started with a very simple problem statement.  We are producing energy poorly and polluting the planet in doing so.  Like any good engineering problem, the solution lies in breaking down the problem into its individual parts.  The first part is where to get the energy.  Well, it turns out “we have this handy fusion reactor in the sky called the sun” explains Musk. Unfortunately, the sun doesn’t shine at night, so the second part of the problem is how do we store the energy to use it at night and off peak hours (as our energy needs fluctuate). Turns out the answer is a battery pack.  So there you have it, we now have a way to store the sun in our homes and power our lives not only in developed countries, but technically anywhere. How to make this economically feasible is a question left unanswered.

Copyright of Tesla Motors.

Copyright of Tesla Motors.

Curious how the battery pack works? You can actually view the patent here.  In fact, Tesla went even further to remove its patents and open source the technology for anyone to use.  With Tesla’s announcement this week, it is clear to me that they are not a product company trying to sell a few cars to the wealthy, but it is a technology company looking to make a societal revolution by allowing others to improve on these technologies without fear of litigation.

Why might you ask am I talking about energy technology on my analytics blog? To me, there are striking similarities to how data as a resource is consumed by the few who have the skills, technology, and resources to apply it in their every day lives.  The problem statement in this case is that we continue to manage data poorly and are polluting the decision making process with bad insights that impact every aspect of our lives.  Similar to the energy problem, there is no single solution, but multiple parts that we need to address. The first part of this problem is the pervasiveness of data exhaust from every digital thing that is poorly instrumented and difficult to work with.  For this, we have a pretty good solution called “Hadoop” that we’ll need to continue to make easier for people to use. Hadoop clusters in your home anyone? Don’t believe me, check out the company BigBoards.  Next, we’ll need to find a way to store and process data efficiently.  For this, the front runner to me is Spark in its ability to crunch through data fast, applying sophisticated operations to find patterns in data, and then stream it into applications easy to use APIs. Last, we’ll need to learn how to engineer a universal way of consuming the “energy” or as it is commonly referred to in the analytics category, the “Insight” that comes out.  For this we have not yet identified a universal solution, and represents the Data Science Last Mile I have written about before.  Looking into the future, there is a lot to be optimistic about as I look at additional parts of society we’ll disrupt:

Energy – Check

Information – Half-Check

Material & Manufacturing – ?

Agriculture – ?

Governments – ?

More to come in the next few months, so check back soon!

Joel

The Data Science Last Mile

Editors Note: This is my first blog since returning from paternity leave.  I am happy to announce the birth of my son Maxwell Horwitz.  I’m already applying data science to his routine.  His health records are stored online, we monitor his intake (and outake of food), and monitor his sleep via motion cameras.  Its an exciting time to bring a new life into the world! 

Data Science is often referred to as a combination of developer, statistician, and business analyst.  In more casual terminology, it can be more aptly described as hacking, domain knowledge, and advanced math.  Drew Conway does a good job of describing the competencies in his blog post.  Much of the recent attention is focused on the early stages of the process of establishing an analytics sandbox to extract data, format, analyze, and finally create insight (see figure 1 below).  Many of the advanced analytic vendors are focused on this workflow due to the historical context of how business intelligence has been conducted over the past 30 years.  For example, a new comer to the space, Trifacta, recently announced a 25 Million dollar venture round that is applying predictive analytics to the data to help improve the feature creation step.  Its a very good area to focus considering some 80% of the work is spent here working to un-bias data, find the variables that really matter (signal / noise ratio), and identify the best model (linear regression, decision trees, naive bayes, etc.) to apply to the data.  Unfortunately, most of the insights created often never make it past what I am calling the “Data Science Last Mile.”

Simple data science workflow.

Figure 1. Simple data science workflow.

What is the Data Science Last Mile? Its the final work that is done to take found insight and deliver in a highly usable format or integrate into a specific application.  There are many examples of this last mile and here are what I consider to be the top examples.

Example 1. Reports, Dashboards, and Presentations

Thanks to the business intelligence community, we are now accustomed to expect our insights in a dashboard format with charts and graphs piled on top of each other.  Newer visual analytics tools like Tableau and Platfora add to the graphing melange by making it even easier to plot seemingly unrelated metrics against each other.  Don’t get me wrong, there will always be a place for dashboards.  As a rule of thumb, metrics should only be reported as frequent as is the ability to take action on them.  At Intel, we had daily standup meetings at 7am where we reviewed key metrics and helped drive the priorities each day for the team.  We had separate meetings scheduled on a project basis for analysis that was more complex like bringing up a new process or production tool.  Here the visualization format is very well defined and there is even an industry standard called SPC or Statistic Process Control.  For every business, there are standard charts for reporting metrics and outside of that there is a well defined methodology for plotting data.

One of my favorite books of all time on the best design practice for displaying data is by Edward Tufte.  My former boss and mentor recommended the book The Visual Display of Quantitative Information and it changed my life.  One of my favorite visuals from this book is how the French visualized their train time tables.

Presentations, reports, and dashboards is where data goes to die.  It was common practice to review these charts on regular basis and apply the recommendation to the business, product, or operations on a quarterly or even annual basis.

Example 2. Models

Another way data science output is ingested by an organization is as inputs in a model.  From my experience, this is predominantly done using Excel.  Its quite surprising to me that there aren’t many other applications that have been built to make this process easier? Perhaps its due to this knowledge being locked away in highly specialized analysts heads?  Whatever the reason, this seems like a primary area for disruption that there is a significant need to standardize this process and de-silo this exercise.  One of my favorite quotes is from a former colleague.  We were working over a long weekend analyzing our new product strategy business models when he stated, “When they told me I’d be working on models, this is NOT what I had in mind.”  Whether you’re in Operations, Finance, Sales, Marketing, Product, Customer Relations, Human Resources, the ability to accurately model your business means you’re likely able to predict its success.

Example 3. Applications

Finally and quite possibly my favorite examples are those products built on data science around us without us evening knowing it.  One of the oldest examples I can think of is the weather report. We are given a 7 day forecast in the form of sun, clouds, rain drops, and green screens managed by verbose interpretive dancers.  As opposed to reporting the raw probability data of rain, barometric pressure, wind speeds, temperatures and many other factors that go into this prediction.  Another example is a derived index of credit worthiness (FICO), or the Stock Market index, or Google’s Page Rank, or a likeliness to buy value, or a number of other singular values that are used to great effect.  These indices are not reported wholesale, although you can find them if you try.  For example, go to http://pagerank.chromefans.org/ to see any websites Google pagerank yourself).  Instead, they are packaged into a usable format like Search, Product Recommendations, and many other productized formats that bridge the gap between habit and raw data.  For me, this is the area that I am most focused on as the last mile of work that needs to be done to push data into the every aspect of our decision making process.

How does Big Data fit into this scenario? Big data is about improving accuracy with more data.  It is well known that the best algorithm looses out to the more data inputs you have.  However, conducting sophisticated statistics and analysis on large datasets is not a trivial task.  A number of startups have sprung up in the last couple years to build frameworks around this, but require a significant amount of code skills.  Only a few have provided a much more approachable way of applying data science to big data.  One such company that has built a visual and highly robust way of conducting analytics at scale is my very own Alpine Data Labs.  It has a bevy of native statistical models that you can mix and match to product highly sophisticated algorithms that rival the best in class in a matter of minutes, not months.  Pretty wild to think that only a few years ago we were still hand tooling algorithms on a quarterly basis.

It is evident to me that the focus needs to shift back towards the application of data science before we find our self disillusioned.  I for one, am already thinking about how to build new products that start with data as its core value than an add on to be determined later.  There is much more to write about on this subject, but now I hear Maxwell calling out for my attention just as I would have predicted :)

What is Analytics?

For my first post, I find the best place to start is by defining the subject matter we wish to talk about.  So lets get started, what is analytics anyways? How is it different than traditional Business Intelligence? and why has it come back into focus after being dormant for so many year?

a Definition: Analytics is the application of computer technology, operational research, and statistics to solve problems in business and industry.

Historically, Analytics was heavily used in banking for portfolio assessment using social status, geographical location, net value, and many other factors. Today, Analytics is applied to a vast number of industries and is re-emerging due to the phenomenal explosion of data from our connected world.

With this explosion of data, we now see analytics re-emerging as a topic and instead called “Data Science.”  In reality, analytics has been around a long while and this new breed of analysts are re-branding to garner higher salaries.  With the advent of low cost and open source databases we’ll see analytics penetrate deeper into traditionally less analysis focused industries.  Leading the pack is Apache Hadoop, primarily due to the aforementioned low cost and ease of scalability.  Big Data  consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools.  McKinsey Global Institute estimates that big data analysis could save the American health care system $300 billion per year and the European public sector €250 billion.

Hadoop, Ecosystem, Big Data, Analytics

Datameer circa 2012

Many industries have already adopted or are in the process of adopting a Big Data platform in their organization and the time has come to start discussing some simple analysis to leverage this vast amount of information.

I humbly submit this blog to discuss Big Data Analytics tooling, a range of Analytics Topics (Customer Retention, Online Marketing, Behavioral Analysis, Customer Valuations, and more).