The Data Science Last Mile

Editors Note: This is my first blog since returning from paternity leave.  I am happy to announce the birth of my son Maxwell Horwitz.  I’m already applying data science to his routine.  His health records are stored online, we monitor his intake (and outake of food), and monitor his sleep via motion cameras.  Its an exciting time to bring a new life into the world! 

Data Science is often referred to as a combination of developer, statistician, and business analyst.  In more casual terminology, it can be more aptly described as hacking, domain knowledge, and advanced math.  Drew Conway does a good job of describing the competencies in his blog post.  Much of the recent attention is focused on the early stages of the process of establishing an analytics sandbox to extract data, format, analyze, and finally create insight (see figure 1 below).  Many of the advanced analytic vendors are focused on this workflow due to the historical context of how business intelligence has been conducted over the past 30 years.  For example, a new comer to the space, Trifacta, recently announced a 25 Million dollar venture round that is applying predictive analytics to the data to help improve the feature creation step.  Its a very good area to focus considering some 80% of the work is spent here working to un-bias data, find the variables that really matter (signal / noise ratio), and identify the best model (linear regression, decision trees, naive bayes, etc.) to apply to the data.  Unfortunately, most of the insights created often never make it past what I am calling the “Data Science Last Mile.”

Simple data science workflow.

Figure 1. Simple data science workflow.

What is the Data Science Last Mile? Its the final work that is done to take found insight and deliver in a highly usable format or integrate into a specific application.  There are many examples of this last mile and here are what I consider to be the top examples.

Example 1. Reports, Dashboards, and Presentations

Thanks to the business intelligence community, we are now accustomed to expect our insights in a dashboard format with charts and graphs piled on top of each other.  Newer visual analytics tools like Tableau and Platfora add to the graphing melange by making it even easier to plot seemingly unrelated metrics against each other.  Don’t get me wrong, there will always be a place for dashboards.  As a rule of thumb, metrics should only be reported as frequent as is the ability to take action on them.  At Intel, we had daily standup meetings at 7am where we reviewed key metrics and helped drive the priorities each day for the team.  We had separate meetings scheduled on a project basis for analysis that was more complex like bringing up a new process or production tool.  Here the visualization format is very well defined and there is even an industry standard called SPC or Statistic Process Control.  For every business, there are standard charts for reporting metrics and outside of that there is a well defined methodology for plotting data.

One of my favorite books of all time on the best design practice for displaying data is by Edward Tufte.  My former boss and mentor recommended the book The Visual Display of Quantitative Information and it changed my life.  One of my favorite visuals from this book is how the French visualized their train time tables.

Presentations, reports, and dashboards is where data goes to die.  It was common practice to review these charts on regular basis and apply the recommendation to the business, product, or operations on a quarterly or even annual basis.

Example 2. Models

Another way data science output is ingested by an organization is as inputs in a model.  From my experience, this is predominantly done using Excel.  Its quite surprising to me that there aren’t many other applications that have been built to make this process easier? Perhaps its due to this knowledge being locked away in highly specialized analysts heads?  Whatever the reason, this seems like a primary area for disruption that there is a significant need to standardize this process and de-silo this exercise.  One of my favorite quotes is from a former colleague.  We were working over a long weekend analyzing our new product strategy business models when he stated, “When they told me I’d be working on models, this is NOT what I had in mind.”  Whether you’re in Operations, Finance, Sales, Marketing, Product, Customer Relations, Human Resources, the ability to accurately model your business means you’re likely able to predict its success.

Example 3. Applications

Finally and quite possibly my favorite examples are those products built on data science around us without us evening knowing it.  One of the oldest examples I can think of is the weather report. We are given a 7 day forecast in the form of sun, clouds, rain drops, and green screens managed by verbose interpretive dancers.  As opposed to reporting the raw probability data of rain, barometric pressure, wind speeds, temperatures and many other factors that go into this prediction.  Another example is a derived index of credit worthiness (FICO), or the Stock Market index, or Google’s Page Rank, or a likeliness to buy value, or a number of other singular values that are used to great effect.  These indices are not reported wholesale, although you can find them if you try.  For example, go to to see any websites Google pagerank yourself).  Instead, they are packaged into a usable format like Search, Product Recommendations, and many other productized formats that bridge the gap between habit and raw data.  For me, this is the area that I am most focused on as the last mile of work that needs to be done to push data into the every aspect of our decision making process.

How does Big Data fit into this scenario? Big data is about improving accuracy with more data.  It is well known that the best algorithm looses out to the more data inputs you have.  However, conducting sophisticated statistics and analysis on large datasets is not a trivial task.  A number of startups have sprung up in the last couple years to build frameworks around this, but require a significant amount of code skills.  Only a few have provided a much more approachable way of applying data science to big data.  One such company that has built a visual and highly robust way of conducting analytics at scale is my very own Alpine Data Labs.  It has a bevy of native statistical models that you can mix and match to product highly sophisticated algorithms that rival the best in class in a matter of minutes, not months.  Pretty wild to think that only a few years ago we were still hand tooling algorithms on a quarterly basis.

It is evident to me that the focus needs to shift back towards the application of data science before we find our self disillusioned.  I for one, am already thinking about how to build new products that start with data as its core value than an add on to be determined later.  There is much more to write about on this subject, but now I hear Maxwell calling out for my attention just as I would have predicted :)

Designer Data Science

I am pleased to report Big Data is here to stay and we are now moving into the application age with many moving beyond descriptive (BI) to prescriptive or machine learning focus. After attending STRATA NYC last month and Databeat this past week I am seeing first hand how this trend is rapidly evolving.  First, lets take an example of another major technological shift that happened a little over 10 years ago when the internet and web applications came of age.

At first there were only a handful of way to access the web via “internet portals.” Many people could access the open web to truly leverage its amazing potential to communicate, access information quickly, and create content. Next, we saw the boom create a huge demand for web developers with little emphasis on design.  I remember fondly many of my engineering colleagues jumping into the fray learning php, html, tcp/ip and other web programming languages to take advantage of the demand.  It wasn’t until the bubble bust and the next era of Web 2.0 arrived that frameworks became standard and the focus shifted to design.  These days do people call themselves Web Developers? Not really, I’d say you see more Web Designers attracting the high salary that can use established web frameworks to design the best customer experience.

blue print

It often reminds me of the situation of the Data Scientist today where many believe the best are great programmers who can leverage R, Python and MapReduce to create one off analysis. Scott Yara from Pivotal went so far as to say this last week, “It only takes minutes for a Programmer to become a Data Scientist.”  Do we truly believe that? When we heard from Allen Day, Data Scientist at MapR, he did not talk in terms of data frames or Hadoop jobs.  Instead, he focused on the design component of engineering a big data application.  No question he has a strong ability to program and work with big data technology, but what truly sets him apart is his ability to design solutions.  You can hear more snipets from his talk “What Shape is Your Data,” by liking us on Facebook.

Today the majority of Data Science applications are heavily coding and scripting frameworks (Python, R, Scala, Java, and Map Reduce).  However, at Alpine we are thinking differently about how to design and replicate analysis without having to start from scratch each time.  We go further and abstract the code into representations of operations to make it less programming intensive.  I agree with Trifacta’s CEO Joe Hellerstein, when he states “Let’s take the programming requirement out of Data Science.”


Hackathons, according to Wikipedia are “…(also known as a hack day, hackfest or codefest) is an event in which computer programmers and others involved in software development, including graphic designers, interface designers and project managers, collaborate intensively on software projects.” For me the first thing that pops into my mind is Facebook, but you may be surprised to learn this was originally devised by the smart marketing folks at Sun during the height of the internet boom!

Since then, Hackathons have become a main stay for socializing, innovating, and friendly competition between like minded individuals with a common theme or goal in mind.

I beg the question, why limit hackathons to programmers. 

Beyond Programmers

Already we can see some areas where Hackathons have been applied to not only programming but also life sciences (e.g. Open Bioinformatics Foundation).  Personally, I would love to see more “hackathons” in other areas.  Perhaps government (ahem shutdown) or cooking? to think of a couple random examples.  For me, it feels like hackathons are simply a way to fast prototype ideas into reality.

If you want to start your own hackathon, what is the structure?


Hackathons can take place practically anywhere.  At Facebook HQ and their notoriously grueling hackathon to the mile high club British Airlines hackathon.  Or at no place at all and purely online as in the case of Kaggle, which I’d consider a form of hackathon.

Now that we have a place, what is the structure of a hackathon once you’re there?


Overall, I’d say there is usually a short presentation by the organizers about the goal and guidelines of the hackathon.  Once announced, people generally break up into teams of 2-4 and go off to generate ideas, mockups, and get to work.  Hackathons can last a simple evening as was the case with AirBnB or last for more than a week in some organizations.

Here are some good tips from our pals at Quora if you are so inclined to create your very own hackathon.

In the meantime, if you want to see what all the fuss is about, I invite you to join one of our hackathons we’re hosting next month for the National Association of Realtors

Customer Analytics: Lifetime Value

Customer Lifetime Value (CLV) is an often over used and over simplified term people use to describe the amount of value a customer generates for your business.  Take this example from Kissmetrics where they average together “expenditure” and “visits” into their variables and then take a random retention rate (r) to calculate the LTV.  From this simple example, they come up with a range of $5k – $25k ??? that’s over a 5X difference and then they average it together??!?! Yikes!

bad ltv

kiss metrics

When I first joined AVG Technologies back in 2010 a customer life time value estimate was generated in a very similar way by simply dividing the total annual revenue by the monthly active users.  Unfortunately, this was a bogus metric as it really over simplified customer lifetime, monetization, and the true cost of acquisition.  Especially considering the fact we had over 110 Million users worldwide.  At its best, the CLV answers in one simple number all of the most important questions about your customers:

Where do my customers come from? How many are making their way through the acquisition and on-boarding funnel to become an active user? and at what cost?

What is the average lifetime of my customers? What impacts their churn behavior and what dimensions are important to segment by?

How many transactions are conducted in the lifetime of my customer? what is the average order value?

Over time, we developed a methodology to extract clickstream cohorts with the channel attribution information and join it to the customer record data.  We then conducted linear regression over a multitude of dimensions to identify the key variables that impact the churn or monetization of a customer.  For more information, please contact me directly @whatisanalytics

At its worse, a bad CLV can lead to over spending on acquiring users; a death wish in the startup space or underestimating the total value a marketing campaign or new product introduction could be generating.

Other great references to look at:


Analytics Stack

With new technologies coming out so fast for analytics, its hard to keep up with the best tool for the job. Take Berkely’s Data Analytics Stack (BDAS) featuring Spark, Shark, Mesos, for advanced analytics and mining. Should I use this or stick with Apache Hadoop, Hive, and Mahout? How do you decide? From my experience, I’ve found this to be the most common stack:


  • Hadoop: for distributed file system for data collection.
  • Database: Hbase or Cassandra to enable random reads
  • Analysis: Hive, Pig, Impala for advanced analysis
  • Real-Time: Storm or Spark
  • Visualization: Tableau Software or if you have programmers D3.JS
  • Applications: Datameer, Alpine Data Labs, WibiData,, others?
  • Infrastructure: On-premise or Hosted?
  • Add-ons: Hue, Sqoop, and Flume.
Analytics, Big Data, Stack

Example of a possible configuration

Is this generally what you see? Are there additional configuration I am missing? Feel free to leave a comment or contact me directly.

Back to Basics: Learning to code can be fun!

After years of trying to avoid coding, I’ve recently picked the habit back up and actually enjoying it this time around.

My first formal programming experience was part of my core engineering curriculum at the University of Washington. I quickly realized how fun and addictive programming can be. We learned the basics of building a simple search algorithm, simulated environments, and some simple tip calculators that we simply cannot get enough of. After an all night session of coding my final project on many cups of coffee I won the freshman competition for best game of the quarter. However, I was so immersed in programming the rest of my coursework suffered. Perhaps this is why so many computer programmers drop out. From this moment, I realized I was not my full passion but more of a fun hobby, so instead I went into Materials Science and Nanotechnology (as it was just emerging). For me the real draw to engineering and technology has been more of what you can do with it than making something pretty.

Fast forward to today, and I’ve found a new interest in coding as the languages have become far more streamlined. Python for example makes it much easier to code by removing variable definitions, syntax, and other formalities that used to consume my time debugging. Furthermore, Python (among other languages) bridge more complex tasks that would include connecting to APIs, writing MapReduce, and creating simple User Interfaces. In addition, there are so many ways to learn that are FREE! (thank you Freemium Model). Coursera, Codeacademy, and Codeschool to name a few are some of the more popular options. Coursera is a great option for usually getting a more high level or theoretical understanding of the material. Paired with Codeacademy or Codeschool it really solidifies the material by doing. I’ve found the python course on Codeacademy to be not only fun (*cough* gamified), but really great at building on each lesson. I can write a whole review on these schools but I’ll save that for another post.

You may ask the question, “Why refresh or learn to code in the age of “drag and drop” enabled analysis tools (MS Excel, Tableau, SAS, Datameer, and others)?” Its a simple answer, while these tools are great for fast ad hoc analysis, you sometimes have to roll up your sleeves and really feel the data and go deep. Furthermore, as the big data space continues to heat up, I’ve already found many vendors try to trap you in a box that makes it difficult to integrate or build on unless you have their blessing. In addition to this, shuttling data around to a sandbox creates more costs, privacy concerns, and other issues that defeats the purpose of using a giant database in the first place. This, I believe, is what is driving the growth in open source projects in the space.

So far I have finished most of the Intro to Data Science course on Coursera, I’m midway through the Code Academy’s Python course and just about to start digging into SQL. In fact, we are currently planning a meetup for Bay Area Analytics to specifically focus on coding and querying data.

Thats all for now, I’ll update this as I get further along and even share some of my code on GitHub as I progress.

Announcing the Bay Area Analytics Meetup

bay area analytics

Over the past few months, I’ve attended many big data, analytics, data science, social media, Hadoop, Vertica, d3.js, you name it meetups in SF and the Bay Area. I have yet to find one that is truly focused on leveraging data analytics to power your business. For this reason, I am starting a completely new and open group to discuss how we can solve real-world business problems with data by sharing experience and transfering knowledge that can make us all better at data decision making.

We’ll meet once a month in a location on the peninsula and have an open forum with break out sessions that will tailor to your interests.  Already we have over 120+ members.  I’ve analyzed your feedback and you can see what our members are interested in the infographic below.

Feel free to check out the group here: and of course let me know if you have other ideas or topics you’d like to discuss at future meetups.

Looking forward to see you at the first meetup!



Customer Analytic Models: Cohort Analysis

There are many analytics models to choose from that have been developed over time by financial analysts, marketers, and product managers.  Here is the first of five core analytic models that are essential to making data informed decisions.

At the top of the list is Cohort Analysis, it has been around a long time and is prevalent in medicinal, political, social, and other sciences.  Lately, there has been a resurgence of this form of analysis and how it relates to web and product analytics (Jonathan Balogh, Jake Stein, and others).  There are many excellent explanations of cohort analysis, so I won’t spend too much time explaining the concept.  Overall, Cohort Analysis is the practice of segmenting a group of people by a dimension.  Whether time, geography, demographic, product, or otherwise, the ultimate goal is to see how one group compares to another.

With this simple model, we are able to measure how a marketing campaign, new feature introduction, or an unknown variable causes change in customer behavior.  For example, churn, retention, conversion, or customer referrals are critical for driving growth.

You can focus all of your attention on driving downloads through Search Engine Optimization (SEO), SEM (Search Engine Marketing), and Call to Action (CTA) improvements, but if you cannot retain these customers then its pouring money down the drain.

Lets take a typical online acquisition funnel as an example:

Acquisition, Funnel, Conversion

Typical Online Acquisition Trend

In this example, we can see our website traffic and downloads are steadily increasing, but our active users are staying flat.  Is there an issue with our download to activation process? Do new customers try our product and then leave? or are there older customers who are now churning away?  There is no way to tell without doing a cohort.

How to setup a cohort in simple to understand terms:

  1. Filter by date range (Depending on your volume you can choose between one day to one week or the first 100k activations).
  2. Collect customer behavior, demographics, or any other important dimension over the next 30 to 60 days or until you reach a steady state (churn rate stabilizes).
  3. Assess your cohorts, segment, compare, and calculate the churn and retention rates.
Cohort, Churn, Retention

Simple Cohort Example

In this example, we can see clearly India has a steep initial fall off but then levels out, Germany has a more steady decay, and the US having the least churn of the three.  From here you can see the issue is early in the customer on-boarding, as well as, a significant country behavioral difference.  We can focus on better product and marketing design for those first few days and even narrow our attention to the first few hours.  Additionally, we may want to limit our marketing spend in India until we resolve the high churn rate.

Cohort Analysis is a simple and powerful model to dig deep into your data to find the root cause of an issue and make data informed recommendations.  Next week, we’ll take this concept further and see how we can find indicators of churn with linear regression and correlation.

What is Analytics?

For my first post, I find the best place to start is by defining the subject matter we wish to talk about.  So lets get started, what is analytics anyways? How is it different than traditional Business Intelligence? and why has it come back into focus after being dormant for so many year?

a Definition: Analytics is the application of computer technology, operational research, and statistics to solve problems in business and industry.

Historically, Analytics was heavily used in banking for portfolio assessment using social status, geographical location, net value, and many other factors. Today, Analytics is applied to a vast number of industries and is re-emerging due to the phenomenal explosion of data from our connected world.

With this explosion of data, we now see analytics re-emerging as a topic and instead called “Data Science.”  In reality, analytics has been around a long while and this new breed of analysts are re-branding to garner higher salaries.  With the advent of low cost and open source databases we’ll see analytics penetrate deeper into traditionally less analysis focused industries.  Leading the pack is Apache Hadoop, primarily due to the aforementioned low cost and ease of scalability.  Big Data  consists of data sets that grow so large and complex that they become awkward to work with using on-hand database management tools.  McKinsey Global Institute estimates that big data analysis could save the American health care system $300 billion per year and the European public sector €250 billion.

Hadoop, Ecosystem, Big Data, Analytics

Datameer circa 2012

Many industries have already adopted or are in the process of adopting a Big Data platform in their organization and the time has come to start discussing some simple analysis to leverage this vast amount of information.

I humbly submit this blog to discuss Big Data Analytics tooling, a range of Analytics Topics (Customer Retention, Online Marketing, Behavioral Analysis, Customer Valuations, and more).