R and Hadoop make Machine Learning Possible for Everyone.

Note: this is a repost from the article I wrote for KDNuggets last November.

In statistics, bootstrapping can refer to any test or metric that relies on random sampling with replacement.  In simple terms, it allows a way to measure the accuracy of the sampling distribution often used in constructing a hypothesis test.  In business, bootstrapping refers to starting a business without external help or capital.  Bootstrapping in general parlance refers to an absurdly impossible action, “to pull oneself over a fence by one’s bootstraps.”  R and Hadoop are very much bootstrapped technologies having received zero direct investment capital and relying on what might appear to be a random group of contributors over the past 20 years in practically every industry and use case imaginable.

R Pirates Pillage Businesses Worldwide

R first appeared in 1993 when Ross Ihaka and Robert Gentleman at the University of Auckland released a free version as a software package.  Since then, R has grown to over 3 million users in the US alone according to the download site log files released last year.

In addition, R surpassed SAS with over 7,000 unique packages you can view on Crantastic website.  It is no wonder it has found wide use in many industries and academia. In fact, during the summer of 2014, R surpassed IBM SPSS as the most widely used analytics software for scholarly articles according to Robert Muenchen.  For this reason, R is now the “Gold Standard” for doing all sorts of statistics, economics, and even machine learning.  Furthermore, from my experience I found that many if not most people use R as a complimentary tool today for spot checking their work even when using other far more expensive or popular enterprise software.  It is no wonder R is quickly taking over as the go to tool for Data Scientists in the 21st century.

What is fueling R growth is predominantly the community for making the core software useful and relevant by providing answers to common questions via many blogs and user groups.  In addition, it is clear there is an underserved job market according to data from LinkedIn (see image below).  Due to this demand, R is now offered in practically all major universities as the de facto language for statistical programming and many new online courses are starting each day.  Datacamp is one such example having built an interactive web environment with rich lessons that non-programmers can easily get started without ever touching a command line.

People to jobs ratio

Businesses too are flocking to statistics and embracing the probabilistic vs the deterministic nature of problems that arise when data is expanding at an increasing size and rate where tradition Business Intelligence cannot keep pace.  For this reason, many turned to Hadoop to open up the data platform to unlock the world of enterprise data management that had been kept away from business analysts for many years.  Gone are the days of pre-filtered, pre-aggregated dashboards and excel workbooks that are emailed around haphazardly to executives and decision makers left to little interpretation or devoid of any “storytelling” to guide the business to make informed decisions.

Hadoop Growth

Apache Hadoop came almost 10 years after R first hit the scene in 2005 and wasn’t widely adopted until as late as 2013 when more than half of the Fortune 50 got around to building their own clusters.  The name “Hadoop” is after a toy elephant of famed Yahoo! engineer Doug Cutting who along with Mike Cafarella originally developed the technology to create a better search engine, of course. Along with its ability to process enormous sums of data on relatively inexpensive hardware, it also made it possible to store data on a distributed file system (HDFS) without having to transform it ahead of time.  As with R, many open source projects were created to re-imagine the data platform. Starting with getting data into HDFS (sqoop, flume, kafka, etc.) to compute and streaming (Spark, YARN, MapReduce, Storm, etc.), to querying data (Hive, Pig, Stinger / Tez, Drill, Presto, etc.), to datastores (Hbase, Cassandra, Redis, Voldermort, etc.), to schedulers (Oozie, Cascading, Scalding, etc.), and finally to Machine Learning (Mahout, MLlib, H2O, etc.) among many other applications.

Unfortunately, there is not a simple way to see all of these technologies and easily install with one line of code like R.   Nor is MapReduce a simple language for the average developer.  In fact, you can clearly see the shortage like R of Hadoop and MapReduce skilled workers to the number of jobs available thanks to LinkedIn.  It is for this reason Hadoop has not fully caught fire in the same way R has and there is talk of its demise at the recent Strata Hadoop World conference in NYC this past fall.

People and Jobs Ratio at Strata

What’s the Real Problem Here? In One Answer, Data

Over the past few years, the issues of data have cropped up in the field of data science as the number one problem faced when working with the vast variety and volumes of data.  I’d be remiss to not mention the velocity of unrelenting data waves crashing against our fragile analysis environments.  In fact, it is projected that the volume of data is expected to exceed the number of stars in the universe by 2020 according to IDC.  Fortunately, there is an entirely new approach to this problem that has until now escaped us in our persistent habit of wanting to constrain data to our querying tools.

Machine Learning is the new SQL

Put simply, “Machine Learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data.” It is a quantum shift in the standard way of simply counting things; instead, its the start of a fantastic journey into the deeper pools of the unknown.

People and Jobs ratio in ML

So here comes the really interesting part of the story.  According to my LinkedIn analysis, Machine Learning and Data Science are actually very well matched to the overall demand in the job market to the people available (unlike R and Hadoop).

We’ll need another plot to really understand what is going on here of the actual number of jobs that exist for Data Science vs Machine Learning.  My interpretation of this graph is that the job of a data scientist today is synonymous to that of an analytics professional or analyst and the real opportunity is in the growing area of Machine Learning.

Jobs and Skills in ML

Machine Learning is the New Kid on the Block

Data Science was first described as the intersection of programming or “hacking skills”, math and statistics along with business expertise according to Drew Conway’s blog.  As it turns out, programming is too generic a term and what is really meant is applied math to large scale data through new algorithms that can crawl through this tangled mess.  To search for answers in this jungle, simply flying over the canopy will not reveal the treasure boxes hidden just beneath the canopy. It is evident to me from the number of machine learning projects that have cropped up and the maturity of the market accepting probabilistic information not only deterministic marks a new era in the race to find value in our data assets.

“The machine does not isolate man from the great problems of nature but plunges him more deeply into them.” – Antoine de St. Exupery

Hadoop 2.0 is Here, Sort Of

Many people have tried to claim that Hadoop 2.0 had arrived with MR2 or YARN or high-availability HDFS capabilities, but this is a misnomer when considering the similarly named Web 2.0 that brought us into the age of the web applications like Facebook, Twitter, LinkedIn, Amazon, and the vast majority of the internet.  According to John Battelle and Tim O’Reilly of now Strata fame defined the shift as simply “Web as a Platform” meaning software applications are built upon the Web as opposed to the desktop. Hints of this change are coming from Apache Spark and specifically new capabilities like SparkR, KeystoneML, and extensions that are making it possible to develop intelligent applications on large-scale data. As Matei Zaharia, the godfather of Spark, would say himself, “its all about data science and interfaces” as he reported in his keynote address earlier this year.  It is now finally possible for Data Scientists and Developers together in the same framework.  It is clear to me having worked in the “Big Data” industry for some time, software developers and statisticians want to program in their language, not MapReduce.  It’s an exciting time to move off the desktop and onto the cluster where the constraints are lifted and the opportunities are endless.

Jobs and Skills Analysis Explained

Many people have conducted research as of late on the growing popularity of statistical software using indirect methods like academic research citations, job posts, books, website traffic, blogs, surveys like KDnuggets annual poll, GitHub Activity and many more.  However, all of these methods have generally been focused on the technical crowd.  Where the rubber meets the road is in the business context which in my mind LinkedIn represents as it is highly representative of the business world.  Further, if you want to go to an even more general audience you can perform the same trick with Google Adwords.  Reverse engineering Ad platforms is a good way to get back of the envelope market sizing information.  I wrote a complete blog on this subject on my personal blog.  In the following instructions, I’ll walk through how I used LinkedIn as my sample and R to analyze the business market for my analysis above.

1. Data Gathering

There are two ways to gather data from LinkedIn.  One is to use the ad shown in the left image below or the other is to use the direct search functionality shown to the right below.

LinkedIn Advertising and Searching

In this case, I went the manual route and used the search function. For each product category that I search there is a count of results that show up and use that as a proxy for demand.  See below:

LinkedIn Product Categories

From this example, we can see the phrase “R Programming” has 1720 results.  I’ve also included “R statistics” and other “R” relevant terms.

2. Data Analysis

As a new R user myself, I manually created each data frame to hold the data by first creating the individual vectors for the people and jobs:

name<-c(“R”,”Python”,"SAS","SPSS","Excel","RapidMiner","SQL")
people <- c(230750,815555,128860,752205,15390756,3306,4648240)
jobs <-c(5059,13519,2414,7429,123874,17,37571) 

after I created a simple data frame:

skills <- data.frame(c(name),c(people),c(jobs)) 

to get the ratio, I simply use the transform function:

skills <- transform(skills, ratio=people/job) 

3. Data Visualization

R comes with many visualization packages, the most notable one being ggplot2.  For this situation, I used a built in barplot as it was much easier out of the box.  Frankly, the visualizations that are produced in R may not seem the most compelling to the general audience, but it does force you to consider what you’re plotting making for more informed visuals.

to get the bar graph (in H2O colors):

barplot(skills$ratio,names.arg=name,col=“#fbe920”, main=“Ratio of People to Jobs”,xlab=“Skill”) 

Thats it! Pretty simple and I am sure there are ways of doing this analysis more elegantly, but for me this was the way that I can be sure the analysis makes sense.

To get the full script you can download to try yourself or add to my analysis.

Welcome to the Age of Free Energy

Well, its official, the future is here.  This week we were introduced to how we’ll power our homes, businesses, transportation, and industries for years to come.  In Elon Musk’s keynote, he took us on a journey that started with a very simple problem statement.  We are producing energy poorly and polluting the planet in doing so.  Like any good engineering problem, the solution lies in breaking down the problem into its individual parts.  The first part is where to get the energy.  Well, it turns out “we have this handy fusion reactor in the sky called the sun” explains Musk. Unfortunately, the sun doesn’t shine at night, so the second part of the problem is how do we store the energy to use it at night and off peak hours (as our energy needs fluctuate). Turns out the answer is a battery pack.  So there you have it, we now have a way to store the sun in our homes and power our lives not only in developed countries, but technically anywhere. How to make this economically feasible is a question left unanswered.

Copyright of Tesla Motors.

Copyright of Tesla Motors.

Curious how the battery pack works? You can actually view the patent here.  In fact, Tesla went even further to remove its patents and open source the technology for anyone to use.  With Tesla’s announcement this week, it is clear to me that they are not a product company trying to sell a few cars to the wealthy, but it is a technology company looking to make a societal revolution by allowing others to improve on these technologies without fear of litigation.

Why might you ask am I talking about energy technology on my analytics blog? To me, there are striking similarities to how data as a resource is consumed by the few who have the skills, technology, and resources to apply it in their every day lives.  The problem statement in this case is that we continue to manage data poorly and are polluting the decision making process with bad insights that impact every aspect of our lives.  Similar to the energy problem, there is no single solution, but multiple parts that we need to address. The first part of this problem is the pervasiveness of data exhaust from every digital thing that is poorly instrumented and difficult to work with.  For this, we have a pretty good solution called “Hadoop” that we’ll need to continue to make easier for people to use. Hadoop clusters in your home anyone? Don’t believe me, check out the company BigBoards.  Next, we’ll need to find a way to store and process data efficiently.  For this, the front runner to me is Spark in its ability to crunch through data fast, applying sophisticated operations to find patterns in data, and then stream it into applications easy to use APIs. Last, we’ll need to learn how to engineer a universal way of consuming the “energy” or as it is commonly referred to in the analytics category, the “Insight” that comes out.  For this we have not yet identified a universal solution, and represents the Data Science Last Mile I have written about before.  Looking into the future, there is a lot to be optimistic about as I look at additional parts of society we’ll disrupt:

Energy – Check

Information – Half-Check

Material & Manufacturing – ?

Agriculture – ?

Governments – ?

More to come in the next few months, so check back soon!

Joel

The Data Science Last Mile

Editors Note: This is my first blog since returning from paternity leave.  I am happy to announce the birth of my son Maxwell Horwitz.  I’m already applying data science to his routine.  His health records are stored online, we monitor his intake (and outake of food), and monitor his sleep via motion cameras.  Its an exciting time to bring a new life into the world! 

Data Science is often referred to as a combination of developer, statistician, and business analyst.  In more casual terminology, it can be more aptly described as hacking, domain knowledge, and advanced math.  Drew Conway does a good job of describing the competencies in his blog post.  Much of the recent attention is focused on the early stages of the process of establishing an analytics sandbox to extract data, format, analyze, and finally create insight (see figure 1 below).  Many of the advanced analytic vendors are focused on this workflow due to the historical context of how business intelligence has been conducted over the past 30 years.  For example, a new comer to the space, Trifacta, recently announced a 25 Million dollar venture round that is applying predictive analytics to the data to help improve the feature creation step.  Its a very good area to focus considering some 80% of the work is spent here working to un-bias data, find the variables that really matter (signal / noise ratio), and identify the best model (linear regression, decision trees, naive bayes, etc.) to apply to the data.  Unfortunately, most of the insights created often never make it past what I am calling the “Data Science Last Mile.”

Simple data science workflow.

Figure 1. Simple data science workflow.

What is the Data Science Last Mile? Its the final work that is done to take found insight and deliver in a highly usable format or integrate into a specific application.  There are many examples of this last mile and here are what I consider to be the top examples.

Example 1. Reports, Dashboards, and Presentations

Thanks to the business intelligence community, we are now accustomed to expect our insights in a dashboard format with charts and graphs piled on top of each other.  Newer visual analytics tools like Tableau and Platfora add to the graphing melange by making it even easier to plot seemingly unrelated metrics against each other.  Don’t get me wrong, there will always be a place for dashboards.  As a rule of thumb, metrics should only be reported as frequent as is the ability to take action on them.  At Intel, we had daily standup meetings at 7am where we reviewed key metrics and helped drive the priorities each day for the team.  We had separate meetings scheduled on a project basis for analysis that was more complex like bringing up a new process or production tool.  Here the visualization format is very well defined and there is even an industry standard called SPC or Statistic Process Control.  For every business, there are standard charts for reporting metrics and outside of that there is a well defined methodology for plotting data.

One of my favorite books of all time on the best design practice for displaying data is by Edward Tufte.  My former boss and mentor recommended the book The Visual Display of Quantitative Information and it changed my life.  One of my favorite visuals from this book is how the French visualized their train time tables.

Presentations, reports, and dashboards is where data goes to die.  It was common practice to review these charts on regular basis and apply the recommendation to the business, product, or operations on a quarterly or even annual basis.

Example 2. Models

Another way data science output is ingested by an organization is as inputs in a model.  From my experience, this is predominantly done using Excel.  Its quite surprising to me that there aren’t many other applications that have been built to make this process easier? Perhaps its due to this knowledge being locked away in highly specialized analysts heads?  Whatever the reason, this seems like a primary area for disruption that there is a significant need to standardize this process and de-silo this exercise.  One of my favorite quotes is from a former colleague.  We were working over a long weekend analyzing our new product strategy business models when he stated, “When they told me I’d be working on models, this is NOT what I had in mind.”  Whether you’re in Operations, Finance, Sales, Marketing, Product, Customer Relations, Human Resources, the ability to accurately model your business means you’re likely able to predict its success.

Example 3. Applications

Finally and quite possibly my favorite examples are those products built on data science around us without us evening knowing it.  One of the oldest examples I can think of is the weather report. We are given a 7 day forecast in the form of sun, clouds, rain drops, and green screens managed by verbose interpretive dancers.  As opposed to reporting the raw probability data of rain, barometric pressure, wind speeds, temperatures and many other factors that go into this prediction.  Another example is a derived index of credit worthiness (FICO), or the Stock Market index, or Google’s Page Rank, or a likeliness to buy value, or a number of other singular values that are used to great effect.  These indices are not reported wholesale, although you can find them if you try.  For example, go to http://pagerank.chromefans.org/ to see any websites Google pagerank yourself).  Instead, they are packaged into a usable format like Search, Product Recommendations, and many other productized formats that bridge the gap between habit and raw data.  For me, this is the area that I am most focused on as the last mile of work that needs to be done to push data into the every aspect of our decision making process.

How does Big Data fit into this scenario? Big data is about improving accuracy with more data.  It is well known that the best algorithm looses out to the more data inputs you have.  However, conducting sophisticated statistics and analysis on large datasets is not a trivial task.  A number of startups have sprung up in the last couple years to build frameworks around this, but require a significant amount of code skills.  Only a few have provided a much more approachable way of applying data science to big data.  One such company that has built a visual and highly robust way of conducting analytics at scale is my very own Alpine Data Labs.  It has a bevy of native statistical models that you can mix and match to product highly sophisticated algorithms that rival the best in class in a matter of minutes, not months.  Pretty wild to think that only a few years ago we were still hand tooling algorithms on a quarterly basis.

It is evident to me that the focus needs to shift back towards the application of data science before we find our self disillusioned.  I for one, am already thinking about how to build new products that start with data as its core value than an add on to be determined later.  There is much more to write about on this subject, but now I hear Maxwell calling out for my attention just as I would have predicted :)

Designer Data Science

I am pleased to report Big Data is here to stay and we are now moving into the application age with many moving beyond descriptive (BI) to prescriptive or machine learning focus. After attending STRATA NYC last month and Databeat this past week I am seeing first hand how this trend is rapidly evolving.  First, lets take an example of another major technological shift that happened a little over 10 years ago when the internet and web applications came of age.

At first there were only a handful of way to access the web via “internet portals.” Many people could access the open web to truly leverage its amazing potential to communicate, access information quickly, and create content. Next, we saw the dot.com boom create a huge demand for web developers with little emphasis on design.  I remember fondly many of my engineering colleagues jumping into the fray learning php, html, tcp/ip and other web programming languages to take advantage of the demand.  It wasn’t until the bubble bust and the next era of Web 2.0 arrived that frameworks became standard and the focus shifted to design.  These days do people call themselves Web Developers? Not really, I’d say you see more Web Designers attracting the high salary that can use established web frameworks to design the best customer experience.

blue print

It often reminds me of the situation of the Data Scientist today where many believe the best are great programmers who can leverage R, Python and MapReduce to create one off analysis. Scott Yara from Pivotal went so far as to say this last week, “It only takes minutes for a Programmer to become a Data Scientist.”  Do we truly believe that? When we heard from Allen Day, Data Scientist at MapR, he did not talk in terms of data frames or Hadoop jobs.  Instead, he focused on the design component of engineering a big data application.  No question he has a strong ability to program and work with big data technology, but what truly sets him apart is his ability to design solutions.  You can hear more snipets from his talk “What Shape is Your Data,” by liking us on Facebook.

Today the majority of Data Science applications are heavily coding and scripting frameworks (Python, R, Scala, Java, and Map Reduce).  However, at Alpine we are thinking differently about how to design and replicate analysis without having to start from scratch each time.  We go further and abstract the code into representations of operations to make it less programming intensive.  I agree with Trifacta’s CEO Joe Hellerstein, when he states “Let’s take the programming requirement out of Data Science.”

Hackathons!

Hackathons, according to Wikipedia are “…(also known as a hack day, hackfest or codefest) is an event in which computer programmers and others involved in software development, including graphic designers, interface designers and project managers, collaborate intensively on software projects.” For me the first thing that pops into my mind is Facebook, but you may be surprised to learn this was originally devised by the smart marketing folks at Sun during the height of the internet boom!

Since then, Hackathons have become a main stay for socializing, innovating, and friendly competition between like minded individuals with a common theme or goal in mind.

I beg the question, why limit hackathons to programmers. 

Beyond Programmers

Already we can see some areas where Hackathons have been applied to not only programming but also life sciences (e.g. Open Bioinformatics Foundation).  Personally, I would love to see more “hackathons” in other areas.  Perhaps government (ahem shutdown) or cooking? to think of a couple random examples.  For me, it feels like hackathons are simply a way to fast prototype ideas into reality.

If you want to start your own hackathon, what is the structure?

Place

Hackathons can take place practically anywhere.  At Facebook HQ and their notoriously grueling hackathon to the mile high club British Airlines hackathon.  Or at no place at all and purely online as in the case of Kaggle, which I’d consider a form of hackathon.

Now that we have a place, what is the structure of a hackathon once you’re there?

Structure

Overall, I’d say there is usually a short presentation by the organizers about the goal and guidelines of the hackathon.  Once announced, people generally break up into teams of 2-4 and go off to generate ideas, mockups, and get to work.  Hackathons can last a simple evening as was the case with AirBnB or last for more than a week in some organizations.

Here are some good tips from our pals at Quora if you are so inclined to create your very own hackathon.

In the meantime, if you want to see what all the fuss is about, I invite you to join one of our hackathons we’re hosting next month for the National Association of Realtors

Customer Analytics: Lifetime Value

Customer Lifetime Value (CLV) is an often over used and over simplified term people use to describe the amount of value a customer generates for your business.  Take this example from Kissmetrics where they average together “expenditure” and “visits” into their variables and then take a random retention rate (r) to calculate the LTV.  From this simple example, they come up with a range of $5k – $25k ??? that’s over a 5X difference and then they average it together??!?! Yikes!

bad ltv

kiss metrics

When I first joined AVG Technologies back in 2010 a customer life time value estimate was generated in a very similar way by simply dividing the total annual revenue by the monthly active users.  Unfortunately, this was a bogus metric as it really over simplified customer lifetime, monetization, and the true cost of acquisition.  Especially considering the fact we had over 110 Million users worldwide.  At its best, the CLV answers in one simple number all of the most important questions about your customers:

Where do my customers come from? How many are making their way through the acquisition and on-boarding funnel to become an active user? and at what cost?

What is the average lifetime of my customers? What impacts their churn behavior and what dimensions are important to segment by?

How many transactions are conducted in the lifetime of my customer? what is the average order value?

Over time, we developed a methodology to extract clickstream cohorts with the channel attribution information and join it to the customer record data.  We then conducted linear regression over a multitude of dimensions to identify the key variables that impact the churn or monetization of a customer.  For more information, please contact me directly @whatisanalytics

At its worse, a bad CLV can lead to over spending on acquiring users; a death wish in the startup space or underestimating the total value a marketing campaign or new product introduction could be generating.

Other great references to look at:

 

Analytics Stack

With new technologies coming out so fast for analytics, its hard to keep up with the best tool for the job. Take Berkely’s Data Analytics Stack (BDAS) featuring Spark, Shark, Mesos, for advanced analytics and mining. Should I use this or stick with Apache Hadoop, Hive, and Mahout? How do you decide? From my experience, I’ve found this to be the most common stack:

Configuration:

  • Hadoop: for distributed file system for data collection.
  • Database: Hbase or Cassandra to enable random reads
  • Analysis: Hive, Pig, Impala for advanced analysis
  • Real-Time: Storm or Spark
  • Visualization: Tableau Software or if you have programmers D3.JS
  • Applications: Datameer, Alpine Data Labs, WibiData, Wise.io, others?
  • Infrastructure: On-premise or Hosted?
  • Add-ons: Hue, Sqoop, and Flume.
Analytics, Big Data, Stack

Example of a possible configuration

Is this generally what you see? Are there additional configuration I am missing? Feel free to leave a comment or contact me directly.

Back to Basics: Learning to code can be fun!

After years of trying to avoid coding, I’ve recently picked the habit back up and actually enjoying it this time around.

My first formal programming experience was part of my core engineering curriculum at the University of Washington. I quickly realized how fun and addictive programming can be. We learned the basics of building a simple search algorithm, simulated environments, and some simple tip calculators that we simply cannot get enough of. After an all night session of coding my final project on many cups of coffee I won the freshman competition for best game of the quarter. However, I was so immersed in programming the rest of my coursework suffered. Perhaps this is why so many computer programmers drop out. From this moment, I realized I was not my full passion but more of a fun hobby, so instead I went into Materials Science and Nanotechnology (as it was just emerging). For me the real draw to engineering and technology has been more of what you can do with it than making something pretty.

Fast forward to today, and I’ve found a new interest in coding as the languages have become far more streamlined. Python for example makes it much easier to code by removing variable definitions, syntax, and other formalities that used to consume my time debugging. Furthermore, Python (among other languages) bridge more complex tasks that would include connecting to APIs, writing MapReduce, and creating simple User Interfaces. In addition, there are so many ways to learn that are FREE! (thank you Freemium Model). Coursera, Codeacademy, and Codeschool to name a few are some of the more popular options. Coursera is a great option for usually getting a more high level or theoretical understanding of the material. Paired with Codeacademy or Codeschool it really solidifies the material by doing. I’ve found the python course on Codeacademy to be not only fun (*cough* gamified), but really great at building on each lesson. I can write a whole review on these schools but I’ll save that for another post.

You may ask the question, “Why refresh or learn to code in the age of “drag and drop” enabled analysis tools (MS Excel, Tableau, SAS, Datameer, and others)?” Its a simple answer, while these tools are great for fast ad hoc analysis, you sometimes have to roll up your sleeves and really feel the data and go deep. Furthermore, as the big data space continues to heat up, I’ve already found many vendors try to trap you in a box that makes it difficult to integrate or build on unless you have their blessing. In addition to this, shuttling data around to a sandbox creates more costs, privacy concerns, and other issues that defeats the purpose of using a giant database in the first place. This, I believe, is what is driving the growth in open source projects in the space.

So far I have finished most of the Intro to Data Science course on Coursera, I’m midway through the Code Academy’s Python course and just about to start digging into SQL. In fact, we are currently planning a meetup for Bay Area Analytics to specifically focus on coding and querying data.

Thats all for now, I’ll update this as I get further along and even share some of my code on GitHub as I progress.

Announcing the Bay Area Analytics Meetup

bay area analytics

Over the past few months, I’ve attended many big data, analytics, data science, social media, Hadoop, Vertica, d3.js, you name it meetups in SF and the Bay Area. I have yet to find one that is truly focused on leveraging data analytics to power your business. For this reason, I am starting a completely new and open group to discuss how we can solve real-world business problems with data by sharing experience and transfering knowledge that can make us all better at data decision making.

We’ll meet once a month in a location on the peninsula and have an open forum with break out sessions that will tailor to your interests.  Already we have over 120+ members.  I’ve analyzed your feedback and you can see what our members are interested in the infographic below.

Feel free to check out the group here: http://www.meetup.com/BayAreaAnalytics and of course let me know if you have other ideas or topics you’d like to discuss at future meetups.

Looking forward to see you at the first meetup!

 

 

Protected: What Makes a Good Analytics Platform?

This content is password protected. To view it please enter your password below: