As the volume of new online information grows exponentially, it becomes harder to find the gems, and especially the ‘incremental gems’, that add value to a mature data science practice
Who of us in the quantitative knowledge-worker space has not been swamped by emails, blog links, and “hotlists” of topics in Data Science, Predictive Analytics, and Machine Learning? Exactly, that is a daily big data problem – the velocity of the material is high, the variety is broad, and the veracity is in many cases questionable. But in other cases truly there are gems that are must-read. Sorting through the latest buzzwords and finding useful resources to drive your data science agenda requires some filters, and with that context in mind here a list of current favorite Statistics/Data Science/Predictive Analytics/Machine Learning/Big Data links. We’ve provided some commentary and context, enjoy!
Kaggle Competitions — Right now Kaggle is a go-to resource for accelerating any data science toolkit. This is because the number and variety of Kaggle competitions, and the code-sharing forums, makes it a gold mine of R and Python scripts that can be modded to fit existing and novel use cases. And the Kaggle community is very friendly, with lots of code sharing, mod-tracking (“forking”) and collaboration.
KDD — KDD Cup has been around for longer than Kaggle, so it has a back book of data sets and use cases that deserves consideration. Kaggle is out-pacing KDD Cup, but KDD Cup is one you want to check out annually to see what use case is being modeled, and how the top teams earned their spots. Several KDD competitions have yielded data sets that serve as benchmarks for testing new predictive and optimization algorithms.
Arxiv — It is good to see the empirical research, to know who is publishing what, and where you can reach them to discuss their analytical stack. There are several resources that don’t require university affiliation. Arxiv is free, open source access to published research. Here’s an example of the types of papers you can gather for review and follow up discussions: Efficient Non-greedy Optimization of Decision Trees — turns out one of our client’s needs are aligned with these researchers expertise, and having a local university talent connection is always a good idea. Clearly a win-win for the Toronto co-author and for our client, an introduction you can facilitate, and if needs and interests are aligned that local connection will accelerate innovation. At a minimum, this paper is a data point on how stochastic gradient tree boosting continues to mature.
Reddit Machine Learning — The discussion here is lively, at times cynical, but there are noobs who are in learning mode and we’ve hired top shelf talent off of Reddit, so ha! And because it is a daily news feed you can pick up new resources here on a weekly basis. This is where a lot of people learned about Microsoft’s massive investment in Azure and how this is propelling Microsoft to the front of the cloud-based Machine Learning pack.
Andrew Ng — One of the early leaders in machine learning, Stanford/Coursera credentials that trump just about everyone else out there. But there’s also an error in the blog so look for a follow up here…
KD Nuggets — One blog is never enough, here is a useful list to augment Google searches. Spend a few hours on these types of blog lists and you’ll soon find a resource that scratches an itch.
NG Data — another list, updated in the last 2 months and very useful content. We’ll keep an eye on this to see if it stays active and see how deep the content is.
Data Science Central — Active bloggers who are pushing out content on a weekly basis, mostly focused on the newest “list” and click-bait titles. However some people click when something looks interesting, and you can also see a lot of this re-published on LinkedIn where peer commentary can be lively and informative. If you are going to feed from the fire hose, you should probably start here.