Tag Archives: R

Rapidics Accelerates my Machine Learning Learning

rapidics-header2For the past week or so, I’ve been digging into the topic of machine learning. It’s something I’ve been interested in for a long time. I’ve done some reading on the subject, and collected links to informational resources and open source tools for years, but I long ago reached the point of diminishing returns, where I either needed to start actually experimenting with it on my own, or otherwise needed a specific reason to learn more.

Thanks to a chance meeting at a coffee shop, I now have the latter. Mark Seligman of Rapidics has helped me understand more about the application of machine mearning techniques. What I find particularly interesting is that Mark and his company are in the business of providing machine learning infrastructure. They are doing some of the heavy lifting to help make machine learning easier and more useful for others to use by taking a generally applicable algorithm called Random Forest and creating a solid, fast multicore implementation called Arborist that can work with the popular, open source, R statistical computing package. By doing so, they’ve achieved major speedups over the standard R implementation, and efficiency and scaling advantages over many of the coarse-grained approaches to speeding up R on parallel hardware.

What’s particularly interesting to me is that they’ve sped things up enough that it could fundamentally change the way people use Random Forest for machine learning, while at the same time making it useful to people who haven’t even heard of machine learning today. That makes the subject triply interesting to me, because I’m learning about machine learning and getting to think about infrastructure and user experience.

So, thanks to Mark, and his partner Mark, for the education!

Predictors of Facebook Engagement

This report from a panel discussion at a user group for R, an open source statistics package, shines a little light on what factors Facebook found helped predict a new users initial and ongoing engagement with their site.

Itamar conveyed how Facebook’s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and (ii) if they stay, which data points predict how active they’ll be after three months?

For the first question, Itamar’s team used recursive partitioning (via the rpart package) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.

For the second question, they fit the data to a logistic model using a least angle regression approach (via the lars package), and found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.

via How Google and Facebook are using R : Data Evolution.