February 13, 2017

How much of Machine Learning is #FakeNews?

If you’re in the tech industry building or selling software, you will be well aware of the rise of machine learning over the past few years.

More recently the terms ‘machine learning’ and ‘AI’ have gone main stream and are essential items in every CEO's buzzword bucket. Everybody is doing it, the machines are taking over the world, and if you don't have an AI strategy you're in the dark ages just waiting to get disrupted.

It’s fair to say that AI is the SoMoLo (social, mobile, local) buzz of 2017. But how much of it is real and how much is just smoke and mirrors?

We've been focussed on enhancing our product with machine learning for the past two years and during that time we've learned a lot about what works and what doesn’t. It’s true that there is a large element of science behind data science, so a lot of it is just R&D and never makes it into production.

During this time we’ve often been asked how much machine learning we actually do, and we've also learned a lot about many other companies out there beating the AI drum. We’ve seen large, well-known and respected companies running billboards about their AI prowess and learned that behind the scenes they’re doing nothing in production. Or often nothing at all. It's all just more buzz to keep the investors happy and the market excited about the prospects of the future.

As a startup all we have is our product and the goal of building a good reputation for offering a quality service. That means being honest about how we do things, sharing our experiences, and incrementally getting better over time. Some startups might feel the need to jump into the enterprise smoke and mirrors game, but I feel that in the long term that's a flawed strategy.

Machine learning at ThisData

We're not calling it AI yet but we're certainly comfortable with talking about machine learning. Our story started around 2 years ago when we thought that we would use machine learning to detect anomalies in user interactions with data or systems. We later refined that down to focus on logins, but it's the starting point that was our first valuable lesson.

At the beginning we assumed that we could just use machine learning and BOOM! - we would get great results, protect the world, and strike gold. We very quickly learned that we had no idea what we were on about, it wasn't that simple, and that we had to be far more strategic about how we would employ machine learning in our system.

We learned that to be effective we needed to ask our machine learning algorithms very narrow, focused questions and give them sufficient prior examples to use as a basis for their decisions.

It was at this point that we decided to shelve machine learning for a while and focus on a rule-based approach to detecting anomalies in user behaviour and logins. We constructed a large set of rules that analyzed attributes of the user and their context at login time. We modelled the expected results in spreadsheets, tweaked rule weightings, and eyeballed results for what felt right based on various use cases.

We launched our rule-based service without any machine learning at all. It collected data, it shared aggregated results across customers and the results improved over time as more data was gathered.

Fast forward about a year, we’re mid 2016 and back into machine learning but this time we have a narrow focus. Rather than thinking about using machine learning to solve one big problem, we’re starting to create algorithms that add superpowers to the rule based system that we already had in place. This gave us plenty of prior data for training the algorithms and the ability to benchmark the incremental value added by each algorithm. We still had some failed machine learning projects but it’s all just science, right? You have to be prepared for that.

The hybrid approach

Fast forward again and we’re at present day, still not comfortable with the term AI but happy with the amount of machine learning we're doing. In fact as we've started talking with larger companies we've come to realize that while they want to hear that we're looking toward the future with machine learning, they're not ready to hand over control to the machines.

The answer for our product is a hybrid approach; a system that has naturally evolved from a rule based system to incorporate machine learning and still has the ability to set fixed rules. i.e. the machine learning might say traffic for a user is safe from a specific location, but if you really want to block that location then you can.

Cutting through the noise

I hope to see more honesty and visibility from software companies working on adding machine learning to their products. There are some amazing off-the-shelf algorithms from Amazon, Microsoft, & Google, and smart data scientists building Dockerized models based on scikit-learn and other similar libraries. Whether you’re rolling your own algorithms or just experimenting with cloud based services don't be afraid to share results and be open about failures.

By cutting down the noise and sharing our experiences we will progress faster, and before long we might just be comfortable with labelling our intelligent services as AI.

If you’re interested in getting started here is a great post on setting up for local development.

Also, in data science land they say “garbage in, garbage out” so for better results remember to keep a narrow focus and ask your algorithms very specific questions.


The future of authentication

Today I’m excited to announce a deal that we have been working on for the past few months and how that will impact the future of contextual ...

Introducing custom security rules

For the past few years we’ve been working hard to create a plug and play adaptive risk engine. We designed our core service using a mix of b ...