Is machine learning the savior of all things “big data” or a poorly-understood technology run amok? Depending on what you read, you could be forgiven for coming to either conclusion (or anything in between).

While entire books are needed to fully cover this topic, this article will give a very high level summary of what machine learning is, why robots aren’t taking over the world (yet), the basic idea of how machine learning works and some examples of popular algorithms that might apply in a fintech paradigm.

The author has spent over five years using various analytic and machine learning approaches, including over two years at a national laboratory focused on this topic.

What is machine learning, exactly?

Like many technological topics, one can ask this question of 100 people and get 100 different answers. However, at the most simple, machine learning is just a collection of mathematical algorithms used to recognize patterns or highlight patterns for humans to analyze. If we define machine learning as “the use of statistical analysis to enhance the process of analyzing data” we can see that it’s just an extension of data science and something that many businesses are clambering to do more of as soon as possible.

Say you have a repository containing 1000 pictures of cats and you want to be able to feed a previously-unseen picture to the computer and have it tell you “cat” or “not-cat”, maybe with some degree of confidence. Sounds easy, right? After all humans do this type of thing (visual or otherwise) constantly as part of being alive. But it’s surprisingly difficult to train a computer to do this.

Set aside for a moment recognizing all the different types of things in the category of “not-cat,” even just determining “is this a cat” takes significant effort. What about a cat seen from above? A hairless cat? A Chihuahua that looks like a cat? A hand-drawn cat? An upside down cat? A blurry image of a cat?

Our human-brains wouldn’t struggle much with most of these challenges, but sometimes even we can be tricked (what color was that dress anyway? Is it a vase or two faces?) Still, we usually get it right.

Why is this so hard for machines?

Our brains are massively parallel in operation and we have many specialized neural functions, or hierarchical collections of them, to detect things like lines, shapes, faces, motion, etc. While cognitive psychology and neurology are definitely outside the scope of this article, suffice it to say that we all carry around some serious supercomputing hardware in our skulls.

Replicating this in an algorithm is not as trivial as it might sound. Graphical Processing Units (GPUs) have given us a big advantage, but they’re still many orders of magnitude less powerful than even simple biological brains. GPUs are really good at doing the same thing to a lot of data at the same time. This is their purpose in rendering images: take all the pixels for an image and apply some operation to ensure they’re all displayed appropriately. This is just linear algebra (i.e., matrix multiplication) and GPUs are really good at it.

As it turns out, this is the same type of math we need to do the very repetitive and intensive operations of a machine learning algorithm.

What are the different approaches to machine learning?

Generally speaking, machine learning can be divided into two main camps (yes there are others, such as reinforcement learning, but for simplicity we’ll limit to these two.)

Supervised learning techniques allow us to give the algorithm the answer key to the test, let it train itself accordingly and then give it new questions it hasn’t seen before and expect it to perform well. Take our cat picture example above, which only works because we had enough cat pictures already on hand to teach our machine learning model what a cat picture looks like, in terms of the numerical representation of pixels.

The supervised approach isn’t always the best. Say, for example, we had a bunch of pictures of cats and wanted a machine learning algorithm to help us separate out different breeds, but we don’t have any data to train it on.

This is where unsupervised learning approaches come into play.

For example, my algorithm might be able to divide those 1,000 pictures into five groups (we call them clusters) it thinks are most similar. Are these going to be “breed” groups? Not necessarily—the algorithm has no concept of what a “breed” is! The machine learning algorithm can only determine some measure of similarity between any two pairs of images. These approaches sometimes yield unexpected results or usually need iterative refinement to get them to perform as desired.

Let’s tackle an elephant in the room, as well.

Machine learning is a subset of what we might call “artificial intelligence” but it is not the same thing. Artificial intelligence, or AI, is a broader category including techniques beyond just pattern recognition. Generalized AI is the term most often used for what we see in Hollywood— self-aware systems that might or might not present a threat of galactic conquest.

Research into these types of high-capability systems is very much ongoing, but arguably in its infancy and we quickly get into issues such as “what is consciousness, anyway?” if we’re not careful. So we’ll stick with machine learning.

What does machine learning actually do?

Think back to high school math. Do you remember doing a linear regression? Given a bunch of points on an X,Y axis, maybe it was something like the cost of a car versus top speed. What is the best line to draw through them to show the relationship? That is, how much does the price of a car predict its top speed? You might have done this on a graphing calculator or maybe even in Excel.

The point being that for some new car you’ve never seen before, if you knew its price you would be able to make the best possible guess as to its top speed given everything you know about car prices and speeds. Maybe you remember a different example, but if you can remember the general idea, congratulations, you’ve done machine learning already!

Wait wait wait, that doesn’t sound like what all those blog posts are talking about! Linear regression of this type is a very simple example of an optimization algorithm. And that’s all machine learning is, no matter how complex the algorithms become.

We are searching for an optimal mathematical function that can take one or more inputs and give us an output that is based on previously-learned patterns, or statistically-derived patterns in the input data itself. This kind of optimization relies on a lot of trial and error to determine what that function is (in the case of supervised and some unsupervised algorithms), or at least a lot of calculation on all the input points. This could take millions of attempts at trying-out different parameters for the model or hyperparameters configuring the model. For data of any significant size this quickly becomes very expensive in terms of compute-time, memory and potentially storage space. In extreme cases, this is where supercomputers come in, but you can do a lot with a commodity GPU or cloud resources from vendors like AWS, Google or Microsoft.

This computational effort is why humans can’t do “machine learning” by hand and it’s also why it’s so easy for machine learning to go wrong if we’re not careful. These algorithms are essentially black boxes, meaning we know they work within some defined criteria of success, but we can’t necessarily say how they work.

This can have big repercussions if machine learning is used in spaces with potential impacts on health, civil liberties, safety, security and so on without paying attention to various hazards. It also makes them difficult to explain in the case of litigation or regulatory investigation. This isn’t to say we shouldn’t use them, it just means we need to tread carefully and never blindly trust their output.

What are some examples of how machine learning could be used?

Two of the primary objectives of machine learning algorithms are regression and classification.

Regression is any algorithm that tries to give a predicted output value(s) for a set of inputs. Linear regression like what we talked about above takes one input and gives one output. Real-world algorithms might have dozens of inputs and could try to predict multiple output values, but the more complexity we add the harder it is to get an optimal model trained in reasonable time.

This is why feature engineering is important. We need to make some decisions on what the most valuable inputs (features) are and pre-processing them in a way that makes them usable to the algorithm.

For example, think of an IP address as a “dotted quad.” It really isn’t very useful to a machine learning algorithm. Can we reliably say that two IPs are more similar if they are “closer together” (and what does “closer together” even mean?) than when they’re far apart? This is a case where the IP really is only a label, like someone’s full legal name. Handling labels is a straightforward problem, we just encode each value to a binary input, one input for each value. But in the case of IP addresses that could get very big—billions of inputs in the case of IPv4 alone—so this might not be tractable.

So you’d need to ask, “Is the full IP address really valuable enough to warrant all that compute power?” Maybe you would instead choose to transform the IP address into its corresponding ASN identifier, reverse DNS domain, geolocation or some other more meaningful representation. All of this might not be something you can determine without some experimentation.

We can come up with all sorts of fintech examples for supervised learning. Anywhere we want to predict a value for an unknown set of inputs based on known examples: revenue projection (based on historical trends or company attributes), forecasting potential demand for a service like API utilization, predicting the number of ACH returns for a client (and thus a measure of risk) or the likelihood that a given transaction will be returned, predicting demand for a feature in a given market segment to optimize marketing dollar-spend. The list goes on!

We also met the classification above, like cat/not-cat, we can think of a number of ways this can be useful to us. We might wish to divide our customers into market segments or cohorts to optimize communications with them and gather cohesive feedback. We might wish to look for patterns of activity that indicate customers struggling to use a feature (or succeeding). Or we might want to identify anomalies in a dataset – find all the not-cats hiding in a pile of cats. This is another good use case for fraud and abuse. If most transactions cluster together, but some small set of others do not, we might want to understand why and whether something bad is going on.

A limiting factor to algorithms of all types is often whether you have enough data to say anything meaningful. “Enough” meaning both type and quantity, while also ensuring that it doesn’t carry any inherent skew or bias that will affect your output model (for example, imbalance in the training dataset classes).

Along with feature engineering, this can be one of the trickiest hidden obstacles to getting an algorithm into production safely.

Dwolla’s Data-Centric strategy

Dwolla takes pride in making data-driven decisions. This already drives a robust information security program and aids many of our teams (and clients) in getting the information they need to make informed decisions. We like to say “the right information, to the right people, at the right time” and we broadly define this cross-functional strategy as Data Intelligence. We are on an exciting journey to build and mature these capabilities and bring them to bear to benefit our clients, their end users and our internal stakeholders in new and exciting ways.

 
 

Stay Updated