Redesigning Procedural Decision Engine To Leverage Machine Learning (and Which)

Question

I'm not sure if this is the right forum, but I'm really looking for an advice on this.

We've a procedural decision engine, which is fairly rudimentary, to decide application routing. It's basically multiple if-else statements (about 30 of them) in series. For example:

if (application.isBlue && (!configs.isBlueAllowed)) { return 1 } ....

In addition, we are going to introduce different tags and flows. For example, if an application contain tag blah, then it should skip some of the if-else statements and proceed to different processing center.

It seems that our naive approach is reaching its bounds (sure we can more switches, but it won't be elegant nor pretty).

Assuming this is the right forum, can you please comment on:

Is this type of an application/process to benefit from an ML approach such as supervised ranking?
If #1 is true, then which ML approach is the most appropriate for this?

candied_orange · Answer 1 · 2018-09-09T20:08:35.563

2

This isn't the right way to frame this question.

I suspect what you really want to know is if Machine Learning can be used to improve your application. It might. It might not. I can't tell because that isn't what you asked.

What you asked is if it can improve a "procedural decision engine". That's not a requirement. That's an implementation. You're telling us how the problem was solved before not what the problem is.

I'm fairly certain ML can be used to solve all problems of this class. ML can be used to solve XOR. You'd be crazy to reach for ML every time you need an XOR.

Understand that Machine Learnings strength isn't that it can solve logic problems. It's that it can figure out how to solve logic problems without humans hand turning how every bit is treated.

30 if-else statements might sound like a lot, and if poorly organized it certainly is. But it's peanuts to machine learning. Still, that isn't enough information to make the call.

In addition, we are going to introduce different tags and flows. For example, if an application contain tag blah, then it should skip some of the if-else statements and proceed to different processing center.

This statement makes me think you're most likely to benefit from the introduction of good old fashioned decomposition and polymorphism. Look up what predicates are. If you have 30 if-else statements in one procedure, many of which need to be skipped, you're likely doing to much in one place. Break things up. Make many functions and objects that only take in what they need and don't reach out to learn things on their own. Have them each do only one thing.

But before doing all that refactoring write some tests that prove that stuff works as well as it ever did. That way when you think your new design does the same as the old design you'll be right.

Now that you have some tests and a new design it's time to try adding new tests for the new tags and flows. The new design should make adding them easier.

But if you want us to assess if ML is a good way to solve this problem, we'll need more information then how you've solved it before. What is it actually deciding?

edited Sep 09 '18 at 20:08

answered Sep 09 '18 at 08:30

candied_orange

108,538

thank you for the detailed response! To clarify, the intention is to replace the old code instead of improving. At the end of the day, the engine decides (based on 20 some variables, plus few special ones) whether an application should be routed to A, B, C, or D. I did some reading after posting the question, and it seems ML classification is a good candidate. – Simply_me Sep 09 '18 at 08:41
1

The desire to replace code rather than improve it is common and often misplaced. You first need to recognize where full understanding of your requirements lives. Very often the best, most authoritative, source for your requirements is the existing code (since no one updates or reads documentation). Don't throw it out until you've squeezed it dry. – candied_orange Sep 09 '18 at 08:47
I, respectfully, disagree. I've seen monuments of bad implementations simply because no one wanted to replace it. Anyway, this is orthogonal to the question at hand. – Simply_me Sep 09 '18 at 08:57
1

how bad the implementation is isn't the point. It's whether the current implementation captures your existing requirements better than any other document. If the implementation is bad It's extremely rare for the documentation to be better. – candied_orange Sep 09 '18 at 09:00
1

Try reading When Understanding means Rewriting and Things You Should Never Do and maybe you'll see where I'm coming from. – candied_orange Sep 09 '18 at 09:05
thank you, and I'll definitely read those. That said, how about the question at hand? You've a procedure that takes in 20 some variables, and decides the best route (A, B, C,or D). Is this suitable for ML? – Simply_me Sep 09 '18 at 09:14
That's still not enough info for me to make the call that ML is a good choice here. I'm sure it can solve it. But that isn't the same as being the best approach. With what I know now, if I had to make the call it's more about what you're good at rather than the problem it self. Hope the next guy is good at it as well. – candied_orange Sep 09 '18 at 09:17

score 1 · Answer 2 · answered Sep 09 '18 at 13:30

Machine learning is nice, but often not applicable – this looks like a case where you can and should write “ordinary” code.

What is ML?

Machine learning is just statistics. After “learning” the relationships of some training data (actually, fitting a statistical model to the training data), the ML algorithm can predict outputs for new inputs. For supervised learning, the training set contains inputs and known outputs (“labels” in case of a classification problem). For unsupervised learning, the training set is unlabelled and the algorithm must infer relationships, which is closely related to clustering problems.

Limitations of ML

ML/computational statistics can be incredibly cool, but there are notable problems:

To obtain a good model, you need a big set of training data. Obtaining this set may be expensive and difficult.
Garbage-in, garbage-out: if your training data is bad, the model will be bad and result in bad predictions. You have to validate the model to test its quality. Suitable validation requires some amount of statistical knowledge.
Statistical models contain specific assumptions. If these assumptions don't fit your use case, the model will be bad. As a simple example, consider trying to fit a linear regression model on a periodic data set. The model's assumption of a linear relationship does not hold, so the model will be useless.
Generalization error: ML models try to generalize from the training data. That involves guessing, and guessing can go wrong. For example, if your training data is not a representative sample of the inputs that will be observed later, you might get a biased model.
Predictions will be fuzzy and inexact (have some variance). You can reduce this with larger training sets, but many real problems contain unavoidable noise. So the outputs of an ML algorithm have to be interpreted carefully.

E.g. the result of an image classification algorithm can be communicated misleadingly as “the image shows a cat”, or more clearly as “the image might show a cat (42% likelihood), toaster (41%), or computer screen (39%)”.

Similarly, for regression problems providing a credible interval might be helpful. There's a difference between a prediction “this customer is going to spend $29.21 today” and “there's a 50% chance the customer will spend between $19.39 and $64.22 today”.
Interpretability: A trained model usually has no meaningful interpretation. In simple cases a model describes correlations between input features, which can be interpreted and visualized. But simulation-based algorithms or models with latent variables are notoriously tricky to interpret and debug. It is not generally possible to explain within the problem domain why a specific prediction was made. This can have ethical and legal ramifications.

When to use ML

For what kinds of problems can ML/computation statistics be appropriate?

For example, if an exact solution is infeasible and an approximate solution is tolerable. The ML model must be allowed to be wrong. Your requirements tell you how wrong the model is allowed to be. You can then try to meet the necessary prediction performance, e.g. by better and bigger training sets, or by techniques such as boosting.

In particular, approximate solutions are tolerable if they are merely used to advise human experts, or when any actions triggered by the prediction are reversible. E.g. using ML for email spam categorization is fairly unproblematic because I can manually mark emails as spam/not spam if the categorization is wrong.

So how about those if-statements?

For a rule engine or other core business logic, machine learning is probably not a good fit.

The ML model may perform unwanted actions.
The ML model may fail to perform wanted actions.
The ML model is basically impossible to debug.
The necessary training set to achieve satisfactory performance is going to be much larger than a comprehensive test suite.

Writing software can be difficult, and requirements can be complex. Machine learning can sometimes fulfill these requirements, but it will not magically remove that complexity.

At best, you can approximate a good-enough solution.
At worst, you are completely ignoring your requirements.

ML is just a mathematical toolset, and no replacement for gathering requirements, writing code, performing tests. You still have to do software engineering.

Thank you for the detailed answer. Given that all values are deterministic (T/F 0/1), one would expect ML to have a consistent model and output. — Simply_me, Sep 09 '18 at 18:43
@Simply_me Sure, you can design an ML model that predicts True/False outputs (classification problem). But my point is that the model will sometimes be wrong. And if you provide so much training data that you no longer see any errors, then that could take so much effort that it might be easier to simply use your training data as a lookup table, or just write the necessary code yourself. — amon, Sep 10 '18 at 05:44

score 0 · Answer 3 · answered Sep 09 '18 at 15:00

It seems that our naive approach is reaching its bounds (sure we can more switches, but it won't be elegant nor pretty).

Can you use ML techniques? Yes.

I think your problem is when you have input {P} and the computing engine C has become too complex ("not elegant nor pretty"). It is possible, using ML technique to get C "for free" when you have a well defined input {P} and output {Q}.

Basically, what you need to do is translate your "application", "configs" into a data set with numerical values, e.g. booleans for "isBlue", "isBlueAllowed", etc. and the wanted "result". You have to create this data sets for all possible configurations and outcome. This is required to get tight fitting data, so the result will be deterministic.

At this point, when you have that kind of data set, the computation engine doesn't need to be complex, i.e. it can be as simple as logical computation between the data set and some filter, or if it's more complex can be some operations involving matrices. We probably don't care much about what this computation be as long as output match {Q}.

Your next job is to decide how to get the computation engine C. To be honest, there will be quite a few and you have to test them out yourself to decide. At this point speed is probably the biggest consideration.

The problem that you have is you have a decision tree and you want to be able to generate it, instead of updating the code. One of the popular decision tree generator is C4.5

When in the future you need to add another configuration property, just update the data set and update them model of whatever algorithm you've chosen.

This whole process is how you apply ML, it's not hard. If you want to compare, it'd be between maintaining the training data set {P} vs developing & maintaining a piece of software.

“You have to create this data sets for all possible configurations and outcome […] so the result will be deterministic.” – If I'm able to write a detailed table of inputs and outputs I could simply use that as a lookup table or just write the corresponding code. Involving ML would not seem to reduce the necessary effort, but would be a source of possible errors (as in, wrong predictions for unseen inputs due to wrong generalisations, or even wrong predictions for known inputs due to too much generalisation). — amon, Sep 10 '18 at 05:38
You're right, it's pretty much a look up table, or decision table, which can be represented as decision tree, or some matrices representing neurones. Re: the technique doesn't reduce the necessary effort, I found this in a wiki page: "Decision tables have proven to be easier to understand and review than code, and have been used extensively and successfully to produce specifications for complex systems." It's cited Udo W. Pooch, "Translation of Decision Tables," ACM Computing Surveys, Volume 6, Issue 2 (June 1974). — imel96, Sep 10 '18 at 09:09

Redesigning Procedural Decision Engine To Leverage Machine Learning (and Which)

3 Answers3

What is ML?

Limitations of ML

When to use ML

So how about those if-statements?