How to shape the reward function based on multiple parameters?

Question

After going over this post about reward shaping, I still find it difficult to define it w.r.t to my specific problem. Suppose that I measure some state $s=\phi$, which is some angle. I also use actions $a=\tau$ that are motor torques. For example, given the state $\phi_0$, I would like to apply torque $\tau_0$, which will bring my system to a new angle state $\phi_1$. There are $2$ aspects that are important:

I aim to bound $\phi$ to some range $[\phi_{min},\phi_{max}]$
Large changes between two consecutive actions should be avoided , i.e $\forall t$ $||a_t-a_{t-1}||<\epsilon$ for some small $\epsilon>0$.

Given those, a Naive reward shaping would be:

if $\phi\notin[\phi_{min},\phi_{max}]:$ set $r\leftarrow r-\phi$
if $||a_t-a_{t-1}||\geq\epsilon:$ set $r\leftarrow r-\||a_t-a_{t-1}||$

The problem is that first, $\phi$ and $||a_t-a_{t-1}||$ may very well be on different scales. if for example $\phi\gg||a_t-a_{t-1}||$, the second line wouldn't affect the loss, and vice versa. Furthermore, under a more "physical" scope, $\phi$ might be measured in radians, yet $||a_t-a_{t-1}||$ is not at all in radians. Every physicist will tell you that this means something is wrong!

Therefore, my question is - How can my reward depend on drastically different phenomenons that do not relate to each other?

Before I post an answer, I do have a question. Are you saying that your objective is to bound phi to a certain range, and that you aim to do this without drastic changes from one action to another? Is that what you mean by your 2 important aspects? — Vladimir Belik, Sep 19 '22 at 15:22
Yes, you could phrase it that way. It might also help If I'll add more context - think of a wing attached to a motor. The motor applies torque $a=\tau$ that brings the wing to some angle $s=\phi$. I'm looking for a reciprocating wing motion, hence the bounded state, but I'm also looking for actions that can be replicated by a real-life motor so large consecutive changes are impractical. — Hadar, Sep 19 '22 at 15:40
I'll be honest - as a very mechanically un-inclined person, that explanation went a bit over my head hahah BUT I think I got your general idea. — Vladimir Belik, Sep 19 '22 at 15:41

score 2 · Accepted Answer · answered Sep 19 '22 at 15:50

First thing I'd say is that I've found Cross Validated to not be the best place for questions about reinforcement learning (at least on such depth). I'd highly recommend ai.stackexchange and/or r/reinforcementlearning on Reddit.

Now, as to your question. I understand where you're coming from, that there isn't some straightforward, unit-less way to combine these metrics. However, you have to remember that RL rewards don't need to be based in the physical, logical laws of the universe. Your goal, when designing a reward function, is very practical/empirical - design a function that makes the agent do what you want it to do. The only thing you might want to be aware of if you make a reward func too divorced from physical reality is that there might be things that the reward function "accidentally" allows while being catastrophic in the real setting.

All that said, here are some thoughts/ideas:

You have what seems to be a pretty typical set up. You might find this video helpful: https://www.youtube.com/watch?v=bD6V3rcr_54&t=364s
You might consider simply giving a flat reward for every time step that the agent's phi is in the correct range, and no reward otherwise (or a negative reward). As to incorporating the torque constraint, you might again consider just giving a negative reward. As to the magnitude of the rewards, you might have to just use trial and error. Make phi-based reward +1 per second and torque-based punishment -5 every time. Observe the behavior. If you find that the agent still makes big torque adjustments, it must mean that the agent thinks that it's STILL worth doing those. Make the punishment bigger. If agent isn't staying in the range, make phi reward bigger per second and add small punishment for all other zones etc. My understanding is that the approach to these things is really very empirical (don't need to fret too much about how the units translate to real life as long as your problem is reasonably represented).
Look into action masking. If it were me, I think "teaching" the agent that big torque changes are bad would be a lot harder than making such movements impossible. Look into action masking for continuous action spaces. I'm only familiar with masking for discrete action spaces, but I'm sure there's a continuous solution. The idea here is just to make it impossible for your agent to take a drastic action - this simplifies your problem, as your agent now just has to worry about staying in the phi bounds using the tools it has (small-medium adjustments).

Good tips, Vladimir. Specifically interested in action masking, so I'll look it up. Thanks! — Hadar, Sep 19 '22 at 16:04
@Hadar Absolutely! I couldn't recommend masking more. I've found some people try to do it by just punishing "invalid" actions, and in my (very small) experience, it's WAY less efficient than just masking, which (at least for discrete actions) isn't hard to do. Best of luck! — Vladimir Belik, Sep 19 '22 at 16:20

How to shape the reward function based on multiple parameters?

1 Answers1