All image credit to maintainers of the CMA-ES library on GitHub.Ā Image source.
Recently, I needed to search for trivial policies in an environment. These policies were simple things, like always take the same action, or every K steps take the same action, etc. In order to optimize these trivial policies as much as possible, I turned to theĀ OptunaĀ framework. But as I dove into their documentation, I found that theyāve got a nice set of optimizers and I began to wonder: What if I searched for not-so-trivial policies?
Could I search for all of the parameters for one small neural network layer and get good performance? The answer, weāll see, is yes⦠in some environments.
The Setup
I decided to keep it simple and experiment with two policies: one Iām calling a ādense Gaussian policyā and the other Iām calling a ādense policyā. I think these names are technically correct but they also can sound pretty opaque, so Iāll explain what I mean. Check out my codeĀ at this linkĀ if youāre interested.
The Policies
Both policies are similar to a 1-layer neural network: we compute the output by taking the dot-product of the input with an array of weights,Ā W. In this case, the input is the observations (S) from the environment, and the output is the action we want to take, so the weight array (W) has the shape (observation_dimension, action_dimension).
In the ādense policyā the output of the dot product is the action, so weāre done! In the ādense gaussian policyā, the resulting vector of the dot product betweenĀ SĀ and W has shape equal toĀ action_dimensionĀ and is used as the µ (mu, or mean) vector for a Gaussian distribution. We also store a variance (Ļ) vector in the layer and then we sample our actions from the resulting Gaussian distribution. Like this:
µ = np.dot(weight_array, observations)
action = µ + np.random.randn(env.action_space.shape[0]) * Ļ
Iāll break down the two lines of code above: In line 1, we take the dot product ofĀ WĀ and the input observations and get µ (the mean of our distribution). In line 2, we useĀ NumPyās randn functionĀ to sample from a unit Gaussian distribution (mean 0 and variance 1). By adding µ to the sample from randn, we shift the mean from 0 to µ. Then, when we multiply the sample from randn by Ļ, its variance gets scaled to match Ļ.
So now, what am I optimizing? I used Optuna to directly optimizeĀ each parameter in the weight array.Ā This approach scales really poorly, since Optuna is designed to optimize hyperparameters and it suggests one number at a time. This means you have to use a for loop to fill your weight array one-by-one. But in environments with small vector observations, itās fast enough. Just know that if youāre here looking for performant, well-written, scalable code, youāre in the wrong place! I cobbled all this together for fun and gave little thought to how good the code actually is. I decided to use CMA-ES1Ā as my optimization method.
The Algorithm: CMA-ES
CMA-ESĀ stands for āCovariance Matrix Adaptation Evolution Strategyā. Itās a black-box numerical optimization method. In fact, all evolution strategies are black-box optimization methods. The phrase āblack-boxā refers to the fact that these methods donāt require derivatives of an objective function in order to optimize it. Theyāre called āevolution strategiesā because they loosely model biological evolution. They stochastically generate a population, evaluate the population on the objective function, and move the population-generating distribution towards the higher-performing members of the population.
If you squint, this kind of mimics natural selection, where the fittest members of a population propagate their traits onwards to the next generation. CMA-ES can also change the variance of the population it generates to find a solution faster, and thatās what the gif at the top is showing. If you want to read more about this topic, David Ha has twoĀ greatĀ postsĀ talking about evolutionary strategies and using them in RL environments. I also recommend reading theĀ Wikipedia page. If you google around, youāll find a bunch of resources about the CMA-ES algorithm and evolution strategies in general.
The Results
This is a mixed bag of results; some great, some not so great. Working with one layer with shape (observation_dimension, action_dimension) really forces us to work with limited representational capacity. Since the policy has so few parameters, it canāt represent very complex relationships. Surprisingly, though, it can still succeed in chaotic and nonlinear environments.
Successes
Check out the inverted double pendulum. All 3-d robotics environments courtesy of Pybullet.
Even though this is a highlyĀ unstable and nonlinear system, the optimizer was still able to find a set of weights which can stably balance the pendulum. The policy here gets about 9,300 reward. I believe the maximum possible is 10,000. So itās not perfect performance, but itās still really good. I think this is super cool because itās literally controlled with a 9-parameterĀ fully-connected layerĀ andĀ that layer is optimized using a black-box (i.e. derivative free) optimizer. Compared to the massive networks weāre used to seeing, with thousands, millions, and sometimes billions of parameters it feels almost shocking that so much can be done with 9 parameters.
Just above here is the inverted pendulum swing-up problem. Again, this is an unstable, nonlinear problem. The goal here is to start with the pole below the cart and then to swing it up and balance it on top of the cart. This policy only has 5 parameters! It gets about 700 reward and the maximum possible is 1,000. So itās definitely not perfect, but it seems clear that the policy learned something useful since in the video it makes multiple attempts to swing the pole up and keep it balanced.
This is a 16 parameter policy trained on Lunar Lander. The environment is modeled after theĀ classic lunar lander game.Ā The goal is simple: land the spaceship between the flags and donāt crash! It has 16 parameters because the action-space is 2 dimensional and the observation space is 8 dimensional. The 2 actions are real-valued numbers with one representing left-right thrust and the other representing bottom rocket thrust. This policy actually solves the environment; the reward here was about 275 and this environment is considered solved after the reward earned is over 200.
This is our last pendulum, but it's also our most boring one. This is just the standard single inverted pendulum, where the pole starts on top of the cart and the goal is to keep it there. This system is near-linear and, while still unstable, it is certainly more stable than the other pendulums we looked at. This policy has 4 parameters and it solves the environment, earning the maximum reward of 1,000.
Failures
This is a 96 parameter policy that totally failed to learn how to achieve its goal. The environment is called āBipedalWalkerā and the goal is for the robot to walk to the end of the level. Instead, our policy just learned to fall forwards. If I gave it more update steps or ran more experiments where I carefully control and anneal the variance of the policy and/or CMA-ES over time, it couldĀ maybeĀ solve this environment.Ā Giuseppe CuccuĀ solved this environment using aĀ tiny, one-layer RNNĀ with 116 parameters. Iām not confident that this environment can be solved with fewer parameters than that, but it might be a fun area to experiment in.
To give you an idea of what a successful BipedalWalker agent looks like, see the video below (I didnāt train this one):
The video below here is arguably a more interesting case of policy failure. This is on the hopper robot environment. The goal is to move forward as far as possible in the time limit. This policy has 45 parameters, since the observation space has 15 dimensions and the action space has 3 dimensions. Obviously, the policy fails at moving forward as far as possible, but it still gets about 1000 reward on the environment (this environment is considered solved above 2500 reward). It earns this reward simply byĀ leaningĀ forward!
Closing thoughts
I think this was a fun experiment, and if youāre curious and want to try weird stuff like this, I recommend it. But, if you want a performant implementation of CMA-ES and other evolution strategies for RL and for optimization of larger numbers of parameters, you should check out David HaāsĀ ESTool, theĀ PGPElib library, and theĀ CMA-ES library. Iām sure there are other strong libraries out there for this stuff, but those are three off the top of my head.
These policies could definitely be pushed further with more parameters, and I donāt think it would be very difficult to add more. Since itās all just dot products, one could write aĀ TransformLayerĀ (or name it whatever you want) whose entire job is to take in the observations and project them to some n-dimensional vector. Then use that vector as input to the policy layer. That would definitely increase the capability of the layers to represent complex relationships and could still be kept small enough to work nicely with Optuna. I didnāt want to do that here because I had too much fun keeping these policies as tiny as possible.
Another way to probably improve the performance is to consider fixing the variance of the CMA-ES algorithm or annealing it on a set schedule over training. This might be hard to do in Optuna, so potentially another option is to schedule the variance of the Gaussian layer. It would be interesting to scale the variance in such a way that when the algorithm seems to be plateauing, turn up the variance and force more exploration in the environment. That might help break free from performance plateaus and maximize reward.
If I were going to push forward on these experiments, Iād stop using Optuna and instead switch to one of the libraries I mentioned above. The policies can still be kept tiny, but using a library with a full implementation of evolution strategies will help accelerate experiments and will likely allow greater control over the optimizer.
I really hope you enjoyed this post! š
Previously published at https://themerge.substack.com/p/weird-rl-with-hyperparameter-optimizers