It’s summer. And in contrary to many people here in Brazil, I am spending time at beaches or beautiful places, for various reasons. On the other hand, I started an interesting coding project: it’s called CREPE, which stands for CRoss-Entropy Parameter Estimation. It is a code written in python designed specifically to be global optimization tool, which is something very handy for scientists. The program is freely available on GitHub. Please be aware that it is far from being a release version, there are still many things that I want to implement.
But what the heck is cross-entropy, global optimization or parameter estimation? – one might ask. Well, in science, many times we create hypotheses when trying to understand a system, a signal or just the mechanisms of how things work. And the most objective way of testing a hypothesis is formulating it mathematically: let’s call it a test function. The test function might be dependent of various variables. For instance, when studying how the temperature of a solid plate behaves when one of its tips is heated, we can observe that the temperature depends on the position of the analyzed point, the energy output of the heat source, the material, how the heat is exchanged, time, and so on. These are all variables, and in the most complex phenomena, we do not know exactly what are the values that these variables assume.
On that same example, we might not know some of these variables, let’s say: the energy output of the source. If we know the temperature of each point on the plate, the other variables and how the heat is changed, we can estimate the energy output of the source (or how it varies with time). There are many tools for that and it can be calculated analytically. However, in many problems we do not know every variable (let’s call them the parameters), the observations might be imbued in noise and uncertainties, our hypothesis is sometimes completely empirical (which means there isn’t a beautifully closed set of equations with well defined variables), and the calculation can be outright too difficult or time-consuming to solve analytically. And that’s when we go to computers and global optimization tools. Some scientists (especially purist physicists) say this is “playing dirty” or “an appeal to ignorance.”
Global optimization is a clever way applied mathematicians created to, well, optimize a function or set of functions in order for it to assume a form that reproduces what we observe (the data). Back to the previous example, a global optimization would take all the information that we know, including our test function, a reasonable guess and optimize the value of the energy output of the heat source, in other words, look for the value that best reproduces our measurements of temperature. Global optimization is so powerful, that it can, in theory, estimate as many variables as you wish. But how well it will estimate and how much time it will take strongly depends on the computation power, the initial guess of the user, how good is the hypothesis, and how good is the data. You know, when you have an equation like y = x + 2, and if y = 4, solve for x? When you solve for x, it’s basically doing an optimization, which is finding the best value of x that best reproduces y.
Now, for the concept of cross-entropy, this is actually very complicated, and to be honest I didn’t quite understand, but you’re welcome to see these Wikipedia pages. The creator of the cross-entropy (CE) method is Reuven Rubinstein, and there is an official webpage for it which contains a very handy tutorial about the method. But notice that this kind of tool isn’t something that we learn (neither are even mentioned) about in classrooms, which is somewhat sad, because global optimization is such a powerful tool, and not that difficult to program it yourself. But with great power come responsibilities.
CE converges to a solution extremely quickly. It’s crazy fast. The problem is that it converges so fast, that it might get stuck into a non-optimal solution (what we call a local minimum). It doesn’t mean that it is a hit and miss program, but the results must be checked very carefully. Also, CE does not guarantee that your test function or the hypotheses are good, neither does it for the data. So, the job of the user of CE is to benchmark the test function(s) and take into account the quality of the data during the analysis. A good way to avoid mishaps is to plot everything, it really help us to make a good initial guess and check the validity of the results.
CREPE is very simple at the moment. So far, it works with optimization of parameters with single-variate Gaussian uncertainties, and there’s still some tweaks I plan to do and many things to add. I’ve been basing the program on this paper.
Inside the codes that can be downloaded on GitHub, you’ll find a folder called examples. The first example I want to show is curve_fit.py. In astronomy and other fields of experimental physics, sometimes we measure extremely faint signals that are imbued in noise. curve_fit.py generates mock data, and you have control of what kind of signal (the function f(x)) it is and the noise sigma (which translates into the amount of noise). In that example I used a sinusoidal mock signal, which describes various phenomena, such as the movement of a spring after it is pressed or pulled, and the parameters that we have to find are a (the spring constant) and b (the phase – or initial position). Here is a plot of a mock data, with noise sigma = 1.0:
Now this is a very noisy data. In fact, by looking at the plot, we can see that the noise levels are approximately half of the amplitude of the signal! This is terrible data, but if that’s the only data available, the only thing we can do is trying to get the best out of it. Suppose we were given this data and we don’t know what are the true values of the parameters a and b. CREPE can estimate them for us, if we feed it with at least two very important items: a guess and a performance function. As the name implies, the guess is a set of intervals that we think a and b should be inside (or close by). The performance function has one job: to evaluate how well the guesses of a and b are able to reproduce the data. There are many ways to evaluate that, in fact, I plan on adding some “standard” performance functions to the program so that the user doesn’t need to bother with it. What CREPE does is to create many samples of a and b (based on the user’s initial guesses), and just trying them on the performance function. Depending how well a certain set of samples do, the program selects the best ones and generate new samples based on them. And then it iterates until the results can’t get any better. Basically, the program “learns” how to reproduce the data, and because of that, methods like these are called machine learning.
After a handful of iterations, CREPE spits out the values of a and b that it “thinks” are the best ones, along with their uncertainties (although I’m still working on how to define these uncertainties). Based on the previous mock data, these are the results I got: a = 2.027 ± 0.007, b = 3.711 ± 0.104.
But how close are they to the true values of a and b? Well, a (the spring constant) get pretty close: a_true = 2.0. But b (the initial phase) doesn’t: b_true = 10.0. Why is that? Well, I think that the quality of the data was very influential. So, let’s try to find a and b using a signal with less noise, shall we? This time, noise sigma = 0.5:
Looks a bit better than the previous signal! And now, using CREPE’s parameter estimation capabilities, we get a = 1.994 ± 0.002 and b = 10.061 ± 0.012, much closer to the true values of a and b.
But scientists are not always just fitting curves and functions to data. Sometimes, there are more complex calculations involving empirical models of how we think the universe plays. The program model_fit.py is a very simple example of how to work with a model in CREPE. One of the biggest accomplishments in astronomy was decomposing the light coming from objects in the sky (a field of study known as spectroscopy), and discovering that they have “lines” in them. These lines are features generated by the emission or absorption of light by chemical elements in very specific wavelengths. In this example, we simulate a spectrum, with 3 emission lines. In reality, the spread and intensity of these lines depend on a number of variables, but in this example, we consider that only the abundance of the elements and the rotation of the object play a role in the observed spectrum.
If we were given the data (again, very noisy) and, from the literature, we have the empirical model of that region of the spectrum for that specific celestial object, we can use CREPE to estimate the elemental abundances and the rotation of the object. The next figure illustrates the results from the model fitting example:
And again, not a perfect fit, but it comes close. Right now, I’m working on improving CREPE in order for it to get better results even if you have noisy data. One of the biggest advantages of CE is that it can attain good results much faster than other methods (there are a number of them in python – pick your poison!), so I think there is some potential for this program. Also, I think it’s important for scientists to make public the tools that they create, so they can be checked and used by other scientists, and python is perfect for that. Many astronomers are stuck with tools created on IDL, a very expensive programming environment (although extremely well-developed), which makes collaboration and sharing much more difficult. Luckily, python has been growing in the scientific community (especially in astronomy), so we are heading to a bountiful future on that front.
Well, this post was longer than I expected, and I barely touched the potential that I envision for this project. If you have any ideas or want to make contributions to it, just drop me a message or a pull request on GitHub!