We hope this analysis will help the interested practitioner in their calisthenics journey, as it will set baselines that he or she can compare against.
While we intend to help calisthenics practitioners train in an efficient and effective way by making available some useful numbers, data analysis comes with a good share of caution.
First, the context: data from which the analysis is built upon come from a weighted calisthenics competition. Second, for all computed average values, the reader must always keep in mind that some standard deviation
The forthcoming numbers should be taken in context and will never be a replacement for a (as) complete (as possible) understanding of the needs of the calisthenics practitioner.
Shall we not be deluded by sheer and raw numbers, we can certainly make good use of them!
I will use the data to show that I am roughly halfway from being average ¯\_(ツ)_/¯.
In 2022 was held the season 1 of the King of Weighted Calisthenics competition, an international and internet based event organized by Micha Schulz.
Each participant had to get their weighted equipment and record their 1RM
One-repetition maximum: https://en.wikipedia.org/wiki/One-repetition_maximum attempts for dips, pull-ups and squats according to some specified rules (especially regarding the range of motion).
Similarly to gymnastics, rock climbing and other bodyweight related sports, calisthenics emphasizes on the strength to bodyweight ratio and tends to fade in the performances of the best athletes across different weight classes. However, in weighted calisthenics, ones has to lift the additional weight that is attached to one’s body. On one hand, tt seems like the heavier the athlete, the heavier the 1RM additional weight: the absolute strength increases with the athlete’s bodyweight. On the other hand, can we say the same thing about the relative strength? How does the added percentage of athlete’s bodyweight increase with the athlete’s bodyweight?
Let’s find out by analyzing the data!
The first step is to gather and organize the data in a table and check for incoherent values. Because the competition is international, organizers have to take into account both the imperial and metric system. The metric system is used: heights are measured in meters and weights in kilograms.
Let’s get a glimpse of the raw data before making them more useful and meaningful.
Athlete number | height (m) | bodyweight (kg) | dip (kg) | pull-up (kg) | squat (kg) |
---|---|---|---|---|---|
0 | 1.8 | 95.5 | 160 | 105 | 232.5 |
1 | 1.94 | 108.5 | 155 | 100 | 240 |
2 | 1.83 | 91.6 | 150 | 85 | 236.25 |
3 | 1.83 | 90 | 145 | 110 | 212.5 |
4 | 1.74 | 82.4 | 130 | 72.5 | 242.5 |
5 | 1.8 | 89.8 | 130 | 90 | 225 |
6 | 1.75 | 86.7 | 120 | 80 | 242.5 |
\(\cdots\) | \(\cdots\) | \(\cdots\) | \(\cdots\) | \(\cdots\) | \(\cdots\) |
143 | 1.65 | 72.8 | 120 | 70 | 190 |
144 | 1.78 | 73.9 | 65.7 | 52.1 | 165.5 |
145 | 1.86 | 109 | 200 | 120 | 185 |
146 | 1.8 | 67 | 60 | 50 | 90 |
147 | 1.8 | 82.7 | 145 | 95 | 195 |
148 | 1.85 | 70.1 | 50 | 32 | 110 |
149 | 1.8 | 68.9 | 50 | 55 | 120 |
In the table, one can see that some weights have a residual .1 kg or .7 kg. It seems like a result of the conversion from the imperial system to the metric system. Therefore, one can see that athlete number 144 is likely an american one. Even in an anonymized database, one must be careful about private informations that can be extracted from it!
After this digression on data confidentiality, let’s dive into the analysis!
What does the average participant look like?
height (m) | bodyweight (kg) | dip (kg) | pull-up (kg) | squat (kg) |
---|---|---|---|---|
1.76 | 77 | 81.5 | 57 | 144 |
An athletic human being indeed!
Please be aware that, while very few female athletes, there are both male and females athletes in the dataset. Despite our dataset being built only on the participants (thus biased toward their specific characteristics), we can remark that world average male height is 1.77 m, comparable to our computed 1.76 m.
The magnitude of the difference may be explained by the fact that we only considered calisthenics athletes, or by the fact that our dataset included female athletes, but this magnitude falls within the statistical uncertainty of our sample size (of order 0.01 m, i.e. 1 cm). Therefore, we cannot really comment on this difference in any meaningful way.
We hope this small example will help the reader to keep exercising statistical caution while going through the rest of this post.
While averages are interesting, histograms may be more informative.
What does the height distribution look like?
From a statistician viewpoint, the dataset is a bit small (150 used profiles) and is bound to be a bit noisy but one can spot two to three spikes in the histogram, indicating that there may be (at least) two to three larger groups of athletes coming from two to three differents parts of the world where the average (male) heights are differents.
What does the weight distribution look like?
Looking at the weight distribution of the athletes, one can spot something that is a bit odd: two major discontinuities at 73 kg and 83 kg. This is the effect of the competition rules! Indeed, there were three weight classes, strictly less than 73 kg (< 73), between 73 kg and 83 kg (73 - 83), and stricly more than 83 kg (> 83). This rule really shows in the weight distribution, in particular close to the 73 kg threshold.
It seems that many athletes tried really hard to get below the 73 kg threshold. Therefore, perhaps to the contrary of what some athletes initially tought, on may imagine that the competition must have been fierced in the less than 73 kg category. Analyzing the data, one can see that athletes between 72 kg and 73 kg are on average as strong (absolute strength) as the average lifter in the 73-83 category. An anomaly
However, seeing how the best athletes in the 73-83 categories have been able to lift a heavier total than the best < 73 athlete, it certainly has been a good decision for athletes close to the 73 kg threshold.
Other strategic statements might be analyzed from the data, but this is a topic for another post.
To wrap up this profiling section, let’s look at the correlation between the height and weight of athletes.
While there are some outliers
For concision, one can summarize all the previous informations into one plot.
The regression line shows that on average, the athlete’s bodyweight increases of 1 kg per 1 cm of height. Of course, this is the result of a regression and the graph clearly shows the diversity of profiles, even at the same height, and a significant deviation around the mean.
On average, the taller the calisthenics athlete, the heavier. This was expected but it is still nice to confirm it with data and back up our knowledge.
The regression line allows us to compute, given the height of a calisthenics practitioner, its expected competitive bodyweight. This can constitute a first approximation of a fitness objective, but the practitioner should keep in mind the deviation around this expectation and that the dataset is built with athletes participating to a competition.
For me, whose height is 1.8 m, a reasonable weight computed from those data could therefore be around 81 kg. I am currently weigthing 72 kg, therefore below this landmark. However, I find this computed value a little bit high, especially compared to my other fitness objectives. At the end of this post, I will analyse some features of the athletes having a height of 1.8 m. In particular, we will see that the 1.8 m group has an average weight of 78 kg, which is, as far as I am concerned, a better target value. It seems to me that the computed regression line tends to overestimate the predicted weight. This is may be due to outliers like this one participant whose weight is 112 kg and whose height is 1.77 m.
We met the athletes, let’s see their performances!
As a first step, we compute the average performances of the athletes for each weight class.
weight class (kg) | nbr. of athletes | dip (kg) | pull-up (kg) | squat (kg) |
---|---|---|---|---|
< 73 | 70 | 72 | 52.7 | 124.4 |
73-83 | 45 | 86.8 | 59.5 | 150.7 |
> 83 | 33 | 94.4 | 61 | 176.1 |
Remark: The total number of athletes, 70+45+33 = 148, is less than the 150 registered because we discarded two incoherent profiles. One of the discarded athlete was registered as 1 meter tall and the other as 5 meter tall.
It can be remarked that there are significantly more athletes in the < 73 kg category. For the next competition, I will kindly suggest to try to enforce more balanced weight classes. I do think that this would make the competition a bit fairer. I propose two solutions.
Another weight class can be created to make things a bit more even between the atletes of different categories. For instance, if the < 73 kg category were split into a < 70 kg and a 70-73 kg category, the first new one would have 38 participants and the second new one 32.
Of course, I understand that making three categories may be more appealing and that too small categories can create a lack of competition. Based on the current data, it would have been possible (a posteriori) to create the following three balanced weight classes:
These two propositions can be merged together in effort to create four as-balanced-as-possible categories. It is interesting to note that other information can be used to set a criterion that define a fair weight division. For instance, one could want for the standard deviation of the weights within a weight class to be roughly the same in each class.
From an organizer standpoint, it is worth investigating the ways to create a fair weight division while taking into account that a participant may adapt their weight to the announced weight division. An irrealistic mathematician solution would be to announce the weight division after the participants registered.
Back to the analysis of the original weight division, one can visualize the previous table in the form of a radar chart
The absolute strength refers to the the total amount of weight an athlete can lift, regardless of the athlete’s bodyweight.
Here we investigate, for each exercise, the relation between the added weight and the athlete’s bodyweight: Does the absolute strength increase with the athlete bodyweight?
Unsurprisingly, we find out that, indeed it does, thus justifying the fairness of implementing weight classes in the competiton’s rules.
The regression line (in red on the graph) shows that the athlete can roughly add 1.19 kg of additional weight per additional bodyweight kg to its 1RM. Of course, this is the result of a regression and the graph clearly shows the diversity of profiles, even at the same weight, as well as a significant deviation around the mean.
This graph and regression line also show that, on average, the heavier the athlete’s bodyweight, the heavier the athlete’s lift. This was expected but it is still nice to confirm it with data and back up our knowledge.
The regression line allows us to compute, given the bodyweight of a calisthenics practitioner, its expected 1RM in the dip exercise. This can constitute a first approximation of a fitness objective, but the practitioner should keep in mind the deviation around this expectation and that the dataset is built with athletes participating to a competition.
For me, whose weight is 72 kg, a reasonable 1RM computed from those data could therefore be around 75 kg. It has been a few months since I tried weighted dips, but last time I tried, I sent 40 kg.
So yeah, halfway from being average ¯\_(ツ)_/¯.
The regression line (in red on the graph) shows that the athlete can roughly add 2.26 kg of additional weight per additional bodyweight kg to its 1RM. Of course, this is the result of a regression and the graph clearly shows the diversity of profiles, even at the same weight, as well as a significant deviation around the mean.
As above, this graph and regression line also show that, on average, the heavier the athlete’s bodyweight, the heavier the athlete’s lift. Those numbers show that the athletes are indeed stronger, even in terms of progression, in the squat than the dip. This was also expected but it is still nice to confirm it with data and back up our knowledge.
The regression line allows us to compute, given the bodyweight of a calisthenics practitioner, its expected 1RM in the squat exercise. This can constitute a first approximation of a fitness objective, but the practitioner should keep in mind the deviation around this expectation and that the dataset is built with athletes participating to a competition.
For me, whose weight is 72 kg, a reasonable 1RM computed from those data could therefore be around 132 kg. Although I did a few squats with my girlfriend on my back, I never tried weighted squat so I cannot compare myself to this landmark.
To the best of my knowledge, I am again roughly halfway from being average ¯\_(ツ)_/¯.
The regression line (in red on the graph) shows that the athlete can roughly add 0.54 kg of additional weight per additional bodyweight kg to its 1RM. Of course, this is the result of a regression and the graph clearly shows the diversity of profiles, even at the same weight, as well as a significant deviation around the mean.
As above, this graph and regression line also show that, on average, the heavier the athlete’s bodyweight, the heavier the athlete’s lift. However, one can clearly read that the pull-up is a much more difficult exercise! This was also expected but it is still nice to confirm it with data and back up our knowledge.
The regression line allows us to compute, given the bodyweight of a calisthenics practitioner, its expected 1RM in the pull-up exercise. This can constitute a first approximation of a fitness objective, but the practitioner should keep in mind the deviation around this expectation and that the dataset is built with athletes participating to a competition.
For me, whose weight is 72 kg, a reasonable 1RM computed from those data could therefore be around 54 kg. I do not currently train weighted pull-up but this is something that I should definitely be doing. A few month ago, I borrowed a weighted belt (the same day that I performed that weighted dip) and I barely sent a pull-up with an additional 20 kg. Still, I was happy as the pulling strength sure is my weakest one.
Once again, I am roughly halfway from being average ¯\_(ツ)_/¯.
The relative strength refers to the additional proportion of bodyweight an athlete can lift. It relates the amount of strength an athlete has compared to his or her bodyweight. As a last topic of investigation, we will try to answer this interesting question:
How does the relative strength depends on the athlete’s bodyweight?
In nature, insects have insane relative strength, children are quite strong, but how strong is the calisthenics adult?
I will link the results of this section with the level system that is introduced in the Overcoming Gravity book written by Steven Low. While the book is not open source, the author made the progression chart available online.
On average, for the dip exercise, a participant can add 105 % of its bodyweight, an impressive number. That is to say, a good calisthenics athlete is roughly at the level 9 of the progression chart of the Overcoming Gravity book.
From the graph, it seems like, for this pushing exercise, the heavier the athlete, the slighly better the relative strength. The regression line indicates a gain of 0.2 % bodyweight strength per kg of bodyweight: For each additional kg of bodyweight, an average (good) athlete can lift 0.2 % more of its own bodyweight.
On average, for the squat exercise, a participant can add 186 % of its bodyweight, an impressive number. That is to say, a good calisthenics athlete is roughly at the level 9-10 of the progression chart of the Overcoming Gravity book (generalization from the weighted pistol squat).
From the graph, it seems like, for this pushing exercise, the heavier the athlete, the slighly better the relative strength. The regression line indicates a gain of 0.6 % bodyweight strength per kg of bodyweight: For each additional kg of bodyweight, an average (good) athlete can lift 0.6 % more of its own bodyweight.
On average, for the pull-up exercise a participant can add 74 % of its bodyweight, an impressive number. That is to say, a good calisthenics athlete is slightly below the level 8 of the progression chart of the Overcoming Gravity book.
This is one good level below the two previous pushing exercises. It is interesting because it means one of the two things: the scale of the Overcoming Gravity book is slightly off, or the athletes are slightly lacking in the relative pulling strength. The two hypothesis are possible because Overcoming Gravity readers and the weighted calisthenics community are focusing on slightly different goals.
From the graph, it seems like, for this pulling exercise, the lighter the athlete, the slighly better the relative strength. The regression line indicates a loss of 0.2 % bodyweight strength per kg of bodyweight: For each additional kg of bodyweight, an average (good) athlete can lift 0.2 % less of its own bodyweight.
This wrap up our statistical analysis of the data from the King of Weighted Calisthenics competition.
From what we computed above, one can see that from a weighted calisthenics standpoint, I am roughly half-average. However, considering that I truly started my calisthenics journey one year ago, I am happy that I progressed this far and I am thankful that I am still progressing. Thanks to this competition and this analysis, I have clear goals that I can aim at during the next few months, maybe years.
Side note: Regression lines are useful to compute predictions and extrapolate from the data. However, it is sometimes more interesting to use a subset of the data to compute more personalized quantities. The drawback is that since less data are used, the uncertainty of computed quantities is theoretically larger. In practice, it can give more precise predictions because we are generalizing from points that are more similar.
I am 1.8 m. From the collected data, the subset of athletes with this height have the following average numbers (OG: Overcoming Gravity):
bodyweight | dip | pull-up | squat | |
---|---|---|---|---|
Absolute weights (kg) | 78 | 89 | 66 | 144 |
Relative weights (%) | 100 | 114 | 85 | 185 |
OG level | level 10 | level 9 | level 10 |
Those are quite challenging goals for me to have - but what’s the point of a goal if it is not challenging for a bit!
It also show that the Overcoming Gravity scale is quite a good and interesting one. Indeed, level 9-10 is the transition from intermediate to advanced athlete according to the book. It is reasonable to think that the average participant of the King of Weighted calisthenics competition is on the verge of being an advanced athlete and that the top percernts can be considerd closed to (if not) elite athletes.
It is interesting to compare the above values to the previously computed one (from the regression). We report the previously computed values below for ease of comparison.
bodyweight | dip | pull-up | squat | |
---|---|---|---|---|
Absolute weights (kg) | 81 | 75 | 54 | 132 |
Relative weights (%) | 100 | 93 | 67 | 163 |
OG level | level 8-9 | level 7 | level 9 |
That is say, considering only the relevant subset of participants, my new target weight is lighter but my target lifts are heavier…
Wonderful!
What I prefer about the numbers computed from the subset of athletes that are 1.8 meter tall is that they correspond to a greater relative strength.
With that, we will close this post. Thank you very much for reading up to this point! If you have any remark or question, feel free to reach me, my door is always open!
]]>To put it in a nutshell: How to handle randomness with computers?
This tutorial has three main parts.
All code snippets can be copied using the button.
The github repository is
Feel free to use it in order to raise an issue about this post or open the discussion about any related topic.
We start by importing a few necessary libraries, modules and functions.
The main purpose of this part is to raise awareness of the importance of reproducibility as well as teach a (random) bit about algorithmic random number generation.
Please ponder over the following question before delving into the practical: How would you generate a random sequence of numbers?
Can you even think of a high level definition of what it means for a sequence of numbers to be random or random enough?
In machine learning, we deal with random numbers and we need an efficient way to generate random numbers according to some probability distribution. We say that we need a way to simulate the law of a random variable. On the other hand, there is a huge need for reproducibility of the experiments. Reproducibility is a major principle of the scientific method and machine learning (or statistical learning) is no exception. In algorithmic (apart from the very specific field of quantum computing) there is no reason we shouldn’t be able to reproduce exactly an experiment. Therefore, we want to generate random numbers as well as control how those numbers are generated so that we can reproduce computer experiments in pseudo-stochastic environment.
At first glance, reproducibility and randomness may seem like opposite goals.
An algorithm that generates random numbers is called a random number generator and the initial input to this algorithm that allows reproducibility is called the seed.
Using numpy
, generate two random numbers between 0 and 1 by calling the function np.random.rand
.
What is different compared to other function calls (e.g. np.sqrt(2)
)?
Calling np.random.rand(1) twice.
np.random.rand(1) = [0.49799233]
np.random.rand(1) = [0.75253602]
Calling np.sqrt(2) twice.
np.sqrt(2) = 1.4142135623730951
np.sqrt(2) = 1.4142135623730951
What may be surprising (or not, if your are used to it) is that two calls of the same function lead to two different answers (seemingly with the same argument).
At a high level, if we assume that the process that generates thr outputs of successive calls to np.random.rand
is deterministic, then the so called initial condition should be enough to completly describe the generated sequence. If the process is complex enough, it may appear random while being completly determnistic and fully described by the generation process and initial condition.
For random number generation, this initial condition is called the seed.
The previous question enlightens the fact that there is a need to set the seed so that we can start over a random experiment. We could use np.random.seed
to specify a global seed but this is not the recommended way.
The preferred best practice for getting reproducible pseudorandom numbers is to instantiate a generator object with a seed and pass it around. The implicit global RandomState
behind the numpy.random.
convenience functions can cause problems, especially when threads or other forms of concurrency are involved. Global state is always problematic. We categorically recommend avoiding using the convenience functions when reproducibility is involved.
—Robert Kern, NEP 19
Use np.random.default_rng
to instanciate a random number generator, generate some random numbers and prove yourself that you know how to start over the generation by generating the same (first) random numbers.
Calling np.random.rand(1) twice.
np.random.rand(1) = [0.73463129]
np.random.rand(1) = [0.50966735]
Calling rng.random(1) twice.
rng.random(1) = [0.55043942]
rng.random(1) = [0.18440082]
We start over by feeding the same seed to np.random.default_rng
.
Calling np.random.rand(1) twice.
np.random.rand(1) = [0.86537091]
np.random.rand(1) = [0.96680284]
Calling rng.random(1) twice.
rng.random(1) = [0.55043942]
rng.random(1) = [0.18440082]
In the next question, you will implement a Linear Congruential random number generator. Such a rng (random number generator) works as follow. Given an initial integer \(x_0 \in (\underline{M}, \overline{M})\) and an integer function
\[f : x \in (\underline{M}, \overline{M}) \mapsto f(x) \in (\underline{M}, \overline{M}) ,\]we generate a sequence of integers by repeatedly applying the function \(f\):
\[x_{n+1} = f(x_n) .\]Given \(x_n \in \mathbb{N}\), we can map this integer to a rational \(u_n \in \mathbb{Q} \cap ]0,1[\) by computing
\[u_n = \frac{x_n - \underline{M}}{\overline{M}- \underline{M}} .\]This procedure allows to generate a sequence \((u_n)_n\) of numbers in \(]0,1[\). If the function \(f\) is well chosen, then the sequence \((u_n)_n\) may appear like a sampling from a uniform distribution in \(]0,1[\) (this has a precise mathematical meaning in the field that study this topic).
For mathematical reason, we avoid the value \(0\) and we therefore skip it in any algorithm that implements a rng in the aformentionned fashion. That is to say, if \(x_n = 0\) then set \(x_n\) to \(f(0)\). Therefore, \(f(0)\) should not be equal to \(0\).
Remark: skipping \(0\) may be a property of the rng or a property of the function that calls the rng. In the following, we make it a property of the rng.
The Linear Congruential random number generator (LC-rng) is a function \(f\) of the form \(f(x) = mx + c \hspace{0.3cm} [q]\), that is to say,
\[x_{n+1} = m x_n + c \hspace{0.3cm} [q] .\]Implement a class
, LC_RNG
, that takes a seed (\(x_0\)) as an input and is able to generate a random sequence of numbers in \(]0, 1[\) by calling a random
method (it should mimick the behavior of numpy’s).
Use the values \(m = 1103515245\), \(c = 12345\) and \(q = 2^{31}\).
Remember the seed,
We test our random number generator,
Calling custom_rng.random(1) twice.
custom_rng.random(1) = [0.26866815984249115]
custom_rng.random(1) = [0.2322915312834084]
Resetting the seed will restart the same generating process.
Calling custom_rng.random(1) twice.
custom_rng.random(1) = [0.26866815984249115]
custom_rng.random(1) = [0.2322915312834084]
This kind of algorithm allows us to generate a random sequence of float
in the bounded interval \(]0, 1[\) that is approximately uniformly distributed on this support. From now on, we shall assume in all simulations that the samples from np.random.default_rng
are independently and identically distributed according to a uniform distribution in \(]0,1[\).
What is the name of the generator that is used by the default call to np.random.default_rng
?
What kind of generator is used by default? We can find out by reading the doctstring ?np.random.default_rng
(or help(np.random.default_rng)
) or simply using print
on the created object. As we can see, this is PCG64.
Generator(PCG64)
Using np.random.default_rng
, generate a sequence of 10 000 samples in \(]0,1[\) and plot a histogram. Do the same with your custom rng, LC_RNG
.
To compute a histogram, one must arbitrarily partition the segment \(]0,1[\) into sub-interval whose union is the whole segement. In this case, partioning can be seen as a discretization of the probability density function (often abbreviated pdf) with respect to the Lebesgue measure. In computer science, a sub-segment of this sub-division is called a bin.
A histogram is computed by counting the number of samples belonging to each bin. Those counts can be scaled by a common factor afterward. If the scaling factor is such that the sum of the value associated to bins is equal to \(1\), we talk about normalization. In this case, the histogram can be seen as a probability density function with respect to a discrete measure. This last remark is a bit advanced and you can pursue your reading even without understanding it, as long as you have an intuition of what a histogram represent.
Theorem (Inverse Transform Sampling) Let $U$ be a uniform random variable on \(]0,1[\) and \(X\) a real random variable. Assume that the cumulative distribution function \(F_X\) of \(X\), \(F_X (u) = \mathbb{P} (X \leq u)\), is continuous. Then the random variable \(Y\),
\[Y = F_X^{-1}(U)\]has distribution \(F_X\). That is to say, \(X\) and \(Y\) have the same law .
An algorithmic consequence of this theorem is that to generate (or simulate) i.i.d. samples having the law of a random variable \(X\), one can first use a generator of uniform samples in \(]0,1[\) and then use the inverse distribution function \(F_X^{-1}\) to compute i.i.d. samples from \(X\).
Of course, such a method assumes that we have a way to compute (or estimate with good enough precision) \(F_X^{-1}\).
Algorithm for generating a sample from random variable X using \(F_X^{-1}\)
Generate
u uniformly at random on
\(]0,1[\)Output
\(F_x^{-1}(u)\)The function \(F_X^{-1}\) of a unitary centered Normal distribution (a.k.a. Gaussian), \(X \sim \mathcal{N}(0,1)\), is implemented in the next cell.
Using np.random.default_rng
, generate a sequence of 10 000 samples according to a unitary centered gaussian distribution and plot a histogram. Modify the LC_RNG
to do the same with your custom rng.
numpy
random number generatorcustom
random number generatorWe add the _normal
and normal
methods to our LC_RNG
class.
Great! Everything is visually pleasing! It seems to work! Yeah sure, but…
What could have gone wrong? Implement a Dummy_LC_RNG
class with \(c=1\) and \(m=128\). Re-do the previous experiments and try changing m
from 128 to 127 and see what happens.
First with \(m = 128\).
Then with \(m = 127\).
As it can be seen, one must be careful when crafting a pseudo random number generator. Furthermore, one should also be careful that \(x_{n+1}\) and \(x_n\) are as independent as possible. Although this may seem counter-intuitive to do so with a completely deterministic procedure, there are ways to measure the statistical correlation of samples.
An analogy is the one of shuffling a deck of cards. While shuffling is a completely deterministic procedure we have a feeling that after shuffling, if done correctly, we get a random permutation of the cards that is roughly independent from the original permutation of the deck before shuffling.
In this part, we show how to compute the expected value of a sequence of numbers, or of a random variable that we can simulate. In both cases, the practionaer has access to a stream of numbers generated by a possibly unknown and possibly random process.
Expected value is a mathematical way of answering the question: What’s happening on average?
For instance, what is
It is often a first step into the modelisation or despcription of a system, albeit a coarse description. Later in this tutorial, we will see how to compute the standard deviation, a measure of dispersion around the average value.
Compared to the median, another mathematical notion of average, the expected value is easier to compute on a sequential fashion.
There are two main ways to compute the mean of a sequence of numbers, an offline and an online way. Given a sequence \((x_i)_{i \in \{1, \cdots , n \}}\) of real numbers, we can compute the mean \(\bar x_n\) by first computing the sum of all the elements, \(\sum_i x_i\) and then dividing by the number of elements. This is the offline method. If we were given a new number \(x_{n+1}\) we would have to re-do all the additions and division. Mathematically, the formula is:
\[\bar x_n = \frac{1}{n} \sum_{i=1}^n x_i .\]The online way computes a sequence of means, \((\bar x_k)_{k \in \{1, \cdots , n \} }\), where \(\bar x_k\) is the mean of the subsequence (or slice) \((x_i)_{i \in \{1, \cdots , k \}}\). The last element of the sequence of means is the mean of the original sequence. Mathematically, we initiate the sequence with \(\bar x_1 = x_1\) and the update rule is:
\[\bar x_{k+1} = \frac{1}{k+1}(k \bar x_k + x_{k+1}) .\]The advantage of the online method is that we don’t need to store the sequence of all samples and that we can immediatly update the mean once given a new sample.
Check that the online methods indeed is correct.
A proof can by made by induction with initial condition \(\bar x_1 = x_1\) and following induction step
\[\begin{aligned} \bar x_{k+1} & = \frac{1}{k+1}(k \bar x_k + x_{k+1}) \\ &= \frac{1}{k+1}(k \frac{1}{k} \sum_{i=1}^k x_i + x_{k+1}) \\ &= \frac{1}{k+1} \sum_{i=1}^{k+1} x_i \end{aligned}\]Write as many functions as you can to compute the mean of a python list
and check on a random sequence if they give the same result. What can you say?
mean_1(samples) = 0.5103409440103216
mean_2(samples) = 0.5103409440103205
mean_3(samples) = 0.5103409440103216 (same result than np.mean)
mean_4(samples) = 0.5103409440103216
mean_5(samples) = 0.5103409440103216
np.mean(samples) = 0.5103409440103216 (float64)
np.mean(samples) = 0.5103409290313721 (float32)
np.mean(samples) = 0.51025390625 (float16)
Depending on the implementation, and the arithmetic precision used, we get differents results. Those differences are mainly due to how arithmetic precision is handled and the order in which we perform the operation. Even if some operations are associative in a mathematical way (on the paper), those may not be in floating point arithmetic due to the very finite nature of numerical computation. In floating point arithmetic \((a + b) + c \neq a + (b + c)\), see accuracy problems.
Below, we emphasize this fact by testing different ways of summing elements of a list.
sum_1(samples) = 510.34094401032155
sum_3(samples) = 510.34094401032155 (same result than np.mean)
sum_4(samples) = 510.34094401032155
sum_5(samples) = 510.3409440103216
np.sum(samples) = 510.34094401032155 (float64)
np.sum(samples) = 510.3409423828125 (float32)
np.sum(samples) = 510.25 (float16)
Let \((X_i)_{i \in \{1, \cdots , n\}}\) be \(n\) i.i.d. random variables such that \(X_i \sim X\) with \(\mathbb{E}(X) = \mu\).
The Law of large numbers states that the empirical mean of those \(n\) i.i.d. random variables,
\[\frac{X_1 + X_2 + \cdots + X_{n-1} + X_n}{n} ,\]converges (in some sense) to \(\mu\) as \(n\) tends to infinity.
Write an experiment that collects 10 000 samples from a unitary centered gaussian distribution, and computes the sequence of updated means. Plot the histogram of samples (as above) and the function \(k \mapsto \bar x_k\).
It is quite interesting to see this random nature of the expected value, yet to know that in the long run, it will be close to the true average. In the next question, this random behavior is highlighted by plotting several trajectories. At the same time, the convergence of those plot to the same value emphasizes some regularity in this random behavior. Some of this regularity is clarified by the central limit theorem.
Repeat 10 times the experiments of the previous question and plot the 10 trajectories \(k \mapsto \bar x_k\). Add the curves \(k \mapsto \pm \frac{1}{\sqrt{k}}\) to your plot. Comment if you remember the central limit theorem, otherwise, come back with the next question in hindsight.
By plotting more trajectories and removing the first initials time steps, one can better visualize the asymptotic results stated by the central limit theorem: the density of curves at the end point might look like a gaussian random variable.
Let \((X_i)_{i \in \{1, \cdots , n\}}\) be \(n\) i.i.d. random variables such that \(X_i \sim X\) with \(\mathbb{E}(X) = \mu\) and \(\mathbb{V}(X) = \sigma^2\). The Central Limit Theorem states that the empirical mean of those \(n\) i.i.d. random variables,
\[\frac{X_1 + X_2 + \cdots + X_{n-1} + X_n}{n} ,\]has the law of a Gaussian random variable, \(\mathcal{N} (\mu, \frac{\sigma}{\sqrt{n}})\) in the limit of large \(n\).
Let \(X\) be a random variable having the uniform law on \(]0,1[\). Craft an experiment to visualize the central limit theorem when samples are drawn from \(X\).
We sample 300 000 lists of 1000 samples generated according to a uniform law in \(]0,1[\). For each list, we compute the mean and create a list of the 300 000 means. Then we plot a histogram of this list. If 1000 is large enough for the central limit theorem to roughly apply, then the histogram of the list of means resemble the one of a gaussian distribution with mean \(\mu = 0\) and standard deviation \(\sigma = 1/\sqrt{1000} \simeq 0.03\).
Given a list of samples drawn from a random variable \(X\), what relevant quantities could be computed? In the previous subsection, we emphasize the empirical mean and illustrate how it is related to the true mean \(\mathbb{E}(X)\). Given the empirical mean, it is natural to measure the spread of the samples around that value. In other words, we want to compute an empirical version of the standard deviation.
There are two main ways of computing the empirical standard deviation. One that is called the uncorrected standard deviation,
\[\hat \sigma_u = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \hat \mu)^2}\]and the other that is called the corrected standard deviation
\[\hat \sigma_c = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \hat \mu)^2}\]Remark: np.std
computes the uncorrected standard deviation.
Intuitively, it corresponds to two viewpoints on how to handle small sample size. When \(n=1\), \(\hat \sigma_u = 0\) while \(\hat \sigma_c = \frac{0}{0}\) i.e., is not defined. Indeed, when \(n=1\), one could think one of the two things
On the other hand, \(\hat \sigma_u \rightarrow \sigma\), and \(\hat \sigma_u \rightarrow \sigma\), meaning that both estimators have the same asymptotic behaviors (large sample size). In particular,
\[\frac{\hat \sigma_c}{\hat \sigma_u} =\sqrt{\frac{n}{n-1}} \rightarrow 1\]Mathematically, \(\hat \sigma_u\) is a biased estimator (\(\mathbb{E}(\hat \sigma_u) \neq \sigma\)) and \(\hat \sigma_c\) is an unbiased estimator (\(\mathbb{E}(\hat \sigma_c) = \sigma\)). Bias is not necessarily a problem because a small bias can sometimes be traded against a reduction in the variance of the estimator (as it is the case here).
Implement functions computing the two empirical standard deviation. Check that both numerically converges to \(\sigma\) as the number of samples increases.
1.0006729694819605
1.000673970156431
Another natural way to compute the spread of a list of samples is via a function. This function associates a number \(F(x)\) to each interval \([\mu - x, \mu + x]\) with \(x>0\); \(F(x)\) is the fraction of samples that lies within it. In particular, it is interesting to know the fraction of samples lying in \([\mu - \sigma, \mu + \sigma]\).
For the readers aware of the notion, this is highly related to the concept of cumulative distribution function.
What percentage of samples fall in the interval \([\mu - \sigma, \mu+\sigma]\)? and \([\hat\mu - \hat\sigma, \hat\mu+\hat\sigma]\)? Check the 68-95–99.7 rule.
68.22240000000001
95.41460000000001
99.72800000000001
Implement a function that computes the \(\alpha\) quantile of a list of samples, that is to say, an empirical value \(\hat x_\alpha\) that corresponds to the true value \(x_\alpha\), defined as
\[x_\alpha = \sup_x \{ x \: | \: \mathbb{P}(X \leq x) \leq \alpha \} .\]Using this function, how can you compute the median? How is the median different from the mean?
To compute the empirical median, one can use the quantile
function to compute \(\hat x_{1/2}\).
In the following, we consider sequences of random variables that are not necessarily i.i.d. That is, we will introduce a dependency model in the sequence of random variables. In the i.i.d. case, the index \(n \in \mathbb{N}\) of an element \(X_n\) of the sequence of random variables is somewhat arbitrary. It is mainly a way of pointing to an object in a collection. In the following, we introduce a type of dependency that is called Markovian and makes use of the order \(\leq\) of the natural number \(\mathbb{N}\). The element of the sequence are seen as if they were generated in order and the index n is often thought of as time. However, keep in mind that the Markov property can be extended beyond sequences indexed by natural number or ordered set. For instance, Markov random field are indexed by elements of topological space and neighborhoods give a dependency relation between the elements of the collection.
Back to where we were, we want to study sequences represented as
\[X_0 \rightarrow X_1 \rightarrow X_2 \rightarrow \cdots \rightarrow X_n .\]In the following, we shall assume that the generative process is Markovian, i.e., the future is independent from the past given the present. The sequence may therefore be represented as
\[X_0 \overset{p(X_1 | X_0)}{\longrightarrow} X_1 \overset{p(X_2 | X_1)}{\longrightarrow} X_2 \overset{p(X_3 | X_2)}{\longrightarrow} \cdots \overset{p(X_n | X_{n-1})}{\longrightarrow} X_n ,\]where \(p\) is called the probability transition function. In the following, this function is independent of \(n\) (time homogeneity property). Therefore, the model may be represented more concisely:
\[X_{k-1} \overset{p(X_k | X_{k-1})}{\longrightarrow} X_k ,\]together with an initial distribution of \(X_0\).
Consider the Markov chain
\[X_0 \overset{p(X_1 | X_0)}{\longrightarrow} X_1 \overset{p(X_2 | X_1)}{\longrightarrow} X_2 \overset{p(X_3 | X_2)}{\longrightarrow} \cdots \overset{p(X_n | X_{n-1})}{\longrightarrow} X_n\]where \(X_0 = 0\) and
\[\left\{ \begin{array}{lcr} p (X_k = X_{k-1} + 1 | X_{k-1 }) & = & 0.5 \\ p (X_k = X_{k-1} - 1 | X_{k-1 }) & = & 0.5 \\ \end{array} \right.\]Write a fonction that generate a sequence of \(n\) samples according to this process. Plot such a trajectory. Zoom-in. What can you say?
An educated eye may see some auto-similarity pattern in the generated curve. That is to say, a similar pattern (made of up-and-down sharp wave) seems to be spotten at diffent scales. To investigate this property, we “zoom-in” by plotting only the curve between the times steps 4000 and 6000 instead of the whole 10 000 time steps represented above.
One can see that zooming-in does not really change the type of curve that we had at a larger scale. It is only when we zoom-in even further that we can start to see the characteristic length of 1 introduced by our generating method, \(X_{n+1} = X_n \pm 1\). Below, we zoom-in on 500 time steps only. Of course, another characteristic length is playing a role here, which is also equal to 1 and corresponds to the distance between two consecutive time steps.
Answer the same question than the previous one but with a process such that \(X_{n+1} \mid X_n \sim \mathcal{N}(X_n, \epsilon)\). That is to say,
\[X_{n+1} = X_n + G_n^{\epsilon}\]where \((G_n^{\epsilon})_{n\in\mathbb{N}}\) is a sequence of i.i.d. gaussian random variable with the law \(\mathcal{N}(0,\epsilon)\).
Intuitively, we should obtain a “similar” type of curve with a characteristic length of \(\epsilon\). However, because this process can take its values in \(\mathbb{R}\) rather than \(\mathbb{N}\) only, the characteristic time discretization will play a more important role in the viualizations of the plots.
Indeed, we see the same auto-similar pattern and can try to zoom-in to better visualize this pattern.
Write a function that computes and plots a discretized Brownian motion in 2D.
In the following, we will consider random walks on a graph where the set of vertices \(V\) is finite as well as the set of edges \(E\). At each vertex \(v \in V\) is attached a probability distribution \(p_v\) on the space of vertices such that \(p_v (n)\) is probability of going from vertex \(v\) to vertex \(n\). In the Markov reward process litterature, a vertex is often called a state.
The specificity of Markov reward process is that a reward \(r\) function assign a number to each transition (or step). Formally, at each state \(v \in V\), \(r_v\) is a function that assign a reward \(r_v(n)\) to each transition from state \(v\) to state \(n\).
To create a random walk on a graph, one first create a graph with weighted edges. Using those weight, we create a transition distribution at each vertex. Using the following code as a starting point, this is the topic of the next question.
Given a labeled directed graph \(G = (V, E, w, r, v_0)\) where \(V\) is the set of vertices, \(E\) the set of edges, \(w: E \rightarrow ]0,1[\) and \(r: E \rightarrow \mathbb{R}\) and \(v_0 \in V\) we can compute a Markovian random walk on the vertices of \(G\) by considering that
\[p (v_i | v_{i-1}) = \frac{w(v_{i-1}, v_i)}{\sum_v w(v_{i-1}, v)} .\]The specificity of this random walk is that at each transition \(v_{i-1} \rightarrow v_i\), we collect a reward \(r(v_{i-1}, v_i)\).
Write a function that generates a random walk along a graph, collects the history of vertices (i.e. samples) and collects the history of rewards. Plot the updated mean reward (it should be around 2, but why?).
Because all edges have an associated reward that is initially generated according to a gaussian distribution with mean 2, we expected the mean cumulative reward collected to be around this value. Of course, previous sections highlighted how the random nature of sampling can generate dispersion around the mean.
Below, we generate 10 graphs and for each one, compute the mean cumulative reward curve. We see that those curves all seem to converge and that those limit are gathered around 2 as we should expect.
To wrap it up, we generate gain 10 random graphs and for each one, compute the mean cumulative reward curve. We see that those curves all seem to converge and that those limit are gathered around 5 as we should expect. Why?
Experiment and play with those models! If you have any remarks or questions, feel free to reach me, my door is always open!
]]>MCMC is a python implementation of some MCMC sampler algorithms that can leverage pyTorch’s automatic differentiation to sample from a given (unormalized or not) density distribution.
For instance, if we were to sample from a Gaussian distribution using MALA algorithm, one can do like in the mcmc_test.py
file and get the following kernel density estimation and sample trajectory:
Within the mala_test.py
file, one can also find a small function to compute autocorrelation and check that the Markov Chain “mixes” well:
To help tuning parameters of the MALA algorithm, it is possible to print the acceptation ratio. A good empirical fit for this is around 63%. It means that the chain is exploring enough so that around 63% of the propositions are accepted as the next step.
One can also find a small tutorial in the tutorial.ipynb
file.
A useful example can be found in mala_test.py
as well as in the tutorial.ipynb
.
The repository is organized around two files:
mcmc.py
which contains several classes associated to MCMC sampler algorithms. Since we need a density function and a gradient in Langevin methods, one can find an important class in the next file.function.py
which contains one class, MODEL, that is used by the MCMC classes. MODEL should contains all the information one have about the distribution one whish to sample from. The bare minimum is either an (unormalized) density or log-density. This function should be written using pyTorch only function so that one can use autodifferentiation to compute gradient of the log-density with respect to the input. However, if the gradient is known, you can still use it in the MODEL.Assume that one have an implementation of the density function we whish ta sample from. Then a minimal working example would be like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from mcmc import MALA
from function import MODEL
DIMENSION_INPUT_SPACE = int nbr
GRADIENT_STEPSIZE = float nbr
RANDOM_STEPSIZE = float nbr
INITIAL_POINT = torch.randn(DIMENSION_INPUT_SPACE)
NBR_SAMPLES = int nbr
def density_to_sample_from(x):
# ...
# Some pyTorch thingsgummy
# ...
return density_at_x
# Create probabilistical model
probabilistical_model = MODEL(density=density_to_sample_from)
# Create MCMC method
mala_mcmc = MALA(probabilistical_model, metric=torch.eye(DIMENSION_INPUT_SPACE),
gradient_step=GRADIENT_STEPSIZE,
random_step=RANDOM_STEPSIZE,
initialization=INITIAL_POINT)
# Sample
SAMPLES = mala_mcmc.fit(NBR_SAMPLES)
# Then one can plot some dimensions
# Please see mala_test.py or tutorial.ipynb