Let's say that there's a reproducible fair test with the following specifications:
Then can you always safely claim that, X and Y must universally lead to P
and Q respectively, and A is solely responsible for the difference between P and Q universally?
If you think it's a definite yes, then you're probably oversimplifying control variables, because the real answer is this: When the control variables are set as K, then X and Y must lead to P and Q respectively.
Let's show you an example using software engineering(Test 1):
If the project's changed to write a 3A game, or a complicated and convoluted full stack cashier and inventory management software for supermarkets, then I'm quite sure that procedural programming won't perform the best, because procedural programming just isn't suitable for writing such software(actually, in reality, the vast majority of practical projects should be solved using the optimal mix of different aradigms, but that's beyond the scope of this example).
This example aims to show that, even a reproducible fair test isn't always accurate when it comes to drawing universal conclusions, because the contexts of that test, which are the control variables, also influence the end results, so the contexts should always be clearly stated when drawing the conclusions, to ensure that those conclusions won't be applied to situations where those conclusions no longer hold.
Another example can be a reproducible fair test examining whether proper up front architectural designs(but that doesn't mean it must be waterfall) are more productive than counterproductive, or visa versa(Test 2):
If an universally applicable conclusion has to be reached, then one way to solve this is to make even more fair tests, but with the control variables set to be different constants, and/or with different variables to be tested, to avoid conclusions that actually just apply to some unstated contexts.
For instance, in Test 2, the project nature as the major part of the control variables can be changed, then one can check if the following new reproducible fair tests testing the productivity of proper up front architectural designs will have changed results; Or in Test 1, the programming paradigm to be used can become a part of the control variables, whereas the project nature can become the variable to be tested in the following new reproducible fair tests.
Of course, that'd mean a hell lot of reproducible fair tests to be done(and all those results must be properly integrated, which is itself a very complicated and convoluted matter), and the difficulties and costs involved likely make the whole thing too infeasible to be done within a realistic budget in the foreseeable future, but it's still better than making some incomplete tests and falsely draw universal conclusions from them, when those conclusions can only be applied to some contexts(and those contexts should be clearly stated).
Therefore, to be practical while still respectful to the truth, until the software engineering industry can finally perform complete tests that can reliably draw actually universal conclusions, it's better for the practitioners to accept that many of the conclusions there are still just contextual, and it's vital for us to carefully and thoroughly examine our circumstances before applying those situational test results.
For example, JavaScript(and sometimes even TypeScript), is said to suck very hard, partly because there are simply too many insane quirks, and writing JavaScript is like driving without any traffic rules at all, so it's only natural that we should avoid JavaScript as much as we can right?
However, to a highly devoted, diligent and disciplined JavaScript programmer, JavaScript is one of the few languages that provide the amount of control and freedom that are simply unthinkable in many other programming languages, and such programmers can use them extremely
effectively and efficiently, all without causing too much technical debts that can't be repaid on time(of course, it's only possible when such programmers are very experienced in JavaScript and care a great deal about code qualities and architectural designs).
The difference here is again the underlying context, because those blaming JavaScript might be usually working on large projects(like those way beyond the 10M LoC scale) with large teams(like way beyond 50 members), and it'd be rather hard to have a team with all members being highly devoted, diligent and disciplined, so the amount of control and freedom offered by JavaScript will most likely lead to chaos;
Whereas those praising JavaScript might be usually working alone or with a small team(like way less than 10 members) on small projects(like those way less than the 100k LoC scale), and the strict rules imposed by many statically strong typed languages(especially Java with checked exceptions) may just be getting in their way, because those restrictions lead to up front investments, which need time and project scale to manifest their returns, and such time and project scale are usually lacking in small projects worked by small teams, where short-term effectiveness and efficiency is generally more important.
Do note that these opinions, when combined, can also be regarded as reproducible fair tests, because the amount of coherent and consistent opinions on each side is huge, and many of them won't have the same complaint or compliment when only the languages are changed.
Therefore, it's normally pointless to totally agree or disagree on a so-called universal conclusion about some aspects on software engineering, and what's truly meaningful instead is to try to figure out the contexts behind those conclusions, assuming that they're not already stated clearly, so we can better know when to apply those conclusions and when to apply some others.
Actually, similar phenomenons exist outside of software engineering.
For instance, let's say there's a test on the relations between the number
of observers of a knowingly immoral wrongdoing, and the percentage of them going to help the victims and stop the culprits, with the entire scenes under the watch of surveillance cameras, so those recordings are sampled in large amounts to form reproducible fair tests.
Now, some researchers claim that the results from those samplings are that, the more the observers are out there, the higher the percentage of them going to help the victims and stop the culprits, so can we safely conclude that the bystander effect is actually wrong?
It at least depends on whether those bystanders knew that those surveillance cameras did exist, because if they did know, then it's possible that those results are affected by hawthorne effect, meaning that the percentage of them going to help the victims and stop the culprits could be much, much lower if there were no surveillance cameras, or they didn't know those surveillance cameras did exist(but that still doesn't mean the bystander effect is right, because the truth could be that the percentage of bystanders going to help the victims has little to do with the number of bystanders).
In this case, the existence of those surveillance cameras is actually a major part of the control variables in those reproducible fair tests, and this can be regarded as an example of the observer's paradox (whether this can justify the more and more numbers of surveillance cameras everywhere are beyond the scope of this article).
Of course, this can be rectified, like trying to conceal those surveillance cameras, or finding some highly trained researchers to regularly record places that are likely to have culprits openly hurting victims with a varying number of observers, without those observers knowing the existence of those researchers, but needless to say, these alternatives are just so unpragmatic that no one will really do it, and they'll also pose even greater problems, like serious privacy issues, even if they could be actually implemented.
Another example is that, when I was still a child, I volunteered into a research of the sleep quality of children in my city, and I was asked to sleep
in a research center, meaning that my sleeping behaviors will be monitored.
I can still vaguely recall that I ended up sleeping quite poorly at that night, despite the fact that both the facilities(especially the bed and the room) and the personnel there are really nice, while I sleep well most of the time back when I was a child, so such a seemingly strange result was probably because I failed to quickly adapt to a vastly different sleeping environment, regardless of how good that bed in that research center was.
While I can vaguely recall that the full results of the entire study of all children volunteered was far from ideal, the changes of the sleeping environment still played as a main part of the control variables in those reproducible fair tests, so I still wonder whether the sleep qualities the children in my city back then were really that subpar.
To mitigate this, those children could have been slept in the research
center of many, instead of just 1, nights, in order to eliminate the factor of having to adapt to a new sleeping environment, but of course the cost of such researches to both the researchers and the volunteers(as well as their families) would be prohibitive, and the sleep quality results still might not hold when those child go back to their original sleeping environment.
Another way might be to let parents buy some instruments, with some training, to monitor the sleep qualities of their children in their original sleeping environment, but again, the feasibility of such researches and the willingness of the parents to carry them out would be really great issues.
The last example is the famous Milgram experiment, does it really mean most people are so submissive to their perceived authorities when it comes to immoral wrongdoings? There are some problems to be asked, at least including the following:
In this case, the majority of the control variables in those reproducible fair tests are the test setups themselves, because such experiments would be immoral to the extreme if those being researched truly did immoral wrongdoings, meaning that it'd be inherently hard to properly establish a concrete and strong causation between immoral wrongdoings and some other fixed factors, like the submissions to the authorities.
Some may say that those being researched did believe that they were
performing immoral wrongdoings because of their reactions during the test and the interview afterwards, and those reactions will also manifest when someone does do some knowingly immoral wrongdoings, so the Milgram experiment, which is already reproduced, still largely holds.
But let's consider this thought experiment - You're asked to play an extremely gore, sadistic and violent VR game with the state of the art audios, immersions and visuals, with some authorities ordering you to kill the most innocent characters with the most brutal means possible in that game, and I'm quite certain that many of you would have many of the reactions manifested by those being researched in the Milgram experiment, but that doesn't mean many of you will knowingly perform immoral wrongdoings when being submissive to the authority, because no matter how realistic those actions seem to be, it's still just a game after all.
The same might hold for Milgram experiment as well, where those being researched did know that the whole thing's just a great fake on one hand, but still manifested reactions that are the same as someone knowingly doing some immoral wrongdoings on the other, because the fake felt so real that their brains got cheated and showed some real emotions to some extent despite them knowing that it's still just a fake after all, just like real immense emotions being evoked when watching some immensely emotional movies.
It doesn't mean the Milgram experiment is pointless though, because it at least proves that being submissive to the perceived or real authorities will make many people do many actions that the latter wouldn't normally do otherwise, but whether such actions include knowingly immoral wrongdoings might remain inconclusive from the results of that experiment(even if authorities do cause someone to do immoral wrongdoings that won't be done otherwise, it could still be because that someone really doesn't know that they're immoral wrongdoings due to the key information being obscured by the authorities, rather than being submissive to those authorities even though that someone knows that they're immoral wrongdoings).
Therefore, to properly establish a concrete and strong causation between knowingly immoral wrongdoings and submissions to the perceived or real authorities, we might have to investigate actual immoral wrongdoings in real life, and what parts of the perceived or real authorities were playing in those incidents.
To conclude, those making reproducible fair tests should clearly state their underlying control variables when drawing conclusions when feasible, and those trying to apply those conclusions should be clear on their circumstances to determine whether those conclusions do apply under those situations they're facing, as long as the time needed for such assessments are still practical enough in those cases.