More information about text formats
In this article the authors discuss their analysis of 21 female and 22 male athletic events. Testing all 43, they find 3 events significant with p<0.05. When testing 43 events, the expectation is that a well-calibrated statistical test will produce 2 false positives with random data, on average, due to the definition of the p-value. The odds of producing 3 false positives are also rather high; for normally distributed simulated data under the null, I found 3 or more false positives approximately 1/3 of the time such an analysis is performed, see here for a simulation notebook: https://github.com/davidasiegel/False-Positive-Rate-for-Multiple-Tests-i....
This is why adjustments for multiple comparisons needs to be performed. It was neglected in their initial study and neglected again in this study. In the 2017 study they state, "These different athletic events were considered as distinct independent analyses and adjustment for multiple comparisons was not required." This doesn't make sense to me; if the analyses are distinct, then all the more reason to correct for multiple comparisons. If a Bonferroni correction were performed, none of the p-values would test significant at the level of the study (p<0.05/43 = 0.001). Therefore I do not see why there is any reason to reject the null hypothesis for any of these results.
Performing a series of tests and then re-testing the subset of events with the most significant results is also poor statistical practice that doesn't add meaning to the study.
In conclusion, taking into account the total number of tests that were performed (43), the results fail to be significant at the level of the study.