===== Reproducible Machine Learning ===== [[https://towardsdatascience.com/reproducible-machine-learning-cf1841606805?gi=ce8574de7e61|Originalartikel]] [[https://www.qgelm.de/wb2html/wbb1272.html|Backup]]

A step towards making ML research open and accessible

Photo credit: geralt via Pixabay

The NeurIPS (Neural Information Processing Systems) 2019 conference marked the third year of their annual reproducibility challenge and the first time with a reproducibility chair in their program committee.

So, what is reproducibility in machine learning?

Reproducibility is the ability to be recreated or copied. In machine learning, reproducibility is being able to recreate a machine learning workflow to reach the same conclusions as the original work.

Why is this important?

An algorithm from new research without the reproducibility aspects can be difficult to investigate and implement. With our increasing dependency on ML and AI systems for decision making, integrating a model not fully understood can have unintended consequences.

Costs and budget constraints are another area impacted by reproducibility. Without the details on hardware, computational power, training data and more nuanced aspects like hyper-parameters tuning, adopting new algorithms can run into huge costs and considerable research effort, only to lead into inconclusive results.

Reproducibility, a factor in building trustworthy ML (Photo by author)

How to identify if a ML model/research meets reproducibility standards?

This post explores two approaches proposed in research papers to establish reproducibility:

The authors suggest missing information as the root cause of all reproducibility issues. Missing information could be intentional (trade secret) or unintentional (undisclosed assumption). They focus on methods to rectify the case of unintentional problems that hinder reproducibility.

Aspects impacting reproducibility and the solutions suggested are summarized here:

Training data: A model produces different results for different training sets. One way of addressing this is to adopt a versioning system that records changes in the training data. However, this may not be practical for large data sets. Using data with documented timestamps is a workaround. The other option is to save hashes of data at regular intervals with documentation of the calculation methods.

Features : Features can produce varying results based on how they are selected and generated. The steps taken in generating features should be tracked and version controlled. Other best practices include i) keeping individual feature generation code independent from one another ii) in case of a bug fix for a feature, creating a new feature.

Model training: Documenting details of how the model was trained will ensure repeatable results. Feature transformations, order of features, hyperparameters and the method to select them, structure of the ensemble are among important model training details to be maintained.

Software environment: Software versions and packages used also play a role in replicating results seen in the original ML model. An exact match in software used may be required. Hence, even if there have been software updates, it is best to use the version on which the model was originally trained.

Here, the authors attempt to independently reproduce (without use of the original author’s code) two hundred and fifty five papers. They record attributes of the research papers and analyze the correlation between reproducibility and the attributes using statistical hypothesis testing .

They use a total of 26 attributes per paper.

Unambiguous (require no explanation): Number of Authors, the existence of an appendix (or supplementary material), the number of pages (including references, excluding any appendix), the number of references, the year the paper was published, the year first attempted to implement, the venue type (Book, Journal, Conference, Workshop, Tech-Report), as well as the specific publication venue (e.g., NeurIPS, ICML).

Mildly subjective : Number of Tables,Number of Graphs/Plots, Number of Equations, Number of Proofs, Exact Compute Specified, Hyper-parameters Specified, Compute Needed, Data Available, PseudoCode.

Subjective : Number of Conceptualization Figures, Uses Exemplar Toy Problem, Number of Other Figures, Rigor vs Empirical, Paper Readability, Algorithm Difficulty, Primary Topic, Looks Intimidating.

Results and Significant relationship

To determine significance, authors use non-parametric Mann–Whitney U test for numeric features and Chi-squared test with continuity correction for categorical features.

Out of the twenty six attributes, ten are significantly correlated to reproducibility.

These are Number of Tables, Equations, Compute Needed, Pseudo-Code, Hyper-parameters Specified, Readability, Rigor vs Empirical, Algorithm Difficulty, Primary Topic, Authors Reply.

The more information there is about the significant attributes, easier it is to reproduce a paper. However, Number of equations is negatively correlated to reproducibility. This does seem counter-intuitive and the authors offer two theories as to why: 1) having a larger number of equations makes the paper more difficult to read, hence more difficult to reproduce or 2) papers with more equations correspond to more complex and difficult algorithms, naturally being more difficult to reproduce.

Study limitations

Authors in this study select papers based on personal interests. The topics are not randomly picked, hence a selection bias is introduced. Also, the results are to be taken with the consideration that the authors are not experts on the topics they select.

Thirdly, there are many subjective attributes that are significant, and need developing objective measures related to the subjective factors. Eg — measuring paper readability or algorithm difficulty.

The first study takes a bottom-up approach while the second a top-down. Both together give an insight into the factors necessary for replicating machine learning pipelines and reproducing original results. As seen from the results of the two approaches, model hyper-parameters and computing resources/software environment are two important factors influencing reproducibility.

In general, it is important to record every step taken in building a model through documentation and version control. After all, many machine learning systems are black boxes and considering reproducibility makes the research open, accessible and reproducible.