Good-enough practices for language model packages

This document suggests minimal “Good-enough Practices” for software packages which rely on language model (LM, or “LLM” for large language model) outputs.

Prefer local LMs, for reasons described in this separate vignette, and generally avoid relying on closed or commercial APIs.
Provide direct links to all models used, such as through links to huggingface model pages, “Model cards” hosted elsewhere, or original published research. Include explicit statements about long-term availability and stability of all models.
Summarise the training data used in all models, including estimation of the proportion of data drawn from public domains, and the extent to which use of such data in model training may violate licensing conditions.
Combine LM output with equivalent output from alternative algorithms. Reasons for this are exemplified in this blog post from anthropic.ai.
Use or implement efficient algorithms to combine ranks from these multiple outputs, including separate pre-processing of the most computationally intensive stages.
Provide a “Summary” of how the software generates results. This should include the following sections where applicable:
- Input chunking, describing chunking methods used, and possible user control
- LM sizes, including input or context size, and ouput or embedding sizes
- Similarity algorithms, including metrics applied to LM outputs, and metrics for alternative, non-LM algorithms
- Final ranking, including description of how different components are combined, such as outputs from different LM chunks and from alternative, non-LM algorithms. Tie-breaking procedures may also be described.
- Reproducibility statement, including descriptions of long-term stability of model results, along with any components relying on random numbers, and how seeding can be used to generate reproducible outputs.
Software should also provide more extended and non-technical descriptions of all aspects presented in the previous summary.
LM package should include routines to update all data used, and demonstrate that such data updates are automated, and are performed with sufficient regularity. See this blog post for the difficulty and importance of updating LM data, and the vignette for this package on how data for this package are updated.