Validation approach
RAISE distinguishes between building AI tools and showing that they are fit for a specific task and context. In Vetra, evaluation should be done by module and also as a whole system, because screening, extraction, synthesis and writing depend on one another and feed back into each other, just like human work.
You need the context of the full review, not only isolated components. Vetra's real value appears when you see how the whole flow fits together, what each module carries over from the previous stage and what information it sends to the next one.
Planned metrics
- Sensitivity, specificity, precision and false negatives in assisted screening.
- Agreement between AI, human reviewers and the researcher's final decision.
- Quality of justifications and usefulness for resolving disagreements.
- Extraction errors, incomplete fields, inconsistencies and need for human correction.
- Time spent per stage compared with traditional manual processes.
- User experience, cognitive load and ease of auditing decisions.
What should be reported
Each evaluation should state the task, dataset, use context, tool version, human or methodological comparator, selected metrics, limitations and any negative or unexpected results.
Adoption decisions
A function may be ready for direct use, require human verification, be exploratory only or not be recommended for formal use. This classification should change when the model, workflow, context or available evidence changes.
Validation should cover the performance of each module and also the behaviour of the complete system, because screening, extraction, synthesis and writing influence one another. It is not enough to measure isolated pieces.
Expected pilot outcome
The pilot should make it possible to decide which modules can be recommended, which ones need extra controls and which limits should be communicated to researchers, institutions and potential partners.
