// the find
HowieHwong/TrustLLM
[ICML 2024] TrustLLM: Trustworthiness in Large Language Models
TrustLLM is an ICML 2024 benchmark and toolkit for evaluating LLM trustworthiness across six dimensions: truthfulness, safety, fairness, robustness, privacy, and machine ethics. It ships 30+ datasets and a Python package that runs models through each category and produces scores you can compare against the published leaderboard. Aimed at researchers and teams doing LLM safety evaluation, not practitioners who just want to ship.
The taxonomy is unusually thorough — splitting 'safety' into jailbreak, toxicity, misuse, and exaggerated safety catches the real failure modes that single-axis benchmarks miss. The mix of evaluation methods (keyword matching, GPT-4 judging, Longformer scoring) is appropriate per task rather than a lazy one-size-fits-all approach. The dataset table is honest about what's original versus borrowed from prior work, which matters when interpreting scores. Still active as of June 2025, unlike most ICML benchmark repos that go dark six months post-paper.
The `pip install trustllm` path is explicitly marked deprecated with no clear migration guide to the GitHub install, which is a bad sign for long-term packaging hygiene. Evaluation leans heavily on GPT-4 as judge for a significant fraction of tasks — this makes results both expensive and circular when the model you're evaluating is GPT-4 or a close relative. The authors have already moved on to TrustGen/TrustEval and are steering people there, so TrustLLM is effectively in maintenance mode; the leaderboard will drift out of date as new models drop. Python 3.9 pin in the conda create example is stale for anyone on a modern stack.