Python-based research interface for blackbox
A microbenchmark support library
A list of open LLMs available for commercial use
RandomX, KawPow, CryptoNight, AstroBWT and GhostRider unified miner
A command-line benchmarking tool
A Heterogeneous Benchmark for Information Retrieval
Checks whether Kubernetes is deployed
A benchmarking framework for the Julia language
Reference implementations of MLPerf™ training benchmarks
Agentic, Reasoning, and Coding (ARC) foundation models
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
MTEB: Massive Text Embedding Benchmark
Benchmarking synthetic data generation methods
A.S.E (AICGSecEval) is a repository-level AI-generated code security
A Fast and Easy to use microframework for the web
LongBench v2 and LongBench (ACL 25'&24')
A simple generic set type for the Go language
Meta Agents Research Environments is a comprehensive platform
Visual Causal Flow
Drill is an HTTP load testing application written in Rust
Integrates the JMH benchmarking framework with Gradle
Strong, Economical, and Efficient Mixture-of-Experts Language Model
Code for the paper "Evaluating Large Language Models Trained on Code"
bsuite is a collection of carefully-designed experiments
Leaderboard Comparing LLM Performance at Producing Hallucinations