A Heterogeneous Benchmark for Information Retrieval
Agentic, Reasoning, and Coding (ARC) foundation models
Reference implementations of MLPerf™ training benchmarks
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
Benchmarking synthetic data generation methods
MTEB: Massive Text Embedding Benchmark
A.S.E (AICGSecEval) is a repository-level AI-generated code security
LongBench v2 and LongBench (ACL 25'&24')
Meta Agents Research Environments is a comprehensive platform
Visual Causal Flow
Strong, Economical, and Efficient Mixture-of-Experts Language Model
Code for the paper "Evaluating Large Language Models Trained on Code"
CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)
Leaderboard Comparing LLM Performance at Producing Hallucinations
Provider-agnostic, open-source evaluation infrastructure
Benchmark LLMs by fighting in Street Fighter 3
A Python toolbox for scalable outlier detection
MemU is an open-source memory framework for AI companions
Clean and efficient FP8 GEMM kernels with fine-grained scaling
Collection of reference environments, offline reinforcement learning
Advanced Privacy-Preserving Federated Learning framework
Utility package for accessing common Machine Learning datasets
Simulation framework for accelerating research
A reinforcement learning package for Julia
Geometric deep learning extension library for PyTorch