A microbenchmark support library
A list of open LLMs available for commercial use
A benchmarking framework for the Julia language
Agentic, Reasoning, and Coding (ARC) foundation models
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
A.S.E (AICGSecEval) is a repository-level AI-generated code security
LongBench v2 and LongBench (ACL 25'&24')
Visual Causal Flow
Meta Agents Research Environments is a comprehensive platform
Integrates the JMH benchmarking framework with Gradle
Strong, Economical, and Efficient Mixture-of-Experts Language Model
Leaderboard Comparing LLM Performance at Producing Hallucinations
Provider-agnostic, open-source evaluation infrastructure
Benchmark LLMs by fighting in Street Fighter 3
Import public NYC taxi and for-hire vehicle (Uber, Lyft)
The Abstraction and Reasoning Corpus
General plug-and-play inference library for Recursive Language Models
MemU is an open-source memory framework for AI companions
Clean and efficient FP8 GEMM kernels with fine-grained scaling
Collection of reference environments, offline reinforcement learning
Minimal examples of data structures and algorithms in Python
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models
A Gym environment for web task automation
An experimental version of DeepSeek model
SAPIEN Manipulation Skill Framework