Qwen2.5 is a series of large language models developed by the Qwen team at Alibaba Cloud, designed to enhance natural language understanding and generation across multiple languages. The models are available in various sizes, including 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters, catering to diverse computational requirements. Trained on a comprehensive dataset of up to 18 trillion tokens, Qwen2.5 models exhibit significant improvements in instruction following, long-text generation (exceeding 8,000 tokens), and structured data comprehension, such as tables and JSON formats. They support context lengths up to 128,000 tokens and offer multilingual capabilities in over 29 languages, including Chinese, English, French, Spanish, and more. The models are open-source under the Apache 2.0 license, with resources and documentation available on platforms like Hugging Face and ModelScope.
Features
- Powerful document parsing: supports multi-scene, multilingual documents including handwriting, tables, charts, formulas, music sheets
- Precise object grounding: ability to detect, point, count objects; supports absolute coordinate & JSON formats for fine spatial reasoning
- Ultra-long video understanding & fine-grained video grounding: supports videos lasting hours, with event segmentation in seconds, dynamic frame rate / temporal resolution
- Enhanced vision encoder: uses window attention in Vision Transformer, optimizations like SwiGLU & RMSNorm, dynamic resolution sampling for images/videos
- Multi-modal input support: accepts images, videos, text; supports local files, URLs, base64 encoding; allows combinations (interleaved media & text)
- Flexible deployment: quantized versions (Int8, etc.), model sizes from small (3B) to large (72B), support via Hugging Face / ModelScope / Docker / vLLM; includes demos and web UIs