Showing 237 open source projects for "text parsing"

View related business solutions
  • Enterprise AI Agents for Every Customer Moment Icon
    Enterprise AI Agents for Every Customer Moment

    For enterprise companies looking for AI Agents

    From chat to voice to SMS, every conversation gets a smart, personalized response powered by your policies, tone, and data.
    Learn More
  • Secure your business by securing your people. Icon
    Secure your business by securing your people.

    Over 100,000 businesses trust 1Password

    Take the guesswork out of password management, shadow IT, infrastructure, and secret sharing so you can keep your people safe and your business moving.
    Learn More
  • 1
    text-extract-api

    text-extract-api

    Document (PDF, Word, PPTX ...) extraction and parse API

    ...Instead of requiring developers to integrate multiple document parsing libraries individually, the system centralizes text extraction capabilities into a unified API that standardizes the output. The platform supports automated processing pipelines that detect file types and apply the appropriate extraction method to obtain the most accurate text representation possible. It can be integrated into document analysis systems, knowledge retrieval tools, and AI pipelines that rely on clean textual data. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 2
    LiteParse

    LiteParse

    A fast, helpful, and open-source document parser

    LiteParse is an open-source lightweight parsing library designed to extract structured data from unstructured text using large language models in an efficient and cost-effective manner. It focuses on simplifying the process of turning raw text into structured outputs such as JSON by providing a streamlined interface for prompt-based parsing. The system is designed to minimize overhead, making it suitable for applications where performance and cost are critical considerations. ...
    Downloads: 8 This Week
    Last Update:
    See Project
  • 3
    YAML

    YAML

    JavaScript parser and stringifier for YAML

    yaml is a definitive library for YAML, the human friendly data serialization standard. This library supports both YAML 1.1 and YAML 1.2 and all common data schemas, passes all of the yaml-test-suite tests. It can accept any string as input without throwing, parsing as much YAML out of it as it can, and supports parsing, modifying, and writing YAML comments and blank lines. The library is released under the ISC open source license, and the code is available on GitHub. It has no external...
    Downloads: 12 This Week
    Last Update:
    See Project
  • 4
    TextFSM

    TextFSM

    Python module for parsing semi-structured text into python tables

    TextFSM is a Python library created by Google that provides a template-based state machine engine for parsing semi-structured text. It is particularly useful for extracting structured data from command-line interface (CLI) outputs, such as those from network devices, routers, and switches. By defining parsing logic through reusable template files, TextFSM transforms unstructured text into structured data like lists or tables without requiring complex regular expression code. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Airlock Digital - Application Control (Allowlisting) Made Simple Icon
    Airlock Digital - Application Control (Allowlisting) Made Simple

    Airlock Digital delivers an easy-to-manage and scalable application control solution to protect endpoints with confidence.

    For organizations seeking the most effective way to prevent malware and ransomware in their environments. It has been designed to provide scalable, efficient endpoint security for organizations with even the most diverse architectures and rigorous compliance requirements. Built by practitioners for the world’s largest and most secure organizations, Airlock Digital delivers precision Application Control & Allowlisting for the modern enterprise.
    Learn More
  • 5
    npm-pdfreader

    npm-pdfreader

    Parse text and tables from PDF files.

    npm-pdfreader is a Node.js library for reading text and parsing tables from PDF files. It supports tabular data with automatic column detection and rule-based parsing, making it useful for extracting structured data from PDFs. ​
    Downloads: 6 This Week
    Last Update:
    See Project
  • 6
    Ksoup

    Ksoup

    Ksoup is a lightweight Kotlin Multiplatform library for parsing HTML

    Ksoup is a lightweight Kotlin Multiplatform library for parsing HTML, extracting HTML tags, attributes, and text, and encoding and decoding HTML entities. ​
    Downloads: 5 This Week
    Last Update:
    See Project
  • 7
    Markdig

    Markdig

    A fast, powerful, CommonMark compliant, extensible Markdown processor

    A fast, powerful, CommonMark compliant, extensible Markdown processor for .NET. Very fast parser and HTML renderer (no-regexp), very lightweight in terms of GC pressure. Abstract Syntax Tree with precise source code location for syntax tree, useful when building a Markdown editor. Check out MarkdownEditor for Visual Studio powered by Markdig! Even the core Markdown/CommonMark parsing is pluggable, so it allows to disable built-in Markdown/Commonmark parsing (e.g Disable HTML parsing) or...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 8
    RAG Anything

    RAG Anything

    RAG-Anything: All-in-One RAG Framework

    ...The system uses a multi-stage pipeline (e.g., document parsing, content analysis, knowledge graph construction, intelligent retrieval) so queries can navigate across modalities with deeper understanding and relevance.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 9
    ChordSheetJS

    ChordSheetJS

    A JavaScript library for parsing and formatting chords and chord sheet

    ChordSheetJS is a JavaScript library for parsing, formatting, and transposing chord sheets. It supports various chord sheet formats and provides tools for rendering and manipulating chord and lyric data.​
    Downloads: 4 This Week
    Last Update:
    See Project
  • More Bookings. Better Experience. Icon
    More Bookings. Better Experience.

    For tour and activity providers

    The all-in-one solution built to help you stay organised and get more bookings with thousands of connections to online travel agencies (OTAs), resellers and suppliers.
    Learn More
  • 10
    LlamaParse

    LlamaParse

    Parse files for optimal RAG

    LlamaParse is a GenAI-native document parser that can parse complex document data for any downstream LLM use case (RAG, agents). Load in 160+ data sources and data formats, from unstructured, and semi-structured, to structured data (API's, PDFs, documents, SQL, etc.) Store and index your data for different use cases. Integrate with 40+ vector stores, document stores, graph stores, and SQL db providers.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 11
    GROBID

    GROBID

    A machine learning software for extracting information

    GROBID is a machine learning library for extracting, parsing, and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications. First developments started in 2008 as a hobby. In 2011 the tool has been made available in open source. Work on GROBID has been steady as a side project since the beginning and is expected to continue as such. Header extraction and parsing from article in PDF format. The...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 12
    markdown-it

    markdown-it

    Markdown parser, done right. 100% CommonMark support, extensions

    markdown-it is a fast and extensible JavaScript-based Markdown parser designed to convert Markdown text into HTML while maintaining strict compliance with the CommonMark specification and offering additional syntax enhancements. It is widely used in web applications, documentation tools, and content platforms due to its high performance and flexibility. The library is built with a rule-based parsing system that allows developers to customize or replace syntax rules, making it adaptable to a wide variety of use cases. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 13
    tree-sitter

    tree-sitter

    An incremental parsing system for programming tools

    Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a source file and efficiently update the syntax tree as the source file is edited. General enough to parse any programming language. Fast enough to parse on every keystroke in a text editor. Robust enough to provide useful results even in the presence of syntax errors. Dependency-free so that the runtime library (which is written in pure C) can be embedded in any application. ...
    Downloads: 9 This Week
    Last Update:
    See Project
  • 14
    Notion-to-MD

    Notion-to-MD

    Convert notion pages, block and list of blocks to markdown

    Notion-to-MD is a Node.js package that allows you to convert Notion pages to Markdown format.Convert notion pages, blocks, and list of blocks to markdown (supports nesting) using notion-sdk-js.
    Downloads: 5 This Week
    Last Update:
    See Project
  • 15
    Helix

    Helix

    A post-modern modal text editor

    Helix is a modal (Kakoune/Vim‑inspired) terminal-based text editor written in Rust. It features modern modal editing, multiple selections, smart syntax highlighting, and built-in language server (LSP) integration leveraging tree‑sitter for fast, incremental parsing and code intelligence.
    Downloads: 9 This Week
    Last Update:
    See Project
  • 16
    zpdf

    zpdf

    Zero-copy PDF text extraction library written in Zig

    zpdf is a high-performance PDF text extraction library written in Zig that focuses on speed, low overhead, and modern parsing techniques. It leans heavily on memory-mapped file reading and zero-copy patterns where possible, so it can scan large PDFs without repeatedly copying data around in memory. The library supports streaming extraction using efficient arena allocation, making it well suited for workloads that need to process big documents quickly or in batches.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 17
    commonmark-java

    commonmark-java

    Java library for parsing and rendering CommonMark (Markdown)

    Java library for parsing and rendering Markdown text according to the CommonMark specification (and some extensions). Provides classes for parsing input to an abstract syntax tree of nodes (AST), visiting and manipulating nodes, and rendering to HTML. It started out as a port of commonmark.js, but has since evolved into a full library with a nice API.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    ELisp Tree-sitter

    ELisp Tree-sitter

    Tree-sitter bindings for Emacs Lisp

    ...The minor mode tree-sitter-mode provides a buffer-local syntax tree, which is kept up-to-date with changes to the buffer’s text. Run M-x tree-sitter-hl-mode to replace the regex-based highlighting provided by font-lock-mode with tree-based syntax highlighting.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 19
    ANTLR

    ANTLR

    Parser generator to read, process, or translate structured text

    ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It's widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees. It’s widely used in academia and industry to build all sorts of languages, tools, and frameworks. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day. ...
    Downloads: 5 This Week
    Last Update:
    See Project
  • 20
    py-pdf-parser

    py-pdf-parser

    A Python tool to help extracting information from structured PDFs

    py-pdf-parser is a Python tool designed to help extract information from structured PDFs. It provides a simple interface to define parsing rules and extract data from PDF documents. ​
    Downloads: 8 This Week
    Last Update:
    See Project
  • 21
    amrlib

    amrlib

    A python library that makes AMR parsing, generation and visualization

    A python library that makes AMR parsing, generation and visualization simple. amrlib is a python module designed to make processing for Abstract Meaning Representation (AMR) simple by providing the following functions. Sentence to Graph (StoG) parsing to create AMR graphs from English sentences. Graph to Sentence (GtoS) generation for turning AMR graphs into English sentences. A QT-based GUI to facilitate the conversion of sentences to graphs and back to sentences. Methods to plot AMR graphs...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    Extractous

    Extractous

    Fast and efficient unstructured data extraction

    Extractous is a Rust-based unstructured data extraction library focused on fast local parsing of documents and other content-heavy files. Its purpose is to extract text and metadata efficiently from formats such as PDF, Word, HTML, email archives, images, and more, without depending on external APIs or separate parsing servers. The project emphasizes performance and low memory usage, and its maintainers describe it as a local-first alternative to heavier extraction stacks. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 23
    SemTools

    SemTools

    Semantic search and document parsing tools for the command line

    SemTools is an open-source command-line toolkit designed for document parsing, semantic indexing, and semantic search workflows. The project focuses on enabling developers and AI agents to process large document collections and extract meaningful semantic representations that can be searched efficiently. Built with Rust for performance and reliability, the toolchain provides fast processing of text and structured documents while maintaining low system overhead.
    Downloads: 13 This Week
    Last Update:
    See Project
  • 24
    mavonEditor

    mavonEditor

    A markdown editor based on Vue

    A markdown editor based on Vue that supports a variety of personalized features. The default toolbar properties are all true, You can customize the object to cover them. The language parsing files and code highlighting in Code Highlighting highlight.js will be loaded on demand. GitHub-markdown-CSS and katex will load only when mounted.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    dots.ocr

    dots.ocr

    Multilingual Document Layout Parsing in a Single Vision-Language Model

    dots.ocr is a cutting-edge multilingual document parsing system built on a unified vision-language model that combines layout detection, text recognition, and structural understanding into a single architecture. Unlike traditional OCR pipelines that rely on multiple specialized components, dots.ocr integrates these processes end-to-end, reducing error propagation and improving consistency across tasks.
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next
MongoDB Logo MongoDB