pdf indexing free download

Showing 46 open source projects for "pdf indexing"

View related business solutions

Securing the Cloud Made Easy
Multi-cloud security delivered — now and in the future.

Designed for organizations operating in the cloud who need complete, centralized visibility of their entire cloud estate and want more time and resources dedicated to remediating the actual risks that matter, Orca Security is an agentless cloud Security Platform that provides security teams with 100% coverage their entire cloud environment.

Learn More
IT Asset Management (ITAM) Software
Supercharge Your IT Assets, the Easy Way

Drowning in misplaced IT assets, compliance headaches, and shadow IT? Navigate to clarity with an intuitive IT Asset Management solution. Experience crisp visibility, effortless control, and unshakable security – all while freeing up your budget with optimized software licenses. The best part? It’s easy.

Learn More
1

Scribe.js

JavaScript OCR and text extraction for images and PDFs

Scribe.js is a JavaScript library that provides Optical Character Recognition (OCR) and text extraction capabilities for both images and PDF documents, aimed at developers who want to build OCR features directly into their applications. The library can take image files (such as PNG or JPEG) and recognize the text they contain, and it can also extract text from PDF files that either already contain text or are image-based scans, using modern web standards and WebAssembly under the hood. ...

Downloads: 7 This Week

Last Update: 2026-03-14
See Project
2

Memvid

Video-based AI memory library. Store millions of text chunks in MP4

...This innovative approach uses standard video containers and offers millisecond-level semantic search across large corpora with dramatically less storage than vector DBs. It's self-contained—no DB needed—and supports features like PDF indexing, chat integration, and cloud dashboards.

Downloads: 48 This Week

Last Update: 2026-03-13
See Project
3

shuyuan

Reading book source

...It likely supports different input formats (text, HTML, PDF), and may integrate optional translation or text normalization tools.

Downloads: 0 This Week

Last Update: 2025-11-28
See Project
4

Everything cURL

The book documenting the curl project, the curl tool, libcurl

Everything curl is an extensive, continuously maintained book that documents the entire curl ecosystem: the curl command-line tool, the libcurl library, the project’s history and development practices, and practical guidance for using and contributing to curl. The project is written as an open source book (CC-BY-4.0) and is available in multiple formats and locations, including an online website, PDF, and ePub so readers can pick the format that suits them. Content ranges from...

Downloads: 2 This Week

Last Update: 2026-04-03
See Project
Cloud data warehouse to power your data-driven innovation
BigQuery is a serverless and cost-effective enterprise data warehouse that works across clouds and scales with your data.

BigQuery Studio provides a single, unified interface for all data practitioners of various coding skills to simplify analytics workflows from data ingestion and preparation to data exploration and visualization to ML model creation and use. It also allows you to use simple SQL to access Vertex AI foundational models directly inside BigQuery for text processing tasks, such as sentiment analysis, entity extraction, and many more without having to deal with specialized models.

Try for free
5

DB-GPT

Revolutionizing Database Interactions with Private LLM Technology

DB-GPT is an experimental open-source project that uses localized GPT large models to interact with your data and environment. With this solution, you can be assured that there is no risk of data leakage, and your data is 100% private and secure.

Downloads: 7 This Week

Last Update: 2026-03-27
See Project
6

Create Index from PDF

PDF Indexing Script: Searches PDF for words, records page numbers

...As it processes the PDF, the script prints the current page being analyzed, providing users with progress visibility. The final output is a text file with each word followed by the page numbers where it appears, separated by commas. This script is ideal for anyone looking to build an automated index for their PDF documents. With detailed comments and a clear structure, it's easy to customize and use for various indexing projects for researchers, authors, and anyone needing a precise and automated indexing solution.

Downloads: 0 This Week

Last Update: 2025-03-03
See Project
7

PageIndex

Document Index for Vectorless, Reasoning-based RAG

PageIndex is an innovative open-source framework that reimagines retrieval-augmented generation (RAG) by eliminating conventional vector similarity search and instead building hierarchical semantic indexes that mirror a document’s natural structure. Rather than chunking text and embedding it into a vector database, PageIndex constructs a tree-structured index — similar to a detailed, AI-enhanced table of contents — that a large language model can traverse to locate the most relevant sections...

Downloads: 0 This Week

Last Update: 2026-04-08
See Project
8

OCRBase

MD/.JSON Document OCR and structured data extraction API

OCRBase is a self-hostable document OCR and structured extraction system built to turn PDFs into machine-usable outputs at scale, aiming to bridge the gap between raw text extraction and production-ready pipelines. Instead of treating OCR as a one-off script, it presents an API-driven workflow where documents are submitted as jobs and processed through a queue-based architecture that can handle high throughput. The core output is designed for downstream automation, producing structured...

Downloads: 0 This Week

Last Update: 23 hours ago
See Project
9

Open Semantic Search

Open source semantic search and text analytics for large document sets

Open Semantic Search is an open source research and analytics platform designed for searching, analyzing, and exploring large collections of documents using semantic search technologies. It provides an integrated search server combined with a document processing pipeline that supports crawling, text extraction, and automated analysis of content from many different sources. Open Semantic Search includes an ETL framework that can ingest documents, process them through analysis steps, and...

Downloads: 5 This Week

Last Update: 4 days ago
See Project
MaintainX is the world-leading mobile-first workflow management platform for industrial and frontline workers.
Trusted by Operational Leaders Across the Globe

Your day-to-day maintenance tasks, simplified. MaintainX eliminates the paperwork, so you can spend less time on your clipboard and more time getting things done.

Learn More
10

AnyTXT Searcher

A Powerful Desktop Full-Text Search Engine, Just Like Local Google.

...It has a powerful document parsing engine built in, which extracts the text of commonly used file formats without installing any other software, and combines the built-in high-speed indexing system to store the metadata of the text. You can quickly find any text in any file on your disk by Anytxt almost in 0.1 second. It works on Windows 11,10, 8, 7, Vista, XP, 2008, 2012, 2016,2022... AnyTXT Searcher supports the following file formats: Plain text (txt, cpp, py, html, etc.) Microsoft OneNote (one) Microsoft Word (doc, docx) Microsoft Excel (xls, xlsx) Microsoft PowerPoint (ppt, pptx) PDF WPS Office (wps, et, dps) EBook (epub, mobi, azw3, fb2 etc.) ...

14 Reviews

Downloads: 5,944 This Week

Last Update: 2025-06-19
See Project
11

Hypernomicon

Hypertext-infused philosophy personal database software

Hypernomicon is a personal productivity/database application for researchers that combines structured note-taking, mind-mapping, management of files (e.g., PDFs) and folders, and reference management into an integrated environment that organizes all of the above into semantic networks or hierarchies in terms of debates, positions, arguments, labels, terminology/concepts, and user-defined keywords by means of database relations and automatically generated hyperlinks (hence ‘Hyper’ in the...

4 Reviews

Downloads: 21 This Week

Last Update: 3 days ago
See Project
12

DocumentGrep

Search text or a regular expression in multiple documents

This is a GUI for the command line tools grep, pdfgrep, pdftotext, unrtf, odt2txt, antiword,docx2txt, html2text and libreoffice. DocumentGrep search text in multiple files types. You can use regular expressions for the search (https://en.wikipedia.org/wiki/Regular_expression). This GUI and the command line tools work without indexing. Either the document is converted into text and processed by the RegExpr libary of Andrey V. Sorokin or handeled by the cli command itself (like...

Downloads: 7 This Week

Last Update: 2026-01-13
See Project
13

myFilterWheel ASCOM DIY

Modify a manual filterwheel and add stepper motor and Arduino

A project by Clive Stachon, Pete I, Paul P and Robert Brown in modifying a manual 5 slot filter wheel to automatic using an Arduino Nano and stepper motor. Windows application, ASCOM driver and Arduino firmware provided. Updated, reflecting new PDF and firmware and applications based on contributions from Pete. Project supports 4, 5, 7 and 9 slot filterwheels.

1 Review

Downloads: 18 This Week

Last Update: 2026-01-25
See Project
14

DocSearcher

DocSearcher is a search tool for indexing and searching files on a personal computer. It uses API's to provide search functionality for common document formats. currently: Word, Excel, PDF, Libre/Open/StarOffice, RTF, Text, and HTML

2 Reviews

Downloads: 3 This Week

Last Update: 2026-02-02
See Project
15

LexiFinder

AI-powered semantic indexing: automating the creation of book indexes

LexiFinder is a tool to generate analytic indexes from documents automatically. Given one or more source documents and a set of keywords, it extracts all nouns, compares them semantically to the keywords using a pretrained NLP model, and produces a structured, hierarchical index ready to be included in a book or manuscript. LexiFinder works in two ways: as a command-line tool for scripting, automation, and batch processing, and as a graphical application for a guided, point-and-click...

Downloads: 4 This Week

Last Update: 2026-03-04
See Project
16

elibsrv

a light OPDS/HTML server indexing EPUB and PDF files

elibsrv is a light, standalone OPDS server for Linux. It allows to generate an OPDS repository of EPUB and/or PDF files scanned from on-disk directories. It also provides a simple html interface for non-OPDS humans, which makes it a good fit for both OPDS-aware devices (like Android with FBReader or Aldiko) and browsers with EPUB/PDF capabilities (for ex. Firefox with the excellent EPUBReader plugin). It's worth noting that elibsrv is a complete solution - ie. it doesn't rely on third...

Downloads: 0 This Week

Last Update: 2024-11-10
See Project
17

PdfgrepGui

This is a simple GUI for the command line tool grep and pdfgrep

THIS PROJECT HAS MOVED TO: https://sourceforge.net/projects/documentgrep/ This program is a GUI for the command line tool grep and pdfgrep. Pdfgrep search text in multiple PDF files and grep can serach text in multiple text files. You can use regular expressions for the search (https://en.wikipedia.org/wiki/Regular_expression). This GUI and the command line tools work without indexing. The following options are used: -i (ignore case) and -F (fixed strings), -n (Print page number or output lines) and -H (Print the file name for each match) from the command line tool. ...

Downloads: 8 This Week

Last Update: 2026-01-13
See Project
18

pdf-extractor

Node.js module for rendering pdf pages to images, svgs and HTML files

Pdf text is converted to HTML. This can be used as a (transparent) layer over the image to enable text selection. Pdf text is extracted to a text file for different usages (e.g. indexing the text). This library is in it's most basic form a node.js wrapper for pdf.js. It has default renderers to generate a default output, but is easily extended to incorporate custom logic or to generate different output.

Downloads: 1 This Week

Last Update: 2023-03-23
See Project
19

File System Crawler for Elasticsearch

Elasticsearch File System Crawler (FS Crawler)

This crawler helps to index binary documents such as PDF, Open Office, MS Office. Local file system (or a mounted drive) crawling and indexing new files, updating existing ones, and removing old ones. Remote file system over SSH/FTP crawling. REST interface to let you “upload” your binary documents to elastic search.

Downloads: 0 This Week

Last Update: 2023-08-25
See Project
20

Paperless-ng

A supercharged version of paperless, scan, index and archive docs

Paperless is a simple Django application running in two parts, a Consumer (the thing that does the indexing) and a Web server (the part that lets you search & download already-indexed documents). Paper is a nightmare. Environmental issues aside, there’s no excuse for it in the 21st century. It takes up space, collects dust, doesn’t support any form of a search feature, indexing is tedious, it’s heavy and prone to damage & loss. I wrote this to make “going paperless” easier. I do not have to...

Downloads: 0 This Week

Last Update: 2022-03-04
See Project
21

OpenSearchServer Search Engine

An open source search engine with RESTFul API and crawlers

OpenSearchServer is a powerful, enterprise-class, search engine program. Using the web user interface, the crawlers (web, file, database, etc.) and the client libraries (REST/API , Ruby, Rails, Node.js, PHP, Perl) you will be able to integrate quickly and easily advanced full-text search capabilities in your application: Full-text with basic semantic, join queries, boolean queries, facet and filter, document (PDF, Office, etc.) indexation, web scrapping,etc. OpenSearchServer runs on...

31 Reviews

Downloads: 12 This Week

Last Update: 2018-08-26
See Project
22

Object Oriented Streetmap

C# class library for processing OpenStreetMap data

This is a class library written in C# for processing OpenStreetMap XML file extracts into a SQLite database for routing with different vehicle types and restrictions. Before rating or contributing please see the README file for a more complete summary and a list of todos.

Downloads: 0 This Week

Last Update: 2018-03-19
See Project
23

Marcion

The study environment of ancient languages (Coptic, Greek, Latin)

Marcion is a software forming a study environment of ancient languages (esp. Coptic, Greek, Latin) and providing many tools and resources (dictionaties, grammars, texts). Although Marcion is focused on to study the gnosticism and early christianity, it is an universal library working with various file formats and allowing to collect, organize and backup texts of any kind. Overview of gnostic sources in Coptic language delivered with Marcion: Nag Hammadi Library; Berlin Codex; Codex...

4 Reviews

Downloads: 15 This Week

Last Update: 2020-07-11
See Project
24

IndexFile (IFile)

IFile, PHP based framework for indexing and search in the documents

Index documents using Lucene Seach Engine or the MySql Full-Text. IFile supports many type of documents: Rich Text Format (.rtf); Moving Picture Expert Group-1/2 Audio Layer 3 (.mp3); Joint Photographic Experts Group (.jpg - .jpeg); Tagged Image File Format (.tiff); Microsoft Word 97-2000 (.doc); Microsoft Word 2003-2007 (.docx); Microsoft Excel 97-2000 (.xls); Microsoft Excel 2003-2007 (.xlsx); Microsoft PowerPint 2003-2007 (.pptx); OpenOffice.org Writer (.odt);...

Downloads: 0 This Week

Last Update: 2016-03-28
See Project
25

Personalized Search Engine

Personalized Search Engine for Your Files

MySearchEngine (Personalized Search Engine) is a Java software to search files and folders in an OS file system. It differs from general OS file search engines in that it personalizes the indexing setup so that users can choose which directories to index or remove from an existing index and it can also suggest queries just like Google's "Did you mean" feature. The customization of indexing and query suggestion greatly improves search speed and make user experience more comfortable. eLibrary can also extract text content from files of many wildly used file types such as pdf, doc, ppt, and mp3 to improve the index quality.

Downloads: 0 This Week

Last Update: 2015-11-19
See Project