3DCF-Labs/doc2dataset

3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.

/ 100

Emerging

# Technical Summary: 3DCF / doc2dataset Ingests diverse document formats (PDF, Markdown, HTML, CSV, JSON, TeX) into a normalized three-file index (`documents.jsonl`, `pages.jsonl`, `cells.jsonl`) with deterministic macro-cell extraction, then generates task-specific datasets (QA, summary, RAG) with per-cell NumGuard hashes for numeric corruption detection. The Rust core achieves 3–6× token compression versus baseline extractors while maintaining QA accuracy and numeric faithfulness, with exports targeting HuggingFace, LLaMA-Factory, Axolotl, OpenAI finetune, and custom RAG stacks. Provides CLI, HTTP service with UI, and Python/Node bindings; the encoder/serializer is published as `three-dcf-core` on crates.io for library integration

No Package No Dependents

Maintenance 10 / 25

Adoption 8 / 25

Maturity 13 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

Rust

License

Apache-2.0

Higher-rated alternatives

thiswillbeyourgithub/wdoc

Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype,...

laxmimerit/RAGWire

Production-grade RAG toolkit — ingest PDFs, DOCX, XLSX into Qdrant with LLM metadata extraction,...

Arterning/DeepParseX

DeepParseX 是一个强大的多模态文档解析与知识管理平台，支持 PDF、Word、Excel、PPT、图片、视频、音频等多种文件格式的智能解析，自动提取关键信息，并构建...

NoEdgeAI/pdfdeal

A python wrapper for the Doc2X API and comes with native texts processing (to improve PDF recall...

atpuxiner/docsloader

This is a documents loader. (文档解析加载器，rag文档解析，rag知识库构建)

Explore RAG Tools

All categories Trending RAG directory Insights