3DCF-Labs/doc2dataset

3DCF / doc2dataset: token-efficient document layer with NumGuard numeric integrity and multi-framework exports for RAG & fine-tuning.

40
/ 100
Emerging

# Technical Summary: 3DCF / doc2dataset Ingests diverse document formats (PDF, Markdown, HTML, CSV, JSON, TeX) into a normalized three-file index (`documents.jsonl`, `pages.jsonl`, `cells.jsonl`) with deterministic macro-cell extraction, then generates task-specific datasets (QA, summary, RAG) with per-cell NumGuard hashes for numeric corruption detection. The Rust core achieves 3–6× token compression versus baseline extractors while maintaining QA accuracy and numeric faithfulness, with exports targeting HuggingFace, LLaMA-Factory, Axolotl, OpenAI finetune, and custom RAG stacks. Provides CLI, HTTP service with UI, and Python/Node bindings; the encoder/serializer is published as `three-dcf-core` on crates.io for library integration

No Package No Dependents
Maintenance 10 / 25
Adoption 8 / 25
Maturity 13 / 25
Community 9 / 25

How are scores calculated?

Stars

56

Forks

5

Language

Rust

License

Apache-2.0

Last pushed

Feb 10, 2026

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/3DCF-Labs/doc2dataset"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.