ck-unifr/pdf_parsing

PDF解析(文字,章节,表格,图片,参考),基于大模型(ChatGLM2-6B, RWKV)+langchain+streamlit的PDF问答,摘要,信息抽取

37
/ 100
Emerging

Combines PyMuPDF and PyPDF2 for multi-modal PDF extraction (text hierarchy, tables, images, references) with separate LLM pipelines—RWKV-Raven-7B for summarization and ChatGLM2-6B for structured reference metadata extraction (author, title, year). Provides a complete Streamlit+LangChain QA interface with vector-based retrieval over parsed content, though acknowledges table extraction as a current limitation requiring alternative approaches like LayoutLM or table-transformer.

211 stars. No commits in the last 6 months.

No License Stale 6m No Package No Dependents
Maintenance 0 / 25
Adoption 10 / 25
Maturity 8 / 25
Community 19 / 25

How are scores calculated?

Stars

211

Forks

33

Language

Python

License

Last pushed

Oct 17, 2023

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/ck-unifr/pdf_parsing"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.