ck-unifr/pdf_parsing
PDF解析(文字,章节,表格,图片,参考),基于大模型(ChatGLM2-6B, RWKV)+langchain+streamlit的PDF问答,摘要,信息抽取
Combines PyMuPDF and PyPDF2 for multi-modal PDF extraction (text hierarchy, tables, images, references) with separate LLM pipelines—RWKV-Raven-7B for summarization and ChatGLM2-6B for structured reference metadata extraction (author, title, year). Provides a complete Streamlit+LangChain QA interface with vector-based retrieval over parsed content, though acknowledges table extraction as a current limitation requiring alternative approaches like LayoutLM or table-transformer.
211 stars. No commits in the last 6 months.
Stars
211
Forks
33
Language
Python
License
—
Category
Last pushed
Oct 17, 2023
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/llm-tools/ck-unifr/pdf_parsing"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
sudan94/chat-pdf-hugginface
This is a fun Python project that allows you to chat with a chatbot about the PDF you uploaded....
amitgupta4407/All_About_PDF
This is a complete website in which you can chat with pdf, extract meta data, text, links,...
rahul2002m/ChatPDF
ChatPDF is a Streamlit app allowing users to query PDF & DOCX content via natural language. It...
benthecoder/chatpdf
chat with pdf with mistral.ai + streamlit
Hashir-Ahmad1/Train-AI-agent-on-mutiple-PDF
The Multi-PDF's Chat Agent is a Streamlit-based web application designed to facilitate...