aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation
Advanced document extraction and chunking techniques for retrieval augmented generation that is aware of the layout of documents. Increases knowledge retrieval accuracy and provides control for retrieved knowledge context management
Leverages Amazon Textract's layout detection to preserve semantic structure (titles, headers, tables, lists) and enriches extracted text with XML tags, enabling intelligent chunking that maintains context hierarchies. Implements specialized chunking strategies—tables preserve column headers per chunk, lists include list titles, and text sections carry parent headers—creating a child-section-chapter metadata hierarchy stored in OpenSearch. Integrates with Amazon Bedrock, SageMaker JumpStart embeddings, and the Textractor library to enable flexible RAG retrieval using hybrid search and hierarchical context selection.
115 stars.
Stars
115
Forks
14
Language
Jupyter Notebook
License
MIT-0
Category
Last pushed
Dec 02, 2025
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/vector-db/aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
yichuan-w/LEANN
[MLsys2026]: RAG on Everything with LEANN. Enjoy 97% storage savings while running a fast,...
byerlikaya/SmartRAG
Multi-Modal RAG for .NET — query databases, documents, images and audio in natural language....
mrutunjay-kinagi/ragsearch
This project aims to build a Retrieval-Augmented Generation (RAG) engine to provide...
Omkar-Wagholikar/adora
Python package that makes it easy to spin up a custom Retrieval-Augmented Generation (RAG) pipeline.
leewaay/ragcar
RAGCAR: Retrieval-Augmented Generative Companion for Advanced Research