aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation

Advanced document extraction and chunking techniques for retrieval augmented generation that is aware of the layout of documents. Increases knowledge retrieval accuracy and provides control for retrieved knowledge context management

46
/ 100
Emerging

Leverages Amazon Textract's layout detection to preserve semantic structure (titles, headers, tables, lists) and enriches extracted text with XML tags, enabling intelligent chunking that maintains context hierarchies. Implements specialized chunking strategies—tables preserve column headers per chunk, lists include list titles, and text sections carry parent headers—creating a child-section-chapter metadata hierarchy stored in OpenSearch. Integrates with Amazon Bedrock, SageMaker JumpStart embeddings, and the Textractor library to enable flexible RAG retrieval using hybrid search and hierarchical context selection.

115 stars.

No Package No Dependents
Maintenance 6 / 25
Adoption 10 / 25
Maturity 16 / 25
Community 14 / 25

How are scores calculated?

Stars

115

Forks

14

Language

Jupyter Notebook

License

MIT-0

Category

local-rag-stacks

Last pushed

Dec 02, 2025

Commits (30d)

0

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/vector-db/aws-samples/layout-aware-document-processing-and-retrieval-augmented-generation"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.