ngc7292/query_of_cc
This project is dataset and model checkpoints for the paper "Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora".
If you're building or training large language models (LLMs) and need a vast, high-quality dataset of domain-specific knowledge, this project provides a curated collection called "Knowledge Pile." It takes seed information (like keywords or FAQs) and expands it into comprehensive, reasoning-rich data from public web sources. This is ideal for AI researchers and practitioners who need extensive knowledge corpora across various STEM and humanities fields.
No commits in the last 6 months.
Use this if you need a pre-compiled, extensive dataset of knowledge and reasoning-oriented text to train or fine-tune your large language models efficiently, covering diverse academic and scientific domains.
Not ideal if you need a dataset for non-text-based AI tasks, require highly specialized domain knowledge not covered in general academic sources, or prefer to collect data manually from scratch.
Stars
4
Forks
—
Language
—
License
—
Category
Last pushed
Mar 05, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/ngc7292/query_of_cc"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
weAIDB/awesome-data-llm
Official Repository of "LLM × DATA" Survey Paper
mlabonne/llm-datasets
Curated list of datasets and tools for post-training.
malteos/llm-datasets
A collection of datasets for language model pretraining including scripts for downloading,...
Jeryi-Sun/LLM-and-Law
This repository is dedicated to summarizing papers related to large language models with the field of law
magpie-align/magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your...