pytorch/text

Models, data loaders and abstractions for language processing, powered by PyTorch

Archived

/ 100

Established

Provides pre-built datasets (WikiText, SQuAD, Multi30k, AG_NEWS, etc.), scriptable tokenizers (SentencePiece, GPT-2 BPE, BERT), and pre-trained transformer models (RoBERTa, T5, XLM-R) with TorchData integration for efficient data pipelines. Supports vectorized text transformations and vocabulary management, designed to streamline end-to-end NLP workflows within the PyTorch ecosystem.

3,565 stars. No commits in the last 6 months.

Archived Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 25 / 25

How are scores calculated?

Stars

3,565

Forks

813

Language

Python

License

BSD-3-Clause

Related tools

facebookresearch/stopes

A library for preparing data for machine translation research (monolingual preprocessing,...

rkcosmos/deepcut

A Thai word tokenization library using Deep Neural Network

Droidtown/ArticutAPI

API of Articut 中文斷詞 (兼具語意詞性標記)：「斷詞」又稱「分詞」，是中文資訊處理的基礎。Articut 不用機器學習，不需資料模型，只用現代白話中文語法規則，即能達到...

fukuball/jieba-php

"結巴"中文分詞：做最好的 PHP 中文分詞、中文斷詞組件。 / "Jieba" (Chinese for "to stutter") Chinese text segmentation:...

jiesutd/NCRFpp

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER,...

Explore NLP Tools

All categories Trending NLP directory Insights