Text Preprocessing Pipelines NLP Tools

End-to-end tools and libraries for cleaning, normalizing, and preparing raw text data for NLP tasks. Includes tokenization, stemming, stopword removal, and data cleaning utilities. Does NOT include downstream NLP applications (sentiment analysis, classification, etc.), feature extraction, or domain-specific cleaning (tweets, names, etc.).

There are 45 text preprocessing pipelines tools tracked. 1 score above 70 (verified tier). The highest-rated is chartbeat-labs/textacy at 70/100 with 2,236 stars and 75,599 monthly downloads.

Get all 45 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=text-preprocessing-pipelines&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 chartbeat-labs/textacy

NLP, before and after spaCy

70
Verified
2 nltk/nltk_data

NLTK Data

61
Established
3 prasanthg3/cleantext

An open-source package for python to clean raw text data

59
Established
4 brightertiger/pygarble

Python Package to detect garbled, gibberish text for EN

54
Established
5 jfilter/clean-text

🧹 Python package for text cleaning

53
Established
6 citiususc/pyplexity

Cleaning tool for web scraped text

50
Established
7 LoLei/redditcleaner

Cleans Reddit Text Data :scroll: :broom:

42
Emerging
8 ksnugroho/basic-text-preprocessing

Basic text preprocessing for Bahasa with Python.

40
Emerging
9 textpipe/textpipe

Textpipe: clean and extract metadata from text

40
Emerging
10 takuti/prelims

Front matter post-processor for static site generators

39
Emerging
11 alinapetukhova/textcl

Text preprocessing package for use in NLP tasks https://pypi.org/project/textcl/

38
Emerging
12 MusfiqDehan/data-preprocessors

🛠️An easy to use tool for Data Preprocessing specially for Text Preprocessing

31
Emerging
13 Shubha23/Text-processing-NLP

This notebook contains entire text preprocessing pipeline for NLP problems....

30
Emerging
14 huu4ontocord/rio

Text pre-processing for NLP datasets

30
Emerging
15 iaramer/dobbi

An open-source NLP library: fast text cleaning and preprocessing

28
Experimental
16 YugantM/textcleaner

text-data pre-processing utility

28
Experimental
17 aflah02/cleansetext

This is a simple library to help you clean your textual data

25
Experimental
18 mantzaris/KeemenaPreprocessing.jl

Preprocessing for text data: cleaning, normalization, vectorization,...

25
Experimental
19 Arfius/light-text-prepro

Python module that collects regex rules

24
Experimental
20 Abhayparashar31/crazytext

A Simple Easy To Use Text Cleaning Package For NLP Built In Python. It Can...

24
Experimental
21 umapornp/textprepro

👀 Everything Everyway All At Once Text Preprocessing for Natural Language Processing.

23
Experimental
22 ninadpatil09/NLP-Notebooks

Explore NLP tasks with Python using NLTK, SpaCy & scikit-learn:...

23
Experimental
23 Ankur3107/nlp_preprocessing

Text Preprocessing Package includes cleaning, tokenization, dataset...

23
Experimental
24 abeaderstadt/nlp-02-text-preprocessing

Text Preprocessing NLP Project

23
Experimental
25 Al-Hasib/eng_text_cleaner

A python package for cleaning text

22
Experimental
26 lgomezt/tidyX

Python package to clean raw tweets for ML applications.

20
Experimental
27 udityamerit/Text-Processing-Package-For-Natural-Language-Processing

This project is a comprehensive collection of NLP techniques, practical...

20
Experimental
28 angelsomo/nlp-text-cleaning

Lightweight Python CLI tool for robust text cleaning, Unicode normalization,...

19
Experimental
29 krisograbek/text-preprocessing

Text preprocessing in Python. Libs include string, re, nltk, spacy, gensim,...

17
Experimental
30 MariyamSiddiqui/Text-Preprocessing-NLP-pipeline

End-to-end NLP text preprocessing pipeline using Python — includes...

16
Experimental
31 mahirmsb25/Text-Preprocessing-Pipeline

A Python-based NLP preprocessing pipeline using NLTK and Pandas to clean and...

15
Experimental
32 nluninja/nlp_crash_course_with_spacy

A Natural Language Processing crash course with SpaCy 2.6 and NLTK 3.6.2,...

14
Experimental
33 basit-afridi62/nlp-nltk-python

This repository is a hands-on guide to Natural Language Processing (NLP)...

13
Experimental
34 mookiezi/dataset-cleaning-toolkit

A dataset toolbox for preparing and analyzing conversational datasets,...

12
Experimental
35 Abdelrahman-Atef-Elsayed/NLP_Preprocessing_pipeline

This repo includes a generalized preprocessing pipeline for text data in NLP tasks.

12
Experimental
36 iam-salma/NLP-Bootcamp-with-python

A hands-on NLP Bootcamp using Python covering text preprocessing,...

12
Experimental
37 NITHISHM2410/text-preprocessing-techniques

This Repo includes modules that helps NLP related tasks.

12
Experimental
38 alanindra/baca-juga-cleaner

Program to clean news text by filtering out irrelevant syntactic...

11
Experimental
39 dodevca/tweet-preprocessor

Lightweight, modular, and extensible Python library for preprocessing...

11
Experimental
40 tnathu-ai/NLP-Job-Ad

Pre-process natural language text data to generate effective feature...

11
Experimental
41 Varsh008/text_preprocessor_toolkit

Configurable Text Preprocessing Toolkit in Python using spaCy

11
Experimental
42 michellepellon/tidyname

Intelligent company name cleaning and normalization for Python. Entity...

11
Experimental
43 shrutimary15/Text-data-preparation

The repository consists of a python code that inputs a text file consisting...

10
Experimental
44 nadinejackson1/text-preprocessing-pipeline

Basic text preprocessing pipeline, which includes tokenization, stemming,...

10
Experimental
45 tripathiadityap/cleantxty

Python package to clean strings and making them reasonable for NLP.

10
Experimental