Text Alignment Systems NLP Tools

Tools for aligning texts across languages, documents, or modalities (word-level, sentence-level, or document-level). Includes cross-lingual alignment, monolingual alignment, and narrative/script synchronization. Does NOT include general translation, similarity matching without explicit alignment output, or semantic parsing.

There are 86 text alignment systems tools tracked. The highest-rated is sileod/tasksource at 46/100 with 193 stars and 208 monthly downloads.

Get all 86 projects as JSON

curl "https://pt-edge.onrender.com/api/v1/datasets/quality?domain=nlp&subcategory=text-alignment-systems&limit=20"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.

# Tool Score Tier
1 sileod/tasksource

Datasets collection and preprocessings framework for NLP extreme multitask learning

46
Emerging
2 luheng/deep_srl

Code and pre-trained model for: Deep Semantic Role Labeling: What Works and...

42
Emerging
3 CK-Explorer/DuoSubs

Semantic subtitle aligner and merger for bilingual subtitle syncing.

40
Emerging
4 loomchild/maligna

Bilingual sengence aligner

39
Emerging
5 coastalcph/lex-glue

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

33
Emerging
6 ChineseGLUE/ChineseGLUE

Language Understanding Evaluation benchmark for Chinese: datasets,...

33
Emerging
7 gkiril/benchie

Comprehensive evaluation framework for Open Information Extraction.

33
Emerging
8 PhilipMay/stsb-multi-mt

Machine translated multilingual STS benchmark dataset.

33
Emerging
9 naver-ai/korean-safety-benchmarks

Official datasets and pytorch implementation repository of SQuARe and KoSBi...

32
Emerging
10 scofield7419/HeSyFu

Code for the ACL2021 paper: Better Combine Them Together! Integrating...

31
Emerging
11 IINemo/isanlp_srl_framebank

SRL parser for Russian based on FrameBank corpus

30
Emerging
12 vecto-ai/word-benchmarks

Benchmarks for intrinsic word embeddings evaluation.

29
Experimental
13 UKPLab/eacl2026-abcd-link

Repository for reproducing results from ABCD-Link

29
Experimental
14 TalSchuster/CrossLingualContextualEmb

Cross-Lingual Alignment of Contextual Word Embeddings

29
Experimental
15 ardoco/benchmark

A benchmark repository for TLR between (textual) Software Architecture...

29
Experimental
16 cdli-gh/Semantic-Role-Labeler

A semantic role labeling system for the Sumerian language. A Google Summer...

28
Experimental
17 ubisoft/ubisoft-laforge-binaryalign

BinaryAlign: Word Alignment as Binary Sequence Labeling

28
Experimental
18 Babelscape/ID10M

Data and code for the paper "ID10M: Idiom Identification in 10 Languages"...

28
Experimental
19 Babelscape/CroCoAlign

A Cross-Lingual, Context-Aware and Fully-Neural Sentence Alignment System...

27
Experimental
20 SapienzaNLP/gsrl

GSRL is a seq2seq model for end-to-end dependency- and span-based SRL (IJCAI2021).

27
Experimental
21 GuillaumeDD/dialign

Automatic and generic measures of verbal alignment in dyadic dialogue based...

27
Experimental
22 ku-nlp/JKUSea

Utilitary tool aligning sentences of texts written in 2 different languages.

26
Experimental
23 thunlp/DictSKB

Code and data of the paper "Automatic Construction of Sememe Knowledge Bases...

26
Experimental
24 doc-analysis/XFUND

XFUND: A Multilingual Form Understanding Benchmark

25
Experimental
25 rggdmonk/hadal

A simple and efficient tool for mining and aligning sentences with pre-trained models.

25
Experimental
26 qiyuw/WSPAlign

WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span...

25
Experimental
27 LaVi-Lab/CLEVA

[EMNLP 2023 Demo] "CLEVA: Chinese Language Models EVAluation Platform"

25
Experimental
28 thespectrewithin/joint_align

Cross-lingual Alignment vs Joint Training: A Comparative Study and A Simple...

24
Experimental
29 scofield7419/LAGCN-SRL

Codes for the AAAI 2021 paper: Encoder-Decoder Based Unified Semantic Role...

24
Experimental
30 tschomacker/aligned-narrative-documents

A collection of scripts to create a Document-aligned corpus of German...

24
Experimental
31 orzhan/rusimscore

Code for paper "RuSimScore: unsupervised scoring function for Russian...

24
Experimental
32 tyjiangU/fido

Code for the paper "Exploiting Definitions for Frame Identification"

24
Experimental
33 amazon-science/real-world-noisy-benchmarks-for-natural-language-understanding

Benchmark test sets for real-world noise phenomena in goal-directed...

24
Experimental
34 UKPLab/acl2024-ircoder

Data creation, training and eval scripts for the IRCoder paper

23
Experimental
35 p-lambda/swords

The Stanford Word Substitution (Swords) Benchmark

23
Experimental
36 strubell/preprocess-conll05

Scripts for preprocessing the CoNLL-2005 SRL dataset.

23
Experimental
37 luciusssss/MiLiC-Eval

[ACL'25 Findings] MiLiC-Eval: Benchmarking Multilingual LLMs for China's...

23
Experimental
38 google/BEGIN-dataset

A benchmark dataset for evaluating dialog system and natural language...

22
Experimental
39 SapienzaNLP/dsrl

Code for "Semantic Role Labeling meets Definition Modeling: using natural...

22
Experimental
40 Tixierae/WECD

Code and data for the paper: 'Word Embeddings for the Construction Domain'

21
Experimental
41 allenai/multicite

MultiCite code and data. Models are available on Huggingface.

21
Experimental
42 ryokamoi/wice

This repository contains the dataset and code for "WiCE: Real-World...

20
Experimental
43 v-hirak/explaining-MT-difficulty

Dataset of diverse typological language properties as part of "Assessing the...

20
Experimental
44 longxudou/multispider

MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing

20
Experimental
45 lyutyuh/structured-span-selector

A Structured Span Selector (NAACL 2022). A structured span selector with a...

19
Experimental
46 liutianlin0121/decoding-time-realignment

Implementation of "Decoding-time Realignment of Language Models", ICML 2024.

18
Experimental
47 ShiZhengyan/IngredientParsing

Dataset and pytorch codes for the paper titled "Attention-based Ingredient...

18
Experimental
48 Sam120204/Pluralistic-Alignment-for-Healthcare

Code of our paper - "Pluralistic Alignment for Healthcare: A Role-Driven...

18
Experimental
49 jacklxc/CORWA

CORWA: A Citation-Oriented Related Work Annotation Dataset, NAACL 2022

18
Experimental
50 tsar-workshop/tsar-2025-shared-task

Code and data for TSAR 2025 Shared Task

17
Experimental
51 cvjena/chiasmus-detector

Code for paper "Data-Driven Detection of General Chiasmi Using Lexical and...

17
Experimental
52 guilhermevarela/deep_srlbr

SRL task using PropBank 1.1

16
Experimental
53 garfieldpigljy/CrowdWSA2019

Crowdsourced Word Sequence Aggregation 2019

16
Experimental
54 joshstephenson/SEAS

Tools for extracting and aligning sentences from subtitle language pairs...

16
Experimental
55 bMagicLAB/human-alignment-pl-en-codeswitch

Human-in-the-Loop alignment dataset for Polish-English code-switching...

15
Experimental
56 yumoxu/detnet

Code and dataset for TACL 19: Weakly Supervised Domain Detection.

15
Experimental
57 sampalomad/IKEA-Dataset

A dataset for multimodal machine translation

14
Experimental
58 Botfuel/benchmark-nlp

NLP benchmark test sentences and full results

14
Experimental
59 Toavinarandrianarivo/Scene2Chapter-NLP-Aligner

đŸ“– Align movie scripts with novel chapters seamlessly using advanced NLP...

14
Experimental
60 SapienzaNLP/srl-pas-probing

Probing for Predicate Argument Structures in Pretrained Language Models (ACL 2022).

13
Experimental
61 nikolayVv/MultiParaphrase

Comparing and evaluating monolingual paraphrasing of English, German, Czech,...

13
Experimental
62 pranav-ust/cognates

ACL SRW paper: Alignment Analysis of Sequential Segmentation of Lexicons to...

13
Experimental
63 DominiqueMercier/ImpactCite

ImpactCite: A XLNet-based Solution Enabling Qualitative CitationImpact...

13
Experimental
64 sileod/metaeval

Collection of tasks for meta-learning and extreme multitask learning

13
Experimental
65 okalai-ai/moimoe

Typology-Guided Adaption in Multilingual Models

13
Experimental
66 gling07/Text2DRS

System Text2Drs takes English narrative as an input and outputs a discourse...

13
Experimental
67 SapienzaNLP/conception

Code and experiments for the COLING2020 paper "Conception:...

13
Experimental
68 multilingual-dataset-survey/multilingual-dataset-survey.github.io

The website implementation of Findings of EMNLP 2022, "Beyond Counting...

13
Experimental
69 kukas/word-alignment-visualization

Word Alignment Visualization is a Python package for visualizing word...

13
Experimental
70 ZurichNLP/ConLoan

A Contrastive Multilingual Dataset for Evaluating Loanwords - ACL2025

13
Experimental
71 DorinK/Principal-Parts-Detection

Multilingual dataset for principal parts detection in inflectional...

12
Experimental
72 ghomasHudson/muld

The Multitask Long Document Benchmark

12
Experimental
73 SapienzaNLP/exploring-srl

Repository for the paper "Exploring Non-Verbal Predicates in Semantic Role...

12
Experimental
74 SapienzaNLP/usea

Universal Semantic Annotator (LREC 2022)

12
Experimental
75 mbanon/benchmarks

Several benchmarks on sentence splitting and language identification

12
Experimental
76 hexuandeng/HExp4UDS

Implementation of the paper “Holistic Exploration on Universal...

12
Experimental
77 qiyuw/WSPAlign.InferEval

Inference library and evaluation script for WSPAlign...

12
Experimental
78 maxkagamine/word-alignment-demo

Demonstration of AI/neural word alignment of English & Japanese text using...

12
Experimental
79 SapienzaNLP/unify-srl

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic...

12
Experimental
80 kinit-sk/multiclaim

MultiClaim dataset repository

12
Experimental
81 SapienzaNLP/united-srl

A unified dataset for span- and dependency-based multilingual and...

12
Experimental
82 zahra-parvizian/PersianLexicalSimplifier

Persian text simplification using lexical simplification

11
Experimental
83 INTERACT-LLM/alignment-drift-llms

Dataset and analysis code for BEA2025 paper @ ACL: "Alignment Drift in...

11
Experimental
84 williammulianto/cleu

Cross-Lingual Embeddings Utility

10
Experimental
85 agneknie/com4520DarwinProject

Adjacent code related to the paper prepared for Joint Workshop on Multiword...

10
Experimental
86 hmosousa/professor_heideltime

Create a multilingual corpus weakly labeled with HeidelTime.

10
Experimental