DigitalPebble/behemoth
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
ArchivedBuilt on MapReduce, it provides a modular annotation framework for chaining document processors (Tika, UIMA, GATE, language identification) and connectors for ingesting from WARC/Nutch sources and exporting to SOLR/Mahout. Acts as distributed glueware orchestrating existing NLP and ML tools at scale rather than implementing its own algorithms, leveraging Hadoop's fault tolerance and horizontal scalability.
284 stars. No commits in the last 6 months.
Stars
284
Forks
59
Language
Java
License
—
Category
Last pushed
Apr 25, 2018
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/nlp/DigitalPebble/behemoth"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
textvec/textvec
Text vectorization tool to outperform TFIDF for classification tasks
nasa-jpl-memex/memex-gate
General Architecture for Text Engineering
NISH1001/tag-generator
A simple tool to generate tags for the given text (document) using TF-IDF.
cooperability/BMX-bookmark-extractor
Better brain. Knowledge management tool. Stop saving things you'll never read. Work in progress.
paradite/tf-idf-keyword
:mag_right: Get keywords from a piece of text using tf-idf