All Data Engineering Tools

517 tools ranked by quality score · Page 2 of 6

Showing 101–200 of 517
# Tool Score Tier
101 neo4j/neo4j-jdbc

Official Neo4j JDBC Driver

61
Established
102 HariSekhon/SQL-scripts

100+ SQL Scripts - PostgreSQL, MySQL, Oracle, Google BigQuery, MariaDB, AWS...

61
Established
103 Breeze0806/go-etl

go-etl is a toolset for data extraction, transformation and loading.

61
Established
104 DataKitchen/dataops-testgen

DataOps Data Quality TestGen is part of DataKitchen's Open Source Data...

61
Established
105 sparklyr/sparklyr

R interface for Apache Spark

61
Established
106 benjamin-awd/monopoly

Monopoly is a Python library & CLI that converts bank statement PDFs to CSV.

60
Established
107 debba/tabularis

A lightweight, developer-focused database management tool. Supports MySQL,...

60
Established
108 VisActor/VStory

Use data to tell stories.An intelligent Visualization Narrative Development...

60
Established
109 bitol-io/open-data-contract-standard

Home of the Open Data Contract Standard (ODCS).

60
Established
110 jtablesaw/tablesaw

Java dataframe and visualization library

60
Established
111 kalininalab/DataSAIL

DataSAIL is a tool to split datasets while reducing information leakage.

60
Established
112 HTTP-RPC/Kilo

Lightweight REST for Java

60
Established
113 techascent/tech.ml.dataset

A Clojure high performance data processing system

59
Established
114 cre-dev/xml2db

A Python package to load complex XML files into a relational database

59
Established
115 turbot/steampipe

Zero-ETL, infinite possibilities. Live query APIs, code & more with SQL. No...

59
Established
116 dotnet/spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

59
Established
117 linkedpipes/etl

LinkedPipes ETL is an RDF based, lightweight ETL tool

59
Established
118 dalenewman/Transformalize

Configurable Extract, Transform, and Load

59
Established
119 DataTalksClub/data-engineering-zoomcamp

Data Engineering Zoomcamp is a free 9-week course on building...

59
Established
120 quixio/quix-streams

Python Streaming DataFrames for Kafka

59
Established
121 bacalhau-project/bacalhau

Community-driven, simple, yet powerful framework for fast, cost-effective...

59
Established
122 turbot/steampipe-plugin-github

Use SQL to instantly query repositories, users, gists and more from GitHub....

59
Established
123 metafacture/metafacture-core

Core package of the Metafacture tool suite for metadata processing.

59
Established
124 vmware/versatile-data-kit

One framework to develop, deploy and operate data workflows with Python and SQL.

59
Established
125 Data-Centric-AI-Community/ydata-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas...

58
Established
126 heavyai/heavydb

HeavyDB (formerly MapD/OmniSciDB)

58
Established
127 biglocalnews/warn-transformer

Consolidate, enrich and republish the data gathered by warn-scraper

58
Established
128 alibaba/feathub

FeatHub - A stream-batch unified feature store for real-time machine learning

58
Established
129 rudderlabs/rudder-server

Privacy and Security focused Segment-alternative, in Golang and React

58
Established
130 dagster-io/community-integrations

Community supported integrations for the Dagster platform.

58
Established
131 9tigerio/db2rest

Instant no code DATA API platform for relational databases. Connect any...

58
Established
132 Guepard-Corp/qwery-core

The Boring query platform - Connect and query anything

58
Established
133 turbot/steampipe-plugin-gcp

Use SQL to instantly query GCP resources across regions, projects and...

57
Established
134 h2oai/sparkling-water

Sparkling Water provides H2O functionality inside Spark cluster

57
Established
135 dataflint/spark

Drop-in replacement for Apache Spark UI

57
Established
136 datazip-inc/olake-ui

Frontend & BFF (Backend for frontend) for Olake. This includes the UI code...

57
Established
137 dotflow-io/dotflow

🎲 Business Logic Code in a flow!

57
Established
138 turbot/steampipe-plugin-kubernetes

Use SQL to instantly query Kubernetes API resources. Open source CLI. No DB required.

57
Established
139 dfpc-coe/CloudTAK

TAK Compatible, browser based Common Operation Picture & Situational Awareness tool

57
Established
140 elyra-ai/pipeline-editor

Common pipeline-editor components used in different clients (e.g. Elyra...

56
Established
141 turbot/steampipe-plugin-azure

Use SQL to instantly query Azure resources across regions and subscriptions....

56
Established
142 dbt-labs/jaffle-shop

🥪🦘 An open source sandbox project exploring dbt workflows via a fictional...

56
Established
143 starlake-ai/starlake

Declarative text based tool for data analysts and engineers to extract,...

56
Established
144 flowsynx/flowsynx

A deterministic orchestrator for composable micro-workflows with reusable modules

56
Established
145 CogStack/CogStack-NiFi

Building data processing pipelines for documents processing with NLP using...

56
Established
146 DataSQRL/sqrl

Data Pipeline Automation Framework to build MCP servers, data APIs, and data...

55
Established
147 spitfireuptown/datalinkx

🔥🔥DatalinkX异构数据源之间的数据同步系统,支持海量数据的增量或全量同步,同时支持HTTP、Oracle、MySQL、ES等数据源之间的数据流转,...

55
Established
148 arkflow-rs/arkflow

High performance Rust stream processing engine seamlessly integrates AI...

55
Established
149 SentryPeer/SentryPeer

Protect your SIP Servers from bad actors at https://sentrypeer.org

55
Established
150 turbot/steampipe-plugin-sdk

Steampipe Plugin SDK is a simple abstraction layer to write a Steampipe...

55
Established
151 docwire/docwire

DocWire SDK: Award-winning modern data processing in C++20. SourceForge...

55
Established
152 reductstore/reductstore

High Performance Storage and Streaming Solution for Data Acquisition Systems

55
Established
153 Snowflake-Labs/emerging-solutions-toolbox

The Emerging Solutions Toolbox is a collection of solutions created by...

55
Established
154 kay-ou/SimTradeData

SimTradeData is a utility library supporting SimTradeDesk, SimTradeLab and...

55
Established
155 dflib/dflib

In-memory Java DataFrame library

54
Established
156 kanton-bern/hellodata-be

The Open-Source Enterprise Data Platform in a single Portal

54
Established
157 MLT-OSS/FirstData

The World's Most Comprehensive, Authoritative, and Structured Open Source...

54
Established
158 akmalsoliev/Validoopsie

A simple and easy to use Data Validation library for Python.

54
Established
159 airyhq/airy

💬 Open Source App Framework to build streaming apps with real-time data - 💎...

54
Established
160 OHDSI/ETL-Synthea

A package supporting the conversion from Synthea CSV to OMOP CDM

54
Established
161 mprove-io/mprove

Open Source Business Intelligence with Malloy Semantic Layer :tada:

53
Established
162 GoPlasmatic/dataflow-rs

A high-performance rules engine for IFTTT-style automation in Rust with...

53
Established
163 ogbinar/DataEngineeringPilipinas

Data Engineering Pilipinas is a community for data engineers, data analysts,...

53
Established
164 JuliaML/TableTransforms.jl

Transforms and pipelines with tabular data in Julia

53
Established
165 build-on-aws/rag-postgresql-agent-bedrock

This application is built in four stages using infrastructure as code with...

53
Established
166 fdmorison/tiozin

Tiozin, your friendly ETL framework

53
Established
167 halestudio/hale

(Spatial) data harmonisation with hale»studio (formerly HUMBOLDT Alignment Editor)

52
Established
168 DataRecce/recce

The data-validation toolkit for enhanced dbt (data build tool) PR review

52
Established
169 DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data...

52
Established
170 ara3d/bim-open-schema

Representing BIM Data as Parquet

52
Established
171 lakevision-project/lakevision

Lakevision is a tool which provides insights into your Apache Iceberg based...

52
Established
172 Edwardvaneechoud/Flowfile

Flowfile is a visual ETL tool and Python library combining drag-and-drop...

52
Established
173 xxh/xxh-shell-xonsh

Use @xonsh wherever you go through the SSH without installation on the host.

51
Established
174 myriade-ai/myriade

AI Native Data Platform: explore, clean, transform and govern your data...

51
Established
175 AndreaBozzo/dataprof

Library and CLI for profiling tabular data

51
Established
176 byzer-org/byzer-lang

Byzer (former MLSQL): A low-code open-source programming language for data...

51
Established
177 robert-koch-institut/mex-common

RKI Metadata Exchange | Software development toolkit for the MEx project...

51
Established
178 StructuredLabs/preswald

Preswald is a WASM packager for Python-based interactive data apps: bundle...

51
Established
179 dashmug/glue-utils

glue-utils makes AWS Glue jobs less repetitive, more type-safe, and easier...

51
Established
180 FalkorDB/falkordb-ts

FalkorDB Typescript Client

51
Established
181 aws-samples/uncovering-hidden-connections-in-unstructured-financial-data

Uncovering Hidden Connections in Unstructured Financial Data using Amazon...

51
Established
182 bitol-io/open-data-product-standard

Home of the Open Data Product Standard (ODPS).

51
Established
183 mehd-io/pypi-duck-flow

end-to-end data engineering project to get insights from PyPi using python,...

50
Established
184 libredb/libredb-studio

A modern, blazing-fast SQL IDE for the cloud era. Query PostgreSQL, MySQL,...

50
Established
185 ashish10alex/vscode-dataform-tools

Dataform Tools - VS Code extension to run and visualise Dataform data...

50
Established
186 pplu/aws-sdk-perl

A community AWS SDK for Perl Programmers

50
Established
187 opensnowcat/opensnowcat-collector

OpenSnowcat Collector, an open source fork of Snowplow (Apache 2.0 License)

49
Emerging
188 bradfitz/embiggen-disk

embiggden-disk live-resizes a filesystem after first live-resizing any...

49
Emerging
189 Pipelex/pipelex-cookbook

Cookbook for Pipelex, the declarative language for composable Al workflows....

49
Emerging
190 turbot/steampipe-plugin-terraform

Use SQL to instantly query resources, data sources and more from Terraform...

49
Emerging
191 koralium/flowtide

High-performance streaming SQL query engine designed for real-time data...

49
Emerging
192 hi-primus/optimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF,...

49
Emerging
193 weifuwan/seatunnel-web

SeaTunnel Web is a visual platform for building, managing, and monitoring...

49
Emerging
194 mc2-project/opaque-sql

An encrypted data analytics platform

49
Emerging
195 MilkMp/CIA-World-Factbooks-Archive-1990-2025

Complete structured archive of every CIA World Factbook edition from...

49
Emerging
196 capitalone/DataProfiler

What's in your data? Extract schema, statistics and entities from datasets

49
Emerging
197 edrewitz/WxData

A Python library that acts as a client to download, pre-process and...

48
Emerging
198 aartikis/RTEC

RTEC is an Event Calculus implementation optimised for stream reasoning

48
Emerging
199 icoretech/airbroke

🔥 Lightweight, Airbrake/Sentry-compatible, PostgreSQL-based Open Source Error Catcher

48
Emerging
200 SETL-Framework/setl

A simple Spark-powered ETL framework that just works 🍺

48
Emerging