Skip to the content.

These are list of Upcoming/New/Partially new Data Engineering/Machine Learning Engineering Technologies I prefer to follow in 2021.
- Getdbt
, Its hitting the sweet spot of Apache Spark, by bringing Simplified SQL based pipeline!
- Prefect
, Designed to make Workflow Management easier & better compared to Apache Airflow
- DVC
, Open-source Version Control System for Machine Learning Projects & desired for MLOps
- Great_Expectations
, Data Science Testing framework, its already amazing!
- Amundsen
An Open Sourced data discovery and metadata engine
- Marquez
Open Source Metadata with an amazing UI
- Dagster
A data orchestrator for machine learning, very Programming based & in the similar space of Airflow/Prefect but emphasize on State flow
- Apache Calcite
Framework for building SQL databases and data management systems without owning Data. Hive, flink and other uses Calcite
- maiot-ZenML
Open Sourced MLOps Framework, having bit of everything.
- Apache Superset
Open Source BI with many connectors available
- Metabase
An Open Source BI, with amazing Vizualization
- Hopswork
Open Sourced MLOPs Feature Store
- Feast
open-source feature store, now with Tecton
- MLFlow
Machine Learning Platform, first of its Kind
- Pachyderm
MLOps Platform, in space of MLFlow
- Montecarlodata
Data Governance or Data Discovery or Data Observability
- Tecton
Enterprise Feature Store
- Fiddler
Enterprise Explainable AI
- Cnvrg
, Enterprise MLOps
- RAPIDS
Data Science on GPU
- DASK
Data Science purely on Python
- Trino
, aka PrestoSQL. With clear seperation from Presto, now Trino can focus heavily on features
- Apache Pinot
Realtime distributed OLAP datastore.Its growth is amazing & Its in the similar space of Druid, but not Exactly!
- Databricks
, with new SQL Analytics and Lakehouse paper, expecting more amazing OSS
- Delta Lake
, ACID on Apache Spark
- Koalas
Pandas on Apache Spark
- Apache Beam
Simplify Stream Processing, its gaining lots of attention and its slowly moving away from ONLY GCP but more generic
- Apache Arrow
Extremely important because of non-JVM, in-memory Columnar format & Vectorized
- Ray
Distributed Machine Learning and now Streaming
- Anodot
Monitors all your data in real time for lightning-fast detection of the incidents
- Data Robot
Solid ML Platform with strong focus in Enterprise MLOps
- Dataiku
Enterprise AI/MLOps Platform
- Fivetran
Data Integration Pipeline
- DataFrame Whale
Extremely simple data discovery tool
- Nextflow
Data-driven computational pipelines designed for BioInformatics, but can go beyond
- Confluent
Apache Kafka & following Ecosystem
- Papermill
Parameterizing a Notebook, makes Data Science more interesting and easier
- Algorithmia
Enterprise MLOps
- Abacus AI
Enterprise AI with AutoML, similar space of Data Robot