Skip to the content.
These are list of Upcoming/New/Partially new Data Engineering/Machine Learning Engineering Technologies I prefer to follow in 2021.
- Getdbt , Its hitting the sweet spot of Apache Spark, by bringing Simplified SQL based pipeline!
- Prefect , Designed to make Workflow Management easier & better compared to Apache Airflow
- DVC , Open-source Version Control System for Machine Learning Projects & desired for MLOps
- Great_Expectations , Data Science Testing framework, its already amazing!
- Amundsen An Open Sourced data discovery and metadata engine
- Marquez Open Source Metadata with an amazing UI
- Dagster A data orchestrator for machine learning, very Programming based & in the similar space of Airflow/Prefect but emphasize on State flow
- Apache Calcite Framework for building SQL databases and data management systems without owning Data. Hive, flink and other uses Calcite
- maiot-ZenML Open Sourced MLOps Framework, having bit of everything.
- Apache Superset Open Source BI with many connectors available
- Metabase An Open Source BI, with amazing Vizualization
- Hopswork Open Sourced MLOPs Feature Store
- Feast open-source feature store, now with Tecton
- MLFlow Machine Learning Platform, first of its Kind
- Pachyderm MLOps Platform, in space of MLFlow
- Montecarlodata Data Governance or Data Discovery or Data Observability
- Tecton Enterprise Feature Store
- Fiddler Enterprise Explainable AI
- Cnvrg , Enterprise MLOps
- RAPIDS Data Science on GPU
- DASK Data Science purely on Python
- Trino , aka PrestoSQL. With clear seperation from Presto, now Trino can focus heavily on features
- Apache Pinot Realtime distributed OLAP datastore.Its growth is amazing & Its in the similar space of Druid, but not Exactly!
- Databricks , with new SQL Analytics and Lakehouse paper, expecting more amazing OSS
- Delta Lake , ACID on Apache Spark
- Koalas Pandas on Apache Spark
- Apache Beam Simplify Stream Processing, its gaining lots of attention and its slowly moving away from ONLY GCP but more generic
- Apache Arrow Extremely important because of non-JVM, in-memory Columnar format & Vectorized
- Ray Distributed Machine Learning and now Streaming
- Anodot Monitors all your data in real time for lightning-fast detection of the incidents
- Data Robot Solid ML Platform with strong focus in Enterprise MLOps
- Dataiku Enterprise AI/MLOps Platform
- Fivetran Data Integration Pipeline
- DataFrame Whale Extremely simple data discovery tool
- Nextflow Data-driven computational pipelines designed for BioInformatics, but can go beyond
- Confluent Apache Kafka & following Ecosystem
- Papermill Parameterizing a Notebook, makes Data Science more interesting and easier
- Algorithmia Enterprise MLOps
- Abacus AI Enterprise AI with AutoML, similar space of Data Robot