It’s a MAD MAD MAD MAD world! Written by Ross Turk

October 5, 2021

Last week, Matt Turck and John Wu published the latest annual report on the state of data, the 2021 Machine Learning, AI and Data (MAD) Landscape. If you haven’t read it yet, we recommend it as a comprehensive snapshot of the intricate world of AI, machine learning, and data science & engineering.

2021 MAD Landsacape

Our team enjoyed reading it. We represent several of the pixels on this chart (hey, cool!) through our work on Datakin, OpenLineage, and Marquez. We would like to share some observations of our own on this exciting, growing landscape. Lets start with the obvious.

OMG the chart

"For those who have remarked over the years how insanely busy the chart is, you’ll love our new acronym – Machine learning, Artificial intelligence and Data (MAD) – this is now officially the MAD landscape!"

Yeah, that checks out. It’s madness! For teams trying to find their path forward, it must be completely overwhelming. There are many areas to consider, each with a set of technologies and companies to research. Worlds within worlds. No single person has the ability to understand everything on this chart, and no organization deploying even a modest slice of these tools can keep track of it all without help.

Movie poster for It's a Mad Mad Mad Mad WorldWhen we look at this chart, here’s what we see: an expansive swarm of stuff that creates, manipulates, and consumes datasets. Most of the logos on the chart represent a new data source, a transformation engine of some sort, or a consumer of data at large scale. We think organizations will operate using bespoke combinations, choosing the best tools for their needs and relying on open standards and ingenuity to tie it all together.

We believe that there will increasingly be a need for standards-based observability solutions (like Datakin) that trace data across platforms, making it easier to deal with operational and quality problems holistically.

Cloud data warehouses & BYOP

"Today, cloud data warehouses (Snowflake, Amazon Redshift and Google BigQuery) and lakehouses (Databricks) provide the ability to store massive amounts of data in a way that’s useful, not completely cost-prohibitive and doesn’t require an army of very technical people to maintain. In other words, after all these years, it is now finally possible to store and process Big Data."

We agree! This creates an entire world of new possibilities for companies of all sizes. It has become easy to store enormous amounts of data and perform complex transformations in no time. The barrier to entry has been obliterated. Organizations who never had the appetite for ETL find themselves bellying up to the bar for a mug of ELT and enjoying its lighter, smoother taste.

Once something powerful becomes easy, we expect it to happen a whole lot. Increasingly, we will find data science at the edges, decentralized and largely ungoverned. Just as IT departments struggled with BYOD (bring your own device) ten years ago, today’s orgs will have to get their heads around how to manage BYOP – bring your own pipeline. More individuals working independently, less centralized coordination. It’s kind of exciting! That’s the way to catch lightning in a bottle, even if it comes with some complications.

We may be biased, but we think that data lineage is the key to operating under these conditions. It can help establish a “chain of custody” for key datasets, contextualize the fragmented work of scientists and engineers, and build trust throughout an organization. That means businesses can support – and even facilitate – a long tail of data science within their walls.

Data lineage use case expansion

"Tracking data across repositories and pipelines [will] become even more essential for troubleshooting purposes, as well as compliance and governance, reinforcing the need for data lineage."

At Datakin, this is our thing: the belief that data lineage has operational value, in addition to the compliance and governance use cases it’s already known for. We believe lineage can make troubleshooting easier by quickly identifying the root cause of a complex pipeline failure, or simulating the effect of a planned change. Lineage context will become essential for data engineers, who are in a unique position to do something with it.

But we think there’s an additional nuance here. In order to truly be helpful to data engineers, a data lineage solution needs to observe how data moves in real time. That way, job performance can be analyzed and lineage information is always up-to-date. So it’s not just data lineage that we need, it’s real time data lineage.

We’ve built a data lineage solution for data engineers, so it naturally starts with direct observation of data on-the-move. The OpenLineage project allows users to instrument their own pipelines for observability, and offers several integrations with common data tools & platforms.

Until next year...

We look forward to next year’s report! Between now and then, we expect to witness another twelve months of unprecedented change. Solutions will become both more powerful and more complex, and new categories will emerge to help us create order from chaos.

Are you looking at this landscape like a kid in a candy store, wanting to build a modern pipeline but not sure where to start? Start with data lineage, and with Datakin. Signing up is easy and free.