Data mesh is a fully managed data pipeline product that enables change data capture(CDC) use cases. Netflix has developed to expand the scope of data mesh to handle not just CDC use cases but for general data movement and data processing like
- More processing patterns like a filter, projection, union, join, etc
- Sourcing of events can be done from more generic applications.
So, in short, Data Mesh is used for general purposes at Netflix for data movement and processing between its systems at scale.
Pipelines are used for the majority of operations. Pipelines read data from many sources, process it using algorithms, and then send it on its way. The controllers must determine the pipelines’ resources and the ideal configuration. An illustration of the pipeline architecture can be seen in the diagram above.
Application developers can use a central catalog of Sources to expose their domain data. This enables data exchange because different Netflix teams could be eager to learn about changes to a particular object. A Source can also be defined as the outcome of several processing steps; for instance, a movie entity that has been enhanced with additional dimensions (such as a list of Talents) may be further indexed to satisfy search use cases.
Domain data specific to each business unit is a source of data processed by various system processors. Engineers mostly use Apache Flink for real-time data processing. The primary component for starting data transfer is a connector. They keep an eye on the data sources and create change data capture (CDC) events that are added to the data mesh. The primary element of a data mesh’s data transporters is Apache Kafka. Data schema and catalogs are crucial to provide searchability and visibility of the data among many business domains. The preferred schema across domains is Apache Avro, which Netflix employs.
Please Don't Forget To Join Our ML Subreddit