Understanding Data Movement and Processing Platform at Netflix

Data mesh is a fully managed data pipeline product that enables change data capture(CDC) use cases. Netflix has developed to expand the scope of data mesh to handle not just CDC use cases but for general data movement and data processing like

  • More processing patterns like a filter, projection, union, join, etc
  • Sourcing of events can be done from more generic applications.

So, in short, Data Mesh is used for general purposes at Netflix for data movement and processing between its systems at scale.

Architecture

https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873

Pipelines are used for the majority of operations. Pipelines read data from many sources, process it using algorithms, and then send it on its way. The controllers must determine the pipelines’ resources and the ideal configuration. An illustration of the pipeline architecture can be seen in the diagram above.

Sources

Application developers can use a central catalog of Sources to expose their domain data. This enables data exchange because different Netflix teams could be eager to learn about changes to a particular object. A Source can also be defined as the outcome of several processing steps; for instance, a movie entity that has been enhanced with additional dimensions (such as a list of Talents) may be further indexed to satisfy search use cases.

Domain data specific to each business unit is a source of data processed by various system processors. Engineers mostly use Apache Flink for real-time data processing. The primary component for starting data transfer is a connector. They keep an eye on the data sources and create change data capture (CDC) events that are added to the data mesh. The primary element of a data mesh’s data transporters is Apache Kafka. Data schema and catalogs are crucial to provide searchability and visibility of the data among many business domains. The preferred schema across domains is Apache Avro, which Netflix employs.

References:

  • https://netflixtechblog.com/data-mesh-a-data-movement-and-processing-platform-netflix-1288bcab2873
  • https://www.infoq.com/news/2022/08/netflix-data-mesh/
Please Don't Forget To Join Our ML Subreddit

Asif Razzaq is an AI Journalist and Cofounder of Marktechpost, LLC. He is a visionary, entrepreneur and engineer who aspires to use the power of Artificial Intelligence for good.

Asif's latest venture is the development of an Artificial Intelligence Media Platform (Marktechpost) that will revolutionize how people can find relevant news related to Artificial Intelligence, Data Science and Machine Learning.

Asif was featured by Onalytica in it’s ‘Who’s Who in AI? (Influential Voices & Brands)’ as one of the 'Influential Journalists in AI' (https://onalytica.com/wp-content/uploads/2021/09/Whos-Who-In-AI.pdf). His interview was also featured by Onalytica (https://onalytica.com/blog/posts/interview-with-asif-razzaq/).

Prathvik is ML/AI Research content intern at MarktechPost, he is a 3rd year undergraduate at IIT Kharagpur. He has a keen interest in Machine learning and data science.He is enthusiastic in learning about the applications of Machine learning in different fields of study.