LinkedIn Open-Sources ‘Venice,’ LinkedIn’s Derived Data Platform that Powers more than 1800 Datasets

Recently, LinkedIn announced that they would soon open-source Venice, their derived data platform. Venice is the engine behind more than 1800 LinkedIn datasets and is used by more than 300 separate applications. As of late 2016, the production of Venice has begun, and it has been steadily scaling to take the place of several existing systems.

An important architectural feature that sets Venice apart from conventional databases is how data is entered. Anyone testing out a new database must first write data before reading it. Since Venice is a derived data store supplied by offline and nearline sources, all writes to it are asynchronous. To rephrase, you cannot issue highly consistent online write requests like you can in MySQL or Apache HBase. Thanks to its design, Venice can reach unprecedented write throughputs.

Venice offers several ways to insert and update rows inside a dataset and swap entire datasets. 

  • Venice allows users to write data in two different ways: from Hadoop and a stream processor. The most popular way to write data into Venice is using the Push Job, which takes data from an Apache Hadoop grid and writes it into Venice. 
  • As of now, LinkedIn has Venice connected with Apache Samza, which is used for stream processing. With Venice’s adaptable framework, new stream processors may be easily added.
  • Collection merging offers a flavor of partially updating that allows the user to add or remove elements from a set or map in a declarative manner. The purpose of a partial update is to allow the user to change a subset of data in a row without needing access to the entire row. Since all writes to Venice are asynchronous, “read-modify-write” workloads are not supported. This is especially helpful for combining datasets from many sources when each source contributes unique columns.

Venice’s strategy of declaratively pushing down the update into the servers is more efficient and better at dealing with huge numbers of concurrent writers than systems that provide direct support for read-modify-write workloads (such as via optimistic locking).

Venice offers hybrid writes workloads that include full dataset swaps (through Full Push jobs or reprocessing activities) and nearline writes (via incremental pushes or streaming). Venice can integrate all of these data sources without any noticeable disruption. Specifically, after a new version of the dataset has been loaded in the background, but before the reads are swapped, there is a replay phase in which recent nearline writes are written on top of the new dataset version. Reads are switched once the replay has caught up.

Since its inception, Venice has been built with scalability and ease of use in mind. These two features receive constant attention and development time. So, the system naturally supports multi-region, multi-cluster Multi-tenancy, configuration controlled by the operator, self-healing, and scalability that’s both linear and elastic. 

Venice uses Apache Helix to quickly recover from hardware failures by strategically placing copies of data across multiple clusters. Helix will rebalance the partitions hosted to available, healthy servers if a server fails.

Rebalancing procedures use the same tried-and-true code approach that is used more than 1,400 times daily to rewrite entire datasets. The team constructed the system so that further optimization of the write path enhances the efficiency of both push and rebalancing operations. As a result, the system is more flexible. Also, Helix makes it simple to add new hardware to a cluster and redistribute the data already there.

The A/B testing platform, Venice, is built on a state-of-the-art technology foundation. It is helpful for artificial intelligence. According to the team, it is also valuable in other scenarios, especially when it comes to absorbing large amounts of data in various formats. Similarly, when high consistency is not essential, but cost and efficiency are, one example is caching a sanitized representation of source-of-truth storage systems.

Github Link | Reference Article

Please Don't Forget To Join Our ML Subreddit

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring the new advancements in technologies and their real-life application.