MIT Researchers Use Machine Learning to Build Faster and More Efficient Hash Functions That are a Key Component of Databases

Hashing is a technique used in database management systems to locate data required directly on the disc without utilizing an index structure. Since it is quicker to search for a given item using the shorter hashed key than using its original value, the hashing approach is used to index and retrieve objects in databases. The memory region where these records are stored is known as a data block or data bucket. Data is saved as data blocks whose addresses are produced by applying a hash function. A hash function produces codes that directly identify the location of data storage. So, finding and retrieving the data is simpler when utilizing these codes.

Yet, two bits of data may occasionally have the same hash result since conventional hash methods produce codes at random. This leads to collisions when a user is directed to multiple pieces of data that share a similar hash value when looking for a single item. Finding the proper one takes much longer, slowing searches and lowering performance.

 Numerous well-known methods for handling collisions include chaining, probing, and cuckoo hashing. Using perfect hash functions rather than truly random hash functions is another method for creating hash indexes. Since perfect hash functions don’t collide, they require specialized construction for each dataset and incur additional storage and processing time costs. 

Since hashing is an essential aspect of database management systems, scientists at MIT aimed to investigate whether utilizing learned models rather than conventional hash functions might lessen collisions and whether this results in better performance, especially for indexing and joining.

They discovered that, in some circumstances, using learned models rather than conventional hash functions can reduce the collisions to half in number. These trained models are produced by applying a machine-learning algorithm to a dataset intended to identify particular traits. Also, the team’s tests revealed that imperfect hash functions were frequently outperformed by learning models in terms of computational efficiency.

Since perfect hash functions were hard to create, the researchers used machine learning to take a tiny sample from a dataset and approximate the distribution’s shape or how the data are distributed. A dataset’s possible values are displayed along with the frequency with which they occur in a data distribution. The likelihood that a specific value will be found in a sample of data can be determined using the distribution. The learned model then uses the approximate position to forecast where a key will appear in the dataset.

Scientists discovered that if data is distributed predictably, trained models are simpler to design, faster to run, and result in fewer collisions than conventional hash functions. Using trained models, however, can result in more collisions if the data is not reliably distributed because the gaps between data points fluctuate too widely.

Compared to conventional hash functions, trained models may decrease the proportion of clashing keys in a dataset from 30% to 15% when data is reliably distributed. Also, they were able to outperform ideal hash algorithms in terms of throughput. In the best scenarios, learned models decreased runtime by around 30%. The researchers discovered that the number of sub-models had the most significant impact on throughput as they investigated using learned models for hashing. Smaller linear models that roughly represent the data distribution for various portions of the data make up each trained model. The learned model generates a more precise approximation with more sub-models but takes longer.

Expanding off this work, the researchers hope to employ learning models to create hash functions for various sorts of data. Also, they intend to investigate learned hashing for databases that allow for adding and deleting data. The model must adapt when data are updated in this way, but doing so while retaining model accuracy is a challenging task.

Check out the Paper and MIT Blog. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 16k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Niharika is a Technical consulting intern at Marktechpost. She is a third year undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the latest developments in these fields.

[Announcing Gretel Navigator] Create, edit, and augment tabular data with the first compound AI system trusted by EY, Databricks, Google, and Microsoft