Tracking through Containers and Occluders in the Wild- Meet TCOW: An AI Model that can Segment Objects in Videos with a Notion of Object Permanence

Many open-source projects have developed comprehensive linguistic models that can be trained to carry out specific tasks. These models can provide useful responses to questions and commands from users. Notable examples include the LLaMA-based Alpaca and Vicuna and the Pythia-based OpenAssistant and Dolly.

Even though new models are being released every week, the community still struggles to benchmark them properly. Since LLM assistants’ concerns are often vague, creating a benchmarking system that can automatically assess the quality of their answers is difficult. Human evaluation via pairwise comparison is often required here. A scalable, incremental, and distinctive benchmark system based on pairwise comparison is ideal. 

Few of the current LLM benchmarking systems meet all of these requirements. Classic LLM benchmark frameworks like HELM and lm-evaluation-harness provide multi-metric measures for research-standard tasks. However, they do not evaluate free-form questions well because they are not based on pairwise comparisons.

🚀 JOIN the fastest ML Subreddit Community

LMSYS ORG is an organization that develops large models and systems that are open, scalable, and accessible. Their new work presents Chatbot Arena, a crowdsourced LLM benchmark platform with anonymous, randomized battles. As with chess and other competitive games, the Elo rating system is employed in Chatbot Arena. The Elo rating system shows promise for delivering the aforementioned desirable quality.

They started collecting information a week ago when they opened the arena with many well-known open-source LLMs. Some examples of real-world applications of LLMs can be seen in the crowdsourcing data collection method. A user can compare and contrast two anonymous models while chatting with them simultaneously in the arena. 

FastChat, the multi-model serving system, hosted the arena at A person entering the arena will face a conversation with two nameless models. When consumers receive comments from both models, they can continue the conversation or vote for which one they prefer. After a vote is cast, the models’ identities will be unmasked. Users can continue conversing with the same two anonymous models or start a fresh battle with two new models. The system records all user activity. Only when the model names have obscured the votes in the analysis used. About 7,000 legitimate, anonymous votes have been tallied since the arena went live a week ago.

In the future, they want to implement improved sampling algorithms, tournament procedures, and serving systems to accommodate a greater variety of models and supply granular ranks for various tasks.

Check out the Paper, Code, and Project. Don’t forget to join our 20k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

🚀 Check Out 100’s AI Tools in AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

Check out to find 100's of Cool AI Tools