Gaming Data governance Physical Security

Troveo expands AI data platform into five categories

Thu, 30th Apr 2026

Troveo has expanded its AI training data platform into five new categories and says it has paid more than USD $20 million to content owners.

The expansion takes Troveo beyond video into audio, text, enterprise workflow data, gaming and robotics, as competition for licensed training material grows across the AI sector.

The company has built its business around supplying non-public, rights-cleared data to AI labs and model builders, arguing that access to training material has become a bigger constraint than data labelling for developers working on large models.

Its existing video operation includes more than 8 million hours of licensed footage. That is now being joined by four million hours of audio, billions of words of text, enterprise workflow traces, gameplay data and first-person robotics data.

The new categories reflect a broader push by AI developers to secure data that has not already circulated widely on the public internet. Much of that material sits in broadcast archives, studio vaults, company systems and private collections, making it harder to access through conventional web scraping.

New categories

In audio, Troveo now offers four million hours of single-channel and multi-channel material across dozens of languages and dialects. It says the data is used to train voice-based systems, including automatic speech recognition, voice assistants and conversational AI.

Its text datasets draw on material from publishers and other rights holders, with corpora structured for training, fine-tuning and evaluation.

The enterprise workflow category, which it also describes as agentic trajectories, consists of business data sourced directly from companies in multiple industries. The material is intended to capture real-world workflows inside enterprises.

In gaming, Troveo is offering video game data that includes time-synchronised keystroke information and character progression metadata. It says the material can be used for world models.

The robotics category centres on egocentric data, or first-person material gathered from operating environments. Troveo says this gives developers access to data from real settings rather than simulated ones.

Licensing focus

The announcement comes as legal and competitive scrutiny of AI training data continues to intensify. Model developers face mounting questions over whether training material was obtained with clear rights and whether its provenance can be traced back to owners.

Troveo says every dataset in its library is sourced and licensed from content owners. It also says it works with thousands of content owners and has relationships with AI labs and model builders, including large technology groups.

The payout figure offers one indicator of demand for licensed data in a market that has often relied on publicly available sources. By reporting more than USD $20 million in payments, Troveo is seeking to show that rights holders can earn direct income by supplying training material to AI developers.

Marty Pesis, Founder and Chief Executive Officer of Troveo, outlined the company's view of the market. "Beyond access to compute and top-tier talent, training data remains the biggest bottleneck for building the next generation of AI models. The most valuable data for solving that is real-world, meaning it captures the complexity of how people actually live and work," he said.

He added: "It is clean, accurately labeled and ready to train on. And it's non-public, meaning it has not been incorporated into a prior training run. It lives in archives, hard drives and operating environments that nobody has indexed or packaged for AI. Troveo delivers this data directly into the training environments of the world's top labs."

Market shift

Troveo's expansion points to a shift in emphasis in the AI supply chain. Early infrastructure businesses often focused on annotation and labelling, but developers are now looking further upstream for scarce, usable and licensable data.

That shift is likely to matter most for model makers building systems that need exposure to specialised or real-world activity, whether spoken language, workplace processes, game interactions or physical environments. Public web data remains abundant, but concerns over quality, duplication and prior use have made fresh sources more attractive.

Troveo says it plans to keep releasing datasets across all six categories, with video remaining the original pillar of the business. For AI companies seeking to train models on material that can be traced to rights holders, the company is positioning itself as a supplier of data from outside the exhausted pool of public internet content.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google