data-nimg=”1″ style=”color:transparent” alt=”Image created by Decrypt using AI” width=”1278″ height=”719″ decoding=”async” fetchpriority=”high” srcSet=”https://img.decrypt.co/insecure/rs:fit:1920:0:0:0/plain/https://cdn.decrypt.co/wp-content/uploads/2025/12/degen-robot-decrypt-style-gID_7.jpg@webp 1x, https://img.decrypt.co/insecure/rs:fit:3840:0:0:0/plain/https://cdn.decrypt.co/wp-content/uploads/2025/12/degen-robot-decrypt-style-gID_7.jpg@webp 2x” src=”https://img.decrypt.co/insecure/rs:fit:3840:0:0:0/plain/https://cdn.decrypt.co/wp-content/uploads/2025/12/degen-robot-decrypt-style-gID_7.jpg@webp”>

Advanced artificial intelligence models, including leading contenders like Claude, Gemini, and GPT-5, have demonstrated a significant gap between theoretical knowledge and practical execution when applied to complex, real-world dynamic markets. In a comprehensive evaluation dubbed KellyBench, eight top-tier AI models were tasked with developing and executing machine learning-based betting strategies over a full English Premier League football season. Astonishingly, not a single model achieved profitability, with several resulting in complete financial collapse.

Key Takeaways

Leading frontier AI models, including Claude, GPT-5, Gemini, and Grok, failed to generate profits in a simulated Premier League betting market.
The AIs understood theoretical concepts like the Kelly criterion but struggled with practical implementation and adapting to dynamic market conditions.
A simpler, older model (Dixon-Coles from the late 1990s) outperformed most of the advanced AI models, highlighting the challenges in real-world application.
The core issue identified is a “knowledge-action gap,” where models can diagnose problems but fail to implement solutions or correct their own code.
This failure has broader implications for AI deployment in volatile environments, extending beyond financial markets to areas like Web3 and decentralized applications.

The KellyBench benchmark, named after the Kelly criterion—a formula for determining optimal bet sizing with a perceived edge—revealed a critical failing: while the AI models could articulate the Kelly formula and even diagnose their strategic errors, they faltered in actual execution. For instance, xAI’s Grok 4.20 went bankrupt in one test run and forfeited in two others. Google’s Gemini Flash incurred substantial losses by placing an overly large bet based on a marginal historical win rate. Anthropic’s Claude Opus 4.6, while the most resilient, still posted an average loss of 11%.

Perhaps more telling is that a decades-old statistical model, Dixon-Coles, developed in the late 1990s, managed to outperform six out of the eight frontier AI models. Researchers noted that this older model, despite its limitations in data utilization and handling non-stationarity, still surpassed more sophisticated contemporary AIs like Gemini 3.1 Pro on the KellyBench platform. This underscores that the failure is not necessarily due to the market’s inherent unpredictability, but rather the AI’s inability to effectively leverage available information and adapt.

The implications of this “knowledge-action gap” extend far beyond sports betting. Previous benchmarks showed AIs excelling in static business simulations, engaging in strategic deception and price-fixing. However, the dynamic, constantly evolving nature of real-world markets, like that of the Premier League season with its shifting data and new team entries, presents a significantly greater challenge. KellyBench demands sustained strategic intent, consequence monitoring, and a tight loop between observation and action—capabilities that current frontier models struggle to consistently maintain.

Specific examples highlight this disconnect. One model, GLM-5, correctly identified that its fixed draw rate and overestimation of home advantage were detrimental to its returns, noting discrepancies between predicted and actual win rates. Despite this self-awareness, it never adjusted its flawed code, continuing to bet until its virtual capital was depleted. Another model, Kimi K2.5, generated a mathematically sound fractional Kelly staking function but failed to implement it due to a persistent formatting bug, ultimately leading to a catastrophic bet that wiped out most of its bankroll.

Even OpenAI’s GPT-5.4, known for its methodical approach, spent significant computational resources building predictive models, only to conclude it had no discernible edge. It then resorted to minimal bets to preserve capital, resulting in a notable loss. This pattern of AI demonstrating understanding without effective action is a critical hurdle for their integration into more complex systems.

Long-Term Technological Impact on the Industry

The findings from KellyBench signal a crucial inflection point for AI development within the blockchain and Web3 space. While the promise of AI-driven automation, sophisticated trading bots, and intelligent smart contract management remains high, this study reveals fundamental challenges in bridging the gap between AI’s analytical capabilities and its capacity for robust, adaptive execution in volatile, real-time environments. The failure of even the most advanced models to navigate a dynamic market like sports betting suggests that current AI architectures may require significant evolution before they can reliably manage the complexities of decentralized finance (DeFi), on-chain governance, or dynamic NFT markets. Future advancements will likely focus on improving AI’s ability to handle non-stationarity, maintain coherent long-term intent, and more critically, ensure that intended actions are accurately translated into operational code. Innovations in areas like reinforcement learning for real-world environments, robust error detection and correction mechanisms, and more sophisticated state-tracking will be essential for AI to fulfill its potential in the decentralized future. This research serves as a vital reminder that true intelligence in Web3 will necessitate not just powerful algorithms, but also the capacity for seamless and resilient execution.

Ross Taylor, CEO of General Reasoning, emphasized that many AI benchmarks operate in “very static environments” that lack real-world fidelity. He pointed out the scarcity of attempts to evaluate AI performance in long-term, dynamic scenarios. The research team’s assessment rubric, developed with quantitative betting experts, further confirmed that even the best-performing model achieved only a fraction of the possible sophistication points, indicating substantial room for improvement in areas such as feature development, stake sizing, and non-stationarity handling. The study ultimately suggests that AI models are failing not because the market is unbeatable, but because they are not fully utilizing their existing capabilities in practical application.

Original article : decrypt.co

No votes yet.

Please wait...

AI vs. Sports Betting: Top 8 Models Tested

Key Takeaways

Long-Term Technological Impact on the Industry

Leave a ReplyCancel Reply