weekend ai reads for 2025-06-13

šŸ“° ABOVE THE FOLD: ā€œTHE ILLUSION OF THINKINGā€

this paper received a lot of attention this week, for the wrong reasons; the paper is fine but the conclusions some drew was either misinformed or misleading; we wanted to expand on the analysis to better understand it, mostly for ourselves

By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities.

  • the paper, a brisk eleven pages without references or appendices, is at the link

Did Complexity Just Break AI’s Brain? / Psychology Today (9 minute read)

Worse still, the models don’t seem to know they’re failing. They produce what appear to be sound, step-by-step answers. To a lay reader, or even a trained one, these outputs can seem rational. But they’re not grounded in any algorithmic or consistent method. They’re approximations of logic based on semantic coherence.

  • clickbait title; and no, it didn’t

  • they are correct that LMs are ā€œlogic based on semantic coherenceā€

Apple did this by showing that leading models such as ChatGPT, Claude and Deepseek may ā€œlook smart – but when complexity rises, they collapseā€. In short, these models are very good at a kind of pattern recognition, but often fail when they encounter novelty that forces them beyond the limits of their training, despite being, as the paper notes, ā€œexplicitly designed for reasoning tasksā€.

  • Gary Marcus usually relishes when LLMs fail to live up to their promise, but he is too gleeful (in our opinion) and draws too-broad conclusions, all the while missing what the paper was really trying to say

however …

Give Me a Reason(ing Model) / Zvi Mowshowitz, Substack archive (12 minute read)

It seems important that this doesn’t follow?

1. Not doing [X] in a given situation doesn’t mean you can’t do [X] in general.
2. Not doing [X] in a particular test especially doesn’t mean a model can’t do [X].
3. Not doing [X] can be a simple ā€˜you did not provide enough tokens to [X]’ issue.
4. The more adversarial the example, the less evidence this provided.
5. Failure to do any given task requiring [X] does not mean you can’t [X] in general.

Or more generally, ā€˜won’t’ or ā€˜doesn’t’ [X] does not show ā€˜can’t’ [X]. It is of course often evidence, since doing [X] does prove you can [X]. How much evidence it provides depends on the circumstances.

I spent 20$ on poking holes in their paper lol / scaling01, XCancel (3 minute read)

tl;dr

 

šŸ“» QUOTES OF THE WEEK

Tasted a little tear gas— tasted like fascism

Acyn (source)

 

Like Dario Amodei’s Machines of Loving Grace, the latest Altman essay spends the bulk of its time hand waving away all the important concerns about such futures, both in terms of getting there and where the there is we even want to get. It’s basically wish fulfillment.

Zvi Mowshowitz (source)

 

šŸ‘„ FOR EVERYONE

The AI Proficiency Report [PDF] — Al investments are increasing, but proficiency is not / Section AI (10 minute read)

AI EXPERTS: 1% of the workforce

Al experts are the most proficient Al users in the workforce, as well as the most bullish about Al’s potential. They receive the most resources for Al development, and 31% of them are saving more than 12 hours - or a day and a half - of their work week by using Al.

Their stats

• Most likely to be C-suite

• 84 average proficiency score out of 100

• 69% are daily Al users

• 47% are saving 8-12 hours per week using Al

Their differentiators

• 94% are in companies that approve of Al

• 71% have received Al training

• 84% have managers that encourage Al

• 75% report their company having a clear Al policy

What if Making Cartoons Becomes 90% Cheaper? / New York Times (14 minute read)

Just a few years ago, lip-syncing a minute of animation could take up to four hours. An animator would listen to an audio track and laboriously adjust character mouths frame by frame. But Mr. Peck’s one-minute scene took 15 minutes for the A.I. tool to sync, including time spent by an artist to refine a few spots by hand.

In Brazil’s Amazon, AI is making healthcare safer — At overburdened clinics, pharmacists use AI to catch dangerous errors. It’s frontier tech meets frontier medicine — with global implications. / Rest of World (9 minute read)

via david, Artificial Intelligence Is Not Intelligent — Despite what tech CEOs might say, large language models are not smart in any recognizably human sense of the word. / The Atlantic (9 minute read)

The plan o3 gave us was plausible, reasonable; but the plan o3 Pro gave us was specific and rooted enough that it actually changed how we are thinking about our future. [eds: emphasis theirs]

This is hard to capture in an eval.

 

šŸ“š FOUNDATIONS

Rethinking decision making to unlock AI potential / McKinsey & Company (14 minute read)

As agents take on high-frequency or transactional work, employees shift into roles that require more oversight, ethics, and judgment, including:

- Custodians whoensure the integrity of data, model performance, and customer outcomes.
- Judgment holders who handle ambiguous or high-stakes decisions where context, nuance, and trust are essential.
- Approvers and auditors who review exceptions, manage escalations, and reinforce compliance boundaries.

Why AI Agents Need a New Protocol / Frank Fiegel, Glama (7 minute read)

  • we are not API apologists, but many of the ā€œAPI limitationsā€ in this article are more about legacy API design patterns rather than fundamental limitations of APIs

  • it glosses over the real advantage of MCP, which is less about technical capabilities and more about having one consistent protocol that AI models can be specifically trained on, which is lacking in API design patterns that exist today

  • anyway, still useful reading if you care about MCPs

  • related, The no-nonsense approach to AI agent development / Vercel blog (7 minute read)

UX Challenges with MCPs / Hardik Pandya (10 minute read)

First, configuration is unintuitive. MCPs work like IFTTT where you need to establish connections on both the app side and the LLM side to make them function. This creates setup friction that most users won’t tolerate.

Second, the UX approach doesn’t feel right for the future of apps with natural language capabilities. The way MCPs bolt conversational AI onto existing tools feels like a bridge solution rather than how apps will naturally evolve. The interaction patterns aren’t optimized for mass-market usage.

To understand why these limitations matter, it’s important to see the types of workflows that MCPs help with today.

 

šŸš€ FOR LEADERS

Instead, leaders should set their teams up for success by helping to identify opportunities for AI-human collaboration. One way to do this is for teams to ensure that agents are working with good data, weeding out wrong or incomplete information and regularly optimizing and providing feedback.

A Chief AI Officer Won't Fix Your AI Problems / The New Stack (8 minute read)

In my experience, these organizations aren’t necessarily resistant to innovation. Instead, they’re working to balance progress with the need for stability, compliance, and alignment with existing operations. In these cases, appointing a Chief AI Officer can serve a valuable purpose by creating a focal point for AI strategy and helping to coordinate efforts across departments.

Moats in the Age of AI / Clouded Judgement (7 minute read)

Speed isn’t just important, it is the moat. The ability to build, ship, learn, and adapt faster than everyone else is the only sustainable edge right now. In a world where everything is open source, everything is demo-able, and everything is one blog post away from being copied, speed is the only thing that compounds.

The median enterprise company in our sample set reached more than $2 million in ARR in its first year, raising a Series A just nine months post-monetization. Median consumer companies performed even better, reaching $4.2 million in ARR and raising an A round within eight months. What was once considered ā€œbest in classā€ — the $0 to $1 million ARR ramp — is now on the lower end of growth we’re seeing.

  • thoughts on product management, design, AI and more

  • link above jumps to the benefits and risks of AI prototyping tools

 

šŸŽ“ FOR EDUCATORS

Part of our task in the face of generative AI is to make an argument for the value of thinking – laboured, painful, frustrating thinking. It is not an easy sell. But to give up on this is to give up on our students, most of whom are at an age where they can be easily seduced by techno-sirens promising instantaneous essays for minimal effort and with little chance of getting caught. They deserve better from us.

Your Campus Already Has AI—And That’s the Problem / Marc Watkins, Substack archive (11 minute read)

While many of us are aware we should be mindful about uploading sensitive documents to AI systems, talking to a bot like it is a person and habitually revealing personal information to it is an extraordinary security risk when you deal with sensitive data. Our words are now prompts, our conversations become data, and the potential FERPA and HIPPA violations that may come from talking about someone with something is not being discussed enough.

The PR worker also didn’t seem to be doing ā€œhigher-level work,ā€ but simply doing analysis more quickly. The output provided by AI is clearly useful to a junior worker’s bosses, but I’m skeptical that it’s giving them a deeper understanding of how a business or industry works.

OpenAI dubs its sales pitch ā€œAI-native universities.ā€ ā€œOur vision is that, over time, AI would become part of the core infrastructure of higher education,ā€ Leah Belsky, OpenAI’s vice president of education, said.

The university will now require students to take an AI skills seminar, and it will incorporate workshops into existing framework like the First Year Seminar program. The seminars are optional one-credit courses tailored to first-year students in specialized subjects like Fantasy Worldbuilding in Television, Know Your Recreational Drugs and soon, AI.

China shuts down AI tools during nationwide college exams ā€” Popular AI apps from Alibaba and ByteDance have disabled features like image recognition to prevent cheating. / The Verge (4 minute read)

 

šŸ“Š FOR TECHNOLOGISTS

Data on AI Supercomputers / Epoch AI (11 minute read)

  • US & China are far outpacing the rest of the world

How Anthropic teams use Claude Code [PDF] / Anthropic (6 minute read)

Claude Code is My Computer / Peter Steinberger (7 minute read)

TL;DR: I run Claude Code in no-prompt mode; it saves me an hour a day and hasn’t broken my Mac in two months. The $200/month Max plan pays for itself.

For the past two months, I’ve been living dangerously. I launch Claude Code (released in late February) with --dangerously-skip-permissions, the flag that bypasses all permission prompts. According to Anthropic’s docs, this is meant ā€œonly for Docker containers with no internetā€, yet it runs perfectly on regular macOS.

  • reading this made our palms sweaty

 

šŸŽ‰ FOR FUN

Seventh Sight ā€” Analyse Your Dreams

SeventhSight app uses patented Machine Learning Artificial Intelligence to analyse the meaning of your dream. Find powerful insights into your daily life by understanding what your subconscious is telling you!

  • we could not locate the patent in the USPTO search engine, if your usage of this hinged on that

Timbaland Announces New AI Entertainment Company — Timbo also introduced a new genre called ā€œA-Pop.ā€ / Billboard archive (5 minute read)

Timbaland has launched his own AI entertainment company called Stage Zero and its first signee is the artist TaTa. Co-founded with Rocky Mudaliar and Zayd Portillo, Stage Zero’s first signee is an AI pop artist called TaTa, driven by Suno AI. The pop artist, along with a bevy of AI-driven creative tools will all be under Timbo’s new company.

 

🧿 AI-ADJACENT

Mental Health Tip: Do this research using voice mode while walking. Legal research isn’t fun, but being outside in fresh air using your voice rather than staring at a screen makes it much more bearable.

 

ā‹„