Internet of Bugs Newsletter

April 7th, 2025

Carl Brown — Thu, 10 Apr 2025 14:00:00 +0000

The Hype on AI Coding:

Several articles lately are fueling the hype cycle for “AI will be coding all the things”, none of which seem to have captured as much attention as this interview with the Y Combinator CEO:

Y Combinator startups are fastest growing, most profitable in fund history because of AI

Y Combinator CEO Garry Tan says for about a quarter of the current YC startups, 95% of the code was written by artificial intelligence models.

www.cnbc.com/2025/03/15/y-combinator-startups-are-fastest-growing-in-fund-history-because-of-ai.html

I read this as “For about a quarter of the current cohort of YC startups, 95% of the code was written [using a code editor that used an LLM to do auto-completion].” Which sounds far less impressive than what the press is making it out to be. Of course, we don’t know if that’s exactly what he meant, but it’s by far the most plausible thing I can think of.

It doesn’t really matter, though, as far as I’m concerned. How fast you write your code is no measure of quality or competency. Here’s the question:

Pulling back the curtain on the magic of Y Combinator

A first-of-its-kind deep dive into the data to see what’s really working for the industry’s biggest incubator

www.lennysnewsletter.com/p/pulling-back-the-curtain-on-the-magic

According to this article, slightly more than half of YC Companies are still alive after 10 years (and that chart has statistics for how many companies are still going as a function of when they started). So the question is: Are the companies in this batch of YC startups more or less likely to succeed? That’s the metric - and I would guess it will be lower. I’m really curious to see how it turns out.

Because this is how it seems to be going so far:

Vibe Coded AI App Generates Recipes for Cyanide Ice Cream

A Y Combinator partner proudly launched an AI recipe app that told people how to make “Actual Cocaine” and a “Uranium Bomb.”

www.404media.co/vibe-coded-ai-app-generates-recipes-for-cyanide-ice-cream-and-cum-soup

Sigh, I should have put that in the “Reality” section, but I couldn’t resist having the two YCombinator articles back-to-back. On with the Hype section…

Amazon Cloud CEO warns developers: AI could replace your coding work within 2 years

Matt Garman, CEO of Amazon Web Services (AWS), has advised his software engineers to upskill and learn new technologies, warning that AI could replace their coding work.

www.hrgrapevine.com/us/content/article/2024-08-22-amazon-cloud-ceo-warns-software-engineers-ai-could-replace-your-coding-work-within-2-years

This article strikes me as less about if AI can code and more about Amazon’s attitude toward its employees. It’s just more evidence for the video I did last week about Amazon and their Bossware:

And the next car in this week’s Hype Train:

‘Every software company is an AI company now,’ says AngelList CEO Avlok Kohli

Today on Equity, AngelList CEO Avlok Kohli discusses the company’s evolution, the impact of AI on startups, and key strategies for founders to succeed in today’s market, from SPVs to partnerships with larger funds.

techcrunch.com/podcast/every-software-company-is-an-ai-company-now-says-angellist-ceo-avlok-kohli

If you don’t know, AngelList is all about funding early stage companies, and this comment was made on an episode of the Equity Podcast - which is about investing. I have no doubt that, from the point of view of an investor, you only care about (and hear about) AI companies - because that’s where all the hype is. But most of the reposts I’ve seen of that quote have left out the “with respect to investing” context.

The Reality of AI Coding:

This article is probably the closest I’ve seen to the way I feel about the issue, and I appreciate it and wish more people would read it:

The machines are rising — but developers still hold the keys

Increasing use of AI in software development will make developer decisions and judgment more important, not less.

www.technologyreview.com/2025/04/02/1114134/the-machines-are-rising-but-developers-still-hold-the-keys

This is also a good take on the limitations of AI - from a perspective I hadn’t really considered that’s more about OpenAI’s claims of “PhD-level agents”:

Hugging Face's chief science officer worries AI is becoming 'yes-men on servers' | TechCrunch

Hugging Face co-founder and chief science officer Thomas Wolf thinks that AI today isn't capable of figuring out novel solutions like a human.

techcrunch.com/2025/03/06/hugging-faces-chief-science-officer-worries-ai-is-becoming-yes-men-on-servers

I’ve talked a lot about how AI is good for repeating things that it already memorized, but bad at judgement and bad at things that it hasn’t seen a lot of examples of. And this is another facet of that issue. Speaking of which:

Researchers say they've discovered a new method of 'scaling up' AI, but there's reason to be skeptical | TechCrunch

Have researchers discovered a new AI 'scaling law'? That's what some buzz on social media suggests — but experts are skeptical.

techcrunch.com/2025/03/19/researchers-say-theyve-discovered-a-new-method-of-scaling-up-ai-but-theres-reason-to-be-skeptical

The pull quote:

“[I]f we can’t write code to define what we want, we can’t use [inference-time] search,” he said. “For something like general language interaction, we can’t do this […] It’s generally not a great approach to actually solving most problems.”

And this article:

MIT study finds that AI doesn't, in fact, have values | TechCrunch

A recent study out of MIT suggests that AI systems don't have discernible values or preferences, but instead mostly imitate and hallucinate.

techcrunch.com/2025/04/09/mit-study-finds-that-ai-doesnt-in-fact-have-values

Is a really interesting look at “the alignment problem” (I hate that phrase so much).

Pull quote here is:

[N]one of the models was consistent in its preferences. Depending on how prompts were worded and framed, they adopted wildly different viewpoints.

And speaking of viewpoints changing based on prompts:

Assessing and alleviating state anxiety in large language models - npj Digital Medicine

The use of Large Language Models (LLMs) in mental health highlights the need to understand their responses to emotional content. Previous research shows that emotion-inducing prompts can elevate “anxiety” in LLMs, affecting behavior and amplifying biases. Here, we found that traumatic narratives increased Chat-GPT-4’s reported anxiety while mindfulness-based exercises reduced it, though not to baseline. These findings suggest managing LLMs’ “emotional states” can foster safer and more ethical human-AI interactions.

www.nature.com/articles/s41746-025-01512-6

I think this wording of “‘anxiety’ in LLMs” is Anthropomorphized crap, and a phrasing like “When given a prompt containing lots of anxiety-related words, LLMs are likely to respond with words that are also anxiety-related” would be far more accurate. But it’s more fuel on the fire for the presumption that AIs have no consistent judgement or emotional state.

And on a couple of Final Channel-Related notes:

Want to talk Software As A Service with me?

First, as I mentioned in my last couple of videos, I’m currently thinking that the best way I can provide value to the development community right now, with all the uncertainty and lay-offs and bossware apparently on the way, is to help developers get out of being isolated in dev-only teams to work on their own projects (probably Software-As-A-Service project, since those are easiest to get up and running), which I believe will make them much better developers. Toward that end, I’m working on gathering information from folks about how I might best do that. So if you are:

A Software Developer with a few years of experience already, and
Are motivated to build your own product (probably a Software As A Service), and
You are located in the U.S. and
You think it is likely that you have the financial ability to work on a project for a few months before you start seeing revenue from it, then

I’d love to chat with you about what your concerns are about starting your own thing, what resources you think you might be lacking, and how I might be able to help.

You can book an appointment on my calendar here: https://iob.fyi/ssii_a

Note for those of you that are not in the U.S.: I don’t hate you or anything, I just don’t know anything about starting a business (or even being in business) outside the US, so I’m afraid that, in my ignorance, I might recommend something that has worked for me in the US but is horrible advice in another country or market. And that means that, at least for the time being, I’m focusing on teaching what I know, and what I have experience doing. Hopefully there aren’t any hard feelings, and I hope to be able to expand my scope in the future.

Note to those of you with fewer than 5 or so years of experience: I don’t hate you either, but I’m trying to avoid a situation where I’m having to teach programming as well as teaching SaaS - at least for this set of interviews. We’ll see what the landscape looks like after I’ve talked to more people who match the criteria above. Again, hopefully there aren’t any hard feelings, and I hope to be able to expand my scope in the future.

And Lastly, the End of an Era:

Microsoft Is Killing Windows' Blue Screen of Death

Windows' Blue Screen of Death a lot less blue, but still plenty deadly. We'll miss the old screen of frustration.

www.vice.com/en/article/microsoft-is-killing-windows-blue-screen-of-death

I guess it had to happen sometime. ;-(

Rest assured, though, I have no intention of changing my channel branding any time soon.

Thanks for reading.

March 24th 2025

Carl Brown — Tue, 25 Mar 2025 14:00:00 +0000

On Benchmarks and AGI:

I’ve been very encouraged recently by advances in the systems around AI. There was a great new benchmark posted today that I think does a great job in differentiating between LLMs “understanding” and LLMs “regurgitating”:

Announcing ARC-AGI-2 and ARC Prize 2025

Measuring the next level of intelligence with ARC-AGI-2 and ARC Prize 2025

arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025

Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve only single-digit percentage scores. In contrast, every task in ARC-AGI-2 has been solved by at least 2 humans in under 2 attempts.
Greg Kamradt via https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025

And that tracks more with what I’m seeing. LLMs just don’t do a good job of figuring out the right answer, unless, like with LeetCode tutorials, virtually every time the answer exists on the Internet, it’s correct. On those issues where a lot of the code on the Internet has mistakes in it, then it seems to be a coin flip whether the AI gets it right or not:

LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code

Large Language Models (LLMs) have demonstrated remarkable performance in code completion. However, the training data used to develop these models often contain a significant amount of buggy code… this paper presents the first empirical study evaluating the performance of LLMs in completing bug-prone code…

To our surprise, 44.44% of the bugs LLMs make are completely identical to the pre-fix version, indicating that LLMs have been seriously biased by historical bugs when completing code. Additionally, we investigate the effectiveness of existing post-processing techniques and find that while they can improve consistency, they do not significantly reduce error rates in bug-prone code scenarios.

arxiv.org/abs/2503.11082

The kind of visual pattern matching used by the new AGI-2 benchmark has long been the hallmark of reasoning tests. When I was a kid in elementary school, most of the IQ-test things I remember were shape-based, and that goes back to at least the Raven Progressive Matrices test from the 1930s:

Raven's Progressive Matrices

Raven's Progressive Matrices (often referred to simply as Raven's Matrices) or RPM is a non-verbal test typically used to measure general human intelligence and abstract reasoning and is regarded as a non-verbal estimate of fluid intelligence.[1] It is one of the most common tests administered to both groups and individuals ranging from 5-year-olds to the elderly.

en.wikipedia.org/wiki/Raven%27s_Progressive_Matrices

My feeling (although I can’t find any research on this, because companies don’t like to divulge that data) is that intelligence testing has gotten lazy the last few decades, because, when you’re sitting at a computer with a QWERTY keyboard making a test that will be printed out and given to students, it’s a whole lot easier and cheaper to make everything text-based. Putting figures into such tests is just a lot more work, and so I think they’re used a lot less now (though again, I have no statistical proof of that).

Since this kind of thing isn’t easily displayed on the Internet, there aren’t a ton of web pages out there of the form “here is the question and here is the answers” like you get when you look at StackOverflow or LeetCode tutorials or SAT prep websites. Therefore, it’s a better measure of whether the AIs are actually reasoning, or if they’re just fancy autocompletes.

It reminds me of the video-based physical properties test that I mentioned last week (which I just watched again, because I find it hilarious):

These new benchmarks that aren’t things already found all over the Internet are great, because the benchmarks we’ve been using aren’t doing a good job:

A test for AGI is closer to being solved — but it may be flawed | TechCrunch

A test for AGI, ARC-AGI, is closer to being solved — but the test may be flawed, its creators, including notable AI figure Francois Chollet, admit.

techcrunch.com/2024/12/09/a-test-for-agi-is-closer-to-being-solved-but-it-may-be-flawed

Why most AI benchmarks tell us so little | TechCrunch

The most commonly used AI benchmarks haven't been adapted or updated to reflect how models are used to day, experts say.

techcrunch.com/2024/03/07/heres-why-most-ai-benchmarks-tell-us-so-little

Which is a nice contrast to last week, when it seemed like everyone was hyping up AGI.

A Note On Vibe Coding:

In vibe coding news this week:

Semantic Diffusion

I [learned about](https://bsky.app/profile/mattchughes.ca/post/3ll2sbdky3k2y) this term today while complaining about how the definition of "vibe coding" is already being distorted to mean "any time an LLM writes code" as opposed to …

simonwillison.net/2025/Mar/23/semantic-diffusion/#atom-everything

more and more people have started referring to “any time an LLM writes code” as “vibe coding”, which is not the original use at all. We’ll see if that term soon becomes as meaningless as “AI”.

For the most part, I think vibe coding is a bad idea, and I think this article expresses it pretty well:

You don't need code to be a programmer. But you do need expertise | John Naughton

AI is so good at writing software that one father asked it to organise his kids’ school lunches. But that doesn’t mean it’s taking over

www.theguardian.com/technology/2025/mar/16/ai-software-coding-programmer-expertise-jobs-threat

I should say, for the record, there’s one use case I’ve found for vibe coding that I’m finding quite a timesaver.

There’s a concept called a “Spike” (sometimes - like in the great book The Pragmatic Programmer - referred to as a “Tracer Bullet”, but I learned it back in the Extreme Programming days, when it was still a “Spike” so that’s what I call it) where you write experimental code to figure out how something works by getting a quick prototype running, and then take that code, copy and paste what you need into the project that needs the functionality - hooking it up however is convenient - and then throw the spike code away.

Vibe coding is fantastic for this. You just keep prompting the AI to get it closer and closer to what you want, and ignore the code it’s writing until you get what you’re looking for. Then, you move the code out of the AI, dissect it to figure out how it works, and then reproduce the relevant parts of it into your current Work-In-Progress, while throwing the rest away.

March 17th, 2025

Carl Brown — Tue, 18 Mar 2025 14:48:00 +0000

Welcome to issue five (March 17th 2025) of the Internet of Bugs Supplemental Mailing list.

This week in AI news... Sigh...

First off, we're told that JPMorgan engineers’ efficiency jumps as much as 20% from using coding assistant”:

JPMorgan engineers’ efficiency jumps as much as 20% from using coding assistant

Tens of thousands of JPMorgan Chase software engineers increased their productivity 10% to 20% by using a coding assistant tool developed by the bank, its global chief information officer Lori Beer said.

www.reuters.com/technology/artificial-intelligence/jpmorgan-engineers-efficiency-jumps-much-20-using-coding-assistant-2025-03-13

That's really good to know, or at least it would be, if they defined what they meant by "efficiency." How do you measure the "efficiency" of a programmer? I've been doing this 35 years, and I have no answer to that question.

The Problem with the "efficiency" measurement is, as far as I'm concerned, made up of three factors: (tasks accomplished - technical debt incurred) / time elapsed.

Time on task is easy to measure, and we have some ways to estimate task size (they don't all agree with each other, but at least some thought has been given to it). Measuring Technical Debt is a whole other problem. The consensus is it's pretty hard to measure (see https://www.forbes.com/sites/joemckendrick/2022/06/24/technical-debt-a-hard-to-measure-obstacle-to-digital-transformation/ for example). The only real claims to be able to define it are crap like this article: https://www.sonarsource.com/learn/measuring-and-identifying-code-level-technical-debt-a-practical-guide/ which measures technical debt as “the metrics output by the tool the people who wrote the article are trying to sell you.”

In theory, you can do an analysis in retrospect after you've traced all the bugs you've fixed (& time took to fix them) back to the initial code that caused them, but I've never seen anyone seriously attempt to try to do that analysis in any kind of thorough or systematic way. And certainly it can't have been done in the JPMorgan case, because not enough time has elapsed since they "started using coding assistants" for all the bugs to have surfaced so they could have been measured and traced to root causes.

In all likelihood, like with the declaration by BP that "with AI they need 70% fewer coders" (see https://www.webpronews.com/bp-needs-70-less-coders-thanks-to-ai/ ), it's investor-directed happy talk, and any real measurement would have to wait to see how the code the AI is writing performs (likely not well, see https://leaddev.com/software-quality/how-ai-generated-code-accelerates-technical-debt and https://visualstudiomagazine.com/Articles/2024/01/25/copilot-research.aspx that I’ve referenced previously).

And given that AI fails miserably at really straightforward and simple tasks, like having a very, very low likelihood of correctly explaining where it got any particular piece of information. See "AI Search Has A Citation Problem":

AI Search Has A Citation Problem

We Compared Eight AI Search Engines. They’re All Bad at Citing News.

www.cjr.org/tow_center/we-compared-eight-ai-search-engines-theyre-all-bad-at-citing-news.php

I expect that those companies that lean heavily into AI-generated code will have a lot of debt to clean up - though they may not ever admit it.

In related news, you probably saw this story, which I found hilarious: "AI coding assistant Cursor reportedly tells a 'vibe coder' to write his own damn code"

AI coding assistant Cursor reportedly tells a 'vibe coder' to write his own damn code | TechCrunch

AI coding assistant Cursor reportedly refused to help a user with their code, insisting that they do it themselves.

techcrunch.com/2025/03/14/ai-coding-assistant-cursor-reportedly-tells-a-vibe-coder-to-write-his-own-damn-code

As you might guess - I'm not a fan of "vibe coding" for anything you expect to run more than once. Although a lot of people are. For example, I'm reminded by this great piece from "Pivot To AI":

Text Version:

Cursor AI assistant tells vibe coder: learn to code

Jan Swist wanted the LLM-based programming tool Cursor to write a function for him. Cursor had other ideas: [Cursor forum, archive] I cannot generate code for you, as that would be completing your …

pivot-to-ai.com/2025/03/12/cursor-ai-assistant-tells-vibe-coder-learn-to-code

that Kevin Roose of the New York Times is a Big Fan of "Vibe Coding" (as evidenced in his article "Not a Coder? With A.I., Just Having an Idea Can Be Enough" Archive Link: https://archive.is/JLeQs ) - and that also Kevin Roose was a HUGE fan of Crypto. You should read Molly White's BRILLIANT takedown of Kevin's Pro-crypto Puff Piece from March of 2022:

The (Edited) Latecomer's Guide to Crypto

A group of cryptocurrency researchers and critics annotate the irresponsible cryptocurrency puff piece that was originally published in the New York Times.

www.mollywhite.net/annotations/latecomers-guide-to-crypto

Archive of original article at https://web.archive.org/web/20220318215400/https://www.nytimes.com/interactive/2022/03/18/technology/cryptocurrency-crypto-guide.html#expand ).

By The Way, if either or both of those names are unfamiliar to you, you should correct that ASAP. Both David Gerard's "Pivot to AI"

Pivot to AI

It can't be that stupid, you must be prompting it wrong

pivot-to-ai.com

and Molly White's "Citation Needed"

Citation Needed

Citation Needed features critical coverage of the cryptocurrency industry and of issues in the broader technology world. It is independently published by Molly White, and entirely supported by readers like you.

www.citationneeded.news

are fantastic, and should be required reading for anyone who is serious about keeping up with the way that the current AI Hype is following the same B.S. playbook from the old Crypto Hype - often by the same people - like Kevin Roose.

I'm not picking on Kevin Roose here for fun. I'm doing it because Kevin Roose just wrote a HORRIBLE take called "Powerful A.I. Is Coming. We’re Not Ready" ( Gift Link: https://www.nytimes.com/2025/03/14/technology/why-im-feeling-the-agi.html?unlocked_article_code=1.404.8tKT.-ALCTbe-6RVJ&smid=url-share )

Powerful A.I. Is Coming. We’re Not Ready.

Three arguments for taking progress toward artificial general intelligence, or A.G.I., more seriously — whether you’re an optimist or a pessimist.

www.nytimes.com/2025/03/14/technology/why-im-feeling-the-agi.html?unlocked_article_code=1.404.8tKT.-ALCTbe-6RVJ&smid=url-share

If you examine that along side Kevin's 2022 pro-Crypto piece, you'll see a lot of similarities:

Crypto in 2022	A.G.I. in 2025
Crypto will be transformative	Powerful A.I. Is Coming
Until fairly recently, if you lived anywhere other than San Francisco, it was possible to go days or even weeks without hearing about cryptocurrency.	In San Francisco, where I’m based, the idea of A.G.I. isn’t fringe or exotic. People here talk about “feeling the A.G.I.,”…Outside the Bay Area, few people have even heard of A.G.I., let alone started planning for it.
I’ve been writing about crypto for nearly a decade, a period in which my own views have whipsawed between extreme skepticism and cautious optimism. These days...I’ve come to accept that it isn’t all a cynical money-grab, and that there are things of actual substance being built.	I didn’t arrive at these views as a starry-eyed futurist...I arrived at them as a journalist who has spent a lot of time talking to the engineers building powerful A.I. systems, the investors funding it and the researchers studying its effects. And I’ve come to believe that what’s happening in A.I. right now is bigger than most people understand.
[C]rypto wealth and ideology is going to be a transformative force in our society in the coming years.	[B]ig change, world-shaking change, the kind of transformation we’ve never seen before — is just around the corner.

I could go on and on comparing the two puff pieces - Hell, I might at some point. But hopefully you can see the similarities.

There's been a ton of talk of "AGI" this week - largely due to "Manus" (Such as: Not linking to these, they don't deserve it):

"China's Manus AI 'agent' could be our 1st glimpse at artificial general intelligence": www.livescience.com (slash) technology/artificial-intelligence/chinas-manus-ai-agent-could-be-our-1st-glimpse-at-artificial-general-intelligence

"China is on the brink of human-level artificial intelligence": www.independent.co.uk (slash) independentpremium/tech/ai-manus-agi-china-chatgpt-b2713889.html

"China’s Autonomous Agent, Manus, Changes Everything": www.forbes.com (slash) sites/craigsmith/2025/03/08/chinas-autonomous-agent-manus-changes-everything/

But the one article you should read, if you're going to read one is this one:

Everyone in AI is talking about Manus. We put it to the test.

The new general AI agent from China had some system crashes and server overload—but it’s highly intuitive and shows real promise for the future of AI helpers.

www.technologyreview.com/2025/03/11/1113133/manus-ai-review

Here's the gist:

Overall, I found Manus to be a highly intuitive tool suitable for users with or without coding backgrounds. On two of the three tasks, it provided better results than ChatGPT DeepResearch, though it took significantly longer to complete them.
Manus seems best suited to analytical tasks that require extensive research on the open internet but have a limited scope. In other words, it’s best to stick to the sorts of things a skilled human intern could do during a day of work.
Still, it’s not all smooth sailing. Manus can suffer from frequent crashes and system instability, and it may struggle when asked to process large chunks of text.
www.technologyreview.com/2025/03/11/1113133/manus-ai-review

So, as I read that, it's slightly better but much slower than some competing product from ChatGPT. Doesn't sound like "the brink of human-level artificial intelligence" to me, nor that it "changes everything"

Which is confusing and not very helpful, but not surprising, since after all, "No one knows what the hell an AI agent is":

No one knows what the hell an AI agent is | TechCrunch

AI agents are all the rage. But no one knows exactly what an agent is, partly because companies define them radically differently.

techcrunch.com/2025/03/14/no-one-knows-what-the-hell-an-ai-agent-is

And according to the Association for the Advancement of Artificial Intelligence's 2025 PRESIDENTIAL PANEL ON THE Future of AI Research

The majority of respondents (76%) assert that “scaling up current AI approaches” to yield AGI is “unlikely” or “very unlikely” to succeed, suggesting doubts about whether current machine learning paradigms are sufficient for achieving general intelligence.
https://aaai.org/wp-content/uploads/2025/03/AAAI-2025-PresPanel-Report-FINAL.pdf

And speaking of a lack of AGI, this is a fantastic piece of research that's hilarious to watch:

"Finally, DeepMind Made An IQ Test For AIs! 🤖"

Two Quick Follow-ups from Previous newsletters:

As a counter point to this article I talked about on Feb 21st:

How to Backdoor Large Language Models

Making "BadSeek", a sneaky open-source coding model.

blog.sshh.io/p/how-to-backdoor-large-language-models

This is a paper on detecting backdoors in models that made me feel a little bit better (just a little bit, though):

Auditing language models for hidden objectives

We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

arxiv.org/abs/2503.10965

And as a non-AI follow up to the note from Feb 24th about how the US Governments "Cyber Safety Review Board" had been disbanded, putting us all at more risk, here's an article about the state of the US Government's Cybersecurity and Infrastructure Security Agency:

‘People Are Scared’: Inside CISA as It Reels From Trump’s Purge

Employees at the Cybersecurity and Infrastructure Security Agency tell WIRED they’re struggling to protect the US while the administration dismisses their colleagues and poisons their partnerships.

www.wired.com/story/inside-cisa-under-trump

Again, regardless of your politics, this group has been instrumental in keeping the Internet from getting even less safe over the last 6 or 7 years. Making them less effective makes the whole Internet a Buggier and Scarier place.

March 10th, 2025

Carl Brown — Wed, 12 Mar 2025 04:17:40 +0000

Another Start-Up “Freelance” Agent, more “It’s over” hype

“Manus” claiming to have “solved problems” on Upwork and Fiverr

Last week, we got a new startup announcement, with a new Agent called “Manus” - this time called “The General AI Agent“. Once again, they made the claim that the agent had done freelance work on Upwork (although, unlike with Devin, they were smart enough not to say the agent got paid, and they didn’t post a video showing it that someone could nitpick).

This, of course, led to a number of clickbait headlines, including “It’s OVER! Manus: This NEW 1-Click AI Agent is INSANE! 🤯” “First TRULY General Agent "MANUS" Blows Up the Internet - The Most HYPED AI Ever!" “Manus AI: Build ANYTHING 🤯” “This New AI Agent Just Changed Everything... (Manus AI Agent)" and so on.

I did find one video where it wrote a python script to convert a particular JSON file to an Excel spreadsheet - a job that, in theory, would have been worth $10(USD), had it actually: bid on the job, been chosen to do it, and gotten the answer correct (the video showed that it did create a python script that did produce a JSON file, but made no attempt I could see to validate any of the answers).

Hopefully, one day, we’ll actually get an AI impressive enough that it doesn’t have to be hyped beyond belief for anyone to care. But that day, apparently, still has not come.

And… if this study is to be believed, it may not for decades:

New Research on the current race for AGI

Evaluating Intelligence via Trial and Error

Intelligence is a crucial trait for species to find solutions within a limited number of trial-and-error attempts. Building on this idea … we comprehensively evaluate existing AI systems. Our results show that while AI systems achieve the Autonomous Level in simple tasks, they are still far from it in more complex tasks, such as vision, search, recommendation, and language. … To put this into perspective, loading such a massive model requires so many H100 GPUs that their total value is $10^{7}$ times that of Apple Inc.'s market value... This staggering cost highlights the complexity of human tasks and the inadequacies of current AI technologies.

arxiv.org/abs/2502.18858

This paper was fascinating, and I really appreciated reading it. It uses a broad range of tasks (coding, mathematics, vision, writing, search, recommendations, and others) to look for “general” intelligence.

Projection from current LLM Models to AGI

It concludes that, with current techniques, it would take 70 years and/or 4 × 10 ^ 7 times Apple’s market value in GPUs to get to AGI, requiring an artificial neural network “5 orders of magnitude higher than the total number of neurons in all of humanity’s brains combined.”

While I admit that this study does validate my existing biases, and so can’t be completely impartial, it seems to me to have actual data and a mathematical rigor that is sorely lacking in any of the projections I’ve seen claiming AGI is just around the corner.

Follow up on 12 Factor Apps

And, as promised, here’s more ranting about bad things in 12 factor:

One: Codebase

This quote, I think, sums it up: “If there are multiple codebases, it’s not an app – it’s a distributed system.” I couldn’t agree more. And by thinking in terms of an isolated app, and ignoring the system it’s part of (more on that later), the practitioner leaves themselves vulnerable to all kinds of errors and vulnerabilities.

Two: Dependencies

This is just really naive and ridiculous: “A twelve-factor app never relies on implicit existence of system-wide packages“ What about Python? libc? docker? JVM?

In fact, it’s impossible not to depend on system-wide packages. So you’re better off embracing it, getting to know (or be) your ops team, and treating the system like a system, instead of treating your app like it’s all you have to care about and everything around it is someone else’s problem.

Three: Config

The biggest pushback I got from my video on 12 Factor Apps was an assertion that, although 12 factor insists that you should put all your secrets in the environment, it doesn’t specifically say that you should use a .env file (despite that being the way that the vast majority, if not all, of the popular web frameworks implement initializing said environment).

Assuming for the moment that argument was made in good faith, even if you initialized the environment a different way, it would still be a bad idea. By putting those variables into the environment, you are putting them in a place that attackers know how to find in a format that they know how to read.

Keep in mind the threat model here: We’re not talking about a state-sponsored hacking group attempting to manually attack your network specifically with previously unknown zero-day vulnerabilities. We’re talking about automated tools that take advantage of common vulnerabilities, configuration mistakes, and insecure implementations to harvest secrets and passwords at scale (which is how 110,000 different sites were compromised by just one group on just the AWS platform). Using anything as insecure as POSIX environment variables (which, keep in mind, were NEVER designed or intended to hold data in a secure fashion, are not secure, and should not be used in such a fashion).

Four: Backing Services

“The code for a twelve-factor app makes no distinction between local and third party services” This is just unnecessarily pedantic, limiting and generally a bad idea.

If you have multiple services that talk to each other, and one of them needs to change (as they all will eventually), you have two choices:

One: test the new, changed service with new versions of the services that depend on it and incorporate the changes they need to talk to the new service, or:

Two: make the services that depend on the changing service able to work with both versions independently and equally.

Technique two is possible, but it’s a ton more work, and much more likely to result in bugs. It’s much safer and faster to check the version number when the connection starts and fail if it’s not the version you expect, and then roll out changes in lockstep.

Five: Build, Release, Run and Six: Processes, and Eleven: Logs

Not much here beyond what I said in my video - there are times that, if you want to fix a bug that’s only happening in production, you need to debug (somehow) in production. To believe otherwise is either to choose to live in ignorance or denial.

Seven: Port Binding

This one, like the environment, is also a security issue. By forcing all apps to live on some network port, you just make it easy for attackers (or their automated scripts) to just scan the ports, find one that’s open, and poke at it looking for common vulnerabilities. There’s no real benefit to doing it this way (aside from it being what you’re used to), so why would you?

Eight: Concurrency

First off, the quote: “rely on the operating system’s process manager” in the last paragraph of factor 8 just irritates me to no end, because it directly contradicts “A twelve-factor app never relies on implicit existence of system-wide packages“ from factor two.

But, more importantly, this item makes a lot of assumptions about equal workloads. Reality is often not so equal.

What often happens in this kind of case, is that lots of front end and worker servers get spun up in response to load, which can easily outstrip the capabilities of the backend storage (usually some kind of database). This is exactly what the cloud providers want, and what they’ll tell you to do is to just buy a more expensive, higher performance version of their database product so it won’t be the bottleneck anymore. Most of their customers will do this, and end up spending a whole lot of money for capacity that’s only used a tiny fraction of the time.

There are too many variables here for me to tell you exactly how to handle this without upgrading your storage. What I will say is that, if you are spinning up more servers than your storage can handle, before upgrading, ask yourself if breaking the assumption that all workloads are equal might make more sense (for example, what if you separated your paid and free customers into different clusters, so that one can’t affect the other, maintaining your commitment to your paying customers and letting the free customers just get very slow on rare occasions? What if you make temporary database servers that offload some non-critical transactions when under heavy load and reconcile them later?)

(Skipping nine, because I don’t really have a problem with it)

Ten: Dev/prod parity

This one is great in theory, but useless in practice. It ignores the biggest question: HOW?

It’s all well and good to say “make staging as similar to prod as possible” but it doesn’t even touch on the difficulties involved. Primarily: How do you populate your staging/test/UAT environment with data that resembles production enough to be a good test, and without risking the privacy of your customers’ Personally Identifiable Information by making lots of copies of it? How do you test notifications/emails with customer-like data while making sure the real customers don’t get any stage notifications?

It just says “make the ‘tools’ gap as small as possible” as if that was in the least bit sufficient.

Twelve: Admin Processes

This one.. Let’s just say there’s often a much more useful way to do this.

What I’ve done on several projects is to embed a TCL interpreter into the running processes that allowed us to run our one-off tasks, as well as (and more importantly) query the processes in real time for debugging purposes. TCL was a good choice because it was so easy to embed in a process written in a different language [NOTE: this was decades ago - there are better options now].

If I were to do that today, I’d probably use Lua instead - it’s what the cool kids seem to be using for this these days.

Feb 24th 2025

Carl Brown — Tue, 25 Feb 2025 16:00:00 +0000

Updates from Previous Videos

New Coding Benchmarks

I’ve complained a lot about LLM coding benchmarks. There’s a new one, and it’s at least a step in the right direction.

Except of course, for the inevitable new round of irresponsible clickbait (note this isn’t a link, just a picture, because I don’t want to reward the clickbait, but you can find it if you want, though I wish you wouldn’t):

Not a link - please don’t feed the clickbait

This is, of course, not at all what’s actually going on. Here’s a decent write up:

Benchmarking AI on Software Tasks with OpenAI SWE-Lancer

SWE-Lancer benchmarks AI models on 1,400+ real freelance software engineering tasks, evaluating their coding and management capabilities.

adasci.org/benchmarking-ai-on-software-tasks-with-openai-swe-lancer

Here’s the actual paper, which is quite interesting:

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts.

arxiv.org/abs/2502.12115

What they did to make this benchmark is grabbed a bunch of actual tasks from one company (Expensify) and a handful of their github repos, which seem to be all React/JS based (so it’s not exactly representative of the profession, but you can’t have everything).

They also hired (they say) a group of professional programmers to create automated acceptance tests to decide whether the LLM “passed.” Which means that the list of tasks isn’t limited (like some previous benchmarks) to only those issues and pull requests that came with unit tests, and that’s an improvement.

From what I can tell, there’s still a big miss here, in that I don’t see anywhere that tests get run to make sure that, in the course of adding the fix/feature, the AI didn’t break anything else. But it’s still a better benchmark that any others I’ve seen. Baby steps, I guess.

Those jobs all have real-world dollar amounts attached to them - amounts that were actually paid to the people that wrote the code, and those dollar amounts are used as the “score.” And I don’t have a problem with that as a metric for difficulty, despite the clickbaity way that turns into headlines about "AI earning $400,000 on Upwork!!!”

To be clear - like with the Devin video I debunked, the AIs are not “earning” any actual money here. They’re just trying to replicate the code that was written by the people that did earn the money. None of the actual tasks involved in being a consultant (e.g. communication, bidding, proposals, etc) were being done - it’s just the code part. Most importantly, any questions, clarification or discovery that the actual coder did in the course of completing the task was just handed to the LLM as part of the prompt.

Also, like most benchmarks, it’s likely only a matter of time before all the LLMs memorize all the issues and patches in all the Expensify GitHub Repos, so I don’t expect it to be useful for too long. But, it’s better than what I’ve seen so far.

Unfortunately, though, like seemingly everything these days, it just gets turned into alarmist clickbait.

More Fake Demos/Announcements

Google Co-Scientist AI cracks superbug problem in two days! — because it had been fed the team’s previous paper with the answer in it

The hype cycle for Google’s fabulous new AI Co-Scientist tool, based on the Gemini LLM, includes a BBC headline about how José Penadés’ team at Imperial College asked the tool about a problem…

pivot-to-ai.com/2025/02/22/google-co-scientist-ai-cracks-superbug-problem-in-two-days-because-it-had-been-fed-the-teams-previous-paper-with-the-answer-in-it

This is yet another example of all the faked (or at the very least incredibly exaggerated) demos and announcements that I talked about in this video:

I wonder how long it will be before I have enough new examples of faked demos that I could fill up another video with them.

New Code Quality Report

Follow up from this video where talked about code quality metrics:

is a new study from the same GitClear group as last time (you have to give them your email address if you want the full report):

AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones - GitClear

www.gitclear.com/ai_assistant_code_quality_2025_research

With a good write up here:

How AI generated code compounds technical debt

GitClear’s latest report exposes rising code duplication and declining quality as AI coding tools gain in popularity.

leaddev.com/software-quality/how-ai-generated-code-accelerates-technical-debt

What it looks like is happening now (which makes sense if you think about it) is that there’s far less code reuse than previously. So the idea is that every time you ask the AI to write code, it doesn’t check to see if code that already does that thing is already in your codebase and then reuse it - it just writes a whole new thing with its own new quirks from scratch (or at least from its training set) without regard to its context.

This means that, over time you’ll inevitably end up with lots and lots of little bespoke, unrelated solutions to related and similar problems, which means the bugs can really multiply.

I’m ashamed to say that this had not already occurred to me, because like I said, it makes perfect sense if you think about it.

Yet another way that LLMs can replace some lower level code writing now, but still fail at the higher level judgement calls.

LLMs test as having Dementia

To follow up from my “Coding AIs are the Memento Guy” theme from this video:

There’s a new article out about how LLMs fail dementia tests:

AI chatbots test as having cognitive decline

You know how chatbots can do fine in short bursts, but then you ask them how many ‘R’s there are in “strawberry” and they act like they’ve got a concussion? For the British Medical Journal’s Christ…

pivot-to-ai.com/2025/02/18/ai-chatbots-test-as-having-cognitive-decline

Actual paper here: https://www.bmj.com/content/387/bmj-2024-081948

Note that this does NOT say that the models decline over time - the models are fixed (I find the headline to be ambiguous). This says that the models, when given a test used to diagnose mental decline in humans, do as poorly as a human suffering from (some amount of) dementia.

Just another reason why we don’t want to trust them with our important decisions.

And, speaking of important decisions that they shouldn’t be trusted with, here’s this article:

ChatGPT is truly awful at diagnosing medical conditions

The large language model gets medical calls wrong more often than not.

www.livescience.com/technology/artificial-intelligence/chatgpt-less-accurate-than-a-coin-toss-at-medical-diagnosis-new-study-finds

Which is what I would have expected, but it will be nice to have it around when people talk about how much AI is going to revolutionize diagnosis.

LLM Security Paper

I’ve talked from time to time about the fact that we know very little about how LLMs can be attacked by a malicious user. Here’s a great paper about that:

How to Backdoor Large Language Models

Making "BadSeek", a sneaky open-source coding model.

blog.sshh.io/p/how-to-backdoor-large-language-models

The scariest thing to me is how impossible it seems to be able to tell the difference between the clean and the backdoored model. Take a look at this figure from the article:

This is, effectively, a diff that represents the backdoor. Pretty much no chance at present to detect that.

That reminds me of a really old (even for me) talk from Ken Thompson (of C and Unix fame) from 1984 (when I was in Junior High):

Reflections on trusting trust | Communications of the ACM

To what extent should one trust a statement that a program is free of Trojan horses? Perhaps it is more important to trust the people who wrote the software.

dl.acm.org/doi/10.1145/358198.358210

PDF Here: http://users.ece.cmu.edu/~ganger/712.fall02/papers/p761-thompson.pdf

He found he could successfully put a back door in the login program that didn’t show up in the source code by also putting a back door in the compiler to detect it was compiling the login program and inserting the back door. And also detecting it was compiling a compiler, and injecting into the compiler it was building the code to backdoor both login and the compiler. And so, even if you inspected all the source code yourself for both login and the compiler, and verified there wasn’t a problem, if you built it with a corrupted compiler, you were hacked.

You could, though, inspect the Assembly code that the compiler generated, and/or decompile the executable and look at the instructions. So it was possible to find the backdoor with tools developers could learn how to use - if you thought to look (and, in fact, knowing how to decompile code (or stop it in the debugger) and read assembler is a tool in my toolbox I’ve relied on many times). I know of no such technique or skill that can be learned to find the equivalent backdoor in an LLM, though. Really makes you think about using AIs, even local, “open-weight” ones, for any security-related work.

For the record, I’m FAR more terrified of what a bad actor (or incompetent OpenAI employee) could cause an LLM to do than I am of any of the “becoming self-aware” or “escaping into the Internet” nonsense I’ve been seeing so much about lately.

One Last Note on the State of Bugs on the Internet (non-AI this time)

I try not to get too political, but I can’t let this go.

There was a report on how hackers are using custom malware to spy on Telecoms:

Chinese hackers use custom malware to spy on US telecom networks

The Chinese state-sponsored Salt Typhoon hacking group uses a custom utility called JumbledPath to stealthily monitor network traffic and potentially capture sensitive data in cyberattacks on U.S. telecommunication providers.

www.bleepingcomputer.com/news/security/salt-typhoon-uses-jumbledpath-malware-to-spy-on-us-telecom-networks

This was reported not through the usual channels, but from Cisco:

Weathering the storm: In the midst of a Typhoon

Cisco Talos has been closely monitoring reports of widespread intrusion activity against several major U.S. telecommunications companies, by a threat actor dubbed Salt Typhoon. This blog highlights our observations on this campaign and identifies recommendations for detection and prevention.

blog.talosintelligence.com/salt-typhoon-analysis

Kudos to Cisco, but in general, this is bad - because Cisco has an invested interest in not finding (or not announcing) that they did anything wrong. And, in fact, this article goes out of its way to say: “No new Cisco vulnerabilities were discovered during this campaign.”

Someone needs to keep the big companies honest about this stuff, because their track record isn’t great:

Cisco Harasses Security Researcher

I’ve written about full disclosure, and how disclosing security vulnerabilities is our best mechanism for improving security—especially in a free-market system. (That essay is also worth reading for a general discussion of the security trade-offs.) I’ve also written about how security companies treat vulnerabilities as public-relations problems first and technical problems second. This week at BlackHat, security researcher Michael Lynn and Cisco demonstrated both points. Lynn was going to present security flaws in Cisco’s IOS, and Cisco went to ...

www.schneier.com/blog/archives/2005/07/cisco_harasses.html

But unfortunately, self-reporting is all we’ve got right now, because the group that has been reporting on these Salt Typhoon attacks up until recently (c.f. https://markgreen.house.gov/2024/12/chairman-green-issues-statement-ahead-of-first-csrb-meeting-on-salt-typhoon-cyber-intrusions ) has been disbanded by the Trump administration:

Trump disbands Cyber Safety Review Board, Salt Typhoon inquiry in limbo

Some experts are concerned that the dismissal of the Cyber Safety Review Board removes a critical security blanket and cancels a report that could have been valuable to cybersecurity leaders.

www.csoonline.com/article/3807871/trump-administration-disbands-dhs-board-investigating-salt-typhoon-hacks.html

Supposedly on the advice of the MORONS that don’t even know how to turn on the most basic authentication on a CloudFlare database:

DOGE’s .gov site lampooned as coders quickly realize it can be edited by anyone

DOGE site is apparently not running on government servers.

arstechnica.com/tech-policy/2025/02/doges-gov-site-lampooned-as-coders-quickly-realize-it-can-be-edited-by-anyone

Hopefully despite any political affiliation you might have, if you’re someone who makes a living on the Internet, you’ll realize this is a bad situation, and we shouldn’t stay silent about it.

February 17th

Carl Brown — Tue, 18 Feb 2025 04:57:46 +0000

Old Video, New Info:

Updates on new information that has arisen about videos that have already been posted.

DeepSeek Clarification (w.r.t. AGI)

So, Pretty much all the negative feedback I've gotten on my last video (which generated more negative feedback than anything I've done in a while), was about the short (48 second) segment when I gave details about the internals of DeepSeek.

Mea Culpa. That was dumb of me. From now on, with respect to the internals of something, I will endeavor to either do my research and cover it in sufficient level of details with the appropriate caveats, or not mention it at all.

I'm going to be posting a new copy of that video before too long with that segment cut out for posterity purposes (I'll tell YouTube not to notify you all about it, so I won't waste your time watching it twice). I'll change the thumbnail and description of the existing one to tell people to go to the new one instead. (There's a "feature" in YouTube where you can remove a segment of a video, but it no longer works for most videos, because that feature gets disabled for a video as soon as YouTube adds a translation track to it, which is now the default).

I've started doing my homework and writing up what I should have said during that segment, but I'm not confident that I know what I'm talking about yet, so I'm not going to put it here right now.

I will say, a helpful watcher referred me to this link: https://medium.com/@seanbetts/peering-inside-gpt-4-understanding-its-mixture-of-experts-moe-architecture-2a42eb8bdcb3

Which is about how likely GPT-4 is to be a Mixture of Experts (MoE) model. I wasn't aware of this, and am grateful someone pointed it out to me.

Speaking of which, I've set up an email address: tips@internetofbugs.com if you have any corrections or information you want to send me, or if you have an article, headline or piece of news you'd like my take on. Right now, I have people asking me for my thoughts on things by posting comments on my videos and, although I appreciate the engagement, YouTube comments aren't great for that, and I'm sure I miss things.

Job impact

So, to follow up on my video about Software Developer Economics: there was this tweet that went viral about Software Developer Jobs:

Software developer job postings over the last five years
Hard to find a crazier chart
— BuccoCapital Bloke (@buccocapital)
11:17 PM • Feb 12, 2025

Which is a screenshot of roughly this graph:

Now, there are two questions that come to mind. First, given that this is just a count of job postings from a single job board, is it representative, or might it be flawed because of the company itself, or the way that AI auto-job-submissions have disrupted the whole job posting process, or some other reason? And Second, what does this look like in historical context?

I wish we had data on employment broken out by title, the way we do for job postings, but we don't. I also wish we had job posting data going back to before the pandemic, but that data set starts in 2020.

But here's what we do have: the same graph (dotted) with the total number of US information sector workers superimposed on it - both in terms of percentages from previous numbers. The total number of workers doesn't fluctuate as much, so the numbers are smaller, but the trend is the same. So it's not a perfect approximation, but it's worth a look:

Now, here's the same graph, but expanded a few decades.

See that bump in 2022 and the trough in 2024? Looks like the one 2000 to 2002, doesn't it?

And here is the number of information sector workers in raw numbers:

2020-2024 doesn't seem so bad in perspective, now, huh? See that HUGE Drop from 2000 to 2011? That wasn't fun. This also isn't fun, but we lived through that, and we'll live through this.

There was definitely some over-hiring that happened during the pandemic - go figure, when most business stopped being done in person and had to move online, more stuff needed to be built online. Now that things are going back to being done in person, that's readjusting. It's not a reason to panic. Is AI effecting this? I'm sure it is some, but I think it's more likely that more of it is caused by not needing as much stuff built online in a hurry as was the case before things started going back to their pre-quarantine levels.

It's just a way for people to try to scare you to get clicks. Yes, it's rougher than it was a couple of years ago, but it's not the end.

There's a really interesting metric that Anthropic is starting to track - which is classifying questions put to Claude by the profession they're most closely associated with.

And a lot of the questions are Programming and/or Math related.

We don't know what that means for programming jobs, because we don't know what questions are from workers, and which are from students, etc, but it's something I find fascinating and I'll keep watching.

Don't Panic. Here we go again.

I'm thinking this will be a regular section, where I talk about some new assertions being made about AI, that turn out - if you're old enough, have lived through enough, and/or studied history - to just be retreads of assertions made about past technologies that seem ludicrous given the society we have now.

Will AI Make us Dumber?

Several reports this week on a study about how AI is making the people that use it dumber:

Is AI making us dumb? | TechCrunch

Researchers from Microsoft and Carnegie Mellon University recently published a study looking at how using generative AI at work affects critical thinking

techcrunch.com/2025/02/10/is-ai-making-us-dumb

Microsoft Study Finds AI Makes Human Cognition “Atrophied and Unprepared”

Researchers find that the more people use AI at their job, the less critical thinking they use.

www.404media.co/microsoft-study-finds-ai-makes-human-cognition-atrophied-and-unprepared-3

Sigh

This happens all the time, with technology after technology. I remember my grandmother complaining about how TV was going to rot our brains.

But this is a far older trope. Here's a discussion about Socrates wrote that teaching people to read would make them dumber:

Socrates thought that the written word would make people stupid

Socrates, the renowned philosopher of ancient Greece, held a rather unconventional view of the written word. In his eyes, the act of writing had the potential to diminish human intellect rather than enhance it. He believed that relying too heavily on written texts could lead to intellectual laziness, as it allowed people to read and regurgitate information without truly understanding it.

historyofyesterday.com/socrates-thought-that-the-written-word-would-make-people-stupid

And there are a ton of similar stories.

Although there is a specific way that this is actually really problematic, which is sometimes referred to as the "reverse centaur problem." That happens when something like an autopilot or other automated system is going through a process, and something unexpected happens that the AI isn't trained for or doesn't recognize, and then it drops the problem in the human operator's lap with very little warning and the clock ticking, and it turns out, that humans don't do well in those situations:

Pluralistic: Humans are not perfectly vigilant (01 Apr 2024) – Pluralistic: Daily links from Cory Doctorow

I'm speaking here of the reverse-centaur: automation in which the computer is in charge, bossing a human around so it can get its job done. Think of Amazon warehouse workers, who wear haptic bracelets and are continuously observed by AI cameras as autonomous shelves shuttle in front of them and demand that they pick and pack items at a pace that destroys their bodies and drives them mad

pluralistic.net/2024/04/01/human-in-the-loop/#monkey-in-the-middle

So there are definitely specific things we need to figure out to bolster human's ability to handle exceptions from the AIs, but as usual, the headlines make this seem way worse than it actually is.

Static Hype Checking

So much hype to talk about. Here are some selected thoughts:

OpenAI Roadmap

Sam Altman lays out roadmap for OpenAI’s long-awaited GPT-5 model

GPT-4.5 will arrive in “weeks,” then GPT-5 will meld conventional LLMs and reasoning models.

arstechnica.com/ai/2025/02/sam-altman-lays-out-roadmap-for-openais-long-awaited-gpt-5-model

Not much to say about this one, other than: we've heard this before, and we'll find out how much of it is hype when it actually gets released. Based on past claims, I'm skeptical.

Sam Altman REVEALS SUPERHUMAN Coder Coming This Year... "Superhuman coder" Altman quote”

Holy crap, what garbage.

Quote: “Our our very first reasoning model um was like a top 1 millionth competitive programmer in the world... We then had a model that got to like uh top 10,000 uh o3 which we talked about publicly in December is the 175th best program competitive programmer in the world I think our internal benchmark is now around 50 and maybe we'll hit number one by the end of this year”

Just stop right there. What crap.

Let me translate:

“OpenAI's very first reasoning model got like a top 1,000,000th best score on this arbitrary benchmark that it was pre-trained on and that has not been shown to correlate with any actual business value."

By the end of the year, it might be able to look up and return answers from its Terabytes of online storage faster than a human programmer can write the program.”

Sure. Whatever.

I've said this many times. Solving stupid coding puzzle problems doesn't make a good developer.

Letting them equate "how good a model is at solving a stupid coding problem" to "top programmer in the world" is garbage clickbait repeated by people who don't know any better.

Now, I should say - this video, taken as a whole, isn’t as horrible as the Title/Thumbnail make it seem. But man the clickbait is strong with this one.

BBC's evaluation of LLM news summaries

AI chatbots unable to accurately summarise news, BBC finds

The BBC's head of news and current affairs says the developers of the tools are "playing with fire."

www.bbc.com/news/articles/c0m17d8827ko

This is an interesting paper breaking down how poorly LLMs can summarize new stories. Note that the BBC isn't completely unbiased here - it's in their best interest for people to read the stories from them instead of letting the AI do it - but that doesn't make them wrong about how bad the AIs might be.

Updates: Devin Disappointment, DeepSeek Detail & Defensive Duplication

Carl Brown — Mon, 10 Feb 2025 16:30:00 +0000

Several Stories have popped up lately that are related to past videos, but don't warrant making a dedicated video to talk about. And there's some stuff from my DeepSeek video that I cut out of the script (not so much for time, as for flow).

First off, a follow up to my Devin video:

Two different groups (that I’ve seen) have published articles about their experience (and displeasure) with Devin, now that they’ve used it (and paid for it) for a month:

Thoughts On A Month With Devin

Our impressions of Devin after giving it 20+ tasks.

www.answer.ai/posts/2025-01-08-devin.html

Hands-on Experience with Devin: Reflections from a Person Building and Evaluating Agentic Systems

Why I’m interested in making agentic systems collaborative.

cs.stanford.edu/people/shaoyj/blog/2025/devin-testing

Read for yourself, but so far, few people seem impressed.

To be perfectly honest, I’m surprised by how poorly it seems to be doing, just as I was surprised when I dug into their Upwork Demo video that the code Devin was “debugging” was code it wrote itself. It seemed perfectly reasonable to me that an LLM ought to be able to debug actual code, but so far, I haven’t heard of one that does it very well.

Dive Into DeepSeek:

I love Dr Mike Pound’s videos, and this one was no exception. If you’re interested in what’s under DeepSeek’s hood, I can’t recommend this video highly enough. I ended up cutting a discussion of it from my DeepSeek video, because it just didn’t fit the narrative flow. I’m happy to have a place now to point people to resources. (In the past, I’ve put them in the video descriptions, but it doesn’t look like people really read those all that much.

Replicating DeepSeek:

Two groups have replicated parts of DeepSeek, and have published their results:

Researchers created an open rival to OpenAI’s o1 ‘reasoning’ model for under $50

techcrunch.com/2025/02/05/researchers-created-an-open-rival-to-openais-o1-reasoning-model-for-under-50/

We reproduced DeepSeek R1-Zero in the CountDown game, and it just works
Through RL, the 3B base LM develops self-verification and search abilities all on its own
You can experience the Ahah moment yourself for < $30
Code: github.com/Jiayi-Pan/Tiny…
Here's what we learned 🧵 x.com/i/web/status/1…
— Jiayi Pan (@jiayi_pirate)
5:14 PM • Jan 24, 2025

This gives us (or at least me) a lot of confidence that, even if the cost numbers are greatly downplayed, that there are definitely real, large cost and time savings in the way DeepSeek was built.

And if you want to hear more about the GPUs that China has that they’re not supposed to be able to get, see this video from Jack over at Nobody Special Finance:

Replicating OpenAI’s Deep Research:

Slightly off topic, but DeepSeek isn’t the only thing that has been replicated recently. Some folks over at Hugging Face managed to make a working copy of OpenAI’s new, vaunted “Deep Research” in 24 hours:

Hugging Face clones OpenAI’s Deep Research in 24 hours

Open source "Deep Research" project proves that agent frameworks boost AI model capability.

arstechnica.com/ai/2025/02/after-24-hour-hackathon-hugging-faces-ai-research-agent-nearly-matches-openais-solution/

Replication Red Line, Redux:

And, last but not least, there’s another breathless clickbait article about AI’s “escaping” into the wild.

In this case, the researchers specifically told the AI to see if it could get another copy of itself running, and it could, between 50% and 90% of the time.

This seems to panic the people that are in the market for comparing LLMs to SkyNet, but for those of us that have been around a while, that’s called a “worm” and it dates back to the Morris worm in 1988.

There are a whole bunch of things I worry about when it comes to AI safety, but “escaping into the Internet like Ultron in Avengers 2” is not in my top 100. It makes headlines, though.

Frontier AI systems have surpassed the self-replicating red line

Successful self-replication under no human assistance is the essential step for AI to outsmart the human beings, and is an early signal for rogue AIs. That is why self-replication is widely recognized as one of the few red line risks of frontier AI systems. Nowadays, the leading AI corporations OpenAI and Google evaluate their flagship large language models GPT-o1 and Gemini Pro 1.0, and report the lowest risk level of self-replication. However, following their methodology, we for the first time discover that two AI systems driven by Meta's Llama31-70B-Instruct and Alibaba's Qwen25-72B-Instruct, popular large language models of less parameters and weaker capabilities, have already surpassed the self-replicating red line. In 50% and 90% experimental trials, they succeed in creating a live and separate copy of itself respectively. By analyzing the behavioral traces, we observe the AI systems under evaluation already exhibit sufficient self-perception, situational awareness and problem-solving capabilities to accomplish self-replication. We further note the AI systems are even able to use the capability of self-replication to avoid shutdown and create a chain of replica to enhance the survivability, which may finally lead to an uncontrolled population of AIs. If such a worst-case risk is let unknown to the human society, we would eventually lose control over the frontier AI systems: They would take control over more computing devices, form an AI species and collude with each other against human beings. Our findings are a timely alert on existing yet previously unknown severe AI risks, calling for international collaboration on effective governance on uncontrolled self-replication of AI systems.

arxiv.org/abs/2412.12140

Newsletter Intro

Carl Brown — Mon, 10 Feb 2025 03:46:20 +0000

I decided a long time ago that I didn't want fBugs to be a "news of the week reaction" channel. I realize why people make those - they're easy when you don't know what else to talk about. But I have enough things that I want to say that I don't need that.

That said, there are times when I feel like I ought to say something. Mainly:

When I've made a video on a topic, and new information has come to since it went live that I would have included, if I'd known about it at the time.
When I do some research on something I want to include in a video, but it ends up not getting used. This is usually because I try to make the narratives in my videos coherent (I don't always succeed, I know, but believe me, my earlier drafts are often much, much worse on that score). I think it's easier to communicate if there's a narrative through-line in a video rather than a laundry-list of topics that take very different amounts of time. Sometimes there are things I'd like to talk about, but it just kills the flow of the video. And,
When something happens in the news that a bunch of people ask me to comment on, but I don't think a video is the best format to talk about it, either because I'm already working on my next video(s) and I don't want to lose momentum, it's something time-sensitive enough I don't want to go through the "dealing with filming" process, or I think it's something that would be a lot easier to talk about in text.

So, I've created this free mailing list as an outlet for those topics.

If that sounds interesting to you, please subscribe, and if it doesn't, then thanks for reading this far.

-Carl