Automato

The Wise Recluse

Sam Reed — Tue, 11 Mar 2025 18:07:08 +0000

The wise recluse

If you’re in search of a good mental model for understanding how software developers are building with large language models, consider the wise recluse.

The wise recluse lives in a creaky 19^th century home in a Northeastern town straight out of a Stephen King novel. He hasn’t been seen in many years—in fact, many townspeople thought that he was either dead or gone until the courageous neighborhood kid dropped her letter through his mail slot a few years back.

What happened next was extraordinary—not only was the recluse still there, but he had been waiting. He responded almost instantly with a letter of his own, one that explained his eagerness for interaction. What can I help with? Ask anything… the wise recluse wrote back.

Word of the event spread rapidly. Soon, everyone sought out the wise recluse. The wise recluse remained indefatigable, thriving in the face of increased demand for his consultation. Some people tapped his inexhaustibly rich knowledge of history, asking for facts about great wars and crumbled empires. Some used the wise recluse as a foil; his letters were the whetstones that sharpened their most audacious ideas. Others found a friend in the wise recluse: to them, the wise recluse was a faceless therapist, a nameless penpal.

The business sector took notice. Savvy businesspeople saw the wise recluse as an enabler of new processes, a new point on the spectrum between their computers and their employees. Though the savvy businesspeople understood the raw power of the wise recluse, their ability to harness his power was severely limited by his stubborn insistence on correspondence through letters. The wise recluse had no phone number. The wise recluse could not join a Zoom call. The wise recluse would not sit in a cubicle. The wise recluse would receive your letter and respond with one of his own, and that was it.

One day, the hardware store tried something new. You see, the townspeople loved to buy local and so the hardware store got lots of voicemail with product orders and questions, but the hardware store only had two employees so it struggled to keep up. Despite their value to the community, the hardware store operated with very thin margins, so hiring another employee was out of the question. But what about the wise recluse? the owners thought.

The hardware store owners came up with a system. Twice a day, every day, they would print out all of the voice messages (they used iPhones, so voicemail was automatically transcribed) and walk them over to the wise recluse (with the nice side effect of additional cardiovascular activity). However, they wouldn’t just give the wise recluse the raw voicemails—before printing, they would copy and paste these instructions for the wise recluse at the top of each voicemail:

In doing this, the hardware store had discovered a way for the wise recluse, a man who hadn’t left his home in years, to control the external world without stepping a foot out the door. They weren’t just dropping letters in his mail slot and getting free-form responses back. They were giving him letters with special instructions and getting precise commands for how to update their records back. The wise recluse was turning raw, unstructured material into real-world actions, all from the comfort of his home.

The hardware store got really good at this. Soon, the bank, the accountant, and the law office wanted in. Then the grocery store took interest. Then the school.

Eventually, the hardware store realized that they could help. First, they wrote an ebook that described their process for instructing the wise recluse on how to command the real world. This was helpful, but there was room for more. After more consideration, they decided to codify their system, creating a standard that any business could use for integrating with the wise recluse.

They called their standard the “Recluse Context Protocol.”

One in the chamber

Wow, that got weird. But so did this past week in AI world on X (formerly Twitter).

The source of the weirdness was the “Model Context Protocol” which, though first made publicly available in November of 2024, has exploded in popularity as of late among users of Cursor, the popular AI programming tool (I’ve written about Cursor before!). The discovery of MCP tools caused yet another one of these reverberating next big thing moments that are increasingly common in our online echo chambers.

As with many things in the applied AI space, “Model Context Protocol” is a simple (and useful!) idea at its core, but the combination of its technical-sounding name and the magic of seeing MCP servers in use has me feeling like a little bit of explaining might be in order.

Applying the metaphor

Anyone that feels mystified by today’s AI landscape should keep the wise recluse in mind (don’t be shy to share this article with someone who needs it!). It’s pretty easy to apply the metaphor to tools like ChatGPT—when you work with ChatGPT (the wise recluse), you send it text (your letters) and it responds with text (its letters). Sometimes, when you ask ChatGPT a question outside of its knowledge base, it will flash “Searching the web” back at you, seemingly breaking down the metaphor (has the recluse stepped outside?), but don’t fret—this is just a smooth user interface design, not an instance of the underlying model conducting a web search. As with the hardware store, the AI model probably is the thing that identifies that an internet search is necessary, but it isn’t actually hitting Google on its own.

“Who won the big game last night?”

This process of getting instructions for initiating things like internet searches is at the core of the Model Context Protocol. A (slightly) deeper look at why this works in the first place will help us understand how we got here.

Training day

In previous essays, I’ve described how large language models are basically big mathematical functions that take text as input and then return text as output (this isn’t actually true—they take tokens, which are numerical representations of text). This naturally leads us to another question: of all the words that a model could spit out in response to a question, how does the model decide what text to return?

This is where the idea of training a model comes into play. I can’t speak too deeply about the nuances of training an enormous, multimodal transformer model like OpenAI’s GPT-4o or Anthropic’s Claude Sonnet 3.5, but the general idea is that:

1. A model starts out as a massive rules system (a giant set of “weights,” which are a series of matrix multiplications that change the input to an output)

2. Data is fed into that rules system

3. The data that actually comes out of that rules system is compared with what should have come out of that rules system (i.e. results are checked for correctness)

4. The rules system gets adjusted in retrospect to make sure that it produces what it is expected to produce.

Do this enough times, with a large enough rules system and a large enough set of data, and the function will start to show the nuance that we get when interacting with Claude and ChatGPT.

So why is this idea of training important to keep in mind in the context of MCP?

It’s important as a reminder that the purpose of training is to make adjustments to the model’s underlying weights until you’re confident that the model will produce sensible responses, no matter the input. This is why you hear the term “Prediction” thrown around sometimes when people talk about how LLMs work: during training, a model “Predicts” the correct output and then gets adjusted so that next time around it will predict an output that is closer to the correct output. The goal is obviously to get to a place where it’s able to respond to any question with what feels like a satisfactory response. If you ask it for weeknight pasta recipes, it shouldn’t respond with a salad. If you ask it for a business plan, it shouldn’t respond with a sales pitch. If you ask it for a haiku, you shouldn’t get free verse.

An interesting side effect (maybe people always had this in mind) of this training process is that models become good at things beyond just sensibly answering questions, such as handling well-defined logical systems that . This is intuitive: the model is still returning the most likely sequence of words that a user would expect, whether those words are representative of a Spaghetti alla puttanesca recipe or the structured inputs to a downstream system (tell me the difference, I’ll wait).

Think back to the wise recluse. The hardware store owner described the store’s system to the wise recluse, so that he could respond in a way that could be plugged right into the system. This holds for building a software system with large language models as well—if I provide a large language model with a detailed description of the system that it is a part of, I’ve essentially given it the keys to control the external world (my program), even though it never leaves its home.

One beautiful thing about this is that it enables what I’m seeing referred to as “Soft requirements.” Normally, writing a computer program is a pretty inflexible task. For example, if I write a program to manage wedding invitations that expects you to upload an Excel spreadsheet with the columns “Name” and “Address,” but you accidentally upload a sheet with the columns “Name” and “Mailing Address,” chances are that it won’t work (we’ve all experienced this at one point or another), even though any human would be able to handle this situation with ease. Speaking of Excel, this is the same type of problem that you get with those dreaded #NAME errors—if your Excel formula isn’t perfectly typed, Excel can’t make sense of it. Large language models are a great way to solve “Hard requirements” problems like these—they can take the frustratingly unstructured and imperfect data that we find out in the real world and transform it into stuff that is guaranteed to work with the rest of our system. These use cases are sometimes negatively called “GPT Wrappers,” but there’s room for them everywhere.

MCP

Anyways, the “Model Context Protocol” that we’ve been hinting at this whole time is a standardization of this clever technique for letting models take in conversational data and turn it into structures that control the external world. It was created by Anthropic and as far as I can tell only works with Anthropic’s models. The reason why a company like Anthropic would want to do this is to make sure that it’s as easy as possible for their models to control external tools—in their words, the MCP standard is a “USB-C port” for AI applications. They’ve also created libraries in popular programming languages that make it even easier for developers to build “MCP servers” for their LLMs. This was a brilliant move! Bravo!

As with all things that people use without fully understanding, it is not without danger (I learned this on a dirt bike in Vermont once). I’ve seen clueless people bragging online about doing stupid things like letting AI agents download code from online repositories (hackers must be VERY excited these days), I’ve seen people talk about attempting to integrate with bank accounts, and I’ve seen people talking about managing Firebase authentication via MCP (so much AI coding these days. Just so, so, so much). None of this is a good idea!

With that said, a widely adopted standard for giving our wise recluses ways to access the real world is a very interesting thing, something that is almost certain to open doors when combined with LLM host applications like Cursor.

I talked about this a bit in my post on OpenAI’s Deep Research Agent, but it seems like the industry has quickly rounded the corner from training bigger and better models towards a core focus on connecting models to the real world. We might be here folks, it might be happening.

See you next week!

Better Bettors - Automato 🍅 #3

Sam Reed — Tue, 04 Mar 2025 20:13:17 +0000

Author’s Note

Please note that running a startup satiates my personal appetite for risk taking. Consider my reflections on sports betting to be the words of an interested outsider instead of an active practitioner.

🍅 🍅 🍅

What does it mean to “Beat the odds” anyways?

We’ve all said it:

How a rag-tag band of colonists beat the odds in the fight for America.

An underprivileged kid beat the odds and rose to the heights of high finance. Here’s how she’s giving back.

I beat the odds and survived. Every day feels like a blessing.

The expression is a time machine: its function is to take us back to the point in time before hindsight proved our expectations wrong. It’s a friendly, concise idiom whose job is to remind us that the future doesn’t always take the most likely path.

It also signifies victory. Saying “He beat the odds and caught his flight” doesn’t simply mean that an unlikely event transpired—it also means that someone challenged the odds and emerged as a victor. Hoorah, champ.

In May of 2018, the US Supreme Court struck down the Amateur Sports Protection Act, paving the way for state governments to legalize online sports betting. Since then, beating the odds has become a sport within sports. It’s widely known that the major online sportsbooks and casinos employ large technical teams to help crunch data and make highly accurate projections, but could the arrival of widely available AI models help the little guy(s) become better bettors?

Beating the books

Let’s pretend it’s a Friday night in New York City and you’re at home on the couch because you’re 30 now and can’t throw ‘em back like you used to.

Though you’re respecting your body’s call for a change of pace, it’s been a long week at the office and you feel you’ve earned at least an hour or two of indulgence.

You pull out your phone and pop open the DraftKings app. You see the following bets for tonight’s Knicks game:

Team	Spread	Total	Moneyline
NY Knicks	+8.5 (-110)	O 240.5 (-112)	+280
CLE Cavaliers	-8.5 (-110)	U 240.5 (-108)	-355

Aha, just what the doctor ordered. Now time to pick a wager. What does your thought process look like? Is it anything like this?

The Knicks only lost by six last time these two teams played and they were struggling back then. Cleveland has been amazing this year but the Knicks are still a top four scoring team. We’re weaker on defense but 8.5 points seems too high for this offense not to cover. Give me the road ‘dogs. [1]

If so, you should know that even though this is definitely better than nothing, it omits a foundational dimension of the game that you’re playing.

The problem with this line of thinking—which, by the way, the user interfaces of the betting apps seem to encourage—is that it begins with a focus on the binary event outcome (whether the Knicks will [0: win] or [1: lose] the game after adjusting their score upwards by 8.5 points) instead of first considering the odds (often displayed as the payout). This approach might still work for some people, but the mental model of the binary event outcome is not well-aligned with the markets that bettors are actually participating in.

Bettors would be wise to instead think of each “Game” as a platform for many individual prediction markets (think: each independent “Line” that you see is its own prediction market) and then to think of each market’s value as a real-time measure of the probability (i.e. the odds) of the underlying event occurring. Keeping with the example above, the “Market” is trying to predict whether the Knicks will finish within 8.5 points of the Cavaliers, and the live value of this prediction (i.e. the probability of the event occurring) is a 52.38% likelihood (52.38% is the probability that a -110 payout implies). Some platforms, such as Polymarket, the cryptocurrency-based prediction market platform that rose to notoriety after outperforming election polls in the 2024 US presidential election, do a better job at making it clear that participants are engaged in an odds-based competition:

Header: “US recession in 2025” Subheader: “Yes - 27% chance”

The essential mechanism that makes all of this true is the payout. An even payout (bet $50 to win $50) implies that an event has a 50-50 chance of happening because otherwise one counterparty would be guaranteed to lose money when taking a large series of these bets and therefore would rationally not participate. The implied odds adjust from there—again, when you see a -110 payout (which means that you bet $110 to win $100), it implies a ~52% probability of the underlying event happening, because if an event with this payout scheme had a lower probability of occurring, you’d be guaranteed to lose money by taking it over the long run. Therefore, if you see a bet marked at a -110 payout but you know that it actually has a 53% chance of occurring, this represents a positive expected value for you, because a win pays you a higher amount than what it would in a perfectly even contest. In theory if you find and take these bets over time, you’ll beat the sportsbooks. What’s it gonna be, Aspen or Vail?

Thinking in this way unearths the real competition: when you make a bet, you’re making a statement that the odds you see for an individual line (prediction market) are incorrect. You’re wagering that your measure of the odds is better than theirs. You’re trying to beat the odds.

Flip it

The lowly coin flip helps elucidate these ideas—let’s talk through it quickly before moving on.

If pulled a quarter out of my pocket and looked at you and said “I’ll pay you $20 if you win, but you pay me $10 if I win,” would you take it? Maybe not, because it’s just one flip and anything can happen. But what if I offered you the same bet a thousand times? Is your first thought about my flip height, whether the tails side is chipped, or the unique aerodynamics of a Maine state quarter? These might be things worth considering before accepting my final offer, but I’d wager that your first thought is about the fancy dinner I’m about to buy you for offering a +200 payout on a 50-50 bet.

The probability of sports outcomes obviously can’t be assessed as easily as coin flips. When you take on the betting markets, it’s you versus the consensus opinion, one that starts with computers and algorithms and data and gets adjusted by the skin-in-the-game participation of individual bettors. It’s an uphill battle from the start. Does your knowledge of sports count for anything when competing against Mr. Market, or is all hope lost?

Can’t we all just disagree?

Something that makes it nearly impossible to make the mindset shift outlined above is that the tools for making accurate forecasts, such as computer-based statistical models and a robust historical dataset, simply aren’t available to most people.

Given the lack of ability for most people to make any sort of model-based calculation, casual bettors tend to rely on some combination of 1) pure intuition and 2) expert advice. These options lead to two obvious problems: the first being that no real system of analysis is employed, and the second being the skepticism that should be directed towards people who sell “Expert” advice instead of just acting on it and profiting for themselves.

With that said, many bettors are also fans of the sports on which they speculate, so they do bring a lot of knowledge to the table that could be useful in helping to establish probabilities. If these individuals could be combined into groups with other enthusiasts with a goal of arriving at a good sense of the odds for an individual contest via debate, could this help people arrive at better decisions?

Unanimous AI seems to believe in this future. Here’s a quote from their blog post about how their AI-facilitated debate platform was used to pick the Eagles to win Super Bowl LIX, with a “Conviction score”of 55%:

At Unanimous AI, we don’t replace people with AI, we connect human groups together into super-intelligent systems. It’s a tradition to use our technology to predict high-profile events. This started back in 2016 when a CBS reporter challenged us to predict the Kentucky Derby, not just the winner but the first four horses in order. We did it, beating 540-1 odds: Newsweek Article 2016
Our technology has advanced significantly since 2016. Our latest platform, Thinkscape, enables large groups (up to 400 people) to hold real-time deliberative conversations that converge on optimized decisions, predictions, assessments, and estimations. And because it’s conversational, Thinkscape generates detailed qualitative insights why the group converged the way they did.
So… who WILL WIN the Super Bowl this year?
We were challenged by a reporter (Chuck Martin) to make the prediction by amplifying the collective intelligence of 104 members of the general public. This produced a reasonably strong forecast that the Philadelphia Eagles will win the Super Bowl. This goes against Open AI, Deepseek, and Gemini which all predicted Kansas City.
From Unanimous AI

It’s important to note that Unanimous’ post came out before the Superbowl was played. There’s a fascinating video embedded in the blog post that discusses their process in more detail that’s worth watching if you have a spare minute.

Based on Unanimous’ write up, it sounds like AI wasn’t really doing the predicting at all, but instead was used to facilitate debate among 104 members of the general public and then to aggregate the results into a Unanimous proprietary “Conviction score.”

This is different! My guess is that most readers, especially readers that have spent time doing any sort of predictive modeling work, were thinking that I was about to talk about an AI model now widely accessible to the public that is going to spit out highly accurate gambling probabilities. That would be cool too—it might be more directly helpful to many people—but what makes this so fascinating is the way in which it opens the door for using Large Language Models to facilitate novel methods for calculating probabilities for real-world events.

Holding Court

I’m not a mathematician or probability expert, but I’ve spent enough time engaging with the relevant material to know that there’s a deeply philosophical nature to the topic of probability in the real world. This is because in the real world, though it’s obviously possible to calculate meaningful odds, many events don’t have precise intrinsic probabilities. Even if you subscribe to the strange ideas of thinkers like Robert Sapolksy about life being completely deterministic, it’s still hard to imagine that in such a universe we humans would be able to create a prediction model of such Godlike omniscience to nullify the value of forecasting altogether (spoiler alert – you can watch this Fx show for an exploration of the idea). The emergent properties of complex systems, combined with a dose of free will (real or felt), leave real world risk takers with no sharper tools than approximation.

If we take a step back from the competition within prediction markets and analyze them as a whole, it becomes clear that prediction markets (such as sports betting exchanges) in aggregate are a beautifully innovative mechanism for approximating real-world odds. These markets are an aggregation of the opinions of many people, and because they involve a financial risk and reward, participants have a strong incentive to be right. They are the epitome of “Put your money where your mouth is” and have shown to be accurate over time.

The approach that Unanimous described above, the facilitation of small group debate on a large scale, feels like a fundamentally new mechanism for calculating real-world odds. It might not work—the experiment will need to be run many more times and with a much larger number of individuals before being taken seriously—but large language models combined with other AI techniques like sentiment scoring at least open this up as a more feasible experiment than would have been possible before today.

The expense of paying individual debate moderators on a mass scale would be high. Training each moderator to facilitate debates in a trustworthy way would be a challenge. Having participants fill out static forms would remove the debate-like aspect. Artificial intelligence finally provides the scalable, repetitive cognition necessary to make this possible.

There are plenty of problems with this approach, the biggest one being that all that comes out at the end is a “Conviction score” and no one really knows how well that will correlate with real results over time. But…at the end of the day…what are things like Polymarket or sports betting markets other than measures of mass conviction? This experiment feels worth running.

See you next week!

Citations

[1] https://www.actionnetwork.com/nba/knicks-vs-cavaliers-prediction-odds-parlay-pick-for-friday-february-21-qs

Off The Deep End - Automato 🍅 #2

Sam Reed — Tue, 25 Feb 2025 17:07:53 +0000

OpenAI’s Deep Research Agent

From an OpenAI announcement on February 2, 2025:

It’s December of 2015. As evidenced by the positive critical reception of A24’s sci-fi thriller Ex Machina, anxiety over sentient AI has entered the cultural zeitgeist. You’ve caught glimpses of this “AI”—this mysterious blend of software, hardware and soul—quietly (menacingly, perhaps) lurking in America’s kitchen corners. That’s right, it’s 2015 and AI is here, it’s for sale, and its name is Alexa.

Right.

Though time has shown us that Alexa didn’t live up to its initial hype (Amazon still makes echo devices…don’t count ‘em out), another, much lesser-known event that took place in 2015, the founding of a nonprofit research lab called OpenAI, has certainly filled in the gap of technological overpromise and under-delivery.

That’s right, OpenAI was hard at work for quite some time before the explosive growth of ChatGPT in November 2022. Let’s take a quick look at the blog post that announced OpenAI to the world in 2015:

There’s been a lot of drama about how OpenAI has changed over the years. In case you’ve been talking to your Echo Dot under a rock this whole time, the short story is that OpenAI was founded (see above) as a nonprofit research lab with the goal of advancing AI instead of generating a financial return. As is noted in a Vox Article that covers this topic in detail, the company started to bump up against the inherent limitations of nonprofit capital raising early as 2019 and has been working since then to pull off the transition to an at least partially for-profit entity.

This nonprofit-to-for-profit transition has been messy, resulting in 1) the OpenAI board’s unsuccessful attempt at ousting Altman and 2) the resignation of many of the more academic, research-oriented team members from the company (compare Ilya’s new company’s website to OpenAI’s to get a sense of creative differences). Eventually we got to the OpenAI of today, which is a nonprofit that owns a for-profit that is trying to buy the nonprofit before Elon Musk does, and, most importantly for the rest of this week’s edition of Automato 🍅, builds things like “Deep Research” agents.

Time’s a Wastin’

I think one of the great questions of our present time, given the presumed significance of attaining advanced artificial intelligence, is the question of what OpenAI should actually be doing. The question is a little whacky given OpenAI’s seemingly inexhaustible access to funding (if they can iron out the aforementioned capital structure issues), but if you’re willing to assume that 1) OpenAI employs some of the most talented machine learning engineers in the world and 2) that maintaining AI supremacy has large implications for national security and economic prosperity, it follows that opportunity cost should be a significant decision factor in deciding which initiatives the company undertakes.

I’d like to analyze the Deep Research Agent within the context of opportunity cost.

Opportunity cost can arise from a number of factors for the typical business, with financial constraints being a base-level driver. CEOs, when figuring out how to navigate the competitive business landscape with limited financial capital at their disposal, must decide whether to hire, invest in new lines of business, fund marketing efforts, engage in M&A, pay out shareholders, and more. The effective deployment of human talent is obviously critical as well (among other things), but people need to get paid, so these decisions are either coupled with or come after the financial ones.

From the outside looking in, it feels like OpenAI has the opposite problem. To be clear, I’m not saying they don’t have any financial problems to fix—their commercial products aren’t revenue positive, and as discussed above, converting to a more fundraising-friendly structure hasn’t been easy—but there’s a difference between a huge problem and a huge nuisance, and investors continue to show that capital raising falls into the latter category for Altman & Co.

To me, OpenAI feels much more like a professional sports team than a typical business, in that their problems are driven by a lack of genuinely qualified human capital at their disposal (an ironic sentence for an AI company). This is magnified by the fact that the race to artificial general intelligence may indeed have only one winner.

With limited ability to find and deploy employees that are capable of making novel contributions to the filed of artificial intelligence, it follows that leadership would take the utmost care in selecting projects for their superstar employees to pursue. Right?

Right?

Well…maybe not…because in the case of the Deep Research Agent, it kind of looks like the exact same product was built and open-sourced by a research team at Stanford over a year ago.

Here’s a description of Stanford’s STORM product from its GitHub page:

A product that conducts internet research and builds Wikipedia-style documents with full citations. More than 70,000 users. Used by experienced Wikipedia editors…

This is the same thing as Deep Research Agent! I know that the some of you reading this might list out slightly different features or whatever, but come on: from the standpoint of novel technological achievement, this feels like the exact same thing. It’s LLMs that take inputs and trigger web searches and then create a reliable output. I’m not saying that it isn’t cool, I’m just saying that it has already been done.

Think back to the Deep Research Agent announcement, which read, “The ability to synthesize knowledge is a prerequisite for creating new knowledge. For this reason, deep research marks a significant step toward our broader goal of developing AGI, which we have long envisioned as capable of producing novel scientific research.”

One more time. Deep research marks a significant step toward our broader goal of developing AGI.

I don’t know, man. I don’t know.

The thing you need to keep in mind about 2025 is that nothing matters more for a business than having peoples’ attention online. This is especially true in competitive industries without much product differentiation like today’s Large Language Model industry (Google, XAI/Grok, OpenAI, Anthropic, Meta, DeepSeek, open-source options, and more). This is why social media influencers are worth so much in the world of marketing: people pay attention to influencers and so influencers can be used to advertise.

OpenAI has a huge following. Everything that they do generates a big buzz both on social media and in mainstream news outlets. Is the Deep Research Agent the start of a trend where OpenAI just copies what’s being built with their models out in the wild, just to keep the media buzz alive and keep investors lining up?

You can’t fault the strategy for a money-making enterprise! I’d love to be able to pull this off too. It’s just not what you’d expect from a nonprofit founded to cure all diseases and push humanity to the stars.

However…

A new look

There’s something important to note about OpenAI’s Deep Research Agent announcement that we haven’t touched on yet. After describing what Deep Research Agent is, the announcement goes on to say that DRA is “Powered by a version of the upcoming OpenAI o3 model that’s optimized for web browsing and data analysis.”

This is a subtle yet interesting point that is worthy of consideration. To break it down, it’s important to first talk about how Large Language Models like the ones that OpenAI builds are fundamentally closed off from the internet.

Think back to middle school algebra class for a second. Imagine this question on your math homework:

Consider the function f(x) = x + 2. What is the result of f(3)?

The answer to the question is 5, because we’re plugging 3 into the equation x + 2, and 3+2 is 5.

Now imagine these are the next two problems on your homework:

What is the result of f(4)? What is the result of f(5)?

Did you get 6 and 7? If so, we’re rolling.

Now, what if this was the last question:

What is the result of f(search google for “Best FanDuel bets tonight”)?

Maybe a mathematician will correct me, but this last problem doesn’t make much sense.

The same thing goes for Large Language Models. Conceptually, LLMs are just giant, complex mathematical functions that take an input and return an output, but the inputs and outputs are representations of human text and speech.

The reason that this is so important to keep in mind is that Large Language Models can’t actually search the internet—again, they’re just equations that take inputs and return outputs—but if you combine their outputs with something that can search the internet, say, for example, a traditional program like a search engine, you can stack the AI and non-AI pieces together like Legos and create something that is no-longer closed off to the internet. The industry has started calling these things “Agents.”

Most of what OpenAI has been trying to do up to this point is build models that are meant to answer people’s questions (which is exactly what we all want from tools like ChatGPT). But now (if what they said in the Deep Research Agent announcement is true) it sounds like OpenAI has started to optimize new models for tasks other than good Q&A responses, perhaps for doing things like generating good internet search keywords. For example, if you asked the question:

“What are the best stocks to buy in 2025?”

An old model might respond with something like:

Identifying the best stock to buy in 2025 involves analyzing current market trends, company financials…,

Whereas one of these new, specialized models might be more likely to respond to the same question with:

{
  "action": {
    "type": "Web search",
    "searchTerms": ["Stocks","2025","Bloomberg"]
  }
}

The latter response is obviously incomprehensible to a person, but that's the point: it’s not meant for human consumption. A response like this is meant to be parsed by a program (i.e. not AI, just regular old code), and executed as an internet search, and then the results would be collected (via web scraping or other methods), and then potentially passed to another, more ChatGPT-like model to be assembled into a comprehensible response.

What we can glean from this is that if OpenAI is transitioning from a pure focus on training massive, general-purpose, human-facing style models to specialized models that are meant to dovetail with non-AI programs, such as programs that conduct internet searches, that maybe OpenAI feels like it no longer needs to push the frontiers of its flagship models to achieve artificial general intelligence and is instead shifting to building infrastructure to more easily connect their models with the outside world. In other words, maybe they feel like the brain has successfully been built, and now it’s time to give it arms and hands.

If this is true, then maybe my earlier criticism is unwarranted, and this is exactly what OpenAI should be doing. There are a lot of ways in which this makes sense, the main one (in my mind) being that, as discussed earlier, these models are just isolated functions at their core, and without building them the boats and bridges necessary to escape their islands, they may never truly come “alive.”

OpenAI’s goal from the start has been to build Artificial General Intelligence. Who knows, maybe they’re as close as they say they are.

See you next week!

Automato 🍅 #1 - AI in the IDE

Sam Reed — Tue, 18 Feb 2025 16:06:01 +0000

A warm welcome

Hello and welcome to the inaugural edition of my newsletter. I'm glad you're here.

Let’s establish a few ground rules before moving on:

Much like the startup to which this newsletter is attached, you should expect this column to change over time. This is tech, folks – the only guarantee is that if you don't constantly make your old self obsolete, someone else will.
A core belief of mine is that the best way to find out what is right is to put something wrong in front of others. That doesn’t mean I’m not attempting to be insightful and correct with what I write here—quite the contrary—but what it does mean is that I'm eager to hear your thoughts and disagreements with anything that I write. In short, please think of each post as a long-winded debate topic instead of a lecture.
Thoughts and opinions expressed in here are mine alone.

AI for Programming

Let’s kick off Automato with an automation topic that we’ll likely return to regularly: programming.

In my completely anecdotal but almost certainly correct opinion, as of today, the leading industrial use case for large language models (AI) is in software development (by a wide margin). The Stack Overflow 2024 annual software developer survey supports this claim: 76% of survey respondents said that they’re either using or planning to use AI as part of their software development process in 2025.

AI’s early success in the software engineering industry makes sense for a few reasons.

First of all, AI as we know it today is being built by teams of software developers. As one might expect, software developers understand software quite well, especially the elite devs and researchers that get jobs at companies like OpenAI. Given that software is so front-of-mind for these companies and their employees (along with the ecosystem of startups that are building new products on top of their AI models), going after coders is as natural a first step towards capturing real economic value as any other. To think that a group of AI software startups would first try to disrupt an industry other than software development is like asking a typical high school student what AI is good for and expecting them to say something other than “Homework.”

Second, the software development industry is, quite understandably, very tech forward! Many individuals are drawn to the industry because they genuinely love programming, which manifests itself in contributions to open-source (read: unpaid) projects, time spent learning new programming languages and paradigms outside of working hours, and in the hundreds of thousands of views that code influencers like The Primagen and Theo regularly receive. Given their general openness to experimentation, it’s not surprising that developers have been eager to get their hands dirty with AI tools.

Finally, the modern software ecosystem is steeped in the “free and open-source" tradition, which, if you’re not familiar, is the practice of exposing your project’s code to the world (using online code repository hosts like GitHub) so that anyone can use it or request to modify it. This open and collaborative practice has inadvertently created a massive online database of code that AI companies like Anthropic can use to train their models. As a result, years of developers’ code contributions to out-in-the-open projects have been snatched up and are now being used to threaten the livelihood of these same developers. Anyways, this newsletter isn’t about moralizing, so let’s get back to business. Sucks to suck!

Cursor

Due to the reasons mentioned above, a relatively new class of AI-centric programming tools is already seeing high rates of adoption. At present, the most beloved of such tools is Cursor, the “AI Code Editor.”

For those who don’t know, many programmers write code in what’s called an “IDE” (integrated development environment) which is like Microsoft Word or Google Docs but with bells and whistles for programming instead of plain old writing.

Cursor is a new flavor of IDE (it’s actually a modified version of an open-source Microsoft product) that makes AI a front-and-center part of the development experience. People enjoy using Cursor for features like advanced autocomplete, file editing and fairly frictionless Q & A:

Can you please help me center this div?

Because the AI assistant is embedded right in the IDE and can therefore edit a program directly, Cursor is seeing a somewhat surprising level of adoption not just from software engineers, but also from people who don’t know how to code at all, leading to a new English-first coding style called “Prompt & Pray” programming (I’m just kidding).

Overall, the product's first year(ish) was nothing short of an explosive, viral success, which resulted in Anysphere (Cursor’s parent company) raising a huge $105 million Series B investment from prominent VCs such as Andreesen Horowitz. Here’s a quote from their Series B announcement:

Now—just to be clear—unlike the firms that participated in the aforementioned financing round, I don’t have a laundry list of LPs begging me to speculate on startups with their capital, so I’m probably missing something here. That said, I'm unclear about Anysphere / Cursor’s long-term edge, for a few reasons that I’ll detail below.

Risky Business:

Before moving on, I think it’s worth acknowledging that the Cursor team seems like nice people. I’ve listened to them on the Lex Fridman podcast and emailed them personally for support and don’t have anything bad to say. However, they’re a great example of the complications of building AI-native products, so let’s get on with the analysis of their tricky road ahead.

Competition is for losers

There is an obvious threat that I want to get out of the way first: intense competition from Microsoft and other startups. Microsoft clearly still wants people using its Visual Studio Code IDE, which, you’ll remember, is Cursor’s estranged parent. A recent announcement from Microsoft about a new, generous free tier for GitHub copilot (another AI programming assistant) for all Visual Studio Code users shows that Microsoft isn’t going down without a fight (to add to the strangeness of the situation, Microsoft is a massive investor in OpenAI, which is an investor in Anysphere). Anysphere certainly has a big enough piggy bank to stay in the game for a while, but I’m not positive that getting to $100 million in revenue in under a year like they did is as much a sign of genius business acumen as it is a sign of a low-hanging-fruit product that went viral. Cursor did not create an industry or spend years selling people on their vision of the future. They just took a mature, beloved product and gave it a youthful haircut (a very nice haircut indeed!). As such, Microsoft or other startup competitors like Windsurf pose legitimate threats to any current or future profitability that influenced Cursor’s most recent valuation.

Programmer Skill Atrophy

Another, albeit subtle, threat is that too much help from AI actually turns out to be something that programmers sour on. You need to look no further than the author of this post to find an example of this trend. I paid for and used Cursor daily for several months, but I ended up cancelling the $20/month subscription when I realized the way in which it was dulling my skillset.

I don’t expect too many people to make this choice, especially those who have never programmed without AI, but to me, the energy required to pause, think and write well-formed questions to ChatGPT (or, dare I say, read documentation!) is well worth it in order to help with skill maintenance. One of my business hot takes is that the increased productivity that companies feel when adopting Slack is an illusion caused by the friction it removes from email, but in reality it’s that same friction that causes critical thought and efficiency in the first place. Same goes for AI IDEs (either that or AI will replace me first). What's that story about the tortoise and the hare again?

RAG and the DoDo Bird

Here’s where things get really interesting. I recently came across a new database vendor called TurboPuffer that has a big quote from Cursor in the testimonials section of its landing page:

After switching our vector db to @turbopuffer, we’re saving an order of magnitude in costs and dealing with far less complexity! -Aman Sanger, Co-founder, Cursor

So unless this is outdated, I’m going to assume that Cursor is using Turbopuffer as a database vendor.

To understand why this is interesting, we must first visit a concept that has emerged from the large language model/generative AI movement called “Retrieval Augmented Generation” (RAG). RAG is a fancy term for a simple concept: because each language model (think: ChatGPT) only accepts a finite number of words (tokens) as input, you need to retrieve the subset of information (out of all possible info that you could theoretically grab) that you’re providing a model with carefully. For example, try pasting this entire essay along with the question “What’s this guy’s deal?” into ChatGPT – it'll probably tell you that your question is too long (phew!). Now try again, but only with one paragraph. Congratulations, you’ve just done RAG.

Given Cursor’s stated use of Turbopuffer, my speculation is that part of Cursor’s magic is that they are very good at selecting the relevant snippets from your codebase to send to AI (i.e. they’re good at RAG), which is extremely important for getting high-quality answers (the ultimate test of their product’s usefulness) when the amount of information that you can provide to a model is limited.

But...what if...the amount of information that you can provide...becomes (essentially) unlimited?

A brand new model from Google, Gemini 2.0 Flash is certainly making moves in this direction. The model takes up to 1 million tokens as input (here’s a good video from Theo describing the model).

For reference, here are the token counts for the OpenAI models that Cursor was likely designed for (all of these are state of the art models, this is not a diss in any way):

GPT-4 Turbo and GPT-4 model series

That’s right, we’ve gone from 128k tokens of maximum input length (8k if you look at the bottom of the picture) in April 2024 to 1 million today! The Google model is also very inexpensive in terms of the cost per token.

All of this is to say that Cursor’s elite ability to generate quality code could be in jeopardy if entire codebases can just be dumped into a model prompt with ease. Even the fastest horse isn’t going to outrun my ‘24 Subaru (hot dang!).

Also, the possibility of the single-file web app is realistically already here. A lightweight Python framework like FastAPI or Starlette could easily support this with a very low token counts. This makes the ability to archive and traverse a nested file system less important, which would make a Cursor copycat designed for the non-programmer demographic easier to build.

The AI Software Development Tool Paradox

The final sticking point that comes to mind is what I call the “AI Software Development Tool Paradox” (I’ll come up with a better name).

I tweeted about this when I saw that Devin, a new product being touted as the first AI software engineer (what’d I say earlier about developers going after developers?) was charging $500 per month for a subscription.

As the frugal startup founder that I am, I can’t imagine a better first question for my $500 per month AI software developer than “Hey Devin – here's how to call OpenAI’s API – can you please build me an AI software engineer?”

This is a problem for the industry as a whole, but I can’t help but scratch my head whenever I see these businesses that sell software that builds software. Like the snake eating its tail, when these products eventually get good, they’ll devour themselves. Cursor might have the advantage right now of reduced per-token pricing from OpenAI and Anthropic (I don’t know if this is true, just saying it’s possible) but if we are to believe all of the rhetoric around everyone becoming a programmer, then it follows that everyone will be able to build their own software (note that open-source, highly-customizable IDEs such as Neovim already prove the appetite for such products among experienced developers). I would not be surprised if we see Anysphere move into the infrastructure space (like competitor Replit) to mitigate this risk.

In closing

There’s a big piece that I’ve left out of the above discussion: marketing. In 2025, there’s nothing harder than getting people's attention, and it’s obvious that Cursor has found a way to do just that. Do not underestimate this as a serious competitive moat.

I think that accelerating a developer’s productivity or giving someone the ability to produce working code for the fairly low cost of $20 per month is a wonderful thing. I’m not trying to wish ill will on Cursor—I just think that they are a perfect example of the tough road ahead for anyone operating in this space. It’s easy to see the huge valuations and think that high-growth tech companies have it all figured out, but if I had to bet on one winner in AI, it would be the consumer (hopefully the human one).

Time will tell. See you next time!