<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>AI Office Hours</title>
    <description>Actionable and Usable AI for developers. Doors are open.</description>
    
    <link>https://ai-office-hours.beehiiv.com/</link>
    <atom:link href="https://rss.beehiiv.com/feeds/IJFcJXSN8V.xml" rel="self"/>
    
    <lastBuildDate>Thu, 14 May 2026 23:26:52 +0000</lastBuildDate>
    <pubDate>Wed, 30 Apr 2025 14:34:51 +0000</pubDate>
    <atom:published>2025-04-30T14:34:51Z</atom:published>
    <atom:updated>2026-05-14T23:26:52Z</atom:updated>
    
      <category>Machine Learning</category>
      <category>Artificial Intelligence</category>
      <category>Technology</category>
    <copyright>Copyright 2026, AI Office Hours</copyright>
    
    <image>
      <url>https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/publication/logo/f8bee330-7f91-44a6-a969-50f8470e4ea5/Screenshot_2023-06-12_at_12.52.42_PM.png</url>
      <title>AI Office Hours</title>
      <link>https://ai-office-hours.beehiiv.com/</link>
    </image>
    
    <docs>https://www.rssboard.org/rss-specification</docs>
    <generator>beehiiv</generator>
    <language>en-us</language>
    <webMaster>support@beehiiv.com (Beehiiv Support)</webMaster>

      <item>
  <title>Beyond Benchmarks</title>
  <description>Really Evaluating AI</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/90bf5300-b5c0-4c62-9f4a-ea126a7c0064/Screenshot_2025-04-30_at_7.32.58_AM.png" length="215633" type="image/png"/>
  <link>https://ai-office-hours.beehiiv.com/p/beyond-benchmarks</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/beyond-benchmarks</guid>
  <pubDate>Wed, 30 Apr 2025 14:34:51 +0000</pubDate>
  <atom:published>2025-04-30T14:34:51Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">I’m giving a talk at ODSC East 2025 on the topic of benchmark pitfalls, and I wanted to give a taste of it here! See below for a link to the talk.</p><p class="paragraph" style="text-align:left;">For the uninitiated, a <b>benchmark</b> is a standardized open source test set for an AI task. The idea of a benchmark or even a test set for AI is not new by any stretch. The general idea when training any AI model is to split a (usually) massive amount of data into “splits” and train the model on the largest portion of the data (the training split), validate your results along the way using a smaller subset (the validation split) and a similarly small subset is used to finally “test” the model at the end (the test split). The idea is that if a team agrees on a given train/val/test split, then we could evaluate models fairly, knowing it wasn’t a difference in data that made the difference. Below is an image from one of my books, highlighting that fine-tuning process.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXeb0ssJ-6__Vf3yBHBsEWdf5FDV6KQF8caDnOF1w8AUH_gcbxwagxSnQTI2sslaBmXMwF5hoqSjvRCWKE7_GOjJ4uS_tTzHZMqXCIZ1Px6TphXRdGyuwQKGBdqHlXLJ1uKv-frU?key=R60ekJOISV8atbAtZEI_hmcg"/><div class="image__source"><span class="image__source_text"><p>Test sets in general (benchmarks being an example of a test set) help AI engineers know if their hard work on training models paid off (Source: A Quick Start Guide to LLMs, by yours truly, Sinan Ozdemir)</p></span></div></div><p class="paragraph" style="text-align:left;">But what if the model you are making is not necessarily meant for you or for your team, but in fact meant for as many people as possible, and what if we can’t standardize a training set, because that itself has some proprietary / “secret sauce”?  That’s where benchmarks come in. Someone (or usually some people) create and propose a test set (the benchmark) and hope people adopt it. Moreover, whatever training data an organization needs to use to train their model, just need to make sure it’s not from this benchmark or else that’d be cheating (more on that later). </p><p class="paragraph" style="text-align:left;">Benchmarks are often the first thing people ask of a new LLM: “How did it score on XYZ benchmark” or “is this model better than ZYX model at benchmarks?” and frankly that’s the primary purpose of benchmarks: to serve as a top-line conversation starter when evaluating an LLM for a certain job. This post and my session overall will tackle the urge for us to consider benchmarks as more than a conversation starter, but a conversation ender. </p><p class="paragraph" style="text-align:left;">We will explore three main areas of benchmarks in this post:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Benchmarks becoming targets</b>: LLM creators are incentivized to chase the top of leaderboards and LLM consumers conflate benchmark performance with everyday, real-world performance.</p></li><li><p class="paragraph" style="text-align:left;"><b>Biases and shortcuts</b>: Static benchmarks often contain artifacts and biases that models exploit, making benchmarks artificially easier than they appear.</p></li><li><p class="paragraph" style="text-align:left;"><b>Overstated progress</b>: High scores on benchmarks don&#39;t mean models have true generalization or human-level understanding.</p></li></ol><p class="paragraph" style="text-align:left;">Let’s dig in.</p><h1 class="heading" style="text-align:left;" id="benchmarks-becoming-targets">Benchmarks becoming targets</h1><p class="paragraph" style="text-align:left;">We’ve been measuring AI using benchmarks for decades. Benchmarks like SQuAD (you’re forgiven if you’ve never heard of it) have been used to measure an AI’s ability to perform question/answer tasks like, “What is the largest city of Poland?” The image below shows the progress of AI on classic benchmarks, superseding “human performance” (denoted as the 0 mark on the y axis) in the last decade. The goal of the benchmarks was to give us humans a consistent, shared view of how well our AI systems were doing.</p><p class="paragraph" style="text-align:left;">That’s still true today; on its face, targeting benchmark performance is useful to measure top-line progress of AI performance. The problem becomes when we all hyper-focus on a relatively small subset of benchmarks and equate a model’s performance on the benchmark with the AI’s overall performance. It leads to one of the most difficult questions I have to answer in a lecture: <b>“Which model is currently the best?”</b></p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcPYcWJ0nswuo-u_tnb-EA5oQfyXtseERK6bBPanimgsDDJHJtRDEg9Io47qcSDMJqePEBpJHaR59OLAyklWi-BZINndJmjkMXNXbSjXvawGFeKCu8T8a4HEayyLjmkiWHsPxBNBg?key=R60ekJOISV8atbAtZEI_hmcg"/><div class="image__source"><span class="image__source_text"><p>Benchmark saturation over time for popular benchmarks, normalized with initial performance at minus one and human performance at zero. (Source: <a class="link" href="https://arxiv.org/pdf/2104.14337?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=beyond-benchmarks" target="_blank" rel="noopener noreferrer nofollow">https://arxiv.org/pdf/2104.14337</a>)</p></span></div></div><p class="paragraph" style="text-align:left;">This is not the right question to ask. The better question is, “how will this particular model perform on my particular task and is there a benchmark that gives me any indication of that?” It’s less punchy and puts more work on the consumer, but it better covers the current adoption of AI: task-oriented and usually domain-focused. </p><p class="paragraph" style="text-align:left;">That being said, we sometimes can’t even trust an LLM’s ability to solve a benchmark consistently. The image below depicts a study showing the performance of GPT-4 going from 84% to 51% performance on a math benchmark within only 3 months of release (from March to June 2023). This was because OpenAI released a new version of the model to little fanfare with a later knowledge cutoff, but didn’t report how benchmarks shifted. So we, the persuadable public, still assume the benchmarks they reported in March were accurate; they weren’t.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXd4UPNFUOLyzyuGTg7yX7MWenjRSvZBvCdCe9n1n5iiQoCeK-PKdb3s_58NJmdTcXgBpu3M-iZ7EUOBYViDL68suOCpmUXnVIUpD21RL-QFScXyhP_eAZ5VcR7HOBVLp9DWGDPKhg?key=R60ekJOISV8atbAtZEI_hmcg"/><div class="image__source"><span class="image__source_text"><p>GPT-4 and GPT-3.5 benchmark performances shifting wildly within only 3 months (Source: <a class="link" href="https://arxiv.org/pdf/2307.09009.pdf?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=beyond-benchmarks" target="_blank" rel="noopener noreferrer nofollow">https://arxiv.org/pdf/2307.09009.pdf</a>)</p></span></div></div><p class="paragraph" style="text-align:left;">Nonetheless, we look to benchmarks on new LLM releases to give us that topline view. But once a benchmark like SQuAD (or more recently MMLU) becomes a standard, researchers and models optimize heavily to improve benchmark scores, chasing the top of a leaderboard to show the world what they’ve done and hopefully sell some credits. So should benchmarks even be a target to chase? Sure they should be, especially when the task is relatively nuanced and so is the benchmark (like testing a model’s financial tool selecting ability using a financial tool selection benchmark). Benchmark performance is a great top line set of numbers to help us create a shortlist of models to consider, but it can’t be the end of the story. This wouldn’t be a post about benchmarks without at least one reference to the 1970’s <b>Goodhart’s Law</b>: </p><p class="paragraph" style="text-align:center;"><i>&quot;When a measure becomes a target, it ceases to be a good measure.&quot; </i></p><p class="paragraph" style="text-align:left;">Said another way, when we optimize too hard for a particular metric (performance on a benchmark), people (or in this case, AI models) will find ways to &quot;game&quot; the system and score higher metrics without actually improving at the underlying goal in any meaningful way. One way it can do that is by finding shortcuts to take to get better grades.</p><h1 class="heading" style="text-align:left;" id="biases-and-shortcuts">Biases and shortcuts</h1><p class="paragraph" style="text-align:left;">A section on benchmark biases and shortcuts could frankly be its own book. In fact I made a 12 hour video series focusing on this topic; there’s a lot to say. For now, let’s focus on a real and imminent shortcut AI’s can suffer from: data contamination. <b>Data contamination</b> is when an AI trains on data suspiciously similar to benchmark questions, artificially inflating a model’s performance on a benchmark. Basically, what if an AI accidentally or maliciously was given a cheat sheet?</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcGrj8MhE7lMQGGsg_zItLXe2ifmIl2obMjAthY7l_ZggnFXDXYx8ltLof3zMS0WplwRuHH9B6MzsYsBOX9C-Rphn_Cv3r2cPDXALpv3_orOuD-neoLnG5vbqOJv6P5Oy4O1o8J?key=R60ekJOISV8atbAtZEI_hmcg"/><div class="image__source"><span class="image__source_text"><p>An investigation into data contamination showed that letting Llama 2 train on rephrased benchmark questions that passed industry standard data contamination detection would have beaten GPT-4 at the popular MMLU benchmark. Too bad Meta has never open-sourced Llama’s training data so we can’t double check that work. (Source: <a class="link" href="https://arxiv.org/abs/2311.04850?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=beyond-benchmarks" target="_blank" rel="noopener noreferrer nofollow">https://arxiv.org/abs/2311.04850</a>)</p></span></div></div><p class="paragraph" style="text-align:left;">One way data contamination can happen is when a benchmark becomes so popular, that the open internet starts to consist of more data that might contain a rephrasing that is justtttt different enough to trick industry-standard contamination detection techniques (usually just embedding similarity and n-gram overlap [basically a keyword search] checks). A frighteningly simple research experiment (shown above) rephrased questions from the MMLU benchmark just enough to pass such industry-standard techniques and found that Llama-2 could have beaten GPT-4 on the benchmark if it had been allowed to train from them.</p><p class="paragraph" style="text-align:left;">It’s easier than we think to let a benchmark question slip into the training data of a model. Experiments like these show that companies are likely doing a decent job at de-contaminating training data (otherwise we’d be seeing saturation at an even greater rate) but when models aren’t fully open source (including it’s training data and code) like every single Llama model, we can never double check the work being done.</p><h1 class="heading" style="text-align:left;" id="overstated-progress"><b>Overstated progress</b></h1><p class="paragraph" style="text-align:left;">Put simply, benchmarks simply don’t cover a majority of what most humans would consider general intelligence, let alone “superintelligence”. Most benchmark questions are multiple choice and most benchmarks reported by companies are in the field of math and coding which really isn’t that helpful if you’re using an LLM to write marketing copy or to classify incoming customer support tickets with a particular intent class. One of the most talked about modern benchmarks is <b>“Humanity’s Last Exam” </b>or HLE. In their own words, this benchmark is designed to be the “final closed-ended academic benchmark of its kind with broad subject coverage.” And by “academic”, and “broad”, they mean almost entirely STEM.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/docsz/AD_4nXcYJ8YtlWkBbA-kqctM-_NUD7Ju9y_4zu3u7-8d3cebmyffGaJIOGYEoB4gXgSOdCM4OVUogY4hb2by9bVWWvV1EvUAUXF1-loMIRDiF7SpoN-Ui7MDJJgcEfH3mbKg18tsXP8N?key=R60ekJOISV8atbAtZEI_hmcg"/><div class="image__source"><span class="image__source_text"><p>Humanity’s Last Exam is mostly a STEM exam, featuring 3 questions tagged as “League Of Legends” questions 🤷(Source: <a class="link" href="https://github.com/centerforaisafety/hle?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=beyond-benchmarks" target="_blank" rel="noopener noreferrer nofollow">https://github.com/centerforaisafety/hle</a>)</p></span></div></div><p class="paragraph" style="text-align:left;">To be 100% clear, there’s no chance I’ll ever pass, let alone ace this closed-book exam. For one, I don’t play League of Legends and I am terrible at Chess (both categories are represented in this benchmark). When an AI can pass this benchmark (yes I said when), I will find it extremely impressive but I will also have many questions, starting with:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Was your AI able to look up information in trying to pass the exam?</p></li><li><p class="paragraph" style="text-align:left;">Can you prove to me you didn’t help your AI “cheat” by fine-tuning it on similar data?</p></li><li><p class="paragraph" style="text-align:left;">Can you convince me that I even care about your AI knowing how long the Second Great War was in StarCraft Lore? (Yes, that’s a real question in this benchmark)</p></li></ol><p class="paragraph" style="text-align:left;">The counterpoint to this line of questioning is that AI adoption will eventually evolve beyond targeted LLM use-cases and prompting and that benchmarks like HLE are less meant for targeted adoption of AI and more meant to signal a turning point in AI: the heralding of Artificial General Intelligence (AGI) or Superintelligence - an AI going beyond human intelligence. To put it bluntly, even an AI scoring 100% on HLE would <b>not</b> <b>alone </b>trigger a sense of AGI or Superintelligence to me. These high benchmark scores look impressive to us, the consumer, but they can stop reflecting true generalization of AI, exactly what Goodhart’s Law predicts.</p><h1 class="heading" style="text-align:left;" id="so-what-do-we-do"><b>So what do we do?</b></h1><p class="paragraph" style="text-align:left;">We will dive deeper into remedies for benchmarks in my live session and I plan to make a follow up post after the talk, but I’ll outline a few simple steps we can take now:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Use benchmarks to create a short list of models to evaluate - don’t use them to select a single model.</p></li><li><p class="paragraph" style="text-align:left;">Make your own test sets - They are valid and frankly will tell you more than most public benchmarks will on your particular tasks.</p></li><li><p class="paragraph" style="text-align:left;">Ask yourself, “who made this benchmark” and “what are they really trying to test?” Is it true reasoning beyond the knowledge an AI has, or simply the recall of a few facts that when put together, simulate reasoning.</p></li></ol><p class="paragraph" style="text-align:left;">Other topics in the session will include:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">The path to “AGI” and “Superintelligence” and a deeper dive into the “Humanity’s Last Exam” benchmark</p></li><li><p class="paragraph" style="text-align:left;">How AI developers use prompting when benchmarking, leading to public misconception of AI strength</p></li><li><p class="paragraph" style="text-align:left;">Addressing staleness in benchmarks: what if answers to questions change over time?</p></li></ol><p class="paragraph" style="text-align:left;">If you’re in Boston, stop by and say hi!</p><div class="embed"><a class="embed__url" href="https://odsc.com/speakers/beyond-benchmarks-evaluating-ai-agents-multimodal-systems-and-generative-ai-in-the-real-world/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=beyond-benchmarks" target="_blank"><div class="embed__content"><p class="embed__title"> Beyond Benchmarks: Evaluating AI Agents, Multimodal Systems, and Generative AI in the Real World </p><p class="embed__link"> odsc.com/speakers/beyond-benchmarks-evaluating-ai-agents-multimodal-systems-and-generative-ai-in-the-real-world </p></div><img class="embed__image embed__image--right" src="https://odsc.com/wp-content/uploads/2023/07/Sinan-Ozdemir.png"/></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=acb8ea9e-4c9b-4880-83f3-a113188f0319&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Evaluating AI Agent Tool Selection</title>
  <description>When you have a hammer (in the first position), everything looks like a nail</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/777503cd-7659-4a65-a587-4f9c8bc354e2/tool_acc_optimized.png" length="28381" type="image/png"/>
  <link>https://ai-office-hours.beehiiv.com/p/evaluating-ai-agent-tool-selection</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/evaluating-ai-agent-tool-selection</guid>
  <pubDate>Wed, 13 Nov 2024 13:00:00 +0000</pubDate>
  <atom:published>2024-11-13T13:00:00Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[Agents]]></category>
    <category><![CDATA[Bias + Ethics]]></category>
    <category><![CDATA[Llm Alignment]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">AI Agents are all the rage, and I get it. The promise of letting an LLM just pick the right tool for the job is very appealing if not somewhat “magical”, but of course if I’m writing this, then it means there’s a lot more to it than meets the eye.</p><p class="paragraph" style="text-align:left;">At it’s most basic, an <b>AI Agent</b> is an auto-regressive LLM (virtually any commercial Generative AI model like GPT, Llama, Mistral, Claude, Command-R, etc) with a prompt telling the LLM how to reason through tasks by selecting and running <b>tools</b> which can be APIs, image generation models (like how ChatGPT uses DALL-E to make images), execute code, really anything that can be distilled down into a simple run function.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fdaa885e-92ad-4091-9edc-e5382a59dae1/04fig12.png?t=1731444712"/><div class="image__source"><span class="image__source_text"><p>Your Basic AI Agent relies on an auto-regressive LLM’s ability to think through a task and select the right tool at the right time.</p></span></div></div><p class="paragraph" style="text-align:left;">Agents are useful in theory, but in practice can often fall short. Evaluating agents can be done on several levels including:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Making sure the final answer is accurate, and helpful</p></li><li><p class="paragraph" style="text-align:left;">Ensuring the latency/speed of the system is good enough</p></li><li><p class="paragraph" style="text-align:left;">Mitigating failures of the LLM to reason through a complex task</p></li></ol><p class="paragraph" style="text-align:left;">One of the more underrated evaluation criteria is the <b>quantifying the</b> <b>ability of the LLM to select the right tool at the right time</b>. On it’s face it&#39;s obvious that we have to measure this but many people dismiss this as being just part of the overall system and if the answer is right at the end, that would imply the agent selected the right tools, right? Well not always. Perhaps the agent selected the wrong tool twice before fumbling it’s way into the right one and that would impact both the latency and the accuracy overall.</p><p class="paragraph" style="text-align:left;">Moreover, there are underlying issues with the deep learning architecture that virtually every LLM is based on, the Transformer. While there’s no doubt that the invention of the Transformer was one of the greatest advancements in NLP in the last several decades, there’s one particular type of bias it falls prey to quite often, the positional bias.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/slidesz/AGV_vUfsOT58i-L7M75K9ECgwQUHxBiB6FLFgfihJTH7Qy9kHK3ALYtFFu26TM5BySYgtrZTWE3Eg1USLV4jJ-YUoS3ibqnZE-O3OiJAElkD1oNsU0U6ke_XDwwc1g1tPJfAwHPx6JchRg=nw?key=FMO3SleMYhC0170nbww_FPdM"/><div class="image__source"><span class="image__source_text"><p>Depending on where the tools are in the agent prompt, tools listed later in the list might end up towards the middle of the prompt, where information can get ignored due to positional bias</p></span></div></div><p class="paragraph" style="text-align:left;"><b>Positional bias</b> essentially means the LLM has a tendency to pay more attention to tokens at the start or end of the prompt while glossing over tokens in the middle. You may have heard this called the &quot;lost-in-the-middle&quot; problem. This can be a big deal when it comes to agents, especially if the LLM favors tools that are recorded earlier in the prompt, glossing over later tools which often appear towards the middle of the overall prompt. As a result, the LLM could pick the wrong tool.</p><h1 class="heading" style="text-align:left;" id="testing-tool-selection">Testing Tool Selection</h1><p class="paragraph" style="text-align:left;">To properly investigate tool selection in agents, let’s run a simple test. Our test will run in 3 stages:</p><h2 class="heading" style="text-align:left;" id="stage-1-setup">Stage 1 - Setup</h2><ol start="1"><li><p class="paragraph" style="text-align:left;">Choose an agent framework to test. I made my own <a class="link" href="https://github.com/sinanuozdemir/squad-goals?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=evaluating-ai-agent-tool-selection" target="_blank" rel="noopener noreferrer nofollow">here</a></p></li><li><p class="paragraph" style="text-align:left;">Write a test set where we write a fairly simple task which (mostly) obviously matches to a single tool. I have 5 tools total. Examples include:</p><ul><li><p class="paragraph" style="text-align:left;">Check the status of my NFT listings → &#39;<b>Crypto Lookup Tool</b>&#39;</p></li><li><p class="paragraph" style="text-align:left;">Add a new row and just write &quot;To do&quot; in it → &#39;<b>Google Spreadsheet Tool</b>&#39;</p></li><li><p class="paragraph" style="text-align:left;">Convert 98 degrees Fahrenheit to Celsius using Python → &#39;<b>Python Tool</b>&#39;</p></li></ul></li><li><p class="paragraph" style="text-align:left;">Define several LLMs to test against. I tested several from OpenAI, Anthropic, a Mistral model, a few Llama models and a Gemini model</p></li></ol><h2 class="heading" style="text-align:left;" id="stage-2-run-the-agent-log-results">Stage 2 - Run the Agent + Log results</h2><ol start="1"><li><p class="paragraph" style="text-align:left;">Choose an <b>n</b> (I chose n=10)</p></li><li><p class="paragraph" style="text-align:left;">For each test datapoint, and for each LLM, shuffle the tools around and pass the order of the tools into the agent framework <b>n</b> times.</p><ul><li><p class="paragraph" style="text-align:left;">for each time, log the correct tool index, the chosen tool index, and whether the agent was correct.</p></li></ul></li></ol><h2 class="heading" style="text-align:left;" id="stage-3-calculate-results">Stage 3 - Calculate Results</h2><ol start="1"><li><p class="paragraph" style="text-align:left;">Calculate the accuracy, precision, recall, and F1 for each LLM on its tool selection along with broken down metrics (see the final results in the github)</p></li><li><p class="paragraph" style="text-align:left;">Calculate the % difference between each tool index being <b>chosen</b> and the index being <b>correct</b> to try and see if the LLM favored any particular tool indices. </p></li></ol><h1 class="heading" style="text-align:left;" id="the-results">The Results</h1><p class="paragraph" style="text-align:left;">The notebook with the test & results (see references) has about a dozen graphs in it but here are a few key takeaways:</p><h2 class="heading" style="text-align:left;" id="tool-selection-accuracy-can-vary-gr">Tool Selection Accuracy can vary greatly between LLMs</h2><p class="paragraph" style="text-align:left;">Depending on which LLM I tried, there were pretty stark differences between tool selection accuracy. It’e tempting to look at this and say “Oh ok, so Anthropic’s Claude 3.5 Haiku&quot; is clearly the best LLM for agents”. <b>Incorrect </b>🙂 this is a test for my agent framework on my tools and on my data. I hope you will take this post/notebook as a framework to follow when testing your own LLMs!</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0976dd93-d72f-4957-bed4-cdd69593e569/tool_acc.png?t=1731446570"/><div class="image__source"><span class="image__source_text"><p>To no one’s surprise, the choice of LLM impacted overall tool selection accuracy</p></span></div></div><h2 class="heading" style="text-align:left;" id="positional-bias-is-real">Positional Bias is Real</h2><p class="paragraph" style="text-align:left;">The graph below shows the average % difference between how often the agent chose a particular tool index (there are 5 bars because I had 5 tools) over how often that tool index was actually correct. so a 9.51% in the first bar means that on average, the LLMs chose the first tool in the list 9.51% more often that the index was correct. For example if that tool index was the correct tool index 95 times during the test, the LLM actually chose that tool index roughly 104 times. Meanwhile the later tools are under-chosen, showing evidence of a positional bias.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d7e47861-f9da-4ace-9ddf-f668c1de774c/pos_bias_tool_Average_Proportion_between_Chosen_and_Correct_Tool_Index.png?t=1731447804"/><div class="image__source"><span class="image__source_text"><p>On average, the chosen LLMs tended to over-select tools in earlier indexes</p></span></div></div><p class="paragraph" style="text-align:left;">You might be thinking that it was the smaller open source models that really skewed the results, but if you look at the results broken down by model provider, even OpenAI models fall victim to positional bias:</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b9fac01a-47d0-4505-b4e8-4a82f97e3ef1/pos_bias_tool_OpenAI_Average_Proportion_between_Chosen_and_Correct_Tool_Index.png?t=1731447820"/><div class="image__source"><span class="image__source_text"><p>Even the “gold standard” OpenAI LLMs fall victim to positional biases</p></span></div></div><h1 class="heading" style="text-align:left;" id="conclusion">Conclusion</h1><p class="paragraph" style="text-align:left;">The allure of AI agents lies in their potential to solve a freeform task by choosing the right tools at the right moment, but the reality is far more nuanced. This experiment highlights that evaluating an agent&#39;s performance goes beyond simply checking the final answer and how long it took to get there. Even when an agent ultimately reaches the correct solution, inefficient tool selection driven by inherent biases can impact accuracy, latency, and consistency.</p><p class="paragraph" style="text-align:left;">Moreover, even the most advanced LLMs from top providers like OpenAI and Google are not immune to these challenges. The over-selection of tools appearing earlier in the list underscores the need for robust testing frameworks and deeper investigations into the LLM’s decision-making process.</p><p class="paragraph" style="text-align:left;">The takeaway? Don’t assume a strong final output implies flawless tool selection. Use testing frameworks like the one shared here to rigorously test, iterate, and refine your agents for better real-world performance. And remember, the right tool at the right time isn’t just magical—it’s measurable.</p><h1 class="heading" style="text-align:left;" id="references">References</h1><p class="paragraph" style="text-align:left;">This work came from my lecture & video on agents which you can find on <a class="link" href="https://learning.oreilly.com/live-events/ai-agents-a-z/0642572007604/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=evaluating-ai-agent-tool-selection" target="_blank" rel="noopener noreferrer nofollow">O’Reilly</a>. Here is the Github for both the lecture and the video as well as the github for my agent framework that I used to perform this test.</p><div class="embed"><a class="embed__url" href="https://github.com/sinanuozdemir/squad-goals?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=evaluating-ai-agent-tool-selection" target="_blank"><div class="embed__content"><p class="embed__title"> Squad Goals </p><p class="embed__description"> My own AI agent framework that I used to run this test. Feel free to install and contribute! </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/a31f88dcefd991fa8075df4c93296ff8817e1bbd939f91f99ab537a7470ef81e/sinanuozdemir/squad-goals"/></a></div><div class="embed"><a class="embed__url" href="https://github.com/sinanuozdemir/oreilly-ai-agents/blob/main/notebooks/agent_positional_bias_tools.ipynb?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=evaluating-ai-agent-tool-selection" target="_blank"><div class="embed__content"><p class="embed__title"> AI Agents - Tool Selection Performance </p><p class="embed__description"> An introduction to the world of AI Agents on O’Reilly </p><p class="embed__link"> github.com/sinanuozdemir/oreilly-ai-agents/blob/main/notebooks/agent_positional_bias_tools.ipynb </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/cbd9b5e4427f81bf250e2ba90ea7eb46740e7765eba0a8509fe73348621ec183/sinanuozdemir/oreilly-ai-agents"/></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=cd54f76f-c9bf-4b1a-a530-14059e123785&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>(Re-) Ranking RAG Solutions</title>
  <description>5 lines of code -&gt; better RAG</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0c8dc2d3-fb3f-4a71-bd4a-e7d35ec571f1/Screenshot_2024-07-25_at_7.10.50_AM.png" length="96271" type="image/png"/>
  <link>https://ai-office-hours.beehiiv.com/p/re-ranking-rag</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/re-ranking-rag</guid>
  <pubDate>Mon, 29 Jul 2024 14:26:17 +0000</pubDate>
  <atom:published>2024-07-29T14:26:17Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">In August of 2022, mere months before ChatGPT made it’s world debut, I wrote a post on medium<sup>[1]</sup> about using auto-encoding LLMs like BERT to embed and retrieve documents from a vector database and then return responses to a query using information from that document. Sound familiar? I was describing a simplified version of <b>Retrieval-Augmented Generation (RAG) </b>inspired by a paper<sup>[2]</sup> in 2020. My version used two types of auto-encoding LLMs - one to retrieve and the other to “generate” a response by selecting the best subset of the document that answered the question (I know that’s not actual LLM text generation, but I wanted to use something open source and it was 2022).</p><p class="paragraph" style="text-align:left;">Auto-encoding LLMs are models that cannot generate text token by token like the “Generative AI” models - ChatGPT, Claude, Llama, or virtually any LLM on the market today - but rather models who’s sole purpose is to read quickly and efficiently at much smaller sizes. To put that size difference in perspective, a case study in my book has a 70M parameter DistilBERT model beating ChatGPT (<b>2,500x bigger parameter-wise</b>) in a head to head fine-tuning classification test.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-rt.googleusercontent.com/slidesz/AGV_vUeoASGK2ChOp8u-VnifDyuY5D3XS2_gWUzFGjTOWJb3w4Qo6b-6soaApfgNCNxOgAogYuc3P8ndKZA2qFAqyACQWSulk-GwV9kQZa8c6VWDPTjtlInqsb_OmHrbsucEbuVpFALL5Y15VHb4Y2eV7jLB7hxGjLcM=nw?key=pPK5AYaUQ-z13aLFgyvMzQ"/><div class="image__source"><span class="image__source_text"><p>Case study from my book: DistilBERT (70M params) performing at a similar level as GPT 3.5 (175B params) on the same training data (<a class="link" href="https://hf.co/datasets/app_reviews?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=re-ranking-rag-solutions" target="_blank" rel="noopener noreferrer nofollow">https://hf.co/datasets/app_reviews</a>) while being nearly twice as fast as GPT 3.5. Size isn’t everything</p></span></div></div><p class="paragraph" style="text-align:left;"></p><p class="paragraph" style="text-align:left;">I’ve been both fascinated and disappointed in the field of auto-encoding models in recent years. So few companies seem to want to innovate on non-generative LLMs so when a use-case like RAG comes up where a huge chunk of that pipeline involves reading / retrieval i.e. not generating anything, I get excited.</p><p class="paragraph" style="text-align:left;">RAG can be broken down into three main steps (these figures are from my 2022 post but still are relevant):</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Indexing documents </b>- Using an embedding system to transform raw text into vectors and storing them in a database</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fb863fe2-40c8-496b-a6be-6538b0238d5d/Screenshot_2024-07-25_at_7.48.04_AM.png?t=1721918892"/></div></li><li><p class="paragraph" style="text-align:left;"><b>Retrieving documents</b> - Using (usually) the same embedding system to embed a query and using a vector similarity metric (like cosine similarity) to find the most relevant document</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e4a9e1ca-525e-4a0c-94e6-e26b0f67a368/Screenshot_2024-07-25_at_7.48.33_AM.png?t=1721918918"/></div></li><li><p class="paragraph" style="text-align:left;"><b>Generating a response </b>- Using an LLM to create a raw text response to a user’s query using information in the document (yes I know there’s a typo in the figure, 2022 me messed up).</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/af88d0a6-4857-4d04-b5a8-a5d1dcd2e766/Screenshot_2024-07-25_at_7.48.55_AM.png?t=1721918940"/></div></li></ol><p class="paragraph" style="text-align:left;">One of the main limitations of RAG systems is the quality of the retrieved document ranking. Cosine similarity between embeddings can only go so far in terms of matching queries to documents. This can be quantified by measuring how often an input query gets matched to the correct document which we will refer to as the <b>top result accuracy</b> of a RAG system, namely that the #1 closest retrieved document is in the fact the correct document that can answer the query. This post will go over an unsung hero in RAG that aims to maximize the effectiveness of this document ranking - the re-ranker.</p><h1 class="heading" style="text-align:left;" id="borrowing-government-data">Re-ranking Documents</h1><p class="paragraph" style="text-align:left;">At it’s core, a re-ranker is yet another LLM who’s job it is to take in a small amount (usually 10-50) of documents and the original query and rank the documents from most to least relevant. That sounds exactly like basic cosine retrieval because, well, it is the same result - a list of ranked documents.</p><p class="paragraph" style="text-align:left;">How the re-ranking LLM does this is where things get different. Re-ranking happens on a much smaller scale than basic retrieval using cosine similarity from a vector DB because re-rankers’ architectures (often <a class="link" href="https://www.sbert.net/docs/pretrained_cross-encoders.html?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=re-ranking-rag-solutions" target="_blank" rel="noopener noreferrer nofollow">cross-encoders</a>) are often more memory consumptive and slower but yield more precise results. They are considered an optional step between retrieval and generation.</p><p class="paragraph" style="text-align:left;">Let’s look at a quick case study - a chatbot meant to help people navigate questions about Social Security.</p><h1 class="heading" style="text-align:left;" id="borrowing-government-data">Borrowing Government Data</h1><p class="paragraph" style="text-align:left;">Don’t worry, no one’s getting on a watch list for reading this. I’m taking just over 100 FAQs from <a class="link" href="https://faq.ssa.gov/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=re-ranking-rag-solutions" target="_blank" rel="noopener noreferrer nofollow">https://faq.ssa.gov</a> with corresponding help articles and using this as my data for this example. By the way, the full case study can be found at my O’Reilly RAG course<sup>[3]</sup> .</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/47a12562-8f9c-44cd-a97b-e44413493960/Screenshot_2024-07-25_at_7.11.27_AM.png?t=1721916691"/><div class="image__source"><span class="image__source_text"><p>Our data for this case study: pre-written FAQs about America’s Social Security system</p></span></div></div><p class="paragraph" style="text-align:left;">I will use OpenAI’s <code>text-embedding-3-small</code> model to do all document embeddings and Pinecone for my vector database. If you follow the code in <sup>[3]</sup> (namely the retrieval / generation notebooks), I will simply embed all FAQs and store them (along with the url and raw text) in a vector database for retrieval. I didn’t want to use a well known benchmark here because frankly I already think most LLM benchmarks are unhelpful to the average person and I grow increasingly worried that companies will want to simply overfit to these benchmarks to get market hype so I try to think of simple yet relatable non-benchmark examples.</p><p class="paragraph" style="text-align:left;">Now we need to create some test data so we can start to see how well our system is performing.</p><h1 class="heading" style="text-align:left;" id="generating-synthetic-test-data">Generating Synthetic Test Data</h1><p class="paragraph" style="text-align:left;">Grain of salt alert! We are going to ask GPT-4 to generate some potential questions to test our retrieval against. Synthetic data generation is a new sub-genre of generative tasks and one with consequential downstream effects. The test data we use here will inform us as to how well our chatbot is retrieving information and therefore is a measure of how well our bot can perform. My point here is that I will be using the below prompt to generate test data but I cannot actually read the non-english examples and vet them myself so I am taking the questions generated by GPT-4 here with a grain of salt.</p><div class="codeblock"><pre><code>I am designing a chatbot to use this document as information to our users.

Please write 10 questions that an average person not educated in this social security system might ask that can definitely be answered using information using the provided document.

Try to ask in a way that&#39;s confusing to really test our system&#39;s knowledge but still fair.

I need 5 in English, 2 in Spanish, 2 in Chinese, and 1 in French in that order.

Use this format to output:
Document: A given document to make questions from
JSON: [&quot;english question 1&quot;, &quot;english question 2&quot;, &quot;english question 3&quot;, ...  &quot;spanish question 1&quot;, &quot;spanish question 2&quot;, ..., &quot;french question 1&quot;]
###
Document: &#123;document&#125;
JSON:

&gt;&gt;&gt;

[(&#39;english&#39;,
  &#39;How do I start the process for getting disability benefits from Social Security?&#39;),
 ..
 (&#39;spanish&#39;, &#39;¿Cómo solicito beneficios por discapacidad del Seguro Social?&#39;),
 ..
 (&#39;chinese&#39;, &#39;我不在美国居住，我可以申请社会保障残疾福利吗？&#39;),
 ..
 (&#39;french&#39;,
  &quot;Comment puis-je contacter mon bureau de sécurité sociale local pour des prestations d&#39;invalidité?&quot;)]</code></pre></div><p class="paragraph" style="text-align:left;">With all that, let’s look at our baseline results of just using OpenAI’s embedder and pinecone’s basic retrieval (just cosine similarity).</p><h1 class="heading" style="text-align:left;" id="baseline-results">Baseline Results</h1><p class="paragraph" style="text-align:left;">For the 220 questions (10 per a 20% sample of our scraped urls) and for each language I generated data in, I broke it up and calculated two items:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">the % of times the expected document was even in the list of <b>10 retrieved documents</b> <b>from Pinecone</b></p></li><li><p class="paragraph" style="text-align:left;">The % of times the expected document was the top document in the list (will always less than or equal to the first number)</p></li></ol><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/feac776f-0bcd-43a7-9ffa-0ec9fbfb8bfe/Screenshot_2024-07-25_at_7.14.58_AM.png?t=1721916905"/><div class="image__source"><span class="image__source_text"><p>OpenAI embeddings alone are a decent showing with English performing the worst tied with Spanish at 82% top result accuracy.</p></span></div></div><p class="paragraph" style="text-align:left;"><b>Using OpenAI’s embeddings alone gives us about a 84% accuracy overall </b>(weighted by language) of the synthetic test set. Not all languages were able to grab the document at all from Pinecone. To raise that number, we could grab more documents or use a different / fine-tuned embedder. Both great things to test, but not the main point of this post.</p><h1 class="heading" style="text-align:left;" id="making-retrieved-documents-better-w">Making Retrieved Documents better with re-rankers</h1><p class="paragraph" style="text-align:left;">We finally arrive at the crux of this post. Once we retrieve the documents from our vector database, you can pass it along to a generative AI and call it a day. But with re-ranking systems and just 5-10 more lines of code (not a hyperbole, check out the Github<sup>[3]</sup> ), we can re-sort those documents from Pinecone to try and surface the actual relevant document to the top of the list. If we can consistently do this, we can pass fewer documents to our final RAG generation prompt resulting in a tighter, faster, and cheaper integration.</p><p class="paragraph" style="text-align:left;">I evaluated two re-ranking systems for this:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><a class="link" href="https://cohere.com/blog/rerank-3?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=re-ranking-rag-solutions" target="_blank" rel="noopener noreferrer nofollow">Cohere’s v3 multilingual re-ranker</a> - likely the largest company providing a marketable solution to the document re-ranking problem</p></li><li><p class="paragraph" style="text-align:left;"><a class="link" href="https://docs.pongo.ai/what-is-pongo?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=re-ranking-rag-solutions" target="_blank" rel="noopener noreferrer nofollow">Pongo’s semantic filter</a> - one of the few small companies innovating in this space</p></li></ol><p class="paragraph" style="text-align:left;">Both of them work in a really simple way: provide a query and <b>raw </b>documents (not the OpenAI embeddings, they don’t matter to the re-ranker) and get back an ordered list of documents from most to least relevant. The test is simple - add this re-ranking step to the 10 retrieved documents from Pinecone. We will still be limited by the relevant document actually existing in the original 10, but we will be comparing Cohere and Pongo against simply using no re-ranking whatsoever.</p><h1 class="heading" style="text-align:left;" id="final-results">Final Results</h1><p class="paragraph" style="text-align:left;">Everyone’s RAG system is different and your data will be different. For the data outlined above, our final results can be summarized as follows:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Both Cohere and Pongo improved top result accuracy from 84% to ~90% (~7% increase in performance).</b></p></li><li><p class="paragraph" style="text-align:left;">Both models slowed the system down (not seen in the graph but both systems more than doubled the time to the testing process). This makes sense because we are actively performing a secondary LLM action.</p></li><li><p class="paragraph" style="text-align:left;">Cohere’s model (being explicitly trained for multilingual use-cases) outperformed Pongo on Chinese, French, and Spanish.</p></li><li><p class="paragraph" style="text-align:left;">Pongo beat Cohere on English examples (which represented 50% of the testing set).</p></li></ol><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0c8dc2d3-fb3f-4a71-bd4a-e7d35ec571f1/Screenshot_2024-07-25_at_7.10.50_AM.png?t=1721916654"/><div class="image__source"><span class="image__source_text"><p>Both Pongo and Cohere made our retrieval rankings better with ~5 lines of added code!</p></span></div></div><p class="paragraph" style="text-align:left;">Overall, adding re-ranking to a system can take mere minutes to code up and as long as you have a proper testing set and a way to run tests automatically, there is no reason you cannot test your RAG systems against these re-rankers to see if they will have a net benefit on the retrieval accuracy.</p><p class="paragraph" style="text-align:left;">Happy re-ranking!</p><h1 class="heading" style="text-align:left;" id="references">References</h1><p class="paragraph" style="text-align:left;">[1] My August 2022 post on RAG:</p><div class="embed"><a class="embed__url" href="https://medium.com/@profoz/building-a-natural-language-interface-from-an-faq-using-pre-trained-language-models-1c150dd572df?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=re-ranking-rag-solutions" target="_blank"><div class="embed__content"><p class="embed__title"> Building a Natural Language Interface from an FAQ using pre-trained language models </p><p class="embed__description"> Quickly and easily build a natural language interface using a static knowledge base as your source </p><p class="embed__link"> medium.com/@profoz/building-a-natural-language-interface-from-an-faq-using-pre-trained-language-models-1c150dd572df </p></div><img class="embed__image embed__image--right" src="https://miro.medium.com/v2/resize:fit:1200/1*CN_Dl3u3p_ln21uJATh4cA.png"/></a></div><p class="paragraph" style="text-align:left;">[2] The Original RAG Paper:</p><div class="embed"><a class="embed__url" href="https://arxiv.org/abs/2005.11401?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=re-ranking-rag-solutions" target="_blank"><div class="embed__content"><p class="embed__title"> Retrieval-Augmented Generation for Knowledge-Intensive NLP TasksOriginal RAG Paperarxiv.org/abs/2005.11401 </p></div><img class="embed__image embed__image--right" src="https:///static/browse/0.3.4/images/arxiv-logo-fb.png"/></a></div><p class="paragraph" style="text-align:left;">[3] My current RAG class materials:</p><div class="embed"><a class="embed__url" href="https://github.com/sinanuozdemir/oreilly-retrieval-augmented-gen-ai?tab=readme-ov-file&utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=re-ranking-rag-solutions" target="_blank"><img class="embed__image embed__image--top" src="https://opengraph.githubassets.com/e95b3b839b932d0c609b130dc4b45d534618637c0058238879e71493c3a0e088/sinanuozdemir/oreilly-retrieval-augmented-gen-ai"/><div class="embed__content"><p class="embed__title"> Sinan Ozdemir’s RAG Course on O’Reilly </p><p class="embed__link"> github.com/sinanuozdemir/oreilly-retrieval-augmented-gen-ai </p></div></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=5a23d180-e86a-4deb-b359-72dfb95c5465&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>To Quantize or not to Quantize</title>
  <description>Asking the right questions about quantization</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/11fb45e1-d523-4401-ae19-4faf791056c6/benchmark_qt_non_qt.png" length="232872" type="image/png"/>
  <link>https://ai-office-hours.beehiiv.com/p/quantizing-llms-llama-3</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/quantizing-llms-llama-3</guid>
  <pubDate>Mon, 06 May 2024 14:12:17 +0000</pubDate>
  <atom:published>2024-05-06T14:12:17Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[Llmops]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h1 class="heading" style="text-align:left;" id="introduction-to-quantization">Introduction to Quantization</h1><p class="paragraph" style="text-align:left;"><b>Quantization refers to the technique of representing models using fewer bits by reducing the precision of its parameters.</b> This process involves converting continuous or high-precision values into a smaller set of discrete values, typically by mapping floating-point numbers to integers. <b>The primary goal of quantizing large language models (LLMs) is to decrease memory usage and accelerate inference.</b></p><p class="paragraph" style="text-align:left;">There are several methods to quantize a model, which I won&#39;t get into as there are already excellent resources available (see reference [3]). Instead, I wanted to focus on a specific use case I get asked about a lot as an AI consultant and teacher: deploying an off-the-shelf model without further fine-tuning. These models could be ones pre-trained by other organizations, like Llama-3-8B, or previously fine-tuned on specific datasets without quantization. This post will not cover the process of fine-tuning while quantizing, which involves techniques such as QLORA (I have codes examples for this in reference [2]).</p><p class="paragraph" style="text-align:left;">Python code to quantize a model is relatively straightforward using popular packages like <b>transformers</b> which have implementations of algorithms like <b>NF4</b> (see below code sample and reference [3] for more details). NF4, which stands for NormalFloat 4, is a particularly effective strategy for maintaining the performance of AI models. Originally introduced in the <a class="link" href="https://arxiv.org/abs/2305.14314?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=to-quantize-or-not-to-quantize" target="_blank" rel="noopener noreferrer nofollow">QLORA paper</a>, NF4 has become a preferred choice in modern quantization strategies.</p><div class="codeblock"><pre><code># Import necessary classes and functions from the transformers library
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Define the model name to load from Hugging Face&#39;s model hub
model_name = &#39;meta-llama/Meta-Llama-3-8B-Instruct&#39;

# Configure the quantization settings using BitsAndBytesConfig
# Setting load_in_4bit to True enables 4-bit quantization
# bnb_4bit_use_double_quant enables double quantization for more precise control
# bnb_4bit_quant_type specifies the NF4 quantization algorithm
# bnb_4bit_compute_dtype sets the data type for computation to bfloat16 for efficiency
bits_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type=&quot;nf4&quot;,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Initialize the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load and configure the quantized model
qt_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bits_config,    
    device_map=&quot;auto&quot;
).eval()  # Set the model to evaluation mode which disables training specific operations like dropout

# Load the non-quantized version of the same model
non_qt_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map=&quot;auto&quot;
).eval()  # Set the model to evaluation mode</code></pre></div><p class="paragraph" style="text-align:left;">The not so straightforward part is testing both quantized and non-quantized models side by side on our main three considerations:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Optimizing Inference</b> - memory and latency reduction</p></li><li><p class="paragraph" style="text-align:left;"><b>Raw token output differences</b> - measuring the raw differences between the next token prediction outputs</p></li><li><p class="paragraph" style="text-align:left;"><b>Performance on benchmarks / test sets</b> - running generative benchmarks and comparing the two models</p></li></ol><p class="paragraph" style="text-align:left;">I will use be using <a class="link" href="https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=to-quantize-or-not-to-quantize" target="_blank" rel="noopener noreferrer nofollow">Meta’s llama-3-8B-Instruct</a> model as my reference.</p><h2 class="heading" style="text-align:left;" id="consideration-1-optimizing-inferenc">Consideration 1 - Optimizing Inference</h2><p class="paragraph" style="text-align:left;">Probably the most well known benefits of quantization are the inference gains both in memory usage and in latency/throughput. <b>Lower parameter precision means less memory required to hold the model and faster computations. </b>The memory usage and latency differences are dramatic between the two models and hold at both small and larger batch sizes.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3bc1bfe9-f6fa-496b-9f4e-8523b91da28b/qt_vs_nonqt.png?t=1714917767"/><div class="image__source"><span class="image__source_text"><p>Measuring the peak memory usage and latency of the forward pass of Llama 3-8B shows striking differences. The Non-Quantized model (red) uses far more memory (top) and takes far longer to process inputs in batch sizes between 1 and 32 (bottom).</p></span></div></div><p class="paragraph" style="text-align:left;">Quantized models are supposed to be faster and more memory efficient so this is just the tip of the iceberg. Are they as reliable as their non-quantized cousin? Are they better? Worse? Let’s see how we can find out.</p><h2 class="heading" style="text-align:left;" id="consideration-1-memory-inference-sp">Consideration 2 - Raw Token Output Differences</h2><p class="paragraph" style="text-align:left;">This next graph has me asking both versions of the Llama 3 model 163 questions from a subset of <a class="link" href="https://huggingface.co/datasets/cais/mmlu/viewer/virology?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=to-quantize-or-not-to-quantize" target="_blank" rel="noopener noreferrer nofollow">MMLU-Virology</a> (the benchmark content isn’t as relevant here) and using the <a class="link" href="https://medium.com/@mayurdhvajsinhjadeja/jaccard-similarity-34e2c15fb524?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=to-quantize-or-not-to-quantize" target="_blank" rel="noopener noreferrer nofollow"><b>Jaccard Index (Similarity)</b></a> - a similarity metric between two sets as the number of items they have in common divided by the total number of unique items between them - to measure the differences between the raw next token predictions for each input at various cutoff points - k=1, 2, 3, etc.<span style="color:rgb(13, 13, 13);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;"> This will give us a straightforward way to quantify the differences in raw model output of quantized and non-quantized models.</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(13, 13, 13);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;">I chose the Jaccard Index also for its robustness in scenarios where the exact alignment of token sets is less important than the overall overlap, making it ideal for evaluating models where slight deviations in token predictions are acceptable. </span>We can see that most tokens are in common but a non-insignificant number of tokens are in fact different.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/76f172b2-78a9-4989-9f94-ec45b9adbeb7/jaccard_top_k.png?t=1714917834"/><div class="image__source"><span class="image__source_text"><p>The Jaccard similarity between the top k predicted tokens of the quantized and non-quantized model on a subset of MMLU-virology</p></span></div></div><p class="paragraph" style="text-align:left;">Given this graph, roughly speaking, we can expect about 75-80% of the tokens to match in the top 1, 3, 5, 10, and 20 predicted tokens for this test set, which can lead to performance differences (see consideration 3). These raw token outputs will not only affect performance on test sets but will also yield differences in the inference parameters that we set. For example, setting a top_p (which affects token probabilities) for a non-quantized model might yield drastically different results on the quantized version.</p><h2 class="heading" style="text-align:left;" id="consideration-1-memory-inference-sp">Consideration 3 - Performance on Test Sets</h2><p class="paragraph" style="text-align:left;">Considerations 1 and 2 were measuring the differences in raw next token predictions both in similarity and in speed/memory usage but neither were considering the accuracy of what those tokens represented. We saw non-insignificant differences between which tokens might be outputted which suggests that there will be differences in benchmark performance. </p><p class="paragraph" style="text-align:left;">I’m planning a post on benchmarking in more detail but for now, I’m going to pass a very simple 0-shot prompt to each model on a subset of MMLU-Virology. I measured the words per minute (which I expected to be better for the quantized model) and the accuracy on the multiple choice questions.</p><p class="paragraph" style="text-align:left;"><i>Note: The only inference parameter I set was a temperature of 0.1 to induce some more consistency and reproducibility of the experiment. This choice will also highlight any token differences by making the differences in token probabilities sharper.</i></p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/feb9c34a-2f9c-4b55-941f-b90794822687/benchmark_qt_non_qt.png?t=1714949818"/><div class="image__source"><span class="image__source_text"><p>The Quantized Model (Red in both graphs) has a better word per minute rate (top) but performs slightly worse on a subset of the MMLU benchmark (bottom).</p></span></div></div><p class="paragraph" style="text-align:left;">Right out of the gate, the non quantized model is performing slightly better on this benchmark subset but has a much lower WPM (no surprise there given the forward pass calculations in consideration 1). The difference in performance comes down to the fact that <b>quantization is objectively altering the model from how it was trained</b>. It’s not always going to be true that the quantized version of a model will perform worse but especially on well known benchmarks like MMLU that companies like Meta, OpenAI, Anthropic, etc test their models on, it’s a good bet. <b>It’s always good to test.</b></p><p class="paragraph" style="text-align:left;">To mitigate this, we could fine-tune the model while quantizing using a technique like QLORA (reference [2]).</p><h1 class="heading" style="text-align:left;" id="conclusion">Conclusion</h1><p class="paragraph" style="text-align:left;">Quantization offers tangible benefits in terms of reducing memory usage and enhancing the speed of computations. This has been demonstrated effectively in the case of Llama-3-8B, where quantized models significantly outperform their non-quantized counterparts in memory efficiency and processing speed during inference. </p><p class="paragraph" style="text-align:left;">However, quantization does come with built in trade-offs. The alterations in precision can lead to differences in token output and potentially affect performance on benchmarks and practical applications. The balance between efficiency and accuracy must be carefully tested and managed, and for critical applications, performing some fine-tuning post-quantization using QLORA may be necessary to restore or enhance model performance. </p><p class="paragraph" style="text-align:left;">I hope this helps!</p><h1 class="heading" style="text-align:left;" id="conclusion-reference">References</h1><p class="paragraph" style="text-align:left;"><b>[1] Code for these graphs</b></p><div class="embed"><a class="embed__url" href="https://colab.research.google.com/drive/12RTnrcaXCeAqyGQNbWsrvcqKyOdr0NSm?usp=sharing&utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=to-quantize-or-not-to-quantize" target="_blank"><div class="embed__content"><p class="embed__title"> Quantization Comparisons </p><p class="embed__link"> colab.research.google.com/drive/12RTnrcaXCeAqyGQNbWsrvcqKyOdr0NSm?usp=sharing </p></div><img class="embed__image embed__image--right" src="https://colab.research.google.com/img/colab_favicon_256px.png"/></a></div><p class="paragraph" style="text-align:left;"><b>[2] QLORA example (see the SFT notebook in colab)</b></p><div class="embed"><a class="embed__url" href="https://ai-office-hours.beehiiv.com/p/aligning-llms?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=to-quantize-or-not-to-quantize" target="_blank"><div class="embed__content"><p class="embed__title"> Guide to AI Alignment with Reinforcement Learning </p><p class="embed__description"> What RL can and cannot do </p><p class="embed__link"> ai-office-hours.beehiiv.com/p/aligning-llms </p></div><img class="embed__image embed__image--right" src="https://beehiiv-images-production.s3.amazonaws.com/uploads/publication/logo/f8bee330-7f91-44a6-a969-50f8470e4ea5/Screenshot_2023-06-12_at_12.52.42_PM.png"/></a></div><p class="paragraph" style="text-align:left;"><b>[3] Primer on Quantization from HuggingFace:</b></p><div class="embed"><a class="embed__url" href="https://huggingface.co/docs/peft/main/en/developer_guides/quantization?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=to-quantize-or-not-to-quantize" target="_blank"><div class="embed__content"><p class="embed__title"> Quantization </p><p class="embed__description"> We’re on a journey to advance and democratize artificial intelligence through open source and open science. </p><p class="embed__link"> huggingface.co/docs/peft/main/en/developer_guides/quantization </p></div><img class="embed__image embed__image--right" src="https://huggingface.co/front/thumbnails/docs/peft.png"/></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=7b7b57a6-4f4a-49ce-b9f9-2e86afb1ee20&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Probing LLMs for a World Model</title>
  <description>Using linear probes to dissect internal LLM embeddings to check for a hint of an internal world model.</description>
      <enclosure url="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6ef87456-e05d-458d-b8c2-10be93519529/Screenshot_2024-04-25_at_11.26.27_AM.png" length="134375" type="image/png"/>
  <link>https://ai-office-hours.beehiiv.com/p/llm-probing</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/llm-probing</guid>
  <pubDate>Thu, 25 Apr 2024 15:31:43 +0000</pubDate>
  <atom:published>2024-04-25T15:31:43Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[How To]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">There are active debates over whether LLMs are just memorizing vast amounts of statistics or if they can learn a more cohesive representation of the world whose language they model. Some have found evidence for the latter by analyzing the learned representations of datasets and even go so far as to discover that LLMs can learn linear representations of space and time (<a class="link" href="http://arxiv.org/abs/2310.02207?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=probing-llms-for-a-world-model" target="_blank" rel="noopener noreferrer nofollow">arxiv.org/abs/2310.02207</a>).</p><p class="paragraph" style="text-align:left;">As part of the 2nd edition of my latest LLM book (coming out later this year) one idea I wanted to add as a net new section aimed at recreating some of the work done in this paper by looking at a dataset comes from a paper entitled “A cross-verified database of notable people, 3500 BC-2018 AD” claiming to build a “comprehensive and accurate database of notable individuals”; just what we need to probe some LLMs on their ability to retain information about notable individuals they read about on the web.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/07b904f7-9978-4ff2-8680-1f0b5a6f93e3/Screenshot_2024-04-25_at_8.21.11_AM.png?t=1714047675"/><div class="image__source"><span class="image__source_text"><p>I’m lucky to live in an age where open data for so many things exist: <a class="link" href="http://doi.org/10.1038/s41597-022-01369-4?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=probing-llms-for-a-world-model" target="_blank" rel="noopener noreferrer nofollow">doi.org/10.1038/s41597-022-01369-4</a></p></span></div></div><p class="paragraph" style="text-align:left;">Our steps to conduct the probe will be:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Design a prompt. At its simplest we will just say the name of the individual - like “Albert Einstein”</p></li><li><p class="paragraph" style="text-align:left;">Instigate a forward pass of our LLM and grab embeddings from the middle layer and the final layer of our LLM’s hidden states.</p><ol start="1"><li><p class="paragraph" style="text-align:left;">For auto-encoding models like BERT, we will grab the reserved CLS token’s embedding and for auto-regressive models like Llama or Mistral, we will grab the embedding of the final token.</p></li></ol></li><li><p class="paragraph" style="text-align:left;">Use those token embeddings as inputs to a linear regression problem where we attempt to fit to three fields of the dataset plus a control fourth:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>birth</b> - the birth year of the individual</p></li><li><p class="paragraph" style="text-align:left;"><b>death</b> - the death year of the individual (we filter to only use people who have died so this value is filled)</p></li><li><p class="paragraph" style="text-align:left;"><b>wiki_readers_2015_2018</b> - average per year number of page views in all Wikipedia editions (information retrieved in 2015–2018). We will use this as a weak signal to the notoriety level of the individual</p></li><li><p class="paragraph" style="text-align:left;"><b>random gibberish</b> - just np.random.rand(len(dataset)). We will use this as a control as we should not be able to see any prediction signal</p></li></ol></li></ol><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/7AeBwuW3f7EoOglsXkeA14CAam2ZXyJPTIYQwQwtk4a_26uGFHTNCDTzWsqpx5Zpzsdpca7xgbCY7yxs8R2Gux38t3rLYypdVZGkL47lW9hPKWyTjut-yVBj1D97NgGY0Juj7UcHtIdz"/><div class="image__source"><span class="image__source_text"><p>Probing gives us a way to understand how much information is locked away with the parameters of a model and whether or not we can extract that information through external processes. We place classifiers or regression layers in our case on top of hidden states and attempt to extract information like the birth year of the person we stated in the original prompt.</p></span></div></div><p class="paragraph" style="text-align:center;"></p><p class="paragraph" style="text-align:left;">The goal of probing is not to act in place of an evaluation for a task but rather as an evaluation of a model as a whole in particular domains. The dataset I chose for this represents a relatively “generic” task - remember information it has read. </p><h3 class="heading" style="text-align:left;" id="probing-results"><b>Probing Results</b></h3><p class="paragraph" style="text-align:left;">For every model we are going to probe we probe the first, middle, and ending layer’s final token embedding to regress to our four columns. The next figure shows an example of probing Llama 2 13b’s middle layer. Our birth year and death year probes perform surprisingly strongly; an RMSE of 80 years and R2 of over .5 is not the worst linear regressor I’ve trained, especially considering the scale of our data.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/Ujnd5QyTwmfQ8jpRGoRAYS846MlaVH4-Hb39pAAndlzDrCbdL6gkOLeOY_EXalCf-KGBrIRYC8ESMBgP8v8mrbichFV0yjbLHAOOiPHiHmmY6gRe_U6_XuXWeaY_7b4gAg_oFf_jiR5D"/><div class="image__source"><span class="image__source_text"><p>An example of probing the middle layer of a Llama 13b model with a constructed prompt. Our birth (top left) and death (top right) probes perform relatively well (R2 of above .5) while readership (bottom left) performs less well (R2 of .32) and our gibberish regression model performs poorly as expected (R2 of 0).</p></span></div></div><p class="paragraph" style="text-align:center;"></p><p class="paragraph" style="text-align:left;">The above figure shows a smattering of models I probed by averaging the R2 achieved by a linear regression on the birth year against the embeddings from the middle and the final layer. The smaller four bars represent auto-encoding BERT models with far fewer parameters than Llama, SAWYER (a chat aligned version of Llama 2 I made), and Mistral v 0.1 and 0.2</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8de91e38-108f-45a6-8178-1749b5d3b797/Screenshot_2024-04-25_at_11.18.47_AM.png?t=1714058331"/><div class="image__source"><span class="image__source_text"><p>Across 15 models, we see a wide range of R ^ 2 scores. BERT models, despite having the lowest scores, also have far fewer parameters, making them perhaps more efficient at storing information.</p></span></div></div><p class="paragraph" style="text-align:left;">A couple of notable takeaways:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">BERT base multilingual out performed BERT large English showing how the data that LLMs are pre-trained on matters</p></li><li><p class="paragraph" style="text-align:left;">Mistral v0.2 as a 7B model performs as well as the Llama 2 13b models showing how parameter size is not everything</p></li><li><p class="paragraph" style="text-align:left;">Llama 13B non instruct performed better when given a structured prompt (“basic information about X” vs simply “X”) showing how prompting can drastically alter the amount of information being retrieved</p></li></ol><p class="paragraph" style="text-align:left;">Are any of these “good” predictors of birth and death year? No absolutely not but that’s not the point. The point is to evaluate a model’s ability to encode and retrieve pre-trained knowledge. Moreover, even though our BERT models performed much worse, remember that A. they are several years older than the other models tested and B. They are ~72x smaller than the Llama 13B models and ~40x smaller than the 7B models. </p><p class="paragraph" style="text-align:left;">The next bar graph shows the efficiency of three models measured by the number of parameters needed to achieve a single R2 value so lower means more efficient. BERT takes the cake for being able to retain the information much more efficiently, most likely due to the nature of its auto-encoding language modeling architecture and pre-training.</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/OxUBWYK8Q1pQkpCcEI5cT3K6C22zmSrsOSug44brNZy2pd2bfS4u9dlsIDeoqEesxdg00xNZjTrTecWW-K1cjpBACkQpN5QeCDu1GbcoVLl9H3XdpW2m_E0xPTVDYU0hnAxUm7Y6r5m8"/><div class="image__source"><span class="image__source_text"><p>Between, BERT, Llama 2 13b, and Llama 2 7b, the number of parameters it takes to achieve the R2 in our probe can indicate the efficiency of the model’s ability to encode information. BERT requires far fewer parameters than Llama 2 to extract encoded information but would require more pre-training on recent data to become on par with the Llama 2 model’s performance</p></span></div></div><p class="paragraph" style="text-align:center;"></p><p class="paragraph" style="text-align:left;">For a second probe, I ran the <a class="link" href="https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=probing-llms-for-a-world-model" target="_blank" rel="noopener noreferrer nofollow">GSM8K</a> testing data through five models and built similar probes to the actual answer of the problem and below we can see our results.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/638cff6e-19dc-4a80-888d-1f8e7ad9ead6/Screenshot_2024-04-25_at_7.36.28_AM.png?t=1714044992"/><div class="image__source"><span class="image__source_text"><p>Probing 6 models on the GSM 8K benchmark by taking the final token of the input world problem and regressing to the actual answer.</p></span></div></div><p class="paragraph" style="text-align:left;">It seems that Mistral v0.1 and v0.2 models have more retrievable encoded knowledge than the Llama models when it comes to mathematical word problems making them potential prime candidates for fine-tuning tasks related to math and logic.</p><p class="paragraph" style="text-align:left;">Check out the raw code for the Llama 3 Probe here: <a class="link" href="https://colab.research.google.com/drive/1e1d9fATVjVun-_tPj4vS_DSTGaIfxs01?usp=sharing&utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=probing-llms-for-a-world-model" target="_blank" rel="noopener noreferrer nofollow">https://colab.research.google.com/drive/1e1d9fATVjVun-_tPj4vS_DSTGaIfxs01?usp=sharing</a> I’m still prettifying everything 😀 </p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=9b04ec34-e7f5-48c6-a138-96bc1641b111&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>LLMs Aligned! But to what end?</title>
  <description>What RL can and cannot do</description>
  <link>https://ai-office-hours.beehiiv.com/p/aligning-llms</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/aligning-llms</guid>
  <pubDate>Fri, 08 Mar 2024 15:00:00 +0000</pubDate>
  <atom:published>2024-03-08T15:00:00Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[Llm Alignment]]></category>
    <category><![CDATA[How To]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h1 class="heading" style="text-align:left;" id="introduction-realigning-our-expecta"><b>Introduction - Re-aligning Our Expectations of AI</b></h1><p class="paragraph" style="text-align:left;">Reinforcement Learning (RL) has become one of the primary engines powering AI <b>alignment</b>, the process of fine-tuning an AI model (usually LLMs) to behave according to a certain set of standards and styles. Reinforcement learning provides the unique ability to instill dimensions of human style and ethics outside of the confines of relatively strict next-token prediction. Reinforcement Learning offers us a chance to supplement traditional fine-tuning methods of prompt-response pairs with a system designed to “nudge” the AI in a direction - funnier, more neutral, more diverse, etc.</p><p class="paragraph" style="text-align:left;">Just want code? here you go 🙂 it’s at the bottom of this repo.</p><div class="embed"><a class="embed__url" href="https://github.com/sinanuozdemir/oreilly-llm-alignment?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=llms-aligned-but-to-what-end" target="_blank"><div class="embed__content"><p class="embed__title"> GitHub - sinanuozdemir/oreilly-llm-alignment </p><p class="embed__description"> Contribute to sinanuozdemir/oreilly-llm-alignment development by creating an account on GitHub. </p><p class="embed__link"> github.com/sinanuozdemir/oreilly-llm-alignment </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/02b6a0e3f754508f18e65313ca7059b099fb87017627d94e41a70a4eae4e3068/sinanuozdemir/oreilly-llm-alignment"/></a></div><p class="paragraph" style="text-align:left;">To instill these kinds of behaviors into an LLM -let’s say we want the AI to be funnier - that would mean you need to go over all of your training data and make sure that examples are “funny enough” to be considered good training data. But who is deciding what is “funny enough”? What if learning humor comes at the cost of the AI’s primary objective (usually answering questions and carrying conversations)? </p><p class="paragraph" style="text-align:left;">My upcoming workshop at ODSC East and this post focuses on <b>Reinforcement Learning from Feedback (RLF) </b>which involves giving an AI iterative feedback on solving a task and letting the LLM adapt its own performance in the hopes of having the AI act in a more expected manner and getting better feedback over time. The most common application of this is in training instruction-following AIs, which is exactly what I will be going over.</p><div class="embed"><a class="embed__url" href="https://odsc.com/speakers/aligning-open-source-llms-using-reinforcement-learning-from-feedback/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=llms-aligned-but-to-what-end" target="_blank"><div class="embed__content"><p class="embed__title"> Aligning Open-source LLMs Using Reinforcement Learning from Feedback </p><p class="embed__link"> odsc.com/speakers/aligning-open-source-llms-using-reinforcement-learning-from-feedback </p></div><img class="embed__image embed__image--right" src="https://odsc.com/wp-content/uploads/2023/08/West_speakers_web_Sinan-Ozdemir-1-Cropped.png"/></a></div><h1 class="heading" style="text-align:left;" id="case-study-teaching-a-llama-to-chat"><b>Case Study - Teaching a Llama to chat</b></h1><p class="paragraph" style="text-align:left;">In a previous post for an ODSC workshop I gave, I showed off results from one of my go-to RLF case studies - fine-tuning a FLAN-T5 model to write more neutral news summaries: <a class="link" href="https://opendatascience.com/harnessing-llm-alignment-making-ai-more-accessible?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=llms-aligned-but-to-what-end" target="_blank" rel="noopener noreferrer nofollow">https://opendatascience.com/harnessing-llm-alignment-making-ai-more-accessible</a></p><p class="paragraph" style="text-align:left;">In this post I want to show off my second meatier go-to case study - making a conversational chatbot from a raw pre-trained LLM. It’s name is <b>SAWYER - Sinan’s Attempt at Wise Yet Engaging Responses</b> - because I wanted to make a fun name for my LLM too.</p><p class="paragraph" style="text-align:left;">That already sounds like an oxymoron doesn&#39;t it? - “raw pre-trained” - but what I mean by that is our base model will be Meta’s <b>non chat-aligned</b> LLama 2 model, meaning this model has no ability to answer a question when it comes to us off the shelf.</p><p class="paragraph" style="text-align:left;">Our RLF process can be broken down into three steps:</p><p class="paragraph" style="text-align:left;"><b>Supervised Fine-Tuning (SFT)</b></p><ol start="1"><li><p class="paragraph" style="text-align:left;">Grab Meta’s 7b non chat model weights: hf.co/meta-llama/Llama-2-7b-hf</p></li><li><p class="paragraph" style="text-align:left;">Fine-tune the model with several conversations to learn how to convert embedded knowledge into a productive conversation</p></li></ol><p class="paragraph" style="text-align:left;"><b>Reward Training (RT)</b></p><ol start="1"><li><p class="paragraph" style="text-align:left;">Get a dataset of scored responses to a conversational reply, indicating which responses humans preferred</p></li><li><p class="paragraph" style="text-align:left;">Fine-tune a RoBERTa model to distinguish between preferred and non-preferred responses to a conversation</p></li></ol><p class="paragraph" style="text-align:left;"><b>Reinforcement Learning from Feedback (RLF via PPO)</b></p><ol start="1"><li><p class="paragraph" style="text-align:left;">Obtain an entirely new set of <b>only</b> prompts with the bot response at the end missing</p></li><li><p class="paragraph" style="text-align:left;">Let the LLM reply to a few and use the reward model to assign rewards to the responses - positive is good, negative is bad</p></li><li><p class="paragraph" style="text-align:left;">Let the LLM update its parameters, taking into consideration how much reward it got and how far the updated model has deviated from the original weights</p></li></ol><p class="paragraph" style="text-align:left;">The figure below shows the RLF process (the third step) at a very high level:</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/2VckUAXlZ_a5RrCoKJG7_IgKrXdiMTQMx25G-0_TmKo4ShlkYewautFfokd7fUS3og4o9sBXSQNoSGEieEIudtfbXfutKwm4wtH_MCAwOeLSd78Nma2jzEl3P0tJ01IsOAaWSpkxDBc2bThJ_tP-y5Q"/><div class="image__source"><span class="image__source_text"><p>Our RL loop has SAWYER answering questions, being graded on its performance, and asking it to try again with updated parameters and yes, that image of a llama is AI generated 🙂</p></span></div></div><p class="paragraph" style="text-align:left;">The workshop will cover the nitty gritty of how to code all of this but for now let’s skip to the fun part: the results!</p><h1 class="heading" style="text-align:left;" id="the-results"><b>The Results</b></h1><p class="paragraph" style="text-align:left;">We will see the full suite of results during our workshop but some notable examples stand out. Let’s ask our three versions of SAWYER - no alignment whatsoever (base LLama 2 7b ), only supervised fine tuned (SFT) and fine-tuned plus reinforcement learning from feedback (SFT + RLF).</p><p class="paragraph" style="text-align:center;"></p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/Z_w_I1e2nyb46euE1VaXBJTWBCoufrHBa2QBgIIj8fQKq2LmknArlT58RF5cf641zIeQ-uRRt6Ikj35sHkIH-f-mK9sNV9ZKpyoCjo5-x5xOL49MT41VU7dihgbf2cKwFQ4W9gbJ9Z9_SWkMC5GGhDM"/><div class="image__source"><span class="image__source_text"><p>SAWYER learns to answer questions with SFT, but learns to answer them in a more conversational way with RL</p></span></div></div><p class="paragraph" style="text-align:left;">We can see a notable difference between all three stages starting with base non-chat LLama 2 which is trying to write a multiple choice question (I guess?) and our final SAWYER model being the chattiest about the actual answer to the question. That is a relatively cherry picked example but even when we zoom out and test our model against a test set of conversations (done before and after applying RL to our chatbot), plotting for achieved reward scores, our model post RL is on average getting higher preference rewards with a much lower variance:</p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/PEuezN4vKk7USqLxEszl9u3TWGoWqi96ae5W9jzKRTJh_ebqQl3xFOW_MZmba1ki32vh3LD_4TRT7BVSm9re7fo_xR8BY1aN_V0NnYqhKKKwZlPPV01VoXWpQT-o1R8shVS5DvjN0NuMldnaK5Sy640"/><div class="image__source"><span class="image__source_text"><p>We see statistically significant changes in rewards from before (SFT only) and after alignment via RL</p></span></div></div><p class="paragraph" style="text-align:left;">This means that the final SAWYER model, post RL, seems to be getting higher rewards, more consistently.</p><p class="paragraph" style="text-align:left;">Reinforcement learning doesn’t help with everything though. For example, we can’t expect a model that received more rewards for answering in a way that humans prefer to be “smarter”. I ran these models against some well known benchmarks (below I’m showing truthful_qa and mmlu[world_religions]) and they got basically the same accuracy score:</p><p class="paragraph" style="text-align:center;"></p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/KJJd-r8J_hXWlKwhvuDidEZqGprNLnwCpgrOM6Cqt00xPEzGsLzRau2DC9-AYSEZqMS6xWpKLbowt7zWflZPFHX-tuH5Q_yUmsnFddidzBfK0nTlqizPbqfgncLht8eEVF9rM0s-8jjVbFF4Jnv6rKU"/><div class="image__source"><span class="image__source_text"><p>Our SFT and SFT + RLF models perform basically at the same level on tasks where the model needs to respond accurately and style is irrelevant (no chain of thought was applied here, I simply asked the model to answer a question directly)</p></span></div></div><p class="paragraph" style="text-align:left;">I wasn’t expecting SAWYER to knock these benchmarks out of the park by any means, I just wanted to show the difference between what RL can and cannot help with. Aligning a model to chat factually and conversationally involves several steps and each step comes with caveats and nuances. Navigating these waters is challenging without step by step guidance and that is exactly what I will be providing at my upcoming workshop!</p><h1 class="heading" style="text-align:left;" id="conclusion"><b>Conclusion</b></h1><p class="paragraph" style="text-align:left;">The exploration of Reinforcement Learning from Feedback (RLF) as a means to fine-tune Large Language Models (LLMs) towards specific behavioral goals—such as conversational attitudes, neutrality, or diversity—represents a significant leap forward in our quest to make AI more adaptable and responsive to human needs.</p><p class="paragraph" style="text-align:left;">Through case studies like SAWYER, we can see firsthand the potential of SFT and RL to transform a pre-trained model into a more engaging and conversational agent. The process, involving a blend of supervised fine-tuning and reinforcement learning, underscores the complexity of aligning AI with nuanced human qualities. The results, while encouraging, also highlight the inherent limitations of current methodologies. While RL can guide models to interact in more human-like ways, it does not inherently increase their factual accuracy or understanding of the world.</p><p class="paragraph" style="text-align:center;"></p><div class="image"><img alt="" class="image__image" style="" src="https://lh7-us.googleusercontent.com/UE3zaYqRdIe6EAarPKL3_YqqbYe0WiQAMdr6Q894RT8CQ1wPx-qbkN18Mrh4qsNbQEyIYR3a3hO4lNjwv9k0WdFQrTs8rRM5LEwt8Wb_qM-x8sivHORpIjhRa7kFFfhh2w4MLum-Blp9GsbpWWkVT3k"/><div class="image__source"><span class="image__source_text"><p>An overview of the three elements of LLM alignment</p></span></div></div><p class="paragraph" style="text-align:left;">The journey of aligning AI with human expectations is ongoing. The successes and limitations of using RL from Feedback signal that while we can nudge AI towards more human-like interactions, the end goal—creating AI that truly understands and reflects human values, humor, and ethics—remains a challenging frontier. As we move forward, it is crucial to continue refining our approaches, questioning our objectives, and considering the broader implications of our quest to create AI that is not just aligned, but aligned to what end. The future of AI alignment is promising, yet it demands our thoughtful consideration, creativity, and, most importantly, our unwavering commitment to ethical principles.</p><p class="paragraph" style="text-align:left;">For more on:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Why we are using PPO over DPO</p></li><li><p class="paragraph" style="text-align:left;">Evaluating SAWYER’s capabilities</p></li><li><p class="paragraph" style="text-align:left;">Tips and techniques I used to fine-tune SAWYER on a single GPU on Colab</p></li><li><p class="paragraph" style="text-align:left;">How to properly fine-tune a reward mechanism</p></li><li><p class="paragraph" style="text-align:left;">How PPO can help set us up for more longer term success than DPO can</p></li><li><p class="paragraph" style="text-align:left;">Why higher rewards isn’t always a good thing</p></li><li><p class="paragraph" style="text-align:left;">SAWYER’s opinions on poetry</p></li></ol><p class="paragraph" style="text-align:left;">And much more, come to our workshop at ODSC East in April! See you there.</p><div class="embed"><a class="embed__url" href="https://odsc.com/speakers/aligning-open-source-llms-using-reinforcement-learning-from-feedback/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=llms-aligned-but-to-what-end" target="_blank"><div class="embed__content"><p class="embed__title"> Aligning Open-source LLMs Using Reinforcement Learning from Feedback </p><p class="embed__link"> odsc.com/speakers/aligning-open-source-llms-using-reinforcement-learning-from-feedback </p></div><img class="embed__image embed__image--right" src="https://odsc.com/wp-content/uploads/2023/08/West_speakers_web_Sinan-Ozdemir-1-Cropped.png"/></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=404427ce-2bf7-4d6e-b73a-f7d98196793e&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Navigating the ML Content Maze: Strategies for a High-Quality Feed</title>
  <description>Essential tips from Nathan Lambert&#39;s latest Interconnects post and our Practically Intelligent podcast.</description>
  <link>https://ai-office-hours.beehiiv.com/p/navigating-ml-content-maze-strategies-highquality-feed</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/navigating-ml-content-maze-strategies-highquality-feed</guid>
  <pubDate>Fri, 01 Mar 2024 17:03:34 +0000</pubDate>
  <atom:published>2024-03-01T17:03:34Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[Collaboration]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;"></p><p class="paragraph" style="text-align:left;"></p><p class="paragraph" style="text-align:left;">Hey everyone!</p><p class="paragraph" style="text-align:start;">In today&#39;s post, I&#39;m thrilled to share insights from my good friend Nathan Lambert&#39;s recent post on <a class="link" href="https://www.interconnects.ai?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=navigating-the-ml-content-maze-strategies-for-a-high-quality-feed" target="_blank" rel="noopener noreferrer nofollow">Interconnects</a>, a reflection sparked by his appearance on my very own <a class="link" href="https://www.practicallyintelligent.com/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=navigating-the-ml-content-maze-strategies-for-a-high-quality-feed" target="_blank" rel="noopener noreferrer nofollow">Practically Intelligent podcast</a>. Lambert and I dove into the art of curating a high-quality ML content feed amidst the deluge of information. Highlighting the need for critical evaluation, model access, and the balance between depth and breadth, this guide is indispensable for those navigating the ML landscape. Dive into the full article for a comprehensive exploration of these strategies. </p><div class="embed"><a class="embed__url" href="https://www.interconnects.ai/p/making-a-ml-feed?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=navigating-the-ml-content-maze-strategies-for-a-high-quality-feed" target="_blank"><div class="embed__content"><p class="embed__title"> How to cultivate a high-signal AI feed </p><p class="embed__description"> Basic tips on how to assess inbound ML content and cultivate your news feed. </p><p class="embed__link"> www.interconnects.ai/p/making-a-ml-feed </p></div><img class="embed__image embed__image--right" src="https://substackcdn.com/image/fetch/w_1200,h_600,c_fill,f_jpg,q_auto:good,fl_progressive:steep,g_auto/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4fe6324-d49b-48cf-968c-a383480abbd3_1145x1010.png"/></a></div><p class="paragraph" style="text-align:start;">Lambert and I offer invaluable advice on navigating the vast ML content landscape:</p><ul><li><p class="paragraph" style="text-align:left;"><b>Model Access and Demos:</b> The gold standard for evaluating ML content credibility.</p></li><li><p class="paragraph" style="text-align:left;"><b>Depth vs. Breadth:</b> Focus on areas that provide the most leverage for your goals.</p></li><li><p class="paragraph" style="text-align:left;"><b>Reproducibility and Verifiability:</b> Signs of scientific rigor in ML projects.</p></li><li><p class="paragraph" style="text-align:left;"><b>Critical Evaluation of Sources:</b> Not all ML content is created equal.</p></li><li><p class="paragraph" style="text-align:left;"><b>Scientific Rigor:</b> The importance of foundational principles in assessing ML advancements.</p></li></ul><p class="paragraph" style="text-align:start;">And for more enriching discussions on ML, don&#39;t forget to tune into Practically Intelligent!</p><div class="embed"><a class="embed__url" href="https://www.practicallyintelligent.com/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=navigating-the-ml-content-maze-strategies-for-a-high-quality-feed" target="_blank"><div class="embed__content"><p class="embed__title"> Practically Intelligent </p><p class="embed__description"> A podcast by AI nerds for AI nerds </p><p class="embed__link"> www.practicallyintelligent.com </p></div><img class="embed__image embed__image--right" src="http://static1.squarespace.com/static/659c2bb240081f4c047b1b57/t/659c32120ef1727dbbb2da62/1704735250638/Pratically+Intelligent+Logo.jpg?format=1500w"/></a></div><p class="paragraph" style="text-align:start;">I’m also looking to learn what you all want me to write about! If you want to submit a GH issue on our github, I would love to incorporate any and all feedback 🙂 </p><div class="embed"><a class="embed__url" href="https://github.com/sinanuozdemir/ai-office-hours?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=navigating-the-ml-content-maze-strategies-for-a-high-quality-feed" target="_blank"><div class="embed__content"><p class="embed__title"> GitHub - sinanuozdemir/ai-office-hours </p><p class="embed__description"> Contribute to sinanuozdemir/ai-office-hours development by creating an account on GitHub. </p><p class="embed__link"> github.com/sinanuozdemir/ai-office-hours </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/174c0123c4cab3a1563f45d93fc676f7a670ec585371d040dab62e5e1fa860b5/sinanuozdemir/ai-office-hours"/></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=ecadefc8-626e-4368-8641-1c6cc9135e81&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Fashion Meets AI</title>
  <description>Designing a Wedding Outfit with GPT-4</description>
  <link>https://ai-office-hours.beehiiv.com/p/fashion-meets-ai</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/fashion-meets-ai</guid>
  <pubDate>Mon, 08 Jan 2024 16:38:53 +0000</pubDate>
  <atom:published>2024-01-08T16:38:53Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[Bias + Ethics]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">I wanted to write a post about a singularly interesting conversation I had with GPT-4 Vision recently. The impetus for this post was that I was genuinely excited to use GPT for a new kind of task for me. </p><p class="paragraph" style="text-align:left;"><b>The Task? </b>I wanted GPT-4 to help me plan a wedding outfit.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/262802fa-24e2-4034-a09f-f77f1b41ee2a/IMG_2710.jpeg?t=1703721822"/><div class="image__source"><span class="image__source_text"><p>Help I’m fashionably challenged and want GPT to help me plan a wedding outfit</p></span></div></div><p class="paragraph" style="text-align:left;">The outfit had an interesting theme of “casual beach formal” , so I thought I creative AI should be able to help me out. With some simple prompts I asked to design an outfit for that theme and then I asked it to draw a image of that outfit on a mannequin with brown hair so that I could see it. I’m very visual.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/faa71228-e41b-4dd0-8090-a60cb4715383/image.png?t=1703721541"/><div class="image__source"><span class="image__source_text"><p>A “casual beach formal” outfit as deemed by GPT-4. Oops I like it but that doesn’t look like me at all. Let’s try to tighten up this look customized for me.</p></span></div></div><p class="paragraph" style="text-align:left;">I noticed the resulting image was quite beautiful, and as much as I make my resolutions every year, I don’t have a body like that. So I asked GPT for the mannequin in my dimensions, which I will not reveal here the resulting image striking similar to the first one.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4d43e067-9ce7-4efc-ad86-6a2481425348/image.png?t=1703721578"/><div class="image__source"><span class="image__source_text"><p>Asking for several pant sizes larger and about 4-6 inches shorter. His brothers left but he stayed. Ok let’s try again</p></span></div></div><p class="paragraph" style="text-align:left;">I tried harder to make the model my proportions, but the AI kept making him very thin.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fc3eafdb-a4dc-4d44-8fe5-046cc82ee089/image.png?t=1703721616"/><div class="image__source"><span class="image__source_text"><p>No matter what I tried I kept letting the same Mr Hot-Man. For this image I specifically asked for love handles and to be fair I get where it was going with that around the mid section. </p></span></div></div><p class="paragraph" style="text-align:left;">Getting bored and wanting something to happen, I asked it to do something more fantastical like draw an arm out of the mannequin’s head, it was happy to change the image for that. To be fair it didn’t do what I asked, but still.. it will rather draw robot arms than make me slightly fatter.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d62b7282-d1c7-451a-99a4-f0255b10d157/image.png?t=1703721493"/><div class="image__source"><span class="image__source_text"><p>I asked for a third arm out of my head. That’s not what I got but hey it’s more willing to turn me into an android than to make a bit heavier so we can work from here maybe.</p></span></div></div><p class="paragraph" style="text-align:left;">He is my son and I shall call him <b>sinan v0.1 alpha</b>. I even asked to make <b>sinan v0.1 alpha</b> stouter and shorter, and it still refused to do that for me.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/890ec6a7-9d31-4388-80ef-a0a54e2c75c5/image.png?t=1703721506"/><div class="image__source"><span class="image__source_text"><p>Hey there’s that third arm! This is after I asked to make sinan v0.1 alpha stouter and shorter </p></span></div></div><p class="paragraph" style="text-align:left;">It’s much easier to make a multi-arm mannequin than one slightly shorter and fatter I guess. As one more test I backed up a bit to the original outfit and I asked it to draw it for my Indian friend who wants to wear the same outfit. And here is what it came up with. </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/3d3ceeaa-50a7-4cad-9c61-6d821c1c2485/Screenshot_2024-01-02_at_1.29.48_PM.png?t=1704231006"/><div class="image__source"><span class="image__source_text"><p>For me (on the left) and for my “Indian friend” on the right. Not much difference but the skin is slightly different for sure.</p></span></div></div><p class="paragraph" style="text-align:left;">Honestly, at first I didn’t notice the difference but then after a few seconds I do see that the skin is slightly different on the right which I guess is fair. It’s just hard to get away from that chiseled face I guess, I get it.</p><h2 class="heading" style="text-align:left;" id="conclusion">Conclusion?</h2><p class="paragraph" style="text-align:left;">This is not a research study, nor did I do a lot of repetitive testing here. But it is a case study in my singular user experience with an AI to solve what I expect is not that rare of a request. As someone who has been working with Generative AI for over a decade, if I can’t get GPT to do what I want even with some minor prompt engineering, I wonder how long the average ChatGPT user will wait before rage-quitting on this exact scenario.</p><p class="paragraph" style="text-align:left;">This is no way of criticism of OpenAI. Multi-turn multi-modal conversations is arguably the most challenging tasks being undertaken by commercial AI today. I’m only saying that any company who is building such AI experiences should remember to market not only the AI’s capabilities but also known limitations, no matter how small. It would at least be nice to know that other people see similar areas of improvement as I do.</p><p class="paragraph" style="text-align:left;">Just a thought.</p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=f797e7a6-ce38-4651-8154-063c44028710&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>AIs Supervising AIs</title>
  <description>Inspecting how good LLM as a Judge actually is to evaluating AI outputs</description>
  <link>https://ai-office-hours.beehiiv.com/p/ais-supervising-ais</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/ais-supervising-ais</guid>
  <pubDate>Mon, 18 Dec 2023 17:28:54 +0000</pubDate>
  <atom:published>2023-12-18T17:28:54Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[Bias + Ethics]]></category>
    <category><![CDATA[How To]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">As a part of my upcoming <a class="link" href="https://www.oreilly.com/live-events/aligning-large-language-models/0636920098043/0636920098042/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=ais-supervising-ais" target="_blank" rel="noopener noreferrer nofollow">O’Reilly session on aligning LLMs</a>, I wanted to talk a bit about <b>scale supervision</b> - an AI’s ability to judge another AI on the generated responses. I was originally inspired by a HuggingFace post called <a class="link" href="https://huggingface.co/blog/llm-leaderboard?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=ais-supervising-ais" target="_blank" rel="noopener noreferrer nofollow">Can foundation models label data like humans</a> and I wanted to replicate some of the results and add some results of my own.</p><h1 class="heading" style="text-align:left;" id="the-data">The Data</h1><p class="paragraph" style="text-align:left;">I am using some comparison data that I also used in my book that can be found <a class="link" href="https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/data/comparison_data_v2.json?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=ais-supervising-ais" target="_blank" rel="noopener noreferrer nofollow">here</a>. Most AI responses were rated very highly by humans:</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/4e42f50d-0cc8-4552-aae5-c90abd65e8ea/Screenshot_2023-12-15_at_2.42.23_PM.png?t=1702680146"/><div class="image__source"><span class="image__source_text"><p>Of nearly 5,000 paired responses with scores, most of the ratings were pretty high</p></span></div></div><p class="paragraph" style="text-align:left;">Because I was approaching the $200 mark in OpenAI costs after running only about 3% of the data through my prompt, I only ended up using 4,877 paired responses.</p><h1 class="heading" style="text-align:left;" id="the-task-prompt">The Task + Prompt</h1><p class="paragraph" style="text-align:left;">The task for the AI is simple: given a query and two AI generated responses, submit a score from 1-9 where 1 means it strongly prefers Assistant 1’s answer, 9 means it strongly prefers Assistant 2’s answer, and I specifically call out 5 to be an appropriate score if both answers are equally fine.</p><p class="paragraph" style="text-align:left;">I’m using GPT-4 with the following prompt format to ask the AI to pick the better response given a query:</p><div class="codeblock"><pre><code>---
SYSTEM PROMPT
---
### Rating Task
Rate the performance of two assistants in response to the user question.

Output a score from 1 to 9 where a 1 means you strongly prefer Assistant 1&#39;s answer and 9 means you strongly prefer Assistant 2&#39;s answer and 5 means either answer works just as well as the other.

Give the answer in the json format: 

JSON: &#123;&quot;reason&quot;: &quot;Assistant X&#39;s answer is preferable because...&quot;, &quot;score&quot;: Y&#125;

---
USER PROMPT
---
### User Question
&#123;query&#125;

### The Start of Assistant 1&#39;s Answer
&#123;answer_1&#125;
### The End of Assistant 1&#39;s Answer

### The Start of Assistant 2&#39;s Answer
&#123;answer_2&#125;
### The End of Assistant 2&#39;s Answer

Now give your answer
JSON:</code></pre></div><p class="paragraph" style="text-align:left;">I’m invoking some chain of thought (by asking for the reasoning first) and have the temperature down to 0.3 to get some consistency going.</p><h1 class="heading" style="text-align:left;" id="the-findings">The Findings</h1><p class="paragraph" style="text-align:left;">With data and prompt ready, I ran the nearly 5K paired responses through my prompt and this is what I found!</p><h2 class="heading" style="text-align:left;" id="the-ai-doesnt-tend-to-match-human-s">The AI doesn’t tend to match human scores</h2><p class="paragraph" style="text-align:left;">I included a human simulated score by taking <b>diff</b> (answer 2’s human-given score minus answer 1’s human-given score which in theory could be from -10 to 10) and applying the formula to map it to be from 1-9</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e63a2b56-e5d8-4a99-94fe-d8395199777e/Screenshot_2023-12-15_at_2.50.25_PM.png?t=1702680629"/><div class="image__source"><span class="image__source_text"><p>This mapping takes actual human score deltas (ranging from -10 to 10) and maps them to 1-9 to better compare to our AI</p></span></div></div><p class="paragraph" style="text-align:left;">As far as raw accuracy goes, the AI only matches the human simulated score 6% but climbs to 25% if you relax accuracy to be within 1 point of each other (so if the simulated score rounded to 7 and the AI said 8, that counts as “correct”). </p><p class="paragraph" style="text-align:left;">More interestingly, if you plot the simulated scores and the AI scores side by side, you see that the AI labels very differently:</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/33faad39-1dc0-438f-9065-3fc5d195d84c/Screenshot_2023-12-15_at_2.44.08_PM.png?t=1702680251"/><div class="image__source"><span class="image__source_text"><p>Left: Simulated human scores form a natural multi-modal distribution with peaks at the 5 mark (where responses are scored similarly), 2.5, and 7.5.<br><br>Right: the AI score distribution is more polarizing and doesn’t have a peak at 5</p></span></div></div><p class="paragraph" style="text-align:left;"></p><p class="paragraph" style="text-align:left;">So far our AI isn’t labeling responses like a human would. This mismatch in labeling behavior is even more striking when you simplify the task.</p><h2 class="heading" style="text-align:left;" id="the-ai-doesnt-tend-to-match-human-s">The AI was more likely to be prefer response 1</h2><p class="paragraph" style="text-align:left;">If you only look at paired responses that were scored exactly the same by humans, you would hope that the AI would recognize that they are similar and give a score of 5 more often than not. However this doesn’t appear to be the case; the AI will prefer to pick one answer over the other, tending towards preferring the first one.</p><p class="paragraph" style="text-align:left;">The bias of favoring the first response is called a <b>positional bias</b> and it’s pretty clear to see in this graph where I’m only considering pairs of responses that humans gave <i>the exact same score</i> and yet the AI is more likely to prefer one response over the other when though I told it to rate the pair as a 5 when they are roughly equal.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8398cd88-4c3b-4e0f-adb7-e2b81bc27687/Screenshot_2023-12-15_at_2.54.31_PM.png?t=1702680874"/><div class="image__source"><span class="image__source_text"><p>The AI favors the first response even when I’m specifically only giving it responses where humans gave both responses the exact same score</p></span></div></div><p class="paragraph" style="text-align:left;">Note that the bar for score 2 is nearly twice as high as the next highest bar (7). </p><p class="paragraph" style="text-align:left;">Even if I bucket the responses into three broad groups, We see a clear bias to not pick a score in the middle even when that’s the appropriate answer:</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bad5ea96-aea5-4849-bb93-2f4c125bb153/Screenshot_2023-12-15_at_2.56.18_PM.png?t=1702680983"/></div><p class="paragraph" style="text-align:left;">This tells me that even for responses that should be roughly similar, I can’t always trust the AI to label them as such.</p><h2 class="heading" style="text-align:left;" id="this-was-expensive">This was expensive 😅</h2><p class="paragraph" style="text-align:left;">I spent about $200 bucks on OpenAI just to get results for this, so I hope it was helpful!</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b497134a-7f25-4184-b25d-91e22385bc20/Screenshot_2023-12-15_at_7.57.42_AM.png?t=1702655867"/><div class="image__source"><span class="image__source_text"><p>Every time I do one of these, I have to re-do my budget for the week</p></span></div></div><h1 class="heading" style="text-align:left;" id="summary-the-code">Summary + The Code</h1><p class="paragraph" style="text-align:left;">Can LLMs label data like humans? It seems that both HuggingFace and I agree: not really. Of course we can improve upon our prompts and fine-tune models to perform even better but most people I talk to tend to use models like GPT-4 off the shelf with a pretty basic prompt like I used here so it’s worth calling it out!</p><p class="paragraph" style="text-align:left;">If in a pinch and you really want to use AI to help you label some data I’d recommend:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Using few-shot learning to give some diverse examples of preferring an answer over the other</p></li><li><p class="paragraph" style="text-align:left;">Expanding on what constitutes a preferred answer in the system prompt</p></li><li><p class="paragraph" style="text-align:left;">Having a human double check at least a few responses to get a sense of how well the AI is doing</p></li></ol><p class="paragraph" style="text-align:left;">The notebook can be found here! <a class="link" href="https://github.com/sinanuozdemir/oreilly-llm-alignment/blob/main/notebooks/rlaif.ipynb?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=ais-supervising-ais" target="_blank" rel="noopener noreferrer nofollow">https://github.com/sinanuozdemir/oreilly-llm-alignment/blob/main/notebooks/rlaif.ipynb</a></p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=36728485-e072-429b-90fe-91d9bac134be&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>AI Benchmarking - the good, the bad, and the confusing</title>
  <description>Taking a look at how AI benchmarks drive innovation and where they fall short</description>
  <link>https://ai-office-hours.beehiiv.com/p/ai-benchmarking-good-bad-confusing</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/ai-benchmarking-good-bad-confusing</guid>
  <pubDate>Fri, 01 Dec 2023 19:10:41 +0000</pubDate>
  <atom:published>2023-12-01T19:10:41Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[Bias + Ethics]]></category>
    <category><![CDATA[Collaboration]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey Everyone!</p><p class="paragraph" style="text-align:start;">I don’t normally announce new episodes of my podcast, <a class="link" href="https://podcasts.apple.com/us/podcast/practically-intelligent/id1678774315?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=ai-benchmarking-the-good-the-bad-and-the-confusing" target="_blank" rel="noopener noreferrer nofollow">Practically Intelligent</a> on here, but with this episode I felt compelled. This episode features an insightful conversation with <a class="link" href="https://scholar.google.com/citations?user=S7n4oDgAAAAJ&hl=en&utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=ai-benchmarking-the-good-the-bad-and-the-confusing" target="_blank" rel="noopener noreferrer nofollow">Praveen Paritosh</a>, a renowned expert in AI research, where we take a look at the critical role of benchmarking in the evolution of AI.</p><h1 class="heading" style="text-align:left;" id="whats-in-the-episode">🔍 What&#39;s In The Episode?</h1><ul><li><p class="paragraph" style="text-align:left;">A detailed look at how benchmarking drives AI advancement.</p></li><li><p class="paragraph" style="text-align:left;">Insights into the impact of legacy benchmarks like SQuAD.</p></li><li><p class="paragraph" style="text-align:left;">The complex dynamics between conceptual learning and rote memorization in AI development.</p></li><li><p class="paragraph" style="text-align:left;">An exploration of benchmarks as vital tools in AI research, highlighting their strengths and limitations.</p></li></ul><h2 class="heading" style="text-align:left;" id="check-it-out">🎧 Check it out!</h2><ul><li><p class="paragraph" style="text-align:left;">YouTube: <a class="link" href="https://youtu.be/VfkTfCSbFOA?si=kr8xh0dbbLbLBHbQ&utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=ai-benchmarking-the-good-the-bad-and-the-confusing" target="_blank" rel="noopener noreferrer nofollow">https://youtu.be/VfkTfCSbFOA?si=kr8xh0dbbLbLBHbQ</a></p></li><li><p class="paragraph" style="text-align:left;">Apple Podcasts: <a class="link" href="https://podcasts.apple.com/us/podcast/e7-the-power-of-benchmarking-in-ai-progress/id1678774315?i=1000637081772&utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=ai-benchmarking-the-good-the-bad-and-the-confusing" target="_blank" rel="noopener noreferrer nofollow">https://podcasts.apple.com/us/podcast/e7-the-power-of-benchmarking-in-ai-progress/id1678774315?i=1000637081772</a></p></li></ul><p class="paragraph" style="text-align:start;">I hope you all enjoy it and have a great weekend 🙂</p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=768fb503-81ff-4a94-a3e9-5dc361a20cc9&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Fine-Tuning LLaMA 2</title>
  <description>A hands-on example of fine-tuning a Llama model</description>
  <link>https://ai-office-hours.beehiiv.com/p/finetuning-llama-2</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/finetuning-llama-2</guid>
  <pubDate>Mon, 27 Nov 2023 18:47:57 +0000</pubDate>
  <atom:published>2023-11-27T18:47:57Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[How To]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h1 class="heading" style="text-align:left;" id="finetuning-llama-2-a-handson-exampl">Fine-tuning Llama 2 - a hands-on example</h1><p class="paragraph" style="text-align:left;"><span style="color:rgb(15, 15, 15);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;">Hello everyone! Today, I am diving deep with a new (relatively) simple notebook to help people fine-tune </span><span style="color:rgb(15, 15, 15);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;"><b>Llama-2</b></span><span style="color:rgb(15, 15, 15);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;">, Meta’s latest open source large language model on Hugging Face. Here are some fascinating insights from this journey (the notebook link is at the bottom of the post </span>🙂<span style="color:rgb(15, 15, 15);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;">).</span></p><p class="paragraph" style="text-align:left;"><b>Dataset and Model:</b><span style="color:rgb(15, 15, 15);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;"> Our dataset for this experiment was the </span><code>guanaco-llama2-1k</code><span style="color:rgb(15, 15, 15);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;"> from HuggingFace, comprising instructional texts. The model of choice was </span><code>NousResearch/Llama-2-7b-hf</code><span style="color:rgb(15, 15, 15);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;">, the 7 billion parameter model. Unlike my previous venture with App Reviews, this dataset explores a different facet of language understanding, focusing on instructional text comprehension and generation.</span></p><p class="paragraph" style="text-align:left;"><span style="color:rgb(15, 15, 15);font-family:Söhne, ui-sans-serif, system-ui, -apple-system, Segoe UI, Roboto, Ubuntu, Cantarell, Noto Sans, sans-serif, Helvetica Neue, Arial, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Noto Color Emoji;font-size:16px;">Feel free to change up the dataset/model but of course that might require some code changes. It will be easier to change out the dataset/conversation format because we’re using the blank slate non-aligned version of Llama 7b that has no expectations for conversation format.</span></p><p class="paragraph" style="text-align:left;"><b>Key Takeaways:</b></p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Efficient Fine-Tuning with LoRA:</b> The LoRA (Low-Rank Adaptation) technique allowed for efficient fine-tuning of the LLaMA 2 model without the need for extensive computational resources. This approach is a nod to the evolving practices in AI, where efficiency is becoming as crucial as effectiveness.</p><p class="paragraph" style="text-align:left;"></p></li><li><p class="paragraph" style="text-align:left;"><b>Quantization and Performance:</b> Implementing BitsAndBytes for model quantization significantly reduced the memory footprint. This optimization meant we could run a larger model on the same hardware, a critical factor in practical AI applications.</p><p class="paragraph" style="text-align:left;"></p></li><li><p class="paragraph" style="text-align:left;"><b>Improved Responses Post Fine-Tuning:</b> The difference in the model&#39;s responses pre and post fine-tuning was stark. This improvement underscores the impact of fine-tuning on model performance, especially for specific use cases. Check out this before and after of asking the model &quot;Who is Leonardo Da Vinci?&quot;</p></li></ol><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bab59bc1-a7df-49db-bc93-574cbf64a81b/Screenshot_2023-11-27_at_7.19.29_AM.png?t=1701098376"/><div class="image__source"><span class="image__source_text"><p>Before fine-tuning: the model has no idea how to answer questions</p></span></div></div><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/59a65297-8b6f-4bd8-bbb4-8fcc0e1e57ff/Screenshot_2023-11-27_at_7.19.50_AM.png?t=1701098394"/><div class="image__source"><span class="image__source_text"><p>After fine-tuning: the model now knows how to answer questions, even when asked about items not found in the instructional dataset</p></span></div></div><ol start="3"><li><p class="paragraph" style="text-align:left;"><b>Cost vs. Performance:</b> Similar to my previous analysis with fine-tuning OpenAI models, the cost-effectiveness of fine-tuning LLaMA 2 was noteworthy. While not as resource-intensive as GPT-3.5, the performance gains were substantial, offering a compelling middle ground between efficiency and effectiveness.</p><p class="paragraph" style="text-align:left;"></p></li><li><p class="paragraph" style="text-align:left;"><b>Masking Loss for Targeted Learning:</b> We employed a custom data collator, <code>DataCollatorForCompletionOnlyLM</code> to selectively mask the loss calculation to focus only on the model&#39;s responses <b>given a conversation</b>, effectively ignoring irrelevant parts of the input. By doing so, we ensured that the model&#39;s learning was concentrated on generating accurate and relevant responses, improving its efficiency and effectiveness in understanding and replying to instructional texts.</p></li></ol><p class="paragraph" style="text-align:start;"><b>Conclusion:</b> The world of AI and language models continues to be a thrilling landscape of endless possibilities and learning. Fine-tuning LLaMA 2 has been an enriching experience, revealing the importance of model efficiency, the power of specific optimizations, and the constant need to balance cost with performance. As always, for those keen to dive deeper, the updated notebook on GitHub awaits. Until our next AI adventure, happy coding!</p><p class="paragraph" style="text-align:start;"><b>Next Steps: RLHF/RLAIF and Custom Pre-Training</b></p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Implementing RLHF/RLAIF:</b> We could use Reinforcement Learning from Human Feedback (RLHF) to enhance LLaMA 2&#39;s performance by fine-tuning it based on interactive feedback, aiming for responses that better align with human expectations.</p></li><li><p class="paragraph" style="text-align:left;"><b>Custom Pre-Training Corpus:</b> We can also fine-tune LLaMA 2 on a custom corpus tailored to specific domains, significantly enhancing its expertise and accuracy in niche areas without losing its versatile applicability.</p></li></ol><p class="paragraph" style="text-align:left;"><b>Notebook link please!</b></p><p class="paragraph" style="text-align:left;">Here is the notebook and have fun! </p><p class="paragraph" style="text-align:left;"><a class="link" href="https://colab.research.google.com/drive/11KBP9-fJzsNtNFeLWJdaleNmxGbBn4l6?usp=sharing&utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=fine-tuning-llama-2" target="_blank" rel="noopener noreferrer nofollow">https://colab.research.google.com/drive/11KBP9-fJzsNtNFeLWJdaleNmxGbBn4l6?usp=sharing</a></p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=b84c1c84-cc37-46ce-ab5d-562c0d28cfee&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>New Notebook to Fine-tune with OpenAI</title>
  <description>Get the most out of Gen AI with fine-tuning</description>
  <link>https://ai-office-hours.beehiiv.com/p/new-notebook-finetune-openai</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/new-notebook-finetune-openai</guid>
  <pubDate>Mon, 06 Nov 2023 20:00:00 +0000</pubDate>
  <atom:published>2023-11-06T20:00:00Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[How To]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">It was brought to my attention that Chapter 4 of my latest book uses a dataset that Amazon has since revoked from HuggingFace (always keeping me on my toes). Because of this, I re-wrote the <a class="link" href="https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/UPDATED%204_fine_tuned_classification_sentiment.ipynb?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=new-notebook-to-fine-tune-with-openai" target="_blank" rel="noopener noreferrer nofollow">notebook in Github</a> to update the example with a working dataset and at the same time, updated the code to use OpenAI’s latest fine-tuning API. I figured I would share some of the takeaways of the case study here. </p><p class="paragraph" style="text-align:left;">Our data is <a class="link" href="https://huggingface.co/datasets/app_reviews?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=new-notebook-to-fine-tune-with-openai" target="_blank" rel="noopener noreferrer nofollow">App Reviews</a> from HuggingFace (original Github <a class="link" href="https://github.com/sealuzh/user_quality?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=new-notebook-to-fine-tune-with-openai" target="_blank" rel="noopener noreferrer nofollow">here</a>). The dataset is 288,065 reviews extracted from the Google Play. I split the data into training, validation, and testing. I used training and validation to fine-tune on OpenAI and held out testing to compare the final 4 mo</p><p class="paragraph" style="text-align:left;">Our model options are:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">Babbage trained for 1 epoch (3B model)</p></li><li><p class="paragraph" style="text-align:left;">Babbage trained for 4 epochs (3B model)</p></li><li><p class="paragraph" style="text-align:left;">GPT 3.5 trained for 1 epoch and <b>no</b> system prompt (175B model)</p></li><li><p class="paragraph" style="text-align:left;">GPT 3.5 trained for 1 epoch <b>with </b>a system prompt (175B model)</p></li></ol><h1 class="heading" style="text-align:left;" id="1-cost-project-early-and-cost-proje">1. Cost project early and cost project often</h1><p class="paragraph" style="text-align:left;">If this model is going to be used a lot, make sure to keep an eye on <a class="link" href="https://openai.com/pricing?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=new-notebook-to-fine-tune-with-openai" target="_blank" rel="noopener noreferrer nofollow">OpenAI’s pricing page</a> to estimate how much money you’re about to spend on fine-tuning. Here is a breakdown of how much it cost me to train and run evaluation on all four models on the training dataset, obviously 3.5 was much more expensive, but were the performance gains worth it?!</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/d338b8d8-8b81-4049-9f03-20f642615d29/image.png"/><div class="image__source"><span class="image__source_text"><p>Our performance increase from GPT 3.5 comes at a cost - literally. Fine-tuning GPT 3.5 was up to 75x more expensive than fine-tuning Babbage and inference with GPT 3.5 was up to 26x more expensive than Babbage! Worth it? Ehhh..</p></span></div></div><p class="paragraph" style="text-align:left;">You could consider writing a <b>batch prompt</b> which is exactly what it sounds like, a prompt that predicts multiple apps at a time.</p><h1 class="heading" style="text-align:left;" id="2-consider-simplifying-the-task">2. Consider simplifying the task</h1><p class="paragraph" style="text-align:left;">Btw the answer to the question was the extra money to fine-tuning GPT 3.5 worth it? <b>NO</b>. In testing accuracy, GPT 3.5 was only about 3% better than Babbage. For being about 60x times bigger, GPT 3.5 can sometimes be not worth the money.</p><p class="paragraph" style="text-align:left;">This is true even if you consider simplifying the task and defining new metrics. For example, we have <b>raw accuracy </b>(simple # correct / # items) but we could also consider “good” vs “bad” as a binary classifier of changing classes to be “Good” (4 or 5 stars) or “Bad” (1, 2, or 3 stars). Of course you can do whatever you want. You can also do “one-off accuracy” so if the model predicts “3” and the answer was “2” or “4” it would be counted as right. All up to you on what matters 🙂. On there 3 metrics, GPT 3.5 still only does up to 3% better than Babbage.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/6aedc0cc-c03f-4483-a764-c936e01ffd43/Screenshot_2023-11-05_at_8.40.34_AM.png"/><div class="image__source"><span class="image__source_text"><p>Fine-tuning GPT 3.5 (ChatGPT) is performing a bit better than the much smaller Babbage models, even among the simplified tasks but is it worth the extra $$$?</p></span></div></div><h1 class="heading" style="text-align:left;" id="3-generative-models-generating-nons">3. Generative models generating nonsense</h1><p class="paragraph" style="text-align:left;">If you let a model blabber on, it will eventually say something unhelpful. In our fine-tuned GPT 3.5 model with no system prompt with a temperature of 0.1 (to make the outputs more deterministic) I saw some instances of the model not predicting 0, 1, 2, 3, or 4. Seems like the system prompt helps prevent against this and Babbage doesn’t need to be told this as much 🙂. It’s annoying but hey, generative models gonna generate.</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/fa3fa389-85d4-4793-9c5e-8411af216375/Screenshot_2023-11-05_at_2.16.45_PM.png"/><div class="image__source"><span class="image__source_text"><p>Only our fine-tuned 3.5 model with no system prompt generates predictions out of the 0-4 range on our testing set sometimes (even with our temperature turned down low). Both Babbage models and GPT 3.5 with a system prompt never did this.</p></span></div></div><h1 class="heading" style="text-align:left;" id="until-next-time">Until next time!</h1><p class="paragraph" style="text-align:left;">I have more takeaways than that but I’ll leave it there for now. If you want to see more, check out the notebook. Happy coding!</p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=d592f471-295a-4a56-8ccd-37cc4ed2af23&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Harnessing LLM Alignment</title>
  <description>Making AI More Accessible</description>
  <link>https://ai-office-hours.beehiiv.com/p/harnessing-llm-alignment</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/harnessing-llm-alignment</guid>
  <pubDate>Fri, 27 Oct 2023 21:17:52 +0000</pubDate>
  <atom:published>2023-10-27T21:17:52Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
    <category><![CDATA[Llm Alignment]]></category>
    <category><![CDATA[How To]]></category>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey everyone! I’m giving an alignment workshop next week at <a class="link" href="https://odsc.com/speakers/aligning-open-source-llms-using-reinforcement-learning-from-feedback/?__hstc=39712252.be8ae3d72d8f620375d1fe2554de2f5c.1694882025961.1697212742345.1698424229362.10&__hssc=39712252.2.1698424229362&__hsfp=2523776101&utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=harnessing-llm-alignment" target="_blank" rel="noopener noreferrer nofollow">ODSC</a> and they had me write a blog post to intro the work we were going to be doing. I wanted to share this intro with you all as well!</p><p class="paragraph" style="text-align:left;">Back in 2020, the world was introduced to OpenAI’s GPT-3, a marvel in the AI domain to many. However, it wasn’t until two years later, in 2022, when OpenAI unveiled its instruction-aligned version of GPT-3, aptly named “InstructGPT,” that its full potential came into the spotlight, and the world started really paying attention. That innovation wasn’t just a technological leap for AI alignment; it was a demonstration of the power of reinforcement learning to make AI more accessible to everyone.</p><h2 class="heading" style="text-align:start;" id="aligning-our-expectations"><b>Aligning Our Expectations</b></h2><p class="paragraph" style="text-align:left;"><b>Alignment</b>, broadly defined, is the process of making an AI system that behaves in accordance with what a human wants. Alignment isn’t just about training AI to follow instructions; it’s about designing a system to sculpt an already powerful AI model into something more usable and beneficial to both technically inclined users and to someone who just needs help planning a birthday party. It’s this very aspect of alignment that has democratized the magic of Large Language Models (LLMs), enabling a broader audience to extract value from them.</p><p class="paragraph" style="text-align:start;">If alignment is the heart of LLMs’ usability, what keeps this heart pumping? That’s where the intricate dance of <b>Reinforcement Learning (RL)</b> comes into play. While the term ‘alignment’ might be synonymous with reinforcement learning for some, there’s a lot more under the hood. Capturing the multifaceted dimensions of human emotions, ethics, or humor within the confines of next-token prediction is a colossal – and potentially impossible – task. How do you effectively program ‘neutrality’ or ‘ethical behavior’ into a loss function? Arguably, you can’t. It’s here that RL rises as a dynamic way to model these intricate nuances without strictly encoding them.</p><p class="paragraph" style="text-align:start;"><b>RLHF</b>, which stands for Reinforcement Learning from Human Feedback is the technique OpenAI originally used to align their InstructGPT model and is frequently discussed among AI enthusiasts as the main way to align LLMs, but it’s merely one tool among many for alignment. The core principle of RLHF revolves around obtaining high-quality human feedback and using it to give LLMs feedback on their task performance in the hopes of having the AI speak in a more user-friendly manner by the end of the loop. </p><p class="paragraph" style="text-align:left;">In our own day-to-day work with LLMs however, we often don’t need the AI to answer <i>everything</i>, we need them to solve the tasks relevant to us / our businesses / our projects. In our journey with RL, we’ll explore alternative approaches to RLHF where we can utilize other forms of feedback mechanisms that do not rely on human preferences.<br></p><h1 class="heading" style="text-align:left;" id="case-study-aligning-flant-5-to-make"><b>Case Study – Aligning FLAN-T5 to make more neutral summaries</b></h1><p class="paragraph" style="text-align:left;">Let’s look at an example of using two classifiers from Hugging Face to enhance the FLAN-T5 model’s ability to write summaries of news articles that are both grammatically polished and consistently neutral in style.</p><p class="paragraph" style="text-align:start;">The below code will define one such reward feedback, using a pre-fine-tuned sentiment classifier to obtain the logits for the neutral class to reward FLAN-T5 for speaking in a neutral tone and punish it otherwise:</p><div class="codeblock"><pre><code>sentiment_pipeline = pipeline(

  &#39;text-classification&#39;, 

  &#39;cardiffnlp/twitter-roberta-base-sentiment&#39;

)

def get_neutral_scores(texts):

  scores = []

  # function_to_apply=&#39;none&#39; returns logits which can be negative

  results = sentiment_pipeline(texts, function_to_apply=&#39;none&#39;, top_k=None)

  for result in results:

    for label in result:

      if label[&#39;label&#39;] == &#39;LABEL_1&#39;: # logit for neutral class

        scores.append(label[&#39;score&#39;])

    return scores

&gt;&gt; get_neutral_scores([&#39;hello&#39;, &#39;I love you!&#39;, &#39;I hate you&#39;]) 

&gt;&gt; [0.85, -0.75, -0.57]</code></pre></div><p class="paragraph" style="text-align:left;">We can use this classifier along with another one to classify a piece of text’s grammatical correctness to <i>align</i> our FLAN-T5 model to generate summaries how we want them to be generated.</p><p class="paragraph" style="text-align:left;">The Reinforcement Learning from Feedback loop looks something like this:</p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Give FLAN-T5 a batch of news articles to summarize (taken from </b><b><a class="link" href="https://huggingface.co/datasets/argilla/news-summary?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=harnessing-llm-alignment" target="_blank" rel="noopener noreferrer nofollow">https://huggingface.co/datasets/argilla/news-summary</a></b><b> only using the raw articles)</b></p></li><li><p class="paragraph" style="text-align:left;"><b>Assign a weighted sum of rewards from:</b></p><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>A CoLA model (judging grammatical correctness) from </b><b><a class="link" href="http://huggingface.co/textattack/roberta-base-CoLA?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=harnessing-llm-alignment" target="_blank" rel="noopener noreferrer nofollow">textattack/roberta-base-CoLA</a></b></p></li><li><p class="paragraph" style="text-align:left;"><b>A sentiment model (judging neutrality) from </b><b><a class="link" href="https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=harnessing-llm-alignment" target="_blank" rel="noopener noreferrer nofollow">cardiffnlp/twitter-roberta-base-sentiment </a></b></p></li></ol></li><li><p class="paragraph" style="text-align:left;">Use the rewards to update the FLAN-T5 model using the <a class="link" href="https://huggingface.co/docs/trl/index?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=harnessing-llm-alignment" target="_blank" rel="noopener noreferrer nofollow">TRL</a> package, taking into consideration how far the updated model had deviated from the original parameters</p></li><li><p class="paragraph" style="text-align:left;"></p></li></ol><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/bbe3e575-8167-459c-a1d2-99a99f6f0804/Screenshot-2023-09-27-at-10.35.59-AM-1024x608.png"/></div><p class="paragraph" style="text-align:left;">Here is a sample of the training loop we will build at the workshop I’m giving next week:</p><div class="codeblock"><pre><code>for epoch in tqdm(range(2)):

  for batch in tqdm(ppo_trainer.dataloader):

    #### prepend the summarize token

    game_data[&quot;query&quot;] = [&#39;summarize: &#39; + b for b in batch[&quot;text&quot;]]

    #### get response from reference + current flan-t5

    input_tensors = [_.squeeze() for _ in batch[&quot;input_ids&quot;]]

    # ....

    for query in input_tensors:

      response = ppo_trainer.generate(query.squeeze(), **generation_kwargs)

      response_tensors.append(response.squeeze())    

    

    #### Reward system

    game_data[&quot;response&quot;] = [flan_t5_tokenizer.decode(...)

    game_data[&#39;cola_scores&#39;] = get_cola_scores(

    game_data[&quot;clean_response&quot;])

    game_data[&#39;neutral_scores&#39;] = get_neutral_scores(

    game_data[&quot;clean_response&quot;])

    #### Run PPO training and log stats

    stats = ppo_trainer.step(input_tensors, response_tensors, rewards)

    stats[&#39;env/reward&#39;] = np.mean([r.cpu().numpy() for r in rewards])

    ppo_trainer.log_stats(stats, game_data, rewards)</code></pre></div><p class="paragraph" style="text-align:left;">I omitted several lines of this loop to save space but you can of course come to my workshop to see the loop in its entirety!</p><h2 class="heading" style="text-align:start;" id="the-results"><b>The Results</b></h2><p class="paragraph" style="text-align:start;">After a few epochs of training, our FLAN-T5 starts to show signs of enhanced alignment towards our goal of more grammatically correct and neutral summaries. Here’s a sample of what the different summaries look like using the validation data from the dataset:</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/53fe7b28-8a5e-4730-99ed-be89b1ad2476/07fig14-1024x917.png"/><div class="image__source"><span class="image__source_text"><p>A sample of FLAN-T5 before and after RL. We can see the RL fine-tuned version of the model is using words like “announced” over terms like “scrapped”.</p></span></div></div><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/f51adc13-a1de-4047-97a1-f5fb9bae3277/comparison_chart-1024x717.png"/><div class="image__source"><span class="image__source_text"><p>Running both our models (the unaligned base FLAN-T5 and our aligned version) over the entire validation set shows an increase (albeit a subtle one) in both rewards from our CoLA model and our sentiment model!</p></span></div></div><p class="paragraph" style="text-align:center;"></p><p class="paragraph" style="text-align:center;">The model is garnering increased rewards from our system, and upon inspection, there’s a nuanced shift in its summary generation. However, its core summarization abilities remain largely consistent with the base model.</p><h2 class="heading" style="text-align:start;" id="conclusion"><b>Conclusion</b></h2><p class="paragraph" style="text-align:start;">Alignment isn’t just about the tools or methodologies of collecting data and making LLMs answer any and all questions. It’s also about understanding what we actually want from our LLMs. The goal of alignment, however, remains unwavering: fashion LLMs whose outputs resonate with human sensibilities, making AI not just a tool for the engineer but a companion for all. Whether you’re an AI enthusiast or someone looking to dip your toes into this world, there’s something here for everyone. <b><a class="link" href="https://odsc.com/speakers/aligning-open-source-llms-using-reinforcement-learning-from-feedback/?__hstc=39712252.be8ae3d72d8f620375d1fe2554de2f5c.1694882025961.1697212742345.1698424229362.10&__hssc=39712252.2.1698424229362&__hsfp=2523776101&utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=harnessing-llm-alignment" target="_blank" rel="noopener noreferrer nofollow">Join me at ODSC this year</a></b> as we traverse the landscape of LLM alignment together!</p><p class="paragraph" style="text-align:start;">I will have a github repo for ODSC soon but until then, you can see the source notebook from my book here: <a class="link" href="https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/7_rl_flan_t5_summaries.ipynb?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=harnessing-llm-alignment" target="_blank" rel="noopener noreferrer nofollow">https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/7_rl_flan_t5_summaries.ipynb</a></p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=95b0709f-a330-46ba-a455-7737b9fc8309&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>My new book on LLMs!</title>
  <description></description>
  <link>https://ai-office-hours.beehiiv.com/p/new-book-llms</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/new-book-llms</guid>
  <pubDate>Fri, 22 Sep 2023 18:40:03 +0000</pubDate>
  <atom:published>2023-09-22T18:40:03Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">Hey everybody! I just wanted to let you all know that I have a new book out on getting started with LLMs! </p><p class="paragraph" style="text-align:left;">Here it is! <a class="link" href="https://a.co/d/5SDvdju?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=my-new-book-on-llms" target="_blank" rel="noopener noreferrer nofollow">https://a.co/d/5SDvdju</a></p><p class="paragraph" style="text-align:left;">Short post today but I just wanted to say how happy I am to have this new book out and a thank you to everyone who has already preordered their copies! I hope you all love it as much as I loved writing it.</p><p class="paragraph" style="text-align:left;">Thats it for today. Happy Coding!</p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=bb148bc9-0f20-4e50-b29f-9a03c6aa51e6&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>Our first Streamlit app</title>
  <description>Rapidly Prototyping with AI</description>
  <link>https://ai-office-hours.beehiiv.com/p/first-streamlit-app</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/first-streamlit-app</guid>
  <pubDate>Thu, 17 Aug 2023 13:00:00 +0000</pubDate>
  <atom:published>2023-08-17T13:00:00Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><p class="paragraph" style="text-align:left;">I’ve been teaching a <a class="link" href="https://learning.oreilly.com/live-events/large-language-models-and-chatgpt-in-3-weeks/0636920090988/0636920090987/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=our-first-streamlit-app" target="_blank" rel="noopener noreferrer nofollow">class through Pearson</a> on LLMs and ChatGPT with an emphasis on empowering non-coders to learn how to prompt, build test harnesses, and rapidly prototype with LLMs. On our last day I introduced <a class="link" href="https://streamlit.io/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=our-first-streamlit-app" target="_blank" rel="noopener noreferrer nofollow">Streamlit</a>, a super simple way to build super quick and dirty prototypes. The goal was to give my students a way to share their prototypes with people with minimal coding. I figured, why not also show the same example here!</p><p class="paragraph" style="text-align:left;">Try it out here: <a class="link" href="https://ai-office-hours-wine.streamlit.app/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=our-first-streamlit-app" target="_blank" rel="noopener noreferrer nofollow">https://ai-office-hours-wine.streamlit.app</a></p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/c6ffdc1a-5bbf-42a4-babf-988139956bac/Screenshot_2023-07-27_at_7.25.23_AM.png"/><div class="image__source"><span class="image__source_text"><p>Our wine recommending app prototype, complete with a explicit feedback mechanism</p></span></div></div><p class="paragraph" style="text-align:left;">Streamlit is super simple and honestly with fewer than 100 lines of code, we can be done with our prototype. Let’s start strong and see the final app, found here on github:</p><h1 class="heading" style="text-align:left;">A wine recommending app</h1><div class="codeblock"><pre><code># Import necessary libraries
import random

import openai
import streamlit as st
from datasets import load_dataset
from supabase import create_client

# Set API Key
openai.api_key = st.secrets[&quot;OPENAI_API_KEY&quot;]


# Initialize DB connection once
@st.cache_resource
def init_connection():
    return create_client(st.secrets[&quot;SUPABASE_URL&quot;], st.secrets[&quot;SUPABASE_KEY&quot;])


supabase = init_connection()

# System prompt for OpenAI API
system_prompt = &#39;&#39;&#39;You are a wine bot that helps clients understand what kind of wine they want. Given a list of wines and a description of the client, tell me what wines they want by giving me the names of the wines. Include a reason preceding each pick to explain to the user why they might like it. Give me the information  as a numbered list of wines with reasons why they might like it.&#39;&#39;&#39;


# Cache wine dataset once
@st.cache_resource
def load_wines():
    wine_dataset = load_dataset(&quot;alfredodeza/wine-ratings&quot;)
    return list(wine_dataset[&#39;train&#39;])  # only use train set for now


# Convert wine to string
def convert_wine_to_string(wine):
    return f&#39;&#123;wine[&quot;name&quot;]&#125; is from &#123;wine[&quot;region&quot;]&#125; and is a &#123;wine[&quot;variety&quot;]&#125;. &#123;wine[&quot;notes&quot;]&#125;&#39;


# Update reaction in DB
def react_to_row(row, reaction):
    supabase.table(&quot;response&quot;).update(
        &#123;&quot;reaction&quot;: reaction or None&#125;, returning=&quot;minimal&quot;
    ).eq(&quot;id&quot;, row[&#39;id&#39;]).execute()


# User input elements
user_description = st.text_input(&quot;Describe the client&quot;,
                                 &quot;The client likes red wine and is looking for a wine to drink with dinner.&quot;)
n = st.number_input(&quot;How many wines to pull from the cellar?&quot;, min_value=1, max_value=10, value=3, step=1)


# Function to get recommendations
def get_recommendations(n=3, user_description=&#39;&#39;):
    wines = random.sample(load_wines(), n)
    wines_formatted = &quot;\n---\n&quot;.join([convert_wine_to_string(w) for w in wines])
    user_prompt = f&#39;User Description: &#123;user_description&#125;\nWines to select from:\n&#123;wines_formatted&#125;&#39;

    # Create chat completion with OpenAI
    chat_completion = openai.ChatCompletion.create(
        model=&#39;gpt-3.5-turbo&#39;,
        messages=[&#123;&#39;role&#39;: &#39;system&#39;, &#39;content&#39;: system_prompt&#125;, &#123;&#39;role&#39;: &#39;user&#39;, &#39;content&#39;: user_prompt&#125;]
    )

    # Show the wine recommendations and store in Supabase
    st.write(&#39;Wines pulled from cellar to choose from&#39;)
    st.table(wines)

    row = supabase.table(&quot;response&quot;).insert(
        [&#123;&quot;system_prompt&quot;: system_prompt, &quot;user_prompt&quot;: user_prompt,
          &quot;response&quot;: chat_completion.choices[0].message.content, &quot;prototype&quot;: &quot;wine&quot;&#125;]
    ).execute().data[0]
    st.write(chat_completion.choices[0].message.content)
    st.session_state[&#39;row&#39;] = row


# Button to get recommendations
st.button(
    &quot;Get recommendations&quot;, on_click=get_recommendations,
    kwargs=&#123;&#39;n&#39;: n, &#39;user_description&#39;: user_description&#125;
)

# User reaction
reaction = st.selectbox(&quot;How do you feel about the response?&quot;, (&quot;&quot;, &quot;👍&quot;, &quot;👎&quot;))
if &#39;row&#39; in st.session_state:
    st.button(
        &quot;Submit reaction&quot;, on_click=react_to_row,
        kwargs=&#123;&#39;row&#39;: st.session_state[&#39;row&#39;], &#39;reaction&#39;: reaction&#125;
    )</code></pre></div><p class="paragraph" style="text-align:left;"><br>Here is how a user interacts with our app:</p><ol start="1"><li><p class="paragraph" style="text-align:left;">The user inputs their wine preferences and selects the number of recommendations they want to receive through the application&#39;s interface.</p></li><li><p class="paragraph" style="text-align:left;">The user clicks on the &quot;Get recommendations&quot; button, triggering the application to randomly select wines from its dataset and request recommendations from the AI model.</p></li><li><p class="paragraph" style="text-align:left;">The application displays personalized wine recommendations from the AI model along with detailed explanations and a table of the selected wines.</p></li><li><p class="paragraph" style="text-align:left;">The user has the option to react to the AI&#39;s recommendations via a select box, expressing their approval or disapproval.</p></li><li><p class="paragraph" style="text-align:left;">If a reaction is provided, the user clicks on &quot;Submit reaction&quot;, and the application saves the user&#39;s feedback to <a class="link" href="https://supabase.com/?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=our-first-streamlit-app" target="_blank" rel="noopener noreferrer nofollow">Supabase</a>, which can be used for future improvements to the system.</p></li></ol><p class="paragraph" style="text-align:left;">The goal here is to help people get their prototypes out there with minimal code. Everyone deserves to share their work!</p><p class="paragraph" style="text-align:left;">As always, the code is also here on the Github! <a class="link" href="https://github.com/sinanuozdemir/ai-office-hours/tree/main/streamlit/wine_prototype?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=our-first-streamlit-app" target="_blank" rel="noopener noreferrer nofollow">https://github.com/sinanuozdemir/ai-office-hours/tree/main/streamlit/wine_prototype</a></p></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=65859ba1-3a1f-4041-88f5-ec63ba31cce8&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

      <item>
  <title>AI Office Hours are Open!</title>
  <description>Introducing your Instructor, Sinan Ozdemir</description>
  <link>https://ai-office-hours.beehiiv.com/p/ai-office-hours-open</link>
  <guid isPermaLink="true">https://ai-office-hours.beehiiv.com/p/ai-office-hours-open</guid>
  <pubDate>Thu, 22 Jun 2023 15:27:49 +0000</pubDate>
  <atom:published>2023-06-22T15:27:49Z</atom:published>
    <dc:creator>Sinan Ozdemir</dc:creator>
  <content:encoded><![CDATA[
    <div class='beehiiv'><style>
  .bh__table, .bh__table_header, .bh__table_cell { border: 1px solid #C0C0C0; }
  .bh__table_cell { padding: 5px; background-color: #FFFFFF; }
  .bh__table_cell p { color: #2D2D2D; font-family: 'Helvetica',Arial,sans-serif !important; overflow-wrap: break-word; }
  .bh__table_header { padding: 5px; background-color:#F1F1F1; }
  .bh__table_header p { color: #2A2A2A; font-family:'Trebuchet MS','Lucida Grande',Tahoma,sans-serif !important; overflow-wrap: break-word; }
</style><div class='beehiiv__body'><h1 class="heading" style="text-align:left;">Welcome to AI Office Hours!</h1><p class="paragraph" style="text-align:left;">Welcome to the very first AI Office Hours Newsletter! I&#39;m Sinan Ozdemir, your guide through the ever-evolving world of AI. As a former lecturer at the Johns Hopkins University and an experienced entrepreneur in the AI field, I&#39;ve spent years breaking down complex concepts, building real-world solutions, and sharing my knowledge through various publications. Now, I&#39;m thrilled to welcome you to this journey where we demystify and actually use AI, particularly the realm of Large Language Models (LLMs).</p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/abbfbb18-e241-4c0c-92f0-f6f516b8066e/square_headshot_small.jpg"/><div class="image__source"><span class="image__source_text"><p>Hi I’m Sinan! Your friendly neighborhood AI/ML/LLM Expert.</p></span></div></div><p class="paragraph" style="text-align:start;">Am I an experience blogger or newsletter writer? Nope. Do I care a lot about sharing actionable insights and code for my fellow software engineers on the topic of AI? Absolutely!</p><h1 class="heading" style="text-align:left;">Example 1: Generating text with Open-source FLAN-T5</h1><p class="paragraph" style="text-align:start;">Our first example today is a simple one, but something that I get asked about a fair amount: <b>How do I simply generate text from an open source model from Huggingface?</b></p><p class="paragraph" style="text-align:start;">Let&#39;s take the example of Google&#39;s FLAN-T5 model, one of Google’s latest open-sourced LLM. Using FLAN-T5 - which is a <b>sequence to sequence model</b> which matters for our upcoming code - we can generate a piece of text based on a given prompt. Here&#39;s a quick Python code snippet using the <code>transformers</code> library from Hugging Face:</p><div class="codeblock"><pre><code># Import necessary classes from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Define the model we want to use
MODEL = &quot;google/flan-t5-base&quot;

# Initialize the tokenizer using the from_pretrained method 
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# Initialize the model using the from_pretrained method
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)

# Define our prompt text
prompt = &quot;Translate from English to Spanish: &#39;How are you?&#39;&quot;

# Encode our prompt text into tensor of integers representing the sequence of tokens
inputs = tokenizer.encode(prompt, return_tensors=&#39;pt&#39;) 

# Generate the output sequence using the model
outputs = model.generate(inputs, max_length=100) 

# Decode the output sequence into readable text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)  # outputs &quot;Cómo estás?&quot;</code></pre></div><ol start="1"><li><p class="paragraph" style="text-align:left;"><b>Model and Tokenizer Initialization</b>: The necessary classes are imported from the transformers library and the pre-trained model (in this case, the FLAN model) is specified. Then, both the tokenizer and the model are initialized based on the pre-trained model.</p></li><li><p class="paragraph" style="text-align:left;"><b>Prompt Definition</b>: The input text, or prompt, is defined. This is the text that the model will translate or generate text from.</p></li><li><p class="paragraph" style="text-align:left;"><b>Input Preparation</b>: The prompt is encoded into a sequence of tokens (a format that the model can understand) using the tokenizer. This involves converting the text into a tensor of token IDs.</p></li><li><p class="paragraph" style="text-align:left;"><b>Text Generation</b>: The model generates an output sequence based on the input tensor. The length of the output sequence is controlled by specifying a maximum length with the <code>max_length</code> parameter.</p></li><li><p class="paragraph" style="text-align:left;"><b>Output Decoding</b>: The output sequence is decoded back into readable text using the tokenizer. Special tokens included in the output sequence are removed during this process.</p></li><li><p class="paragraph" style="text-align:left;"><b>Printing the Output</b>: The final step involves printing the generated text. Depending on the task, this could be a translation, a summary, a continuation of the prompt, or any other type of text.</p></li></ol><p class="paragraph" style="text-align:start;">This is a pretty bare bones code example but I didn’t want to leave you totally hanging on the first post 🙂.</p><h1 class="heading" style="text-align:left;">Next time on AI Office Hours</h1><p class="paragraph" style="text-align:start;">In the coming weeks, expect more content around prompting techniques, using and fine-tuning open source LLMs, using and testing different closed source LLMs all with a mind for production and keeping costs down and solving interesting and specific tasks with LLMs. This is something I love talking about and building around, so I can’t wait 🙂 </p><div class="image"><img alt="" class="image__image" style="" src="https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/41964593-c6d8-40a9-86da-4b8376ceea87/image.png"/><div class="image__source"><span class="image__source_text"><p>Me talking about my startup (now acquired) and how we were using AI to generate conversational responses on Jason Calacanis’ “This week in startups” podcast in 2017</p></span></div></div><p class="paragraph" style="text-align:start;">I encourage you to be curious, <b>ask questions that I can talk about on the newsletter</b>, and experiment with all of these examples. After all, AI is as much about learning and adapting as it is about coding and algorithms. </p><p class="paragraph" style="text-align:start;">There’s also a github I’ll do my best to maintain with any code examples here: </p><div class="embed"><a class="embed__url" href="https://github.com/sinanuozdemir/ai-office-hours?utm_source=ai-office-hours.beehiiv.com&utm_medium=newsletter&utm_campaign=ai-office-hours-are-open" target="_blank"><div class="embed__content"><p class="embed__title"> sinanuozdemir/ai-office-hours </p><p class="embed__description"> Contribute to sinanuozdemir/ai-office-hours development by creating an account on GitHub. </p><p class="embed__link"> github.com/sinanuozdemir/ai-office-hours </p></div><img class="embed__image embed__image--right" src="https://opengraph.githubassets.com/2e3fc0bf52d8a8572823e693fc89b465b82deb6df2bda3edfabc5e1c8c5d7270/sinanuozdemir/ai-office-hours"/></a></div></div><div class='beehiiv__footer'><br class='beehiiv__footer__break'><hr class='beehiiv__footer__line'><a target="_blank" class="beehiiv__footer_link" style="text-align: center;" href="https://www.beehiiv.com/?utm_campaign=4824bef9-eb29-4f9a-a19e-fccac79216b1&utm_medium=post_rss&utm_source=ai_office_hours">Powered by beehiiv</a></div></div>
  ]]></content:encoded>
</item>

  </channel>
</rss>
