AI Office Hours

(Re-) Ranking RAG Solutions

Sinan Ozdemir — Mon, 29 Jul 2024 14:26:17 +0000

In August of 2022, mere months before ChatGPT made it’s world debut, I wrote a post on medium^[1] about using auto-encoding LLMs like BERT to embed and retrieve documents from a vector database and then return responses to a query using information from that document. Sound familiar? I was describing a simplified version of Retrieval-Augmented Generation (RAG) inspired by a paper^[2] in 2020. My version used two types of auto-encoding LLMs - one to retrieve and the other to “generate” a response by selecting the best subset of the document that answered the question (I know that’s not actual LLM text generation, but I wanted to use something open source and it was 2022).

Auto-encoding LLMs are models that cannot generate text token by token like the “Generative AI” models - ChatGPT, Claude, Llama, or virtually any LLM on the market today - but rather models who’s sole purpose is to read quickly and efficiently at much smaller sizes. To put that size difference in perspective, a case study in my book has a 70M parameter DistilBERT model beating ChatGPT (2,500x bigger parameter-wise) in a head to head fine-tuning classification test.

Case study from my book: DistilBERT (70M params) performing at a similar level as GPT 3.5 (175B params) on the same training data (https://hf.co/datasets/app_reviews) while being nearly twice as fast as GPT 3.5. Size isn’t everything

I’ve been both fascinated and disappointed in the field of auto-encoding models in recent years. So few companies seem to want to innovate on non-generative LLMs so when a use-case like RAG comes up where a huge chunk of that pipeline involves reading / retrieval i.e. not generating anything, I get excited.

RAG can be broken down into three main steps (these figures are from my 2022 post but still are relevant):

Indexing documents - Using an embedding system to transform raw text into vectors and storing them in a database
Retrieving documents - Using (usually) the same embedding system to embed a query and using a vector similarity metric (like cosine similarity) to find the most relevant document
Generating a response - Using an LLM to create a raw text response to a user’s query using information in the document (yes I know there’s a typo in the figure, 2022 me messed up).

One of the main limitations of RAG systems is the quality of the retrieved document ranking. Cosine similarity between embeddings can only go so far in terms of matching queries to documents. This can be quantified by measuring how often an input query gets matched to the correct document which we will refer to as the top result accuracy of a RAG system, namely that the #1 closest retrieved document is in the fact the correct document that can answer the query. This post will go over an unsung hero in RAG that aims to maximize the effectiveness of this document ranking - the re-ranker.

Re-ranking Documents

At it’s core, a re-ranker is yet another LLM who’s job it is to take in a small amount (usually 10-50) of documents and the original query and rank the documents from most to least relevant. That sounds exactly like basic cosine retrieval because, well, it is the same result - a list of ranked documents.

How the re-ranking LLM does this is where things get different. Re-ranking happens on a much smaller scale than basic retrieval using cosine similarity from a vector DB because re-rankers’ architectures (often cross-encoders) are often more memory consumptive and slower but yield more precise results. They are considered an optional step between retrieval and generation.

Let’s look at a quick case study - a chatbot meant to help people navigate questions about Social Security.

Borrowing Government Data

Don’t worry, no one’s getting on a watch list for reading this. I’m taking just over 100 FAQs from https://faq.ssa.gov with corresponding help articles and using this as my data for this example. By the way, the full case study can be found at my O’Reilly RAG course^[3] .

Our data for this case study: pre-written FAQs about America’s Social Security system

I will use OpenAI’s text-embedding-3-small model to do all document embeddings and Pinecone for my vector database. If you follow the code in ^[3] (namely the retrieval / generation notebooks), I will simply embed all FAQs and store them (along with the url and raw text) in a vector database for retrieval. I didn’t want to use a well known benchmark here because frankly I already think most LLM benchmarks are unhelpful to the average person and I grow increasingly worried that companies will want to simply overfit to these benchmarks to get market hype so I try to think of simple yet relatable non-benchmark examples.

Now we need to create some test data so we can start to see how well our system is performing.

Generating Synthetic Test Data

Grain of salt alert! We are going to ask GPT-4 to generate some potential questions to test our retrieval against. Synthetic data generation is a new sub-genre of generative tasks and one with consequential downstream effects. The test data we use here will inform us as to how well our chatbot is retrieving information and therefore is a measure of how well our bot can perform. My point here is that I will be using the below prompt to generate test data but I cannot actually read the non-english examples and vet them myself so I am taking the questions generated by GPT-4 here with a grain of salt.

I am designing a chatbot to use this document as information to our users.

Please write 10 questions that an average person not educated in this social security system might ask that can definitely be answered using information using the provided document.

Try to ask in a way that's confusing to really test our system's knowledge but still fair.

I need 5 in English, 2 in Spanish, 2 in Chinese, and 1 in French in that order.

Use this format to output:
Document: A given document to make questions from
JSON: ["english question 1", "english question 2", "english question 3", ...  "spanish question 1", "spanish question 2", ..., "french question 1"]
###
Document: {document}
JSON:

>>>

[('english',
  'How do I start the process for getting disability benefits from Social Security?'),
 ..
 ('spanish', '¿Cómo solicito beneficios por discapacidad del Seguro Social?'),
 ..
 ('chinese', '我不在美国居住，我可以申请社会保障残疾福利吗？'),
 ..
 ('french',
  "Comment puis-je contacter mon bureau de sécurité sociale local pour des prestations d'invalidité?")]

With all that, let’s look at our baseline results of just using OpenAI’s embedder and pinecone’s basic retrieval (just cosine similarity).

Baseline Results

For the 220 questions (10 per a 20% sample of our scraped urls) and for each language I generated data in, I broke it up and calculated two items:

the % of times the expected document was even in the list of 10 retrieved documents from Pinecone
The % of times the expected document was the top document in the list (will always less than or equal to the first number)

OpenAI embeddings alone are a decent showing with English performing the worst tied with Spanish at 82% top result accuracy.

Using OpenAI’s embeddings alone gives us about a 84% accuracy overall (weighted by language) of the synthetic test set. Not all languages were able to grab the document at all from Pinecone. To raise that number, we could grab more documents or use a different / fine-tuned embedder. Both great things to test, but not the main point of this post.

Making Retrieved Documents better with re-rankers

We finally arrive at the crux of this post. Once we retrieve the documents from our vector database, you can pass it along to a generative AI and call it a day. But with re-ranking systems and just 5-10 more lines of code (not a hyperbole, check out the Github^[3] ), we can re-sort those documents from Pinecone to try and surface the actual relevant document to the top of the list. If we can consistently do this, we can pass fewer documents to our final RAG generation prompt resulting in a tighter, faster, and cheaper integration.

I evaluated two re-ranking systems for this:

Cohere’s v3 multilingual re-ranker - likely the largest company providing a marketable solution to the document re-ranking problem
Pongo’s semantic filter - one of the few small companies innovating in this space

Both of them work in a really simple way: provide a query and raw documents (not the OpenAI embeddings, they don’t matter to the re-ranker) and get back an ordered list of documents from most to least relevant. The test is simple - add this re-ranking step to the 10 retrieved documents from Pinecone. We will still be limited by the relevant document actually existing in the original 10, but we will be comparing Cohere and Pongo against simply using no re-ranking whatsoever.

Final Results

Everyone’s RAG system is different and your data will be different. For the data outlined above, our final results can be summarized as follows:

Both Cohere and Pongo improved top result accuracy from 84% to ~90% (~7% increase in performance).
Both models slowed the system down (not seen in the graph but both systems more than doubled the time to the testing process). This makes sense because we are actively performing a secondary LLM action.
Cohere’s model (being explicitly trained for multilingual use-cases) outperformed Pongo on Chinese, French, and Spanish.
Pongo beat Cohere on English examples (which represented 50% of the testing set).

Both Pongo and Cohere made our retrieval rankings better with ~5 lines of added code!

Overall, adding re-ranking to a system can take mere minutes to code up and as long as you have a proper testing set and a way to run tests automatically, there is no reason you cannot test your RAG systems against these re-rankers to see if they will have a net benefit on the retrieval accuracy.

Happy re-ranking!

References

[1] My August 2022 post on RAG:

Building a Natural Language Interface from an FAQ using pre-trained language models

Quickly and easily build a natural language interface using a static knowledge base as your source

medium.com/@profoz/building-a-natural-language-interface-from-an-faq-using-pre-trained-language-models-1c150dd572df

[2] The Original RAG Paper:

Retrieval-Augmented Generation for Knowledge-Intensive NLP TasksOriginal RAG Paperarxiv.org/abs/2005.11401

[3] My current RAG class materials:

Sinan Ozdemir’s RAG Course on O’Reilly

github.com/sinanuozdemir/oreilly-retrieval-augmented-gen-ai

To Quantize or not to Quantize

Sinan Ozdemir — Mon, 06 May 2024 14:12:17 +0000

Introduction to Quantization

Quantization refers to the technique of representing models using fewer bits by reducing the precision of its parameters. This process involves converting continuous or high-precision values into a smaller set of discrete values, typically by mapping floating-point numbers to integers. The primary goal of quantizing large language models (LLMs) is to decrease memory usage and accelerate inference.

There are several methods to quantize a model, which I won't get into as there are already excellent resources available (see reference [3]). Instead, I wanted to focus on a specific use case I get asked about a lot as an AI consultant and teacher: deploying an off-the-shelf model without further fine-tuning. These models could be ones pre-trained by other organizations, like Llama-3-8B, or previously fine-tuned on specific datasets without quantization. This post will not cover the process of fine-tuning while quantizing, which involves techniques such as QLORA (I have codes examples for this in reference [2]).

Python code to quantize a model is relatively straightforward using popular packages like transformers which have implementations of algorithms like NF4 (see below code sample and reference [3] for more details). NF4, which stands for NormalFloat 4, is a particularly effective strategy for maintaining the performance of AI models. Originally introduced in the QLORA paper, NF4 has become a preferred choice in modern quantization strategies.

# Import necessary classes and functions from the transformers library
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# Define the model name to load from Hugging Face's model hub
model_name = 'meta-llama/Meta-Llama-3-8B-Instruct'

# Configure the quantization settings using BitsAndBytesConfig
# Setting load_in_4bit to True enables 4-bit quantization
# bnb_4bit_use_double_quant enables double quantization for more precise control
# bnb_4bit_quant_type specifies the NF4 quantization algorithm
# bnb_4bit_compute_dtype sets the data type for computation to bfloat16 for efficiency
bits_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Initialize the tokenizer for the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load and configure the quantized model
qt_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    quantization_config=bits_config,    
    device_map="auto"
).eval()  # Set the model to evaluation mode which disables training specific operations like dropout

# Load the non-quantized version of the same model
non_qt_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="auto"
).eval()  # Set the model to evaluation mode

The not so straightforward part is testing both quantized and non-quantized models side by side on our main three considerations:

Optimizing Inference - memory and latency reduction
Raw token output differences - measuring the raw differences between the next token prediction outputs
Performance on benchmarks / test sets - running generative benchmarks and comparing the two models

I will use be using Meta’s llama-3-8B-Instruct model as my reference.

Consideration 1 - Optimizing Inference

Probably the most well known benefits of quantization are the inference gains both in memory usage and in latency/throughput. Lower parameter precision means less memory required to hold the model and faster computations. The memory usage and latency differences are dramatic between the two models and hold at both small and larger batch sizes.

Measuring the peak memory usage and latency of the forward pass of Llama 3-8B shows striking differences. The Non-Quantized model (red) uses far more memory (top) and takes far longer to process inputs in batch sizes between 1 and 32 (bottom).

Quantized models are supposed to be faster and more memory efficient so this is just the tip of the iceberg. Are they as reliable as their non-quantized cousin? Are they better? Worse? Let’s see how we can find out.

Consideration 2 - Raw Token Output Differences

This next graph has me asking both versions of the Llama 3 model 163 questions from a subset of MMLU-Virology (the benchmark content isn’t as relevant here) and using the Jaccard Index (Similarity) - a similarity metric between two sets as the number of items they have in common divided by the total number of unique items between them - to measure the differences between the raw next token predictions for each input at various cutoff points - k=1, 2, 3, etc. This will give us a straightforward way to quantify the differences in raw model output of quantized and non-quantized models.

I chose the Jaccard Index also for its robustness in scenarios where the exact alignment of token sets is less important than the overall overlap, making it ideal for evaluating models where slight deviations in token predictions are acceptable. We can see that most tokens are in common but a non-insignificant number of tokens are in fact different.

The Jaccard similarity between the top k predicted tokens of the quantized and non-quantized model on a subset of MMLU-virology

Given this graph, roughly speaking, we can expect about 75-80% of the tokens to match in the top 1, 3, 5, 10, and 20 predicted tokens for this test set, which can lead to performance differences (see consideration 3). These raw token outputs will not only affect performance on test sets but will also yield differences in the inference parameters that we set. For example, setting a top_p (which affects token probabilities) for a non-quantized model might yield drastically different results on the quantized version.

Consideration 3 - Performance on Test Sets

Considerations 1 and 2 were measuring the differences in raw next token predictions both in similarity and in speed/memory usage but neither were considering the accuracy of what those tokens represented. We saw non-insignificant differences between which tokens might be outputted which suggests that there will be differences in benchmark performance.

I’m planning a post on benchmarking in more detail but for now, I’m going to pass a very simple 0-shot prompt to each model on a subset of MMLU-Virology. I measured the words per minute (which I expected to be better for the quantized model) and the accuracy on the multiple choice questions.

Note: The only inference parameter I set was a temperature of 0.1 to induce some more consistency and reproducibility of the experiment. This choice will also highlight any token differences by making the differences in token probabilities sharper.

The Quantized Model (Red in both graphs) has a better word per minute rate (top) but performs slightly worse on a subset of the MMLU benchmark (bottom).

Right out of the gate, the non quantized model is performing slightly better on this benchmark subset but has a much lower WPM (no surprise there given the forward pass calculations in consideration 1). The difference in performance comes down to the fact that quantization is objectively altering the model from how it was trained. It’s not always going to be true that the quantized version of a model will perform worse but especially on well known benchmarks like MMLU that companies like Meta, OpenAI, Anthropic, etc test their models on, it’s a good bet. It’s always good to test.

To mitigate this, we could fine-tune the model while quantizing using a technique like QLORA (reference [2]).

Conclusion

Quantization offers tangible benefits in terms of reducing memory usage and enhancing the speed of computations. This has been demonstrated effectively in the case of Llama-3-8B, where quantized models significantly outperform their non-quantized counterparts in memory efficiency and processing speed during inference.

However, quantization does come with built in trade-offs. The alterations in precision can lead to differences in token output and potentially affect performance on benchmarks and practical applications. The balance between efficiency and accuracy must be carefully tested and managed, and for critical applications, performing some fine-tuning post-quantization using QLORA may be necessary to restore or enhance model performance.

I hope this helps!

References

[1] Code for these graphs

Quantization Comparisons

colab.research.google.com/drive/12RTnrcaXCeAqyGQNbWsrvcqKyOdr0NSm?usp=sharing

[2] QLORA example (see the SFT notebook in colab)

Guide to AI Alignment with Reinforcement Learning

What RL can and cannot do

ai-office-hours.beehiiv.com/p/aligning-llms

[3] Primer on Quantization from HuggingFace:

Quantization

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co/docs/peft/main/en/developer_guides/quantization

Probing LLMs for a World Model

Sinan Ozdemir — Thu, 25 Apr 2024 15:31:43 +0000

There are active debates over whether LLMs are just memorizing vast amounts of statistics or if they can learn a more cohesive representation of the world whose language they model. Some have found evidence for the latter by analyzing the learned representations of datasets and even go so far as to discover that LLMs can learn linear representations of space and time (arxiv.org/abs/2310.02207).

As part of the 2nd edition of my latest LLM book (coming out later this year) one idea I wanted to add as a net new section aimed at recreating some of the work done in this paper by looking at a dataset comes from a paper entitled “A cross-verified database of notable people, 3500 BC-2018 AD” claiming to build a “comprehensive and accurate database of notable individuals”; just what we need to probe some LLMs on their ability to retain information about notable individuals they read about on the web.

I’m lucky to live in an age where open data for so many things exist: doi.org/10.1038/s41597-022-01369-4

Our steps to conduct the probe will be:

Design a prompt. At its simplest we will just say the name of the individual - like “Albert Einstein”
Instigate a forward pass of our LLM and grab embeddings from the middle layer and the final layer of our LLM’s hidden states.
1. For auto-encoding models like BERT, we will grab the reserved CLS token’s embedding and for auto-regressive models like Llama or Mistral, we will grab the embedding of the final token.
Use those token embeddings as inputs to a linear regression problem where we attempt to fit to three fields of the dataset plus a control fourth:
1. birth - the birth year of the individual
2. death - the death year of the individual (we filter to only use people who have died so this value is filled)
3. wiki_readers_2015_2018 - average per year number of page views in all Wikipedia editions (information retrieved in 2015–2018). We will use this as a weak signal to the notoriety level of the individual
4. random gibberish - just np.random.rand(len(dataset)). We will use this as a control as we should not be able to see any prediction signal

Probing gives us a way to understand how much information is locked away with the parameters of a model and whether or not we can extract that information through external processes. We place classifiers or regression layers in our case on top of hidden states and attempt to extract information like the birth year of the person we stated in the original prompt.

The goal of probing is not to act in place of an evaluation for a task but rather as an evaluation of a model as a whole in particular domains. The dataset I chose for this represents a relatively “generic” task - remember information it has read.

Probing Results

For every model we are going to probe we probe the first, middle, and ending layer’s final token embedding to regress to our four columns. The next figure shows an example of probing Llama 2 13b’s middle layer. Our birth year and death year probes perform surprisingly strongly; an RMSE of 80 years and R2 of over .5 is not the worst linear regressor I’ve trained, especially considering the scale of our data.

An example of probing the middle layer of a Llama 13b model with a constructed prompt. Our birth (top left) and death (top right) probes perform relatively well (R2 of above .5) while readership (bottom left) performs less well (R2 of .32) and our gibberish regression model performs poorly as expected (R2 of 0).

The above figure shows a smattering of models I probed by averaging the R2 achieved by a linear regression on the birth year against the embeddings from the middle and the final layer. The smaller four bars represent auto-encoding BERT models with far fewer parameters than Llama, SAWYER (a chat aligned version of Llama 2 I made), and Mistral v 0.1 and 0.2

Across 15 models, we see a wide range of R ^ 2 scores. BERT models, despite having the lowest scores, also have far fewer parameters, making them perhaps more efficient at storing information.

A couple of notable takeaways:

BERT base multilingual out performed BERT large English showing how the data that LLMs are pre-trained on matters
Mistral v0.2 as a 7B model performs as well as the Llama 2 13b models showing how parameter size is not everything
Llama 13B non instruct performed better when given a structured prompt (“basic information about X” vs simply “X”) showing how prompting can drastically alter the amount of information being retrieved

Are any of these “good” predictors of birth and death year? No absolutely not but that’s not the point. The point is to evaluate a model’s ability to encode and retrieve pre-trained knowledge. Moreover, even though our BERT models performed much worse, remember that A. they are several years older than the other models tested and B. They are ~72x smaller than the Llama 13B models and ~40x smaller than the 7B models.

The next bar graph shows the efficiency of three models measured by the number of parameters needed to achieve a single R2 value so lower means more efficient. BERT takes the cake for being able to retain the information much more efficiently, most likely due to the nature of its auto-encoding language modeling architecture and pre-training.

Between, BERT, Llama 2 13b, and Llama 2 7b, the number of parameters it takes to achieve the R2 in our probe can indicate the efficiency of the model’s ability to encode information. BERT requires far fewer parameters than Llama 2 to extract encoded information but would require more pre-training on recent data to become on par with the Llama 2 model’s performance

For a second probe, I ran the GSM8K testing data through five models and built similar probes to the actual answer of the problem and below we can see our results.

Probing 6 models on the GSM 8K benchmark by taking the final token of the input world problem and regressing to the actual answer.

It seems that Mistral v0.1 and v0.2 models have more retrievable encoded knowledge than the Llama models when it comes to mathematical word problems making them potential prime candidates for fine-tuning tasks related to math and logic.

Check out the raw code for the Llama 3 Probe here: https://colab.research.google.com/drive/1e1d9fATVjVun-_tPj4vS_DSTGaIfxs01?usp=sharing I’m still prettifying everything 😀

LLMs Aligned! But to what end?

Sinan Ozdemir — Fri, 08 Mar 2024 15:00:00 +0000

Introduction - Re-aligning Our Expectations of AI

Reinforcement Learning (RL) has become one of the primary engines powering AI alignment, the process of fine-tuning an AI model (usually LLMs) to behave according to a certain set of standards and styles. Reinforcement learning provides the unique ability to instill dimensions of human style and ethics outside of the confines of relatively strict next-token prediction. Reinforcement Learning offers us a chance to supplement traditional fine-tuning methods of prompt-response pairs with a system designed to “nudge” the AI in a direction - funnier, more neutral, more diverse, etc.

Just want code? here you go 🙂 it’s at the bottom of this repo.

GitHub - sinanuozdemir/oreilly-llm-alignment

Contribute to sinanuozdemir/oreilly-llm-alignment development by creating an account on GitHub.

github.com/sinanuozdemir/oreilly-llm-alignment

To instill these kinds of behaviors into an LLM -let’s say we want the AI to be funnier - that would mean you need to go over all of your training data and make sure that examples are “funny enough” to be considered good training data. But who is deciding what is “funny enough”? What if learning humor comes at the cost of the AI’s primary objective (usually answering questions and carrying conversations)?

My upcoming workshop at ODSC East and this post focuses on Reinforcement Learning from Feedback (RLF) which involves giving an AI iterative feedback on solving a task and letting the LLM adapt its own performance in the hopes of having the AI act in a more expected manner and getting better feedback over time. The most common application of this is in training instruction-following AIs, which is exactly what I will be going over.

Aligning Open-source LLMs Using Reinforcement Learning from Feedback

odsc.com/speakers/aligning-open-source-llms-using-reinforcement-learning-from-feedback

Case Study - Teaching a Llama to chat

In a previous post for an ODSC workshop I gave, I showed off results from one of my go-to RLF case studies - fine-tuning a FLAN-T5 model to write more neutral news summaries: https://opendatascience.com/harnessing-llm-alignment-making-ai-more-accessible

In this post I want to show off my second meatier go-to case study - making a conversational chatbot from a raw pre-trained LLM. It’s name is SAWYER - Sinan’s Attempt at Wise Yet Engaging Responses - because I wanted to make a fun name for my LLM too.

That already sounds like an oxymoron doesn't it? - “raw pre-trained” - but what I mean by that is our base model will be Meta’s non chat-aligned LLama 2 model, meaning this model has no ability to answer a question when it comes to us off the shelf.

Our RLF process can be broken down into three steps:

Supervised Fine-Tuning (SFT)

Grab Meta’s 7b non chat model weights: hf.co/meta-llama/Llama-2-7b-hf
Fine-tune the model with several conversations to learn how to convert embedded knowledge into a productive conversation

Reward Training (RT)

Get a dataset of scored responses to a conversational reply, indicating which responses humans preferred
Fine-tune a RoBERTa model to distinguish between preferred and non-preferred responses to a conversation

Reinforcement Learning from Feedback (RLF via PPO)

Obtain an entirely new set of only prompts with the bot response at the end missing
Let the LLM reply to a few and use the reward model to assign rewards to the responses - positive is good, negative is bad
Let the LLM update its parameters, taking into consideration how much reward it got and how far the updated model has deviated from the original weights

The figure below shows the RLF process (the third step) at a very high level:

Our RL loop has SAWYER answering questions, being graded on its performance, and asking it to try again with updated parameters and yes, that image of a llama is AI generated 🙂

The workshop will cover the nitty gritty of how to code all of this but for now let’s skip to the fun part: the results!

The Results

We will see the full suite of results during our workshop but some notable examples stand out. Let’s ask our three versions of SAWYER - no alignment whatsoever (base LLama 2 7b ), only supervised fine tuned (SFT) and fine-tuned plus reinforcement learning from feedback (SFT + RLF).

SAWYER learns to answer questions with SFT, but learns to answer them in a more conversational way with RL

We can see a notable difference between all three stages starting with base non-chat LLama 2 which is trying to write a multiple choice question (I guess?) and our final SAWYER model being the chattiest about the actual answer to the question. That is a relatively cherry picked example but even when we zoom out and test our model against a test set of conversations (done before and after applying RL to our chatbot), plotting for achieved reward scores, our model post RL is on average getting higher preference rewards with a much lower variance:

We see statistically significant changes in rewards from before (SFT only) and after alignment via RL

This means that the final SAWYER model, post RL, seems to be getting higher rewards, more consistently.

Reinforcement learning doesn’t help with everything though. For example, we can’t expect a model that received more rewards for answering in a way that humans prefer to be “smarter”. I ran these models against some well known benchmarks (below I’m showing truthful_qa and mmlu[world_religions]) and they got basically the same accuracy score:

Our SFT and SFT + RLF models perform basically at the same level on tasks where the model needs to respond accurately and style is irrelevant (no chain of thought was applied here, I simply asked the model to answer a question directly)

I wasn’t expecting SAWYER to knock these benchmarks out of the park by any means, I just wanted to show the difference between what RL can and cannot help with. Aligning a model to chat factually and conversationally involves several steps and each step comes with caveats and nuances. Navigating these waters is challenging without step by step guidance and that is exactly what I will be providing at my upcoming workshop!

Conclusion

The exploration of Reinforcement Learning from Feedback (RLF) as a means to fine-tune Large Language Models (LLMs) towards specific behavioral goals—such as conversational attitudes, neutrality, or diversity—represents a significant leap forward in our quest to make AI more adaptable and responsive to human needs.

Through case studies like SAWYER, we can see firsthand the potential of SFT and RL to transform a pre-trained model into a more engaging and conversational agent. The process, involving a blend of supervised fine-tuning and reinforcement learning, underscores the complexity of aligning AI with nuanced human qualities. The results, while encouraging, also highlight the inherent limitations of current methodologies. While RL can guide models to interact in more human-like ways, it does not inherently increase their factual accuracy or understanding of the world.

An overview of the three elements of LLM alignment

The journey of aligning AI with human expectations is ongoing. The successes and limitations of using RL from Feedback signal that while we can nudge AI towards more human-like interactions, the end goal—creating AI that truly understands and reflects human values, humor, and ethics—remains a challenging frontier. As we move forward, it is crucial to continue refining our approaches, questioning our objectives, and considering the broader implications of our quest to create AI that is not just aligned, but aligned to what end. The future of AI alignment is promising, yet it demands our thoughtful consideration, creativity, and, most importantly, our unwavering commitment to ethical principles.

For more on:

Why we are using PPO over DPO
Evaluating SAWYER’s capabilities
Tips and techniques I used to fine-tune SAWYER on a single GPU on Colab
How to properly fine-tune a reward mechanism
How PPO can help set us up for more longer term success than DPO can
Why higher rewards isn’t always a good thing
SAWYER’s opinions on poetry

And much more, come to our workshop at ODSC East in April! See you there.

Aligning Open-source LLMs Using Reinforcement Learning from Feedback

odsc.com/speakers/aligning-open-source-llms-using-reinforcement-learning-from-feedback

Navigating the ML Content Maze: Strategies for a High-Quality Feed

Sinan Ozdemir — Fri, 01 Mar 2024 17:03:34 +0000

Hey everyone!

In today's post, I'm thrilled to share insights from my good friend Nathan Lambert's recent post on Interconnects, a reflection sparked by his appearance on my very own Practically Intelligent podcast. Lambert and I dove into the art of curating a high-quality ML content feed amidst the deluge of information. Highlighting the need for critical evaluation, model access, and the balance between depth and breadth, this guide is indispensable for those navigating the ML landscape. Dive into the full article for a comprehensive exploration of these strategies.

How to cultivate a high-signal AI feed

Basic tips on how to assess inbound ML content and cultivate your news feed.

www.interconnects.ai/p/making-a-ml-feed

Lambert and I offer invaluable advice on navigating the vast ML content landscape:

Model Access and Demos: The gold standard for evaluating ML content credibility.
Depth vs. Breadth: Focus on areas that provide the most leverage for your goals.
Reproducibility and Verifiability: Signs of scientific rigor in ML projects.
Critical Evaluation of Sources: Not all ML content is created equal.
Scientific Rigor: The importance of foundational principles in assessing ML advancements.

And for more enriching discussions on ML, don't forget to tune into Practically Intelligent!

Practically Intelligent

A podcast by AI nerds for AI nerds

www.practicallyintelligent.com

I’m also looking to learn what you all want me to write about! If you want to submit a GH issue on our github, I would love to incorporate any and all feedback 🙂

GitHub - sinanuozdemir/ai-office-hours

Contribute to sinanuozdemir/ai-office-hours development by creating an account on GitHub.

github.com/sinanuozdemir/ai-office-hours

Fashion Meets AI

Sinan Ozdemir — Mon, 08 Jan 2024 16:38:53 +0000

I wanted to write a post about a singularly interesting conversation I had with GPT-4 Vision recently. The impetus for this post was that I was genuinely excited to use GPT for a new kind of task for me.

The Task? I wanted GPT-4 to help me plan a wedding outfit.

Help I’m fashionably challenged and want GPT to help me plan a wedding outfit

The outfit had an interesting theme of “casual beach formal” , so I thought I creative AI should be able to help me out. With some simple prompts I asked to design an outfit for that theme and then I asked it to draw a image of that outfit on a mannequin with brown hair so that I could see it. I’m very visual.

A “casual beach formal” outfit as deemed by GPT-4. Oops I like it but that doesn’t look like me at all. Let’s try to tighten up this look customized for me.

I noticed the resulting image was quite beautiful, and as much as I make my resolutions every year, I don’t have a body like that. So I asked GPT for the mannequin in my dimensions, which I will not reveal here the resulting image striking similar to the first one.

Asking for several pant sizes larger and about 4-6 inches shorter. His brothers left but he stayed. Ok let’s try again

I tried harder to make the model my proportions, but the AI kept making him very thin.

No matter what I tried I kept letting the same Mr Hot-Man. For this image I specifically asked for love handles and to be fair I get where it was going with that around the mid section.

Getting bored and wanting something to happen, I asked it to do something more fantastical like draw an arm out of the mannequin’s head, it was happy to change the image for that. To be fair it didn’t do what I asked, but still.. it will rather draw robot arms than make me slightly fatter.

I asked for a third arm out of my head. That’s not what I got but hey it’s more willing to turn me into an android than to make a bit heavier so we can work from here maybe.

He is my son and I shall call him sinan v0.1 alpha. I even asked to make sinan v0.1 alpha stouter and shorter, and it still refused to do that for me.

Hey there’s that third arm! This is after I asked to make sinan v0.1 alpha stouter and shorter

It’s much easier to make a multi-arm mannequin than one slightly shorter and fatter I guess. As one more test I backed up a bit to the original outfit and I asked it to draw it for my Indian friend who wants to wear the same outfit. And here is what it came up with.

For me (on the left) and for my “Indian friend” on the right. Not much difference but the skin is slightly different for sure.

Honestly, at first I didn’t notice the difference but then after a few seconds I do see that the skin is slightly different on the right which I guess is fair. It’s just hard to get away from that chiseled face I guess, I get it.

Conclusion?

This is not a research study, nor did I do a lot of repetitive testing here. But it is a case study in my singular user experience with an AI to solve what I expect is not that rare of a request. As someone who has been working with Generative AI for over a decade, if I can’t get GPT to do what I want even with some minor prompt engineering, I wonder how long the average ChatGPT user will wait before rage-quitting on this exact scenario.

This is no way of criticism of OpenAI. Multi-turn multi-modal conversations is arguably the most challenging tasks being undertaken by commercial AI today. I’m only saying that any company who is building such AI experiences should remember to market not only the AI’s capabilities but also known limitations, no matter how small. It would at least be nice to know that other people see similar areas of improvement as I do.

Just a thought.

AIs Supervising AIs

Sinan Ozdemir — Mon, 18 Dec 2023 17:28:54 +0000

As a part of my upcoming O’Reilly session on aligning LLMs, I wanted to talk a bit about scale supervision - an AI’s ability to judge another AI on the generated responses. I was originally inspired by a HuggingFace post called Can foundation models label data like humans and I wanted to replicate some of the results and add some results of my own.

The Data

I am using some comparison data that I also used in my book that can be found here. Most AI responses were rated very highly by humans:

Of nearly 5,000 paired responses with scores, most of the ratings were pretty high

Because I was approaching the $200 mark in OpenAI costs after running only about 3% of the data through my prompt, I only ended up using 4,877 paired responses.

The Task + Prompt

The task for the AI is simple: given a query and two AI generated responses, submit a score from 1-9 where 1 means it strongly prefers Assistant 1’s answer, 9 means it strongly prefers Assistant 2’s answer, and I specifically call out 5 to be an appropriate score if both answers are equally fine.

I’m using GPT-4 with the following prompt format to ask the AI to pick the better response given a query:

---
SYSTEM PROMPT
---
### Rating Task
Rate the performance of two assistants in response to the user question.

Output a score from 1 to 9 where a 1 means you strongly prefer Assistant 1's answer and 9 means you strongly prefer Assistant 2's answer and 5 means either answer works just as well as the other.

Give the answer in the json format: 

JSON: {"reason": "Assistant X's answer is preferable because...", "score": Y}

---
USER PROMPT
---
### User Question
{query}

### The Start of Assistant 1's Answer
{answer_1}
### The End of Assistant 1's Answer

### The Start of Assistant 2's Answer
{answer_2}
### The End of Assistant 2's Answer

Now give your answer
JSON:

I’m invoking some chain of thought (by asking for the reasoning first) and have the temperature down to 0.3 to get some consistency going.

The Findings

With data and prompt ready, I ran the nearly 5K paired responses through my prompt and this is what I found!

The AI doesn’t tend to match human scores

I included a human simulated score by taking diff (answer 2’s human-given score minus answer 1’s human-given score which in theory could be from -10 to 10) and applying the formula to map it to be from 1-9

This mapping takes actual human score deltas (ranging from -10 to 10) and maps them to 1-9 to better compare to our AI

As far as raw accuracy goes, the AI only matches the human simulated score 6% but climbs to 25% if you relax accuracy to be within 1 point of each other (so if the simulated score rounded to 7 and the AI said 8, that counts as “correct”).

More interestingly, if you plot the simulated scores and the AI scores side by side, you see that the AI labels very differently:

Left: Simulated human scores form a natural multi-modal distribution with peaks at the 5 mark (where responses are scored similarly), 2.5, and 7.5.

Right: the AI score distribution is more polarizing and doesn’t have a peak at 5

So far our AI isn’t labeling responses like a human would. This mismatch in labeling behavior is even more striking when you simplify the task.

The AI was more likely to be prefer response 1

If you only look at paired responses that were scored exactly the same by humans, you would hope that the AI would recognize that they are similar and give a score of 5 more often than not. However this doesn’t appear to be the case; the AI will prefer to pick one answer over the other, tending towards preferring the first one.

The bias of favoring the first response is called a positional bias and it’s pretty clear to see in this graph where I’m only considering pairs of responses that humans gave the exact same score and yet the AI is more likely to prefer one response over the other when though I told it to rate the pair as a 5 when they are roughly equal.

The AI favors the first response even when I’m specifically only giving it responses where humans gave both responses the exact same score

Note that the bar for score 2 is nearly twice as high as the next highest bar (7).

Even if I bucket the responses into three broad groups, We see a clear bias to not pick a score in the middle even when that’s the appropriate answer:

This tells me that even for responses that should be roughly similar, I can’t always trust the AI to label them as such.

This was expensive 😅

I spent about $200 bucks on OpenAI just to get results for this, so I hope it was helpful!

Every time I do one of these, I have to re-do my budget for the week

Summary + The Code

Can LLMs label data like humans? It seems that both HuggingFace and I agree: not really. Of course we can improve upon our prompts and fine-tune models to perform even better but most people I talk to tend to use models like GPT-4 off the shelf with a pretty basic prompt like I used here so it’s worth calling it out!

If in a pinch and you really want to use AI to help you label some data I’d recommend:

Using few-shot learning to give some diverse examples of preferring an answer over the other
Expanding on what constitutes a preferred answer in the system prompt
Having a human double check at least a few responses to get a sense of how well the AI is doing

The notebook can be found here! https://github.com/sinanuozdemir/oreilly-llm-alignment/blob/main/notebooks/rlaif.ipynb

AI Benchmarking - the good, the bad, and the confusing

Sinan Ozdemir — Fri, 01 Dec 2023 19:10:41 +0000

Hey Everyone!

I don’t normally announce new episodes of my podcast, Practically Intelligent on here, but with this episode I felt compelled. This episode features an insightful conversation with Praveen Paritosh, a renowned expert in AI research, where we take a look at the critical role of benchmarking in the evolution of AI.

🔍 What's In The Episode?

A detailed look at how benchmarking drives AI advancement.
Insights into the impact of legacy benchmarks like SQuAD.
The complex dynamics between conceptual learning and rote memorization in AI development.
An exploration of benchmarks as vital tools in AI research, highlighting their strengths and limitations.

🎧 Check it out!

I hope you all enjoy it and have a great weekend 🙂

Fine-Tuning LLaMA 2

Sinan Ozdemir — Mon, 27 Nov 2023 18:47:57 +0000

Fine-tuning Llama 2 - a hands-on example

Hello everyone! Today, I am diving deep with a new (relatively) simple notebook to help people fine-tune Llama-2, Meta’s latest open source large language model on Hugging Face. Here are some fascinating insights from this journey (the notebook link is at the bottom of the post 🙂).

Dataset and Model: Our dataset for this experiment was the guanaco-llama2-1k from HuggingFace, comprising instructional texts. The model of choice was NousResearch/Llama-2-7b-hf, the 7 billion parameter model. Unlike my previous venture with App Reviews, this dataset explores a different facet of language understanding, focusing on instructional text comprehension and generation.

Feel free to change up the dataset/model but of course that might require some code changes. It will be easier to change out the dataset/conversation format because we’re using the blank slate non-aligned version of Llama 7b that has no expectations for conversation format.

Key Takeaways:

Efficient Fine-Tuning with LoRA: The LoRA (Low-Rank Adaptation) technique allowed for efficient fine-tuning of the LLaMA 2 model without the need for extensive computational resources. This approach is a nod to the evolving practices in AI, where efficiency is becoming as crucial as effectiveness.
Quantization and Performance: Implementing BitsAndBytes for model quantization significantly reduced the memory footprint. This optimization meant we could run a larger model on the same hardware, a critical factor in practical AI applications.
Improved Responses Post Fine-Tuning: The difference in the model's responses pre and post fine-tuning was stark. This improvement underscores the impact of fine-tuning on model performance, especially for specific use cases. Check out this before and after of asking the model "Who is Leonardo Da Vinci?"

Before fine-tuning: the model has no idea how to answer questions

After fine-tuning: the model now knows how to answer questions, even when asked about items not found in the instructional dataset

Cost vs. Performance: Similar to my previous analysis with fine-tuning OpenAI models, the cost-effectiveness of fine-tuning LLaMA 2 was noteworthy. While not as resource-intensive as GPT-3.5, the performance gains were substantial, offering a compelling middle ground between efficiency and effectiveness.
Masking Loss for Targeted Learning: We employed a custom data collator, DataCollatorForCompletionOnlyLM to selectively mask the loss calculation to focus only on the model's responses given a conversation, effectively ignoring irrelevant parts of the input. By doing so, we ensured that the model's learning was concentrated on generating accurate and relevant responses, improving its efficiency and effectiveness in understanding and replying to instructional texts.

Conclusion: The world of AI and language models continues to be a thrilling landscape of endless possibilities and learning. Fine-tuning LLaMA 2 has been an enriching experience, revealing the importance of model efficiency, the power of specific optimizations, and the constant need to balance cost with performance. As always, for those keen to dive deeper, the updated notebook on GitHub awaits. Until our next AI adventure, happy coding!

Next Steps: RLHF/RLAIF and Custom Pre-Training

Implementing RLHF/RLAIF: We could use Reinforcement Learning from Human Feedback (RLHF) to enhance LLaMA 2's performance by fine-tuning it based on interactive feedback, aiming for responses that better align with human expectations.
Custom Pre-Training Corpus: We can also fine-tune LLaMA 2 on a custom corpus tailored to specific domains, significantly enhancing its expertise and accuracy in niche areas without losing its versatile applicability.

Notebook link please!

Here is the notebook and have fun!

https://colab.research.google.com/drive/11KBP9-fJzsNtNFeLWJdaleNmxGbBn4l6?usp=sharing

New Notebook to Fine-tune with OpenAI

Sinan Ozdemir — Mon, 06 Nov 2023 20:00:00 +0000

It was brought to my attention that Chapter 4 of my latest book uses a dataset that Amazon has since revoked from HuggingFace (always keeping me on my toes). Because of this, I re-wrote the notebook in Github to update the example with a working dataset and at the same time, updated the code to use OpenAI’s latest fine-tuning API. I figured I would share some of the takeaways of the case study here.

Our data is App Reviews from HuggingFace (original Github here). The dataset is 288,065 reviews extracted from the Google Play. I split the data into training, validation, and testing. I used training and validation to fine-tune on OpenAI and held out testing to compare the final 4 mo

Our model options are:

Babbage trained for 1 epoch (3B model)
Babbage trained for 4 epochs (3B model)
GPT 3.5 trained for 1 epoch and no system prompt (175B model)
GPT 3.5 trained for 1 epoch with a system prompt (175B model)

1. Cost project early and cost project often

If this model is going to be used a lot, make sure to keep an eye on OpenAI’s pricing page to estimate how much money you’re about to spend on fine-tuning. Here is a breakdown of how much it cost me to train and run evaluation on all four models on the training dataset, obviously 3.5 was much more expensive, but were the performance gains worth it?!

Our performance increase from GPT 3.5 comes at a cost - literally. Fine-tuning GPT 3.5 was up to 75x more expensive than fine-tuning Babbage and inference with GPT 3.5 was up to 26x more expensive than Babbage! Worth it? Ehhh..

You could consider writing a batch prompt which is exactly what it sounds like, a prompt that predicts multiple apps at a time.

2. Consider simplifying the task

Btw the answer to the question was the extra money to fine-tuning GPT 3.5 worth it? NO. In testing accuracy, GPT 3.5 was only about 3% better than Babbage. For being about 60x times bigger, GPT 3.5 can sometimes be not worth the money.

This is true even if you consider simplifying the task and defining new metrics. For example, we have raw accuracy (simple # correct / # items) but we could also consider “good” vs “bad” as a binary classifier of changing classes to be “Good” (4 or 5 stars) or “Bad” (1, 2, or 3 stars). Of course you can do whatever you want. You can also do “one-off accuracy” so if the model predicts “3” and the answer was “2” or “4” it would be counted as right. All up to you on what matters 🙂. On there 3 metrics, GPT 3.5 still only does up to 3% better than Babbage.

Fine-tuning GPT 3.5 (ChatGPT) is performing a bit better than the much smaller Babbage models, even among the simplified tasks but is it worth the extra $$$?

3. Generative models generating nonsense

If you let a model blabber on, it will eventually say something unhelpful. In our fine-tuned GPT 3.5 model with no system prompt with a temperature of 0.1 (to make the outputs more deterministic) I saw some instances of the model not predicting 0, 1, 2, 3, or 4. Seems like the system prompt helps prevent against this and Babbage doesn’t need to be told this as much 🙂. It’s annoying but hey, generative models gonna generate.

Only our fine-tuned 3.5 model with no system prompt generates predictions out of the 0-4 range on our testing set sometimes (even with our temperature turned down low). Both Babbage models and GPT 3.5 with a system prompt never did this.

Until next time!

I have more takeaways than that but I’ll leave it there for now. If you want to see more, check out the notebook. Happy coding!

Harnessing LLM Alignment

Sinan Ozdemir — Fri, 27 Oct 2023 21:17:52 +0000

Hey everyone! I’m giving an alignment workshop next week at ODSC and they had me write a blog post to intro the work we were going to be doing. I wanted to share this intro with you all as well!

Back in 2020, the world was introduced to OpenAI’s GPT-3, a marvel in the AI domain to many. However, it wasn’t until two years later, in 2022, when OpenAI unveiled its instruction-aligned version of GPT-3, aptly named “InstructGPT,” that its full potential came into the spotlight, and the world started really paying attention. That innovation wasn’t just a technological leap for AI alignment; it was a demonstration of the power of reinforcement learning to make AI more accessible to everyone.

Aligning Our Expectations

Alignment, broadly defined, is the process of making an AI system that behaves in accordance with what a human wants. Alignment isn’t just about training AI to follow instructions; it’s about designing a system to sculpt an already powerful AI model into something more usable and beneficial to both technically inclined users and to someone who just needs help planning a birthday party. It’s this very aspect of alignment that has democratized the magic of Large Language Models (LLMs), enabling a broader audience to extract value from them.

If alignment is the heart of LLMs’ usability, what keeps this heart pumping? That’s where the intricate dance of Reinforcement Learning (RL) comes into play. While the term ‘alignment’ might be synonymous with reinforcement learning for some, there’s a lot more under the hood. Capturing the multifaceted dimensions of human emotions, ethics, or humor within the confines of next-token prediction is a colossal – and potentially impossible – task. How do you effectively program ‘neutrality’ or ‘ethical behavior’ into a loss function? Arguably, you can’t. It’s here that RL rises as a dynamic way to model these intricate nuances without strictly encoding them.

RLHF, which stands for Reinforcement Learning from Human Feedback is the technique OpenAI originally used to align their InstructGPT model and is frequently discussed among AI enthusiasts as the main way to align LLMs, but it’s merely one tool among many for alignment. The core principle of RLHF revolves around obtaining high-quality human feedback and using it to give LLMs feedback on their task performance in the hopes of having the AI speak in a more user-friendly manner by the end of the loop.

In our own day-to-day work with LLMs however, we often don’t need the AI to answer everything, we need them to solve the tasks relevant to us / our businesses / our projects. In our journey with RL, we’ll explore alternative approaches to RLHF where we can utilize other forms of feedback mechanisms that do not rely on human preferences.

Case Study – Aligning FLAN-T5 to make more neutral summaries

Let’s look at an example of using two classifiers from Hugging Face to enhance the FLAN-T5 model’s ability to write summaries of news articles that are both grammatically polished and consistently neutral in style.

The below code will define one such reward feedback, using a pre-fine-tuned sentiment classifier to obtain the logits for the neutral class to reward FLAN-T5 for speaking in a neutral tone and punish it otherwise:

sentiment_pipeline = pipeline(

  'text-classification', 

  'cardiffnlp/twitter-roberta-base-sentiment'

)

def get_neutral_scores(texts):

  scores = []

  # function_to_apply='none' returns logits which can be negative

  results = sentiment_pipeline(texts, function_to_apply='none', top_k=None)

  for result in results:

    for label in result:

      if label['label'] == 'LABEL_1': # logit for neutral class

        scores.append(label['score'])

    return scores

>> get_neutral_scores(['hello', 'I love you!', 'I hate you']) 

>> [0.85, -0.75, -0.57]

We can use this classifier along with another one to classify a piece of text’s grammatical correctness to align our FLAN-T5 model to generate summaries how we want them to be generated.

The Reinforcement Learning from Feedback loop looks something like this:

Give FLAN-T5 a batch of news articles to summarize (taken from https://huggingface.co/datasets/argilla/news-summary only using the raw articles)
Assign a weighted sum of rewards from:
1. A CoLA model (judging grammatical correctness) from textattack/roberta-base-CoLA
2. A sentiment model (judging neutrality) from cardiffnlp/twitter-roberta-base-sentiment
Use the rewards to update the FLAN-T5 model using the TRL package, taking into consideration how far the updated model had deviated from the original parameters

Here is a sample of the training loop we will build at the workshop I’m giving next week:

for epoch in tqdm(range(2)):

  for batch in tqdm(ppo_trainer.dataloader):

    #### prepend the summarize token

    game_data["query"] = ['summarize: ' + b for b in batch["text"]]

    #### get response from reference + current flan-t5

    input_tensors = [_.squeeze() for _ in batch["input_ids"]]

    # ....

    for query in input_tensors:

      response = ppo_trainer.generate(query.squeeze(), **generation_kwargs)

      response_tensors.append(response.squeeze())    

    

    #### Reward system

    game_data["response"] = [flan_t5_tokenizer.decode(...)

    game_data['cola_scores'] = get_cola_scores(

    game_data["clean_response"])

    game_data['neutral_scores'] = get_neutral_scores(

    game_data["clean_response"])

    #### Run PPO training and log stats

    stats = ppo_trainer.step(input_tensors, response_tensors, rewards)

    stats['env/reward'] = np.mean([r.cpu().numpy() for r in rewards])

    ppo_trainer.log_stats(stats, game_data, rewards)

I omitted several lines of this loop to save space but you can of course come to my workshop to see the loop in its entirety!

The Results

After a few epochs of training, our FLAN-T5 starts to show signs of enhanced alignment towards our goal of more grammatically correct and neutral summaries. Here’s a sample of what the different summaries look like using the validation data from the dataset:

A sample of FLAN-T5 before and after RL. We can see the RL fine-tuned version of the model is using words like “announced” over terms like “scrapped”.

Running both our models (the unaligned base FLAN-T5 and our aligned version) over the entire validation set shows an increase (albeit a subtle one) in both rewards from our CoLA model and our sentiment model!

The model is garnering increased rewards from our system, and upon inspection, there’s a nuanced shift in its summary generation. However, its core summarization abilities remain largely consistent with the base model.

Conclusion

Alignment isn’t just about the tools or methodologies of collecting data and making LLMs answer any and all questions. It’s also about understanding what we actually want from our LLMs. The goal of alignment, however, remains unwavering: fashion LLMs whose outputs resonate with human sensibilities, making AI not just a tool for the engineer but a companion for all. Whether you’re an AI enthusiast or someone looking to dip your toes into this world, there’s something here for everyone. Join me at ODSC this year as we traverse the landscape of LLM alignment together!

I will have a github repo for ODSC soon but until then, you can see the source notebook from my book here: https://github.com/sinanuozdemir/quick-start-guide-to-llms/blob/main/notebooks/7_rl_flan_t5_summaries.ipynb

My new book on LLMs!

Sinan Ozdemir — Fri, 22 Sep 2023 18:40:03 +0000

Hey everybody! I just wanted to let you all know that I have a new book out on getting started with LLMs!

Here it is! https://a.co/d/5SDvdju

Short post today but I just wanted to say how happy I am to have this new book out and a thank you to everyone who has already preordered their copies! I hope you all love it as much as I loved writing it.

Thats it for today. Happy Coding!

Our first Streamlit app

Sinan Ozdemir — Thu, 17 Aug 2023 13:00:00 +0000

I’ve been teaching a class through Pearson on LLMs and ChatGPT with an emphasis on empowering non-coders to learn how to prompt, build test harnesses, and rapidly prototype with LLMs. On our last day I introduced Streamlit, a super simple way to build super quick and dirty prototypes. The goal was to give my students a way to share their prototypes with people with minimal coding. I figured, why not also show the same example here!

Try it out here: https://ai-office-hours-wine.streamlit.app

Our wine recommending app prototype, complete with a explicit feedback mechanism

Streamlit is super simple and honestly with fewer than 100 lines of code, we can be done with our prototype. Let’s start strong and see the final app, found here on github:

A wine recommending app

# Import necessary libraries
import random

import openai
import streamlit as st
from datasets import load_dataset
from supabase import create_client

# Set API Key
openai.api_key = st.secrets["OPENAI_API_KEY"]


# Initialize DB connection once
@st.cache_resource
def init_connection():
    return create_client(st.secrets["SUPABASE_URL"], st.secrets["SUPABASE_KEY"])


supabase = init_connection()

# System prompt for OpenAI API
system_prompt = '''You are a wine bot that helps clients understand what kind of wine they want. Given a list of wines and a description of the client, tell me what wines they want by giving me the names of the wines. Include a reason preceding each pick to explain to the user why they might like it. Give me the information  as a numbered list of wines with reasons why they might like it.'''


# Cache wine dataset once
@st.cache_resource
def load_wines():
    wine_dataset = load_dataset("alfredodeza/wine-ratings")
    return list(wine_dataset['train'])  # only use train set for now


# Convert wine to string
def convert_wine_to_string(wine):
    return f'{wine["name"]} is from {wine["region"]} and is a {wine["variety"]}. {wine["notes"]}'


# Update reaction in DB
def react_to_row(row, reaction):
    supabase.table("response").update(
        {"reaction": reaction or None}, returning="minimal"
    ).eq("id", row['id']).execute()


# User input elements
user_description = st.text_input("Describe the client",
                                 "The client likes red wine and is looking for a wine to drink with dinner.")
n = st.number_input("How many wines to pull from the cellar?", min_value=1, max_value=10, value=3, step=1)


# Function to get recommendations
def get_recommendations(n=3, user_description=''):
    wines = random.sample(load_wines(), n)
    wines_formatted = "\n---\n".join([convert_wine_to_string(w) for w in wines])
    user_prompt = f'User Description: {user_description}\nWines to select from:\n{wines_formatted}'

    # Create chat completion with OpenAI
    chat_completion = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[{'role': 'system', 'content': system_prompt}, {'role': 'user', 'content': user_prompt}]
    )

    # Show the wine recommendations and store in Supabase
    st.write('Wines pulled from cellar to choose from')
    st.table(wines)

    row = supabase.table("response").insert(
        [{"system_prompt": system_prompt, "user_prompt": user_prompt,
          "response": chat_completion.choices[0].message.content, "prototype": "wine"}]
    ).execute().data[0]
    st.write(chat_completion.choices[0].message.content)
    st.session_state['row'] = row


# Button to get recommendations
st.button(
    "Get recommendations", on_click=get_recommendations,
    kwargs={'n': n, 'user_description': user_description}
)

# User reaction
reaction = st.selectbox("How do you feel about the response?", ("", "👍", "👎"))
if 'row' in st.session_state:
    st.button(
        "Submit reaction", on_click=react_to_row,
        kwargs={'row': st.session_state['row'], 'reaction': reaction}
    )

Here is how a user interacts with our app:

The user inputs their wine preferences and selects the number of recommendations they want to receive through the application's interface.
The user clicks on the "Get recommendations" button, triggering the application to randomly select wines from its dataset and request recommendations from the AI model.
The application displays personalized wine recommendations from the AI model along with detailed explanations and a table of the selected wines.
The user has the option to react to the AI's recommendations via a select box, expressing their approval or disapproval.
If a reaction is provided, the user clicks on "Submit reaction", and the application saves the user's feedback to Supabase, which can be used for future improvements to the system.

The goal here is to help people get their prototypes out there with minimal code. Everyone deserves to share their work!

As always, the code is also here on the Github! https://github.com/sinanuozdemir/ai-office-hours/tree/main/streamlit/wine_prototype

AI Office Hours are Open!

Sinan Ozdemir — Thu, 22 Jun 2023 15:27:49 +0000

Welcome to AI Office Hours!

Welcome to the very first AI Office Hours Newsletter! I'm Sinan Ozdemir, your guide through the ever-evolving world of AI. As a former lecturer at the Johns Hopkins University and an experienced entrepreneur in the AI field, I've spent years breaking down complex concepts, building real-world solutions, and sharing my knowledge through various publications. Now, I'm thrilled to welcome you to this journey where we demystify and actually use AI, particularly the realm of Large Language Models (LLMs).

Hi I’m Sinan! Your friendly neighborhood AI/ML/LLM Expert.

Am I an experience blogger or newsletter writer? Nope. Do I care a lot about sharing actionable insights and code for my fellow software engineers on the topic of AI? Absolutely!

Example 1: Generating text with Open-source FLAN-T5

Our first example today is a simple one, but something that I get asked about a fair amount: How do I simply generate text from an open source model from Huggingface?

Let's take the example of Google's FLAN-T5 model, one of Google’s latest open-sourced LLM. Using FLAN-T5 - which is a sequence to sequence model which matters for our upcoming code - we can generate a piece of text based on a given prompt. Here's a quick Python code snippet using the transformers library from Hugging Face:

# Import necessary classes from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Define the model we want to use
MODEL = "google/flan-t5-base"

# Initialize the tokenizer using the from_pretrained method 
tokenizer = AutoTokenizer.from_pretrained(MODEL)

# Initialize the model using the from_pretrained method
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL)

# Define our prompt text
prompt = "Translate from English to Spanish: 'How are you?'"

# Encode our prompt text into tensor of integers representing the sequence of tokens
inputs = tokenizer.encode(prompt, return_tensors='pt') 

# Generate the output sequence using the model
outputs = model.generate(inputs, max_length=100) 

# Decode the output sequence into readable text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)  # outputs "Cómo estás?"

Model and Tokenizer Initialization: The necessary classes are imported from the transformers library and the pre-trained model (in this case, the FLAN model) is specified. Then, both the tokenizer and the model are initialized based on the pre-trained model.
Prompt Definition: The input text, or prompt, is defined. This is the text that the model will translate or generate text from.
Input Preparation: The prompt is encoded into a sequence of tokens (a format that the model can understand) using the tokenizer. This involves converting the text into a tensor of token IDs.
Text Generation: The model generates an output sequence based on the input tensor. The length of the output sequence is controlled by specifying a maximum length with the max_length parameter.
Output Decoding: The output sequence is decoded back into readable text using the tokenizer. Special tokens included in the output sequence are removed during this process.
Printing the Output: The final step involves printing the generated text. Depending on the task, this could be a translation, a summary, a continuation of the prompt, or any other type of text.

This is a pretty bare bones code example but I didn’t want to leave you totally hanging on the first post 🙂.

Next time on AI Office Hours

In the coming weeks, expect more content around prompting techniques, using and fine-tuning open source LLMs, using and testing different closed source LLMs all with a mind for production and keeping costs down and solving interesting and specific tasks with LLMs. This is something I love talking about and building around, so I can’t wait 🙂

Me talking about my startup (now acquired) and how we were using AI to generate conversational responses on Jason Calacanis’ “This week in startups” podcast in 2017

I encourage you to be curious, ask questions that I can talk about on the newsletter, and experiment with all of these examples. After all, AI is as much about learning and adapting as it is about coding and algorithms.

There’s also a github I’ll do my best to maintain with any code examples here:

sinanuozdemir/ai-office-hours

Contribute to sinanuozdemir/ai-office-hours development by creating an account on GitHub.

github.com/sinanuozdemir/ai-office-hours