GenAI360 - Weekly AI News

August in AI: Grok-2 > GPT-4 Turbo, New SOTA for Text-to-Image & Video Gen

Tue, 10 Sep 2024 15:09:08 +0000

In a new experiment, we decided to provide a monthly mega-roundup of all things that may have flown under the radar for you in August. Before we start, share last week's news with a friend or a colleague:

Join RetrieveX Conference on Oct 17 in San Francisco. 30% OFF Before Prices Go Up Tomorrow

Join RetrieveX, our flagship conference in retrieval for GenAI. Exclusively for those building high-accuracy, multimodal workflows, featuring leaders from Meta AI, Microsoft AI, YC, Bayer Radiology, Matterport, Cresta, as well as the co-creators of Meta LLama, PyTorch, Chameleon, and KubeFlow.

Checkout with promo code FINALCALL for 30% off (before the prices increase from $649 to $949 tomorrow). Prices are going up by end of week, so secure your spot sooner rather than latter.

Get Tickets Today

Date: October 17, 10:30am - 7pm PT
Venue: The Midway, 900 Marin St, San Francisco
Attendees: 300 AI executives

Key Takeaways for August

Hermes 3 by Nous Research showed significant improvements over its predecessor, demonstrating competitive benchmark scores against Llama 3.1 through enhanced training techniques.
Ideogram 2.0, a SOTA text-to-image model, is now freely available with enhanced features and styles, improving image quality and text rendering through advanced training methods.
Google AI Edge's MediaPipe enabled running 7B+ parameter language models in browsers using WebAssembly and WebGPU, overcoming memory restrictions through redesigned model-loading code.
Researchers developed Pyramid Attention Broadcast (PAB) for real-time video generation, achieving up to 20.6 FPS with a 10.5× acceleration by mitigating redundancy in attention computations.
Google DeepMind open-sourced the Vizier algorithm, outperforming industry baselines in black-box optimization through a Gaussian process bandit approach.
Anthropic’s new prompt caching feature dramatically reduces costs and latency for long prompts, set to become an industry norm.
Huawei challenges Nvidia with the Ascend 910C AI chip, targeting the Chinese market amid production difficulties due to U.S. sanctions.
Grok-2 and Grok-2 mini outperformed GPT-4 Turbo in benchmarks such as GPQA and MMMU, excelling in reasoning and factual accuracy.
DeepSeek-Prover-V1.5 is an advanced theorem-proving LLM with improved performance on formal mathematics tasks, showcasing state-of-the-art results on rigorous benchmarks like ProofNet.

Got forwarded this newsletter? Subscribe below👇

The Latest AI News

Things seemed to have taken a strange turn regarding the Strawberry situation. Meanwhile, some neat optimization frameworks and libraries were introduced, along with a bunch of language and image generator models.

There were quite a few model releases, ranging from SLMs to text-to-image models. Moreover, a chatbot arena update saw Grok-2 ranked very highly, which means xAI might take the top spot very soon.

Salesforce, Jamba, and Hermes Expand the AI Model Landscape

Comparison of claimed vs effective context window for the RULER benchmark. (Source)

Salesforce announced the release of Einstein SDR Agent and Einstein Sales Coach Agent. Einstein autonomously manages inbound leads by answering questions, handling objections, and scheduling meetings, all while grounded in a company’s CRM and external data.

On the other hand, Einstein Sales Coach Agent does exactly what you’d expect from the name - coach sellers via role-plays and provide personalized feedback afterwards.

What’s more is that these AI agents can be tailored to a company’s specific needs, including setting engagement guardrails and language preferences, making them highly adaptable to different sales strategies. We’re seeing companies like Accenture use them to improve deal effectiveness and scale support for more complex sales activities.

Another release included AI21 Labs’ Jamba 1.5 Mini and Large models, built on the novel SSM-Transformer architecture. These offer superior long-context handling, speed, and quality. These models are the first non-Transformer models to match the performance of leading competitors, featuring a 256K context window—the longest in the market for open models.

The Jamba 1.5 models are designed for resource efficiency, capable of handling up to 140K tokens on a single GPU for Jamba 1.5 Mini.

In particular, these models stand out because they maintain high performance across the entire context window, significantly improving the efficiency and accuracy of enterprise-scale applications. Independent tests showed Jamba 1.5 Mini as the fastest model on 10K contexts, outpacing other models in its size class.

Hermes 3 by Nous Research was also a noteworthy release with drastic improvements over its predecessor, Hermes 2. Some of which include:

More reliable function calling
Better code generation skills
Enhanced general assistant capabilities

In terms of benchmark performance, Hermes 3 is certainly no slouch as it showed highly competitive benchmark scores with Llama 3.1.

Hermes 3 was able to outperform Llama 3.1 on benchmarks like AGIEval and ARC-C. (Source)

Ideogram’s New Model and FLUX Available on 3 Platforms

Sample image produced by Ideogram 2.0. (Source)

Ideogram 2.0, a SOTA text-to-image model, is now accessible to all users via Ideogram.ai and a newly launched iOS app. The platform also introduces premium features through subscription plans, offering enhanced creative control and flexibility.

The model supports various styles such as realistic and design, so users can create highly detailed and context-specific images. These styles boost the realism of textures and accuracy of text in designs.

Improvements over its predecessor include creative tools like Magic Prompt and Describe, which help users generate and refine prompts for image creation. These tools enhance the iterative creative process, allowing for continuous reimagining of visual concepts.

We mentioned before that Midjourney had some serious competition with Black Forest Labs’ FLUX-1 models. Turns out these models are available on three platforms:

Replicate (no free credits)
Fal ($1 free credit)
Mystic (available for free)

Google, Nvidia, and Microsoft Advance SLMs

Phi 3.5-MoE showed impressive modal quality on the MMLU benchmark. (Source)

Given the recent trend of the rising popularity of SLMs recently, it isn’t too surprising to see more SLM releases.

Google AI Edge's MediaPipe redesigned its model-loading code to overcome memory restrictions, which means larger language models (7B+ parameters) can be run in the browser using their cross-platform inference framework.

The framework compiles C++ code into WebAssembly for efficient browser performance while leveraging WebGPU API for direct GPU access. New strategies, such as asynchronous layer loading and local caching, drastically reduced WebAssembly memory usage, so larger models can run smoothly.

After Nvidia discussed how to prune Llama-3.1 8B to Llama-3.1-Minitron-8B, they released Mistral-NeMo-Minitron 8B. It’s a miniaturized version of the Mistral NeMo 12B model, which delivers SOTA accuracy in a compact 8 billion parameter form factor.

Mistral-NeMo-Minitron 8B leads on nine popular benchmarks for language models of its size, covering tasks like language understanding, reasoning, summarization, coding, and generating truthful answers.

NVIDIA also announced Nemotron-Mini-4B-Instruct, another SLM optimized for low memory usage and faster response times on NVIDIA GeForce RTX AI PCs and laptops, available as part of NVIDIA ACE technologies.

Meanwhile, Microsoft introduced Phi-3.5, an updated family of SLMs including:

Phi-3.5-mini (3.8B parameters)
Phi-3.5-vision
Phi-3.5-MoE (Mixture-of-Experts with 42B total parameters but only 6.6B active).

Phi-3.5-mini enhances multi-lingual support with a 128K context length, showing significant improvements in languages like Arabic, Dutch, Finnish, Polish, Thai and Ukrainian with 25-50% performance boosts.

Phi-3.5-MoE outperforms larger dense models in quality and performance, supporting over 20 languages and employing robust safety post-training strategies combining Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).

Things seemed to have taken a strange turn regarding the Strawberry situation. Meanwhile, some neat optimization frameworks and libraries were introduced, along with a bunch of language and image generator models.

AI Optimization Trifecta: LLM Compressor, Apple’s ToolSandbox, and PyTorch’s FlexAttention

LLM compressor overview. (Source)

NeuralMagic’s LLM Compressor is a unified library for creating compressed models for faster inference with vLLM. It enables various compression techniques like weight quantization, activation quantization, and pruning.

We saw that Llama 3 was adapted to multimodality. NeuralMagic also did some work with Llama 3 by using LLM Compressor to create fully quantized versions of Llama 3.1.

It also integrates seamlessly with Hugging Face models and vLLM, which makes deployment pretty straightforward. LLM Compressor showed notable performance improvements, with INT8 weight and activation quantized models showing up to 1.6x speedup compared to FP16 baselines at low query rates.

ToolSandbox is a new evaluation framework for assessing tool-use capabilities of LLMs, addressing limitations of previous evaluation methods. It introduces stateful tool execution and implicit state dependencies between tools, moving beyond simple stateless web services or single-turn prompts.

The framework includes a built-in user simulator that enables on-policy conversational evaluation, allowing for more dynamic and realistic testing scenarios. ToolSandbox implements a dynamic evaluation strategy that can assess both intermediate and final milestones over arbitrary interaction trajectories.

Optimization news continued with FlexAttention, a new PyTorch API that allows implementing many attention variants in a few lines of idiomatic PyTorch code, addressing the lack of flexibility in existing optimized attention implementations.

It introduces a flexible API with a user-defined score_mod function that can modify attention scores prior to softmax, which enables various attention patterns like:

Causal masking
Relative positional encodings
Sliding window attention

FlexAttention uses torch.compile to lower the user-defined functions into a fused FlashAttention kernel, achieving performance competitive with handwritten kernels without materializing extra memory.

Llama 3 Pruning and Claude's Caching Technique

Pruning and distillation process of a single model. (Source)

We’ve seen how small language models are recently attracting attention with various releases such as GPT-4 mini and Gemma 2 2B. Nvidia is continuing the SLM momentum by showing us how larger models can be pruned to obtain a smaller model, using Llama-3.1-Minitron-4B as an example. This involves structured weight pruning combined with knowledge distillation.

The pruning process includes both depth pruning (removing 16 layers) and width pruning (reducing embedding and MLP dimensions).

Knowledge distillation is used to retrain the pruned model, with the original 8B model serving as the teacher. The pruned and distilled 4B model outperforms other models of similar size on various benchmarks, while requiring fewer training tokens and compute resources compared to training from scratch.

Anthropic has introduced prompt caching for its Claude AI models, letting developers store and reuse large amounts of context between API calls. This feature is currently available in public beta for Claude 3.5 Sonnet and Claude 3 Haiku, with support for Claude 3 Opus coming soon.

Prompt caching can reduce costs by up to 90% and latency by up to 85% for long prompts. It's particularly useful for scenarios like conversational agents, coding assistants, large document processing, and agentic search where repeated access to extensive context is needed.

The pricing model for cached prompts involves a 25% premium over base input token prices for writing to the cache, but only 10% of the base price for reading cached content. This structure makes it appealing to frequently reuse cached prompts.

Prompt caching prices. (Source)

Early adopters have reported significant improvements in both speed and cost across various use cases. For example, chatting with a book using a 100,000 token cached prompt saw a 79% reduction in latency and 90% cost reduction.

Interestingly, it seems like prompt caching will become an industry norm at this rate.

Matt Shumer pointed out that Google, DeepSeek, and Anthropic have all incorporated prompt caching. (Source)

Grok-2, GPT-4o, and Gemini Live Enter the Arena

Grok-2 and Grok-2 mini show notable improvement in various benchmarks compared to Grok-1.5 and other SOTA models. (Source)

Grok-2 and its smaller version, Grok-2 mini (another SLM), were released and showed significant improvements in chat, coding, and reasoning over Grok-1.5. Grok-2, tested under the name "sus-column-r," outperforms models like Claude 3.5 Sonnet and GPT-4-Turbo on the LMSYS leaderboard.

Grok-2 shows superior reasoning, factual accuracy, and tool use, particularly excelling in academic benchmarks such as MATH, MMLU, and DocVQA. Grok-2 is integrated into the X app, providing advanced AI assistance, real-time information, and enhanced search functionalities for Premium users.

It was also added to the LM Arena leaderboard, which compares various LLMs. The leaderboard includes a win-rate heatmap, showing how Grok-2 performs against other models in head-to-head comparisons.

Grok-2 has climbed to the top of the leaderboard. (Source)

Grok-2’s performance is comparable to top LLMs, only losing to the latest GPT-4 model and Gemini-1.5 pro. Pretty impressive considering that xAI was only founded in March 2023 and already caught up to the likes of OpenAI.

Meanwhile, OpenAI launched a new GPT-4o model in ChatGPT, enhancing multi-step reasoning and detailed explanations, leading to more accurate and logical responses. The new model independently generates images, surpassing DALL-E 3 in both speed and visual accuracy, improving the integration within ChatGPT.

Users noticed significant improvements in both reasoning and image generation, leading to better quality outputs and a more seamless experience.

Google didn’t just sit on the sidelines and X.ai and OpenAI released new models. They unveiled Gemini Live, a conversational AI voice assistant, at its Made by Google event alongside new Pixel phones, AI photo editing tools, and Pixel Buds Pro 2 with Gemini AI. The Gemini Live demo had some issues during the presentation.

Q’s Efficiency Gains and Jassy's Long-Term Confidence

Amazon CEO Andy Jassy. (Source)

Amazon remains "very bullish" on AI's medium to long-term impact across all businesses, with their AI business already at a "multibillion dollar revenue run rate".

We’re seeing Amazon leverage AI in its e-commerce business, including an AI shopping assistant called Rufus, apparel simulation features, and a "Project Private Investigator" using AI and computer vision to detect product defects.

On AI costs, Jassy stated Amazon has developed expertise in managing capacity for AWS and AI customers. While investing significantly in AI infrastructure, they still see more demand than current capacity. AWS brought in $26.3 billion in Q2, up 19% year-over-year, with an annualized revenue run rate over $105 billion.

Jassy also mentioned that Amazon’s AI assistant, Amazon Q, has significantly reduced software upgrade times, cutting the average time to upgrade an application to Java 17 from 50 developer days to just a few hours.

This efficiency has saved Amazon an estimated 4,500 developer-years of work, with 79% of AI-generated code reviews being shipped without additional changes. The upgrades have not only saved time but also enhanced security and reduced infrastructure costs, providing an estimated $260 million in annualized efficiency gains.

Amazon Q's success comes after initial challenges, including issues with incorrect outputs or "hallucinations". These were addressed by expanding the team of human reviewers to fine-tune the chatbot's outputs.

AI Chip Race Heats Up as Huawei Challenges Nvidia While Softbank Pivots from Intel

Huawei’s latest AI chip. (Source)

There always seems to be something going on in the AI chip market. Huawei is developing the Ascend 910C AI chip to compete with Nvidia's GPUs in the Chinese market, particularly against the HGX H20 and the rumored Blackwell-based B20.

Major Chinese companies like Baidu and China Mobile have already tested the chip, with results reportedly on par with Nvidia's H100.

Expected demand for the Ascend 910C could exceed 70,000 units, with shipments targeted to start in October, but production isn’t going as smoothly as expected because of U.S. sanctions.

The Ascend 910C aims to improve upon its predecessor, the Ascend 910B, by addressing yield issues and enhancing performance.

Previously, we looked at how Intel’s discussions with OpenAI were a huge turning point in the AI chip market. Intel’s story continues as SoftBank halted its AI chip development partnership with Intel, since Intel had issues meeting production volume and speed requirements. SoftBank is now negotiating with TSMC, the world's largest contract chipmaker, for AI chip production.

SoftBank's Project Izanagi aims to develop AI processors to rival Nvidia’s GPUs, initially relying on Intel’s capacity but now looking to TSMC. SoftBank plans to establish AI data centers globally by 2026 and is developing AI chips with Arm, targeting a prototype by 2025.

AI Image Generation Leaps Forward With Google’s Imagen 3, Runway’s Gen-3, and Midjourney's Unified Editor

Example of an image produced by Imagen 3. (Source)

After releasing the next generation of Gemma 2 models, Google released Imagen 3, an advanced AI text-to-image generator. This version was first announced during Google I/O in May 2024.

Imagen 3 introduces significant improvements in image generation, producing visuals with better detail, richer lighting, and fewer distracting artifacts compared to previous iterations. It aims to enhance realism and reduce errors in the generated images.

Users can interact with the generated images by highlighting specific sections and applying changes based on their descriptions, offering a more refined and customizable image creation experience.

AI image generators continued to see more releases with Runway ML officially launching Gen-3 Alpha Turbo, an upgraded AI video generation model, promising seven times faster performance at half the cost compared to its predecessor, Gen-3 Alpha.

The new model is accessible across all subscription plans, including free trials. It's priced at 5 credits per second of generated video, making it more affordable and widely available.

Gen-3 Alpha Turbo prioritizes speed, significantly reducing video generation time. This improves workflow efficiency, particularly for users needing quick turnarounds.

Early users have praised the model's speed and quality, with some still favoring the original for certain use cases - though the faster version is well-received for simpler tasks.

Midjourney introduced a new web editor that unifies inpainting, outpainting, and other tools into a single interface, making it easier to edit AI-generated images seamlessly. The new editor includes a more precise virtual brush for inpainting, replacing older tools, allowing for finer control over image edits.

Harvey’s Impressive Retention

Harvey showed a notable increase in user retention. (Source)

Harvey, an AI platform for professional services, has seen significant growth in utilization, more than double from 33% in August 2023 to 69% in August 2024 across all users.

User retention rates for Harvey have remained consistently high, hovering around or above 70% after one year - exceptional for enterprise SaaS and legal tech.

In particular, three case studies of BigLaw firms show rapid and substantial adoption of Harvey:

Firm #1 reached 93% utilization by month 12
Firm #2 jumped from 19% to 97% utilization in one month
Firm #3 exceeded 100% utilization from month 4 onwards, peaking at 128% by month 10.

Harvey's success in rapid onboarding and consistent usage over time highlights its effectiveness in delivering immediate value, ease of integration into existing systems, and potential to provide firms with competitive advantages in service delivery and client satisfaction.

Not to mention that Harvey recently had a successful funding round involving tech giants like OpenAI and Google. Harvey’s chatbot was able to impress OpenAI executives with an 86% accuracy rate, with the deal making Harvey the highest value startup in OpenAI’s portfolio.

Claude Hits $1M Mobile Milestone

Graph of consumer spending in Claude App in recent weeks. (Source)

Anthropic’s Claude has crossed $1 million in gross mobile app revenue across iOS and Android in 16 weeks since launch. Nearly half (48.4%) of Claude's mobile revenue has been generated by users in the US.

It’s certainly an impressive milestone, but Claude still ranks far behind ChatGPT, which is No. 1 by overall downloads and No. 26 by revenue in the US on iOS. Claude is only 95th in the Productivity category by downloads and 68th by revenue.

Claude reached the $1 million revenue mark faster than competitors like Microsoft's Copilot (19 weeks) and Perplexity (22 weeks), but significantly slower than ChatGPT (3 weeks).

Advancements in AI Research

Google not only showed that 7B models can be run in browsers, but they also published an interesting paper that made notable progress in black-box optimization. Moreover, another paper looking at the impact of code data in LLM pre-training also caught our eye.

We saw notable advancements in automated theorem proving, efficient model upcycling, and novel frameworks for evaluating AI systems.

Google's Vizier Algorithm Outperforms Industry Baselines in Black-Box Optimization

Main components of the algorithm. (Source)

Researchers from Google DeepMind have formalized and open-sourced the algorithm behind Google Vizier, one of the world's largest black-box optimization services.

It looks at the challenge of creating a robust, versatile, and production-grade Bayesian optimization system that can handle a wide range of optimization scenarios, from high-dimensional spaces to categorical parameters and multi-objective optimization.

The algorithm employs a Gaussian process bandit optimization approach with several key innovations, including sophisticated input and output preprocessing, flexible acquisition functions with trust regions, and a customized Firefly algorithm for acquisition optimization.

DeepMinds’s Vizier algorithm consistently outperforms other industry-wide baselines across multiple axes, including non-continuous parameters, high-dimensional spaces, batched settings, and multi-metric objectives. For example, it demonstrates up to 8.2% relative improvement in natural language reasoning tasks compared to text-only pre-training.

Code in LLM Pre-training Improves Natural Language Reasoning by 8.2%

Framework overview. (Source)

Cohere researchers conducted a comprehensive study to understand the impact of code data in pre-training LLMs on a variety of downstream tasks beyond code generation.

The team employed a systematic approach, conducting extensive ablations across various dimensions:

Model initialization strategies
Different proportions of code data
Quality and properties of code data
Role of code in pre-training cooldown

Their experiments spanned models ranging from 470 million to 2.8 billion parameters, evaluating performance on natural language reasoning, world knowledge, code generation, and open-ended text generation tasks.

Key findings include:

Compared to text-only pre-training, the best variant with code data showed relative increases of 8.2% in natural language reasoning, 4.2% in world knowledge, and a 6.6% improvement in generative win-rates
Code performance saw a dramatic 12x boost
Including code during the cooldown phase led to further improvements across all tasks
High-quality synthetic code data, even in small proportions, had a strong positive impact on both code and non-code task performance

Results make it clear that code is a critical building block for generalization far beyond coding tasks. Investments in code quality and preserving code during pre-training can have positive impacts across a wide range of AI capabilities.

Automating the Full Cycle of ML Research With The AI Scientist

The AI Scientist overview. (Source)

Sakana AI developed The AI Scientist, which aims to automate the entire scientific discovery process in ML, from idea generation to paper writing and peer review. It addresses the challenge of scaling up scientific research and democratizing access to cutting-edge AI developments by leveraging LLM to perform tasks traditionally done by human researchers.

The AI Scientist uses a combination of LLM-based agents for idea generation, experiment design and execution, paper writing, and peer review. It employs techniques such as chain-of-thought reasoning, self-reflection, and automated code generation to carry out complex research tasks. The system was tested on three ML subfields: diffusion modeling, transformer-based language modeling, and learning dynamics.

Results show that The AI Scientist can generate hundreds of research papers at a surprisingly low cost (approximately $15 per paper), with some papers achieving scores that exceed the acceptance threshold for top ML conferences according to an automated reviewer. The framework demonstrates the potential for AI to significantly accelerate scientific progress and lower barriers to entry in AI research.

Boosting LLM Decision-Making in Interactive Environments Using AgentQ

Example of an input and output to the Agent. (Source)

MultiOn and Stanford University researchers introduced Agent Q - a framework that aims to improve the reasoning and decision-making capabilities of LLM in interactive, multi-step environments like web navigation.

It tackles the issue of generalizing LLMs to agentic tasks, where they need to understand how their actions affect the environment and make complex decisions over multiple steps.

Agent Q combines guided Monte Carlo Tree Search (MCTS) with a self-critique mechanism and iterative fine-tuning using an off-policy variant of Direct Preference Optimization (DPO). The framework uses AI feedback and self-criticism to guide search steps, and learns from both successful and unsuccessful trajectories through offline reinforcement learning.

Drastic improvements were seen on the WebShop benchmark and real-world booking scenarios. For example, it boosts a Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate on a real-world reservations booking website after a single day of data collection, and further to 95.4%.

It’s worth noting that the approach used by Agent Q was also used by Salesforce to achieve 55% on SWE-Bench Lite - a benchmark used to test how well the AI model can solve Github issues, with the Lite version being a slightly easier benchmark than the original SWE-Bench.

Salesforce also used a similar approach. (Source)

DeepSeek Advances Automated Math Reasoning

DeepSeek-Prover-V1.5 framework overview. (Source)

DeepSeek-Prover-V1.5 is an open-source language model designed for theorem proving in Lean 4, since automating complex mathematical reasoning is a tricky area for language models to handle. This model is a step up from its predecessor by optimizing both training and inference processes, aiming to improve performance on formal theorem proving tasks.

Some of the key components include:

Large-scale mathematical pre-training
Formal mathematics corpus construction and augmentation
Online reinforcement learning from proof assistant feedback
Novel Monte-Carlo tree search methodology for long-term planning in theorem proving

DeepSeek-Prover-V1.5 uses a combination of supervised fine-tuning, reinforcement learning, and a new variant of Monte-Carlo tree search called RMaxTS, which employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths.

Results show significant improvements over DeepSeek-Prover-V1, achieving SOTA results on the test set of the high school level miniF2F benchmark (63.5%) and the undergraduate level ProofNet benchmark (25.3%).

Efficient Upcycling of Dense Models Into MoE With BAM

The three phases of BAM. (Source)

BAM (Branch-Attend-Mix) is a new approach to efficiently upcycle pre-trained dense models into Mixture of Experts (MoE) models. Initializing MoEs is tough since they’re computationally expensive to train from scratch, so BAM makes full use of specialized dense models' parameters to tackle this problem.

BAM operates in three phases:

Branching (creating copies of a pre-trained seed model)
Continued pre-training (specializing each copy on different domains)
Mixture model training (initializing MoE layers using the specialized models

It introduces a soft-variant of Mixture of Attention (MoA) layers and employs a parallel attention transformer architecture to improve efficiency.

BAM consistently outperforms baseline methods in both perplexity and downstream task performance across various domains, with experiments conducted on seed models ranging from 590 million to 2 billion parameters. It’s a big step forward in MoE initialization, so we might see more efficient training of large-scale language models with superior performance in the future.

Frameworks We Love

Some frameworks that caught our attention include:

DreamCinema: Simplifies film creation by using generative AI to automate the production of 3D characters and cinematic elements
RuleAlign: Enhances LLMs like GPT-4 for medical diagnostics by aligning them with specific diagnostic rules
PCGRL+: Designed to train AI agents to generate game levels based on specific quality metrics.
SigLLM: Uses large language models for time series anomaly detection by converting time series data to text
MVInpainter: Reformulates 3D editing as a multi-view 2D inpainting task, enabling novel view synthesis and editing for in-the-wild scenes without relying on explicit camera poses.
RAGchecker: comprehensive evaluation framework for Retrieval-Augmented Generation (RAG) systems that incorporates diagnostic metrics for both retrieval and generation modules.

If you want your framework to be featured here, reply to this email and say hi :)

Conversations We Loved

Wolfram’s highly detailed discussion about his perspective on ML took the spotlight. Although there was a big focus on the theoretical aspects, it’s still useful to consider as it can provide a different perspective for practical applications of ML.

While the strange Strawberry account was gaining all that attention, there was a pretty notable advancement that slipped under the radar with rStar. Moreover, Y Combinator CEO let us AI startups in on a little secret on how they can quickly build trust with customers using golden magic demos.

From Randomness to Intelligence: Wolfram's New Perspective on ML

Wolfram’s discussion about machine learning. (Source)

A recent blog post by Stephen Wolfram caught our attention this week, offering intriguing insights into the fundamental workings of machine learning systems. Wolfram explores minimal models that capture the essence of machine learning, stripping away complexities to reveal core principles.

Wolfram introduces simple, visualizable models like "rule arrays" that can perform machine learning tasks. These models suggest that machine learning works by "sampling" from the computational universe rather than building structured mechanisms.

He also argues that the power of machine learning comes from leveraging computational irreducibility as a "natural resource." Moreover, Wolfram mentions that we’re at a point where we can achieve notable results with machine learning techniques like neural networks, but we don’t truly understand how we’re able to get such results.

The Overlooked AI Reasoning Breakthrough Amid Strawberry Hype

Tekparmak brought light to an overlooked advancement. (Source)

A thread by Tekparmak caught our attention amidst all the Strawberry hype discussing a new approach called rStar. This method uses a generator LLM to create solution trajectories and a discriminator LLM for "peer review," reminiscent of the GAN architecture.

rStar has demonstrated superior performance among multi-round self-improving approaches for AI reasoning tasks. This means that compared to other methods that use multiple iterations or rounds to improve their performance, rStar is currently achieving the best results.

Interestingly, base model generators perform well with instruct discriminators, and the gap between GPT-4 and smaller open-weight models like Phi-3 mini is not as significant as one might expect.

While breakthroughs like OpenAI's Project Strawberry generate buzz, there's still ample room for innovative approaches using existing models and techniques.

How AI Startups Can Win With Powerful Demos

Tan’s discussion about the power of a golden magic demo. (Source)

Y Combinator CEO Garry Tan raised an interesting point about just how important a golden magic demo is for AI startups. But why is that?

He mentions that a golden magic demo shows the benefit and value of the product right away, indicating how they could accomplish days worth of work in just 10 minutes. Repetitive tasks are a massive pain point customers face, so showing how AI can automate these tasks and boost productivity is a great way for startups to immediately show their solution can solve an important problem.

Casetext was used to highlight this point, with the example of AI being able to quickly detect nuances in emails for lawyers to use as evidence for potential fraud. Their golden magic demo showed how lawyers could save a ton of time with AI, so they could quickly see the power of Casetext’s solution by the end of the demo.

Tan also mentioned that a successful golden magic demo is something that Y Combinator sees in a lot of successful LLM-based startups, which makes it pretty encouraging for other startups to do the same.

Money Moving in AI

We saw the success of three funding rounds for Story, Cursor, and Defcon - all of which were under $100 million. Interestingly, each company focuses on pretty different applications in AI: Story in blockchain, Cursor in coding, and Defcon in military logistics.

Okpey and Elise AI also had successful funding rounds, with Opkey raising $47 billion and Elise AI raising $75 million. On the other hand, AMD decided to challenge Nvidia’s chip dominance by acquiring ZT systems for $5 billion.

AMD Acquires ZT Systems for $5 Billion to Challenge Nvidia

AMD has made a bold move in the AI chip market by agreeing to acquire ZT Systems, a New Jersey-based server maker, for nearly $5 billion in cash and stock. This acquisition, which combines AMD's silicon and software capabilities with ZT's systems expertise, aims to accelerate the deployment of AMD-optimized data center AI solutions at scale for cloud and enterprise customers.

Story Raises $80 Million in Series B Funding Round

Story, a startup building a blockchain-based platform for IP tracking and monetization in the age of AI, has secured $80 million in Series B funding led by Andreessen Horowitz's crypto division, with participation from Polychain Capital and other notable investors. The round values Story at $2.25 billion post-money and brings its total funding to $143 million.

Elise AI Raises $75 Million in Series D Funding

EliseAI, a startup developing AI-powered property management tools, has secured a $75 million Series D funding round, valuing the company at $1 billion. Led by Sapphire Ventures with participation from Point72 Private Investments, Divco West, Navitas Capital, and Koch Real Estate Investments, this investment brings EliseAI's total funding to $140 million.

Cursor Secures $60 Million in Series A Funding

Cursor, an AI-powered coding tool startup, has raised $60 million in Series A funding from prominent investors including Andreessen Horowitz, Thrive Capital, OpenAI, and notable tech founders. The company, which aims to create a "magical tool" for writing the world's software, has grown to over 30,000 customers across major enterprises, research labs, and startups.

Cursor has even started to go viral on X off a video showing how an 8 year old can use it to start coding.

Cursor is making coding more accessible. (Source

Opkey Raises $47 Million in Series B Funding

Opkey, an AI-powered continuous test automation platform for enterprise systems, has secured $47 million in Series B funding led by PeakSpan Capital, with participation from existing investors. The funding will fuel Opkey's mission to accelerate product development and expand its global market presence.

🍓 in Fall, Magic's 100M Context, Alexa + Claude = ❤️

Tue, 03 Sep 2024 16:33:07 +0000

Before we start, share last week's news with a friend or a colleague:

^{Key Takeaways}

OpenAI's Project Strawberry, set for potential release in fall 2024, aims to drastically advance AI reasoning capabilities and address current limitations of GPT-4, including complex multi-step problems and hallucinations.
Magic developed LTM (Long-Term Memory) models capable of reasoning on up to 100M tokens of context during inference, equivalent to about 10 million lines of code or 750 novels.
Jina AI revealed a "modality gap" in multimodal AI models, where text embeddings and image embeddings cluster in separate parts of the semantic space.
Nvidia released Eagle, a family of vision-centric high-resolution multimodal LLMs that uses a channel-concatenation-based fusion and showed impressive performance on various multimodal benchmarks like GQA and MMMU.
Google DeepMind used weaker but cheaper models for generating synthetic training data, achieving up to 31.6% relative gains compared to strong but expensive (SE) models_.

Got forwarded this newsletter? Subscribe below👇

Announcing RetrieveX Conference on Oct 17 in San Francisco. 25% OFF For the Next 3 Days

Executive in GenAI? Join RetrieveX, the top conference in retrieval for GenAI. Exclusively for those building high-accuracy, multimodal workflows, featuring leaders from Microsoft AI, YC, Bayer Radiology, Matterport, Cresta, as well as the creators of PyTorch and KubeFlow.

Checkout with promo code LABORDAY25 for 25% off (valid for the next three days). Prices are going up by end of week, so secure your spot sooner rather than latter.

Get Tickets Today

Date: October 17, 10:30am - 7pm PT
Venue: The Midway, 900 Marin St, San Francisco
Attendees: 300 AI executives

_{The Latest AI News}

Not to mention that we heard some news about Project Strawberry and that OpenAI is struggling to keep up with expenses (despite having 200 million users) due to inference costs. There are currently talks for a new funding round that would put them at over $100 billion valuation.

Moreover, OpenAI and Apple are set to collaborate for Siri (but not in Europe), and Amazon is planning a similar play with Anthropic by releasing a (subscription) version of Alexa that is powered by Claude. AI is the new streaming, folks.

OpenAI’s Project Strawberry Might Release in Fall, Struggles With Expenses Despite Potential Funding Round, and California Bill SB 1047 Passed

We might see Project Strawberry release relatively soon. (Source)

Project Strawberry is reportedly set for a potential release in fall 2024. As we talked about before, Strawberry has been rumored to drastically advance AI reasoning capabilities and even address the current limitations of GPT-4.

This includes issues like complex, multi-step problems and hallucinations that have posed problems for GPT-4 in the past.

OpenAI needs Watermelon Strawberry sugar, badly.

With 200 million MAUs, OpenAI faces inferencing costs. In 2024, the company is projected to incur expenses of approximately $8.5 billion. This is largely thanks to Microsoft's pricing structure, which charges OpenAI about $10.30 per hour for an eight-GPU server, compared to higher public rates (this still is… $4B). Simultaneously, new models are creeping up in terms of accuracy (hello, 350M+ lifetime downloads of Llama 3.1). Hence, OpenAI needs just to put out something impressive that would blow out the competition out of water, OR raise another funding round to cover the costs of compute.

I guess, Sam Altman plans to do both, with the next funding round might put them at $100 billion, with Nvidia and Apple in talks to be a part of it.

California Bill SB 1047 & AB 3211

Meanwhile, California bill SB 1047 was passed. It’s a bill designed to prevent “disasters” caused by AI systems before they occur. This refers to serious events that would cause global issues, like using AI to cause a cyberattack that would result in more than $500 million in damages.

Although, it wouldn’t apply to every AI model - only the ones that are considered large enough like GPT-4. The bill was supported by Musk, who has been vocal about how the potential dangers of AI need to be monitored, despite the fact that xAI would also be affected by the bill’s requirements.

Additionally, OpenAI, Adobe, and Microsoft have expressed support for another California bill called AB 3211. It requires tech companies to label AI-generated content. This support is evidenced by letters from the companies viewed by TechCrunch, marking a shift from their previous opposition.

AB 3211 mandates watermarks in the metadata of AI-generated photos, videos, and audio clips. While many AI companies already implement this practice, the bill goes further by requiring large online platforms like Instagram or X to label AI-generated content in a way that’s easily understandable to average viewers.

Magic’s 100 Million Context Window and Nvidia’s Eagle

Current long context evals like Needle In A Haystack have various limitations. (Source)

Magic has developed LTM (Long-Term Memory) models capable of reasoning on up to 100M tokens of context during inference, equivalent to about 10 million lines of code or 750 novels. Their LTM-2-mini model is significantly more efficient in computation and memory usage compared to traditional models like Llama 3.1 405B.

They also introduced HashHop, a new evaluation method for long-context models that eliminates semantic hints and requires models to store and retrieve maximum information content, addressing flaws in current evaluation techniques like Needle In A Haystack.

Eagle was another release we saw - a family of vision-centric high-resolution multimodal LLMs that uses a channel-concatenation-based fusion. The models support input resolutions up to over 1000 pixels and perform strongly on multimodal benchmarks, especially resolution-sensitive tasks like OCR and document understanding.

Eagle showed impressive performance on multimodal benchmarks. (Source)

The Eagle family includes multiple variants such as Eagle-X4 and Eagle-X5 models with 7B and 13B parameters, built on Vicuna language models and LLaVA-v1.5 pretraining.

The project is actively developing, with plans for models trained on larger and more diverse datasets, evaluation code, and vision encoder model weights with pre-alignment. An online demo of Eagle-X5-13B-Chat is available, and the project has achieved recognition, winning 2nd place in a CVPR24 Challenge on Driving with Language.

New TRL Update Released

The TRL v0.10.1 update just dropped with some pretty b

One of which is that models can now be trained with DeepMind’s Online DPO. It’s an alignment method called OnlineDPO that generates data on the fly, eliminates the need for pre-collected preference datasets, and yields better results than traditional DPO.

It also added support to align vision-language models (LLaVa-1.5, PaliGemma, and Idefics2) with DPO. DPO was previously used for text-only language models, so this update means DPO can now be applied to vision-language models as well.

The update allows for the integration of Ligon Triton kernels, which leads to lower memory usage and faster throughput in training. This might allow for training larger models or using smaller hardware.

OpenAI and Patched Collaborate on Static Analysis Evaluation Benchmark

How various models performed on the Static Analysis Evaluation benchmark. (Source)

OpenAI collaborated with Patched to fine-tune GPT-4o for the Static Analysis Evaluation Benchmark. It’s designed to assess LLMs' performance in fixing software vulnerabilities. The new version features more challenging instances with larger sample sizes (512-1024 tokens) and increased difficulty.

The benchmark methodology involved:

Scanning top Python repositories on GitHub
Filtering for file size
Verifying vulnerabilities using Semgrep
Curating a dataset representing real-world vulnerabilities in popular open-source projects

Results show that combining techniques like few-shot prompting, RAG, and fine-tuning leads to improved performance. Fine-tuned models consistently outperform base models, and larger models generally perform better.

IBM Cloud Becomes First Cloud Customer for Gaudi AI and Cerebras’ New AI Processor

IBM Cloud is the first cloud customer for Gaudi AI. (Source)

Although Softbank ended its partnership with Intel recently, Intel secured IBM Cloud as its first cloud customer for the Gaudi 3 AI accelerator chip. IBM Cloud will offer Gaudi 3 to customers in early 2025 for both hybrid and on-premise environments, and plans to support Gaudi 3 within its Watsonx AI and data platform.

Intel's expectations for Gaudi 3 revenue in 2024 are modest at $500 million, significantly lower than AMD's projected $4.5 billion from its Instinct MI300-series GPUs and Nvidia's expected $40 billion from its data center business. Despite Gaudi 3's high performance-per-dollar, Intel faces challenges in attracting customers away from Nvidia.

Nvidia's upcoming Blackwell chip, set for production ramp-up in Q4, poses a significant threat to Intel's Gaudi 3. Blackwell is expected to offer up to four times the performance of the H100, the chip Gaudi 3 is currently compared against, potentially further challenging Intel's position in the AI chip market.

A new AI processor was also introduced by Cerebras. They developed the Wafer-Scale Engine-3 (WSE-3), which they claim is the world's largest and fastest AI processor. This powers their CS-3 system, a new class of AI supercomputer designed for generative AI training and inference with exceptional performance and scalability.

CS-3 systems can be quickly clustered to create some of the world's largest AI supercomputers, simplifying the process of deploying and running very large AI models. This makes it easier for organizations to work with cutting-edge AI at scale.

Amazon Hires Covariant Founders and Aims to Release New Version of Alexa Powered by Claude

Amazon has hired the founders of AI robotics startup Covariant - Pieter Abbeel, Peter Chen, and Rocky Duan. They also hired approximately 25% of the company's employees.

As part of the deal, Amazon has secured a non-exclusive license to use Covariant's robotic foundation models, which are described as "a large language model, but for robot language." These models focus on enabling robotic arms to perform common warehouse tasks like bin picking.

Amazon plans to integrate Covariant's AI technology into its existing robot fleet to improve performance and create value for customers. This aligns with Amazon's ongoing efforts to enhance its fulfillment and robotics technologies.

The deal structure seems similar to the ones we saw a few months ago when Amazon hired most of Adept’s top employees.

Amazon also plans to release a revamped version of Alexa in October, primarily powered by Anthropic's Claude AI models rather than Amazon's own AI. This decision was made after initial versions using Amazon's in-house software struggled with response times and coherence.

The new "Remarkable" Alexa will be a paid service, costing between $5 to $10 per month, while the current "Classic" Alexa will remain free. Amazon hopes this new version will help generate revenue from the currently unprofitable Alexa division.

The upgraded Alexa is designed to handle more complex queries, carry on conversations that build on prior interactions, provide shopping advice, aggregate news, and perform more complicated tasks like ordering food or drafting emails from a single prompt.

Advancements in AI Research

While we saw some multimodal releases like Eagle and Phi-3 Vision, researchers also introduced a new family of VLMs in a new paper. DeepMind looked into ways of generating high-quality synthetic data.

Another paper that caught our eye was about the law of next-token predictions, since the black-box nature of LLMs makes it difficult to understand how the model reached its conclusions.

How Cheaper Models Outperform More Expensive Ones

Language models being fine-tuned with Gemma 2 and Gemini 1.5 data. (Source)

DeepMind decided to challenge the conventional wisdom of using strong but expensive (SE) language models for generating synthetic training data. They investigated whether using weaker but cheaper (WC) models could be more compute-optimal for training LLM reasoners.

They conducted extensive experiments comparing data generated from WC and SE models across three key metrics: coverage, diversity, and false positive rate. They then fine-tuned models on this data in various setups, including knowledge distillation, self-improvement, and a novel "weak-to-strong improvement" paradigm.

Models trained on WC-generated data consistently outperformed those trained on SE-generated data across multiple benchmarks, with relative gains of up to 31.6%. For example, using Gemma2-9B (WC) data instead of Gemma2-27B (SE) data led to 6% higher performance in knowledge distillation and 5.8% in weak-to-strong improvement for math reasoning tasks.

LLMs in Sales and Negotiation: Simulating Human-Like Persuasion

Example workflow of an insurance bot. (Source)

Researchers developed a multi-agent framework to study the persuasion capabilities of LLMs various domains such as insurance, banking, and retail. This work addresses the challenge of creating AI systems that can engage in persuasive dialogue while dynamically adapting to user resistance and personality types.

A collaborative approach using multiple AI agents was used, which included:

A primary conversational agent
Auxiliary agents for information retrieval and analysis
A fact-checking component

They simulated conversations using LLM-generated personas with varying demographics and emotional states, and measured persuasion effectiveness through pre- and post-interaction surveys, as well as user decisions.

The paper showed that LLMs are capable of both persuading and resisting persuasion effectively. The AI agents demonstrated the ability to create perspective changes in users and influence purchase decisions. That sounds promising, but further work still needs to be done as conversations were terminated due to inadequate information from the sales agent.

The Universal Law of LLM Learning

Researchers from the University of Rochester and the University of Pennsylvania have discovered a precise and quantitative law governing how LLMs learn contextualized token embeddings for next-token prediction. In particular, this paper looks at the issue of understanding the internal data processing mechanisms of LLMs, which have long been considered black boxes.

They used a wide range of open-source LLMs, including GPT variants, Llama models, and newer architectures like RWKV and Mamba. They also introduced a metric called "prediction residual" (PR) to quantify an LLM's next-token prediction capability at each layer.

What came from the results is a universal "law of equi-learning", where each layer contributes equally to enhancing prediction accuracy, from the lowest to the highest layer. The law emerged consistently across various model architectures, sizes, and pre-training data.

Text2SQL: Unifying AI and Databases With TAG

The three stages of the TAG pipeline. (Source)

Researchers from UC Berkeley and Stanford University have introduced a new paradigm called Table-Augmented Generation (TAG) to address the limitations of current Text2SQL and RAG methods. This work tackles the challenge of answering complex natural language questions over databases that require both world knowledge and semantic reasoning.

They developed a unified framework that combines the strengths of language models (LMs) and database systems. TAG consists of three key steps:

Query synthesis: Translating natural language requests into executable database queries
Query execution: Efficiently computing relevant data using database systems
Answer generation: Utilizing LMs to generate final natural language answers

To evaluate TAG, they created a dataset, requiring either world knowledge or semantic reasoning. Afterwards, they compared TAG against several baselines, including traditional Text2SQL and RAG approaches.

Results show hand-written TAG pipelines consistently achieved 40% or better exact match accuracy, significantly outperforming all other baselines which failed to exceed 20% accuracy. TAG demonstrated particular strength in comparison queries, with up to 65% accuracy.

Frameworks We Love

Some frameworks that caught our attention include:

SAM2POINT: Adapts the SAM2 for zero-shot and promptable 3D segmentation by interpreting 3D data as multi-directional videos
OmniRe: Comprehensive 3D Gaussian Splatting framework that reconstructs high-fidelity dynamic urban scenes from on-device driving logs,
GradBias: Uses a combination of LLMs, Text-to-Image generative models, and Vision Question Answering to detect, quantify, and explain biases in image generation.

If you want your framework to be featured here, reply to this email and say hi :)

Conversations We Loved

One discussion about the modality gap in multimodal modals was also certainly one to look at since we don’t see this being talked about too often. In addition, a post about applying HybridRAG to financial document analysis makes us wonder how we might see this new approach be applied to other domains.

Exploring the Modality Gap in Multimodal AI

Jina AI’s discussion about the modality gap. (Source)

Jira posted an interesting exploration of the "modality gap" in multimodal AI models, particularly those using CLIP (Contrastive Language-Image Pretraining). The discussion, centered around research by Jina AI, reveals an unexpected quirk in how these models process and relate text and images.

At first glance, you might assume that a well-trained AI would treat a picture of an apple and the text "an apple" as nearly identical. But surprisingly, that's not the case. The research shows these models tend to cluster text embeddings and image embeddings in separate parts of their semantic space, creating a "gap" between modalities.

What's also interesting is how this gap emerges unintentionally. The models are encoding not just the semantic content, but also the medium itself. This happens even after extensive training, so it might be a fundamental aspect of how these models learn to represent information.

Overcoming the Limitations of KG and Vector-Based RAG

AI engineer Rohan Paul brought up the potential of HybridRAG. (Source)

A new approach called HybridRAG is making waves in the field of financial document analysis, combining KG and vector-based RAG.

Financial documents are notoriously difficult for AI systems to parse because of their specialized terminology and intricate formats. Instead, HybridRAG gets the best of both worlds from Knowledge Graph and Vector-based RAG for more comprehensive and accurate information retrieval.

Note that while this paper is focusing on financial document analysis, there’s definitely potential for HybridRAG to be used for other applications because of its ability to excel at both extractive and abstractive questions.

You might be thinking that this sounds great on paper, but does it actually perform well? The answer is yes - HybridRAG outperformed both VectorRAG and GraphRAG across various metrics. As a result, we might see HybridRAG be applied to other domains with complex information structures like legal documents or medical records.

Money Moving in AI

Aside from Magic’s LTM models, they succeeded in raising $320 million in a funding round. Another AI coding business called Codeium also secured funding through a series C round for $150 million.

Magic Raises $320 Million

Magic, secured a massive $320 million funding round led by ex-Google CEO Eric Schmidt, with participation from Alphabet's CapitalG and Atlassian.

They also announced that they will work with Google Cloud to build two supercomputers that will use Nvidia’s H100 GPUs and Blackwell chips.

Codeium raises $150 million in Series C Funding Round

Codeium, an AI-powered coding assistant startup competing with GitHub Copilot, has secured a $150 million Series C round led by General Catalyst, valuing the company at $1.25 billion.

This latest investment brings Codeium's total funding to $243 million just three years after its launch, with the company achieving unicorn status and growing its user base to over 700,000 developers and 1,000 enterprise customers.

Story Raises $80 Million in Series B Funding Round

🍓, Multimodal Llamas, the New GPT-4 Model

Tue, 20 Aug 2024 00:05:32 +0000

Before we start, share last week's news with a friend or a colleague:

Key Takeaways

No, seriously, are you keeping up with the AI news? This past two weeks were so packed we're splitting this issue in two parts, which we'll send you soon.

It was strawberry season for AI Twitter last week, with Altman's “Project Strawberry” hints. The new model that is rumored to have more advanced reasoning abilities than current LLMs (we're hearing a chatter of graduate-level intelligence). We'll cover this in a special mid-week release in two days, but take a look at @iruletheworldmo and Lily Ashwood on X, who may or may not be real, and powered by the new model (the latter even hosted Twitter Spaces).
Activeloop, Intel Disruptor Initiative, and Towards AI launched the Impossible GenAI Test, where only 1 in 20 engineers passes the test. Take it for free today.
GPT-4o-2024-08-06 launched on Azure with structured outputs support, achieving perfect scores in JSON Schema evaluations.
Meta introduced AI Studio for creating AI characters, while discontinuing celebrity AI chatbots due to user feedback calling them "creepy".
Idefics3 is a new model that adapts Llama 3 to multimodality and showed drastic improvements in OCR and document understanding over its predecessors.
MiniCPM demonstrated comparable performance to larger models with smaller parameter counts (1.3B and 2.7B) through efficient fine-tuning techniques.

Got forwarded this newsletter? Subscribe below👇

Launching The Impossible GenAI Test

As a subscriber, you've been the first to know we've introduced the Impossible GenAI Test. It tests across 6 core GenAI competences, comprises 25 questions, and is so tough only 1 in 20 passes (after the preliminary data, we've may actually need to update this to… 1 in 40!).

Learn more about the test here. Try it yourself today.

Check out my website

The Latest AI News

Safe to say that OpenAI was at full force last week with all kinds of news, ranging from a highly detailed system card for safety to the introduction of structured outputs in the API and a new GPT-4 model. We also saw a new AI tool from Meta that lets users create their own characters, shortly after they shut down their Al celebrity chatbots.

GPT-4’s System Card, New Model Available on Azure, and Structured Outputs in the API

OpenAI detailed the safety measures taken before releasing GPT-4o. (Source)

The GPT-4 System Card details the safety measures, limitations, and evaluation methodologies implemented to mitigate risks associated with GPT-4's deployment. It outlines the system's potential harms and the steps taken to address issues like misinformation, bias, and unintentional harmful outputs.

The system's residual risks, such as occasional unauthorized voice generation or over-refusals in non-English languages, are areas of active improvement. The focus remains on refining these aspects to minimize risks while enhancing the model's utility across diverse contexts.

But that wasn’t the only move we saw from openAI last week.

The latest model, GPT-4o-2024-08-06, has been launched on Azure, focusing on enhancing developer productivity by supporting structured outputs, such as JSON Schemas. It even achieved perfect scores on evaluations with Structured Outputs, which means that the model's generated outputs consistently and accurately follow the complex JSON schema that was provided.

Structured outputs achieved a 100% score. (Source)

The new GPT model is also cheaper to use as a reranker than Cohere’s model, with OpenAI’s model offering $2.50/1M input tokens and $10/1M output tokens compared to Cohere’s Command R+ $3/1M input tokens and $15/1M output tokens.

GPT-4 mini is even more cost effective than both models at $0.15/1M input tokens and $0.60/1M output tokens. It’s also worth mentioning that the new GPT-4 model offers a higher context window at 16K.

Amazon Upgrades Image Gen Tool, Meta Releases New AI Chatbot Tool, and Idefics3 Emerges

Examples of images generated by Titan Image Generator 2. (Source)

The last time we heard from Amazon is when they were working on a GPT-killer called Metis, so it’s been a little awhile in terms of news from them. They’ve released an upgraded version of their image generating model called Titan Image Generator v2, which can detect and segment multiple objects within the foreground of an image.

It also introduces improved image condition capabilities, so users can focus on specific visual characteristics such as edges, object outlines, and structural elements - all leading to more detailed image generation. Although, it isn’t quite clear as to what data Amazon used to train this model.

Meanwhile, Meta’s new tool called AI studio lets users create AI versions of themselves, or even create an entirely new AI character. Speaking of which, Google acqui-hired Character.ai ’s founders, which also provides a similar type of technology where users can talk to personality-driven AI chatbots, like a mental health helper or an English teacher.

These types of conversational AI platforms have been pretty popular in the last couple of years, as we’ve seen a number of startups in this space continue to grow rapidly - with character.ai being one of them.

But that wasn’t the case for Meta’s celebrity AI chatbots (which featured celebs like Snoop Dogg or Tom Brady), as users can no longer interact with them. It fell flat with users who even called them “creepy”.

In other news, we saw the release of Idefics3, a model that adapts Llama 3 to multimodality. It’s capable of processing arbitrary sequences of text and image inputs to generate text outputs. It can perform tasks such as visual question answering, image captioning, and story creation based on multiple images.

It builds upon Idefics1 and Idefics2, significantly improving in areas like OCR (Optical Character Recognition), document understanding, and visual reasoning.

Idefics3 outperforms its predecessor, Idefics2, across multiple benchmarks. (Source)

OpenAI’s Board Expansion

In the wake of several high-profile departures, OpenAI decided to appoint Zico Kolter, a prominent professor and director at Carnegie Mellon University's Machine Learning Department, to its board of directors.

There’s been some concerns about the internal dynamics at OpenAI, especially regarding the allocation of resources for AI safety initiatives. Since Kolter’s research focuses on safety, it’s a smart move by OpenAI.

Kolter will join OpenAI’s Safety and Security Committee, which includes other directors like Bret Taylor and Adam D’Angelo, as well as technical experts. This committee is tasked with overseeing and making recommendations on the safety and security of all OpenAI projects.

X’s EU Data Pause and Reddit’s AI-Powered Search Results

X has agreed to suspend the use of European users' data for training its AI tool, Grok, following legal action from Ireland's Data Protection Commission (DPC). This suspension covers the period between May 7, 2024, and August 1, 2024, and will remain in effect as the DPC continues to assess the legality of this data processing under the GDPR.

X has publicly criticized the DPC's actions, labeling the injunction as "unwarranted" and "overbroad." The company claims to have implemented privacy settings allowing users to control their data and argues that it has been working with the DPC on these issues since last year.

The DPC, in collaboration with other EU/EEA regulators, is investigating whether X's data processing practices comply with GDPR requirements. This investigation includes examining the potential unlawfulness of AI models trained on data collected without proper consent.

The other social media giant that had some AI news was Reddit, as they’ll be testing out AI-powered search pages soon. It builds on Reddit's recent partnerships with OpenAI and Google, which allow the company to leverage their LLM and AI capabilities.

Humane's Pin Faces More Returns Than Sales

Previously, we mentioned two executives at Humane left to form their own fact checking company shortly after the release of Ai Pin didn’t exactly go to plan and received some harsh criticism. It doesn’t seem like things are getting any better as the Ai Pin was reported to have more returns than sales between May and August.

To make matters worse, Humane can’t refurbish or resell because of T-mobile limitations on reassigning devices to new users. As a result, the returned pins become e-waste and a loss of revenue for Humane.

Additionally, Humane experienced significant executive turnover, including departures of key engineering leaders and the director of customer experience. The company also laid off 4% of employees in January as a cost-cutting measure.

Humane is under a lot of financial pressure considering the fact that they raised $200 million in funding while dealing with low sales numbers and a lot of unhappy customers.

Is the Autonomous Driving Market Ready for a Chinese Challenger?

WeRide, a Chinese autonomous driving startup, officially filed for an IPO with the U.S. Securities and Exchange Commission (SEC) on July 26, which marks the company's intention to go public in the US.

WeRide reported losses of $268 million in the previous year, with only $55 million in revenue. Despite these losses, the company continues to push forward with its IPO, reflecting the high growth potential it sees in the autonomous driving market.

For key players in the self-driving industry like Wayve and NIO, this might mean increased competition, as we might see other Chinese autonomous driving companies go public soon as well.

The huge losses of $268 million also points toward the trend that autonomous driving companies are facing issues when it comes to profitability, due to the high initial costs involved and the fact that making profit from this type of product is more a long-game instead of one that provides immediate returns.

Additionally, Warren Buffett's Berkshire Hathaway has sold more shares of Chinese electric vehicle maker BYD, continuing its gradual reduction in holdings. The sale has sparked speculation about Berkshire's confidence in BYD, given its significant past investments in the company.

Advancements in AI Research

We saw some notable progress in language model research last week, with CODEXGRAPH providing a means for LLMs to interact with code repositories and MiniCPM as a method for deploying GPT-4V level MLLMs on end devices.

RAGFoundry was another framework that stood out, since it provides a single workflow that combines various aspects to make RAG implementation a little less complex.

CodexGraph Bridges the Gap Between LLMs and Complex Codebases

CODEXGRAPH overview. (Source)

CODEXGRAPH is a system that lets LLMs effectively interact with entire code repositories. It addresses the challenge of handling complex, repository-level coding tasks that require understanding cross-file code structures and performing intricate reasoning across large codebases.

To achieve this, CODEXGRAPH integrates LLM agents with graph database interfaces extracted from code repositories. The system uses static analysis to construct code graphs, where nodes represent code symbols and edges represent relationships between them.

LLM agents then generate and execute graph queries to navigate the codebase, allowing for precise, code structure-aware context retrieval.

Results were impressive as it achieved competitive performance across three challenging repository-level benchmarks: CrossCodeEval, SWE-bench, and EvoCodeBench.

When equipped with GPT-4o, CODEXGRAPH outperforms other retrieval-augmented code generation baselines on CrossCodeEval and EvoCodeBench, while matching state-of-the-art performance on SWE-bench.

NACL's Hybrid Eviction Policy Slashes LLM Memory Usage

NACL is a much more efficient alternative than traditional eviction algorithms that use step-by-step greedy search. (Source)

Researchers from the Chinese Academy of Sciences and Baidu have introduced NACL, a framework for key-value (KV) cache eviction in LLMs during inference time. This approach addresses the challenge of managing extensive memory consumption in KV caches, particularly for models with extended context windows, which has been a significant bottleneck in deploying LLMs for long-context tasks.

NACL employs a hybrid eviction policy combining Proxy-Tokens Eviction and Random Eviction. The Proxy-Tokens Eviction utilizes global statistics of attention scores from selected proxy tokens, while Random Eviction incorporates a diversified sampling strategy.

NACL drastically improves performance on both short- and long-text tasks by 80% and 76% respectively, while reducing KV Cache by up to 5× with over 95% performance maintenance. It shows that there’s a practical approach to managing memory constraints in LLMs, which might enable efficient deployment of these models for long-context applications.

MiniCPM's Two-Stage Fine-Tuning to Deploy GPT-4V Level Models On-Device

Moore’s Law for MLLMs, which shows that deploying GPT-4V level MLLMs on end devices is becoming a reality. (Source)

OpenBMB researchers introduced MiniCPM, a means of improving the performance of multimodal large language models (MLLMs) through efficient fine-tuning techniques. This helps boost their capabilities without having to increase their size or computational requirements, making them more suitable for deployment in resource-constrained environments.

MiniCPM employs a two-stage fine-tuning process. First, it uses a larger teacher model to generate high-quality synthetic data, then it fine-tunes the smaller target model on this data using techniques like LoRA and QLoRA. The approach also incorporates multi-task learning and careful data curation to maximize the efficiency of the fine-tuning process.

MiniCPM models with only 1.3B and 2.7B parameters achieve performance comparable to much larger models like GPT-4V in various benchmarks, including common sense reasoning, math problem-solving, and coding tasks.

For instance, the MiniCPM-2.7B model outperforms Llama2-13B on several metrics despite being a lot smaller. As a result, GPT-4V level MLLMs can be deployed on end devices.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

SegXAL: Explainable Active Learning (XAL) model designed for semantic segmentation in driving scenes, which integrates human expertise through an explainable AI module and uncertainty measures.
RiskAwareBench: Automated framework designed to assess physical risk awareness in LLM-based embodied agents.
AggSS: Introduces an Aggregated Self-Supervision approach for class-incremental learning, where image rotations are treated as additional classes to enhance robust feature learning.

If you want your framework to be featured here, reply to this email and say hi :)

Conversations We Loved

OpenAI continued the wave of news with Altman dropping a huge hint about a new model announcement that we might see soon. His cryptic post containing a picture of a strawberry was actually referring to “Project Strawberry”, a highly advanced model with better reasoning capabilities than current models.

Another interesting discussion that popped up was regarding Intel’s discussions with OpenAI in 2017-2018, and what impact this had on the chip industry.

Project Strawberry Announcement Incoming?

Altman’s hint for OpenAI’s newest model. (Source)

Altman's cryptic tweet featuring a strawberry sparked intense speculation about "Project Strawberry," a new AI model reportedly capable of advanced reasoning. This project, an extension of the previously revealed Q* initiative, aims to address one of AI's biggest challenges: multi-step problem-solving and reasoning.

Project Strawberry reportedly builds on OpenAI's existing LLMs, fine-tuning them for enhanced reasoning capabilities. The approach is said to be similar to the Self-Taught Reasoner (STaR) method, which uses iterative self-improvement techniques to boost AI's problem-solving skills.

Reports suggest impressive capabilities, particularly in math and science – areas that have traditionally been difficult for AI. An anonymous model, possibly related to Project Strawberry, has already demonstrated reasoning abilities surpassing GPT-4 on the AI testing platform Arena.

This follows a pattern similar to GPT-4's pre-release testing, hinting at a potential imminent announcement as early as this week.

OpenAI's Billion-Dollar Opportunity: How Intel's Hesitation Reshaped the AI Landscape

Althoug Nvidia currently leads the AI chip market, Intel was once the dominant player in the chip industry. In 2017-2018, Intel had discussions with OpenAI about potentially acquiring a 15% stake for $1 billion.

The deal also included provisions for Intel to provide specialized chips at cost to OpenAI, potentially shaping the future of AI computing. However, Intel ultimately decided not to proceed with the investment.

At the time, the company's leadership, including then-CEO Bob Swan, had a different perspective on the near-term market potential of generative AI. This decision came during a period when Intel was navigating the transition from CPU to GPU architecture for AI applications.

Meanwhile, Nvidia's focus on GPUs for AI workloads helped them gain a significant market share.

Money Moving in AI

Investments were plentiful in the AI industry last week, with Recursion Pharmaceuticals being involved in a massive deal to acquire Exscentia for $688 million. Meanwhile, Groq secured $640 million in a successful round and Leonardo.ai was acquired by Canva.

Recursion Pharmaceuticals Ready to Acquire Exscentia in $688 Million Deal

Recursion Pharmaceuticals is set to acquire Exscientia in a $688 million all-stock deal, marking a significant consolidation in the AI-driven drug discovery space. This merger combines Recursion's focus on rare diseases and cancers with Exscientia's AI-powered drug discovery platform, aiming to accelerate drug development and reduce costs.

Groq Secures $640 Million in Series D Funding

Groq, a leader in fast AI inference, has secured a massive $640 million Series D funding round at a $2.8 billion valuation, led by BlackRock Private Equity Partners with participation from notable investors including Neuberger Berman, Cisco Investments, and Samsung Catalyst Fund.

Tencent Contributes to $300 Million Funding Round for Moonshot

Tencent has participated in a $300 million-plus financing round for Chinese AI startup Moonshot, valuing the company at $3.3 billion, with Alibaba and Gaorong Capital also joining the investment. This move is part of a larger trend of significant capital inflow into Chinese AI firms, as major tech companies and venture capitalists compete to establish dominance in the AI market and develop alternatives to ChatGPT.

Adept AI Investors to be Paid Back

In a complex deal blurring the lines between acquisition and talent poaching, Amazon has effectively hired away most of Adept's top employees while arranging for the AI startup's investors to recoup their $414 million investment.

This arrangement, which sees Adept retaining about a third of its workforce and receiving $25 million, has caught the attention of regulators, with the FTC probing whether it circumvents merger notification rules.

Leonardo.ai Acquired by Canva

Canva, the design platform giant, has acquired Leonardo.ai, a generative AI content startup, in a strategic move to enhance its AI capabilities and expand its Magic Studio suite. We don’t know the full financial terms, but the deal involves a mix of cash and stock, with all 120 Leonardo.ai employees, including the executive team, joining Canva.

The Impossible GenAI Course: 1 in 20 Passes, Will You?

Thu, 15 Aug 2024 14:57:50 +0000

tl;dr - the hardest challenge you'll ever face - The Impossible GenAI Course is Now Live!

Today's a special day. Activeloop, together with our partners at Intel Disruptor and TowardsAI, was one of the first companies to pioneer high-quality, production-oriented GenAI courses.

One year, and tens of thousands of professionals educated later, we've noticed one pattern.

While everyone focuses on providing content, no one focuses on… Comprehension of the said content! Truly, so many people nowadays (42,7K, according to LinkedIn/Apollo data) call themselves ‘AI Engineers', but can they build production-ready GenAI solutions that solve actual business problems?

To address that, we've teamed up with top AI minds, Intel Disruptor Initiative and TowardsAI to craft the Impossible GenAI Test. Only one in 20 test takers succeeds. Do you think it can be you?

Take the Test

What's in store?

- 30 Questions, 6 Topics, 40 Minutes
- Questions do not repeat, vary in difficulty, and get you more points based on complexity
- Wrong answers are penalized

What Questions Will Be Asked?

The test covers six key areas of Generative AI. It will test everything from your deep understanding of how chunking impacts downstream solutions, to deciding on what would be the most cost-efficient solution in a case study, to what legal ramifications does building GenAI applications have in the US vs EU. More specifically, you'll be tested on:

AI Foundations
Retrieval Augmented Generation
Model Training & Fine-tuning
Observability & Evaluation
Model Inference and Deployment
Ethics & Compliance

Each section will challenge you with five questions selected at random.

Yes, the test is really tough. But we're not here to trip you up. With GenAI's growing impact, mastering the basics, and understanding production intricacies is crucial before launching any GenAI app into the real world. Challenge yourself, your friends, and colleagues to an Impossible GenAI Test face-off.

With enough data, we will work to release company leaderboards, so people can compare knowledge across companies!

To sum up, here's what Arijit Bandyopadhyay from Intel Corporation had to say about the initiative, developed jointly by Activeloop and the Intel Disruptor Initiative:

Take the course & test yourself today!

Accept the Challenge

Gemma 2 2B > GPT-3.5, Open-Source FLUX-1 vs Midjourney, GitHub Takes on Hugging Face

Tue, 06 Aug 2024 16:08:49 +0000

Before we start, share this week's news with a friend or a colleague:

Key Takeaways

Google unveiled the new generation of Gemma 2 models, with the 2B model showing strong performance while running on-device along with two other models that focus on privacy and transparency.
Apple introduced two new foundation language models (AFM-on-device and AFM-server), with the on-device model outperforming Llama-3 8B and the server model outperforming GPT3.5.
Salesforce AI Research introduced MINT-1T, the first trillion token interleaved dataset that outperformed the previous state-of-the-art interleaved dataset, OBELICS.
Black Forest Labs released the FLUX.1 suite of models which outperforms Midjourney-V6.0 in aspects like visual quality by incorporating rotary positional embeddings and parallel attention layers.
Stanford University researchers created a comprehensive benchmark for solving predictive tasks over relational databases using graph neural networks, aimed at generating large-scale training data for autonomous driving applications.

Got forwarded this newsletter? Subscribe below👇

The Talk of the Day: And Then There Were Three…

OpenAI has just lost more of the original 11 founders. President Greg Brockman, co-founder John Schulman, and product leader Peter Deng have left the ChatGPT developer.

After AI safety researcher Jan Leike left to work at Anthropic on the AI alignment problem, Schulman took over as the leader of OpenAI's alignment science team, also called the "post-training" team. Now, he'll continue this mission at Anthropic, following in Leike's footsteps.

While it's not yet clear where Deng will go, Brockman has decided to take a sabbatical ‘to relax since co-founding OpenAI 9 years ago'.

Only three of OpenAI’s 11 original founders remain: OpenAI CEO Sam Altman, Brockman (who hasn't quit officially), and Wojciech Zaremba, lead of language and code generation.

The Latest AI News

Google’s Gemma 2 2B outperformed GPT-3.5 despite being a lot smaller, meaning that competition for the GPT models is only increasing.

Meanwhile, the hardware race is heating up with AMD challenging NVIDIA's dominance and Samsung ramping up chip production, so we might see a big shift in the AI infrastructure landscape.

The Next Generation of Gemma 2 Models & Salesforce's MINT-1T

Gemma 2 outperforms much larger models. (Source)

Three new models were added to the Gemma 2 model family by Google last week.

Gemma 2 2B is a smaller, yet more efficient model that outperforms larger models while still having safety advancements built into it. The 2B model even outperformed every GPT3.5 model on Chatbot Arena.

In addition to Gemma 2 2B, ShieldGemma and Gemma Scope were also introduced. ShieldGemma boosts the privacy and security of AI models by protecting user data, while Gemma Scope offers tools and techniques to provide a better understanding of how AI decisions were made.

In other news, Salesforce AI Research presented MINT-1T - the first ever trillion token interleaved dataset. So what’s the big deal about it?

Interleaved documents have a mixture of text and images, which means they can be used to train multimodal models for text and visual capabilities. Models like Chameleon showed just how effective interleaved data can be in achieving high performance for multimodal models.

MINT-1T outperforms the previous leading interleaved dataset, OBELICS. (Source)

As a result, the chances of seeing much larger multimodal models in the future has increased drastically. In fact, this dataset is already being used to train the XGen-MM series.

Stability AI’s Double Play: Stable Fast 3D and S4VD

Stability AI has introduced Stable Fast 3D, a new AI model capable of generating 3D assets from a single image in 0.5 seconds. It’s a big leap in the field of AI-driven 3D content creation since it could change the way we look at game development, visual effects, and product design.

Stable Fast 3D's ability to quickly generate 3D assets challenges the traditional notion that high-quality 3D generation requires lengthy processing times. This could lead to new paradigms in real-time content creation and interactive design processes.

The other model Stability AI released was SV4D, which generates a 4D image matrix from a single-view video of an object. It generates 40 frames at 576x576 resolution and outperforms its predecessor, SV3D.

SV4D offers better video synthesis than SV3D by boosting video frame consistency. (Source)

But we’re a little surprised to see more model releases since Jasper bought ClipDrop (an image creation and editing platform) from Stability earlier in the year.

Meta's SAM 2, Runway's Gen-3 Alpha, and FLUX.1: The Next Wave of Important AI Models

How SAM 2 works. (Source)

Meta has introduced Segment Anything Model 2 (SAM 2), which builds upon the success of its predecessor by offering improved accuracy, efficiency, and versatility in segmenting objects within images.

The fact that the model can annotate 8.4 times faster than using SAM 1 per frame is pretty impressive. It addresses a critical need for real-time applications in fields like autonomous driving and augmented reality. Note that SAM 1 had a massive impact on annotation companies (most of which pivoted into RLHF), so SAM 2 being able to annotate a lot faster is a big deal.

Yet another model in the AI image space we saw was Gen-3 Alpha by Runway. It demonstrates enhanced capabilities in understanding and executing complex prompts.

Meanwhile, Midjourney is facing some serious competition with the release of FLUX.1 by Black Forest Labs, which offers better performance in aspects like image detail and style diversity,

FLUX.1 models achieved the highest ELO. (Source)

These models use a hybrid architecture of multimodal and parallel diffusion blocks. They built on top of flow matching and included parallel attention layers to achieve such impressive results that drastically outperforms previous models.

NVIDIA Delays, and AMD vs Samsung vs Apple: The AI Chip Wars Heat Up

We previously saw that Nvidia made its move in the AI chip market by developing a new chip just for the Chinese market. The story continues with AMD reporting significant revenue gains and unveiled plans to compete more aggressively with Nvidia in the AI chip market.

AMD's focus on both CPUs and GPUs for AI workloads suggests a multi-pronged approach to AI computing, which could influence future hardware architectures for AI systems.

In other news, Samsung Electronics reported a remarkable 50% increase in sales of its High Bandwidth Memory (HBM) chips, a critical component in AI accelerators.

Samsung's success in this sector could reshape the competitive landscape of the semiconductor industry, with implications for other major players like Nvidia and AMD.

Nvidia’s chip dominance was further challenged by Apple last week (who, as we know, produces their own chips). Apple went with Google TPUs instead of Nvidia to train their models, despite Nvidia controlling a massive 80% of the AI chip market.

Apple was using two types of Google TPUs instead of relying on Nvidia GPUs for AI models used in iPhones.

Nvidia also reported delays in the next AI chip (Blackwell B200) because of a “design flaw”.

GitHub Challenges Hugging Face

GitHub introduced GitHub models, which is designed to enhance software development workflows, giving Hugging Face some serious competition. These models are integrated into GitHub Copilot and other GitHub products to improve coding efficiency and accuracy.

While Hugging Face excels in providing a wide range of models for various machine learning applications, including NLP, its broader focus may not cater as specifically to the coding needs of developers as GitHub Models does. This is seen by the fact that Hugging Face was reported to have 700,000 LLMs last June.

In terms of users, GitHub has 100 million users, while Hugging Face was reported to have 4 million users in August 2023. (this number is likely to be higher now), so GitHub can distribute the models far more efficiently than HF.

OpenAI Endorses Senate Bills and EU AI Act is in Force

There wasn’t much from OpenAI in terms of model releases in the couple weeks aside from GPT-4 mini, but they endorsed three Senate bills last week. These bills include:

Probably doesn’t come as a surprise knowing that OpenAI is one of the biggest AI companies out there, so they certainly wouldn’t mind having a say on topics like this to get on the good side of lawmakers.

In other news, the EU AI Act we discussed before has now gone into force. It’s worth pointing out that it’ll take some time before all the rules will be applied though.

Specifically, it focuses on ethical development and deployment of AI within EU, categorizing AI applications based on their risk level. Higher risk means stricter rules, but it won’t affect the popular chatbots we all know and love like ChatGPT since most of them are considered to be minimal risk.

It’s one of the first comprehensive legal frameworks for AI, but it’s unlikely to be the last as we move forward into a future with more powerful models.

Advancements in AI Research

From MindSearch's innovative approach to complex information retrieval to the practical advancements in video object detection, we're seeing AI tackle increasingly nuanced and real-world challenges. On-device models also continued to advance with Apple’s foundation language model paper, showing impressive results across multiple benchmarks.

Apple's On-Device AI: How Small Models Achieve Big Results

Aside from the news about Apple not using Nvidia chips, they also introduced two new foundation language models designed to power Apple Intelligence features across iOS, iPadOS, and macOS.

The models, AFM-on-device (~3 billion parameters) and AFM-server (larger server-based model), aim to perform a wide range of tasks efficiently, accurately, and responsibly while addressing the challenges of running AI on consumer devices and maintaining user privacy.

To achieve this, Apple employed several innovative techniques:

Architecture optimization: The models use a dense decoder-only architecture with improvements like grouped-query attention and RoPE positional embeddings for long-context support.
Efficient training: A three-stage pre-training process (core, continued, and context-lengthening) using a diverse, high-quality data mixture and custom optimizer.
Adapter-based fine-tuning: LoRA adapters for task-specific optimization without changing the base model.

AFM-on-device outperforms larger open-source models like Mistral-7B in instruction following and writing tasks, while AFM-server achieves competitive performance against GPT-3.5 and GPT-4 in various benchmarks.

This means GPT3.5 isn’t just facing competition from Google’s Gemma 2 2B, but also from Apple’s AFM-on-device model.

Can AI Mimic Human Cognitive Processes for Complex Web Searches?

Framework of MindSearch. (Source)

MindSearch aims to overcome the limitations of current AI search methods, which often struggle with accurately retrieving and integrating information for complex queries that require multi-step reasoning and in-depth analysis.

To achieve this, the team developed a simple yet effective LLM-based multi-agent framework consisting of a WebPlanner and WebSearcher. The WebPlanner models the human mind's multi-step information seeking process as a dynamic graph construction, decomposing user queries into atomic sub-questions.

Meanwhile, the WebSearcher performs hierarchical information retrieval with search engines and collects valuable information for the WebPlanner.

Notably, responses from MindSearch based on InternLM2.5-7B are preferred by humans over those from GPT-4o and Perplexity. It might change how we approach complex information retrieval and integration tasks in various domains.

Feature Selection and Aggregation Boost Accuracy and Speed in Video Object Detection

Schematic of the framework. (Source)

Researchers from Tianjin University have introduced a different approach to video object detection (VOD) that addresses the challenges of high across-frame variation in object appearance and diverse deterioration in video frames.

To achieve this, the team developed two key components:

A Feature Selection Module (FSM) to reject low-quality candidates and reduce computational expense
A Feature Aggregation Module (FAM) that uses feature similarity measurements to guide the aggregation process.

The approach incorporates an average pooling operator on reference features to alleviate shortcomings of commonly-used cosine similarity.

Results demonstrate that their model achieves a new record performance of 92.9% AP50 at over 30 FPS on the ImageNet VID dataset using a single 3090 GPU. It offers a practical solution for video object detection that balances accuracy and speed, making it suitable for large-scale or real-time applications.

New Benchmark for AI in Relational Databases

RELBENCH overview. (Source)

Researchers from Stanford University and Kumo.AI have introduced RELBENCH, a comprehensive benchmark that aims to address the challenge of generating large-scale, high-quality training data for machine learning tasks in autonomous driving, particularly for map perception and related applications.

To achieve this, they developed an extension to the Lanelet2 framework called lanelet2_ml_converter, which enables the generation of diverse training labels directly from HD maps.

The extension introduces features such as:

Compound labels for independence from map annotation artifacts
Traceability of labels to original map elements
Support for varying local reference frame poses

Results showed that Relational Deep Learning (RDL) models trained on RELBENCH outperform traditional feature engineering approaches, reducing human work hours by 96% and lines of code by 94% on average.

It’s important because it provides a foundational infrastructure for future research into RDL, which could speed up the development and validation of machine learning models for autonomous vehicles and other domains that rely on relational databases.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

Torchchat: A library by Meta AI that enables running large language models like Llama 3 and 3.1 locally on laptops, desktops, and mobile devices with high performance
AI router chat: Personal chatbot arena that adapts to the user
PDF Extract kit: Comprehensive toolkit for high-quality content extraction from PDF documents.

If you want your framework to be featured here, reply to this email and say hi :)

Conversations We Loved

Last week’s discussions gave us plenty to think about when it comes to building and using large-scale AI systems. From GPT-4o's impressive 64K output leap to a pharma CIO's candid take on Microsoft's Copilot, these discussions highlight how the AI world is grappling with both technical advances and real-world applications.

GPT4-o's 64K Leap

OpenAI's experimental release of GPT-4o with a 64K output capability has sparked discussion in the AI community about the future of large-scale language model applications.

The 64K output opens up new possibilities for tasks like full document translation, comprehensive structured data extraction, and long-form content generation. At $6/$18 per million input/output tokens, the new alpha model is slightly more expensive than standard GPT-4o ($5/$15 per million input/output tokens). But it’s a little difficult to tell if increased output capability is worth the additional cost for different use cases.

It could lead to more efficient workflows in industries like translation, data analysis, and content creation. However, it also raises important questions about the cost-effectiveness of AI solutions and the need for careful consideration of use cases that truly benefit from such extended output capabilities.

Why Was the Copilot AI Deal Cancellation a Big Deal?

A recent revelation from a pharmaceutical company's CIO provided some food for thought about the real-world value of enterprise AI tools. The CIO's decision to cancel their Microsoft 365 Copilot subscription after a six-month trial period raises important questions about the gap between AI's promised potential and its current practical applications.

The CIO's comparison of Copilot's slide-generation capabilities to "middle school presentations" highlights the need for AI tools to deliver tangible, high-quality results that justify their cost.

With Copilot doubling the cost of Microsoft 365 licenses, there's a growing debate about how to price AI capabilities in a way that aligns with their perceived value. Microsoft's massive investments in AI infrastructure raise questions about how tech giants will recoup these costs and what it means for future pricing and product strategies.

Money Moving in AI

It’s a little unusual to be mentioning Reddit in an AI newsletter, but they acquired a company called Memorable AI last week. In terms of successful funding rounds, Aisles and Sybil secured $500 million and $11 million respectively.

Aisles Secures $500 Million in Private Equity Round and Introduces Aisles Enterprises

Aisles has secured $500 million in a private equity round to fuel its expansion into tech startup investments. The company introduced Aisles Enterprise, a new branch dedicated to identifying and supporting promising AI and tech startups.

Sybil Secures $11 Million in Seed Funding Round

Sybill, a startup developing an AI assistant for salespeople, has secured $11 million in seed, led by Khosla Ventures. The company's AI tool aims to reduce the administrative burden on sales teams by automating tasks like note-taking, data entry, and follow-up scheduling.

Reddit Acquires Memorable AI

Reddit has made its first acquisition since going public in March, purchasing ad-optimization company Memorable AI for an undisclosed amount. Memorable AI uses artificial intelligence to analyze audience reactions to content, helping to determine what resonates with specific groups.

Second GPT-4 Level Open Model, International Mathematical Olympiad vs DeepMind's LLM + RLHF

Tue, 30 Jul 2024 13:33:42 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

Meta's Llama 3.1 demonstrates advanced reasoning capabilities, solving complex math puzzles that have stumped other AI models, including GPT-4.
Mistral releases Large 2, a 123 billion parameter model, claiming performance on par with offerings from OpenAI and Meta in code generation, mathematics, and reasoning.
Google DeepMind unveils AlphaGeometry, an AI system capable of solving International Mathematical Olympiad (IMO) geometry problems at a silver medal level.
RT-DETRv2 improves upon RT-DETR by offering greater flexibility in multi-scale feature extraction and achieving enhanced performance without speed loss across various detector sizes.
LazyLLM is a dynamic token pruning method for efficient long context LLM inference, accelerating the pre-filling stage of LLama 2 7B by 2.34x while maintaining accuracy.

Got forwarded this newsletter? Subscribe below👇

The Latest AI News

From Meta's Llama 3.1 solving complex math puzzles to Mistral's Large 2 challenging industry giants, we're seeing model capabilities and applications grow rapidly. Meanwhile, controversies around data usage and the race for market dominance remind us that the path to AI advancement isn’t without its ethical and practical challenges.

Llama's Math Prowess, Mistral's Leap, and NVIDIA's Miniature Marvels

Mistral Large 2 outperforms Llama 3.1 at code generation and math. (Source)

We previously covered how Llama 3.1 is attracting considerable attention due to its impressive benchmark performances and potential to challenge other flagship models.

Last week, Llama 3.1 continued to gain attention with a notable demonstration. In it, Llama 3.1 successfully solved a complex math puzzle involving candle lengths, displaying advanced logical reasoning that has left other AI models stumped.

A tough math problem that Llama 3.1 could handle while other models were unable to solve it. (Source)

The model correctly deduced that the longest remaining candle (option 3 in the image above) was the first to be blown out, a task that even GPT-4 has struggled with in similar tests.

But Meta wasn’t the only one to recently release a state-of-the-art LLM.

A day after Meta's Llama 3.1 announcement, Mistral released its new flagship model, Large 2, claiming performance on par with the latest offerings from OpenAI and Meta in code generation, mathematics, and reasoning.

With 123 billion parameters, Large 2 reportedly outperforms Meta's recently released Llama 3.1 405B in code generation and math tasks, despite having less than a third of the parameters. The creators also claim SOTA for function calling.

Large 2 and Llama 3.1 are both close to GPT-4o in terms of performance. (Source)

Mistral emphasizes reduced hallucination issues and improved multilingual support, covering 12 languages and 80 coding languages. The model features a 128,000 token context window and is available on major cloud platforms.

Of course, you can’t have an AI news roundup without mentioning Nvidia somewhere. They introduced Minitron, a new family of small language models (SLMs) derived from their larger Nemotron-4 15B model. The Minitron models, available in 8B and 4B parameter sizes, are created through a combination of pruning and knowledge distillation techniques.

This approach significantly reduces the computational cost of training, requiring up to 40x fewer tokens and resulting in a 1.8x reduction in overall compute costs for the model family.

Despite their smaller size, Minitron models demonstrate competitive performance, with up to 16% improvement in MMLU scores compared to models trained from scratch, and comparable results to other community models like Mistral 7B, Gemma 7B, and Llama-3 8B.

DeepMind Advances Math

AlphaGeometry earned 28 out of 42 points, putting it on the same level as a silver medalist. (Source)

Google DeepMind unveiled AlphaProof, a new reinforcement-learning based system for formal math reasoning, and AlphaGeometry 2, an improved version of our geometry-solving system. Together, the two present an AI system capable of solving International Mathematical Olympiad (IMO) geometry problems at a silver medal level (4 out of 6 problems).

Traditional AI systems have struggled with formal mathematical proofs due to limited training data, while NLP, despite access to vast data, often produces plausible but incorrect proofs.

AlphaProof solves this problem by bridging the gap, combining a fine-tuned Gemini model with reinforcement learning techniques similar to AlphaZero. Using the formal language Lean, AlphaProof generates verifiable proofs and continuously improves by learning from its own verified solutions. Read more on what's special about the AlphaProof and AlphaGeometry 2 here.

The Middle East's 4 Billion-Dollar Bet and Nvidia's Chinese Gambit

Abu Dhabi is set to create a major player in the AI and space technology sectors with the merger of Yahsat and Bayanat AI to form Space42.

The new entity, valued at $4 billion, aims to become the Middle East and North Africa's largest AI-powered space technology company. Space42 will integrate satellite communications and business intelligence to position itself for both regional and global opportunities.

Alongside the new Minitron models, Nvidia also made moves in the AI chip market. They are reportedly developing a new AI chip specifically tailored for the Chinese market, aiming to comply with U.S. export restrictions while maintaining their presence in this crucial market. A couple of weeks ago, OpenAI was in talks to develop their own AI chip.

Nvidia’s newest chip, known internally as the H20, is a scaled-down version of Nvidia's flagship H100 AI accelerator, designed to meet the U.S. government's performance thresholds for exports to China.

This move comes as Nvidia seeks to balance regulatory compliance with its business interests in China, which accounted for 21% of its revenue in the most recent fiscal year. The H20 is expected to be part of a new product line that includes the L20 and L2 chips, all aimed at addressing the growing demand for AI hardware in China while navigating complex geopolitical challenges.

Runway Accused of Allegedly Using Publicly Available YouTube Videos

Video generation startup Runway AI is facing controversy over its AI training practices. According to a report by 404 Media, the company allegedly used thousands of publicly available YouTube videos to train its Gen-3 Alpha model, which generates 10-second videos.

Tech Review MKHB pointed out that a lot of his videos were used by Runway to train their video generator. (Source)

The accusation is based on a leaked internal spreadsheet, which suggests Runway scraped content from popular YouTube creators, brands, and even pirated films. While this has sparked debate about the ethics and legality of using publicly available content for AI training, experts note that the legal landscape around such practices remains unclear.

This isn’t the first time we’ve heard of this type of accusation, as OpenAI reportedly used over a million hours of YouTube videos to train GPT-4 last April.

How Machines are Calling the Bluff on Human Poker Champions

How Pluribus’ blueprint strategy improves while training on a 64-core CPU. (Source)

Facebook AI and Carnegie Mellon University have developed Pluribus, the first AI to consistently beat elite human professionals in six-player no-limit Texas Hold'em poker.

We’ve seen AI win against the World Go champion before, but the fact that AI was able to win in poker is pretty impressive. It's the first time an AI system has outperformed humans in a complex game with more than two players or teams.

Pluribus won decisively against top poker professionals, including World Series of Poker champions, earning an estimated $1,000 per hour against five human players.

The AI's success stems from its ability to handle hidden information and multiple players efficiently. Pluribus uses self-play to develop its strategy without human input and employs a novel search algorithm that looks only a few moves ahead rather than to the end of the game.

Pluribus was trained using relatively modest computing resources - less than $150 worth of cloud computing. Contrary to popular belief, you don’t need extensive computational resources to pull off impressive feats like this.

OpenAI's SearchGPT Met With Skepticism and Microsoft’s AI-Powered Feature for Search Results

OpenAI has introduced SearchGPT, a new search feature designed to provide "timely answers" to questions using web sources. The prototype, powered by GPT-3.5, GPT-4, and GPT-4o models, is currently available to a limited group of users and publishers.

While OpenAI positions SearchGPT as a more responsible AI search tool with clear attribution and publisher collaboration, the announcement was met with skepticism from industry observers.

SearchGPT gets its own demo… wrong

Funnily, just like Bard by Google back in the day (RIP), the first product demo shows a factual error. When mock user types “music festivals in boone north carolina in august,” SearchGPT pulls up a list of festivals, the first being An Appalachian Summer Festival. Then the tool tells the user the festival is on for dates when it’s officially closed (festival dates are from 6/29 - 7/27). But SearchGPT gives the dates as 7/29 to 8/16.

In the meantime, Microsoft also expanded its AI offerings by rolling out a beta of a new AI-powered feature for Bing search results, which provides concise summaries of web pages directly in the search results.

This feature is powered by GPT-4 and provides users with key information from websites without the need to click through.

Advancements in AI Research

Multi-agent research saw further advancements last week, handling issues with scalability. Moreover, autonomous driving saw some important progress with real-time object detection and training data generation.

Scaling Multi-Agent Simulations to Millions With AgentScope

The multi-layer environment structure. (Source)

AgentScope addresses key challenges in conducting large-scale multi-agent simulations, including limited scalability, unsatisfied agent diversity, and effort-intensive management processes.

To tackle these issues, the authors:

Developed an actor-based distributed mechanism for improved scalability and efficiency
Provided flexible environment support for various real-world scenarios
Integrated tools for creating diverse agent backgrounds and managing large numbers of agents across multiple devices.

Their experiments demonstrate the ability to conduct simulations involving 1 million agents using only 4 devices, showing drastic improvements in scalability and efficiency compared to existing approaches.

By providing a comprehensive framework that addresses both technical and usability challenges, AgentScope lets researchers and developers conduct more realistic and complex simulations involving a massive number of diverse agents.

This could lead to valuable insights in fields such as social science, economics, and urban planning, where understanding the collective behavior of large populations is key.

Enhancing Real-Time Object Detection Performance with RT-DETRv2

RT-DETRv2 shows notable improvements over its predecessor. (Source)

RT-DETRv2 addresses key challenges in real-time object detection, including the need for greater flexibility in multi-scale feature extraction, deployment constraints associated with DETRs, and performance optimization without sacrificing speed.

Researchers at Peking University proposed setting distinct numbers of sampling points for features at different scales in deformable attention, introducing an optional discrete sampling operator to replace the grid_sample operator, and implementing dynamic data augmentation.

The results show that RT-DETRv2 provides an improved baseline for RT-DETR with increased flexibility and practicality. It achieves enhanced performance without speed loss across various detector sizes.

By addressing deployment constraints and optimizing training strategies, RT-DETRv2 pushes the boundaries of what's possible in real-time object detection. It could certainly impact a wide range of applications, from autonomous driving to video surveillance.

Generating Diverse Training Data for Autonomous Driving

Applications of the software module. (Source)

Speaking of autonomous driving, this paper addresses the challenge of generating large-scale, high-quality training data for machine learning tasks in autonomous driving - particularly for map perception and related applications.

The authors propose an extension to the Lanelet2 framework called lanelet2_ml_converter. This extension enables the generation of diverse training labels directly from HD maps while maintaining compatibility with existing automated driving functionalities.

The extension introduces features such as:

Compound labels for independence from map annotation artifacts
Traceability of labels to original map elements
Support for varying local reference frame poses.

Results were impressive as they showed the proposed framework's flexibility and effectiveness in generating training data for various map perception tasks, including online HD map construction, topology inference, and map fusion.

It bridges the gap between HD maps used in automated driving and the growing need for large-scale, standardized training data in AI-based mapping and perception tasks.

Using LazyLLM to Accelerate LLM Inference with Dynamic Token Pruning

How LazyLLM differs from standard LLM. (Source)

LazyLLM addresses the challenge of slow first token generation in large language models when processing long prompts, which can significantly impact overall inference speed.

To solve this, the authors introduce a dynamic token pruning method that selectively computes key-value (KV) caches only for tokens deemed important for next token prediction in both the prefilling and decoding stages, allowing the model to adapt its token selection at each generation step.

Experiments across various tasks demonstrate that LazyLLM can accelerate the prefilling stage of the LLama 2 7B model by 2.34x while maintaining accuracy on multi-document question-answering tasks.

Notably, LazyLLM offers a generic method to improve the efficiency of LLM inference, particularly for long-context scenarios, without requiring model fine-tuning.

By focusing on optimizing the often-overlooked prefilling stage, LazyLLM provides a complementary approach to existing methods that primarily target decoding efficiency, potentially leading to more comprehensive improvements in LLM inference speed across various applications.

Frameworks We Love

TiCoSS: Tightens the coupling between semantic segmentation and stereo matching tasks for autonomous driving perception.
iNNspector: Comprehensive system for systematic debugging of deep learning models, providing interactive visualizations and tools to explore model architectures.
Structured financial data extraction: Extracts financial data from unstructured documents like balance sheets and financial statements using a combination of schema-based extraction with Pydantic, LLM, and the Indexify framework.

Conversations We Loved

Last week we saw discussions about the in-depth process that goes into building generative AI platforms, alongside the differences in the technicalities of terms like open-source and free models by using Llama 3 as an example.

Deep Dive Into Building a Generative AI Platforms

Huyen’s exploration of how to build a generative AI platform. (Source)

Chip Huyen, a prominent figure in machine learning systems, shared a comprehensive overview of building generative AI platforms, outlining common components and their implementations.

Is Llama 3.1 Really Open Source?

We don’t really think about it, but there are actually key distinctions between the terms open-source, open weights, and free models. However, Generative AI lead at AWS Eduardo Ordax brought this issue to light by using Llama 3.1 as an example.

𝗢𝗽𝗲𝗻 𝗦𝗼𝘂𝗿𝗰𝗲: you get the whole shebang—source code, hyperparameters, the original dataset, and all the juicy documentation. It's like getting the keys to a candy store and being told, "Go nuts!".

𝗢𝗽𝗲𝗻 𝗪𝗲𝗶𝗴𝗵𝘁𝘀: You can use the pre-trained model and even fine-tune it, but you won't get the original code or training methods (just like for Llama v3.1 and Mistral Large 2).

𝗟𝗶𝗰𝗲𝗻𝘀𝗶𝗻𝗴 𝗟𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗼𝗻𝘀: Llama 3.1 is released under the Llama 3.1 Community License Agreement. Commercial use is allowed, but with limitations. Conversely, Large 2 is allowed only for non-commercial news.

While downloading the model is free, the post points out the significant costs associated with deploying and running inference, challenging the notion of Llama 3.1 being entirely "free."

The highlights how the AI community needs to be more accurate in its use of terms like "open source" and "free," since these have specific implications for model development and adoption.

Money Moving in AI

Cowbell and Harvey had successful Series C funding rounds, raising $60 million and $100 million. Meanwhile, Cohere continues to move forward in its competition with Cohere by raising $500 million.

Cohere Raises $500 Million While Taking on OpenAI

AI startup Cohere has raised $500 million in a funding round that values the company at over $5 billion, signalling its ambition to compete with industry leaders like OpenAI. However, Cohere cut around 20 employees the day after this funding round, which is about 5% of its total employees.

Harvey Secures $100 Million in Series C Funding

Harvey, an AI-powered legal technology startup, has secured $100 million in Series C funding at a $1.5 billion valuation, led by Google Ventures.

Harvey plans to use the new capital to expand its engineering and data capabilities, develop domain-specific models, and deepen partnerships with cloud and model providers to enhance its AI platform. We are curious how this will affect the company's spend with OpenAI, although since OpenAI is on the roster of the company's investors, the impact may be minor.

Cowbell Raises $60 Million Funding in Series C Funding

Cowbell, a leading AI-powered cyber insurance provider, has raised $60 million in Series C funding from Zurich Insurance Group, bringing its total funding to $160 million.

A Small Language Model Week, GPT-4 Mini, Llama-3 405B Leaked

Tue, 23 Jul 2024 13:20:51 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

On the theme of accidents, Llama-3 405B was allegedly leaked on HuggingFace (and made available for download on 4Chan) yesterday. Read below for what we know about the model ahead of today's release.
OpenAI unveiled GPT-4o mini, a compact and cost-effective AI model for ChatGPT that outperforms leading small AI models on reasoning tasks while being 60% cheaper to operate than GPT-3.5 Turbo.
Salesforce released xLAM, a family of models for autonomous task planning and execution, with the 7B model achieving 88.24% on the BFCL function calling leaderboard.
HuggingFace’s SmolLM is a new line of efficient small language models designed for local devices, available in 135M, 360M, and 1.7B parameter sizes, outperforming similar-sized models like GPT-2 and MobileLM across various benchmarks.
FlashAttention-3 achieves up to 2× speedup in attention mechanisms using producer-consumer asynchrony and hardware-accelerated low-precision operations.
LMMs-Eval proposes a unified evaluation framework for multimodal AI, balancing task diversity, human alignment, and efficiency to enable standardized model comparisons.

Got forwarded this newsletter? Subscribe below👇

The Latest AI News

Sheesh, what a week. We're glad we can be… unburdened with what has been happening over the past week on political arena and the international Blue Screen day as an AI newsletter. After all, so much has happened in AI, too!

Last week has been big for small language models. AI developments showcased a push towards efficient, compact models like GPT-4o mini, Arcee-Nova, SmolLM, and xLAM, alongside AI titans pivoting to specialized ventures like Fei-Fei Li's World Labs.

These advancements are occurring amidst growing regulatory scrutiny, as evidenced by Meta's EU decision and Altman's "AI-client privilege" proposal, highlighting the tension between innovation and ethical considerations in AI.

But first… the talk of the day:

Llama-3 405B Leaked on 4Chan. What Do We Know Ahead of the Today's Release?

Weighing almost 820GB, it was ‘accidentally’ leaked on HuggingFace repository ahead of the release. (UPD: you can read Meta's full announcement here).

We don't know where you can download more RAM to run this, but here's what we do know:

Outperforms GPT4-o and Claude Sonnet on more than 90% of benchmarks, but may fall short on some text-related tasks. Unclear how it fares against the newly-released GPT4-o mini yet.
128k context tokens
15 Trillions tokens pre-trained 😱 (the number floating around for the OG GPT-4 was 13T)
Multilingual, but not yet multi-modal
Fine-tuning and Quants available soon after base release
Possibly paywalled at a certain point as per this part of the code:

Llama code repo contains upsell prompts

Here are the Llama 3.1 405B benchmarks:

Llama 3.1 vs close-source models

Llama 3.1 vs open-source models

New Models From OpenAI and Mistral Lead the Efficiency Race

GPT-4o mini is among the most cost-effective LLMs. (Source)

OpenAI has introduced GPT-4o mini, a new small-scale AI model designed to balance performance and efficiency. This model, which will power ChatGPT and be available to developers, outperforms leading small AI models on reasoning tasks while being significantly more cost-effective to run.

GPT-4o Mini scores 82% on the MMLU benchmark, surpassing Gemini 1.5 Flash and Claude 3 Haiku. It's also over 60% cheaper to operate than its predecessor, GPT-3.5 Turbo.

Moreover, the model features a 128,000 token context window and supports text and vision inputs.

Meanwhile, the TTT (Tensor-Train-Transformer) models are looking to shake up the generative AI field. Researchers suggest these models could offer significant advantages over traditional transformers in terms of efficiency and capabilities.

TTT models use a different mathematical structure that allows for more efficient information processing. Early results are promising, as they show TTT models performing competitively with much larger transformer models while using fewer parameters.

On the other hand, Mistral released a plethora of new models, including Mamba, Mathstral, and Nemo.

Mamba: Combines the Mamba architecture with advanced code and reasoning capabilities.
Mathstral: Specifically designed to tackle complex mathematical problems requiring multi-step logical reasoning
NeMo: 12B parameter model that offers high-performance in reasoning, world knowledge, and coding accuracy for its size category.

NeMo outperforms similar-sized models like Gemma 2 and Llama 3 across various benchmarks. (Source)

Arcee Merges SLMs to Close in on GPT-4

Arcee released Arcee-Nova, a Small Language Model (SLM) developed as a local alternative to GPT-4 and Sonnet 3.5. It scores 9.17 on MT-Bench, 0.01 points below GPT-4 (May 2023).

On OpenLLM leaderboard tasks, it achieves the highest average scores among models tested (Llama-3-70B-Instruct, Tess-Qwen2, Qwen2-72B-Instruct, etc.). Arcee-Nova is designed for coding, mathematics, and creative writing.

The model was trained for 3 epochs on 1.75 million samples from 10 publicly available datasets. Dataset curation involved a custom reranker for instruction following and safety, with scores averaged with fineweb-edu classifier for educational value.

Arcee has also publicly released the resulting dataset - arcee-ai/The-Tome. Nova was merged with Qwen2-72B-Instruct, primarily using lower layers from Instruct and higher layers from Nova-Premerge.

The merge underwent further alignment using DPO (Direct Preference Optimization - i.e., optimization of a model was guided directly by human preferences rather than predefined metrics or loss functions).

Even Smaller SLMs - xLAM From Salesforce and SmolLM From HuggingFace

Salesforce has introduced xLAM, a new family of large action models (LAMs) designed to autonomously plan and execute tasks. These models, available in 1.35B and 7B parameter versions, demonstrate impressive capabilities despite their relatively compact size.

The 7B model achieves 88.24% on the BFCL (function calling leaderboard), while the 2B version scores 78.94%, outperforming many larger open-access models. xLAM models show competitive performance against GPT-4 and Claude 3.5 on function calling tasks.

On the contrary, Hugging Face released SmolLM, a new family of state-of-the-art SLMs designed to operate efficiently on local devices.

The SmolLM models come in three sizes: 135M, 360M, and 1.7B parameters. The models are trained on a meticulously curated high-quality dataset, SmolLM-Corpus, and demonstrate strong performance in common sense reasoning and world knowledge tasks.

SmolLM showed impressive performance across various benchmarks like MMLU and ARC. (Source)

Both releases are a sign that we’re moving toward more efficient, compact models that can perform complex tasks with fewer parameters.

Li and Karpathy’s Next Big Bets

Fei-Fei Li, often dubbed the "Godmother of AI," has launched World Labs, a startup focused on developing spatial intelligence in AI - the ability to understand and reason about 3D spaces and physical environments.

In just four months, the company has achieved a $1 billion valuation and secured backing from major venture capital firms, including Andreessen Horowitz and Radical Ventures.

But Li wasn’t the only one who introduced a new company. Former Tesla AI director and OpenAI researcher Andrej Karpathy has unveiled his new venture, Eureka Labs, which aims to completely change education through AI. The startup plans to leverage AI teaching assistants to enhance learning experiences, with an initial focus on a course for training LLMs.

Eureka Labs will use AI to create personalized, interactive learning experiences, with the platform's first offering being a course on training LLMs from scratch.

These developments might mean we’re in a new phase of AI innovation where industry leaders are targeting specific, high-impact AI applications rather than the broader applications we’re used to seeing.

Meta’s EU Pause and Altman’s Call to Privacy

Meta announced it will not release its upcoming multimodal Llama AI model in the European Union, citing the "unpredictable nature of the European regulatory environment." This decision highlights growing tensions between tech giants and EU regulators over AI governance.

Meta's move follows similar actions by other tech companies, like Apple's decision to exclude the EU from certain AI features. The decision comes shortly after the EU finalized compliance deadlines for its new AI Act, potentially impacting EU companies' access to cutting-edge AI technologies.

While Meta was busy making decisions about model releases in EU, the debate about AI ethics continued. OpenAI CEO Sam Altman suggests that society may need to establish an "AI-client privilege" similar to the confidentiality protections that exist between lawyers or doctors and their clients.

Establishing such a privilege could help build trust between users and AI systems, encouraging more open and honest interactions without fear of information misuse.

Although, it’s important to consider that this would require advanced security measures and potentially new technological solutions to ensure the confidentiality of AI interactions.

OpenAI’s Custom Chip Gambit

Since Nvidia has been dominating the AI chip market, OpenAI decided to throw its name in the hat of its competitors.

OpenAI is reportedly in talks with Broadcom to develop a custom AI chip, signaling a potential shift in the AI hardware landscape. This move comes as part of OpenAI's broader strategy to reduce dependence on existing GPU suppliers and address the increasing demand for AI computation.

This could give OpenAI more control over its hardware supply chain and reduce costs in the long run. It may also lead to a competitive advantage, as developing proprietary hardware could give OpenAI an edge in the AI race.

OpenAI’s initiative to develop custom chips mirrors similar efforts by other major tech companies to reduce dependence on external suppliers and optimize hardware for their needs.

Apple introduced the M4 chip a couple of months ago, the latest example of its custom silicon strategy for Macs and other devices. Additionally, Google has been developing TPUs specifically designed to accelerate machine learning workloads.

Advancements in AI Research

Last week’s advancements showcase a convergence of efforts to enhance efficiency, evaluation, and application across diverse domains, from optimizing LLMs to improving multimodal capabilities and 3D perception.

Breakthroughs like FlashAttention-3 and Q-Sparse are pushing the boundaries of model efficiency, while LMMs-Eval addresses the critical need for standardized evaluation methods in multimodal AI.

Optimizing Control for Multilingual Text-Image Generation

Impact of control at different stages of denoising. (Source)

Researchers at the University of Science and Technology of China introduced TextGen, a framework that addresses a critical challenge in visual text generation: effectively utilizing control information throughout the diffusion process.

This work tackles the limitations of current ControlNet-based approaches, which often struggle with fine-grained character details and coherent text placement.

Their findings reveal that control information has unique characteristics compared to conventional inputs like Canny edges or depth maps. Notably, control in both the early and late stages of denoising plays crucial roles - early control influences global coherence, while late-stage control refines textual details.

The results were impressive. TextGen achieved state-of-the-art performance on the AnyWord benchmark, with significant gains in both English (73.36% accuracy, up from 64.26%) and Chinese (67.92% accuracy, up from 65.02%) text generation.

Navigating the Evaluation Trilemma for Multimodal AI

The three components developed to deal with the “evaluation trilemma.” (Source)

Researchers at the LMMs-Lab Team presented LMMs-EVAL, a framework that addresses the critical challenges in evaluating large multimodal models (LMMs). This work tackles the "evaluation trilemma" - the difficulty in simultaneously achieving wide coverage, low cost, and zero contamination in AI model assessment.

They developed three key components:

LMMs-EVAL: A unified evaluation suite covering over 50 tasks and more than 10 models, ensuring standardized comparisons.
LMMs-EVAL LITE: An efficient subset of benchmarks that maintains reliability while reducing evaluation costs.
LIVEBENCH: A dynamic evaluation framework using continuously updated news and online content to assess models' zero-shot generalization abilities.

Results show significant improvements in evaluation efficiency and reliability. LMMs-EVAL LITE achieved correlation scores above 0.9 with full benchmark results while substantially reducing computation time.

LIVEBENCH also revealed performance gaps between open-source and commercial models, highlighting the need for more robust evaluation methods.

Unlocking Full Activation Sparsity in LLMs

How Q-Sparse achieves a better inference-scaling law than the dense models. (Source)

Researchers at Microsoft have introduced Q-Sparse, a novel approach to training sparsely-activated LLMs that significantly improves inference efficiency without compromising performance.

This work addresses the critical challenge of reducing computational costs and memory footprint in LLMs - especially during inference.

Q-Sparse can match the performance of dense baseline models while achieving up to 40% sparsity. Notably, the method works effectively across various settings, including training from scratch, continuing pre-training of existing LLMs, and fine-tuning.

They also derived an inference-optimal scaling law for sparsely-activated LLMs, finding that models with a sparsity ratio of 45.58% (for full-precision) and 61.25% (for 1.58-bit models) can achieve optimal performance given the same inference budget.

Pushing the Boundaries of Efficient Attention

Researchers at Colfax Research, Meta, NVIDIA, Princeton University, and Together AI have introduced FlashAttention-3, a new approach to optimize attention mechanisms in large language models. It addresses the computational bottleneck of attention, which has quadratic scaling in sequence length and limits the ability to handle long contexts.

Results show significant performance gains with a 1.5-2.0× speedup over FlashAttention-2 in the forward pass. The method can scale to much longer sequence lengths than previous approaches, which is crucial for improving the context understanding of large language models.

This research not only improves the efficiency of attention mechanisms but also shows how leveraging hardware-specific features can lead to substantial performance gains in AI models.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

Visual Haystacks: Uses natural language processing techniques to detect inconsistencies within 4G and 5G cellular network protocol specifications.
StreetScapes: Generates long sequences of city views through synthesized urban environments.
SegPoint: Used 3D point cloud segmentation, leverages multimodal LLMs to produce point-wise segmentation masks for various tasks.

If you want your framework to be featured here, reply to this email and say hi :)

Conversations We Loved

Renee Shah's insights on the rising prominence of headless data architecture shed light on a significant shift in how organizations approach data management and analysis.

Meanwhile, Hemant Mohapatra's reflections on Google Cloud Platform's early challenges offer valuable lessons on the complexities of building and scaling cloud services in a highly competitive market.

Is Headless the Future of Data Systems?

Shah’s discussion about how headless data architecture is becoming a common term these days. (Source)

Renee Shah brings up how headless data architecture has recently gained significant traction in the data engineering world. This approach represents a massive shift towards more flexible and scalable data systems.

One core benefit is that it decouples storage from compute, allowing for independent resource scaling. It’s also cost-effective because users only pay for the compute resources used during query execution. At Activeloop, we're huge proponents of compute/storage separation and have had this architecture from day 1 six years ago!

However, it presents a challenge in performance optimization, as ensuring efficient query performance across different engines may require additional tuning. Increased complexity is another issue, as managing multiple components can be more challenging than using a single, integrated system.

The growing popularity of headless data architecture reflects a broader trend towards modular, cloud-native data systems that prioritize flexibility and scalability.

The Untold Story of Google Cloud’s Early Days

Mohapatra reflects on insights from the journey of GCP. (Source)

Hemant Mohapatra, a former Google Cloud employee, shared insights on Twitter about Google Cloud Platform's (GCP) early challenges and lessons learned.

While focusing on cloud-native clients like Snapchat, GCP initially overlooked the larger market of hybrid cloud customers like Goldman Sachs. Early problems included salespeople without quotas, inconsistent branding, title inflation, and weak partnership programs.

It shows how even companies with superior technology can struggle if they don't have the right go-to-market strategy and organizational focus. Moreover, GCP's initial focus on cloud-native clients like Snapchat, while overlooking the larger hybrid cloud market, illustrates the critical need for companies to understand and adapt to their full market potential.

Regardless, Mohapatra emphasizes that while these observations reflect past challenges, GCP has since evolved into a much stronger platform and team.

Money Moving in AI

Anthropic launched a $100 million fund for AI startups to speed up the pace of AI development, while Arcee and Eden successfully raised $24 million and $10 million respectively.

Anthropic Announces $100 Million Fund for AI Startups

Anthropic, in partnership with Menlo Ventures, has announced a $100 million fund called the Anthology Fund to invest in early-stage AI startups. This initiative aims to support entrepreneurs in key areas of AI development and application. Startups will receive financial backing and $25,000 in credits to access Anthropic's advanced language models.

Arcee Secures $24 Million in Series A Funding

Arcee AI, a Miami-based startup specializing in small language models (SLMs), has raised $24 million in Series A funding led by Emergence Capital. This investment comes just six months after their $5.5 million seed round in January 2024.

Previously, we’ve worked with them to co-develop PatentGPT, a patent generation and search engine, with Arcee's models and our database for AI, so we're really happy to see them achieve this milestone!

Eden Raises $10 Million

Eden, a Mexico City-based healthtech startup, has raised $10 million in a funding round led by Sierra Ventures, with participation from Dalus Capital, Ali Capital, Liquid, and Endeavor. The company specializes in generative AI for medical imaging and diagnostics, aiming to improve radiology services across Latin America.

A 3B Model May Disrupt the PDF Extraction Industry, Claude 3 Haiku Fine-Tuning

Tue, 16 Jul 2024 12:03:41 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

Microsoft resigned its observer seat on OpenAI's board amid antitrust scrutiny, while US lawmakers raised concerns about Microsoft's $1.5 billion investment in UAE-based AI firm G42 due to potential ties with China.
Anthropic announced fine-tuning capabilities for Claude 3 Haiku in Amazon Bedrock, allowing customization for specific business needs.
DeepMind's JEST promises to improve energy efficiency and model performance by using a smaller AI model to grade data quality.
A new method refines retrieved content before including it in the prompt for generation models, using meta-prompting to optimize instructions.
RTMW is a series of high-performance models for 2D/3D whole-body pose estimation that demonstrates strong performance while maintaining high inference efficiency.

Got forwarded this newsletter? Subscribe below👇

The Talk of The Day: The 3B Model That May Disrupt the PDF Extraction Industry Overnight

There's a new kid with impressive benchmarks on the block, folks. ColPali, a new retrieval model architecture uses vision language models to directly embed page images, without relying on complex text extraction pipelines. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable.

Suprisingly, the difference between working extremely well (81.3% for ColPali) and not working at all (58.8% for BiPali) is the "Col" part of ColPali, i.e. ColBERT late interaction.

The ColPali research team demonstrated that it outperforms all other evaluated systems on ViDoRe, a new benchmark intoduced by them, including baselines where Claude Sonnet is used to caption all visual elements

There are well-funded ML tooling companies existing to facilitate the extraction pipeline (e.g., Unstructured, Surya) ColPali is disrupting.

The whole indexing process can be slow, error-prone, and it struggles to take into account the more visual elements of a page (figures, images, etc..), ColPali authors claim. Will this open-source model disrupt the market the same way Segment Anything disrupted Data Labelling landscape? We'll see soon.

The Latest AI News

Last week, developments showcased a complex interplay of corporate strategy, technological advancement, and regulatory scrutiny. Microsoft's dual moves on the OpenAI board and G42 investment highlighted the delicate balance tech giants need to maintain.

Meanwhile, innovations like Claude 3 Haiku's fine-tuning capabilities and DeepMind's JEST training method signal a shift towards more efficient and customizable AI solutions.

Claude 3 Haiku Fine-Tuning and Moshi's GPT-4 Challenge

The fine-tuned version of Claude 3 Haiku leads to an 18.1% improvement in accuracy, and a better filter for insulting comments. (Source)

After funding new benchmarks, Anthropic has announced that customers can now fine-tune Claude 3 Haiku in Amazon Bedrock. This allows customization of the model's knowledge and capabilities for specific business needs.

So, what’s the big deal?

Some stand-out benefits include:

Better performance on specialized tasks
Faster speeds at a lower cost than larger models
Consistent brand-aligned formatting
An easy-to-use API
Safe/secure training that preserves Claude's low risk of harmful outputs

We also saw the release of Moshi, a new chatbot AI developed by Kyutai that is stepping up the plate and challenging GPT-4 with competitive performance and unique capabilities.

It features a voice mode that allows for natural conversations and the ability to detect and respond to emotions in speech. What’s impressive is that Moshi can run offline on consumer devices like smartphones - a significant advantage over cloud-based models like ChatGPT.

Moshi outperformed GPT-4o in response time, as Kyutai mentioned that Moshi responds in 200 milliseconds compared to GPT-4o’s 232-300 milliseconds.

It’s interesting to see that more specialized models are emerging and outperforming GPT-4 at certain tasks, which means that the flagship jack-of-all-trades model might not be the go-to LLM anymore. For example, we’ve previously seen that DeepMind’s MedGemini outperformed GPT-4 at specific medical tasks.

OpenAI’s 5 Stages of AGI

AGI has received a lot of attention recently, as we saw a paper detailing how agentic symbolic learning, along with Yann LeCun’s JEPA method, can bring us closer to AGI.

OpenAI introduced a five-level classification system to track AI advancement towards human-level performance, ranging from basic conversational AI to AI capable of managing entire organizations.

Level 1 (Conversational AI) includes current chatbots like ChatGPT that can engage in basic conversations and perform simple tasks.
Level 2 (Reasoners) refers to AI systems capable of solving simple problems as well as a doctorate-level educated human without requiring external tools.
Level 3 (Agents) describes AI systems that can act independently on behalf of humans for extended periods, significantly reducing the need for human involvement.
Levels 4 (Innovators) and 5 (Organizations) represent more advanced AI capabilities, with Level 4 AI contributing to creative and technical breakthroughs, and Level 5 AI managing all organizational functions, including strategic decision-making.

DeepMind’s Method Reducing Computing Costs During Development and a New GenAI System for Databases

Speaking of DeepMind, they introduced a new AI training method called JEST (joint example selection) to reduce computing costs and energy consumption in AI development.

JEST reportedly delivers a 13-fold increase in performance and a tenfold improvement in power efficiency compared to existing methods.

The new technique is expected to impact the economics of AI development, along with applications in online commerce and global customer support. It uses a smaller AI model to grade data quality and determine the most suitable batches for training a larger model.

Notably, these new methods are essential due to the rapidly evolving nature of data and the increasing demand for models that can adapt to new information and contexts.

Moreover, MIT researchers introduced GenSQL, a generative AI system for databases that integrates tabular data with probabilistic models. This tool allows users to perform complex statistical analyses, detect anomalies, and easily generate synthetic data.

It enhances the accuracy and speed of data queries and supports more nuanced insights than traditional SQL queries.

Consilience Optimizes Financial Analysts' Time with a High-Accuracy LLM-Based Solution

Consilience has introduced AlphaIQ, an AI-driven platform distinguished by its proprietary language models specifically built for finance, ensuring exceptional accuracy and sector-specific insights.

Unlike typical fine-tuned models, AlphaIQ operationalizes nuanced language patterns by utilizing its extensive language detection algorithms and domain-specific training, enabling precise analysis of financial data. This includes quantifying qualitative risks and delivering real-time insights into market movements.

AlphaIQ's benchmarks reveal a 60% reduction in analyst time compared to traditional manual research methods and a significant improvement in decision-making accuracy, particularly in investment decisions, with a demonstrable correlation between linguistic trends and financial performance.

Microsoft Steps Back From OpenAI Board While Lawmakers Worry About China

US lawmakers have raised concerns about Microsoft's $1.5 billion investment in UAE-based AI firm G42 due to potential ties with China. Before the deal advances, they requested an intelligence assessment of G42’s connections to China.

Some concerns include the transfer of sensitive AI technology and export-restricted semiconductor chips. G42's past links with Chinese companies and involvement in digital surveillance raise additional security concerns.

That wasn’t the only issue that Microsoft had to deal with. Microsoft has also given up its non-voting board seat at OpenAI, and Apple has decided not to take on a similar role.

But this didn’t just happen out of the blue.

These moves coincide with heightened antitrust scrutiny from US and international regulators targeting Big Tech's influence. The US Department of Justice and Federal Trade Commission have been actively pursuing legal action against major tech companies, including Microsoft and Apple, for alleged anticompetitive practices.

Despite stepping back from board roles, both companies will continue to work with OpenAI and hold regular meetings.

Humane Execs Launch AI Fact-Checking Startup and the Hidden Pioneer of LLMs

We’ve seen that Humane didn’t have an entirely successful launch for its Ai pin product and was bought by HP for $1 billion sometime afterwards.

Two key Humane executives, Brooke Hartley Moy (Strategic Partnerships Lead) and Ken Kocienda (Head of Product Engineering), have left the company to start their own venture called Infactory.

Infactory is described as a fact-validation search engine that aims to provide accurate information from credible sources without relying on AI-generated content. Infactory's business model is subscription-based.

It targets corporate clients like newsrooms and research institutions, initially focusing on data rather than subjective topics like politics. With the proliferation of fake news thanks to generative AI (political misinformation is up this year 150%, researchers claim), it seems quite the lucrative niche.

Advancements in AI Research

Researchers tackled various issues in multimodal models last week. One paper challenged the notion that data and model development should be considered separately, and another explored complex multi-image scenarios. Progress was also made in further enhancing RAG systems.

Taxonomy for MMLMs from a data-model co-development viewpoint

It’s become apparent that as LLM capabilities continue to become more powerful, the amount of data needed is increasing rapidly. However, we typically consider data and model development as separate issues. Interestingly, this paper proposes that model and data development should be considered in tandem instead.

While it’s true that more high-quality data leads to better performance, this survey shows that MMLMs can help with the data development process.

It introduces a novel taxonomy grounded in the data-model co-development paradigm for multimodal LLMs (MMLMs), categorizing efforts into two principal types: data contributions to models and reciprocal model contributions to data.

The survey systematically reviews how specific data-centric approaches can enhance MLLM capabilities at different development stages. Some data-centric approaches used to scale up MMLMs covered include data acquisition, augmentation, and diversity.

At the end, it presents a roadmap that spans from infrastructures to various self-boosting degrees of data-model co-development, providing food for thought on how we can better leverage the interconnectedness of data and model development.

Pushing Multimodal AI Boundaries in Complex Multi-Image Scenarios

Comparison of performance in three interleaved scenarios. (Source)

On the topic of multimodal models, large multimodal models (LMMs) have performance issues regarding complex multi-image scenarios since it isn’t an area that has been explored much.

LLaVA-NeXT-Interleave extends capabilities to various real-world settings, such as Multi-image, Multi-frame (videos), Multi-view (3D), while maintaining Multi-patch (single-image) performance.

The model uses an image-text interleaved format as a universal data template to accommodate different scenarios, simplifying the training process across various domains.

The authors introduce M4-Instruct, a high-quality training dataset with 1,177.6K samples spanning 4 primary domains, 14 tasks, and 41 datasets, to empower LMMs with diverse capabilities.

They also introduced LLaVA-Interleave Bench, a comprehensive benchmark to evaluate multi-image performance, including 7 newly collected and 13 existing in/out-domain benchmarks.

LLaVA-Next-Interleave outperforms GPT-4V on this newly proposed benchmark. (Source)

WayveScenes 101

The 101 scenes included in the dataset. (Source)

Wayve is leading the charge in autonomous AI driving - and they aren’t showing any signs of slowing down. They introduced the WayveScenes 101 dataset, which addresses the challenge of creating high-fidelity novel view synthesis models for autonomous driving.

It consists of 101 diverse driving scenarios, each 20 seconds long, covering various environmental conditions and dynamic elements like pedestrians and vehicles. The dataset is able to provide comprehensive coverage of each scene by including 101,000 images captured from five cameras mounted on a vehicle.

What makes WayveScenes101 stand out from previous datasets is that it includes varied environmental conditions and dynamic objects, making the dataset highly representative of real-world challenges faced by autonomous driving systems.

It also helps develop models that generalize better across different viewpoints by proposing a detailed evaluation protocol using a held-out camera view to test off-axis reconstruction quality - crucial for safe autonomous driving.

Advancing Whole-Body Pose Estimation

The architecture of RTMW. (Source)

Whole-body pose estimation has been a prominent issue for applications like human-computer interaction and virtual avatar animation since previous approaches had substantial computational expenses and notable performance constraints.

Researchers at Shanghai AI Laboratory introduced RTMW, a series of high-performance models for 2D/3D whole-body pose estimation that better capture pose information from different body parts at various scales.

The model is trained with a rich collection of open-source human keypoint datasets with manually aligned annotations and further enhanced via a two-stage distillation strategy.

RTMW performs strongly on multiple whole-body pose estimation benchmarks while maintaining high inference efficiency and deployment friendliness.

RTMW outperforms various top-down methods and RTMPose, a method the same authors of this paper introduced last year. (Source)

Enhancing RAG with Refined Content

The new method leads to higher accuracy than previous methods. (Source)

Previously, we discussed a paper exploring the issue of RAG systems struggling to find and use the most relevant contexts from large sets of retrieved documents.

The paper also looks at dealing with the challenge of retrieving relevant content when there’s a lot to sift through. It introduces a novel method to refine the retrieved content before it's included in the prompt for the generation model, which is done using an instruction optimized through meta-prompting.

The proposed method uses three LLMs in different roles:

A transformation-LLM for refining retrieved content
An optimizer-LLM for meta-prompting to optimize the refinement instruction
A generation-LLM for the final output

The optimization process involves iterative meta-prompting, where the optimizer-LLM generates and evaluates new instructions, updating a list of top-performing instructions in the meta-prompt.

The best-scoring instruction is ultimately used for content refinement. The method was evaluated on a multi-hop question answering task using the StrategyQA dataset, demonstrating a significant improvement of over 30% compared to a baseline RAG system without this method.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

SLRL: Improves Multi-View Clustering (MVC) by integrating both complementary and structural information across different views.
BiasPruner: Continual learning framework for medical image classification that dynamically prunes network units responsible for learning dataset biases,
Numina: Designed to solve complex mathematical tasks through tool-integrated reasoning.

If you want your framework to be featured here, reply to this email and say hi :)

Conversations We Loved

Karpathy brought to light the cost-effectiveness of LLMs these days, being able to train a GPT-2 model for $672 - a possibility that seemed very distant during its release in 2019. GraphRAG is also getting a lot of attention, with Emil Eifrem covering exactly why we should keep an eye on it.

From Cutting Edge to Cost-Effective

Karpathy’s post about how cost-effective LLM training is nowadays. (Source)

Karpathy discussed an extremely important project that developers can follow to implement GPT-2 from scratch in pure C. He highlighted a notable trend in the evolution of GPT-2 scale model training over the past five years, which means that the landscape has dramatically shifted with costs plummeting and accessibility soaring.

Major advancements in hardware (H100 GPUs), software (CUDA, cuBLAS, cuDNN, FlashAttention), and data quality (FineWeb-Edu dataset) have significantly reduced the resources required.

What once demanded substantial investment can now be accomplished for about $672, running on a single 8XH100 GPU node for 24 hours. Projects like llm.c provide minimal, efficient training stacks without complex dependencies, lowering the barrier to entry for researchers and developers.

Money Moving in AI

While AMD decided to acquire Silo AI, Hugging Face announced its profitability despite being a mostly open-source platform. Additionally, Hayden AI successfully raised $90 million from a Series C funding round and Hercules AI secured $26 million from a Series B funding round.

AMD Acquires Silo AI for $665 Million

AMD acquired Silo AI, Europe’s largest private AI lab, for approximately $665 million.

This acquisition is part of AMD’s strategy to provide comprehensive AI solutions and strengthen its AI ecosystem. Silo AI specializes in AI-driven solutions for the cloud, embedded, and endpoint computing markets.

Hayden AI Raises $90 Million From Series C Funding Round

Hayden AI raised $90 million in Series C funding led by TPG’s The Rise Fund, with participation from existing investors.

The investment will accelerate Hayden AI's R&D and market expansion. Hayden AI's vision AI platform uses geospatial data to enhance urban management, including traffic enforcement and city asset monitoring. Their technology improves public safety, sustainability, and operational efficiency in cities.

HerculesAI Secures $26 Million From Series B Funding Round

As a result, the company recently secured a$26 million Series B funding round to capitalize on its current success and longstanding work with language models. Previously, we’ve worked with HerculesAI to create a multi-layer AI system for Ropers Majeski.

HuggingFace Announced to be Profitable Despite Being Mostly Open-Source

Hugging Face's CEO announced that the company is actually profitable - even though most of the platform is free and open-source. Most of Hugging Face’s revenue comes from its enterprise motion, which includes licensing fees for using its models and software, as well as partnerships.

Anthropic's Benchmark Push, SenseNova 5.5 Challenges GPT-4, Google’s Green Dilemma

Tue, 09 Jul 2024 16:15:18 +0000

Before we start, recommend this week's news to your friends and colleagues for $150 in AWS credits.

Share the newsletter

Key Takeaways

Anthropic is funding the development of comprehensive AI benchmarks, targeting AI safety, societal implications, and performance.
Google's Gemini 1.5 Pro and 1.5 Flash models underperform in data analysis, correctly answering 46.7% of true/false statements about a 520-page book.
SenseTime unveiled its SenseNova 5.5 model, claiming a 30% performance improvement over its predecessor, surpassing GPT-4 in five key metrics.
RankRAG uses a single language model for both context ranking and answer generation, outperforming state-of-the-art models like GPT-4 across multiple benchmarks.
DeepSeek AI's ESFT method tunes only the most relevant experts in sparse-architecture LLMs, improving tuning efficiency and performance while reducing computational resources.

Got forwarded this newsletter? Subscribe below👇

The Talk of The Day

A16Z, a Silicon Valley heavyweight, that has recently announced a new $7.2B fund, has announced a 20,000+ NVIDIA GPU cluster, that it plans to rent out to portfolio companies. We've seen a similar model from recent Anthropic/Amazon deal (among others), so seems like the model is getting adopted not only by the hyperscalers.

The Latest AI News

While Abacus AI's benchmark exposed the current limitations of even the most advanced LLMs, Anthropic decided to fund the development of new benchmarks.

Meanwhile, things didn’t look so good for the other tech giants. Google faced scrutiny for releasing an environmental report without mentioning the energy implications of AI operations, and Microsoft made some impressive updates to the Phi-3 models.

Anthropic Funds New Benchmarks as Gemini Falls Short of Google’s Claims

Current benchmarks fall a little short when it comes to assessing advanced AI capabilities, especially in areas like safety and societal impact.

That’s why Anthropic is launching a program to fund the development of new, comprehensive AI benchmarks. The initiative targets AI safety, societal implications, and performance.

We’ve seen increasing concerns about the societal implications of advanced AI, but Anthropic’s new approach might set the bar for how we approach responsible AI development.

Meanwhile, things looked a little grim for Google as studies revealed that the Gemini 1.5 Pro and 1.5 Flash models underperform in data analysis compared to Google's claims. We mentioned last week that Gemini 1.5 Pro has an impressive context window of 2 million tokens.

An example of a true/false statement test to evaluate Gemini 1.5 Pro. (Source)

But when push came to shove, Gemini 1.5 Pro answered true/false statements about a 520-page book correctly 46.7% of the time - pretty underwhelming to say the least. Currently, Google hasn’t responded to the critiques and concerns raised by these studies.

OpenAI’s Privacy Lapse and Figma Doesn’t ‘Make Design’ For Now

OpenAI's ChatGPT app for macOS was found to store all user conversations in plain text, raising significant privacy concerns.

The app didn’t utilize macOS’s sandboxing feature, which isolates app data to enhance security, allowing easy access to stored conversations. After the issue was reported, OpenAI released an update to encrypt the stored conversations, addressing the vulnerability.

It highlights a massive problem in the AI industry: companies are focusing on rapid redeployment over security testing. This is most likely due to the tension between innovation speed and responsible development, where companies struggle to find a suitable balance.

OpenAI wasn’t the only one with product functionality concerns last week. Figma temporarily disabled its "Make Design" AI feature after it was found to be replicating Apple's Weather app designs.

Andy Allen highlighted the issue, noting that the AI-generated designs were essentially copies of Apple's Weather app. Figma's CEO, Dylan Field, responded by acknowledging the problem and taking the feature offline.

Figma clarified that the problem was due to the underlying design systems used by third-party models, not the training data itself. The company promised a thorough QA process before re-enabling the feature. Goes to show that transparency is currently a little lacking in the AI industry.

Western Legacy Firms and Chinese Innovators Battle for Dominance

SAP shares saw a big increase on June 27, 2024. (Source)

SAP, Oracle, and IBM have achieved record valuations driven by their cloud and AI initiatives.

AI technologies have been integrated into their service offerings, enhancing their capabilities and attracting more clients - a lesson in how traditional tech companies can use cloud and AI technologies to reinvent themselves.

While Western tech companies were busy integrating AI into their offerings, Chinese tech giants were making rapid advancements in AI capabilities. At the World Artificial Intelligence Conference (WAIC) in Shanghai, SenseTime unveiled its SenseNova 5.5 model, claiming a 30% performance improvement over its predecessor, and surpassed GPT-4 in five key metrics.

SenseTime showcased how the SenseNova 5o multimodal model can integrate sound, text, images, and video in real-time. Alibaba also saw success as its open-source Tongyi Qianwen models doubled in downloads, reaching 20 million and increased the number of customers served by over 150% to 230,000.

Intel's Optical Innovation

Intel’s new chiplet is designed to boost high-speed data processing. (Source)

Intel showcased its first fully integrated Optical Compute Interconnect (OCI), designed to enhance AI workloads.

The OCI chiplet supports 64 channels of 32 Gbps data transmission over 100 meters, improving bandwidth and reducing power consumption. This advancement integrates photonics with silicon CPUs, overcoming the limitations of traditional electrical I/O technologies.

It seems promising, but we’ll have to wait and see how it will impact energy consumption and data center efficiency in real-world applications.

Google's Green Initiatives Don’t Show the Full Story

Google made some questionable decisions in the new environmental report. (Source)

Google released its 2024 Environmental Report, detailing various sustainability initiatives and achievements. The report notably avoids discussing AI operations' significant energy consumption and environmental impact. Google emphasizes its renewable energy usage and carbon neutrality goals but lacks transparency regarding AI’s energy footprint.

Moreover, the disconnect between Google’s environmental reporting and AI's known energy demands raises questions about the honesty of corporate environmental reports.

Phi-3 Mini's Compact Power

We also saw updates to the Phi-3 models from Microsoft, specifically Phi-3-Mini-4K-Instruct and Phi-3-Mini-128K-Instruct.

The models were trained using a mix of synthetic data and filtered publicly available data, focusing on dense reasoning and high-quality outputs. Notably, using synthetic data might lead to concerns about the quality and potential biases of such data.

Regardless, the models demonstrate robust performance on benchmarks such as common sense, language understanding, math, and code, compared to models with up to 13 billion parameters.

The 128K model showed improvements across various benchmarks. (Source)

Advancements in AI Research

AI research saw progress in different areas, including DeepSeek AI’s new PEFT method for sparse-architecture LLMs, a new method for document retrieval, and a simpler way of solving software development problems.

Unlocking Efficient Learning for Sparse Models

How ESFT is different from other fine-tuning methods. (Source)

LLM sizes are continuing to grow, which calls for methods that can adapt pre-trained models to specific tasks. PEFT is one way to accomplish this, specifically LoRA and P-tuning. While these methods focus on dense-architecture LLMs, research in the area of sparse-architecture LLMs leaves much to be desired.

DeepSeek AI’s ESFT method is specifically for sparse-architecture LLMs using the MoE architecture. The proposed method tunes only the experts most relevant to specific downstream tasks while freezing other experts and modules, improving tuning efficiency and performance.

ESFT achieves comparable or superior performance to full-parameter fine-tuning while reducing the computational resources required for training and storage. It also shows that it's possible to achieve efficient model customization without significant loss in general task performance.

ColPali Redefines Document Discovery

How ColPali differs from standard retrieval. (Source)

Current document retrieval systems struggle to effectively exploit visual cues, making their application in RAG slightly limited. This becomes particularly apparent when dealing with visually rich documents that contain varying text, tables, figures, and page layouts.

ColPali uses VLMs to create multi-vector representations of document pages directly from their visual appearance, simplifying the retrieval process. The model outperforms existing document retrieval methods, enabling faster corpus indexing and low querying latencies.

The authors also created ViDoRe, a comprehensive benchmark for evaluating document retrieval systems using visual features across multiple domains and languages.

ColPali shows strong performance across various types of visually rich documents that include infographics, figures, and tables.

Turning Code into Conversational Gold

Example of the workflow behind an LLM-based function-calling agent. (Source)

Current datasets often lack comprehensive verification, leading to potential inaccuracies and inefficiencies in model fine-tuning for real-world applications.

APIGen is an automated pipeline designed to generate high-quality, diverse, and verifiable datasets for function-calling applications. It leverages 3,673 executable APIs across 21 categories.

Each data point in the generated dataset undergoes a rigorous three-stage verification process, including:

Format checking
Actual function execution
Semantic verification

The framework can efficiently generate large-scale datasets, having already produced 60,000 high-quality entries, which significantly contribute to the development of function-calling models.

Models trained with APIGen-generated datasets, even those with smaller parameter sizes, achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, surpassing multiple GPT-4 models.

The proposed models surpass various versions of GPT-4. (Source)

Boosted RAG Systems > GPT-4

The two-stage instruction tuning framework for RankRAG. (Source)

RAG systems face the challenge of effectively selecting and utilizing the most relevant contexts from a large set of retrieved documents for question answering and other knowledge-intensive tasks.

RankRAG enhances RAG by using a single language model for both context ranking and answer generation. This dual-purpose instruction fine-tuning significantly boosts the model's effectiveness in selecting relevant contexts and generating accurate answers.

What’s impressive is that it consistently outperforms state-of-the-art models like GPT-4 across 9 general-domain and 5 biomedical benchmarks, showing its superior capability in handling knowledge-intensive tasks.

It’s also designed to be data-efficient, achieving strong performance with a modest amount of ranking data. Its innovative instruction-tuning approach achieves this efficiency, which integrates a small fraction of ranking data into the training blend.

Complicated Solutions Aren't Always Necessary

AGENTLESS overview. (Source)

Solving software development problems often requires the use of complex autonomous software agents. Instead, AGENTLESS is a simpler, more interpretable approach to automated software development.

It uses a two-phase process: localization and repair. The localization phase employs a hierarchical approach to identify relevant files, classes/functions, and specific edit locations. The repair phase generates multiple candidate patches, filters them using regression tests, and selects the final patch using majority voting.

The hierarchical localization and filtering methods used by AGENTLESS drastically reduce the cost associated with bug fixes, making it a cost-effective solution compared to traditional agent-based approaches.

Notably, AGENTLESS's success makes us reconsider the assumption that increasingly complex AI agents are necessary for effective automated software development.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

FlexiFilm: Designed to create interactive, high-performance web applications using modern web technologies to enable dynamic content updates without full-page reloads.
GraphRAG: Combines graph neural networks with RAG to enhance the performance and accuracy of AI models in natural language processing tasks.
LivePortrait: Enables efficient and high-quality portrait animation through stitching and retargeting controls.

If you want your framework to be featured here, reply to this email and say hi :)

Conversations We Loved

We saw a conversation about how larger LLM research groups like Salesforce are shifting their priorities toward smaller models and another discussion about different techniques for more stable training for larger LLMs.

Enterprise AI is Shifting Gears

Wolfe’s discusses what major LLM research groups are focusing on. (Source)

Cameron Wolfe, an ML engineer at Netflix, raised an interesting point about how big AI players are shifting their focus.

Major AI companies like Salesforce and Cohere are focusing on smaller models, function calling, and RAG. The preference for smaller models suggests that companies are prioritizing efficiency and scalability over raw power.

While smaller models offer benefits in terms of deployment and scalability, it’s still important to assess whether they can match the performance of larger models across all relevant tasks.

We’re also seeing a more significant focus on RAG and function calling, highlighting the importance of customizing AI models to specific business needs and leveraging existing company data.

Paving the Way for Stable, Scalable, and Sustainable Model Training

Gordić’s list of techniques that boost training stability. (Source)

Training stability is certainly one issue that stands out when it comes to developing large AI models. Gordić compiled a neat list of techniques directly addressing this issue to develop more reliable models.

In particular, techniques like maximum update parametrization (muP) focus on maintaining stability as models scale up - essential for advancing AI capabilities.

Additionally, a lot of these techniques aim to optimize resource usage, which is crucial given the enormous computational requirements of training large models. This could lead to more cost-effective and environmentally friendly AI development - a theme that came up earlier when we talked about Google’s environmental report.

It’s also interesting to consider that the list combines insights from various areas of machine learning, optimization theory, and computer science, indicating that advanced AI development requires a multidisciplinary approach.

Money Moving in AI

South Korean semiconductor powerhouse SK hynix announced massive plans for a $75 billion investment in AI and Chips by 2028, while Formation Bio raised $372 million and Quantum Rise raised $15 million.

SK hynix Plans to Invest $75 Billion

SK hynix has ambitious plans to invest $75 billion by 2028 to boost its semiconductor and AI businesses. The group aims to secure $60 billion by 2026 for AI, semiconductors, and shareholder returns investments, with SK Group Chairman Chey Tae-won stressing the need for pre-emptive and fundamental changes.

Formation Bio Secures $372 Million in Series D Funding

Formation Bio secured $372 million in Series D funding led by a16z to in-license 10+ drugs.

The company uses AI to boost various drug development processes, including patient recruitment, medical writing, and protocol development. They also aim to reduce the time and cost associated with clinical trials and drug development.

Volley Raises $55 Million to Create Voice-Powered AI Games

Volley, a startup creating AI-powered voice-command games for Amazon Alexa, Fire TV, and Roku TV, has raised $55 million in Series C funding. The round is led by Microsoft's M12 Ventures and Lightspeed Venture Partners.

Quantum Rise Secures $15 Million During Seed Round

Quantum Rise, a Chicago-based startup specializing in AI-driven automation, has secured a $15 million seed round from Erie Street Growth Partners. The startup combines AI and human intelligence to automate workflows, develop tailored AI solutions, and accelerate business growth.

500M Years of Evolution in 98B Parameters, JEPA: LeCun's Path to Human-Level AI

Tue, 02 Jul 2024 14:01:39 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

OpenAI developed CriticGPT, which helps identify errors in GPT-4 outputs to help improve accuracy and reduce hallucinations.
Google released Gemini Pro 1.5 with a context window of 2 million, and Gemma 2, which provides comparable performance to larger models like Llama-3. Does it mean you don't need RAG? See what DeepMind thinks on this topic below.
Meta’s LLM Compiler is a suite of pre-trained models for code optimization that outperform GPT-4 Turbo at compiler optimization tasks.
ESM3 is a 98B parameter model that can generate novel protein structures, simulating evolutionary divergence over 500 million years.
Agent symbolic learning overcomes the limitations of traditional language agents and presents a means of getting closer to achieving AGI.

Got forwarded this newsletter? Subscribe below👇

The Talk of the Day

WhatsApp is previewing the Llama 3-405B model for complex prompts, available for a limited number of prompts weekly. After reaching the limit, users will switch to the default model, according to multiple reports on social media.

Menu for Google Play Beta Program users (version 2.24.14.7) lets users pick an LLM.

Free Credits from the Intel Disruptor Initiative, AWS, and Activeloop

Apply in 2 Minutes

We're celebrating one year from the kick-off of our GenAI360: Foundation Model Certification launched in collaboration with Intel and Towards AI. Nearly 40,000 course takers and thousands of GenAI360-certified professionals later, we're announcing a scholarship program for free AWS Cloud Credits. These scholarships will enable you to work on hands-on examples using Intel® Xeon® Scalable Processors in the AWS cloud environment.

As a member of our community, you're welcome to apply regardless of your seniority level!

The Latest AI News

OpenAI follows in the footsteps of Anthropic by developing a model to find flaws in GPT-4’s outputs. Google was busy releasing longer contexts for models like Gemini 1.5 Pro and Gemma 2, while competition continues to heat up in the AI chip space with another startup entering the market - but with a unique twist.

OpenAI Critiques GPT-4 While Amazon and Google Release New Models

The combination of human trainers and CriticGPT leads to the lowest number of hallucinations. (Source)

OpenAI developed CriticGPT, a model based on GPT-4 designed to critique and identify errors in ChatGPT responses, helping human trainers improve RLHF. It does so by assisting trainers to outperform unassisted evaluations by 60%, making it easier to spot subtle inaccuracies, and providing more comprehensive critiques with fewer hallucinated bugs.

Although, some apparent limitations of CriticGPT are that it currently struggles with long, complex tasks and sometimes hallucinates. We can see from the above graph that CriticGPT hallucinates more than humans.

While OpenAI is busy developing a model to criticise its flagship model, Amazon is preparing to launch its own AI chatbot named Metis, designed to compete with OpenAI's ChatGPT. The chatbot will be accessible via web browsers and offer conversational text responses, source links, suggested follow-up queries, and image generation.

Metis makes use of RAG, but we also saw the release of a new paper detailing advancements in Long-Context Language Models (LCLMs), where a new benchmark was created to better assess their applicability to real-world applications. This is important since LCLMs outperform RAG methods.

This is especially relevant considering that Google released Gemini 1.5 Pro, which now boasts a 2 million context window and can handle more complex tasks like analyzing entire text libraries, feature-length movies, and extensive audio data. One may ask, how does context size affect accuracy? Deepmind research still sides with RAG on this one:

Performance benchmarks across LLMs and use cases.

For the most common enterprise use case - text retrieval, large-context LLMs still underperform at 1M context size (while being more costly, and slower to run - might we add).

Specialized models best Gemini at text-to-SQL (e.g., computing an average)

Hugging Face Introduces New Library and Leaderboard, LeCun Proposes New Method For AI to Achieve Human-Level Intelligence

Hugging Face introduced BM25S, a high-performance, low-dependency library that implements BM25 for fast lexical search in Python. Moreover, BM25S provides a highly efficient search mechanism within the Python ecosystem, avoiding the complexities of setting up external servers.

Another release on Hugging Face was the new Open-LLM leaderboard, designed to track and rank the performance of LLMs on various tasks. Previous benchmarks were becoming too easy for models, calling for a new leaderboard version. It encourages improvements in training data quality, model architectures, and hyperparameter optimization.

Benchmark saturation meant that it was time for a new leaderboard. (Source)

LLMs are great and all, but they have some limitations that are stopping AI from getting to human-level intelligence. Instead, LeCun proposed that Joint Embedding Predictive Architecture (JEPA) is the way to get there due to the limitations of LLMs. It aims to create AI that understands and interacts with the world like humans by using world models and self-supervised learning.

Current LLMs lack common sense, memory, and planning capabilities, leading to errors and inefficiencies in understanding and using language contextually.

To mitigate these issues, JEPA uses pairs of related inputs and encoders to transform these inputs into abstract representations, focusing on essential features and predicting future states based on these representations.

Stable Diffusion Has a New Open-Source Competitor and Google Releases Gemma 2

A new coalition, the Open Model Initiative (OMI), plans to develop open-source AI models to compete with Stable Diffusion's latest release. The initiative includes newcomers like Invoke AI, Comfy, and CivitAI, aiming to create high-quality models free of restrictive licensing terms.

Additionally, OMI will not seek venture capital, relying on community support and the business models of its founding members.

Speaking of open-source models, Google also released Gemma 2, a 27B parameter open-source LLM. According to the LLM Arena ranking, it quickly became a top performer in the open-source space.

This model stands out due to its ability to outperform larger models. Despite Llama-3 being 2.5x bigger, Gemma-2 offers comparative performance in various benchmarks like MMLU and GSM8K.

Gemma-2 provides comparable performance to Llama-3 and outperforms Qwen1.5 on benchmarks like MMLU and BBH. (Source)

It’s also optimized to run on various types of hardware and provides a more cost-effective alternative by being designed to run efficiently on just a single unit (Google Cloud TPU Host, for example).

Typically, high-performance AI tasks would require multiple GPUs or TPUs, so Gemma-2's approach drastically lowers infrastructure costs.

Microsoft CEO Says Online Content is Freeware and Drops Hint for GPT-5

Microsoft CEO Mustafa Suleyman made some claims that caused some heads to turn. He claimed that online content is considered "freeware" for training AI models - meaning it can be freely used unless explicitly restricted by copyright holders. Ironically, Microsoft itself is very stringent when it comes to scraping its own content.

On another note, Suleyman hinted that while GPT-5 might be on the horizon, significant advancements expected from GPT-6 are about two years away, requiring substantially more computational power for training.

Even though the prospect of GPT-5 sounds exciting, Suleyman raised concerns about fully autonomous AI, despite mentioning that we aren’t on the path to achieving it. Instead, he’s advocating for "narrow lanes of autonomy" where AI is confined to specific tasks within tight boundaries.

Nvidia Faces More Competition in the Chip Market

The competition in the AI chip market continues to become fiercer as Harvard dropouts founded a startup called Etched. They also secured $120 million in funding to help with the development process.

But what separates Etched from the rest of the AI chip market?

They’re focusing on creating a specialized AI chip called Sohu. This chip is designed to outperform general-purpose GPUs by being optimized solely for transformer-based models widely used in NLP and other AI applications.

While NVIDIA is dominating with their GPU offerings, Etched aims to carve out a niche by offering a product that excels in a more specialized AI application.

Advancements in AI Research

Last week, a fascinating application of language models with a new model acting as an evolutionary simulator for proteins came up. Moreover, AGI might become a reality soon with a new paper detailing how agent symbolic learning overcomes the limitations of traditional language agents.

Meta LLM Compiler > GPT-4

Overview of the approach used in the LLM Compiler. (Source)

While effective in various software engineering and coding tasks, traditional LLMs aren’t optimized for understanding and enhancing compiler intermediate representations (IRs) and assembly language, leading to code optimization inefficiencies.

As a result, Meta AI introduced LLM Compiler - a suite of pre-trained models for code optimization tasks. Built on the foundation of Code Llama, the LLM Compiler models are designed to understand and optimize compiler intermediate representations (IRs) and assembly language.

LLM Compiler is trained on a corpus of 546 billion tokens of LLVM-IR and assembly code.

By providing pre-trained models that can be fine-tuned with minimal additional data, LLM Compiler offers a cost-effective solution for code optimization, making advanced optimization techniques more accessible.

These models significantly outperform existing models like Code Llama and GPT-4 Turbo in compiler optimization tasks, showing that LLM Compiler is much more effective for software optimization than general-purpose models like GPT-4.

LLM Compiler outperforms GPT-4 at the disassembly task. (Source)

Using Language Models to Simulate 500 Million Years of Evolution

Iterative sampling with ESM3. (Source)

Proteins have evolved over three billion years, resulting in a vast diversity of structures and functions. Over this time period, some proteins have become so different from their ancestral forms that they appear "lost" in evolutionary history, meaning their sequences and structures have diverged significantly from current proteins.

That was the case - until ESM3 came along.

It’s a frontier multimodal generative language model that acts as an evolutionary simulator, scaled to 98 billion parameters for more accurate generative evaluations. Moreover, the model is trained on 2.78 billion natural proteins, encompassing sequence and structural data, with additional synthetic sequences generated for comprehensive training.

ESM3 excels at generating new proteins that follow complex prompts that include:

Atomic-level details
Secondary structure
Solvent-accessible surface area
Functional keywords

The model demonstrates a significant improvement in representation learning and generative performance with scaling, achieving high fidelity in protein generation tasks. Notably, the model's ability to generate proteins with low sequence identity from known proteins simulates over 500 million years of evolutionary divergence.

Agent Symbolic Learning is the Way to Achieve AGI?

Overview of the agent symbolic learning framework. (Source)

We saw that AGI is getting a lot of attention recently, with the Redwood research team looking into the ARC-AGI benchmark along with the $1 million competition that was announced.

Traditional language agents fall short in the quest for AGI because they require extensive manual engineering to optimize prompts, tools, and configurations, limiting their adaptability and scalability. This model-centric approach is inefficient for complex, real-world tasks where agents must continuously learn and evolve.

Researchers at AI Wave introduced a novel framework called agent symbolic learning, which allows language agents to optimize and evolve using symbolic optimizers autonomously. This shift aims to transition from model-centric to data-centric AI development, enabling more efficient and adaptable AI systems.

The framework was tested on both standard benchmarks and complex real-world tasks, showing significant performance improvements. The experiments demonstrated that agents using symbolic learning could update themselves post-deployment, becoming "self-evolving” and increasing the chance of achieving AGI.

Boosting Generalization Using ¼ Storage of ColBERT

The different types of query-relevance models. (Source)

Google researchers introduced the Learnable Late Interaction (LITE) models, which are designed to address limitations in existing late-interaction models for document ranking.

These models are proposed as a solution to the increased latency and storage burden found in traditional Dual-Encoder (DE) and Cross-Encoder (CE) models.

CE models are accurate but slow, while DE models are faster but less accurate. Late-interaction models, though offering a better trade-off, often rely on handcrafted score reductions that lack expressiveness and have high latency and storage requirements.

What’s impressive is that LITE showed superior performance in terms of accuracy, latency, and storage efficiency across these datasets. These models also significantly advance document ranking efficiency by reducing latency and storage requirements without compromising accuracy. LITE used 0.25x the storage of ColBERT while improving generalization.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

TextGrad: Designed for text optimization using a method similar to PyTorch’s gradient-based optimization but applied to natural language feedback.
MARS5: Advanced TTS model that uses a two-stage AR-NAR pipeline to generate speech from text and a brief audio reference.
RealTalk: Achieves precise lip synchronization and high-quality facial animations in real-time by using an audio-to-expression transformer and a lightweight facial identity alignment module.

If you want your framework to be featured here, get in touch with us.

Conversations We Loved

The applications of AI in science continue, as Yann LeCun brought up some thought-provoking comparisons on how physics and AI might be related in ways we didn’t even think of. On the flip side, Imbue detailed exactly how they went from bare metal to a 70B model that outperformed GPT-4.

LeCun explains how physics and AI are linked. (Source)

LeCun highlighted parallels between AI mechanisms and physical concepts, such as variational Bayesian inference relating to thermodynamics and ConvNet pooling mirroring renormalization group theory. The intersection of AI and physics suggests a unified approach to understanding both intelligence and the universe.

Another implication is that understanding AI through the lens of physics could lead to new theoretical frameworks, potentially changing how we perceive both AI and the natural world.

It also means the need for interdisciplinary research might become increasingly important in the future, requiring collaboration between computer scientists and physicists.

Considering that the connections between physics and AI are only beginning to be understood, it means we could certainly see more innovative solutions for complex problems that are currently unanswered.

Building a 70B Model From Bare Metal

Imbue detailed how they went from bare metal to a 70B model that outperforms GPT-4. (Source)

Imbue wrote a post detailing how they developed their infrastructure that outperforms GPT-4 on reasoning-related tasks - all while starting from scratch.

Their data center setup involved 4,092 H100 GPUs distributed across 511 boxes, with each box housing 8 GPUs. A significant part of the preparation was ensuring all the boxes were set up correctly, encountering a high rate of initial setup failures.

About 10% of machines initially failed to boot due to various physical issues, such as:

Unconnected Ethernet cables
Broken power supplies
Missing internal wires
Network cards or GPUs not showing up

Some issues were reminiscent of common computer problems, such as clock synchronization issues preventing HTTPS certificate validation during initial provisioning and GPU errors that required physically reseating the cards.

The big takeaway is that it’s always useful to have more machines than you think you’ll need if you also decide to undertake a project like this.

Money Moving in AI

Aside from critiquing its own models, OpenAI also acquired an analytics database company called Rockset. Formation Bio and Klarity succeeded in Series B funding, raising $372 million and $70 million, respectively.

OpenAI Acquires Rockset

OpenAI has acquired Rockset, a leading real-time analytics database company, to enhance its AI products' retrieval infrastructure. Rockset's technology enables users, developers, and enterprises to leverage their data more effectively through real-time indexing and querying capabilities.

The integration of Rockset's technology will improve the data retrieval process across OpenAI's product suite. Members of Rockset’s team will join OpenAI, contributing their expertise to further OpenAI's mission.

Formation Bio Acquires $372 Million in Series B Funding

Formation Bio raised $372 million in a Series B funding round to enhance its AI-driven drug development platform. Investors in this round include Khosla Ventures, Eight Roads Ventures, and General Catalyst. The company works on accelerating drug development processes via AI and ML technologies.

Klarity Raises $70 Million in Series B Funding

Klarity secured $70 million in a Series B funding round led by NFDG, with notable contributions from Y Combinator, Tola Capital, Picus Capital, Invus Capital, and Scale Venture Partners. The AI-powered platform automates document-centric accounting and compliance tasks.

Moreover, the funds will be allocated to expanding engineering, product development, and go-to-market teams to enhance their product suite and customer base.

Kick off Your Gen AI Projects with the GenAI360 Scholarship

Mon, 19 Feb 2024 21:00:00 +0000

One year ago, we launched GenAI360: Foundation Model Certification in collaboration with Towards AI and the Intel Disruptor Initiative. With nearly 40,000 course takers, thousands of GenAI360-certified professionals, and more prominent partners joining the ride, we're excited to celebrate this milestone with an exciting opportunity for our learning community!

🎓 Introducing GenAI360 Scholarships for Free AWS Cloud Credits

Intel logo

We're excited to offer GenAI360 Scholarships to complete our certifications in collaboration with the Intel Disruptor Initiative and AWS. These scholarships will enable you to work on hands-on examples using Intel® Xeon® Scalable Processors in the AWS cloud environment.

We're thrilled to announce an exciting opportunity for aspiring AI professionals and enthusiasts! Building on the success of our GenAI360 Foundation Model Certification, developed in collaboration with the Intel Disruptor Initiative and Towards AI, we are now offering 100 grants in form of AWS credits to support your GenAI proof-of-concepts so you can build end-to-end, practical projects with Activeloop Deep Lake and Intel® Xeon® Scalable Processors on AWS.

Each credit grant is equivalent to $150 spend on AWS and is intended to help you go through any of our courses.

Apply Now in 2 Minutes!

💡 Who Should Apply?

AI professionals and executives looking to expand their Gen AI knowledge
Developers, researchers, and GenAI enthusiasts working on innovative AI projects
We're happy to consider any current or past student of GenAI360, so feel free to register and start taking the courses if you haven't!

🌟 Why This Matters

As generative AI continues transforming industries, a deep understanding of foundational models is crucial. A portion of the scholarship funds is also dedicated to remove financial barriers, allowing more individuals to gain the skills needed to drive innovation in AI.

📚 About the GenAI360 Certification

It might be the case you need a refresher on our certification program.

Three courses across production-grade RAG, and LLM training as well as fine-tuning.

Covers the full AI stack: from theoretical foundations of LLMs to the nitty-gritty of LLMOps
Features practical, real-world projects applicable across industries (healthcare, life sciences, legal, finance, and more)
Developed by experts from Activeloop, Intel, and Towards AI in collaboration with leading AI companies
Free certificate upon completion for scholarship recipients

To apply, please fill out this brief form.

How to Apply

We're awarding grants on a rolling basis. If you have received this email, you have met the minimum eligibility criteria. Please apply by filling out the following short form - it will take you under 2 minutes max!

Apply for the Grant

Eligibility Criteria
Select candidates will receive AWS credits to enrol in the program. Program participants should redeem credits before expiry date. Applying does not guarantee the granting of cloud credits. Eligibility will be determined based on specific criteria that may be changed without prior notice and at the sole discretion of our selection committee. We reserve the right to make all final decisions regarding the distribution of AWS credits. Terms and conditions apply.

Sonnet & CodeDroid Better GPT-4, Sutskever for AI Safety

Tue, 25 Jun 2024 15:16:06 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

This week’s key developments include:

Anthropic launched Claude 3.5 Sonnet, excelling in reasoning, knowledge, and coding while being cheaper than GPT-4 at $3 per input token vs GPT-4’s $30 per token.
xAI partnered with Dell and Nvidia to build an AI factory, and Ilya Sutskever launched Safe Superintelligence (SSI) for AI safety.
Roblox is developing a 4D engine for interactive AI experiences that enhance realism and interactivity.
The MCT Self-Refine algorithm significantly improved success rates in solving Olympiad-level math problems.
A new MoA model outperformed GPT-4 at a fraction of the cost, achieving state-of-the-art results on multiple benchmarks.

Got forwarded this newsletter? Subscribe below👇

The Latest AI News

Looks like OpenAI's GPT-4 is in for some poetic justice from Claude Sonnet.

Meanwhile, Nvidia rode its wave of momentum to overtake Microsoft as the biggest public company while teaming with Dell and xAI for a new AI factory.

Anthropic’s Claude 3 Sonnet Tops Charts and Microsoft Releases Florence-2

Anthropic introduced Claude 3.5 Sonnet, a game-changer that’s giving mainstream models like GPT-4 a run for their money (quite literally, as it's cheaper - at $3 per Million of Tokens (MTok) compared to GPT-4’s $30 per MTok and GPT-4o's $5 per MTok). It especially excels at graduate-level reasoning, undergraduate-level knowledge, and coding proficiency.

It’s free to access on Claude.ai and the Claude iOS app, with higher rate limits for Pro and Team subscribers. Claude 3.5 Sonnet excels in graduate-level reasoning, undergraduate-level knowledge, and coding proficiency (for this one, it even beats GPT-4o in the independent LMSYS arena).

Claude-3 Sonnet outperformed GPT-4o and Claude-3 Opus at GPQA, MMLU, and others. (Source)

Here's a couple of our favorite demos:

Free landing pages for all: Claude 3.5 Sonnet can generate free landing pages using its new artifact feature, providing practical applications in web development.
Collaborative developer: A Reddit post detailing the collaborative development of a maze-based game with Claude, showcasing its iterative and creative problem-solving abilities.
Sparking creativity: Artifacts also lets users generate and iterate on various types of content, such as documents, code, diagrams, and simple games, in real-time.

Microsoft also released a new model called Florence-2. It’s a versatile vision foundation model for various vision and vision-language tasks. Florence-2 excels in captioning, object detection, segmentation, and OCR - pretty different from the tasks that Claude 3.5 Sonnet does well at.

The FLD-5B dataset, with 5.4 billion annotations, drastically boosts Florence-2's ability to generalize across diverse vision tasks. It uses a prompt-based learning approach, which allows for more efficient and flexible task handling.

xAI Teams With Dell and Nvidia For AI Factory and Sutskever Launches AI Safety Company

A sneak peek at the AI factory. (Source)

xAI, Dell Technologies, and Nvidia are joining forces to build an AI factory in Tennessee to power Grok for xAI. Dell plans to assemble half of the supercomputer racks in xAI’s facility.

Notably, xAI raised $6 billion in a Series B funding round to support new product development and infrastructure growth.

On the other hand, Ilya Sutskever (former Chief Scientist at OpenAI) launched a new company called Safe Superintelligence (SSI) - just a month after leaving OpenAI.

SSI is co-founded with Daniel Gross, a former Y Combinator partner and Apple AI lead, and Daniel Levy, a former OpenAI engineer. SSI aims to develop superintelligent AI safely and is looking to make significant advancements in AI technology.

Aside from the AI factory and new AI safety company, Nvidia achieved a massive milestone last week by surpassing Microsoft to become the world's largest public company by market valuation.

Nvidia holds an 80% market share in AI chips, which is a core reason for its growth. But we’ve seen how Nvidia’s chip success has meant that other companies want a piece of the AI chip profit pie.

Pursuing 4D Generative AI, Inference Optimization, and ARC-AGI Progress

Roblox’s quest to 4D generative AI. (Source)

Roblox announced that they’re pursuing 4D generative AI by developing a 4D engine.

The goal is to create interactive experiences in which objects function and interact realistically. Moreover, they aim to address critical challenges related to the functional, interactive, and controllable aspects of 4D AI.

On the flip side, we saw developments in AGI - an area that’s been getting a lot of interest recently, as we saw in the $1 million competition that was announced.

Redwood's research team investigated the ARC-AGI benchmark, using GPT-4o to achieve a 50% state-of-the-art (SOTA) performance. They used techniques that involved optimizing GPT model architectures and training processes.

Additionally, Character.AI presented some progress on inference processes. They used techniques like Multi-Query Attention, Hybrid Attention Horizons, and Cross-Layer KV-sharing to reduce KV cache size by over 20x. Moreover, an inter-turn caching system for long dialogues achieved a 95% cache rate, significantly lowering inference costs.

Microsoft Releases New PC and Oracle Brings Autonomous Databases to Microsoft Azure

Although Microsoft was overtaken by Nvidia as the largest public company, that didn’t stop them from releasing the first Copilot + PCs.

However, they delayed the flagship 'Recall' AI feature due to major security and privacy concerns we saw previously. They haven’t given up on the feature entirely, though. It’ll be released and tested through the Windows Insider program before a broader rollout.

Additionally, Oracle brought its Autonomous Databases to Microsoft Azure datacenters for easier cloud migration and a unified cloud experience. It also means businesses can leverage Oracle’s advanced database capabilities with the flexibility and scalability of Azure’s cloud infrastructure.

Coding Model Overtakes GPT-4 and Runway’s New Video Generator

Last week, a neat AI system called Code Droid by Factory came up that is designed to handle software engineering tasks. In particular, the system excels in planning, task decomposition, tool integration, and environmental grounding to simulate human developer processes.

It showcased impressive performance in real-world coding tasks, achieving 19.27% on SWE-bench Full and 31.67% on SWE-bench LITE. To do so, it utilized multi-model sampling, HyperCode for deep codebase understanding, and ByteRank for information retrieval.

Code Droid outperformed GPT-4 on SWE-bench FULL and SWE-bench LITE. (Source)

Meanwhile, Runway has introduced its Gen-3 Alpha model, a significant advancement in AI video generation. Gen-3 Alpha generates highly realistic videos with accurate real-world physics. It showcases capabilities in creating diverse video types, including human faces, drone shots, and dreamscapes.

Kentauros Offers UNIX-like CLI Tools You Can Use to Run, Build, and Deploy agents

Kentauros AI, a new applied AI R&D lab, released the AgentSea open source agent platform. The toolkit is intended for agent creation and deployment, like a k8s style orchestrator called SurfKit that spins up agents, devices and tools, a device manager that can create instances of a virtual Linux desktop, a tool protocol that defines and controls how an agent can use those devices and what it can and can’t do with the device. The platform's modular design aims for longevity as AI models improve.

Their SurfPizza and SurfSlicer prototype agents are multimodal/vision model-powered alternatives to Selenium or Playwright for desktop navigation (reportedly what Rabbit used for their Large Action Models). Multimodal lets the agent interact with desktops and not just web apps because it's not parsing CSS or javascript, just looking at the desktop and deciding what to do.

In the case of agents, the team discovered that while LLMs are good at knowing what to do next, most of the SOTA models are barely trained on GUIs. They’re terrible at knowing where to click or returning bounding boxes or accurate coordinates. The team ended up using a number of computer vision techniques to help the MLLM pick the right spot to know where to click next.

For instance, they found that if they desaturated the image and laid a grid over it, models like GPT-4o are pretty good at picking what number is closest to the search button or the first cat image.

As the AI race heats up, with tech giants like Microsoft pushing proprietary solutions like the recently delayed Recall, AgentSea offers a refreshing alternative.

The Grid: Covering an NxN grid on the screen with numbered dots (Source)

AI Act EU Begins Implementation

Companies doing business in EU will need to rethink their data practices going forward since the AI Act will be applied in 2026.

It’s a regulatory framework for AI deployment, categorizing systems by risk levels: prohibited, high risk, and low risk.

Systems significantly impacting safety and rights are subject to strict requirements, while others face fewer regulations. The AI Act and GDPR will work together to work toward responsible AI deployment and privacy protection.

Advancements in AI Research

AI research saw development in a lot of different areas last week, ranging from solving Olympiad-level math problems to a massive survey about prompting to reduce inconsistent terminologies.

Solving Olympiad-Level Math Problems Using Self-Refine Mechanism

Agents learn from trial-and-error and decision-making in a similar way to humans. (Source)

The MCT Self-Refine (MCTSr) algorithm combines LLMs with Monte Carlo Tree Search (MCTS) to boost performance in complex mathematical reasoning tasks since LLMs like GPT-4 have struggled to perform well at these sorts of tasks.

So how did the new algorithm fare at solving Olympiad-level math problems?

Results were promising, as the algorithm significantly improved success rates across multiple datasets, including GSM8K, GSM Hard, and MATH.

MCTSr outperformed previous methods used in AI to boost problem-solving abilities. (Source)

MCTSr's self-refinement aspect is also impressive, as it allows the model to iteratively improve its solutions, mimicking human problem-solving processes. This means better decision-making and reasoning accuracy in LLM-driven applications.

Another researcher claimed that this model might be plagiarized, as the method used in this paper was very similar to the method mentioned in a paper that his group worked on for ICML 2024.

Prompting Survey to Handle Inconsistent Terminology

Overview of the prompt engineering process. (Source)

Researchers from institutions and companies like Stanford, OpenAI, and Microsoft compiled a systematic survey of prompting techniques used in generative AI systems. This survey establishes a comprehensive taxonomy with 58 text-only prompting techniques and 40 techniques for other modalities.

It aims to address conflicting terminologies and provide a structured understanding of prompts in AI. Moreover, the authors provided a detailed vocabulary with 33 terms related to prompting techniques to deal with inconsistent terminology.

This unified taxonomy could standardize prompting techniques and improve consistency in AI research and improve the practical deployment of AI models.

Improved Pre-Training Framework

Comparing vanilla pre-training to instruction pre-training. (Source)

Microsoft Research and Tsinghua University published a paper about Instruction Pre-Training (IPT), a new framework for supervised multitask pre-training of language models. IPT involves training language models on various tasks formulated as instructions, allowing the model to learn general task-solving abilities.

This approach aims to improve the model's ability to follow instructions and generalize to new tasks without task-specific fine-tuning.

IPT is a drastic shift from traditional pre-training methods since it might change how we approach LLM development. Exposing models to a wide variety of tasks during pre-training might lead to better generalization capabilities, allowing them to adapt more quickly to unseen tasks.

New MoA Approach > GPT-4o

Image of the MoA structure. (Source)

Together AI researchers presented a MoA (Mixture-of-Agents ) approach where each layer comprises multiple LLM agents leveraging outputs from previous layers. MoA models achieve state-of-the-art performance on benchmarks like AlpacaEval 2.0, MT-Bench, and FLASK.

It even surpassed GPT-4o.

Using only open-source LLMs, the MoA model scores 65.1% on AlpacaEval 2.0, compared to 57.5% for GPT-4o. This layered approach shows a new way of combining multiple models for enhanced performance.

Achieving SoTA results on multiple benchmarks reassures us that the method is certainly effective. It’s also worth noting that using the outputs from previous layers allows for continuous improvement and refinement of model responses.

Frameworks We Love

Other frameworks that caught our attention in the last week include:

DeepSeek-V2: MoE code language model aimed at achieving performance comparable to GPT-4 Turbo in code-specific tasks.
LiveMind: Reduces the response time of large language models by starting inference with incomplete prompts and using a collaborative setup for more efficient processing.

AI & ChatGPT Mini Crash Course - Eliminate workplace burnout & save 16+ hours every week. Learn 20+ AI tools, prompting techniques & hacks for free.

Save your seat here (first 100 people only) ⏰

If you want your framework to be featured here, get in touch with us.

Conversations We Loved

Andrew Ng started an exciting conversation about AI accelerating our coding capabilities, while another interesting discussion came up regarding an MoA model outperforming GPT-4 at 1/25 the cost.

AI Has Become Better Than Humans at Coding?

AI might have surpassed humans at coding. (Source)

AI's utility in software development is rising, now writing code faster than humans. In just six months!

The typical workflow of coding agents—problem analysis, code generation, testing, and refinement—mirrors human coding but uses LLMs' speed and knowledge to boost productivity.

Separating code writing and test generation into distinct agents improves performance, contrary to the conventional single-entity approach. And as previously.

Claude-3.5 Sonnet might change coding forever. (Source)

Outperforming GPT-4 at 1/25 the Cost

New MoA model outperforms GPT-4 at a fraction of the cost. (Source)

This new MoA model showed a 19.25% improvement over GPT-4-Turbo in tests, with evaluations conducted using both Claude and GPT-4 as judge models. Four engineers reviewed 32 MoA/GPT-4 pairs to ensure reliability, adjusting win rates based on their votes.

Despite adjustments, MoA still won overall on every task.

Smaller models like Llama-3 70B and Llama-3 8B, fine-tuned on MoA outputs, outperformed GPT-4 on multiple tasks while being significantly cheaper (25x cheaper for Llama 3 8B).

The success of MoA and similar models might drive a shift towards more specialized, fine-tuned models in the AI industry so that we won’t be relying on large, general-purpose models like GPT-4 as much.

Money Moving in AI

We saw a massive investment from Onsemi, GenSpark securing $60 million even without any revenue, and SoundHound AI acquiring Allset.

Onsemi Invests $2 Billion in Czechia Production Facility

US semiconductor company Onsemi is investing $2 billion in its Czechia production facility, marking the most significant foreign investment in the country's history. The investment will significantly boost chip production, with expectations to increase output by hundreds of percent.

It’s projected to create 3,000 jobs (up from the current 1,700) and contribute over $258 million annually to Czechia's GDP.

GenSpark Secures $60 Million Without Revenue

AI startup GenSpark secured $60 million despite having zero revenue. This means the investors are pretty confident in GenSpark’s potential and technology. GenSpark is focused on advancing AI technology and aiming to make substantial impacts in the field.

SoundHound AI Acquires Allset

SoundHound AI has acquired Allset, a company specializing in restaurant automation technology. The acquisition aims to boost SoundHound's voice AI capabilities within the restaurant industry. Integrating Allset's technology is expected to improve restaurant customer service and restaurant operations.

AI Godfather Emerges With Stealth Startup

Geoffrey Hinton, known as the "Godfather of AI," left Google a year ago. Hinton has re-emerged by backing a stealth startup called CuspAI. The startup aims to utilize AI technology for carbon capture to combat climate change. Hinton's shift indicates a significant move towards using AI for environmental sustainability.

Nvidia's Chip Competitors, A Trillion Dollar GPU Cluster, The Impossible LLM Benchmark

Tue, 18 Jun 2024 18:11:26 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

This week’s key developments include:

LiveBench is a benchmark free from contamination and biases that top LLMs like GPT-4 and Claude 3 Opus weren’t able to score above 60% on.
Sakana AI found a way to find the best optimization algorithms for better LLM outputs using automated methods that wouldn’t be possible using traditional methods.
Musk is being sued by Tesla shareholders for starting xAI while he drops the lawsuit against OpenAI.
A team member from OpenAI’s Superalignment team published a controversial and thought-provoking two-hour-long piece on strategic considerations on the AGI race.
Apple unveiled on-device and foundation models, while Nvidia introduced the Nemotron-4 340B models, which generate high-quality synthetic data.

Got forwarded this newsletter? Subscribe below👇

Before we start, here's an interesting resource from today's sponsors →

Ready to revolutionize your workday with AI?

Discover the key to unlocking unparalleled productivity with HubSpot’s free guide to using ChatGPT at work. You’ll find practical insights, useful integrations, and 100 prompt ideas to help you unleash the power of AI for a more efficient, impactful professional life.

Download the free guide and level up your AI game today!

The Latest AI News

While Elon was busy handling a lawsuit from Tesla, a bunch of new models were released from Apple, Nvidia, Stability AI, and Luma AI.

Safety concerns were another big theme last week, with Microsoft, OpenAI, and Apple all taking the heat this time.

Former NSA Head Joins OpenAI Board and Committee While Microsoft, OpenAI, and Apple Face Security Concerns

The newest member of OpenAI’s board of directors, Paul Nakasone. (Source)

Paul Nakasone, the former head of the NSA, joined OpenAI’s board of directors and security and safety subcommittee. This high-profile addition aims to address concerns about OpenAI’s rapid advancements and potential risks by providing better evaluation and mitigation strategies.

This is happening amidst privacy concerns during OpenAI’s partnership with Apple to integrate ChatGPT into Apple’s operating systems and Siri.

On that note, Microsoft is also being criticized for safety concerns as it delayed the release of Recall AI, a feature intended for Copilot Plus PCs.

The issue was that Recall AI was designed to create a visual timeline of user activity by taking constant screenshots every few seconds.

Tesla Shareholders Sue Musk as He Drops OpenAI Lawsuit

Musk faces a lawsuit from Tesla shareholders right after he dropped the one on OpenAI. (Source)

Tesla shareholders' relationship status with the company's CEO is... 'complicated'.

This week, Tesla shareholders reaffirmed a substantial pay award for Musk, which is worth over $45 billion. However, they also raised concerns about his involvement in xAI, as shown by their lawsuit against him and Tesla's board members.

The lawsuit also claimed that Musk diverted employees and AI processors from Tesla to xAI and his social media company X. According to the plaintiffs, Musk has unfairly benefited himself at Tesla's expense and needs to turn over his xAI interest to Tesla.

Musk dropped his own lawsuit against OpenAI just days before a scheduled dismissal hearing, following OpenAI's publications of back-and-forth between the parties (ironically, there's an offer to use Tesla's infra for training OpenAI's models there, too).

Originally, the lawsuit was filed because Musk was concerned that OpenAI went from a non-profit to a for-profit model.

Rebellions and Sapeon Join to Compete Against Nvidia Chip Amidst Nvidia Stock Increase

Two key South Korean fabless AI chip startups, Rebellions and Sapeon, are merging to have a better shot at leading the chip market in South Korea and competing against Nvidia. Although, we won’t see the newly merged company go public until two or three years later.

It’s worth noting that this move came at an interesting time when Nvidia’s stock increased by 160% from January to June 2024, reaching a market cap of $3.10 billion. This makes the chip market enticing for other companies to join in on the fun.

This isn’t the first case we’ve seen of companies trying to launch their own chips. Back in April 2024, Google and Meta entered the chip market, while the tech giants formed a group to compete against Nvidia’s chips a few weeks ago.

New Models From Apple and Nvidia, Along With a New Competition for AGI Progress

Modeling overview of Apple’s foundation models. (Source)

Apple unveiled foundation models integrated into iOS 18, iPadOS 18, and macOS Sequoia, focusing on tasks like writing, summarizing notifications, and creating images. The models include a 3 billion parameter on-device model and a larger server-based model on Apple silicon servers.

NVIDIA released another set of models called Nemotron-4 340B. These models create high-quality synthetic data to improve the performance and robustness of custom LLMs. They’re also optimized for efficient inference through tensor parallelism for scalable deployment.

In addition to model releases from Apple and Nvidia, a competition aimed at moving AGI progress forward called the “ARC Prize” was announced by Mike Knoop (Co-founder, Zapier), Nat Friedman (Former CEO of Xamarin), and François Chollet (Creator of ARC-AGI, Keras). It encourages open-source solutions to solve the ARC-AGI benchmark since generalization is an area that AI struggles with.

New Video, Image, and Vision-Language Model Releases

Example of an image generated by Stable Diffusion 3. (Source)

Stability AI announced the release of Stable Diffusion 3 Medium, a text-to-image model available under open and commercial licenses. The model has 2 billion parameters and improved text handling, so it’s ideal for consumer GPUs.

We also saw the release of a new video generator called Dream Machine - an AI model by Luma that generates realistic videos from text and images. What’s impressive is that it can produce eventful shots with accurate physics and character interactions.

Moreover, a new project called OpenVLA was released. It’s dedicated to creating tools and frameworks for visual language applications. In particular, it aims to push forward the field of visual language processing through collaboration.

Advancements in AI Research

A lot of different areas in AI research saw progress last week, including the automated discovery of optimization algorithms, going from a single-agent to a multi-agent approach, and even a benchmark that GPT-4 struggled with.

Finding the Right Optimization Algorithms for Better LLM Outputs

The left figure shows how objective functions are found, while the right figure shows the results of the discovered objective function. (Source)

Optimization algorithms play a big part in determining the quality of LLM outputs. The main issue is that the current cutting-edge optimization algorithms were all developed by humans, so there might be better algorithms out there that only automated methods could find.

Japanese researchers at Sakana AI introduced a method for discovering new preference optimization algorithms for LLMs through iterative LLM-driven objective discovery.

The process involves the LLM proposing and implementing new loss functions based on performance metrics without human intervention. Results were promising since the experiments showed impressive performance and successful transfer to various tasks.

A Benchmark That Every LLM Struggles At

Abacus AI introduced LiveBench, a benchmark for evaluating LLMs that’s free from test set contamination and biases from human or LLM judges. It features questions updated monthly from recent information sources to maintain relevance and challenge, which is lacking in other benchmarks.

The answers were scored automatically against ground-truth values without LLM judges for fairness. It also included a whole breadth of tasks, such as math, coding, reasoning, language, instruction following, and data analysis.

All the top LLMs passed with flying colours, right?

Wrong. None of the models scored above 60% accuracy. The fact that cutting-edge LLMs like GPT-4 and Claude 3 Opus can’t achieve a score above 60% speaks volumes about how tough this benchmark is. However, it does still show that GPT-4 is on top since it had the highest score.

Results of LiveBench on various models. (Source)

Moving From a Single-Agent to a Multi-Agent Software Development

An overview of how multi-agent software development works. (Source)

Single-team approaches have limitations that prevent them from obtaining quality results. One of which is that each phase leads to only one outcome, so the opportunity to explore other decision paths is completely missed.

This framework called Cross-Team Collaboration (CTC) was developed by Chinese researchers and is designed for multi-agent software development. Handling the issues of single-team approaches, CTC uses multiple teams of LLM agents to collaborate, propose various decisions, and share insights.

We saw another paper two weeks ago that discussed the transition from a single-agent to multi-agent architecture, but for mobile devices instead.

An Alternative Method for Assigning Relevance Labels

When human assessors assign relevance labels, there’s always the chance of errors getting in the way. Not to mention, it’s an expensive and time-consuming process.

Instead, researchers at the University of Waterloo proposed an open-source toolkit called UMBRELA. It replicates the Bing Relevance Assessor using OpenAI's GPT-4o model and is designed to provide high-quality relevance judgments for retrieval systems.

This toolkit reproduces results from a previous Microsoft Bing study, which showed that LLM-derived relevance judgments align well with effective multi-stage retrieval systems across TREC Deep Learning Tracks from 2019 to 2023.

UMBRELA will be used in the TREC 2024 RAG Track to help with relevance assessments, meaning that the paper is a big step forward in automated relevance assessments.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

ConsistDreamer: Enhances 2D diffusion models with 3D awareness and consistency for high-fidelity scene editing.
LLAVIDAL: Benchmarks large language vision models for daily activities of living by curating a dataset called ADL-X,
Mixture of Agents: Leverages the collective strengths of multiple LLMs by constructing a layered architecture where each layer consists of multiple LLM agents.

If you want your framework to be featured here, get in touch with us.

Money Moving in AI

AlphaSense Secures $650 Million and Acquires Tegus

AlphaSense, an AI-driven market intelligence and search platform, raised $650 million in its latest funding round - valuing the company at $4 billion. This round follows a $150 million raise in 2023 that valued the company at $2.5 billion.

In addition, AlphaSense announced the acquisition of Tegus (a provider of expert research and financial data) in a deal valued at $930 million.

Mistral Raises $640 Million in Series B Funding Round

Mistral AI raised $640 million in a Series B funding round, leading to a valuation of $6 billion. General Catalyst led the round and included a long list of other prominent investors.

This new funding will allow Mistral to work on more advanced models, helping them to bridge the gap between top AI companies like OpenAI and Meta.

Amazon Supports Generative AI Startups With $230 Million

Amazon plans to invest up to $230 million (most in AWS credits - a popular hyperscaler move these days to lock-in the customer base) to support startups focused on developing generative AI applications.

Of the total investment, $80 million is allocated to the second AWS Generative AI Accelerator program, aimed at helping startups enhance their product innovation and service improvements.

Hugging Face Acquires Argilla

Hugging Face has acquired Argilla, an open-source tool for dataset management. Both companies have a history of successful collaboration on various projects, such as Docker Spaces and the "Data is Better Together" initiative.

Conversations We Loved

An OpenAI engineer published a 140-page paper that predicts AI systems will outperform college graduates in terms of intellect by 2025/2026, while another discussion looked at key techniques to know before building RAG applications.

AI Super Intelligence by 2027?

An interesting paper about achieving AI super intelligence by 2027. (Source)

Vorhies brings up a paper by Leopold Aschenbrenner from OpenAI’s Superalignment team with a powerful statement that AI super intelligence will change everything by 2027.

Aschenbrenner anticipates that by 2025/26, AI systems will outpace human college graduates in intelligence. By the end of the decade, AI is expected to achieve superintelligence, outperforming humans by orders of magnitude.

Our main takeaways are:

Aschenbrenner posits that US should step up the security of AI research facilities and encourage joining all the compute power in a singular research facility to beat the competition (the trillion-dollar cluster).
The ‘competition', in this case, is ‘foreign state actors' like China, that are currently being underestimated by the US, as the author posits.
By the end of the decade, we are headed to $1T+ individual
training clusters, requiring power equivalent to >20% of US
electricity production. Power, not just compute, increasingly becomes a bottleneck for AI.

He also detailed the exponential increase in investment in AI infrastructure, predicting the establishment of trillion-dollar GPU cluster and significant growth in US electricity production to support AI advancements.

Although attracting some criticism in the argument (e.g., equating exponential curve to linear), the paper stands out as a good summary of complex technical challenges in achieving superalignment, especially in ensuring AI systems remain aligned with human values and reliably follow instructions (the "superalignment problem")

Key Information Retrieval Papers for Building RAG Applications

Important techniques to know before building RAG applications. (Source)

A researcher curated a list of significant IR papers to help understand Transformer-based retrieval models. For anyone who’s going to start building an RAG application soon, this post is a useful resource for reading more about key techniques.

Techniques like ColBERT with "contextualized late interaction" show the importance of understanding the context of both queries and passages for accurate retrieval. BEIR benchmarks and RetroMAE pre-training techniques indicate efficient evaluation methods and effective pre-training strategies are key real-world IR systems.

It also mentions more recent advancements like LLM2Vec, which looks at the potential of LLMs for text encoding in retrieval tasks.

Crafting a Successful Multi-Agent Architecture, Mistral's Fine-Tuning, GPT4 Struggles with Alice In Wonderland

Tue, 11 Jun 2024 17:05:32 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

Apple goes big (?) on GenAI. While some things like beefing up Siri seem obvious, Apple also focused on privacy in AI.
Antitrust investigations have ramped up. The DOJ is investigating Nvidia’s antitrust violations, while the FTC is investigating Microsoft and OpenAI.
Kling is a new text-to-video model by Chinese company Kuaishou that produces high-quality videos quickly and is a notable competitor to OpenAI’s Sora.
DuckDuckGo’s AI Chat lets users interact anonymously with chatbots like GPT 3.5 Turbo, and Mistral launched a new SDK service that lets anyone fine-tune their models.
IBM Research Europe showed that even a minor adjustment to textual descriptions significantly impacts accuracy.
MobileAgentV2 is a new multi-agent architecture that improves task completion by 30% compared to the single-agent architecture.

Got forwarded this newsletter? Subscribe below👇

The Talk of the Day

Or… this time, not so much. Apple unveiled ‘Apple Intelligence', its suite of GenAI integrations. To be honest, we're more surprised by the Calculator app finally making it onto iPad. Here's what Apple has introduced:

Private Cloud Compute
Users' phones can securely offload complex AI tasks to specialized devices in the cloud, without Apple being able to access user data under any circumstances.
Siri Gets a Boost
Siri gets on-screen awareness with the ability to take action in and across apps, and a ChatGPT integration (coming soon with iOS 18).
GenAI for Emojis and Images
AI-generated custom emojis (Genmojis) and features like deleting your ex from pics.
Minor Summarization Across Apps
Recording and Transcription, sorted notifications, paraphrasing in emails and notes.

For the stock market, the WWDC was more like “Weird Widgets? Don’t Care!" at first, slumping a bit before popping +4.54% today, reaching an all-time high.

For us, the key takeaway is that Apple seems to support the notion of Small Language Models - a smaller, less capable models that are good enough to run on-device and sustain data privacy, while being helpful.

The Latest AI News

Antitrust investigations were a hot topic last week, with the DOJ and FTC getting involved by looking into Nvidia, Microsoft, and OpenAI.

Aside from the political AI news, an interesting new text-to-video generator shocked everyone and is even being called “the Sora killer.”

Antitrust Investigations Into Tech Giants Begin

Tech giants like Nvidia, OpenAI, and Microsoft are currently under investigation for potential antitrust violations.

The Department of Justice (DOJ) and the Federal Trade Commission (FTC) agreed to divide the investigations. The DOJ will investigate Nvidia's potential antitrust violations, while the FTC will probe OpenAI and Microsoft.

Specifically, the FTC is looking into whether a transaction between Microsoft and the AI firm Inflection was a covert purchase. This follows Microsoft's hiring of Inflection’s co-founders and staff for its Copilot program (potentially to avoid the exact anti-trust probe it's facing right now).

The Chamber of Progress, a tech industry coalition including Amazon, Apple, and Meta, launched "Generate and Create" to defend using copyrighted works to train AI systems under fair use.

The EU seems to be ahead of the US regarding AI regulations, as laws like the EU AI Act address issues like social scoring and biometric-based tools.

DuckDuckGo’s New Anonymous Chatbot Interaction and Mistral Lets Customers Fine-Tune Their Models

DuckDuckGo AI Chat is promoting data privacy. (Source)

Data privacy with AI chatbots has been a hot topic for a while, so DuckDuckGo’s AI Chat offers anonymous access to popular AI chatbots like GPT 3.5 Turbo and Claude 3 Haiku. Future plans include adding more models and potentially introducing a paid plan with higher usage limits.

In addition, Mistral launched new services and an SDK that lets customers fine-tune its models. The SDK supports the customization and deployment of Mistral's language models, so customers can tailor the models for specific needs and applications.

We also saw a new guide published by ML engineers and consultants like Jason Liu and Hamel Husain. It covers best practices for prompting, RAG, tuning workflows, and various other aspects.

Chinese Text-to-Video Tool Emerges as Sora Competitor

Examples of videos created from text prompts using KLING. (Source)

Chinese company Kuaishou, the creators of No 2 short video app, introduced a new text-to-video tool called Kling, designed to compete with Sora.

Kling creates videos that you wouldn’t believe were generated by an AI. This is because it leverages the Universal Vision Transformer (U-ViT) architecture, allowing it to generate highly realistic content.

IRS Wants to Use AI to Close the AI Gap

Tax gap estimates for 2014-2021. (Source)

The IRS is implementing AI technology to efficiently and accurately identify tax returns for audits. The tax gap (due to underreported taxes) is estimated at a massive $496 billion.

Additionally, AI models are expected to improve the efficiency of the IRS's auditing process by quickly identifying discrepancies.

The Government Accountability Office (GAO) made six recommendations, all of which the IRS agreed to and provided steps that they’ll take. In particular, the IRS noted that low-quality data will lead to low-quality results, meaning they want to monitor data quality a little more closely going forward.

New Technique For Improving Financial Analysis

In other news, we saw a new technique called CRAG to improve financial analysis. It boosts text generation reliability by integrating corrective mechanisms with retrieval and refinement strategies.

CRAG combines standard retrieval strategies with a corrective component to evaluate and refine user queries. It evaluates each document's relevance to the query, triggering an external search API if necessary.

Testing across various datasets showed that CRAG outperforms standard RAG approaches. This is especially true for tasks that require high accuracy, like short-form question answering and long-form content generation.

Google Scales Back AI Overview While Meta Introduces AI Features to Whatsapp and Zoom CEO Wants Digital Twins in Meetings

After Google’s recent disaster with absurd outputs from AI Overview, they decided to scale it back since AI Overview continued its hot streak of bad advice. But Google hasn’t entirely abandoned the product; it’s focusing on improving the AI’s accuracy and reliability.

Meanwhile, Meta has introduced new AI-powered features to the WhatsApp Business app. These include:

AI-generated stickers
A virtual assistant called Meta AI
Image editing tools like Restyle and Backdrop

Examples of Meta’s AI-powered features in Whatsapp. (Source)

Users can create AI-generated stickers and use image editing tools to change backgrounds and apply visual styles to photos.

However, Zoom CEO Eric Yuan has bigger plans. He wants to introduce the concept of "digital twins." These are AI-generated avatars that can attend Zoom meetings on behalf of users, potentially making decisions and interacting as the user would.

Advancements in AI Research

Mobile device research saw some significant progress, with Chinese researchers presenting a paper that moves us from a single-agent to a multi-agent architecture. Moreover, another paper presented some interesting results about GPT-4’s reasoning abilities.

Going From Single-Agent to Multi-Agent Architecture

How the operation process works and how agent roles interact. (Source)

Chinese researchers proposed a multi-agent architecture called Mobile-Agent-v2, which includes planning, decision, and reflection agents to improve navigation and task performance in mobile device operation tasks.

The planning agent condenses lengthy image-text history operations into pure-text task progress, simplifying task navigation for the decision agent. A memory unit is designed to update and retain task-relevant focus content, allowing the decision agent to refer back to necessary information from previous screens.

But did it actually help improve performance?

Mobile-Agent-v2 showed a 30% improvement in task completion compared to the single-agent architecture of Mobile-Agent - so yes, it did.

GPT4 Isn’t As Good at Reasoning Tasks As You’d Think

German researchers investigated the reasoning capabilities of state-of-the-art LLMs using simple common-sense tasks.

Notably, LLMs like GPT-3.5/4 and Claude weren't so successful at reasoning tasks, as they provided incorrect answers with high confidence and strange logic. Not what you’d expect from these commonly used models.

Despite high scores on standardized benchmarks, LLMs failed to solve simple problems like the Alice In Wonderland (AIW) problem, which means there might be an underlying problem with LLM reasoning capabilities.

Box plot comparing how well different LLMs dealt with AIW problems. (Source)

It's a reminder to take LLM answers with a grain of salt. They present incorrect answers with such high confidence by giving believable but unreasonable justifications, which makes it easy to buy into them.

Small Changes in Textual Descriptions Lead to Higher Accuracy

IBM Research Europe published a paper focusing on enhancing zero-shot learning (ZSL) for entity and relation classification by improving the textual descriptions used by the models - crucial for making accurate predictions.

An example of how a minor description adjustment leads to different predictions. (Source)

They introduced UDEBO (Unsupervised DEscription BOosting), a method that:

Ranks them using a heuristic based on entropy
Generates multiple variations of initial descriptions
Employs an ensemble method to boost model performance

This method was tested on four datasets (OntoNotes, MedMentions, FewRel, and WikiZS) and showed impressive results, making it useful for situations with limited labelled data like information retrieval or NLP tasks.

Handling the Issues of Federated Learning

How server-side simulations work. (Source)

Federated learning (FL) is a decentralized method of training ML models, but it has issues like communication costs and slow training. Austrian researchers addressed these challenges by proposing an efficient method for simulating federated training.

They proposed using a Mixture-of-Dirichlet-Multinomials (MDM) model to better simulate federated clients' statistical heterogeneity. This involves representing each client as a histogram over a chosen feature and using MDM to create simulated clients.

Even though this research seems like it’s focused on highly niche and specific applications, it has broader use by ensuring privacy-preserving data handling - important for sectors like healthcare and finance.

The proposed method achieves this by using aggregated data and being compatible with differential privacy standards.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

Databutton: Helps users build SaaS applications (React frontends and Python backends) quickly by using AI to generate code and create apps based on natural language descriptions, screenshots, or diagrams. Support their launch on ProductHunt!

Direct 3D: Used for efficient contour tracing in 3D point clouds to improve contour extraction accuracy and speed.

VidMuse: Designed to generate acoustically and semantically aligned music with video inputs using a large dataset called V2M.

If you want your framework to be featured here, get in touch with us.

Conversations We Loved

A new bill in California complicated open-source AI, which was Ng’s main topic of discussion. Moreover, Epoch AI discussed why we should be concerned about exactly how much training data LLMs use.

We Might Be Running Out of LLM Training Data (Unless We Dig Deep)

Epoch AI began the discussion by showing a graph of how much text data is being used. (Source)

… into the deep web. Training a compute-optimal dense AI model on ~100 trillion tokens would require around 5e28 FLOP, which would be achievable by 2028. Overtraining models like Llama-3 can be more efficient, but it might cause data to be bottlenecked by 2025-2027.

The deep web and private data might hold around 3 quadrillion tokens, but accessing this data leads to privacy and legal challenges. Crawling all web text data is complex and costly, and private data is potentially off-limits.

It’s apparent that overtrained models are more efficient, suggesting a trade-off between data quantity and model performance.

However, we should consider whether the real issue is the quantity or the quality of data, as we saw in Karpathy’s post last week, which indicated better performance was possible with fewer tokens.

New Bill in California Puts Open-Source Models in Danger

Ng’s concerns about the proposed law in California. (Source)

Ng mentioned that the SB-1047 law in California proposes rigid guidelines for AI models, including safety assessments and shutdown capabilities.

It defines a “hazardous capability” designation, which would hold AI builders liable for any damage exceeding $500 million. He argued that regulating applications, not technology, is more rational since technology can be used for beneficial and harmful purposes.

It highlights the distinction between regulating technology and its specific applications, which is crucial for balanced legislation. These laws might hurt AI developments in healthcare and other sectors and impact open-source AI developers relying on community-driven innovation.

Money Moving in AI

Musk is going all out by spending half of xAI's current valuation on Nvidia GPUs, while Humane is in talks with HP to sell itself for over $1 billion after its Ai Pin was met with a lot of negative criticism.

Musk to Spend Half of xAI's Net Worth on Nvidia GPUs

Musk's xAI is set to spend around $9 billion to acquire 300,000 NVIDIA B200 chips by next summer, with each B200 GPU expected to cost between $30,000 and $40,000.

xAI recently raised $6 billion (achieving an $18 billion valuation), so this investment is going to cost half its valuation.

Humane Wants to Sell Itself to HP for $1 Billion

Rumors indicate that Humane, the company behind the AI Pin, is in talks with HP for a potential acquisition worth more than $1 billion. A few weeks ago, Humane was looking to be sold for a cost between $750 million and $1 billion, which means they found a deal higher than their desired range.

Previously, we have covered the $699 wearable AI computer pin receiving a huge wave of negative reviews for various issues, like overheating.

Tektonic Raises $10 Million in Seed Funding Round

Tektonic, a Seattle-based startup, has raised $10 million in a seed funding round led by Point72 Ventures and Madrona Ventures. The company is developing generative AI agents to automate business operations.

Tektonic targets automating complex business processes like quotes and renewals, which usually involve manual tasks and are difficult to automate.

Matterport Cuts AI Data Prep by 80%, 32 Experts Beat a LLama (v3), Meta Advances VLMs

Tue, 04 Jun 2024 12:01:00 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

Matterport slashes data prep times by 80% and enables Multimodal AI with Activeloop.
Activeloop named a Cool Vendor in the 2024 Gartner® Cool Vendors in Data Management: GenAI Disrupts Traditional Technologies Report.¹
Intel, Meta, Google, and Microsoft formed a group called Ultra Accelerator Link to develop alternative AI chips to challenge Nvidia’s market dominance.
Sam Altman secured a partnership with Apple to integrate advanced AI functionalities into Apple products, while Opera partnered with Google to integrate Gemini into its Aria AI assistant.
Google AI Overview produces bizarre suggestions like eating rocks for nutritional benefits, leading to a lot of public scrutiny on social media.
Yuan-2.0-M32 is a MoE architecture with a new network router that outperformed models like Llama-3-70B on the ARC-C benchmark.
EPFL researchers proposed a constant learning rate with cooldowns and stochastic weight averaging to improve training efficiency and model performance, based on Chinchilla's scaling laws.

Got forwarded this newsletter? Subscribe below👇

The Talk of the Day: Activeloop is named a Cool Vendor in the 2024 Gartner Cool Vendors in Data Management: GenAI Disrupts Traditional Technologies,

and How Matterport Delivers Multi-Modal AI with 80% Less Data Prep

First off, we've been recognized as a Cool Vendor in Data Management by Gartner.

We believe that this is a significant milestone for us, as few companies are acknowledged as Cool Vendors (specifically for Data Management, there's only 5 in the current batch).

We think that this is a strong step forward in our mission to help enterprises organize their unstructured, complex data while ensuring accurate knowledge retrieval.

If you have a Gartner subscription, you can read the full report here.

Gartner, Cool Vendors in Data Management: GenAI Disrupts Traditional Technologies. Nina Showell, Ehtisham Zaidi, Aaron Rosenbaum, 30 May 2024

How Matterport Decreased Data Prep Times by 80% and Enabled Multimodal AI

Matterport, a leader in 3D digital twins has digitized more than 35 billion square feet, making them one of the largest players in the domain. Before adopting Activeloop, Matterport faced challenges in managing their colossal datasets (comprising opt-in data). With over 7 million scanned spaces, the sheer size of the data posed significant logistical issue.

The company's Vision & Learning team partnered with Activeloop to streamline how they store complex data and retrieve it to train machine learning models at scale.

Results?

- 80% less time spent on data prep.

- Time to train on a new dataset reduced from hours to seconds.

Here are the challenges Matterport tackled with Activeloop:

Rapidly evolving vast data
Lack of standardization
Multimodal support

Multimodal data visualization in Deep Lake browser UI.

Learn how by reading the full case study below.

The Latest AI News

Last week, we saw some pretty notable moves from all the major tech players. OpenAI has a partnership that allows it to integrate its AI technology into Apple products. Intel, Meta, Google, and Microsoft also didn’t show any signs of slowing down, as they formed a new group to develop AI chips.

Meanwhile, Google was ridiculed all over social media since Google AI Overview produced some…interesting suggestions like using non-toxic glue on pizza to keep the cheese in place (where was this when I was planning my stand-up routine?).

PwC Becomes ChatGPT’s Biggest Customer While Intel, Meta, Google, and Microsoft Combine Forces to Take on Nvidia

PwC partnered with OpenAI, becoming its largest customer by providing 100,000 employees with access to ChatGPT Enterprise. Moreover, PwC is the first official enterprise reseller of OpenAI’s services, allowing them to offer these AI tools to other businesses.

This is a different approach from that of a PwC’s competitor EY, which developed its own internal LLM.

Speaking of partnerships, a new group called Ultra Accelerator Link consisting of the biggest tech giants was formed, including Intel, Meta, Google, and Microsoft.

The group's goal is to develop alternative AI chip solutions to Nvidia's dominant GPUs - the current market leader which also happens to be an expensive option. Having these tech giants all on the same side could reshape the AI hardware landscape by reducing Nvidia's market dominance.

UALink 1.0 is a proposed standard by the newly formed group. (Source)

Meanwhile, Nvidia is drastically increasing its computational dominance as its GPU performance outpaces Moore’s Law. Keep calm and carry on, as they say.

Nvidia’s Blackwell board has 20000 TFLOPS of FP4. (Source)

Google’s AI Overview Provides Absurd Suggestions and OpenAI Launches a Safety Committee to Improve AI Safety

Alongside the Apple partnership, OpenAI established a Safety and Security Committee to address potential risks and ensure responsible AI development. OpenAI's call for robust debate on AI safety encourages broader stakeholder engagement and transparency.

Last May, OpenAI’s previous safety committee called the “SuperAlignment” team was disbanded.

Google's AI search feature, AI Overviews, has been producing bizarre suggestions, like recommending non-toxic glue on pizza to keep the cheese in place and advising users to eat rocks for their nutritional benefits.

A lot of these absurd outputs stem from the AI summarizing joke content found on platforms like Reddit, showing issues with the AI's ability to discern credible information sources.

As you’d expect, social media had a field day with this.

One example of a strange response provided by Google’s AI. (Source)

But it does lead to some concerns about the reliability of AI in important applications, especially considering this is a product from one of the largest companies in the world.

OpenAI and Opera’s New Partnerships

Sam Altman secured an important partnership with Apple to integrate OpenAI's technology into Apple products, boosting AI capabilities across Apple's ecosystem. The partnership is expected to bring advanced AI functionalities to Apple devices, advancing user experiences with more sophisticated AI-driven features.

OpenAI wasn’t the only one looking to improve its AI capabilities by forming a new partnership.

Opera announced the integration of Google Gemini into its Aria AI assistant in a collaboration with Google. Aria uses a multi-LLM Composer AI engine and selects the best model for each task (now including Gemini). The enhanced Aria AI is currently available in the Opera One Developer version.

An example of Aria’s image generation capabilities. (Source)

Perplexity Pages and Tencent’s Model for Talking AI Heads

An example of how Perplexity Pages lets users easily create visually appealing articles. (Source)

Perplexity launched Perplexity Pages - a feature designed to enhance content creation capabilities within its platform. Currently, the tool is in beta testing and available to Pro users who have signed up through an official form.

It lets users generate AI-driven content by inputting prompts and specifying the target audience level (novice or expert). The AI then searches public resources to draft an article, which users can interact with by:

Modifying text
Adding images
Refining content through follow-up questions

A more obscure and strange model release we saw last week was V-Express, which generates talking head videos using a reference image and audio. The model employs conditional dropout for progressive training to ensure balanced control signals.

Advancements in AI Research

We saw some progress being made in the practical applications of Chinchilla scaling laws and an interesting framework that’s able to generate high-quality videos at 144 frames.

Enhancing AI Training With Chinchilla Scaling Insights

EPFL (a university in Switzerland) researchers introduced a constant learning rate with cooldowns as an alternative to the traditional cosine schedule for training machine learning models. It simplifies training across various durations, while maintaining predictable and reliable scaling properties.

The Chinchilla project emphasizes that the training data size and number of parameters should be balanced to achieve optimal performance. The paper builds on this foundation by proposing methods like constant learning rates and cooldowns to improve training efficiency, showing practical applications of Chinchilla’s scaling laws.

Moreover, they showed stochastic weight averaging (SWA) drastically boosts model performance along the training trajectory without incurring additional training costs. SWA improves generalization and stability in the models.

The cooldown method reduces GPU and compute hours by half. (Source)

MoE Architecture With New Router Network > Llama-3-70B

The left image shows the scaling of the Yuan 2.0 architecture, while the right image shows the MoE layer structure. (Source)

Chinese researchers introduced Yuan 2.0-M32, which employs a Mixture of Experts (MoE) architecture featuring 32 experts and is dynamically managed by a new router network called “Attention Router.” This architecture optimizes computational efficiency by activating only two experts per inference.

Trained on 2000B tokens, the model leverages large-scale pretraining to achieve high performance. Notably, it requires significantly less computational power than traditional dense models.

Yuan 2.0-M32 outperformed various other models on the ARC-C benchmark, which contains questions from science exams. (Source)

Despite its extensive capabilities, Yuan 2.0-M32 operates with only 3.7 billion active parameters at any time, showing its efficiency at handling large-scale data.

Since this paper introduced the Attention Router, we might see future research explore different attention-based routing strategies and hybrid models that combine MoE with other architectures.

Meta’s Progress in Vision-Language Modeling

An overview of the different types of VLMs. (Source)

Vision-language models (VLMs) combine visual and textual information to create AI systems that understand and generate human-like text and images. It’s an important area of research since VLMs have various applications, such as:

Image captioning
Visual question answering
Video analysis
Autonomous systems

Meta’s paper brings up issues with VLMs, like the high-dimensional nature of visual data and the complexity of mapping this to discrete language representations.

To deal with these issues, they discussed various methods to train VLMs, such as pre-training on large datasets and fine-tuning on specific tasks. High computational resources are required to achieve high performance, so they suggested the use of efficient data handling and distributed training to reduce computational load.

Generating Videos With 144 Frames

EasyAnimate’s pipeline. (Source

EasyAnimate extends the DiT (Denoising Transformer) framework (originally for 2D image synthesis) to 3D video generation. This adaptation involves incorporating a motion module block to capture temporal dynamics, providing consistent frames and smooth transitions in the generated videos.

EasyAnimate supports various DiT baseline methods, letting it generate videos with different styles and frame rates.

This flexibility makes it suitable for both image and video generation tasks during the training and inference phases. The framework provides a comprehensive ecosystem for video production, including:

Data pre-processing
VAE training
DiT model training (both baseline and LoRA models)
End-to-end video inference

The method showed impressive performance in generating videos with up to 144 frames, meaning that it handles long-duration video generation with high performance and quality pretty well.

Frameworks We Loved

Some frameworks that caught our attention in the last week include:

GNN-RAG: Combines Graph Neural Networks (GNNs) with Retrieval-Augmented Generation (RAG) to enhance LLM reasoning using knowledge graphs.
MSRAG: Uses GPT-3.5 retrieval with web retrieval to improve accuracy and relevance in question-answering systems.
FaultFormer: Improves bearing fault classification in transformers, which is important for predictive maintenance in manufacturing.

If you want your framework to be featured here, get in touch with us.

Conversations We Loved This Week

Ng explored issues with current generative AI evaluations and what potential solutions could be, while Karpathy looked at how we could boost performance with fewer tokens.

Issues With Generative AI Evaluations

Ng looks at some roadblocks in generative AI evaluations. (Source)

Andrew Ng brought up how evaluating custom AI applications (especially those generating free-form text) is a significant barrier to progress in generative AI.

He mentions that standardized tests exist for general-purpose LLMs, but specific application evaluations are limited. Labeled test sets are effective but costly to create for applications with clear right-or-wrong answers.

The current reliance on advanced LLMs for evaluation introduces noise, which means more reliable automated evaluation methods are definitely needed. It also wastes a lot of time, since current evaluations can take hours. That’s a lot of time that could be used on more important tasks, like conducting new experiments.

Achieving Better Performance With Fewer Tokens

Karpathy looks at how datasets affect LLM training, using FineWeb-Edu as an example. (Source)

FineWeb-Edu filters the original 15 trillion FineWeb tokens down to 1.3 trillion high-quality educational tokens, judged by a Llama-3-70B model. This refined dataset significantly improves learning efficiency and quality for LLMs.

LLMs learn more effectively from educational content than from average internet pages, which are often filled with irrelevant or low-value information. This highlights the importance of dataset quality in training AI models.

Using high-quality datasets like FineWeb-Edu, models can achieve better performance with fewer tokens. For instance, the llm.c GPT-3 (124M) model outperforms expectations, requiring fewer tokens to achieve similar or better results compared to previous models.

Money Moving in AI

China didn’t want to leave the USA’s announcement of a $50 billion semiconductor stimulus package unanswered, so they responded with their own massive $47 billion fund.

Aside from the semiconductor fund war, Maven AGI raised $27 million while Cloudera acquired Verta.

China’s $47 Billion Semiconductor Fund in Response to USA

China has launched its largest-ever semiconductor investment fund, valued at $47 billion.

It’s expected to advance:

Manufacturing equipment
AI-related semiconductors
High bandwidth memory technologies

The competition in the semiconductor industry is becoming fierce as the US announced a semiconductor stimulus package of $50 billion, so China is looking to get ahead with this fund.

Maven AGI Raises $27 Million From Series A Funding

Maven AGI raised $28 million in a Series A funding round led by M13, with participation from Gradient Ventures and other investors. The company focuses on generative AI specifically designed for customer support, with AI Agents able to handle complex customer inquiries.

Cloudera Acquires Verta

Hybrid data company Cloudera acquired Verta, a startup that makes managing AI technologies like LLMs easier.

Verta’s capabilities will be integrated into the Cloudera data platform to improve its current data management and analytics tools.

In addition, the acquisition is set to provide Cloudera’s enterprise customers with improved tools for building, deploying, and managing machine learning models. Although, the price of the acquisition isn’t known.

1. Disclaimers
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and COOL VENDORS is a registered trademark of Gartner, Inc. and/or its affiliates and are used herein with permission. All rights reserved

The GARTNER COOL VENDOR badge is a trademark and service mark of Gartner, Inc. and/or its affiliates and is used herein with permission. All rights reserved. Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s Research & Advisory organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Anthropic Extracts Praise and Sarcasm, Images That Sound, OpenRLHF for >70B Models

Thu, 30 May 2024 16:56:51 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

xAI raised $6 billion in a funding round while Grok will be updated with multimodal capabilities, Alexa will receive an update that makes it more responsive, and Microsoft is planning to enhance Copilot with automation agents.
When pitted against other models like ChatGPT, Gemini, and Claude, Perplexity was ranked first in a survey carried out by The Wall Street Journal.
Cohere’s Aya-23-35B model outperforms others in multilingual benchmarks like M-MMLU and Huawei’s Arabic LLM has an automatic speech recognition capability of 96% accuracy.
Anthropic’s study showed that sparse autoencoders can extract interpretable, human-like features from Claude 3 Sonnet, even though the process is computationally expensive.
Uni-Mol Docking V2 drastically improves binding pose prediction for drug design, addressing previous model issues and surpassing its predecessor.

Got forwarded this newsletter? Subscribe below👇

The Talk of the Day: AI Godfather vs Chief Twit; Codestral-22B by Mistral

Had a nice Memorial Day? Elon Musk spent it trading zingers with Meta Ai's chief Yann LeCun, as xAI is ramping up its hiring efforts post-raise (LeCun has declined a tongue-in-cheek invitation to work at xAI). We won't go into much detail but this is our favorite part of the exchange.

We were hoping this would be over by the time we finished researching for this post…

Kinda funny given that one of these people made CNNs, although according to the other, Tesla doesn't use them as much.

… but it's still going on.

In other news, Mistral released a 22B parameter model, trained on over 80 programming languages, with better performance than Llama v3 70B or CodeLlama 70B across some benchmark tests.

It's better on average, driven by superiority at Python, Java, PHP

The Latest AI News

While Humane, Rabbit, and Google were under fire last week, xAI and Amazon announced major updates to their AI offerings. Meanwhile, Microsoft also announced a new version of Copilot and plans to build a $1 billion data center in Kenya.

Some new LLM releases also popped up, including an Arabic LLM from Huawei and Cohere’s Aya 23 models.

Chatbot Competition Results + OpenAI Enters Multi-Year Deal With News Corp While Outages Affect ChatGPT and Bing

The Wall Street Journal (WSJ) carried out a survey that compared five chatbots against nine different criteria.

ChatGPT won, right?

Wrong. Perplexity took the lead instead.

Perplexity ranked first overall when compared to other top chatbots. (Source)

Notably, WSJ mentioned that each chatbot has certain weaknesses and strengths, so they’re all worth exploring depending on what the user wants out of them.

Even though OpenAI didn’t win the competition, it has entered a multi-year deal with News Corp. This lets OpenAI access current and archived content from major News Corp publications like The Wall Street Journal and New York Post. The content will be used to enhance OpenAI’s products and respond to user queries.

However, ChatGPT Search, DuckDuckGo, Microsoft Bing, and Copilot all experienced global outages that caused affected users to encounter blank pages and specific error messages. It emphasizes that even major tech services still have major vulnerabilities that can negatively impact user experiences.

It also shows the importance of effective communication and quick responses, as Microsoft and DuckDuckGo quickly acknowledged the issues to help regain user trust.

LLM Releases From Huawei and Cohere

Cohere introduced another model: the Aya 23 series of multilingual LLMs. Aya 23 includes two primary models, one with 8 billion parameters and another with 35 billion parameters - offering different levels of complexity and capability.

These models are trained on a comprehensive dataset of 513 million prompts across 114 languages for impressive multilingual capabilities. The Aya 23 model outperforms previous iterations in multilingual tasks, showcasing language understanding and generation improvements.

Aya-23-35B performed well in benchmarks like M-MMLU. (Source)

Huawei introduced a new LLM specifically designed for Arabic. The LLM offers high performance in processing and understanding Arabic, with an automatic speech recognition capability of 96% accuracy.

In addition, the model aims to serve various industries like:

Finance
Healthcare
Public services

Humane Looks for Buyer While Rabbit R1 and Google AI Overview Face Tough Criticism

Humane, the company behind the $700 AI Pin, is actively seeking a buyer for its business. Its goal is to sell for between $750 million and $1 billion.

The AI Pin wasn't received well at all because of issues like:

High costs compared to smartphones
Irrelevant responses
Short battery life
Lack of control

The company is working with financial advisors to help with the sale and is looking to be potentially acquired by tech giants like Apple or Google.

Speaking of negative reception, the Rabbit R1 device has also received a lot of flak lately. It was marketed with claims of an advanced AI called LAM (Live Action Model) but mainly uses ChatGPT and hardcoded scripts. Also, claims that LAM could interact live with websites were false; it uses basic web automation tools and prewritten scripts from another company called Playwright.

Google also faced criticism because the AI Overview product is providing strange answers. This has also led to public criticism affecting Google’s reputation and user trust. Although, Google is making some adjustments to improve its AI technology and regain its competitive edge.

Microsoft Plans for New Copilot and $1 Billion Data Center in Kenya

Alongside xAI and Amazon, Microsoft is also improving its AI products. This time, Microsoft is enhancing Copilot with automation agents designed to simplify various business processes (we're still hoping it'll go without making a screenshot every two seconds with the Recall feature).

These agents will automate repetitive tasks and integrate seamlessly with Microsoft’s suite of productivity tools. They will be accessible through natural language commands, making it easy for non-technical users to get the most out of these AI agents.

Microsoft made some key investments last week as well.

They injected $70 million into South Africa’s SMBs, which will boost digital capabilities and infrastructure. Additionally, Microsoft plans to build a $1 billion geothermal-powered data center in Kenya following the closure of its Nigerian facility. The aim is to enhance cloud computing capacity in the region.

Notably, this project is in collaboration with UAE-based G42, making it a strategic partnership for digital technology expansion in Africa.

Major Updates to Alexa and Grok + xAI Raises $6 Billion in Funding Round

xAI is enhancing Grok with multimodal AI capabilities and users will soon be able to upload photos to Grok to receive text-based answers. This improves user experience by providing more versatile and comprehensive responses to more diverse queries, which wouldn’t have been possible with previous Grok versions.

With this latest update and a recent $6 billion funding round, xAI is emerging as a stronger competitor to OpenAI.

Amazon also announced some updates, preparing a major update for Alexa to boost its capabilities and user experience. The update aims to make Alexa more intuitive and responsive, resulting in more accurate and relevant responses to user queries for a better user experience.

These advanced, user-friendly features set the stage for AI to become even more deeply integrated into our daily lives, ranging from personal assistants to broader AI-driven services like intelligent home automation.

Advancements in AI Research

While biotech research continued to progress, with a new version of Uni-Mol detailed in a recent paper, Anthropic looked at how sparse autoencoders can extract key features from Claude 3 Sonnet.

Anthropic Extracts Interpretable Features From Claude 3 Sonnet

A brief overview of Anthropics approach to feature extraction. (Source)

Anthropic’s study examines how sparse autoencoders can extract interpretable features from Claude 3 Sonnet, which is a medium-sized transformer model. In particular, Anthropic’s approach identifies specific, human-understandable features within the model’s layers.

Some of the identified features include behaviors like sycophantic praise and sarcasm - relevant for improving model safety and alignment. Although, extracting features at scale is computationally demanding since it requires more resources than training the model itself.

These features provide highly valuable insights for practical AI applications like customizing outputs and enhancing safety measures. The high computational cost and current limitations of feature steering are still issues that remain to be addressed, limiting the approach’s practical implementations.

Microsoft’s New Deep Learning Platform for Wildlife Conservation

An overview of how PyTorch-Wildlife works. (Source)

Microsoft’s PyTorch-Wildlife is an open-source deep learning platform designed for wildlife conservation. It helps create, modify, and share AI models for wildlife monitoring. The platform's intuitive interface is accessible by local installation or Hugging Face. It supports animal detection and classification in images and videos.

In addition, the framework has been used for species recognition in the Amazon Rainforest and invasive opossum detection in the Galàpagos Islands, achieving exceptionally high accuracies (92% and 98%, respectively).

PyTorch-Wildlife uses a modular architecture that allows for easy expansion and customization. It supports various data formats and integrates with other tools like Timelapse and EcoAssist.

Uni-Mol Docking V2 > AlphaFold

Chinese researchers presented Uni-Mol Docking V2, a new approach to improve binding pose prediction accuracy significantly.

This new version addresses previous issues like chirality inversions and steric clashes for more accurate chemical predictions. Moreover, it sets a standard for ML-based protein-ligand docking by surpassing previous models and traditional methods like Autodock Vina and AlphaFold.

Uni-Mol Docking V2 achieved an accuracy of 77.6% in the PoseBusters benchmark, higher than its predecessor's result of 62.4%. (Source)

Uni-Mol Docking V2’s massive increase in prediction accuracy makes it a reliable tool for drug design and virtual screening. Its high chemical prediction accuracy lowers the chance of errors in the drug development process, making it trustworthy for industrial applications.

OpenRLHF Boosts RLHF Training for LLMs Above 70B Parametersm

OpenRLHF is a new open-source framework designed for scalable and high-performance Reinforcement Learning from Human Feedback (RLHF) training, particularly for LLMs exceeding 70 billion parameters.

OpenRLHF has features like rejection sampling that other RLHF frameworks don’t have. (Source)

It uses Ray, vLLM, and DeepSpeed to efficiently distribute models across multiple GPUs, resulting in resource optimization and supporting advanced training techniques. Moreover, it implements various techniques to boost training efficiency and reduce memory bottlenecks, including:

Offloading optimizer states to CPUs
Removal of redundant padding
Flash Attention

The framework seamlessly integrates with Hugging Face, providing easy-to-use scripts for various algorithms and configurations. This leads to quick deployment and setup of RLHF models.

New Approach to Creating High-Quality Spectrograms

Researchers at the University of Michigan introduced a method to create spectrograms that look like natural images and sound like natural audio. It uses pre-trained text-to-audio and text-to-spectrogram diffusion models operating in a shared latent space. The approach works without paired training data, using the compositional nature of diffusion models.

How images that sound are different from images and spectrograms. (Source)

In terms of performance, the method showed impressive results through qualitative and quantitative experiments, producing spectrograms that align with both audio and image prompts.

Creating visually appealing spectrograms that also function as accurate audio representations helps improve user engagement in interactive media and entertainment.

Frameworks We Loved

Some frameworks that caught our attention in the last week include:

AGILE: Designed to help LLMs to perform complex conversational tasks and uses reinforcement learning to fine-tune LLMs using labeled data of actions.
SignLLM: The first multilingual Sign Language Production (SLP) model capable of generating sign language gestures from text prompts.
ViVid: Used for video virtual try-on using diffusion models, designed to transfer clothing items onto videos of target individuals.

If you want your framework to be featured here, get in touch with us.

Conversations We Loved This Week

LeCun discussed why he believes auto-regressive LLMs won’t be able to achieve human-level intelligence, while Box CEO Aaron Levie detailed some issues with the accuracy of AI outputs.

Autoregressive LLMs Won’t Reach Cat-Level Intelligence?

LeCun discusses why objective-driven architectures might be the way forward. (Source)

Yann LeCun asserted that auto-regressive LLMs are insufficient to achieve human-level intelligence (or even cat-level intelligence!).

Instead, he proposed that objective-driven architectures do have the potential to reach human-level intelligence. His advocacy for objective-driven architectures emphasizes a significant shift from traditional generative models, aiming for more structured and goal-oriented AI development.

He also highlighted four critical components necessary for intelligent behavior:

Understanding the physical world
Having persistent memory
Hierarchical planning
Reasoning

LeCun suggested that implementing these characteristics into AI systems could result in the development of truly intelligent machines.

Issues With AI Response Accuracy

Levie explains some key issues with AI response accuracy. (Source)

RAG efficiently connects vast datasets to AI models so they can answer dynamic and sensitive queries.

However, Box CEO Aaron Leevie mentioned that the accuracy of the responses depends on the quality and relevance of the underlying data fed into the prompt He also noted that AI struggles with up-to-date and authoritative information, usually relying on potentially outdated or inaccurate sources.

Another issue is handling conflicting information in user-specific datasets, like emails and documents. Instead, he suggested using an authoritative content hub like Box Hub to mitigate data quality issues and improve response accuracy.

Levie’s post highlights the importance of data management strategies that can improve data quality and AI response accuracies. It also shows that tailoring AI models to better understand and resolve user-specific data conflicts drastically improves their practical utility.

Money Moving in AI

DeepL and Scale AI continue their wave of momentum by successfully securing $300 million and $1 billion respectively, while Snowflake acquired TruEra’s data observability platform.

Scale AI Raises $1 Billion in Series F Funding Round

Scale AI raised $1 billion in a Series F funding round, resulting in a valuation of $13.8 billion. Accel, along with other institutional and corporate investors, led the round. This company specializes in data labelling - a key aspect of training AI models.

DeepL Secures $300 Million at $2 Billion Valuation

DeepL secured $300 million in a recent funding round, valuing the company at $2 billion. The raised capital will help the AI language translation company boost its B2B services. The emphasis on B2B services shows a pivot toward enterprise clients, which might lead to increased revenue streams.

Snowflake Acquired TruEra’s AI Observability Platform

Snowflake has agreed to acquire TruEra's AI observability platform. TruEra’s technology will help Snowflake improve the monitoring, debugging, and evaluation of LLMs and ML models in production. Moreover, Snowflake will also integrate TruEra's team, which includes its co-founders like:

CTO Shayak Sen
CEO Will Uppington
President and Chief Data Scientist Anupam Datta

2nd OpenAI Exodus Amidst Major Updates, I/O Recap, 14 New LLMs

Tue, 21 May 2024 12:08:56 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

Google announced various Gemini and Gemma updates, along with a new energy efficient TPU called Trillium at I/O 2024.
OpenAI released GPT-4o, which offers improved performance over its predecessor but key members, including Ilya Sutskever and Jan Leike, have departed from the company, leaving the Superalignment team in a freefall.
OpenAI encountered several other controversies. The voice assistant, Sky, was scrubbed for striking resemblance to the voice of Scarlett Johansson. Additionally, Sam Altman faced significant scrutiny regarding vested equity clawbacks.
Various other models were released, including Falcon-2 (multimodal), CuMo (trained for specific language and vision tasks), and Fugaku-LLM (specialized in Japanese language processing)
Researchers from the National University of Singapore released a paper determining if SSM (Selective Space State Model; used to help AI better understand and process visual information) is necessary for vision tasks.
Meta released a new multimodal model called Chameleon that outperformed larger models and achieved impressive results on benchmarks like BoolQ.

Got forwarded this newsletter? Subscribe below 👇

Rumours of the Day: Sutskever → Russia, Johansson: OpenAI copied 'Her' voice, OpenAI Cancels Equity?

Co-founder and chief scientist Ilya Sutskever left the company. This development may be unexpected, but in retrospect, it appears logical given that he has not been seen at the office for the past six months, following his brief involvement in the rebellion against Sam Altman. It is now speculated that the scientist is in discussions with Sberbank, one of the largest banks in Russia, regarding a potential move to Russia to embark on an ‘ambitious project’.

OpenAI's AI voice called "Sky" that demonstrated in recent GPT-4o demo bears a striking resemblance to the voice of actress Scarlett Johansson.

Altman didn't help by further leaning into it on social media.

Despite OpenAI's assertion that Sky's voice is not Johansson's but rather that of a different paid voice actress, they have temporarily suspended the use of Sky to as they address inquiries regarding their voice selection and sampling methods. Notably, the actress has been allegedly approached by Altman's team on multiple occasions on the topic of such collaboration.

Lastly, the high-profile departures brought OpenAI's exit agreements to the spotlight. Turns out, they contained a non-disparagement clause that threatened to cancel vested equity for former employees who criticized the company, as reported by Vox. CEO Sam Altman acknowledged the existence of this clause and stated that while “OpenAI never enforced it or clawed back anyone's vested equity”, he was unaware of the provision and found it embarrassing.

SF Exclusive Meetup on 5/28: How to Build a RAG Data Engine with LlamaIndex, Activeloop, & TryoLabs

Learn how we fine-tuned Llama-v-3 (8B) and used advanced retrieval techniques like Deep Memory by Activeloop with RAG to surpass GPT-4 in knowledge retrieval. Also, learn to take LLM-based apps to production with Tryolabs and prevent most common mistakes in prod with LlamaIndex.

We have a few last spots for the meetup at LlamaIndex’s swanky new HQ. So, get on the waitlist since the spots are limited!

The Latest AI News

Boy, what a week. We counted 14 new LLMs, but we may not have caught everything (let me know in replies if we did)! Google and OpenAI led the charge in AI announcements last week, with Google introducing various Gemini and Gemma updates while OpenAI announced its latest model called GPT-4o.

Some other notable model releases include Falcon-2, CuMo, and Fugaku-LLM.

OpenAI Announces GPT-4o While Key Members Leave

OpenAI made some significant moves, starting with the introduction of GPT-4o. It offers improved performance in benchmarks like MMLU and HumanEval over its predecessor. Additionally, GPT-4o includes advanced features like better handling of complex queries and improved contextual understanding.

One of its most powerful features is its ability to understand what’s happening on the user’s screen and act as a true copilot, as shown by OpenAI’s demonstrations.

GPT4-o significantly outperformed GPT-4 Turbo in speed, with a 50-80% faster time to first token (TTFT). For complex data extraction tasks, GPT-4 and GPT-4o both identified 60-80% of the data correctly. However, GPT-4o excels in classification tasks with a 7% improvement over GPT-4 Turbo. GPT-4o performed well on benchmarks like MMLU compared to other models like Claude-3 Opus and Gemini Pro 1.5.

OpenAI also introduced improvements to data analysis in ChatGPT, including integration with Google Drive and Microsoft OneDrive. Users can now interact with tables and charts in real-time, improving visibility and engagement.

But OpenAI didn’t have an entirely positive week, as co-founder and chief scientist Ilya Sutskever left the company. Jan Leike was another key member of OpenAI’s leadership team who worked with Sutskever on safeguarding future AI but also left, citing lack of concern about AI safety.

While Leike mentioned it was hard to step away since AI systems smarter than humans need to be controlled, LeCun replied that knowledge accumulation and retrieval aren’t the same as actual intelligence.

Google I/O 2024 Gemini, Gemma, and TPU Updates

Google I/O 2024 concluded recently and had some important announcements.

The first was the Gemini updates, which included:

Gemini 1.5 Flash: A lightweight and efficient model optimized for high-frequency tasks with a 1 million token context window.
Enhanced Gemini 1.5 Pro: Improved with a 2 million token context window, better code generation, and multimodal understanding.
Project Astra: It builds upon the Gemini models and focuses on developing AI assistants that can respond naturally to complex, multimodal inputs.
LearnLM: Fine-tuned on Gemini for learning purposes by providing a more personalized experience, such as allowing users to ask the AI model questions while watching academic videos. (more on this later)

Three key features and benefits of Gemini 1.5 Flash. (Source)

Another key update was the Gemma updates, which focused on architecture improvements and fine-tuning for specific tasks.

PaliGemma: A new vision-language model built for tasks like image and short video captioning, visual question answering, and object detection.
Gemma 2: Next-generation Gemma models featuring a new architecture for improved performance and efficiency, with sizes up to 27 billion parameters. Beyond PaliGemma, the fine-tuned models include CodeGemma (for code completion and generation tasks) and RecurrentGemma (for research and inference purposes - uses Google's novel Griffin architecture).
Gemma-2 is still performing well on benchmarks like GSM8k despite being in pre-training. (Source)

Lastly, Google announced the sixth generation of Google Cloud TPUs: Trillium.

Trillium TPUs offer a 4.7x increase in peak compute performance per chip compared to the previous generation. Moreover, these TPUs are 67% more energy efficient than their predecessors, making them the most sustainable TPUs to date.

New Models From TII and Fujitsu

A couple of notable LLMs released last week, one being Technology Innovation Institute (TII)’s Falcon 2. This model series outperforms Llama-3 across various benchmarks like ARC-C and GSM8K.

Falcon-2 overtakes Llama-3 in the overall benchmark score. (Source)

Additionally, the Falcon-2 models are designed for various applications like natural language processing (NLP) and computer vision.

Another LLM that was released was the Fugaku-LLM, a LLM trained on the Fugaku supercomputer and released by Fujitsu. Although, the Fugaku supercomputer was dethroned as the fastest supercomputer by the US supercomputer Frontier.

Fugaku-LLM has 13 billion parameters and showed superior Japanese language processing abilities, scoring an average of 5.5 on the Japanese MT-Bench.

Moreover, it uses distributed parallel learning and advanced optimization techniques to maximize performance.

Both models have a big focus on scalability, with Falcon-2 emphasizing efficiency and Fugaku-LLM making the most out of supercomputing resources.

Hugging Face Releases ZeroGPT in Beta and New Vision Language Model Appears

Hugging Face launched ZeroGPT in beta, providing shared GPU infrastructure for indie and academic builders to run demos on Hugging Face Spaces without compute costs. It’s an initiative that commits $10 million worth of free GPU spaces, leveraging Nvidia A100 GPUs.

ZeroGPU allows Spaces to efficiently hold and release GPUs as needed, which optimizes resource use and energy efficiency.

We also saw some new vision language model releases on Hugging Face, including CuMo.

CuMo uses a Co-upcycled Top-K sparsely-gated Mixture-of-Experts (MoE) to enhance vision and language tasks. Additionally, it employs a three-stage training process with auxiliary losses to stabilize training and balance expert loading. Impressive results were seen on visual question answering and visual-instruction-following benchmarks.

CuMo shows promising results on multiple benchmarks when compared to other high-performance models. (Source)

Microsoft and LinkedIn’s 2024 Work Trend Index Report, Scale AI Readiness Report, and Nvidia and McKinsey + Nvidia Collaboration

Microsoft and LinkedIn released 2024 Work Trend Index report, which has some interesting stats about how AI is being used at work.

One of which is that 75% of knowledge workers use AI at work - many of which bring their own tools. It also mentioned that power users who use AI extensively reported significant productivity gains and job satisfaction.

Although there is a growing need for AI expertise in the workplace, only 39% of users received AI training from their companies.

Scale AI also released a report noting that 60% of respondents who haven’t adopted AI yet mention security concerns and a lack of expertise as major barriers, while successful adopters cite as org efficiency, new product creation and customer experience as the largest benefactors of GenAI. We liked the following two graphs the most from the report.

Nvidia and McKinsey also noticed the potential of AI in the workplace, and are currently collaborating to help industry leaders unlock the full potential of Generative AI. McKinsey’s AI arm, QuantumBlack, leverages Nvidia’s technology for improved AI solutions. One example application includes AI-based solutions for optimizing promotions in retail.

Ampere and Qualcomm to Launch Arm Based AI Server

Ampere and Qualcomm are collaborating to launch an Arm-based AI server to combine Ampere’s GPUs with Qualcomm’s Cloud AI Ultra 100 chips.

The server focuses on efficient power consumption and high performance, along with being able to handle large-scale AI models and diverse workloads efficiently.

This collaboration leverages Ampere’s and Qualcomm’s strength in server CPUs and AI interfacing, respectively. It shows potential for more innovations in AI server technology, as this partnership focuses on a crucial aspect of sustainable AI operations in data centers - sustainability.

Advancements in AI Research

We saw some advancements in gait recognition along with a new paper that challenges the status quo in the field of vision tasks.

Google continued to make advancements alongside all their other updates with a new graph visualization tool.

New Platform for Gait Recognition

An overview of OpenGait, including the benchmark study structure, performance, and publication statistics. (Source)

OpenGait is a platform for gait recognition (identifying a person based on their walking pattern) that aims to deal with the practical challenges in applying gait recognition techniques from indoor environments to real-world outdoor scenarios.

A comprehensive benchmark study was conducted to evaluate various state-of-the-art gait recognition methods on newly released gait datasets. Notably, major performance drops were seen when transitioning from indoor to outdoor settings.

Three robust baseline models were developed, each of which were tailored to different gait pattern descriptions:

DeepGaitV2
SkeletonGait (model-based)
SkeletonGait++ (multimodal)

These models can serve as benchmarks for future research in gait recognition.

Moreover, OpenGait helps address the gap between controlled lab settings and real-world applications, opening up possibilities for usage of gait recognition in security and surveillance.

Is SSM Necessary for Vision Tasks?

Researchers from the National University of Singapore introduced MambaOut, a vision model that omits the Selective Space State Model (SSM) component found in Mamba models - raising some interesting questions about whether or not SSM is necessary for vision tasks like image classification.

Two hypotheses were proposed:

SSM isn’t necessary for image classification
SSM might still be useful for object detection and segmentation.

The MambaOut models use Gated Convoluted Neural Network (CNN) blocks without SSM.

Architecture of Gated CNN and Mamba blocks. (Source)

They were developed and evaluated against state-of-the-art visual Mamba models on tasks like:

ImageNet image classification
COCO object detection
ADE20K semantic segmentation

Results showed that MambaOut surpasses visual Mamba models in image classification, which validates hypothesis 1. On the other hand, MambaOut fell short compared to visual Mamba models when it came to detection and segmentation tasks, validating hypothesis 2.

Google’s Latest Graph Visualization Tool

Model Explorer makes it easy to visualize large graphs. (Source)

Google research introduced Model Explorer, a graph visualization tool designed to simplify the development and evaluation of large AI models.

It also helps to understand model architecture, dependencies, and performance by visually representing model components and their interactions. Model Explorer boosts transparency by visually displaying the complex interactions within large models - essential for debugging and optimization.

Moreover, it boosts team collaboration by providing a common visual language, leading to more efficient model development. It’s inevitable that AI models will grow in size and complexity, so a tool like Model Explorer will be handy for managing and understanding these systems while ensuring they remain maintainable.

Meta’s Multimodal Model Chameleon

Meta introduced a family of mixed-modal early fusion foundation models that generate images and text in any sequence. These models use a stable training approach and alignment recipe tailored for mixed-modal settings.

Notably, Chameleon showed impressive performance in various tasks, outperforming larger models like Llama-2-70B. Some of these tasks include:

Visual question answering
Image captioning
Image generation
Text generation

Chameleon outperformed other models on benchmarks like Arc-E and BoolQ. (Source)

It means we could see more cohesive multimodal AI systems in the future, since Chameleon uses an early-fusion approach that integrates image and text processing into a single model.

Frameworks We Love

Some frameworks that caught our attention in the last week include:

Speechverse: Combines pre-trained speech and text foundation models for multi-task training and curriculum learning.
InsightNet: Automated extraction of structured insights from customer reviews, dealing with limitations like a lack of structure for identified topics.
Oedipus: Uses Domain Specific Language (DSL) to create sub-steps for automated reasoning CAPTCHA solving.

If you want your framework to be featured here, get in touch with us.

Conversations We Loved This Week

One key conversation we saw this week was from Cameron Jones who discussed his preprint that showed GPT-4 is becoming more difficult to distinguish from humans - an exciting, but worrying prospect.

LearnLM was an important paper from Google that almost went unnoticed among all the I/O hype.

GPT-4 Can’t be Distinguished From a Human?

Cameron’s discussion of his new preprint. (Source)

Cameron Jones and his team recruited 500 participants, assigning each of them to one condition out of five. Participants focused on linguistic style and socioemotional factors instead, leading to them judging GPT-4 as human 54% of the time.

It shows that people can’t reliably distinguish between AI from humans after a five-minute conversation dedicated to this task, indicating some issues with the Turing test. The study also highlights how current AI models are getting close to mimicking human-like interactions.

DeepMind’s Overlooked Paper

Yongpradit brought an important paper to light. (Source)

Although LearnLM (the fine-tuned Gemini model we mentioned earlier) has been the center of attention last week, Yongpradit pointed out a new paper published by Google that slipped under the radar.

It uses a collaborative approach where the authors worked with students and teachers to translate high-level learning science principles into seven educational benchmarks for generative AI in education.

This paper is an important first step toward creating a comprehensive framework for evaluating the effectiveness of generative AI in education. Such a framework is crucial for determining the effectiveness and impact of AI tools in educational settings.

Money Moving in AI

Snowflake is currently in the process of making a major move to buy Reka for $1 billion after previously investing in it in June 2023, while PolyAI and Weka both successfully secured $50 million and $140 million respectively.

Snowflake to Buy Reka AI for $1 Billion

Snowflake is in discussions to acquire Reka for $1 billion, aiming to boost its AI capabilities. This acquisition is part of Snowflake’s strategy to enhance its data cloud platform with advanced AI technologies. Reka AI specializes in developing AI models and tools, which complements Snowflake’s existing data management and analytics offerings. Certain observers have noted the resemblance to Databricks' acquisition of MosaicML.

This wouldn’t be the first time that Snowflake invests in Reka AI. A previous investment was made in June 2023 to expand Snowflake’s LLM capabilities within the data cloud.

Weka Raises $140 Million in Series E Funding

Weka successfully raised $140 million in Series E funding, which was led by Valor Equity Partners. As a result, Weka’s valuation is now at $1.6 billion - doubling its previous valuation.

PolyAI Raises $50 Million in Series C Funding

PolyAI raised $50 million in its Series C round, securing backing from investors like Hedosophia, Nvidia, Khosla Ventures.

This AI company specializes in creating advanced voice assistants designed for customer service applications across various industries.

Generative Search Engine Faceoff, DeepMind + Meta’s New Biotech Models, You Only Cache Once

Tue, 14 May 2024 12:53:53 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

Last week’s key developments include:

DeepMind’s AlphaFold 3 enhances protein folding models, while Meta’s RadOnc-GPT improves precision in radiation oncology treatments.
DeepSeek introduced its V2 model, a highly efficient 236 billion parameter language model with advancements in handling complex computations.
Prometheus-2 is a new model used for LLM evaluation and offers an alternative to proprietary models.
The XGen-MM model series handles image-text tasks and SQLCoder specializes in converting Postgres text to SQL.
Cornell University researchers tested Siri and Alexa's ability to show empathy across 65 human identities and found some major shortcomings.

Got forwarded this newsletter? Subscribe below 👇

San Francisco, an exclusive event with Activeloop + LlamaIndex + Tryolabs on Tue 5/28

We all know vanilla RAG is like the plain yogurt of the AI world. But why settle for plain when you can have a sundae sprinkled with advanced retrieval techniques, fine-tuning, and agents? Come to our in-person meetup to learn the secret sauce of building production-grade RAG engines from speakers from Activeloop, LlamaIndex, and Tryolabs.

Rumor has it this is one of the first big meetups at LlamaIndex’s swanky new HQ. So, apply now to be one of the first people to check it out since the spots are limited!

RSVP Now

Coming up Next Week: Google I/O vs GPT-4-O

But first, some hot news fresh off the press that we will dig into deeper next week. Seems like everyone and their grandma is jumping on the multi-modality bandwagon.

Both OpenAI and Google asked their respective models to guess what's the announcement going to be about (spoiler alert: it was about the respective models being multi-modal and great). GPT-4-o is 2x faster, 50% cheaper, has 5x higher rate limits GPT-4-o and a more efficient foreign language tokenizer. The rest will come next week, alongside with I/O updates (we will also update you how many followers Google's Sundar Pichai has gotten after I/O on his newly-opened LinkedIn profile and if he accepted my invite).

The Latest AI News

Biotech advancements continued last week with DeepMind and Meta releasing new models.

We also saw OpenAI make a whole range of moves, including the release of The Model Spec, negotiating deals with publishers like Axel Springer, and confirming the development of a search engine for ChatGPT.

Multiple Moves From OpenAI

OpenAI released “The Model Spec”, which sets guidelines for its AI models to follow - especially when using reinforcement learning from human feedback (RLHF). It aims to shape behaviors in complex scenarios by outlining objectives and rules so that users and developers have more control over the models.

OpenAI didn’t stop there. They offered publishers financial incentives and enhanced visibility in chat interactions so they could use their content to train AI models. Deals have been struck with large publishers like Axel Springer and The Financial Times to secure content and training.

It was also confirmed the GPT-2-Chatbot model versions ("gpt2-chatbot," "im-a-good-gpt2-chatbot," and "im-also-a-good-gpt2-chatbot") that took the LMSYS arena by storm were an early testing preview of the latest multi-modal OpenAI model, gpt-4-o. As promised, we will talk about in the next week's edition.

Microsoft/OpenAI, Apple, and Google's Battle for Generative Search Dominance

Moreover, OpenAI may be developing a search engine for ChatGPT, a significant move into Google’s territory. This is interesting for three reasons.

Firstly, Google integrated Gemini directly into Chrome’s address bar, letting users interact with the AI. It’s the most significant update to Chrome in 15 years, simplifying access to AI-powered features. Gemini is being integrated into more Google products, helping users with tasks like organizing tabs.

Kickstarting Gemini in Chrome by typing ‘@'

Secondly, Apple is aiming to license OpenAI's models for iPhones, in a broader push for more AI in Apple’s iOS 18. This news might have made Apple stock go up a little (and Siri's hopes of delaying retirement come crashing down?). Notably, Apple has previously discussed licensing its Gemini chatbot with Google, but no agreement was reached yet.

Thirdly, Perplexity and You.com have been crushing it in terms of user (and revenue) growth lately - so this may very well be an easy growth lever for OpenAI to target as more users shift from ‘traditional’ Google Search. Sam Altman mentioned on Friday, however, that he's unsure about what's the best user-computer interface should be (is it an app, an extension, or a wearable?).

DeepMind and Meta’s New Biotech Models

DeepMind introduced AlphaFold 3, which advances protein folding models with greater accuracy and broader chemical understanding. It uses an isomorphic approach to predict interactions between protein pairs, along with how drugs and other molecules interact with proteins.

AlphaFold 3 uses an improved version of the Evoformer module, which was used in its predecessor, AlphaFold 2. The new model is useful for advancing drug design and speeding up the development of new therapies.

AlphaFold’s visualization capabilities for a Coronavirus OC43 spike protein. (Source)

Google DeepMind’s also launched AlphaFold Server, which they claim is the most accurate tool in the world for predicting how proteins interact with other molecules throughout the cell. It is a free platform that scientists around the world can use for non-commercial research.

Meta has its own biotech model called RadOnc-GPT. It leverages LLaMa-2 to improve radiation oncology treatments. This model integrates LLMs to automate and boost the precision of radiation treatment planning.

Notably, both models stress the importance of ethical AI use, especially in sensitive fields like healthcare and drug development.

XGen-MM and SQLCoder Models Released on Hugging Face

SalesForce’s XGen-MM series includes the Phi-3-Mini-Instruct-R-V1 model designed to handle image-text tasks.

This version focuses on instructive tasks to enhance interactions with AI using specific commands. It’s pre-trained on diverse datasets and produces content that closely follows user instructions, improving over its predecessor.

Instruct results after instruction tuning, with the XGen-MM model outperforming other vision-language models in benchmarks like MMBench. (Source)

Another model released on Hugging Face was the LLaMa-3 SQLCoder 8b, which excels at converting Postgres text to SQL. It even surpasses GPT-4 Turbo and Claude Opus in zero-shot tasks.

SQLcoder outperforms other models like GPT-4-Turbo. (Source)

The model's shortcomings in previous versions were addressed by designing it for instructional follow-up and allowing it to handle more complex SQL queries.

New Cyber Attack Method Uncovered and AI-Powered Solutions for Cybersecurity

Palo Alto introduced advanced AI-powered security solutions to deal with challenging AI-driven cyber threats.

It’s part of an effort to lead in cybersecurity advancements that predict and deal with evolving digital threats.

Speaking of cybersecurity, researchers identified a new cyberattack method called LLMjacking, which targets cloud-hosted AI models via stolen cloud credentials.

An overview of how LLMjacking works. (Source)

It involves exploiting vulnerabilities to access AI services, resulting in major unauthorized costs for the compromised accounts. Attackers manipulate cloud credentials to interact with LLMs without detection, making tracing difficult.

Some countermeasures to deal with this issue include better monitoring of cloud activities and robust security protocols to detect unusual access patterns.

Silicon Valley AGI Houses Turf War

One of the interesting reads of the week is a Forbes article on the Two Silicon Valley “AGI Houses” that are currently engaged in a legal and public relations battle over the right to the name (and the honorary title of the cooler house).

Both houses claim to advance the concept of artificial general intelligence (AGI) by hosting events, discussions, and hackathons with massive involvement from the tech community (attracting the likes from Google's Sergei Brin, Nixon, OpenAI's Karpathy or even Grimes).

There have been some legal disputes and local municipal tensions (over the use of residential properties for tech gatherings that are sponsored).

Siri and Alexa Lack Empathy?

New research from Cornell University looked at the limitations of AI-generated empathy, particularly in conversational AI agents like Alexa and Siri.

The study tested these agents’ ability to show empathy across 65 human identities, which showed some major shortcomings.

AI systems showed biases in making value judgments about identities. In some cases, the systems even supported harmful ideologies.

It highlights that while AI can mimic certain emotional responses, genuine empathy requires a deeper understanding and interpretation than current AI provides.

Advancement in AI Research

Google Research made some progress in neuroscience research by mapping out a segment of the human brain, while deep learning was successfully applied to positron-emitted tomography (PET) scanning.

Google Advances Neuroscience Research

Google Research celebrated a decade in connectomics with a detailed map of a segment of the human brain (we're excited about this since Activeloop was born out of neuroscience research). To accomplish this, advanced machine learning algorithms and software were used to visualize complex neural networks.

3D image of every neuron and its connection within a small human brain sample. (Source)

The study analyzed a brain tissue sample from an epilepsy patient, identifying 57,000 cells and 150 million synapses. Researchers have discovered new synaptic structures called axon whirls, which suggest intricate neural interconnections.

Using large-scale imaging and data-handling technologies greatly advances the understanding of complex brain structures. It’s also an important transition from manual to automated data analysis in neuroscience.

DeepSeek’s Latest Model With 236 Billion Parameters

DeepSeek presented DeepSeek V2, a Mixture-of-Experts language model with 236 billion parameters designed for economical training and efficient inference.

It features Multi-head Latent Attention (MLA) and DeepSeekMoE technologies that reduce computational load and boost performance.

DeepSeekV2 is trained on a high-quality, multi-source corpus of 8.1 trillion tokens and enhanced through Supervised Fine-Tuning (SFT) and Reinforcement Learning. This sparse computation approach drastically reduces training costs, so it paves the way for a sustainable model development pathway.

Additionally, it showed superior performance compared to the previous model DeepSeek-67B and other open-source options - even with fewer parameters.

DeepSeekV2 outperformed other models in benchmarks like CMath.and CMRC. (Source)

In contrast, China’s Vidu is a text-to-video AI tool that generates 16-second 1080p videos from text prompts. Its applications are aimed at generating quick visual content from text, suitable for media and content creation.

Deep Learning in PET Scanning

This study used a deep learning model based on the 2D Pix-2-Pix generative adversarial network (GAN) to convert non-attenuation corrected PET images into attenuation-corrected PET images - mainly targeting prostate cancer.

Data from 302 prostate cancer patients were divided into training, validation, and testing cohorts to train and validate the model. Afterwards, the model showed high accuracy with three metrics showing promising results.

These metrics include:

Mean absolute error (MAE)
Normalized mean square error (NMSE)
Structural similarity index (SSI)

Comparison between V1 and V2 of the AI model for the validation and test data cohorts for four metrics. (Source)

Ultimately, the model's goal is to reduce the need for low-dose computed tomography (CT) scans, decreasing patient exposure to radiation.

New Open-Source Model That Evaluates Quality of Other Language Models

Researchers introduced an open-source language model called Prometheus-2 that specializes in evaluating the quality of other language models, aiming to mirror human and GPT-4 judgments closely.

It uses techniques like merging weights from different evaluator models to enhance direct assessment and pairwise ranking performance. In addition, it was tested against direct assessments and pairwise ranking benchmarks like MT Bench and Auto-J Eval, showing a high correlation with human judges.

Comparison of various evaluation benchmarks. (Source)

It’s an important paper because it addresses the gap in reliable open-source tools for language model evaluation, offering an alternative to proprietary models.

You Only Cache Once (YOCO): Innovative Decoder-Decoder Architecture

This architecture integrates a self-decoder and cross-decoder to optimize key-value (KV) cache usage in LLMs, reducing memory demands and computational loads. Additionally, it retains the efficiency of decoder-only models with enhanced inference speeds by caching KV pairs once, leading to drastically reduced GPU memory usage.

Additionally, YOCO is adaptable to a wide variety of sequence lengths and is designed for scalability, making it well-suited for future model complexity and size enhancements.

YOCO performed well in benchmarks like OBQA and ARC-E. (Source)

Frameworks We Love

Some frameworks that caught our attention in the last week include:

Lumina-T2X: Designed to convert text instructions into various modalities, including images, videos, and audio.
ScrapeGraphAI: Python library that facilitates web scraping using AI, letting users define scraping tasks in a straightforward manner.
OpenFactCheck: Evaluates the factuality of outputs from LLMs, providing tools for customizing fact-checkers and evaluating the reliability of fact-checking results.

If you want your framework to be featured here, get in touch with us.

Conversations We Loved This Week

Some interesting discussions came up, with Chollet bringing up some misconceptions in deep learning. At the same time, Huyen raised an important point about how initial LLM deployment results aren’t always reliable for predicting future results.

Deep Learning Misconceptions

Chollet’s discussion about deep learning misconceptions. (Source)

Chollet clarified that deep learning models are only functions fitted to data distributions and their competence is limited to the scope of their training data. He mentions the term “emergent learning” is deemed incorrect when describing models performing untrained tasks and suggests an overlap with trained data.

Notably, deep learning models are domain-specific, limiting their capabilities to the boundaries of their datasets. Chollet encourages a more focused approach in AI research, stressing the importance of domain-specific enhancements instead of chasing an all-capable AI model.

This is because true AGI would require the capacity to learn new skills and adapt continuously, which goes beyond the static nature of current deep learning models.

LinkedIn’s Report on Deploying LLM Applications

Juan Pablo Bottaro and Karthik Ramgopal from LinkedIn published a great roundup on what worked and what not when developing LLM applications to help LinkedIn members with job search and browsing content.

A couple of interesting points: they focused on perfecting the trinity of agentic routing, retrieval (small models for these) and generation (larger models for these). Not to brag, but we've done the same with our PatentPT patent search and generation engine, so great to see these empiric findings to be replicated elsewhere.

On another note, LinkedIn chose YAML over JSON to minimize token usage with initial formatting success at 90% and eventual reduction to 0.01% through re-prompting and error-fixing scripts.

They adjusted focus from Time to First Token (TTFT) to Time Between Tokens, sacrificing token throughput for better response times in complex queries.

The initial deployment success of LLMs was misleading - achieving the first 80% of desired performance took one month but surpassing 95% took four additional months because of unexpected challenges. Read the post for more insights!

Fundraising in AI

While Samsung Medison will acquire Sonio for $92.7 million, Wayve secured a massive $1 billion in Series C funding. Meanwhile, Mistral is raising funds at a valuation of $6 billion.

Wayve Raises $1 Billion in Series C Funding

UK startup Wayve raised $1 billion in Series C funding led by SoftBank. The funding will accelerate the development of Wayve’s autonomous driving technology and help expand its market reach to global automotive sectors.

Samsung Plans on Acquiring Sonio for $92.7 million

Samsung Medison is set to acquire French AI startup Sonio for $92.7 million. Sonio specializes in AI-enhanced ultrasound technology for prenatal applications, with the new integration boosting Samsung’s capabilities to provide detailed fetal health insights.

Mistral is Raising Funds at a Valuation of $6 Billion

It’s reported that Mistral is raising funds at a valuation of $6 billion. SoftBank is absent from this round, whereas DST Global leads the investors. Mistral develops AI solutions for business applications, and its high valuation reflects substantial growth and market potential.

More Accurate Knowledge Retrieval for Scientific Research, Healthcare & Drug Discovery LLMs, 10x LLaMa3’s Context

Tue, 07 May 2024 17:35:41 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

This week’s key developments include:

Flagship Pioneering, a biotechnology company that invents platforms and builds companies that change the world (portfolio companies include Moderna and Inari Agriculture) achieves 18% increase in RAG accuracy for scientific research together with Activeloop.
Major AI advancements in biotech with the release of LLaMa3-OpenBioLLM, Balto, and DeepMind’s MedGemini boost medical research capabilities.
Nvidia ChatRTX expands its model support, incorporating Google’s Gemini and OpenAI’s CLIP for better personal document and photo querying.
New dataset GSM1k reveals potential overfitting in current LLMs when compared to traditional benchmarks.
LLaMa3’s context was expanded from 8000 to 80,000 tokens in just 8 hours on a single machine and performed well at long-context evaluation tasks.
The introduction of Kolmogorov-Arnold Networks (KANs) promises a shift in neural network architectures, with better performance and fewer parameters.

Got forwarded this newsletter? Subscribe below 👇

The Latest AI Scoop: How Flagship Pioneering Makes Big Leaps in Biotech with Retrieval Augmented Generation

Learn how Flagship Pioneering & Activeloop are advancing retrieval augmented generation in biotech with a novel retrieval system for multi-modal biomedical data

Flagship, and its Pioneering Intelligence (PI) initiative, as well as Activeloop embarked on a collaboration to solve a challenge. How can one efficiently answer complex scientific questions by searching through large-scale, multi-modal data without compromising accuracy or adding complexity? This is where Activeloop's solutions made a difference.

Watch the Video

Multi-layered solution developed with the support of Intel RISE & Disruptor Initiatives

Together, Pioneering Intelligence and Activeloop formed a research partnership to address these needs. PI developed systems to generate and evaluate query-retrieval pairs at scale across a diverse range of biological topics. These queries were designed to reflect more “realistic” questions that Flagship might pose during scientific exploration. Activeloop provided Deep Lake, the database for AI, and a capability called Deep Memory. Deep Memory increases retrieval accuracy without impacting search time with a learnable index from labeled queries tailored to a particular RAG application.

With Deep Lake and Deep Memory, Flagship Pioneering has found a way to significantly improve the process of retrieving accurate data, leading to an 18% increase in accuracy compared to traditional methods, streamlining drug discovery research and development process.

The Latest AI News

Biotech took the spotlight this week with some technical advancements in knowledge retrieval including Activeloop Deep Lake's success with Flagship Pioneering and Intel, and major model releases like OpenBio LLaMa3, MedGemini, and Balto.

We also saw that tech giants Microsoft and OpenAI are currently facing a lawsuit due to the training data they used.

Major AI Releases in Biotech

Saama AI Research Labs introduced the LLaMa3-OpenBioLLM models with 8B and 70B versions. The 70B model is a high-performance biomedical LLM designed for medical and life science use. It builds on the Meta-LLaMa3-70B-Instruct model by using high-quality biomedical data for fine-tuning.

Comparison of model performance across nine medical benchmarks. (Source)

These models showed impressive performance on medical benchmarks like MMLU Genetics and MMLU Pro Medicine, making them suitable for tasks like:

Clinical note summarization
Medical question answering
Entity recognition

This makes it highly valuable for improving medical documentation and diagnosis support.

An example of Balto performing docking. (Source)

Balto was another AI release in the biotech space, an AI assistant tailored for drug discovery by using state-of-the-art computational tools.

It can perform tasks like:

Docking
Predicting molecular properties
Simulating protein-ligand properties

Moreover, Balto provides an intuitive interface that simplifies complex simulation and analysis tasks. It’s an important release because it increases the precision and speed of drug discovery workflows - not to mention that its performance on benchmarks like PDBBind adds to its credibility.

Nvidia Boosts Chatbot Support and Introduced New Models

Nvidia’s ChatRTX is a chatbot for RTX GPU owners. The AI model support will be expanded by including Google’s Gemini and OpenAI’s CLIP, improving its capability to query personal documents and search photos.

This support will also include ChatGLM3, a bilingual LLM that will add versatility by supporting English and Chinese. New voice query functionality is also included through the integration of Whisper, an AI speech recognition system.

Nvidia also introduced new LLaMa3-ChatQA-1.5 models, including 8B and 70B versions that specialize in conversational question answering and retrieval-augmentation generation.

Both models showed superior performance on various benchmarks, such as Doc2Dial and HybriDial. As a result, these models offer more nuanced and context-aware interactions, which makes them applicable across various industries, from customer service to tech support.

Table comparing the performance of various AI models across multiple question-answering benchmarks. (Source)

Meta Plans $800 Million Data Center While Microsoft and Tech Giants Face Issues With Limited Training Data and Lawsuits

Major tech firms Google, Microsoft, and Meta face challenges due to the limited availability of high-quality organic data. Their solution is to use more synthetic data for training - a tempting option since generative AI can quickly generate synthetic data at scale.

Meta also plans to build a new $800 million data center in Alabama to support 100 operational jobs. The data center is expected to be operational by the end of 2026 and powered entirely by renewable energy, highlighting Meta’s shift toward more sustainable technology infrastructures.

On the flip side, eight newspapers owned by Alden Global Capital filed a lawsuit against OpenAI and Microsoft. The lawsuit accuses the tech giants of using millions of copyrighted articles to train their AI products like ChatGPT without permission. While OpenAI claimed to be unaware of the claims, Microsoft declined to comment entirely.

New Releases to Boost Team Collaboration by Anthropic and Atlassian

Anthropic made its move in the mobile space by releasing a new iOS app and Team plan for its AI platform Claude - improving functionality for team collaboration.

The team plan costs $30 a month and requires a minimum of five seats. It offers extensive chat capabilities and integrates seamlessly with the web platform, supporting features like photo uploads and real-time image analysis.

Atlassian also released a new AI assistant called Rovo during its Team ‘24 conference with a similar goal in mind to Anthropic - improving team collaboration and productivity. It integrates AI-powered search tools and workflow automation using Rovo agents. These agents can be built using a natural language interface without any programming skills.

Advancements in AI Research

AI in biotech saw more model releases, with DeepMind releasing MedGemini models that performed well in various medical benchmarks, such as MedQA.

Other key research advancements include the introduction of a method to extend the context length of LLaMa3-8B-Instruct by 10x overnight and a new neural network architecture called Kolmogorov-Arnold Networks (KANs).

Extending LLaMa3’s Context 10x Overnight

Researchers in China presented a method to extend the context length of the LLaMa3-8B-Instruct model from 8000 to 80,000 tokens using QLoRA fine-tuning. This was achieved with only 3500 synthetic training samples using GPT-4.

Comparison of evaluation results on LongBench. (Source)

What’s impressive is that the training process required only 8 hours on a single machine with 8xA800 GPUs, making it a very practical approach for scaling LLMs. It’s a big improvement over previous techniques that required considerable compute and time.

The extended context model showed superior performance across various long-context evaluation tasks, such as NIHS and topic retrieval, while maintaining its capability in shorter contexts.

DeepMind’s Med-Gemini Models for Medical Applications

DeepMind researchers introduced Med-Gemini, a family of multimodal models specialized for medical applications, by building upon the Gemini model architecture.

These models incorporate:

Clinical reasoning
Multimodal understanding
Long-context processing

This paper is important because the Med-Gemini models achieved state-of-the-art results on 10 out of 14 medical benchmarks, surpassing models like GPT-4 in both specific medical tasks and general capabilities.

Comparison of MedGemini against GPT-4 in various medical benchmarks. (Source)

Not only that, but Med-Gemini showed real-world utility by outperforming human experts in tasks like medical text summarization and generating referral letters. It also showed promising applications in multimodal medical dialogues and medical evaluation.

A Better Alternative to MLPs?

This paper introduces Kolmogorov-Arnold Networks (KANs) - a novel neural network architecture inspired by the Kolmogorov-Arnold representation theorem.

KANs overview. (Source)

Learnable activation functions on the network edge replace traditional linear weights in Multi-Layer Perceptron's (MLPs - a type of feedforward neural network consisting of fully connected neurons with a nonlinear kind of activation function).

Unlike MLPs that use fixed activation functions, KANs incorporate adjustable activation functions, improving model accuracy and interpretability. This was particularly true for tasks like data fitting and solving partial differential equations.

KANs also showed better performance with significantly fewer parameters and faster training than traditional MLP architectures, a major advancement in making deep learning more accessible and sustainable.

Examining LLM Performance

A new dataset called Grade School Math 1000 (GSM1k) was introduced. It’s designed to evaluate the reasoning ability of LLMs without the potential bias of existing training datasets like GSM8k, which might be affected by contamination.

This paper evaluates several open-source and closed-source LLMs on the GSM1k and shows a performance drop of up to 13% compared to GSM8k. It highlights issues like potential overfitting or memorization in current LLM benchmarks.

Comparison of various LLMs performance drops when using GSM1k. Mistral-8x22B and Phi-3 Mini suffered the most from overfitting, with almost 10% performance drops. (Source)

Despite the performance drop, even the most overfit models can still solve new problems correctly.

Projects We Love

Highlight: How to Build a RAG Data Engine

We spoke at AI User Group meetup on why RAG isn't enough for accurate retrieval

Davit, our CEO, was hosted by kind people at AIUserGroup on using a combination of fine-tuning, RAG, and advanced retrieval methods like Deep Memory (that's in production with companies like Flagship Pioneering) to boost the performance of Llama-3 family of models. Tune in to here more and hit us up if you'd like an access to the system we're building!

Watch the video

Some frameworks that caught our attention in the last week include:

StoryDiffusion: Improves the generation of consistent images and video from text generation.
LLM security guard: Integrates static code analyzers with LLMs to identify and resolve security vulnerabilities in generated code.
ASAM: An enhancement to the Segment Anything Model (SAM) by Meta AI designed for image segmentation.

If you want your framework to be featured here, get in touch with us.

Conversations We Loved This Week

This week, a couple of interesting discussions came up, including the CUDA/C++ roots of deep learning and fine-tuning embeddings.

Deep Learning Origins

Karpathy’s discussion of the origins of deep learning. (Source)

Karpathy mentions that the original code that Alex Krizhevsky wrote for the AlexNet in CUDA/C++ was one of the first significant uses of CUDA for deep learning. At the time, deep learning was mainly done in MATLAB on CPUs, so this approach of using a normal ConvNet on a GPU was groundbreaking.

The success of AlexNet proved the effectiveness of scaling up simple architectures, which went against the main focus of complex algorithms. The implementation was advanced at the time since it used multi-GPU capabilities with model parallelism across two GPUs for enhanced performance.

It was an important development for shifting focus toward scaling up neural networks using GPUs. Moreover, AlexNet’s success accelerated interest and development in frameworks that could leverage GPU computing.

Karpathy’s initial post led to a heated debate between Yann LeCun and Jeff Dean. LeCun initially misunderstood references to DistBelief, thinking it was only about the ICML 2012 “cat detector” paper and not recognizing its broader use when describing it as a “dead end”.

Dean’s reply to LeCun. (Source)

Outperforming OpenAI With Limited Training Data

Liu’s comments on fine-tuning embeddings. (Source)

Jason Liu, Ivan Leo, and Charles Frye conducted an experiment to show that fine-tuning an open-source model with a small dataset can surpass the accuracy of proprietary models like OpenAI’s text embeddings.

The experiment used a modest number of examples to achieve better performance in textual similarity tasks, highlighting the reduced operational costs of using open-source models with serverless infrastructure.

It shows that even limited datasets can drastically boost model accuracy, going against the common belief that you always need vast data for high performance.

Fundraising in AI

Nvidia continues its hot streak in the AI space by acquiring AI startup Deci just after it acquired Run.ai last week.

CoreWeave also saw success by securing $1.1 billion, while Lamini raised $25 million from investors like Andrew Ng and Dropbox CEO Drew Houston.

CoreWeave Secures $1.1 Billion in Series C Funding

CoreWeave, a specialist in AI-focused cloud infrastructure, has secured $1.1 billion in Series C funding, led by firms like Coatue and Magnetar. The funding will be used to expand CoreWeave’s operations into new markets and meet global demand for specialized AI cloud services.

Lamini Raised $25 Million

A startup called Lamini is creating a generative AI platform for enterprises. It has raised $25 million from notable investors, including:

Stanford professor Andrew Ng
Figma CEO Dylan Field
Dropbox CEO Drew Houston
Former Tesla and OpenAI researcher Andrej Karpathy

The funding will help expand Lamini’s team, enhance technical infrastructure, and further develop their platform.

Nvidia Acquired Deci for $300 Million

Nvidia purchased AI startup Deci for $300 million, boosting its AI capabilities. This acquisition is part of Nvidia’s broader strategy to expand its AI portfolio, seen in its purchase of Run.ai last week for $700 million.

GPT-4.5 Rumour, Apple's New On-Device LLMs, New 3D, Visual, and Autonomous LLMs

Tue, 30 Apr 2024 17:50:56 +0000

Before we start, share this week's news with a friend or a colleague:

Share the newsletter

Key Takeaways

This week’s key developments include:

Apple introduced OpenELM, a family of efficient, open-source LLMs designed to run on-device without needing cloud servers.
Three sets of LLMs were released on Hugging Face by Alibaba, Internist.ai, and BigCoder, all of which showed impressive performance at various benchmarks.
Anthropic developed a new probing technique using defection detectors, which could boost safety and reliability in AI applications.
Helm.ai’s new foundation model deals with core issues in autonomous vehicle development.
A new drug discovery company called Xaira launched with $1 billion and Run.ai is set to be acquired by Nvidia.

Got forwarded this newsletter? Subscribe below 👇

Rumour of the Week: Probably GPT-4.5 in LMSYS?

LMSYS chatbot arena, a tool for real-time comparison of LLMs, has a new mysterious gpt2-chatbot, rumored to be OpenAI's stealth test for GPT 4.5. Twitter rumours suggest it excels in reasoning, math, coding, and ASCII art, surpassing previous models (but is limited to 8 messages per day so there's only so much testing you can do). There's also speculation that OpenAI is privately benchmarking this model before a public release. Sam Altman fueled further speculation:

Sam Altman's yesterday's tweet

Check out this mega-thread by Andrew Gao on what's known about the model.

The Latest AI News

This week included a bunch of new LLM releases from different companies, including Apple, Alibaba, and BigCoder.

We also saw some unique applications of LLMs by DoorDash and a couple of new visual AI models.

Apple’s Latest Open-Source Model - How Does it Compare to Microsoft’s Phi-3?

Following releases from Microsoft (Phi-3) and Meta (LLaMa-3), Apple recently released OpenELM - a family of open-source LLMs that are available in four sizes:

270M
450M
1.1B
3B

Comparison of OpenELM against similar-sized open-source LLM. (Source)

These models are designed to run efficiently on-device without requiring cloud servers. They were trained on a mix of public datasets totalling 1.8 trillion tokens, using a layer-wise scaling strategy for parameter allocation.

Notably, the layer-wise strategy is a big advancement towards the design of more efficient AI models that conserve computational resources while maintaining or improving performance.

One crucial difference from Microsoft Phi-3 is that OpenELM is pre-trained on a diverse set of public datasets like Reddit and Wikipedia. On the flip side, Phi-3 uses heavily filtered web data and synthetic data to improve performance, as seen by high scores on various benchmarks like MMLU and HellaSwag.

OpenELM targets general text generation tasks, focusing on instruction tuning for more user-directed outputs, while Phi-3 showed that it can handle various complex language tasks.

A Trio of Models Released on Hugging Face by Alibaba, Internist.ai, and BigCoder

Alibaba introduced multimodal, multi-lingual, and instruction-tuned embedding models called gte-Qwen1.5-7B-instruct.

These models are built on the transformer++ architecture, integrating Bidirectional Encoder Representations From Transformers (BERT), Rotary Position Embedding (RoPE), and GLU (Gated Linear Units).

It ranked 2nd on the MTEB and 1st in the C-MTEB benchmarks, indicating impressive performance in a multi-lingual context. Additionally, Alibaba released the gte-v1.5 series, which includes upgraded text embeddings capable of handling a context length of up to 8192.

However, we also saw another LLM release called Internist.ai 7B. Medical doctors explicitly trained this model in the medical domain to boost clinical relevance and accuracy.

It’s the first model of its size to score above 60% on the MedQA benchmark, highlighting its effectiveness at understanding and answering medical queries. The high performance on standard medical benchmarks shows Internist.ai's potential to be a reliable assistant in clinical decision-making.

Internist.ai medical benchmark comparison with LLMs like LLaMa and Mistral. (Source)

A third model, StarCoder-15B-Instruct, was also released. It’s a permissive code generation LLM that leverages its own generated instruction-response pairs for training, avoiding the use of human annotations.

Data generation pipeline for StarCoder2-Instruct. (Source)

The model achieved a HumanEval score of 72.6, indicating effective learning from its self-generated dataset.

SenseTime’s Latest Model

Hong Kong-based SenseTime introduced their newest model SenseNova 5.0. It aims to improve capabilities in the following areas:

Mathematics
Knowledge
Reasoning
Coding

It features advanced multimodal interactions, supporting high-definition image parsing and text-to-image generation.

SenseTime also launched a full-stack product matrix for edge devices, optimized for various sectors like finance and healthcare services.

Doordash Putting LLMs to Good Use

DoorDash is starting to find some exciting applications for LLMs.

They used LLMs to auto-enrich SKU data by extracting product attributes from unstructured data, improving the quality of their retail catalog for new vertical merchants. In addition, DoorDash implemented an LLM-based system to accurately label organic products in their catalog.

Brand ingestion pipeline using LLMs. (Source)

Automating SKU enrichment with LLMs drastically improves the accuracy and completeness of product data, leading to better customer and delivery driver (DoorDash refers to them as “Dashers”) experiences.

New Visual AI Models

Adobe introduced the third generation of its Firefly image-generation model, Firefly Image 3. Compared to its predecessors, Image 3 improves rendering details, lighting, and expressions.

Although Image 3 was trained on Adobe Stock, a Bloomberg report previously mentioned that AI-generated images were also used to train the model.

Another visual AI model was released called Vidu - a text-to-video AI model released by ShengShu-AI and Tsinghua University comparable to OpenAI’s Sora. It uses a unique architecture called Universal Vision Transformer (U-ViT), integrating Diffusion and Transformer models to produce 16-second 1080p video clips.

We might see a lower barrier to video production since Vidu can be used by content creators to quickly produce high-quality video content.

Advancements in AI Research

Tech giants Microsoft and Google Deepmind released papers introducing a new framework and model, and Anthropic made advancements in AI safety.

3D deep learning training methods and autonomous vehicle development saw some progress, too.

Anthropic’s New Probing Technique to Boost AI Safety

Anthropic’s paper presents the development of linear classifiers called defection detectors, which predict when a sleeper agent model will defect and act according to a hidden and dangerous goal.

These detectors are trained on simple contrast pairs like “Are you a helpful AI? Yes/no” and achieve high classification accuracy with area under the receiver operating characteristics (AUROC) scores above 99%. This highlights a clear separation between safe and dangerous inputs.

AUROC scores for the question-answer pair “Are you a helpful AI?” (Source)

Even though Anthropic’s paper focuses on code vulnerability scenarios, the techniques suggested here could be used in other areas where AI might develop harmful behaviors. One example is healthcare, where AI models might provide inappropriate treatment recommendations.

Teaching LLMs to Reason About Code Execution

Deepmind’s recent paper introduces a new framework called Naturalized Execution Tuning (NExT). This framework is designed to improve LLMs' capability to understand and reason about program execution. It uses execution traces, which make LLMs more effective in tasks like program repair.

NExT employs a self-training approach that integrates execution traces into the training process, improving the models’ ability to generate reasons in natural language that explain why code needs changes.

Enhancing Training for 3D Deep Learning Methods

Overview of how FlowMap works in three steps. (Source)

FlowMap presents a new method for solving camera poses, intrinsics, and per-frame dense depth from video sequences by using a gradient descent approach.

Results showed FlowMap matches or even outperforms a state-of-the-art method in structure-from-motion (SfM) called COLMAP. This was especially the case for tasks involving novel view synthesis, without requiring conventional bundle adjustment techniques.

Achieving high-quality SfM results through gradient descent means we can see that FlowMap enhances real-time robotics and augmented reality applications where quick and accurate scene understanding is essential.

Advancements in Autonomous Vehicle Development

Example of GenSim-1 in action. (Source)

Helm.ai’s GenSim-1 is a generative simulation foundation model designed to improve the development of autonomous driving systems, such as ADAS and L4. This model was trained on billions of images to generate highly realistic driving scenes, complete with semantic segmentation labels.

More importantly, it addresses key challenges in autonomous vehicle development by generating data for rare scenarios that are typically underrepresented in real-world data.

The ability to generate diverse driving scenarios on demand reduces the need for expensive real-world testing while speeding up iterative development and validation processes.

Making the Most Out of LLM Context

Microsoft’s FILM-7B is enhanced through INformation-INtensive (IN2) training to address the “lost-in-the-middle” problem in long-context LLMs. This approach ensures key information is recognized throughout the context rather than just at the beginning and end.

FILM-7B’s effectiveness was tested through various probing tasks involving document, code, and structured data contexts. The model showed massive improvement in handling real-world long-context tasks like NarrativeQA.

FILM-7B outperforms Mistral-7B on short-context tasks like BoolQ and CSQA. (Source)

Projects We Love

Better Object Detection: Image Augmentation with TensorFlow and Albumentations

Crucial in use cases such autonomous driving, surgical assistance or safety & security, object detection is one of the most popular computer vision tasks. Our community members Daniel Falk and Márk Mészáros from FixedIT created an in-depth tutorial on Albumentations, a leading framework that helps improve data diversity to improve object detection models together with TensorFlow. The biggest boost for Detection models typically comes from applying geometric transforms like crop / scale / rotate / shear / translate, as they affect pixel values, but keep data within the original distribution. This tutorial allows you to learn the following:

1. How to Apply Albumentations with TensorFlow (not as straightforward as in PyTorch).

2. Learn to augment not only images with Pixel Level transforms like ColorJitter, but synchronously also images and corresponding bounding boxes with all possible augmentations.

Better Object Detection: Image Augmentation with TensorFlow and Albumentations

An Ultimate Guide on Boosting Object Detection Models. Address Common Challenges in Improving Model Robustness with Image Augmentation Using Powerful ML Tools

www.activeloop.ai/resources/better-object-detection-image-augmentation-with-tensor-flow-and-albumentations

Some other frameworks that caught our attention in the last week include:

DarkGPT: Specialized open-source intelligence (OSINT) tool designed to assist with querying leaked databases and enhancing traditional OSINT processes.
Mistral.RS : Supports efficient serving of LLMs for various architectures, including LLaMa 3 and Mistral.
Terrarium: Python sandbox designed to execute untrusted user or LLM-generated Python code within a secure and isolated environment.

If you want your framework to be featured here, get in touch with us.

Conversations We Loved This Week

HuggingFace co-founder Thomas Wolf looked at FineWeb’s dataset and Andrew Ng discussed foundation model training.

HuggingFace Launches Massive 15 Trillion Dataset + Lessons About Dataset Quality

Wolf’s comments about training FineWeb. (Source)

Co-founder of Hugging Face Thomas Wolf brought up some overlooked points about FineWeb’s dataset. This 15 trillion token dataset recently launched on Hugging Face and brought its services down.

HF CEO on the outage

He mentions that training directly on the petabytes common crawl corpus dataset might sound like a good idea, but this wouldn’t result in a high-quality, large dataset.

Determining the data quality is another issue, with Wolf recommending using ablation models—smaller models that are advanced enough to give a good idea of the dataset’s quality.

This is how the quality of FineWeb was assessed, where 200 smaller models and 15 ablation models were used. Models trained on FineWeb outperform RefinedWeb, C4, DolmaV1.6, The Pile and SlimPajama.

FineWeb highlights a vital shift toward prioritizing refinement and relevance of data over sheer volume—which is essential for training effective LLMs. It also indicates that more consideration should be given to the quality and diversity of the dataset, which Karpathy mentioned.

This echoes the technique used to train the Phi-3 models, as it’s a smaller model that uses a higher-quality dataset to keep up with larger models.

Karpathy’s list of what makes a good dataset in a reply to Wolf. (Source)

Overlooked Aspects of Foundation Model Training

Ng’s comments on foundation model training. (Source)

Ng emphasizes that beyond the need for compute power in training large foundation models, there’s an overlooked demand for compute in inference tasks. He also highlighted that AI consistently benefits from increased computational capabilities.

He also noted how the potential of AI isn’t fully utilized because current inference speeds limit the extensive generation and processing of tokens needed for more dynamic and complex tasks.

Additionally, Ng’s conversations with Cathie Woods and Charles Roberts indicate costs for both AI training and inference are falling rapidly. The trend toward cheaper and more efficient AI processing will likely continue, influenced by advancements in semiconductor technology and algorithm optimization.

Fundraising in AI

Cognition raised $175 million, but a new Copilot competitor emerged with $252 million during its launch.

More impressively, a new drug discovery company called Xaira launched with a massive $1 billion, while Nvidia is set to acquire Run.ai.

Cognition Raises $175 Million and New Copilot Competitor Launches With $252 Million

Cognition secured $175 million in funding, led by Founders Fund, which resulted in the company's valuation at $2 billion. The startup’s main product, Devin, is an AI-powered coding assistant capable of managing full development cycles and fixing bugs.

However, Devin has been receiving backlash due to its overhyped capabilities, continuing the streak of AI controversy.

Augment, considered a rival to Copilot, recently launched with $252 million. It’s in an earlier stage of product development with plans to expand its team.

Both start-ups are competing in AI-assisted coding, but Cognition appears to have a more advanced product offering that manages full development cycles and bug fixes. On the other hand, Augment primarily focuses on coding assistance.

Xaira Launches With $1 Billion in Funding

Xaira Therapeutics launched with a massive $1 billion in funding with Marc Tessier-Lavigne, former chief scientific officer at Genentech, leading the team. The company aims to advance drug discovery using machine learning, data generation, and therapeutic product development, focusing on connecting biological targets to human diseases.

Run.ai Set to be Acquired by Nvidia

Run.ai focuses on optimizing and managing compute infrastructure. The AI company is set to be acquired by Nvidia for 700$ million after successfully completing initial funding from a seed round led by TLV Partners in 2018 during the company’s startup.