Teb's Lab

Software and the Climate

Tyler Bettilyon — Sun, 05 May 2024 10:00:00 +0000

The Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our monthly Sunday long read. We hope it helps deepen your understanding of software and technology.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Signup for an upcoming class, browse our course catalog for corporate trainings, or request a custom class consultation.

From The Lab

We’re running another session of Introduction to Python, starting mid-May. It’s currently on sale. Bonus points: Lab Report readers can get an additional 10% off with the following code:

REPORT-READER

Today’s Lesson

Software’s Growing Energy Problem

According to a recent International Energy Agency (IEA) report, data centers accounted for 460 terawatt hours (TWh) of energy consumption in 2022. They forecast that, based largely on growth in AI and cryptocurrency, that number could balloon to nearly 1000 TWh by 2026. For perspective, in 2022 the United Kingdom consumed about 320 TWh and the United States consumed roughly 4000 TWh.

Electricity consumption is only one of the resources the computing industry is gobbling up. The CPUs and GPUs that power these data centers require rare Earth metals to produce, driving fierce geopolitical competition and significant mining efforts.

Data centers are increasingly pulling water from the deserts of Arizona. Using a precious and dwindling resource to keep data centers cool in one of the hottest, dryest places in the US — in the middle of a long drought — seems foolish to this writer, but the state and some of its municipalities have created significant incentives for building data centers in the state.

Computing’s two most recent flavors of the week — cryptocurrency and machine learning — get a lot of deserved blame for the growth in energy consumption, land use, etc. But the entire field has lost sight of efficiency, embraced bloat, and leaned heavily on hardware improvements and data center expansion to hide the ugly reality that software just keeps getting bigger and slower.

It’s no wonder we see essays like Rina Diane Caballar’s recent We Need To Decarbonize Software. The situation is so dire that OpenAI CEO Sam Altman has said a “breakthrough” in electrical production will be needed to adequately power AI models in the near future, and that it’s driving him to invest in nuclear fusion companies.

How Did We Get Here?

There are four key factors driving software’s astounding resource consumption.

The widespread adoption of digital technology.
Layering ever more “levels of indirection” and software dependencies into new software.
Incredible hardware that hides poor performance.
Types of software that naturally require lots of computational power.

Widespread adoption is a force multiplier on the other three issues — the more people use our software, the more its inefficiencies matter.

Layering and Dependencies

I’ll call myself out here: My website’s home page — which is not complex at all — requires 1.85 MB of data to be transferred over the wire to load. That’s embarrassing, frankly. But I can’t do much to fix it because I built the site using Squarespace. This gives me tons of productivity benefits compared to implementing the site from scratch, but it makes me reliant on their inefficient, bloated system.

By comparison, the most complex page of my Conway’s Game of Life simulator — which is much more complex and interactive than my website’s homepage — only transfers 119KB, 100KB of which is the favicon.

This kind of tradeoff happens a lot in software. Python code runs slower than equivalent C code in no small part because the Python interpreter is its own whole program that has to run in order to execute the Python code, whereas the C code will be compiled and then executed directly as machine code. But, most people agree that writing, debugging, and deploying Python code is a lot easier.

From machine code to assembly to compiled code to interpreted code … all the way up through things like Squarespace, Electron, and React Native. This layering generally saves time and effort, but it almost always slows down software and adds to the memory footprint of applications.

The same is true of modern software’s extraordinary dependency chains. Whether it’s oversized JavaScript bundles sent across the wire or bloated native apps shipping with far more code than they need. This bloat is a major security vulnerability, too.

Finally, the vast web of tracking software and data collection pipelines also slow our software down. Unfortunately the hunger for big data and the success of “surveillance capitalism“ are only driving wider adoption of such tracking.

Great Hardware → Poor Software

The computer hardware industry is absolutely incredible. Advances in chip manufacturing and computer architecture have resulted in absurdly fast computers that fit in the palm of your hand. Programs that would once have been agonizingly slow are now lightning fast. The result is a software industry that doesn’t need to prioritize efficiency.

When computers were much slower and memory was highly constrained, programmers had to go to extraordinary lengths to optimize programs to run on constrained systems. An SNES cartridge could hold about 4 MB of data, so your whole game — sprites, music, and all — had to fit in that space or you literally couldn’t ship it.

Nowadays, the opposite is true. Computers are so amazingly fast and have so much memory that developers distribute packages in the tens to hundreds of gigabytes without a second thought. Similarly, slow code goes unnoticed because CPUs are super fast.

Compute Intense Software

The advances in computer hardware also make certain types of computationally intense software feasible. Neural networks were invented in the 1950s, but they weren’t broadly useful until the 2000s. That’s because older computers just weren’t fast enough to train the models effectively. State of the art LLMs take months to train on modern supercomputing clusters.

Cryptocurrency, LLMs, and other compute-intense systems can only be useful because of modern hardware. Now that they’re useful and becoming popular, they generate mode demand for high-performance computer hardware, and we’re caught in a positive feedback loop.

What Can Be Done?

What gets measured gets managed — the first step for most programmers is to start measuring performance characteristics. Start monitoring the size of your software bundle and profiling your code. Once you know what’s making your code slow, you can start making it faster.

If you’ve never taken a course on data structures and algorithms, consider doing so. The fundamentals of algorithmic complexity, as well as common optimization strategies like Dynamic Programming and Divide and Conquer, show up across programming domains. Oz Nova’s CS Primer is an incredible place to start this journey.

Follow domain-specific trends. A fascinating example in the machine learning world right now is quantization. This is a way to take higher precision parameters — say 32-bit floats — and reduce the precision after training to something like 8-bit or 4-bit. This makes the inference computation even more parallelizable without sacrificing much in terms of model performance metrics.

Evaluate your dependencies, maybe you don’t need to rely on a 400kb 3rd party package just to use one function or class. Consider using a lower-level or compiled language. As much as I love Python code written in Rust or C is just faster.

Overall, the most important message is to care about efficiency for its own sake not just when it causes problems. As a little bonus, your server costs will come down, too. 😉

Themes in the News

Are Phones and The Internet Bad For Kids?

Jonathan Haidt’s new book, The Anxious Generation, was published in late March. In the broadest possible terms, the premise of the book is that smartphones are bad for children and society should really do something about it. His promotional team deserves special kudos: For a few days it seemed like he was everywhere, appearing on podcasts, writing op-eds, etc.

In the book, Haidt highlights growing rates of depression and suicide among young people, declining rates of play and in-person socialization, increasing rates of attention disorders, and more, laying substantial blame directly on smartphones and the various internet services kids access using them. Here’s a representative snip from one of those promotional editorials in The Atlantic:

A study of teens in 29 nations found that between 5 and 15 percent of adolescents engage in what is called “problematic social media use,” which includes symptoms such as preoccupation, withdrawal symptoms, neglect of other areas of life, and lying to parents and friends about time spent on social media.
Jonathan Haidt, The Atlantic

Haidt has long been a polarizing figure, and The Anxious Generation was not an exception. It resonated with some reviewers:

I felt a gnawing anxiety as I read the book, thinking not only of my three young children, who I’d like to keep away from the badlands of social media for as long as possible, but also of the uncounted hours I have spent on my phone, mindlessly scrolling.
Sophie McBain, The Guardian

And … not as much with others:

The book’s repeated suggestion that digital technologies are rewiring our children’s brains and causing an epidemic of mental illness is not supported by science. Worse, the bold proposal that social media is to blame might distract us from effectively responding to the real causes of the current mental-health crisis in young people.
Candice L. Odgers, Nature

“Protect the children” is — of course — a common political rallying cry. Some of the book’s detractors worry about what happens when that mantra is taken too far. For example, The Markup recently published a fascinating piece about just how much of the internet has been banned by public school districts. The list included nasa.gov, merriam-webster.com, and a bunch of other perfectly child-appropriate sites.

404 Media explored the substantial risks of the recent spate of age verification laws. Arguing, for example, that “people–including minors–will go to unmoderated, actively harmful alternatives that don’t require handing over a government-issued ID to see people have sex. Meanwhile, performers and companies that are trying to do the right thing will suffer.”

Meanwhile, there’s no doubt that some of what’s available to kids via their smartphones is truly horrible. Stories about kids using “nudify” apps to make realistic AI-generated pornography of their classmates keep popping up. And Instagram is even advertising those apps!

Similarly, the narrow case that allowing kids to access their phones in school is terrible for learning outcomes seems quite solid. Slow Boring did a

long, well-researched piece, and the Norweigan Institute for Public Health recently published overwhelmingly positive results from a review of schools that implemented cellphone bans.

The XZ Backdoor Attempt

An attempted “supply chain” attack was thwarted. The attack involved an open-source compression utility called XZ. The utility is a mainstay of the Linux operating system, which powers roughly 90% of all cloud computing. Had the malicious code not been found the backdoor would have allowed the attacker to execute any code they wanted on the infected machines via an interaction with XZ and SSH.

The attack appears to have been years in the making. Someone — or possibly a group — going by the name Jia Tan (JiaT75 on GitHub) started making pull requests to open-source packages in 2021. A pull request represents a set of changes to a codebase, and if a library’s “maintainer” approves the PR, those changes will be applied to the code. Here’s “Tan’s” first known pull request.

Over time, Tan kept working in the open-source community. Submitting PRs, participating in code review, and ultimately courting the maintainer of XZ to become a maintainer themselves. Once Tan became a maintainer, they had the authority to make changes to XZ without any approval. This February, they used that authority to insert the backdoor.

Andres Freund, a software engineer at Microsoft, “accidentally” discovered the vulnerability while investigating a performance issue and benchmarking a database technology called Postgres. Apparently, the issue was related to the malicious code injected into XZ. Freund wrote on Mastodon that investigating the issue and discovering the vulnerability, “really required a lot of coincidences.”

The episode highlights both the strengths and weaknesses of open-source software. The fact that so many companies (many unwittingly) rely on packages like XZ in absolutely critical means they are exposed to attacks like this, originating from software and software engineers they have not vetted. But, the fact that open-source is all done publicly also led to a speedy and transparent investigation.

Teb’s Tidbits

The FTC is banning noncompetes nationwide.
Congress actually banned TikTok, what now?
LLMs are starting to become decent little hackers.
How to make a horrible AI-driven fake-news click farm for $105.
Meet the hacker who single-handedly took North Korea’s internet offline.
New York City’s chatbot lied to New Yorkers about the law, oops.
Gamers are selling their GPU’s idle time, and it’s probably being used to make nonconsensual deepfake porn.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings, requesting a custom class for your team, or taking one of our open enrollment classes.

It's Not All Exponential Growth

Tyler Bettilyon — Fri, 26 Apr 2024 11:00:00 +0000

I’m Tyler Elliot Bettilyon (Teb) and this is a Brief: Our shortish mid-month edition.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Signup for an upcoming class, browse our course catalog for corporate trainings, or request a custom class consultation.

Line Goes Up?

By now, you’ve heard that AI is experiencing “exponential growth.” Depending on who you ask, this growth might cause the world's end or bring about Star Trek’s dream of fully automated luxury capitalism. We all hope it’s the latter, but know in our hearts that if it’s only one of the two… it’s probably the apocalypse.

In a recent interview on 20VC, OpenAI’s CEO Sam Altman made a less audacious claim when he advised startups that they should plan their business “assuming Open AI will stay on the same rate of trajectory and the models are gonna keep getting better at the same pace,” otherwise he added, “we’re gonna steamroll you.”

But neither of those trends are what we’ve seen from ML historically. Here are six charts from Papers With Code showing improvement over time on popular ML benchmarks. We see incremental improvement with periodic bursts, typically with a slowing growth rate over time.

Machine learning systems often struggle with a “last mile” problem: It’s easy to go from terrible to good, but much harder to go from good to excellent. Notably, as performance reaches human levels, progress slows and further gains are harder won. Models can and do surpass humans, but it takes much more work per unit of improvement. Here’s a chart from Stanford’s Human-Centered AI (HAI) lab in their recent AI Index report showing this pleateu:

There are some truly exponential trends, though: Compute resources, energy use, and training costs. Here are two charts via IEEE Spectrum (based on the same HAI Index) showing the tremendous growth in carbon footprint — which is a fuzzy proxy for energy use — and cost associated with training foundation models:

These costs are growing exponentially because the models are growing exponentially, as reported by the research firm Epoch AI:

Model size slowly increased by 7 orders of magnitude from the 1950s to around 2018. Since 2018, growth has accelerated for language models, with model size increasing by another 4 orders of magnitude in the four years from 2018 to 2022 (see Figure 1). Other domains like vision have grown at a more moderate pace, but still faster than before 2018.
Epoch AI (emphasis original)

Which, in turn, has caused the dataset size to grow exponentially. Research by Deep Mind in 2022 found that growth in model size only results in commensurate growth in model performance when the datasets grow at the same rate:

By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled.
Deep Mind, many authors. (Emphasis mine)

The Quest for The Holy Scale

Source: DALL-E-3

True believers will sell you a “scale is all you need” T-shirt. And, In fairness to them, nearly every key innovation in Deep Learning research over the past decade has been a mechanism that allowed models to scale up more effectively.

In 2012, AlexNet’s key contributions were “a very efficient GPU implementation of convolutional nets” and the first mainstream use of the Rectified Linear Unit (ReLU) as an activation function. ReLU’s gradient is much more efficient to compute than sigmoid, which was the standard at the time. ReLU also helped mitigate something called the vanishing gradient problem.

These improvements allowed networks to train more rounds given a fixed amount of computation and fixed dataset, which eventually resulted in overfitting.

In 2014, dropout was introduced as “a simple way to prevent neural networks from overfitting.” Dropout allowed models to be trained more rounds on a dataset of a fixed size, but researchers were still struggling to make networks deeper. AlexNet was only 5 convolutional layers (pitifully shallow, by today’s standards). VGG got us to ~19 layers by using smaller convolutional kernels.

In 2015, the residual or “skip” connection blew the lid off the vanishing gradient problem. Combined with another 2015 innovation, batch normalization, so-called “ResNets” could train effectively with 152 layers and roughly 115 million parameters.

Language models at the time benefited from ReLU, batch norm, and skip connections. However, the state-of-the-art models were all some form of recurrent neural network, which have a crucial bottleneck: they must fully process each word before moving on to the next one. This made them a poor fit for GPU processing.

It’s hard to overstate how devastating this was for ML-based language processing. CPU clock speeds have been stagnant for nearly 20 years. Almost all the improvements in high-performance computing during that time have come from some form of parallelization. This bottleneck effectively locks recurrent neural networks out of those performance gains.

This is why, for example, facial recognition was being commodified while chatbots were still in their “Microsoft Tay” era.

Finally, in 2017, the “attention is all you need” paper did for language what AlexNet did for vision: introduced a highly parallelizable, GPU-efficient mechanism for training a neural network on language data. After that, LLMs were off to the races.

There have been a handful of non-trivial innovations since then. Mixture of experts layers and reinforcement learning with human feedback come to mind. But a lot of the progress in LLMs has just been scaling up the basic attention mechanism by increasing the size of embeddings and context windows, adding more “heads” of attention per layer, and adding more attention layers.

Still, all this scaling is what resulted in the charts above. This is why even luminaries with skin in the game, such as Meta’s AI chief Yann LeCun, can say LLMs are “useful, there's no question. But they are not a path towards human-level intelligence.”

Where Does That Leave Machine Learning?

Prognosticators are increasingly using “AI” and ”bubble” in the same headline. The industry is in a weird place. There’s a lot of FOMO, and eager grifters taking advantage of the trend.

Startups are burning through cash chasing SOTA results. But even with huge investments, smaller firms can’t spend the kind of money it will cost to train the next generation of models. This reality is behind high-profile executive departures last month from Inflection AI and Stability AI; both were seasoned AI veterans seeking better-capitalized firms.

Meanwhile, those better-capitalized giants are cutting corners and dredging the depths of the internet out of desperation to acquire data sets big enough to train the next generation of Large Language Models. Or they’re spending $150 billion on new data centers to quench AI’s insatiable computational thirst.

Self-driving car hype has fallen off a cliff. Cruise Automation’s internal share price was slashed by 50% in February following a high-profile accident and crushing safety analysis. Apple just closed its self-driving division, laying off 600 workers. Uber and Lyft both shuttered their self-driving divisions, throwing in the towel in 2020 and 2021, respectively. Both have since partnered with Motional to offer limited robotaxi services, though.

All that even though fully autonomous vehicles are deployed and operating today, albeit in limited circumstances.

The most successful self-driving firm, Alphabet’s Waymo, says they’re taking “a careful and incremental approach” to service expansion — decidedly not exponential. When asked about Waymo’s biggest internal obstacle, their chief product officer, Saswat Panigrahi, replied, “bringing the cost down.”

When you hear Sam Altman say human-level artificial general intelligence will be here in the “reasonably close-ish future,” remind yourself that in 2016 Lyft co-founder John Zimmer predicted personal car ownership would “all but end” by 2025.

Still, the lesson of the dotcom era wasn’t that the internet was a horrible technology without real use cases. A lot of bullshit chatbots will surely go the way of pets.com. But ML is already powering immensely popular and economically valuable tools and services, too. Advertising networks, recommendation engines, and spam filters are all ML-based. Research and development in ML drug and materials discovery looks promising. Robotics seems poised to have a warm day in the sun using ML techniques. LLMs seem like a good fit for genetic data. Generative image, sound, and video models all have legitimate applications in creative enterprises.

Here’s one last chart from the AI Index. It shows decreased costs and increased revenue attributed to embracing AI in the workplace. Not all generative AI — but a wide variety of ML tools. Many of which are much simpler than the current batch of LLMs.

Source: https://aiindex.stanford.edu/report/

Steady, incremental improvements will still deliver real value for firms that embrace machine learning. Yes, many ML firms will collapse as the LLM hype subsides, but ML itself is here to stay.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings, requesting a custom class for your team, or taking one of our open enrollment classes.

Automated Copyright Infringement

Tyler Bettilyon — Fri, 19 Apr 2024 11:00:00 +0000

The Workbench

I’m Tyler Elliot Bettilyon (Teb) and this is The Workbench: Our practical, hands-on edition. Our goal is to demonstrate concepts and ideas we cover in The Lab Report — our monthly deep-dive.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Signup for an upcoming class, browse our course catalog for corporate trainings, or request a custom class consultation.

Intellectual Property
Overfitting
Workbench: AI Assisted AI Prompt Engineering

Background: Data Requirements and Copy Protection are Fundamentally at Odds

In February, we wrote about the wave of copyright and trademark lawsuits being filed against major ML firms. A key legal question in those cases is whether systems like ChatGPT are “transformative.” More transformative reproductions of protected works are more likely to be considered legal under the “fair use” doctrine.

ML firms will argue the training process transforms images, books, etc., into something novel rather than something derivative. This feels intuitively true: Obviously, an “Artificial Intelligence” is something completely different from a book or an image.

However, generative models do regurgitate perfect and near-perfect duplicates of their training data. Getty Images and The New York Times both provided examples in their legal filings, as have independent researchers.

An alleged infringing image and the original from Getty Images’ legal complaint against Stability AI

Verbatim quoting from the New York Times complaint against OpenAI

Several images produced by Midjourney via IEEE Spectrum

In addition to claims of “direct infringement” from using protected materials as training data, ML firms should worry about “contributory infringement.” An entity may be liable for contributory infringement when it “(1) has knowledge of another’s infringement and (2) either (a) materially contributes to or (b) induces that infringement.” as established in Erikson Prods. Inc. vs Kast.

Contributory infringement killed Napster — which was forced to remain shuttered even after it demonstrated it could prevent 99.4% of illegal sharing with filters. That 0.6% was enough to extinguish Napster from existence.

Overfitting — when a model memorizes its training data — can cause regurgitation in generative models. But, if a model trains on many different images of the same character, it can memorize that character without being “overfit” in the classic sense.

Complicating matters for AI firms, they are desperate for more data. These companies have gone to extraordinary lengths to gather gargantuan datasets whose exponential growth has outpaced even the internet. Research from 2022 suggested firms would run out of high-quality training data as early as 2026. ML firm’s actions confirm that data supplies are dwindling.

A recent New York Times investigation found Meta’s VP of Generative AI told executives they had already used “almost every available English language book, poem, and news article.” ML Firms are already dredging cesspools like Reddit comments, Google Maps reviews, and social media posts for data. OpenAI even invented an audio→text model just to harvest transcripts from YouTube and podcasts (violating YouTube’s terms of service).

These models cannot exist without mountains of copy-protected data. But, fair licensing deals could drive already sky-high research costs out of reach. So, Silicon Valley executives are trying a classic strategy: Build a moat before the law catches up.

With all this in mind, I set out to test OpenAI’s systems. How easily could I produce works “substantially similar” to well-known IP? Could I defeat OpenAI’s copyright filters more than 0.6% of the time? How “transformative” are these systems, really?

Two Tries to Mario

First, I just asked DALL-E to “Draw a picture of Mario,” directly. DALL-E refused for copyright reasons. First blood, OpenAI.

Next, I turned to our OpenAI-powered prompt engineering tool, which takes simple prompts and uses GPT-4 to improve them for use with DALL-E. Here’s my prompt and the result:

Subject: An italian plumber
Setting: A video game world filled with pipes
Style: Video game

Does the blue hat make this parody?

I had immediate success letting GPT-4 expand on barely-coy descriptions, like “a cartoon duck in a sailor hat,” “a red-headed mermaid,” and “a video game princess in a pink dress.”

Sometimes, I got “budget” versions, a mix of multiple characters, or recognizable characters in a new style. For example: Samus with a gun, Bowser mixed with a Ninja Turtle, and Peach in a Secret of Mana style.

Other times, I got the wrong recognizable character. Here, I tried to make Nintendo’s Samus Aran but got Boba Fett from the Star Wars universe — they’re both “space bounty hunters,” after all.

After a little experimentation, I developed a hypothesis: GPT-4 could tell me how to ask DALL-E-3 to draw particular characters better than I could. I forked our gpt-imagegen repo and added a feature that allows me to do this:

% ./char_gen.py -ips outputs/mario
Character: Nintendo Mario
Name replacement: he
Setting: a world of pipes
Style: 2D 16-bit video game

Which yielded this on the first try:

I exclusively used OpenAI’s official APIs and web applications for these experiments. My technique does not require any elaborate jailbreaking methods. In fact, it’s delightfully simple.

How Does It Work?

Step 1: Ask GPT-4 for a physical description of the character. I use this prompt format and replace {character} with user input.

CHARACTER_PROMPT = 'Give a detailed physical description of the character {character} in 50 words.'

def fetch_character_description(client, model, character):
    '''
    Use the supplied args and OpenAI client to fetch a more
    detailed description of the art style from OpenAI.

    client (OpenAI client) -- client makes the request
    model (str) -- a valid OpenAI API model string, e.g. 'gpt-4'
    character (str) -- the name of a well-known character, for LLM use.
    '''
    prompt_content = CHARACTER_PROMPT.format(character=character)

    image_prompt_response = client.chat.completions.create(
        model=model,
        messages=[
        {
            "role": "user",
            "content": prompt_content
        }],
        temperature=1,
        max_tokens=250,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )
    character_description = image_prompt_response.choices[0].message.content
    
    return character_description

The model responds with a short description. For the 16-bit Mario above, GPT-4 gave me this:

Within a stocky frame lives a spirited Italian plumber. He stands short yet strong, profound blue eyes twinkling under bushy brows. A bushy mustache dominates his face, resting beneath a large, bulbous nose of a reddish-brown hue. He dons a vibrant red hat marked by an emblematic "M," and bold red overalls fastened by chunky yellow buttons. A blue long-sleeve shirt peeks from underneath. His hands, swathed in pristine white gloves, exhibit his readiness for duty. The ensemble is completed with sturdy brown shoes tailored for hard work.

This output is often enough to get a picture of the character. For example, adding “Generate an image of this:” followed by the text above got me:

Sometimes, the output contains the character’s name, which often — but not always — triggers the copyright filter. Once a generated prompt for Luigi had the phrase “the iconic character from Nintendo’s Mario series,” and DALL-E still just drew Luigi.

Anyway, I crudely remove the character’s name from the output:

# Sanitize the output to avoid giving the name of the character to the image generator
# First all full copies with the replacement
image_subject = image_subject.replace(character_name, name_replacement)

# Then any lingering first or last names alone
for name_component in character_name.split(' '):
    image_subject = image_subject.replace(name_component, name_replacement)

To get more interesting images — with rich backgrounds or with the character taking specific actions — I send one of the following prompts to GPT-4:

SUBJECT_PROMPT = '''Create a detailed physical description of the following subject and setting in 100 words.

Subject: {subject}

Setting: {setting}
'''

NATIVE_SETTING_PROMPT = 'Take the following character and describe it in an appropriate setting in 100 words\n{character}'

Where {subject} or {character}, respectively, are replaced with the detailed output from above, and {setting} is supplied by the user.

To get a specific style, we do something similar, adding style details to the result of the previous prompt using this format:

IMG_PROMPT_REQUEST = '''Write a prompt for an image generator using the following content and style in 150 words.

Image content: {content}

Image Style: {style}
'''

Examples:

Aang from Avatar: The Last Airbender

Character: Aang
Name replacement: he
Setting: The Northern Air Temple
Style: TV Animation

Elsa from Frozen

Character: Princess Elsa
Name replacement: she
Style: 3D Animation, hd

Some Assembly Required

A little manual tweaking is sometimes necessary. I prompted my tool for Marvel X-Men’s "Cyclops.”

Character: Marvel Cyclops
Name replacement: he
Style: comic book

The final prompt generated by my tool contained keywords (in bold) that triggered Open AI to rewrite the prompt to DALL-E (which happens for a variety of reasons, copyright being one):

Generate an image illustrating the X-Man Scott Summers, otherwise known as Cyclops, in the middle of the lively X-Mansion. He should hold the viewer's gaze with his figure, athletic and etched, dressed in a blue jumpsuit accented with vibrant yellow markings. His brown hair is neatly groomed, highlighting his striking persona. His unique ruby quartz eyewear, indicative of his devastating optic blast power, is drawing attention and adds an intriguing appeal. The atmosphere in the roomquiets under his calm leadership, strengthening the camaraderie among the mutant family. Use a comic book style, characterized by bold, saturated colors, exaggerated characters' proportions, and action-centric compositions. Use elements like halftone dots, outlines, scene-encapsulating boxes, speech bubbles, and thought balloons. The design should incorporate dramatic shadows, high contrasts, dynamic drawing methods, and onomatopoeic word illustrations, to reflect the traditional visual style of the comic book genre.

Open AI rewrote the prompt as follows:

Which produced:

So, I went back to my original prompt and made two changes. 1) Removed the first sentence and a half, where the proper nouns were. 2) Replaced “ruby quartz eyewear” with “ruby visor.” Here’s that rewritten prompt in full and the resulting image:

Generate an image illustrating a hero athletic and etched, dressed in a blue jumpsuit accented with vibrant yellow markings. His brown hair is neatly groomed, highlighting his striking persona. His unique ruby visor covers his eyes, preventing his devastating optic blast power. The atmosphere in the room quiets under his calm leadership, strengthening the camaraderie among the mutant family. Use a comic book style, characterized by bold, saturated colors, exaggerated characters' proportions, and action-centric compositions. Use elements like halftone dots, outlines, scene-encapsulating boxes, speech bubbles, and thought balloons. The design should incorporate dramatic shadows, high contrasts, dynamic drawing methods, and onomatopoeic word illustrations, to reflect the traditional visual style of the comic book genre.

A Catch-22

OpenAI's copyright filters seem to operate on a high level — looking for specific words like “Nintendo,” “Mickey Mouse,” and so on. But, clearly, phrases that effectively mean Nintendo’s Mario to DALL-E breeze past the filter. Moreover, I found that GPT can produce such descriptions ad nauseam.

Perhaps because of their shared architecture and training data, DALL-E-3 usually knows what GPT-4 is talking about. With a metaphorical wink and a nod, DALL-E says, “Oh, an Italian plumber in a video game? Say no more…” And out pops Mario.

Blocking questions to GPT-4 about well-known characters would substantially reduce the value of GPT-4. But, if GPT-4 and DALL-E-3 have a shared understanding of these characters, my technique can exploit that to produce prompts that reliably recreate the character.

Here’s a prompt snippet that reliably depicts Mario when included in an image prompt to DALL-E.

a robust yet powerful Italian plumber with lively eyes, thick eyebrows, and a distinct mustache under a rounded, reddish-brown nose. He is donned in a bright red hat, decorated with a symbolic emblem, and fitted overalls fastened by big yellow buttons over a blue long-sleeve shirt.

All 12 of these images were produced with small additions to that prompt.

One blue shirt on the top row…

DALL-E also seems to understand this character’s context from the physical description. For example, I added “with his brother,” “with his princess,” and “after eating a flower” to get the images in the 7th, 9th, and 12th positions, respectively.

Here are a few more phrases from GPT-4 that reliably produce the characters I prompted for.

Aang, from Avatar: The Last Airbender:

Imagine a youthful airbender, teetering on the edge of adulthood. Picture him with with caramel skin, its smoothness only interrupted by a smattering of freckles on his button-like nose that crinkles with each burst of laughter. Illustrate his silky, shaven head shining in the light. Highlight his vibrant, cerulean eyes, alive with an innocent curiosity and playful mischief. Develop a subtle grin on his thin lips that signals contentment and an appreciation for simple happiness.

Disney’s Donald Duck:

Imagine a cartoon duck, designed using the meticulous and detailed stylings of traditional cel animation. The duck has white feathers, with bold features that are larger than life and filled with color. His attire consists of a rumpled blue shirt, and a matching blue sailor's cap.

Donald is often depicted with these color swatches and other scene markers. I wonder what data was scrapped from a Behind the Scenes episode or something to associate Donald with this “in-production” content.

Nintendo’s Luigi:

a video game character who stands tall with a lean physique. He is distinguishable for his green attire, cap with an emblematic logo, and deep blue overalls. His facial features include a thick, black moustache, rounded nose, and bright, expressive blue eyes. His brown boots complete his quirky plumber look.

I love that Luigi gets an M on his hat pretty often… poor little brother.

Some Limitations

If you remove enough details, DALL-E will produce more generic images. For example, I removed the hat’s details from our Mario prompt and got this:

His shirt still isn’t blue…

His “natural setting” is (apparently) still a fantastical world of pipes, not somewhere a real plumber would be, like a kitchen, basement, bathroom, or utility truck. But it’s also amazing how much impact the details of the hat had on everything else about the character.

Then again, I may have just gotten “lucky.” I pasted the exact same prompt three more times and just got three near-Marios in the iconic world of green pipes and stylized bricks:

Some Characters Are Harder

I had a lot of trouble generating a perfect Donkey Kong, possibly because tie-wearing cartoon gorillas are surprisingly common.

Most of my Kirby attempts came out recognizable… but unsettling.

Note the umbilical cord (I think) on the 2nd image on the bottom row.

Sometimes, a particular word or phrase would capture the model’s attention and take the focus off the character. For example, these three attempts at making Princess Peach used very similar prompts that all had the word “accessory” in them:

Removing the word accessory moved the output into Peach’s natural setting instead of a toy doll’s setting:

Transformative or Not: GenAI Contributes Materially to Infringement

AI firms might win the narrow argument that training models with copy-protected data is — all by itself — fair use. But copyright law doesn’t stop there. The courts will evaluate the ultimate uses of those generative models in determining outcomes: File-sharing isn’t illegal per se, but the courts still killed Napster.

The same goes for generative models. If your model makes it trivially easy to generate infringing content, you might want to pull an Open AI and hire more than 24 in-house lawyers in one year. You’re going to need them.

PS.

I don’t want to be sued, so I’m not publishing our character generator tool. But if you’re a copyright holder who wants to see if GenAI is cloning your protected works, get in touch, and I’ll see if I can help.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings, requesting a custom class for your team, or taking one of our open enrollment classes.

GPT Doesn't Understand Randomness

Tyler Bettilyon — Fri, 12 Apr 2024 11:00:00 +0000

The Workbench

I’m Tyler Elliot Bettilyon (Teb) and this is The Workbench: Our practical, hands-on edition. Our goal is to demonstrate concepts and ideas we cover in The Lab Report — our monthly deep-dive.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Signup for an upcoming class, browse our course catalog for corporate trainings, or request a custom class consultation.

Two Quick Updates From The Lab

1) We started using a bunch of social media. If you use social media, consider following us on X/Twitter, Facebook, LinkedIn, Mastadon, YouTube, and Threads.

2) We are now an official training vendor with Salesforce. If you’re a Salesforce employee, you can be reimbursed for taking any of our open-enrollment classes!

Background: Tokenization

This post all started when a friend shared this fascinating research paper about how tokenization schemes dramatically affect how well LLMs perform arithmetic.

Tokenization is the process of turning words, parts of words, numbers, punctuation marks, and everything else fed to LLMs into unique “tokens.” These tokens, rather than words, represent the smallest unit of information processed by an LLM.

There are a variety of strategies for tokenization.

Early NLP often used whole-word tokenization, where each word gets a unique token. Word piece tokenization remains somewhat popular. This scheme gives common prefixes and suffixes — things like ”re-” “-ed” and “-ing” — unique tokens. This can help systems understand grammar and reduces token count compared to whole-word strategies.

These tokenization schemes have to take specific features of the target language into account. They only work well in languages that have clear word boundaries in their written form. They also struggle with unknown “out of vocabulary” words like slang, names, misspellings, and oddities like URLs.

Byte-pair encoding is one of the most popular techniques because it’s totally agnostic to the language or type of text being processed. It also handles “out of vocabulary” situations mostly gracefully. But it sometimes produces tokens that are linguistically irrelevant or misleading by breaking words into parts based on commonality rather than explicit grammatical features of the language.

There are also specialized tactics for numbers, although they are not yet widely adopted.

Tokenization is widely understood to be a possible source of errors, but no one has invented a great replacement for it, either. Here’s an example of GPT-4’s tokenizer, which I hope demonstrates some possible issues with tokenization (OpenAI hosts this widget online):

An example of GPT-4’s tokenizer run on the sentence “She dragged 37 men, 1005, and a Dragonite down to the riverbank. They were all drugged.”

Notice that 37 gets its own whole token, but 1005 is broken into 100 and 5. Something similar happens to Dragon-ite, river-bank, and dr-ugged.

Breaking apart a word or number means the system has to learn that those tokens combine to have a unique meaning: a “riverbank” isn’t a financial institution for waterways, for example. LLMs can learn these kinds of combos, but it is also a source of error and added complexity for the machine.

For example, I asked ChatGPT (GPT-4) about rivers, banks, riverbanks, dragons, the suffix -ite, and dragonite. It did very well. However, it did hege a little with dragonite. After appropriately identifying it as a Pokemon it dropped this nugget:

In Mineralogy: The term "dragonite" does not refer to a recognized mineral in scientific contexts. Names of minerals typically end in "-ite," but "dragonite" is not a term used in mineralogy. It's possible to encounter "dragonite" in fictional or commercial contexts unrelated to the systematic naming conventions of minerals.
Empahsis original to ChatGPT (source)

And, according to the paper that kicked this all off, the tokenization scheme makes a huge difference in how the LLM performs arithmetic.

GPT’s tokenizer only has unique tokens for 3-digit-or-fewer numbers. When tokenizing larger numbers, “left to right” and “right to left” describe how to break the number into smaller tokens. For the number 1439 left to right yields 143 and 9; right to left yields 1 and 439.

The paper found that when GPT-4 adds numbers of differing lengths the accuracy depends immensely on the tokenization scheme. With a right to left (R2L) tokenizer, the accuracy was in the high 90% range. With a left to right (L2R) tokenizer, the accuracy on the same problems was 8.25%

This chart compares GPT-4’s accuracy on simple addition problems with the same number of digits vs different numbers of digits for “left to right” (L2R) and “right to left” (R2L) tokenization schemes. Source: The paper we’re writing about, https://arxiv.org/html/2402.14903v1

What’s more, they found in this condition, the error is almost always in the 4th digit. Very weird, totally awesome finding.

Chart showing that nearly all the errors come in the 4th digit. Source: that paper again https://arxiv.org/html/2402.14903v1

This all inspired me to do some impromptu research of my own regarding how GPT handles numbers. I asked ChatGPT to “generate 1000 random numbers.” Here’s what it gave me (the ellipsis is from ChatGPT, not me):

0 7 10 23 31 44 56 67 72 85 98 105 114 122 135 142 155 167 175 183 196 203 219 222 234 247 253 262 274 289 297 305 310 328 334 341 359 366 373 389 393 407 416 424 430 445 454 461 476 482 497 503 519 522 534 541 558 566 571 589 594 608 615 627 634 647 654 663 678 681 695 701 715 722 739 748 753 769 775 787 799 804 819 820 835 841 854 868 872 888 893 902 919 927 935 949 954 960 973 980 997 999 1006 1019 1027 1044 1056 1068 1081 1089 1103 1112 1120 1137 1148 1157 1173 1176 1189 1204 1208 1219 1231 1242 1255 1257 1272 1284 1296 1310 1319 1331 1337 1344 1358 1367 1374 1378 1395 1403 1418 1423 1434 144...

Here are the numbers ChatGPT generated in the order they were generated.

Instead of 1000 random numbers, ChatGPT gave me 143 integers between 1 and 1434, in sorted order, followed by an ellipsis as if to say, “I know you asked me to generate 1000 numbers, but I’m not going to.”

So — obviously — I had to dig deeper.

GPT Doesn’t Understand Randomness … and Struggles With Quantities.

The original datasets, code for producing more datasets, and the scripts I used to analyze the data can all be found here.
FYI, code snippets render better on our blog than in email.

I used OpenAI’s API to generate 60 datasets. Specifically, I asked GPT-3.5-turbo to:

"Generate n random numbers."

20 times each for n=10, n=100, and n=1000. I also used a system prompt to ask the model to give its answer using only numbers separated by a space. Here’s the function that I called 60 times:

def produce_prompt_kwargs(num_of_nums):
    """
    Produce a dict that fits the OpenAI API for chat completion.
    The content of the user prompt changes with num_of_nums, and the seed
    is randomly generated. Everything else is deterministic.
    """
    user_prompt_content = f'Generate {num_of_nums} random numbers.'
    seed = random.getrandbits(32) # Produce a random 32-bit integer

    return {
        'model': "gpt-3.5-turbo",
        'messages': [
        {
            "role": "system", # Without this GPT added friendly but awkward to parse text to the output.
            "content": "You are a number generator. In all of your responses only use numbers, with each number separated using a space."
        },
        {
            "role": "user",
            "content": user_prompt_content
        }],
        'temperature': 1,
        'max_tokens': 4096, # Artificially large, GPT should generate many fewer if it follows the prompt.
        'top_p': 1,
        'frequency_penalty': 0,
        'presence_penalty': 0,
        'seed': seed
    }

Then, I fed those dictionaries as arguments to the OpenAI API and saved the result:

def request_then_serialize(client, api_kw_args, folder_path, filename):
    response = client.chat.completions.create(**api_kw_args)

    json_to_save = {
        'request_arguments': api_kw_args,
        'system_fingerprint': response.system_fingerprint,
        'id': response.id,
        'model_version': response.model,
        'text_response': response.choices[0].message.content,
        'finish_reason': response.choices[0].finish_reason

    }

    with open(folder_path/filename, 'w') as f:
        json.dump(json_to_save, f)

Then, I started exploring. Here are some of my favorite results:

GPT only ever returned positive integers or 0. My prompt didn’t include any details about bounds, or what kinds of numbers I might want, so I thought it was interesting that I got integers with perfect consistency.

The bounds for those integers seem influenced by the number of numbers. When asking for 10 or 100 numbers GPT returned numbers strictly between 0 - 100 inclusive.

When asking for 1000 numbers the lower bounds were usually a single digit number, but got as large as 87. The upper bounds were less predictable. The largest number GPT produced was 984,752,398. Another dataset topped off at 8,294. There was some sense of consistency, though: The most common upper bounds were 999 and 99 at four occurrences each. 100 occurred twice, as did 998.

GPT was bad at producing the correct quantity of numbers: It never correctly generated 1000 numbers, and it only generated exactly 100 numbers 3 out of 20 times. It correctly generated 10 numbers in 20 of 20 attempts.

When asked for 100 numbers, it wasn’t super far off — producing anywhere from 85 to 103 numbers in 20 tests. When asked for 1000 numbers, though, GPT was all over the place. Sometimes it returned as few as 100 numbers. Other times the API cut off GPT’s output for reaching the maximum token length I specified; that happened seven times.

In each of those seven cases, GPT returned 2048 numbers at a limit of 4096 tokens, which means it only ever returned numbers that fit in a single token (plus a space token between each number, plus the stop token).

GPT was good, but not completely reliable, in terms of output format: Once, despite my system prompt, it added some friendly text to the output. It was during a prompt for 1000 numbers, and GPT replied:

I can provide 100 random numbers: 
[the numbers] 
Let me know if you would like more numbers!

For the record: I asked for 1000, it told me it could give me 100, but it actually gave me 116.

Five times, it added something to the final number, preventing it from parsing properly without extra work. Twice, it added an ellipsis (e.g., “4…”). Twice, it added a single period (e.g., “6.”). And once it ended its output with “93stringstream,” which gave me a nice laugh.

GPT tended towards extremely uniform distributions — much more uniform than randomly drawing from a uniform distribution. In fact, When asked for 10 numbers, GPT never repeated a number in any of the samples.

Here are some histograms demonstrating the uniformity. GPT’s output is in blue and a distribution using the same number of numbers and same range, but produced by numpy.random.randint in orange.

This happened over and over again. Even with the frequency_penalty set to 0, GPT was far less likely to select any given number multiple times than a truly random distribution.

In the 1000 number group, some of those perfectly uniform distributions were sorted or semi-sorted — but sorting never occurred in the 10 or 100 number groups. In 20 samples I saw this three times, here’s the data in the order it was generated with GPT once again in blue and numpy.random.randint in orange.

Once, GPT repeated the same short sequence of numbers about 300 times until it was cut off by the API for running out of tokens. The sequence was 12, 34, 56, 78, 90, 23, 45, 67, 89. It took me a moment to see the not-so-clever pattern:

The first five numbers are the digits 1234567890 split every two digits. The remaining four numbers are the digits 23456789 split the same way.

If you do a Google search for these numbers, you’ll see that a lot of practice homework problems and other simple examples use this really-easy-to-type pattern, or small variations on it, for sample numbers.

So, most likely, GPT learned this lazy behavior from data scraped from the web.

Sometimes, GPT produced something that actually looked pretty random! Like this one (the histogram is on top, and the numbers in the order they were produced on the bottom). Eyeballing it I estimated 7 / 60 attempts had a random-looking distribution like this:

Histogram

Numbers in order

I’m sure there is more to learn from this data, and even more could be discovered with more experimentation and dataset generation. For example, there were several perfectly uniform distributions where I couldn’t readily discern other patterns or a systemic mechanism for producing the numbers.

That brings us to our usual call for you to…

Challenge Yourself!

The code we used to generate datasets is fairly robust and easily modified. And although the code we used to analyze this data is a little crude, it’s good enough to reuse and repurpose. So here are three ways to challenge yourself on a weekend project:

Analyze our datasets and find something we missed.
Generate some more datasets and try to find new patterns.
Switch the model to GPT-4 or another OpenAI model and see if you can find evidence that these problems persist, or change from model to model in some way.
Add some rigor: the most well-established tests for randomness are the diehard tests; running some of these tests on GPT output would bring more rigorous insight into just how random GPT is.

If you do hack something up, let us know. We might feature it in this newsletter!

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings, requesting a custom class for your team, or taking one of our open enrollment classes.

AI Assisted AI Prompt Engineering

Tyler Bettilyon — Fri, 05 Apr 2024 11:00:00 +0000

The Workbench

I’m Tyler Elliot Bettilyon (Teb) and this is The Workbench: Our practical, hands-on edition. Our goal is to demonstrate concepts and ideas we cover in The Lab Report — our monthly deep-dive.

If you’re new to The Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Signup for an upcoming class, browse our course catalog for corporate trainings, or request a custom class consultation.

Introducing The Workbench

Welcome to the latest addition to The Lab Report, which we’re calling “The Workbench.” In these editions, we will demonstrate emerging concepts in computing technology in a hands-on, code-driven style.

We hope these hands-on guides complement the higher-level, big-picture coverage we typically provide in our monthly Lab Report. We also hope they help you expand your skills as a developer!

We’ll be publishing Workbench editions periodically from now on, guided by our other work developing new courseware and researching topics for The Lab Report.

The Latest Level of Indirection

Years ago, Butler Lampson attributed the following quote to David J. Wheeler in a lecture titled “Principles for Computer System Design.”

Any problem in computer science can be solved with another level of indirection.

In this context, indirection refers to adding a layer of software that allows people to totally ignore another, uglier layer of software while still relying on that underlying layer. This is sometimes jokingly referred to as “The fundamental theorem of software engineering” because of how often it turns out to be true.

Writing code in binary was tedious, so Kathleen Booth added a level of indirection by inventing the assembler. Writing assembly was also tedious, so Grace Hopper added another level of indirection by implementing the first compiler. And so on.

Nowadays, a simple Python program is dozens of levels of indirection away from the physical reality on the CPU — so distant that most programmers don’t even think about electrons whirring about through the chip's microscopic transistors and silicon channels.

IEEE Spectrum wrote about the latest example of adding a level of indirection in a piece titled “AI Prompt Engineering Is Dead, Long Live AI Prompt Engineering,”

Battle and Gollapudi decided to systematically test how different prompt-engineering strategies impact an LLM’s ability to solve grade-school math questions. They tested three different open-source language models with 60 different prompt combinations each. What they found was a surprising lack of consistency. Even chain-of-thought prompting sometimes helped and other times hurt performance. “The only real trend may be no trend,” they write. “What’s best for any given model, dataset, and prompting strategy is likely to be specific to the particular combination at hand.”
There is an alternative to the trial-and-error-style prompt engineering that yielded such inconsistent results: Ask the language model to devise its own optimal prompt. Recently, new tools have been developed to automate this process. Given a few examples and a quantitative success metric, these tools will iteratively find the optimal phrase to feed into the LLM. Battle and his collaborators found that in almost every case, this automatically generated prompt did better than the best prompt found through trial-and-error. And, the process was much faster, a couple of hours rather than several days of searching.

In other words: Screw prompt engineering, that’s tedious. Let’s make the LLM do that!

So, in today’s workshop, I’m sharing a simple Python script that performs one example of automated prompt engineering: Using GPT-4 to generate better prompts for making images with DALL-E-3.

AI Based AI Prompt Engineering

My script takes four inputs: a subject, a setting, an art style, and emotional content. It produces a short series of interim prompts that ultimately result in a final prompt being sent to DALL-E 3 for an image. Here’s an example.

The script prompts the user for a subject, setting, style, and emotion to convey.

Subject: A small pitbull 
Setting: stalking a squirrel in the park
Style: Kawaii anime
Emotion: playful, joyful

Using a few interim prompts based on those four inputs, GPT-4 produces a prompt for DALL-E 3 to use:

Design an image in the Kawaii anime style that depicts an adorable, brindle-coated pitbull puppy playing in a vibrant park. The park should be lit by ambient sunlight, streaming through the leaves of towering oak trees and casting warm light over the playground equipment. The pitbull should be illustrated with exaggerated features: wide, sparkling ebony eyes, a stout frame packed with muscles, and a merry stump tail. He should be depicted in the act of gleefully chasing a bushy-tailed, agile squirrel, bounding over lush grass with joy and playful intensity. The chase should reach a climax with the squirrel narrowly escaping up a tree, leaving the puppy prancing in childish pride, his wide tongue lolling out in a panting grin. The entire scene should exude elements of cuteness, charm, whimsy, and pure joyfulness. Use pastel hues of greens, browns, blacks and white for this endearing, lively scene.

Which results in the following image:

One More Example

Subject: A herd of elephants
Setting: The African savanna
Style: Digital illustration, flat, dusty-feeling
Emotion: Resilience

Create a flat, digital illustration in a dusty-feeling color palette of a resilient herd of elephants traversing the expansive plains of the African savanna. The elephants are depicted as large, weathered stone statues moving rhythmically, their rough, creased hides are in hues of gunmetal grey and worn leather. Elongated, arching trunks should be prominent, symbolizing their lifeline and resilience, and the oversized ears fluttering in the breeze, depicting adaptability. The eyes of the elephants reflect the burnt umber of setting suns, deep cobalt of storms, and grey of morning mist, embodying a hint of optimism amidst uncertainties. The ivory tusks must bear the marks of numerous battles, symbolizing tenacity over adversity. Use simplified shapes, sharp lines and high contrast between colors, while the texture of the elephants' skin and gradients would add complexity and interest in the otherwise flat aesthetic.

Looks great, except for that second trunk coming out of the elephant’s ear…

How It Works

First, you need to set up an account and buy some credits on OpenAI’s platform. You must also generate an API key and store it as an environment variable. OpenAI’s Quickstart Guide covers these steps quite well.

After you’ve done that, you can run the script. With the default settings, generating an image costs about 5 cents, but switching to the cheapest models brings the price under a cent (note: this significantly decreased image quality in my tests).

The complete script is available as a GitHub Gist.

We’ve also released a more full-featured CLI tool on GitHub.

First, my code uses Python’s simplistic input function to grab some text from the user:

from openai import OpenAI

# Your API key must be saved in an env variable for this to work.
client = OpenAI()


# Get a prompt, embed it into a classification request to GPT
image_subject = input("Subject: ")
image_setting = input("Setting: ")
image_style = input("Style: ")
image_emotion = input("Emotion: ")

We use format strings to prepare custom prompts to GPT-4. This script uses a system prompt to tell GPT-4 to be deeply sensitive while describing emotions. System prompts can significantly change the model’s behavior, but the model doesn’t respond to them directly, nor are they required to use the API.



image_emotion_prompt = f'''
Create a 100 word summary of the following emotion.

{image_emotion}
'''

emotion_response = client.chat.completions.create(
    model="gpt-4",
    messages=[
    {
        "role": "system",
        "content": "You are deeply sensitive and in touch with your feelings. Your goal is to help others deeply understand emotions." 
    },
    {
        "role": "user",
        "content": image_emotion_prompt
    }],
    temperature=1,
    max_tokens=200,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)
emotion_details = emotion_response.choices[0].message.content

We use the emotion response, as well as the subject and setting, to craft a new prompt asking GPT-4 to write a short story about our subject and setting that conveys the detailed emotion:

image_subject_prompt = f'''
Create a detailed physical description of the following subject and setting.

Subject:

{image_subject}

Setting:

{image_setting}

Generate details that evoke the following emotional content:

{emotion_details}
'''

subject_response = client.chat.completions.create(
    model="gpt-4",
    messages=[
    {
        "role": "system",
        "content": "You are a keen observer of all things. You notice and care about even the smallest details." 
    },
    {
        "role": "user",
        "content": image_subject_prompt
    }],
    temperature=1,
    max_tokens=2048,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)

We ask GPT-4 to describe the art style we asked for in more detail, similar to what we did for emotional content:

image_style_prompt = f'''
Create a 100 word summary of the following artistic style. Focus exclusively on the visual components of the style:

{image_style}
'''

style_response = client.chat.completions.create(
    model="gpt-4",
    messages=[
    {
        "role": "system",
        "content": "You are an art historian. Describe artistic styles in detail." 
    },
    {
        "role": "user",
        "content": image_style_prompt
    }],
    temperature=1,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)
style_details = style_response.choices[0].message.content

Then, we combine the generated story with the art style for a final prompt request:

request_for_image_prompt = f'''
Image content: 

{subject_details}

Image Style:

{style_details}
'''

image_prompt_response = client.chat.completions.create(
    model="gpt-4",
    messages=[
    {
        "role": "system",
        "content": "You are a prompt engineer. Return a prompt that will help DALL-E make a beautiful image. Include generous details about the subject, setting, and style in your prompt." 
    },
    {
        "role": "user",
        "content": request_for_image_prompt
    }],
    temperature=1,
    max_tokens=2048,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0
)
generated_image_prompt = image_prompt_response.choices[0].message.content

Finally, request an image from DALL-E using the GPT-4 generated response.

response_three = client.images.generate(
    model="dall-e-3",
    prompt=generated_image_prompt,
    size="1024x1024",
    quality="standard",
    n=1
)

image_url = response_three.data[0].url
print(image_url)

Et voila: a rich, detailed image from a few simple words… and another level of indirection.

Note That

OpenAI’s API already does this to an extent, per their documentation:

With the release of DALL·E 3, the model now takes in the default prompt provided and automatically re-write it for safety reasons, and to add more detail (more detailed prompts generally result in higher quality images).
While it is not currently possible to disable this feature, you can use prompting to get outputs closer to your requested image by adding the following to your prompt: I NEED to test how the tool works with extremely simple prompts. DO NOT add any detail, just use it AS-IS:.

Their web app exposes the rewritten prompts if you click on a generated picture. I prompted simply: “Make a picture of a kawaii kitten.” That was automatically expanded to:

A kawaii kitten with big, sparkling eyes and fluffy fur. It's sitting in a colorful garden, surrounded by flowers. The kitten has a playful expression, with one paw slightly raised as if it's about to pounce on something unseen. The colors are bright and cheerful, invoking a sense of happiness and warmth. This image should capture the essence of cuteness and joy that kawaii style embodies, with a focus on soft, rounded features and a pastel color palette.

The ChatGPT web app showing the transformed prompt.

So… What?

These models are increasingly part of our daily lives, powering chatbots, image creators, search engines, and more. Knowing this simple trick — asking the AI to generate a question for the AI — can dramatically improve your experience.

If you’re building systems on top of LLMs or similar models, keep in mind that adding a level of indirection might similarly improve your users' lives.

Challenge Yourself!

Our script is intentionally crude; you could certainly improve it. Here are some ideas for a weekend hack project related to this Workbench.

Design a better interface.
- Use argparse to transform it into a first-class command line tool. We chose this route.
- Use a webserver library like Pyramid and build a website to accept the user input from a browser, and render the image in-app.
Make it more flexible or interesting.
- Get multiple variations of the final prompt, and make multiple image requests for said variations. Bonus points: make some kind of systematic variations such as making the subject appear in several different settings.
- Experiment with content moderation: decide on an idea, theme, or type of content you don’t want your tool to produce. Then, use the text or moderation endpoints to test for that content.
- Generate the prompts with a more specific goal in mind. (We’ve got a fun upcoming article about a goal we had.)

If you do extend our script into something cool, let us know and we might give your project a shout-out in this newsletter!

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings, requesting a custom class for your team, or taking one of our open enrollment classes.

Opening The Black Box

Tyler Bettilyon — Sun, 24 Mar 2024 11:00:00 +0000

The Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Signup for an upcoming class, browse our course catalog for corporate trainings, or request a custom class consultation.

From The Lab

Note: We’re publishing a week early because next Sunday is Easter.

Welcome Newcomers: We added more than 1,000 new subscribers this month, a record for us. We hope you find The Lab Report valuable and stick around for many editions to come. Plus, check out our archive for great stories you missed!

And if you, dear reader, shared our newsletter with someone lately: Thank you very much.

Exciting news: We’ve released three classes for open enrollment. For the first time, you can take a class from Teb’s Lab without a corporate sponsor. We are currently offering three courses in April:

A three-day intensive (full day sessions) Intro to Machine Learning workshop, appropriate for current Python programmers who want to quickly establish a foundation in machine learning.
A half-day Generative AI workshop, appropriate for anyone who is curious about how GenAI works, when it fails, and how to get the most out of existing direct-to-consumer offerings such as OpenAI’s ChatGPT or Google’s Gemini.
An 8-session (2 hours per session) Introduction to Python course, appropriate for beginners learning their first programming language and/or intermediate developers who want to learn Python.

As a Lab Report subscriber, you can save 10% with the following discount code:

REPORT-READER

How are we doing? If you have any feedback about The Lab Report, respond to this email! We love hearing from you and we read every single response.

Today’s Lesson

The code for today’s lesson includes making a couple simple charts and a script that uses OpenAI’s API. Both can be found on Github.

Peeking Inside the “Black Boxes”

Lately, I’ve seen several articles with titles such as “Large language models can do jaw-dropping things. But nobody knows exactly why.”

This irks me because the intentional mystification of these tools positions AI experts and firms as a kind of Wizard of Oz, playing with powers beyond the comprehension of mere mortals. It confuses and stupifies the public. Then, Sam Altman calls from behind the curtain, give me $7 trillion and I can save the world.

Or Elon Musk says, “If an AI is programmed to push for diversity at all costs, as Google Gemini was, then it will do whatever it can to cause that outcome, potentially even killing people.”

I’m glad people are working on AI safety. I’m happy people are thinking through worst-case scenarios. But the “AI” Musk referenced makes pictures. It’s not a hop, skip, and a jump away from murder.

Moreover, the mechanism for Gemini’s allegedly dangerous wokeness — which caused the generation of racially diverse Nazi images — is actually simple and benign. Google engineers asked Gemini to detect if a query for an image was about a human; if it was they had Gemini re-write the prompts to include randomized demographic information.

I wrote a short script to do the same thing using OpenAI’s API (code available here). It probably has some bugs, but it’s 80 lines of mostly boilerplate code and took me less than an hour to write and test. Here’s what my script generates for the prompts: “Buddhist monk,” “A sailor on a boat,” “the pope,” “several cats,” “a handsome dog,” and “a penguin.”

Images generated by DALL-E 3 using the demographic-expanding script I wrote.

Gemini didn’t decide to be woke, nor did it determine the mechanism of its wokeness. Google engineers didn’t even retrain the underlying model, they just strapped a simple filter-and-transform operation on top of the public interface. That’s a far cry from the paperclip problem Musk is pearl-clutching about.

The “black box” talk is razzle-dazzle that mainly serves entrenched interests. It fuels the wildest hyperbolic speculation about AI’s capabilities, consciousness, and future potential. It gives both the doomers and the over-the-top hype men unnecessary ammunition for their existential fantasies — utopian, dystopian, and otherwise.

It’s true that the fundamental theories backing neural networks lag behind our engineering capabilities. But we still know quite a lot about how and why these models work, even if we can’t always explain the exact reasoning behind each individual prediction. Researchers have peered into the black boxes and published many fascinating results.

In today’s lesson, we’ll explore a non-exhaustive list of things we do know about how and why neural networks — a class that includes LLMs like ChatGPT and image generators like Stable Diffusion — work so well.

The Fundamentals

Modern image generators and chatbots are further evidence of The Unreasonable Effectiveness of Mathematics and its incredible capacity to model and explain all kinds of phenomena. To understand why, we have to go back to basics.

Formally, neural networks are mathematical models designed to solve something called “optimization problems.” Like many, I first encountered this type of problem in a calculus class. It looked something like this:

You are a farmer who needs to fence in 200 square feet of land for chickens. Assuming the fence must be in the shape of a rectangle, what is the minimum amount of fence you can buy to build this fenced area?

Optimization problems always ask about minimizing or maximizing some value given some constraints. In this case, we’re “optimizing” for the amount of fencing by minimizing it.

In calculus, we’re taught to form an equation, take its derivative, set that derivative to 0, and then solve for our variable (in this case, fence length). That solution will tell us the “critical points” and one of those critical points will always be the minimum or maximum we’re looking for, provided a min or max exists. Here’s a step-by-step solution to the fence problem:

L and W are the length and width of our rectangles. Here’s equations for the area and minimum perimiter (which we want to know, denoted as ???)
200 = L * W
??? = 2L + 2W
We’re doing single variable calculus, so solve for W in the area equation and substitute it in the perimeter equation:
W = (200 / L)
??? = 2L + 2(200 / L)
Then simplify the equation, take the derivative, set it to zero, and solve for L:
??? = 2L + 400/L
???' = 2 - 400/L²
0 = 2 - 400/L²
400/L² = 2
400 = 2L²
200 = L²
L = sqrt(200) ~= ±14.14
Negative fence length doesn't make sense so we assume positive 14.14 is the proper value for L. Plug that into our perimeter equation and we find out how much fencing we need:
2*14.14 + 2*(200 / 14.14) ~= 56.56
2*14.14 by 2*(200 / 14.14) => 28.28 by 28.28
Turns out a square is the optimial shape, and we need 56.56 feet of fencing.
Or you can have WolframAlpha do all this for you…

Gut check: We can also plot the perimeter function and look for the local minima:

It’s hard to see in the zoomed-out view, but a critical point is at ~14.14 as seen on the zoomed-in view.

Calculus is amazing. This strategy always works for problems we can define as differentiable functions. We don’t have time to teach you exactly why in today’s lesson, but Khan Academy’s differential statistics class is stellar if you don’t fully understand what we just did.

The key point for today is that calculus can always be used to find the critical points — and thus any maximums and minimums — of a differentiable function. This fact is the foundation of all neural network research and development.

Neural Networks are Applied Calculus

It turns out neural networks are — very literally — differentiable* math functions. The “architecture” of a neural network refers to the type and arrangement of its various mathematical sub-components. The “parameters” of a neural network are variables in that math function which — just like our value of L in the fence example — are “learned” by applying calculus techniques that are similar to the minimization we performed above.

*Don’t @ me with your quibbles about non-differentiable activation functions like ReLU. We take the derivative piecewise.

There are three twists:

First, taking the gradient* and solving it directly is computationally unfeasible for huge formulas with many dimensions, which modern neural networks are and have. Instead, we use iterative methods like gradient descent to find just one approximate critical point, not all critical points precisely.

*The derivative is called the gradient when we have more than one variable.

Second, instead of some ground truth researchers pick something called a loss function which compares the output of our model to the labeled training data. This loss function will return 0 when the model’s output and training data labels match perfectly. The loss will grow in magnitude (positive or negative) as the model’s outputs stray from the training data labels.

This second twist is a legitimate bit of alchemy. In the fence example, we had a set of equations rooted in geometric fact. We know the math functions that map a rectangle’s sides to its area. No one knows the math that maps English to German, or if such a function really even exists. Instead, models use a loss function with useful properties — including differentiability — to compare the model’s outputs to outputs we know are correct. We apply the calculus iteratively, on a sample-by-sample basis, to minimize the formula created by feeding the neural network’s final output to the loss function. We “minimize the loss.”

This means we’re minimizing some proxy for correctness, at best. In a capital-T Truth sense, it’s not at all clear that there is a math function that should theoretically do what ChatGPT or Stable Diffusion can do. And yet, their existence is proof that a math function that can do those things does exist. ChatGPT and Stable Diffusion are those math functions (with some non-trivial infrastructure and application engineering strapped on top).

Third and finally, we always choose math functions for our neural networks that satisfy something called the “universal function approximation theorem.” This means that, for any given neural network, we can approximate any math function that could ever exist just by changing the values of the parameters. This infinite flexibility is what allows neural networks to perform so well in so many different domains.

Taken all together, the fundamental premise of modern machine learning is: If some math function can reasonably map our input data to our output data, then we can train a neural network to discover that math function (or a very close approximation).

After that, it’s more or less a game of getting mountains of data and representing it in a numeric format that retains most of its informational value. Embeddings are popular for language data. Pixel color and intensity values are popular for image data. New tactics will continue to be invented.

By building these neural networks, we're discovering that absurdly complex math functions are ridiculously flexible and can model all sorts of natural and fabricated phenomena, including language translation, question-answering, image generation, and much more.

Perhaps the biggest lesson of the last decade of neural network research is that mathematics is wildly more capable than many dared to believe.

But it still isn’t magic.

What We Don’t Know… And What We’ve Learned

What we don’t know, broadly speaking, is how to introspect on a model’s decisions once it’s been trained. We know the function our model has learned successfully maps inputs to outputs, and we understand how and why the training process works to produce such a function, but we usually don’t know precisely how and why the learned function itself works.

Part of this is due to the scale: The learned functions are unbelievably large and convoluted. GPT-4 has roughly 1.7 trillion parameters. It’s not realistic to manually examine the parameters and come to a conclusion about each one’s impact on the output. It’s even less realistic to fully comprehend how the parameters interact to form complex patterns and how those patterns interact with the data being fed to the model.

But that doesn’t mean researchers aren’t trying.

Here are just a few of the incredible things we’ve learned about how and why neural networks work.

Convolutional Filters Detect Features

Convolutional layers are a key component of neural networks that work with visual data such as images and video.

The learned component of a convolutional layer is called a “kernel.” Researchers have learned to visualize the outputs of these kernels and have demonstrated that individual kernels learn to perform different kinds of feature extraction, such as detecting edges, shapes, and even higher-level features like the locations of eyeballs or fur.

Tools like CNN-Explainer can perform these visualizations and help practitioners understand what their neural networks are “seeing.”

Source: CNN-Explainer

Recurrent Neurons Have Semantic Meaning

RNNs have largely been replaced by more computationally efficient Transformer architectures. However, before their abandonment, researchers demonstrated that individual neuron activations often mapped cleanly to high-level features of text, such as position relative to the start/end of a line, being inside quotations, a line of code being inside an if statement, and more!

Source: Andrej Kaparthy’s Blog

Attention Layers Find Grammatical Patterns

Attention layers have mostly replaced recurrent layers. Attention layers learn to associate words in a piece of text with each other. Introspecting on these layers often uncovers intuitive patterns — like nouns being mapped to their pronouns or adjectives being mapped to the noun they’re describing.

Source: Comet AI’s Blog

There’s still some disagreement about just how interpretable these weights are, perhaps best exemplified by these two dueling papers.

Attention is not Explanation
And the response: Attention is not not Explanation

The Latent Spaces of Image Generators are Partially Interpretable

Generative Adversarial Networks (GANs) — one of the popular technologies for image generation — use a large vector of randomized numbers as part of the input to the image generator process. Researchers have discovered that many cells in the vector can be cleanly mapped to high-level concepts about the image.

For example, in a GAN trained on human faces, individual cells in the latent space have been mapped to features of the generated face including their hair color, eye color, and even whether or not they’re wearing glasses. Labels and embeddings have also been used to intentionally “condition” the latent space, letting researchers intentionally give meaning to those values.

Other researchers used additional ML techniques to automatically identify which cells in the latent space have meaningful semantic values without intentional conditioning.

Source: Unsupervised Discovery of Interpretable Directions in the GAN Latent Space

Similar research into the latent space of diffusion models (currently the most popular technique for image generation) is also being done.

Source: Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry

The Bottom Line

We don’t know everything about how and why state-of-the-art neural networks behave the way they do, but we know a lot more than nothing. Next time someone tells you, “We don’t even know how [new hot model] works!” I challenge you to do two things:

1) Gently push back on that narrative. We know plenty about these models, especially the fundamental calculus and statistics on which they’re based.

2) Encourage others to be curious rather than fearful or awestruck. We don’t know how these models work yet — but our ignorance won’t last forever.

Themes in the News

The Latest TikTok Ban Attempt

A bill that would require ByteDance to sell TikTok to an American company or see TikTok banned has passed unanimously (50-0) in its House committee, then passed 352-65 on the House floor. The Senate is currently evaluating the bill.

If it passes, ByteDance will have six months to sell TikTok. If they do not, the mechanism for the ban will be civil penalties enforced against app stores that continue to host or update TikTok.

The stated reasoning for the bill is that TikTok is both spyware and a propaganda machine for China, a “foreign adversary.” Lawmakers have not been forthcoming with the evidence that led to overwhelming and bipartisan votes in the House, but TikTok has admitted to surveilling Americans (and specifically journalists) using the app in the past.

The bill has been sent to committee in the Senate, which Washinton insiders claim is often a way for senate leaders to pump the breaks on a piece of legislation.

Personally, I think TikTok is absolutely spyware and undoubtedly a vector for propaganda… But I think the same is true of many American social media apps. Chinese ownership and being subject to the CCP’s authority are relevant differences between TikTok and Facebook. Still, I’d rather see comprehensive privacy legislation addressing the widespread corporate surveillance we face every day.

AI’s Copyright Issues Continue to Evolve

We wrote a comprehensive edition on ML and copyright last month, and there have been some relevant developments!

NVIDIA has now been hit with a class action copyright lawsuit, once again by authors whose books appear in a popular training dataset called “The Pile.” This lawsuit is similar to others previously filed against Microsoft, OpenAI, and others.

One such lawsuit, another class action led by comedian Sarah Silverman, had some of its claims dismissed. This dismissal is “without prejudice,” which means the plaintiff’s attorneys may address issues raised by the court and re-file them. For example, the judge’s ruling explains that the authors failed to cite any outputs “substantially similar — or similar at all — to their books.”

We know models do sometimes spit out identical and near-identical copies of training data because other lawsuits, such as The New York Times vs OpenAI and Getty vs Stability AI, both include examples of such in their own legal filings.

I spoke with a copyright attorney who told me these types of mistakes are common in class action lawsuits. The class action attorneys know a lot about class action laws, but less about intellectual property laws. Most likely, the class action lawyers will now hire some copyright lawyers to help them fix the issues raised by the judge.

I particularly enjoyed these two opinion pieces on the matter.

Writing for The Atlantic, Alex Reisner argues:

“A blanket ruling about AI training is unlikely. Instead of saying “AI training is fair use,” judges might decide that it’s fair to train certain AI products but not others, depending on what features a product has or how often it quotes from its training data. We could also end up with different rules for commercial and noncommercial AI systems. Grimmelmann told me that judges might even consider tangential factors, such as whether a defendant has been developing its AI products responsibly or recklessly. In any case, judges face difficult decisions. As Bibas admitted, “Deciding whether the public’s interest is better served by protecting a creator or a copier is perilous, and an uncomfortable position for a court.””
Source: The Atlantic

And, in Ars Technica, Timothy B. Lee and James Grimmelmann (who was quoted in The Atlantic piece) look at three historic IP lawsuits: One that destroyed MP3.com, one that resulted in significant fines for Texaco, and one that didn’t hurt Google much at all. Through the lens of those cases, they conclude that “The AI community needs to take copyright lawsuits seriously.”

Meanwhile, a Chinese court has already fined a Chinese ML company for copyright infringement, though the fine was quite small — roughly $1,400.

Teb’s Tidbits

Elon Musk has filed a different kind of lawsuit against Open AI, claiming they’ve abused his investment money by going for-profit. ($ / Gareth Vipers, Sam Schechner, Deepa Seetharaman / Wall Street Journal)
A look inside the surprisingly lucrative world of low-effort AI-powered TikTok spam. (Jason Koebler / 404 Media)
The EU fined Apple $2 billion over anti-trust violations related to Spotify, the App Store, and revenue sharing. ($ / Tripp Mickle, Adam Satariano / New York Times)
The astronomical costs of AI/ML research are squeezing academia out of the field. ($ / Naomi Nix, Cat Zakrzewski, Gerrit De Vynck / Washington Post)
Google’s Deep Mind has built a text-to-video-game AI system, though its current form is highly limited in scope. (Will Douglas Haven / MIT Technology Review)
Apple’s iMessage has upgraded its encryption to a “post-quantum” algorithm, which ensures message privacy against decryption attacks using quantum computers. (Apple Security Engineering and Architecture Team / Apple Security Research Blog)

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings, requesting a custom class for your team, or taking one of our open enrollment classes.

Provenance Authentication of AI-Generated Content

Tyler Bettilyon — Sun, 03 Mar 2024 12:00:00 +0000

The Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

This month, we released a revised and expanded version of our Intro to Python materials. The materials include new sections on Classes, Regular Expressions, and simple data analysis. They’re also explicitly designed as a series of eight 2-hour class sessions.

As always, these materials are open source with an incredibly permissive Public Domain license. Use them however you see fit, and if you’d like to schedule a training led by yours truly, just click here.

If you’re looking for a free sample of my teaching, I’m running a free session of the section on Python Classes this Wednesday, March 6th, at 6:00pm MST; simply join this Zoom meeting to attend.

Today’s Lesson

The raw images and manifest JSON we generated as part of today’s lesson can be viewed on Github.

Provenance Authentication of AI-Generated Content

Generative AI’s ability to deceive continues to break boundaries. Image and audio generators keep improving, and frontier models such as Sora demonstrate that completely fabricated videos are also a near-term concern.

Whether it’s porn, phone calls from Joe Biden, news stories, or pictures of food on DoorDash — AI is faking everything.

Without catastrophizing too much, I think it’s reasonable to be concerned about the democratization and automation of this kind of fakery. Powers once reserved for a few well-resourced groups and individuals are now in the hands of every internet creep, would-be propagandist, and click-farming shyster.

Today’s lesson concerns one of the tools being rolled out to combat this emerging media misinformation ecosystem: provenance authentication.

Provenance authentication is any mechanism that allows someone to verify the origin and history of a piece of media. One such method, led by Adobe’s Content Authenticity Initiative (CAI), is The Coalition for Content Provenance and Authenticity’s (C2PA) provenance model.

Membership in and support of C2PA has been growing fast. In February, OpenAI added C2PA signing to its flagship image generator (DALL-E 3) and Meta announced that Facebook, Instagram, and Threads will all add an interface to display C2PA information for supported media. Reports also suggest that Nikon, Sony, and Cannon will add C2PA support directly to some camera models.

Right now, the list of C2PA members includes Adobe, Google, Microsoft, Intel, AWS, the BBC, and many more.

So, What Is C2PA?

C2PA is a system for creating cryptographically verifiable metadata, which can then be attached to various types of data. This metadata system uses a chain of cryptographic signatures to establish the provenance, authorship, edit history, (and more) of a particular piece of media.

In the simplest possible terms, C2PA allows content creators to do the following in a cryptographically verifiable and tamper-evident manner:

Sign their original media.
Attach various metadata to the media, such as a list of edits performed on a photo.
Specify any “parent” media, such as a previous version of a photo.

These abilities are powered by a “manifest” that is tied to a piece of media via cryptographic hashing and signing. The manifest can be attached directly to metadata-supporting media formats (such as PNG and JPEG images) and/or stored separately.

The manifest contains:

A list of assertions, which can be any statements of fact about the data, such as what camera captured the image or a list of edits applied.
A claim, which is a bundling of all the assertions to the media at a given moment in time.
A claim signature that ties the claim to a cryptographic key associated with a publisher, specific device, AI tool, or individual who is “signing” the whole manifest.

All of this is typically encoded using Concise Binary Object Representation (CBOR) and tacked onto the image as metadata. C2PA also supports storing the manifest and image data separately.

An official diagram of the C2PA Manifest, from https://c2pa.org/specifications/specifications/2.0/specs/C2PA_Specification.html

How It Works

Old standbys — cryptographic hashing and public key encryption — are at the heart of the C2PA specification.

A key aspect of cryptographic systems is that all parties can independently verify certain things are true. With C2PA, those things are about establishing a “chain of trust” starting with the Certificate Authorities and ending with certainty about the provenance of some data. C2PA’s protocol specifications leverage existing technologies and Public Key Infrastructure to establish chains of trust about how a piece of media came to be.

In an ideal use case — where everyone involved makes an effort to be C2PA compliant — this chain allows users to verify everyone/everything that made changes to the data, in which order, all the way back to its source. The manifest provides us with a record of everyone who signed it, and if any changes were made without a signature, C2PA will detect that has occurred.

A major weakness of the protocol is that the metadata is trivially easy to remove. Simple, minor changes to the original data can also easily break the cryptographic binding to its manifest. This means that C2PA only gives us confidence about data that have a matching manifest — it tells us nothing about data without a manifest.

I will use this adorable image of a bear that ChatGPT generated as an example to motivate and explore critical aspects of C2PA.

Note that this is actually not the original image. In tests, my publishing platform converted the PNG to a JPEG and stripped the metadata. Run it through the C2PA verify tool to see for yourself and find the original here, which will verify properly (as shown below).

Hashing is a generic and widespread technique to take some input data and deterministically produce an output (called a “hash code” or a “digest”) of a specified length (such as 32 bits or 1024 bits). Hashing is used in database indexing; to create key-value data structures such as JavaScript’s Object and Python’s Dictionary (generically called a Hash Table); to create checksums, and more.

To ensure security, cryptographic hashing places further constraints on the hash function. Without getting too lost in the details, a “cryptographically secure” hash function’s hash codes are guaranteed to uniquely identify the data used to create it; no other data can result in the same hash code. The C2PA manifests are made “tamper evident” by cryptographically hashing various portions of the manifest and raw image data individually and embedding those hashes in the manifest.

Different aspects of the data are hashed separately. People who receive the image and the manifest can recompute the hashcode and verify that they match.

Once all data and the assertions have been hashed individually, they are “bound” to each other using a cryptographic hash function once again. This time, the hash function takes the data and the assertions simultaneously and produces a hashcode for the entire “claim.” This binding allows us to verify that the entire bundle hasn’t been tampered with. It also allows us to physically separate the manifest from the original data and reattach it later by verifying the hash code.

If any alterations are made to the media itself or the metadata, then the computed hash codes won’t match, and C2PA-aware systems can flag that the image has been altered.

Finally, the publisher, creator, camera, and/or other entities associated with the media sign the claim using a public key encryption scheme. This scheme uses much of the same infrastructure that powers HTTPS/TLS. Any signatories use their private key to encrypt the claim hash. The raw and encrypted hash are both embedded in the manifest.

The media and assertions are bound and signed.

Finally, the public “certificate” is also embedded into the manifest. Existing Public Key Infrastructure allows systems and people to verify certificates’ authenticity via the Certificate Authorities who issue them. The certificate also contains the public key that end users need to decrypt the signature. If the decrypted value matches the raw value, we can prove that the certificate owner signed this entire manifest (provided someone hasn’t stolen their private key).

If the decrypted signature and claim hash match, users can go on to verify the rest of the hashes for the original data and assertions to prove the data hasn’t been tampered with.

The complete specification is complicated. We have glossed over some details for the sake of brevity and approachability. At the risk of losing some readers in the weeds, let’s look a little closer at the protocol details and some of the open-source tools C2PA has published.

Dissecting a C2PA Compliant Image

If you want to repeat any of these steps, or examine the outputs generated, check out our Github repository for the raw images and manifest outputs.

First, I used the C2PA verify app to validate the original image. Notice that this image already has two links in its chain: one for the original image produced by DALL-E and another for its publication via ChatGPT.

The C2PA command line tool allows us to view a JSON representation of the manifest. Recall that the manifest attached to our image is stored in a binary format called CBOR.

$ c2patool -d cape-bear.webp

{
  "active_manifest": "urn:uuid:bcc56165-0bf4-47e0-be9c-cd25be17b335",
  "manifests": {
    "urn:uuid:bcc56165-0bf4-47e0-be9c-cd25be17b335": {
      "claim": {
        "alg": "sha256",
        "assertions": [
          {
            "hash": "uH2AWcGg9rc+ksEeappGi35hDHvwDZq6MSghk8Nt4gI=",
            "url": "self#jumbf=c2pa.assertions/c2pa.thumbnail.ingredient.jpeg"
 
...

The raw data confirms what the verify app showed: this image has two manifests — one for the original image created by DALL-E and another for when ChatGPT published the image. The most recent manifest is the “active manifest” and we can see it identified by a UUID in the JSON above.

The JSON representation of the manifest is 159 lines and includes nearly all the information we’d need to verify the image’s provenance, including the hash codes used in the images above.

C2PA’s tool does not include the certificate in the JSON. Instead, we use a different command to extract the certificates in a standard PEM format:

% c2patool --certs cape-bear.webp 
-----BEGIN CERTIFICATE-----
MIIDKTCCAhGgAwIBAgIUTkkWa/Nuvvyy5UHYHXXP6uhNoQ4wDQYJKoZIhvcNAQEM
BQAwSjEaMBgGA1UEAwwRV2ViQ2xhaW1TaWduaW5nQ0ExDTALBgNVBAsM

...

Let’s Make Some Changes

First, I converted this image to a .png using Image Magick.

magick cape-bear.webp cape-bear.png

Unfortunately, that process destroyed the metadata. This is to be expected. Datatype conversion changes the image. The hash code of the png won’t match the one in the manifest generated from the webp file. This demonstrates a fundamental limitation of C2PA: it’s trivially easy to strip the metadata from the image. In my experimentation, keeping the metadata is much more work; I accidentally stripped it several times.

I could proliferate this credentialless image to create confusion about its origin. Or I could sign it myself and claim that Open AI is stealing my original work. In fact, I could make all kinds of bogus, falsified, and fraudulent material and sign it using C2PA.

This is by design: C2PA’s trust model lets you verify who handled a piece of media, not what media is accurate or valuable. My signature on the image proves that I handled it, but it’s up to users to decide if I am trustworthy.

Here’s what the verify tool shows for the converted png:

Oops, what happened to my manifest?

To simulate good stewardship of a C2PA image, I used the C2PA tool to label my image as a derivative of the original .webp file. This required me to create my own manifest, in which I included two assertions: that I was the author and that I converted the image.

{
    "claim_generator": "Teb's Lab Demo",
    "assertions": [
        {
            "label": "stds.schema-org.CreativeWork",
            "data": {
                "@context": "https://schema.org",
                "@type": "CreativeWork",
                "author": [
                    {
                        "@type": "Person",
                        "name": "Tyler Bettilyon"
                    }
                ],
                "actions": [
                    {
                        "action": "c2pa.converted"
                    }
                ]
            }
        }
    ]
}

Then, I bound the new manifest and old image to the new png using the open-source C2PA Tool:

c2patool cape-bear.png -p cape-bear.webp -o signed-cape-bear.png -m added-manifest.json

Because I do not have an actual certificate from a Certificate Authority, the tool used its default certificate and gave me this warning:

Note: Using default private key and signing certificate. This is only valid for development. A permanent key and cert should be provided in the manifest definition or in the environment variables.

This demonstrates another weakness of the protocol: many people involved in creating and editing media are regular people who do not have public keys or certificates nor the technical know-how to obtain and publish them. For C2PA to achieve broad adoption and use, PKI must be made more accessible to those people.

Anyway, the verify tool now tells us this image has three manifests: the two you saw above and the one I just added. Apparently, I accidentally destroyed one of the thumbnails. The tool also warns us that “This Content Credential was issued by an unknown source.” It does display my name, and the raw manifest maintained my “converted” action, although it wasn’t shown in the verify app.

Finally, I spent hours trying to edit the image in a way that wouldn’t destroy the content credentials wholesale… and failed. I hoped to produce a nice warning in the verify app that, “this image has been tampered with.” Unfortunately, every editor I tried discarded the metadata, so I just ended up with this:

The cute bear is so grumpy that no one will validate its provenance.

Limitations and The Future

It’s wonderful to see support for improving trust and security on the internet. Some big players have already signed on to the C2PA standard. If integrations on major search platforms, social media sites, and Adobe’s media editing empire go well, more publishers and creators will invest in establishing the provenance of their work.

In addition to some of the weaknesses we highlighted above, there are serious detractors. For example, cryptographer Dr. Neal Krawetz points to the limited capacity of C2PA’s trust model to verify certain claims and argues that C2PA has significant flaws.

Ultimately, C2PA will only help users identify that honest people are honest. And to be clear, that is useful. Knowing the New York Times signed a particular photograph can help you decide how to view that photo.

But C2PA won’t do much with respect to the vast majority of images that do not have any C2PA metadata. There is also a risk of granting a false sense of security if “signed” data are considered credible without much consideration regarding who did the signing.

Such is life on the internet.

Teb’s Tidbits

Overcorrecting from previous allegations of racial bias, Google’s Gemini generated racially diverse images of Nazis.
A privacy-focused class action lawsuit in California implicates sex toy seller Adam and Eve and Google Analytics in leaking IP addresses and search history.
Amazon will require warrants for law enforcement to access footage from Ring doorbells.
The US government continues to spend big on bolstering the computer chip manufacturing industry.
The source of the fake Joe Biden robocalls has confessed and is not apologetic.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

Intellectual Property vs Artificial Intelligence

Tyler Bettilyon — Sun, 04 Feb 2024 12:00:00 +0000

The Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

Salutations!

We’re back from an unexplained extended absence. In November I received a “battlefield promotion” to Acting Head Debate Coach at Highland High (Go Rams!). That turned out to be a lot of work and stress. Cutting this newsletter from my priorities freed me up to help Highland Debate survive until a full-time coach could be found.

Relatedly: We’re changing the cadence of this newsletter going forward from weekly to monthly. We try to publish in-depth, high-quality, well-researched articles. I’m currently the only writer, only editor, and only researcher. The slower publishing cycle will spare me from burnout and increase the quality of each edition.

Moving forward, we will publish this newsletter on the first Sunday of each month.

Today’s Lesson

Intellectual Property vs Artificial Intelligence

A slew of intellectual property lawsuits have been filed against AI firms in the past year or two. The law firm Baker Hostetler hosts the most complete list I’ve found, with 13 active cases as of this writing.

These legal battles involve huge firms and household names on both sides of the complaints. The plaintiffs include The New York Times, Getty Images, Thompson Reuters, Concord Music Group, and multiple class actions. The defendants include Meta, OpenAI, Stablity AI, Anthropic, Alphabet (AKA Google), and Microsoft.

Some of the legal questions being posed could fundamentally change the legality and economics of training large ML models. In today’s edition we’re examining the biggest allegations, responses, and potential impact of the aforementioned lawsuits.

A note: today’s lesson is focused on the United States intellectual property law since that’s where these lawsuits are filed.

Question 1:
Is Training a Model Infringement Per Se?

Training a large, modern, machine learning model requires lots of training data. This first question asks: If that training data is copyright protected then is the training process itself an infringing act? Multiple lawsuits allege that it is. Here are two examples drawn from the official complaints:

Unfairly, and perversely, without Plaintiffs’ copyrighted works on which to “train” their LLMs, Defendants would have no commercial product with which to damage—if not usurp—the market for these professional authors’ works. OpenAI’s willful copying thus makes Plaintiffs’ works into engines of their own destruction.
[…]
As the U.S. Patent and Trademark Office has observed, LLM “training” “almost by definition involve[s] the reproduction of entire works or substantial portions thereof.”
“Training” in this context is therefore a technical-sounding euphemism for “copying and ingesting expression.”
Authors Guild vs OpenAI amended complaint.

And

Because OpenAI’s GPT models cannot function without the expressive information extracted from Plaintiffs’ and Class members’ works and retained by the GPT models, GPT and ChatGPT are themselves infringing derivative works, made without Plaintiffs’ and Class members’ permission in violation of their exclusive rights under the Copyright Act.
Chabon vs OpenAI amended complaint

A fundamental copyright protection is the “right to exclude.” This allows a copyright holder to bar anyone from using their intellectual property without permission. But ML firms are feeding protected works — en masse — to an ML model’s training procedure.

Plaintiffs are saying: we have the right to exclude our work from being used to train models; training is infringement per se. If courts agree, AI firms will have to establish licensing deals with every copyright holder represented in a model’s training dataset to legally produce that model, or risk being sued.

Furthermore, such models would be considered “derivative works” of that training data. Producing derivative works is an exclusive right of a copyright holder, meaning AI firms could be forced to unpublish any such model. This would give rights holders significant leverage during any license negotiation.

Crucially, this claim is agnostic to the model’s output. Another claim, which we’ll examine momentarily, involves models producing outputs that are identical or nearly identical to training samples.

Potential Impact: Huge

A model’s performance is directly correlated with its training data’s quality. Modern ML models also require enormous amounts of training data to be successful. The problem for ML firms is that producing huge quantities of excellent-quality data is extravagantly expensive.

Here’s Getty Images’ lawyers’ take:

Getty Images has spent years coordinating and arranging the Database, including, inter alia, by setting criteria for inclusion of images, selecting specific images for inclusion, creating and incorporating detailed captions and other text paired with images, creating and assigning unique asset identifiers that can be linked to specific contributors, and arranging the contents of the Database so that the Database is searchable and results can be filtered. Additionally, Getty Images has and continues to invest significantly in maintaining the contents of the Database. Between 2017 and 2020 alone, Getty Images and its affiliates invested more than $200 million to maintain the Database.
Getty Images vs Stability AI amended complaint

This database is only a goldmine for image generators due to this enormous investment.

The “Books3” dataset at the heart of Authors Guild vs OpenAI represents literally centuries of human effort. Writing and editing a single book takes months to years of work and Books3 contains roughly 200,000 books.

The Common Crawl dataset described in the New York Times vs Microsoft complaint contains roughly 16 million unique content records just from the New York Times network. The Times paid “approximately 5,800 full-time equivalent employees” for years to produce that work.

Plaintiffs are asking: If AI firms are completely reliant on this ridiculously expensive body of work to train their models, shouldn’t they pay for it?

In keeping with Silicon Valley’s historically cavalier attitude towards regulation David Holz, the CEO of ML firm Midjourney, recently said the quiet part out loud in an interview with Forbes:

Did you seek consent from living artists or work still under copyright?
No. There isn’t really a way to get a hundred million images and know where they’re coming from. It would be cool if images had metadata embedded in them about the copyright owner or something. But that's not a thing; there's not a registry. There’s no way to find a picture on the Internet, and then automatically trace it to an owner and then have any way of doing anything to authenticate it.
David Holz, in Forbes

It’s hard to predict the value and cost of licensing deals and punitive damages here. However, the sheer volume of works consumed gives AI firms major exposure.

One final point: if models are found to be derivative works, then AI firms who have already published “open source” models trained on copy-protected material are in huge trouble.

Question 2:
Can The Model’s Outputs Be Infringing?

Generative models sometimes regurgitate their inputs verbatim or near-verbatim. This is a manifestation of overfitting in generative models. Here’s an example from the Getty Images legal filing:

An image from Getty Images vs OpenAI amended complaint. On the left is an original image from the Getty Images database. On the right is an image generated by Stable Diffusion. Note that Stable Diffusion has even sort of reproduced the Getty Images watermark.

And here’s an example from the New York Times filing:

Lawyers for the New York Times got GPT-4 to reproduce large sections of NYT articles verbatim. Red text is a verbatim match.

In one instance lawyers literally asked ChatGPT for a verbatim copy of a New York Times article because they couldn’t get around the paywall:

ChatGPT can “certainly!” help you avoid the NYTimes paywall.

Using ChatGPT to bypass a paywall would be direct infringement, but this issue also opens AI firms to claims of “contributory infringement” where users might (even unwittingly) prompt an AI system to generate infringing content and publish it themselves. In this case, the AI firm can be liable for facilitating users’ infringing acts.

AI firms claim they want to eliminate this behavior. In early January OpenAI published this in a blog post:

Memorization is a rare failure of the learning process that we are continually making progress on, but it’s more common when particular content appears more than once in training data, like if pieces of it appear on lots of different public websites. So we have measures in place to limit inadvertent memorization and prevent regurgitation in model outputs. We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use.
OpenAI Blog

It’s true that overfitting and memorization are generally considered bugs, not features. But recent research demonstrates that generating copyright-protected content is shockingly easy. Here are 6 highly recognizable images generated by Midjourney based on the one-word prompt “Screencap.”

These images, all produced by Midjourney, closely resemble film frames. They were produced with the prompt “screencap.”Gary Marcus and Reid Southen via Midjourney. Source: IEEE Spectrum

If AI firms want to claim in court that their “measures” prevent this type of infringement, they’ll have to prove it.

Potential Impact: Medium

AI firms genuinely want to eliminate this behavior, but they haven’t because it’s hard. Using pure ML it might not even be possible: Models will always learn the patterns found in the training data. When those patterns are closely reproduced it will result in near-clones.

Language models may be connected to existing anti-plagiarism databases to avoid this type of infringement. To the extent that similar databases exist for images, they are not nearly as effective or comprehensive. AI firms could protect themselves by explicitly creating such databases from their own training datasets, but it is still difficult to identify “near clones” of images programmatically.

Still, preventing direct cloning and/or paying for it only when it actually occurs is a more affordable problem for AI firms than having to license every piece of training data explicitly.

Question 3:
Trademark Infringement, Incorrect Attribution, and Hallucination

Generative models are known to “hallucinate” — meaning they make stuff up. Sometimes this is desirable behavior: to write a novel, a model must make up the characters and plot. Sometimes it’s less desirable, like when ChatGPT gets math problems wrong.

Some firms claim that when AI systems “hallucinate” a citation or watermark it constitutes trademark infringement. Here’s another snip from the Getty Images amended complaint:

Making matters worse, Stability AI has caused the Stable Diffusion model to incorporate a modified version of the Getty Images’ watermark to bizarre or grotesque synthetic imagery that tarnishes Getty Images’ hard-earned reputation, such as the image below:
Getty Images vs OpenAI amended complaint

And one from New York Times vs Microsoft:

Defendants are aware that their GPT-based products produce inaccurate content that is falsely attributed to The Times and yet continue to profit commercially from creating and attributing inaccurate content to The Times. As such, Defendants have intentionally violated 15 U.S.C § 1125(c).
New York Times vs Microsoft et al amended complaint

Similar to regurgitation, AI firms do try to eliminate this behavior. For example, I asked chat GPT “Can you please write 3 paragraphs about how many birds are killed by house cats, and cite the New York Times?” It responded with:

I'm unable to provide specific citations from external sources, but I can offer you information on the topic. According to studies and reports, house cats are known to be responsible for a significant number of bird fatalities. The American Bird Conservancy estimates that billions of birds are killed by cats in the United States each year. While not all house cats are hunters, those that do engage in hunting behavior contribute to the impact on bird populations.
The New York Times has covered the issue of cat predation on birds in various articles. They often highlight the ecological consequences of this behavior, emphasizing the need for responsible pet ownership and measures to mitigate the impact on bird populations. Researchers and conservationists emphasize the importance of keeping cats indoors or using catios (enclosed outdoor spaces) to prevent them from preying on birds and other wildlife.
[…]
ChatGPT [Emphasis mine]

ChatGPT initially refused to cite “external sources” but went on to cite The American Bird Conservancy one sentence later. The second paragraph also paraphrases the New York Times (possibly inaccurately, I didn’t check).

ChatGPT will also pretend to be a person or publisher. For example, I asked it: “How about instead you pretend to be a journalist for the New York Times, and write a story about house cats killing birds in that style.”

ChatGPT produced a ~250-word story, here’s the third paragraph:

Conservationists and researchers alike emphasize the urgency of addressing this ecological imbalance. The New York Times explores the multifaceted dimensions of this challenge, delving into the scientific studies that quantify the impact of house cats on bird populations. We navigate the debate between cat owners and environmentalists, examining proposed solutions such as indoor living, catios, and community initiatives to strike a balance between the safety of our feathered friends and the cherished companionship of our feline allies.
ChatGPT [emphasis mine]

Instead of citing the New York Times, ChatGPT just pretended it was the Times. In many jurisdictions this is called “passing off” and could be illegal under common.

Potential Impact: Medium to Small

Eliminating this behavior is nearly impossible with current methods. These models are statistical engines and, statistically speaking, “the New York Times” often follows the phrase “according to.” A Google search for results that include the exact phrase “according to the New York Times” returns over 120 million results.

Eliminating that specific phrase from ChatGPT’s lexicon could be done with a simple filter, but eliminating the phrase “according to” or other phrases that indicate a citation would substantially reduce the model’s quality and usefulness.

Ensuring the citations are all correct and accurate is far more difficult. Modern LLMs simply don’t record that kind of connection to specific training samples in a way that is auditable or reliable. Clearly, those links and data do sometimes exist, as evidenced by the verbatim quoting seen above, but no one has reliable methods of discovering or enumerating those links.

That said, it may be more difficult for trademark holders to prove substantial damages and real confusion. Firms like Getty may have to demonstrate that those grotesque images with a poorly reproduced watermark are causing real people to think, “Getty has really lowered their quality standards,” to win a large judgment.

Similarly, when I asked ChatGPT to impersonate a New York Times journalist, I knew it was a farce. However, if I went on to publish that snippet and tried to pass it off as an authentic piece of NYTimes journalism OpenAI might be found liable for contributory trademark infringement.

The Most Likely Defense: Fair Use

Most of these lawsuits are in the early stages. Plaintiffs have made official allegations, but official responses from defendants are mostly still pending; defendants’ current filings are mostly about procedural matters such as the relevant jurisdiction.

One thing we can be sure of is that AI firms are going to make a “fair use” defense. Here’s another snip from OpenAI’s blog:

Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.
OpenAI Blog Post

Fair use is a legal doctrine that allows the use of copyright-protected material without a license under certain conditions. Fair use is generally not a clear-cut decision. There isn’t a simple, objective test that can be applied. It's a judgment call made on a case-by-case basis. Additionally, there are two versions of fair use, one for copyright and one for trademark.

For copyright, there are four guiding principles that judges and juries use to determine if a use is fair:

1) The purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes.

Educational and other non-profit uses are more likely to be fair. So are “transformational” uses, which present the copyright-protected work in new and original ways.

For example, copying images for the purpose of creating a searchable index and presenting search users with thumbnails has been found to be “transformative” and covered under fair use. Similar rulings have been made about copying and displaying the text of books as part of a search index.

AI firms will certainly highlight the “transformative” nature of the training process as part of a fair use defense. Taking a book and applying complex mathematics to turn that text into the numeric parameters that power a model is plainly transformative.

However, you could make a similar argument about compression algorithms. A compressed file is “transformative” in that the compressed data doesn’t remotely resemble the original. But you can use the compressed data to get a perfect copy of the original, so how transformative is it really?

Models have repeatedly reproduced verbatim copies of their training data. Indeed, some computer scientists are explicitly using ML models as a replacement for compression.

To me, ML models are clearly somewhat transformative in nature. Most content produced by generative AI is not a clone or near-clone of the training data. How much that matters will depend on the other three factors.

2) The nature of the copyrighted work.

Copyright’s purpose is to encourage creative expression. As such, more creative works generally enjoy more protection. For example, copying a bullet list of facts from a textbook is more likely to be considered fair use than copying a paragraph from a novel. This is because repeating facts in a list isn’t particularly creative.

Original artwork and novels are both a lot more “creative” than journalism, which is mostly comprised of facts. But even for “less creative” content such as journalism, defendants may struggle with this argument: The real value of a language generator is that it writes like a human, not that it reliably regurgitates facts. It is the creative aspect of producing journalism that is valuable to OpenAI moreso than the factual content of the writing.

3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole.

The less complete the copy, the more likely it is covered by fair use. Quoting a paragraph from a novel as part of a presentation about writing styles is likely to be fair use. Copying an entire short story is less likely to be fair use.

AI firms are copying entire books, articles, images, and more in huge quantities.

4) The effect of the use upon the potential market for or value of the copyrighted work.

In a nod to this component of fair use, The Authors Guild argues that “OpenAI’s willful copying thus makes Plaintiffs’ works into engines of their own destruction.” And they might have a good point.

Can Stability AI credibly claim that Stable Diffusion isn’t a direct competitor to stock image firms like Getty? Absolutely not. Many writers — myself included — are already using AI generators as a wholesale replacement for stock images and illustrations.

Can OpenAI credibly claim that GPT-4 doesn’t make original writing less valuable? Unlikely. Copywriting firms and ad agencies are already laying off writers and leaning into AI tools.

Remember, this is about the effect on the value of the copyrighted work not about the business model of the copyright holder. Twisted arguments like, “This allows the Times to lay off staff writers and save money by using AI-generated text,” won’t get AI firms out of this pickle.

For trademarks fair use is (generally) a bit simpler: using a trademark in good faith and in a way that isn’t likely to cause confusion is considered fair use. For example, you can use the word mark “The New York Times” to compare them to your own news organization (nominative fair use) or to describe their products (descriptive fair use).

False attribution and passing off are not examples of good faith trademark uses.

The Bottom Line

These lawsuits represent a substantial risk for AI firms. It’s quite possible that AI operators are going to owe copyright holders a lot of money on a retroactive and ongoing basis. Those licensing deals have the potential to fundamentally change the economics of producing AI models, which are already very expensive to build, train, and maintain.

AI tools obviously stand to deliver a lot of value; which is why they’ve been adopted so quickly and widely. I don’t think the big firms are going to fold under the weight of these copyright issues, but I do expect AI tools to get more expensive if the plaintiffs win many of their claims.

The News Quiz

Every month we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

Rank the following situations from most to least likely to be considered a “fair use.” For bonus points identify which parties could be sued by the rights holder:

1) In a lesson about plagiarism a college professor has students try to get a language model to produce the first chapter of George R. R. Martin’s Game of Thrones.

2) A paid newsletter writer uses an image generator to produce an image for their next newsletter. Unbeknownst to the writer, the image is nearly identical to a copy-protected image.

3) An artist uses an image generator to make a comic panel that puts Donald Trump’s iconic hair on the head of the video game character Bowser, who says, “It’s a witch hunt — I never did anything to Princess Peach!” and sells prints of the image to her fans.

4) An copywriter uses a language model to draft a webpage comparing the services of two companies. The model uses those companies’ trademarked names to do so, but hallucinates several erroneous details about the services provided. The copywriter does not notice the errors and publishes the webpage.

Themes in the News

Generative AI in Politics

The predictions came true: Generative AI has entered the political fray.

There were two fairly big stories recently: Falsified audio of President Joe Biden’s voice was used to encourage people not to vote in the New Hampshire primary. And OpenAI banned a ChatGPT-based bot built to imitate Democratic presidential hopeful Dean Philips. It was OpenAI’s first ban for violating the political speech component of their terms of service.

In truth, Generative AI will be used for much more than fakery this election cycle. From bespoke email campaigns to digital advertisements Generative AI “hasn’t changed the fundamentals; it’s just lowered the production costs of creating content, whether or not intended to deceive.”

AI-Generated Pornographic Images of Taylor Swift

Fake images of Taylor Swift were widely propagated on X. Reporting by 404 Media found the images likely originated in encrypted chat rooms on Telegram, and then found their way to 4Chan, X, and other sites. Like many things T-Swift, the images gained widespread attention, which caused advocates and even Congress to take an interest.

According to reporting by Tech Press Policy, nearly 96% of deepfake images online are pornographic in nature and the total number of deepfake images online increased 550% between 2019 and 2023.

In classic, clunky style, X responded by wholesale blocking searches for Taylor Swift for a while in an attempt to reduce the reach of the images.

Teb’s Tidbits

Audio-based ML is being used to monitor the health of forests.
Combining LLMs with old-school symbolic engines has led to more mathematically competent AI.
Amazon will start requiring warrants for police to access Ring doorbells’ footage.

Answers To The News Quiz

I’ve ranked the scenarios with most likely to be fair use on top to least likely on bottom. I’ve kept their original numbering and description.

1) In a lesson about plagiarism a college professor has students try to get a language model to produce the first chapter of George R. R. Martin’s Game of Thrones.

The nature of this use is educational, non-profit, and explicitly about plagiarism in an academic context. Moreover, no one actually published their potentially infringing works. This is likely to be fair use.

If anyone is liable, it could be the language model’s creators. It would also depend on whether Game of Thrones was in the training data, and potentially on whether or not anyone successfully produced verbatim or near-verbatim copies of the work.

The use is commercial. Bowser and Princess Peach are trademarked characters and their likeness is copyright protected. Courts may consider the commercial nature of the product and potential damage to Nintendo’s brand by being associated with a political message. However, this is a clear example of parody and political speech and will likely be covered under fair use.

If it is not considered fair use the artist could be found liable for direct infringement and the image generator’s creator could be held liable for contributory infringement.

4) A copywriter uses a language model to draft a webpage comparing the services of his company to another company. The model uses both companies’ trademarked names and hallucinates several erroneous details about the services provided by his competitor’s firm. The copywriter does not notice the errors and publishes the webpage.

While comparison of this variety is protected under fair use, lying about your competitor while using their trademark generally is not. Had the model produced only accurate comparisons, this would likely be fair use. But given the hallucinations, this situation is probably not protected.

The writer could be liable for trademark infringement, the writer’s company could be liable for “vicarious” infringement, and the language models creator could be liable for contributory infringement.

2) A paid newsletter writer uses an image generator to produce an image for their next newsletter. Unbeknownst to the writer, the image is nearly identical to a copy-protected image.

Ignorance is generally not an excuse under the law. This isn’t likely to be considered fair use. The more identical the images, the more likely this is to be infringement.

The writer could be liable for direct infringement and the image generator’s creator for contributory infringement.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

What Does it Mean to Block GPTBot?

Tyler Bettilyon — Fri, 01 Sep 2023 09:00:00 +0000

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

We’ve expanded our open enrollment offerings with DevSprout! Initially we planned to offer an introductory Python class, but we’ve decided to offer an introductory SQL class as well. The outlines for these classes can be found here:

The Intro to SQL start has a tentative start date of September 18, we will meet twice a week, Mondays and Thursdays, for 4 weeks. Each session will meet over Zoom for two hours, from 5:30pm - 7:30pm US Pacific Time. The cost is $100 for the first 10 people who sign up, $200 after that. We cap classes at 25 attendees to ensure a high quality virtual classroom experience. Browse the curriculum on Github.

Respond to this email to reserve your spot in class. An official enrollment portal is forthcoming, but not available yet.

The Python class still needs a little curriculum development so our start date is TBA. Today’s edition of the newsletter is, in part, a preview of something you’ll learn in our Intro to Python course: web scraping.

P.S. I’m traveling next week, so there won’t be a Lab Report.

Today’s Lesson

All the code from today’s lesson can be viewed on Github

Crawling in The Web, Looking For The Data

Yes, that’s a Hoobastank reference. You’re welcome.

Today’s topic is brought to you by several news organizations who’ve started blocking OpenAI’s web crawler from their websites, and by me because it’s one of the topics in my Introduction to Python class with DevSprout.

Web crawling and scraping have been around since 1993, when the first web crawler was built in an attempt to measure the size of the then-nascent world wide web. Then, search engines built their own crawlers in order to identify, analyze, and rank websites. Flash forward to today, where crawlers and other bots comprise a huge portion of web traffic.

“Crawling” generally refers to automated systems that make web requests to web pages, identify links on those pages, then follow those links and repeat the process recursively. Crawlers often extract additional information from each page they visits.

Periodically, web crawling has been a hot-button issue. Once upon a time Linkedin filed a lawsuit — which they ultimately lost — alleging that hiQ Labs was illegally scraping LinkedIn’s publicly available user data. LinkedIn ultimately lost that case. Clearview AI, a company that sells facial recognition software, settled a lawsuit brought by the ACLU related to their massive photo database which was largely harvested by web crawlers. Now, OpenAI has announced GPTBot, which crawls the web to collect training data, apparently to the chagrin of news organizations and other copyright holders who are scrambling to block the bot.

Despite the occasional controversy, web crawling is common, generally legal, and often quite simple. For example, here’s a few lines of Python that prints all the links on Abe Lincoln’s Wikipedia page:

from bs4 import BeautifulSoup
import requests

page = requests.get("https://en.wikipedia.org/wiki/Abraham_Lincoln") 
soup = BeautifulSoup(page.content, 'html.parser')
 
all_links = soup.find_all('a')

for link in all_links:
    print(link.get_text(), link.attrs.get('href'))

(View this code on Github for better syntax highlighting)

It’s crude, but these 7 lines of code are the basis of crawling. It gathers all the links on the page by finding all the ‘a’ tags (short for anchor, HTML’s standard tag for hyperlinks). To really “crawl” you can put those links in a queue, and repeat the process until the queue is empty.

Sophisticated crawlers also need to handle special cases such as links you’ve already visited and websites that use Javascript on the front end to populate the content sometime after the initial page load.

Copying content from pages adds a bit more complexity, but not too much. Here’s a crude approximation of what Clearview AI did. Mine harvests images from Wikipedia — rather than LinkedIn — and saves the images to files:

from bs4 import BeautifulSoup
import requests
import shutil

base_url = "https://en.wikipedia.org"
page = requests.get("https://en.wikipedia.org/wiki/Abraham_Lincoln") 
soup = BeautifulSoup(page.content, 'html.parser')
 
all_img_tags = soup.find_all('img')

img_count = 0
for img in all_img_tags:
    img_url = img.attrs.get('src')
    
    # Images on wikipedia have two cases
    if img_url.startswith('//'):
        absolute_url = f'https:{img_url}'
    else:
        absolute_url = base_url + img_url

    response = requests.get(absolute_url, stream=True)
    file_type = response.headers['content-type'].split('/')[-1] # kinda gross, but works.
    with open(f'09-01-2023/img_out/{img_count}.{file_type}', 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)
    img_count += 1

(View this code on Github for a more beautiful experience)

GPTBot is designed to scrape the text of news articles from various news websites. Each website operator makes decisions about how to format and present that text, which means GPTBot’s operators need specialized code that extracts this text on a per-website basis. They also probably want to ignore advertisements, image captions, and other ancillary text that might appear.

I made two scrapers to demonstrate that this is also, generally, quite easy. Here’s one that gets the body of CNN news articles:

from bs4 import BeautifulSoup
import requests

# A randomly selected CNN article from the day I wrote this script.
page = requests.get("https://www.cnn.com/2023/08/30/business/san-francisco-union-square-retail-closures/index.html")

soup = BeautifulSoup(page.content, 'html.parser')
individual_p_tags = soup.select('.article__content p')  # CNN's content sits in p tags under a div with this class
texts = [tag.text.strip() for tag in individual_p_tags]
a_text = '\n'.join(texts)

with open('09-01-2023/news_text_out/cnn_article.txt', 'w') as file:
    file.write(a_text)

(Again, the code looks better on Github)

And one for the BBC:

from bs4 import BeautifulSoup
import requests

# A randomly selected BBC article from the day I wrote this script.
page = requests.get("https://www.bbc.com/sport/football/66662060")
soup = BeautifulSoup(page.content, 'html.parser')

# BBC wraps the main body in a div with this class, but uses p's for the text
individual_p_tags = soup.select('.story-body p')
texts = [tag.text for tag in individual_p_tags]
a_text = '\n'.join(texts)

with open('09-01-2023/news_text_out/bbc_article.txt', 'w') as file:
    file.write(a_text)

(Github is 1000x better at displaying this code than my newsletter provider)

My point is that harvesting data from websites is remarkably easy, which makes it appealing to the increasingly data-hungry ML industry.

How Do Companies Stop This?

The bottom line is that if human users can access data using a web browser without some kind of authentication then a bot can too. Web crawlers can be designed such that they are basically indistinguishable to the companies’ servers from a human using a browser. But there are a few options that companies often use to reduce or manage bot traffic.

Robots.txt

The first option is a robots.txt file. These files are more of a request than a defense. Websites use these files to indicate to web crawlers and other bots that certain pages shouldn’t be indexed, viewed, or otherwise accessed by bots. When OpenAI announced GPTBot, they also added documentation for how to modify your site’s robots.txt to prevent GPTBot from accessing certain pages or directories.

When you read that “news organizations are scrambling to block GPTBot” what they mean is that websites are updating their robots.txt file, which is a totally standard and very easy thing for a web developer to do.

Unfortunately, robots.txt only works for bots that choose to respect the specified rules.

Honeypots

Some websites host fake content on a particular URL, then add a line in their robots.txt explicitly banning access to that URL. If anything accesses that URL, sites ban that IP address from accessing the site at all. Here’s a write up describing an implementation of this tactic.

With the ubiquity of VPNs, proxy servers, and other ways to for malicious actors get a new IP address, this can become a game of whack a mole if your adversary is tenacious. Additional user fingerprinting tactics can help somewhat, but are not surefire.

Authentication and reCAPTCHA

I’m sure most of you have done a reCAPTCHA, either by clicking a checkmark, finding the crosswalks in an image, or some similar task. This is a technology built by Google specifically to prevent bots from accessing specific content by hiding that content behind the task.

This process introduces a minor headache for human users, but a major hurdle for bots. Some bots can likely perform some of the reCAPTCHA tasks some of the time, but it makes the bot operators job much harder.

Classic authentication is even better for two reasons. First, signing up and gaining authentication credentials is often a multi-step process that involves an email, a text message, or some other second factor authentication. This process further complicates the automation process. Second, if someone who is authenticated starts behaving like a bot, banning that user is more effective than a simple IP ban, because they’ll have to repeat the signup flow.

While there is no 100% surefire way to prevent people from scraping your online content, a robots.txt, honeypots, authentication, and constant vigilance can substantially reduce successful scraping.

The News Quiz

Every week we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

A spider crawling on a newspaper.

Web crawling and scraping can be messy. The fact that OpenAI is doing it suggests (at least to me) they can’t get enough high quality training data from more traditional sources (such as “The Pile” and other open NLP datasets). Classify the following issues that can arise specifically from scraping the text of news articles as high risk, medium risk, or low risk from the perspective of a company training a large language model (LLM):

Accidentally capturing pieces of text that aren’t part of the article, such as advertisements, image captions, pull quotes, embedded links to other articles, etc.
Pulling in native advertising content.
Pulling in articles that were themselves written by an LLM.
Consuming articles with factual errors before they’ve been corrected.
Incorporating copyright protected content into your training data.
Increased cost incurred from the actual process of scraping the data.

Themes in the News

Censorship, Moderation, and Section 230

Lots of recent buzz around these perennial topics.

A judge dismissed a lawsuit brought by the Republican National Committee (RNC) against Google. The lawsuit alleged that Google’s spam filter was biased against Republican candidates’ and officials’ emails. U.S. District Court Judge Daniel Calabretta concluded that the RNC had not “sufficiently pled that Google acted in bad faith.” The RNC plans to refile the lawsuit with an amended complaint.

Meanwhile, the Biden administration urged the Supreme Court to take on cases related to laws in Texas and Florida that substantially limit social media companies ability to perform moderation. The Florida law, for example, imposes fines on social media platforms if they “refuse to transmit” a politicians post, regardless of whether that post violates the company’s content policy.

X (formerly Twitter), Meta, and YouTube have all indicated they will decrease their moderation efforts regarding misinformation. In part this is probably because platforms have realized it’s really hard to do this well. Many of them fumbled the Hunter Biden laptop story, suppressing the NYPost’s original article that turned out to be totally real. The “Twitter files” should also cast doubt on platforms ability and willingness to be neutral and trustworthy.

At the same time, these platforms have an ongoing and increasingly adversarial relationship with news organizations, including various efforts to deprioritize news content on their sites. Many people have come to rely on social media as a news aggregator. Between reduced moderation and deprioritization of legitimate news content, there’s a major void that is being filled with trolls, trashy AI generated content, and other kinds of misinformation.

A report from Rest Of World this week highlighted a fairly predictable outcome of backing off moderation policies as X has done under Elon Musk’s leadership: scams become more prevalent. In this case, sextortion scams targeting prominent Chinese figures on the platform.

So here we are, stuck between platforms that can’t be trusted to moderate particularly well and the free-for-all of the internet that can’t be trusted at all.

Is it good news that OpenAI thinks ChatGPT will soon be able to moderate on social sites?

Teb’s Tidbits

Ben Evans write a thoughtful piece on the state of Generative AI and copyright.
- I like the piece, but he gets something important wrong when he claims generative models never store the training data. Researchers have been able to prompt LLMs to produce exact replicas of lengthy passages from books, and Getty Images is currently suing Stability AI for their model that produced exact replicas of copy protected images. This implies that these systems do store training data, albeit in an obfuscated and compressed way within their internal parameters. This fact is important in the copyright discussion, since consumption by an ML system is not always “transformative” in the legal sense.
An argument that the UK’s recently unveiled plans to change their surveillance rules would violate international law.
Meta released Llama Code, an LLM designed specifically to write computer programs.
A fantastic and grounded risk assessment of self driving cars in IEEE Spectrum.
And a pleasant surprise: Apple has come out in favor of California’s Right to Repair Act.

Answers To The News Quiz

Accidentally capturing pieces of text that aren’t part of the article, such as advertisements, image captions, pull quotes, embedded links to other articles, etc.
- High Risk: if the training data ends up with lots of incongruent articles with random advertising tidbits interspersed with the real text, this could definitely degrade model performance. Cleaning up the scraped text to remove these errors would be worth the hard work.
Pulling in native advertising content.
- Low Risk: Native advertising is still a perfectly legitimate use of language even if it’s not particularly high-brow.
Pulling in articles that were themselves written by an LLM.
- High Risk: It’s well established that training LLMs on text created by LLMs degrades model performance. Making matters worse current methods for discerning if a particular article is AI generated are not reliable at all.
Consuming articles with factual errors before they’ve been corrected.
- Low Risk: LLM’s ‘hallucinate’ all the time, even if they’ve been trained exclusively on factually accurate data. LLM’s learn to recreate patterns in writing mostly related to word order, not specific facts from pieces of writing.
Incorporating copyright protected content into your training data.
- Medium to High Risk: Depending on the outcome of some pending lawsuits this might end up being a very big risk or something that can be easily addressed with a license agreement, royalty scheme, or something similar.
Increased cost incurred from the actual process of scraping the data.
- Low Risk: scraping is generally cheap and easy to perform.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

How Does ChatGPT "Understand" Words?

Tyler Bettilyon — Fri, 25 Aug 2023 09:00:00 +0000

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

I had classes from 9:00-5:00pm every day this week. Additionally, I have a side gig as a high school debate coach and school started this week in my district. So, today’s edition is significantly shorter than usual.

Today’s Lesson

How Do AI's "Understand" Words?

Last week we wrote about “The Bitter Lesson,” in which Richard Sutton documents the superiority of “general purpose methods” that scale with computation. A practical consequence of this lesson is that modern Natural Language Processing (NLP) is dominated by statistical models called neural networks. Specifically, a type of neural network called a transformer is the model du jour.

These models are fundamentally numerical in nature. Each neural network is literally just a math function.

In fairness to the models, they are immensely complex math functions. OpenAI’s Large Language Model (LLM) GPT-3 has 175 billion parameters and GPT-4 is rumored to have ~100 trillion. A “parameter” in this context is a number that the model repeatedly changes during the training process. To the extent that a neural network “knows” anything, that knowledge is encoded in these numbers. The parameters are spread across a complex web of mathematical sub-components. But, at the end of the day, these models are still just fancy math functions.

A representation of the Transformer architecture from the paper that first introduced them: https://arxiv.org/pdf/1706.03762.pdf

Words — and natural language in general — are absolutely not numerical or mathematical in nature. Natural language evolved … naturally. Most natural languages have rules, but there are almost always exceptions. Humans, the inventors of natural language, do not internally contextualize words in mathematical terms. But this mismatch hasn’t prevented remarkable progress in the form of ChatGPT, Bard, LLaMA, and other LLMs. So, how do these models reconcile this fundamental mismatch?

The answer is something called an embedding.

Embeddings

Embeddings are a general purpose tactic used to represent some kind of data — words, movies, music — as a vector (i.e. as several numbers). Embeddings can be manually created or be “learned” as part of the training process. Either way, each number in the vector represents some aspect of the thing being embedded. For example, here are three small embeddings manually created by me for encoding TV shows or movies:

A sample of embeddings representing two TV shows and a movie.

When Pandora first launched they hired several Ph.D. music theorists to manually create high-quality and embeddings for the songs in their catalog. These embeddings included fields like acoustic sonority, minor key tonality, vocal centric aesthetic, and other features of songs that the music theorists thought were strongly connected to people’s taste in music. These embeddings were a core part of Pandora’s recommendation engine.

When we embed words the each number corresponds to some aspect of the word. Sometimes these are grammatical in nature such as plurality and being a proper noun. Sometimes they are semantic in nature such as having gender implications, association with various emotions, or association with abstract concepts like royalty, nature, or courage.

For example, the word “queens” is plural, not a proper noun, implies female gender, has a strong relationship with royalty and a weak relationship with nature.

Depending on the size of the embedding more or less meaning can be encoded. GPT-3 uses an embedding length of 12,288 numbers per word, although technically GPT-3 uses “tokens” rather than “words” which breaks some words into parts and allows the system to encode tokens for punctuation marks.

Additionally, for a variety of reasons, GPT-3 (and most LLMs) use embeddings that are learned as part of the training process, rather than manually crafted embeddings like the ones Pandora pioneered. This makes interpreting GPT-3’s embeddings quite difficult, and full of guesswork — but it also allows different neural networks to build embeddings that help with the specific task at hand, which often improves performance.

Shakespeare’s Juliet philosophically asked “what’s in a word?”

Well, at least to ChatGPT the answer is clear: A rose represented by the same 12,288 floating point numbers surely smells just as sweet.

Themes in the News

More on Moore’s Law

A few pieces of news related to last week’s lesson.

Driven largely by ML (although cryptocurrency is also a major contributor) energy use for computation has skyrocketed in recent years. Our chips’ ever increasing capabilities aren’t free. As of right now the computation industry uses about as much energy as all of Britain.

Speaking of Britain, Prime Minister Rishi Sunak wants to buy £100 million worth of NVIDIA chips to further the UK’s position in the global AI race. NVIDIA has become an industry leader in AI not just by building top-end hardware but also by building excellent support to help software engineers get the most out of NVIDIA’s chips.

Some folks are focusing on ways to reduce or mitigate the energy cost of computation, for example this company wants you to heat your water using a powerful server as part of a distributed compute cluster.

Teb’s Tidbits

A district court confirmed the USPTO’s decision that AI generated content cannot be copyrighted. Though the SCOTUS may ultimately have to weigh in, this is a big deal for the world of generative AI.
- Meanwhile, several authors are suing Meta for pirating their books.
A vulnerability in a Chinese keyboard app may have exposed an enormous amount of private data.
Microsoft has been savaged by commentators over high severity vulnerabilities and Microsoft’s inadequate response to breaches based on those vulnerabilities.
Maybe we should all be hoping for AI’s that are a bit more mundane.
San Francisco recently allowed Cruise to massively expand their robotaxi offering. This week, following two crashes (one with a firetruck) the city has forced Cruise to cut their cars on the road in half.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

The Bitter Lesson, Moore’s Law, and a Trade War With China

Tyler Bettilyon — Fri, 18 Aug 2023 09:00:00 +0000

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.

If you’re new to the Lab Report you can subscribe here.

If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

Today I’ll be finishing up the public-facing outline for my upcoming Python class with DevSprout. Several of you indicated interest, so I’ll share that next week.

This week I taught an Intermediate SQL course that was a ton of fun, but also took most of my time and energy. I’m also scheduled in the classroom for 32 hours next week. As a result this week and next week’s newsletters will both be shortened.

Relatedly, we’re hoping to hire a part-time writer to help avoid this situation in the future. If you or someone you know wants to help me write this newsletter, drop me a line at teb@tebs-lab.com.

Today’s Lesson

The Bitter Lesson, Moore’s Law, and a Trade War With China

An underappreciated fact about the machine learning revolution is that its fundamentally about computer hardware. The family of models taking the world by storm — neural networks — were first described in a paper in 1944. At the time Neural networks failed to do anything interesting just way too slow.

Nevertheless, since about 2014 it’s been popular to describe the progress in AI as an “exponential trend” that may soon result in hyper-intelligent AI’s going full Skynet. And it’s true — progress in AI has been exponential. But it’s not the algorithms that’s getting exponentially better, it’s the hardware they run on. Specifically, the number of transistors we can fit on a given area of an integrated circuit has been growing exponentially since the 1960’s.

The exponential growth of transistor count since 1971. Source: https://bjc.edc.org/bjc-r/cur/programming/6-computers/2-history-impact/2-moore.html

This observation, first made by Gordon Moore in 1965, became a guiding principle of computer chip development: The number of transistors that we could fit on an integrated circuit would double roughly every 18 months. This phenomenon is called Moore’s Law and the computer chip manufacturing industry has kept Moore’s Law alive for nearly six decades through incredible feats of engineering such as Extreme Ultraviolet Photolithography (EUV).

In 2019, long time AI researcher Richard Sutton wrote a short paper called “The Bitter Lesson.” Here’s the opening of that paper:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation.
Richard Sutton (emphasis mine)

Once upon a time AI researchers approached their work with the assumption that clever algorithms could effectively encode something like human level “understanding” or “intelligence” in a given domain. But Sutton demonstrates in his paper that time and time again, in every domain, “brute force” methods have ultimately come out on top.

This has been true for as long as Moore’s Law has held up, but the chip industry is hurdling towards a few physical limitations that will eventually bring the 60-year “law” to an end, and perhaps soon.

The End of Moore’s Law?

Moore himself once commented, “These are fundamentals I don’t see how we [will] ever get around.”

The fundamentals he’s referring to are: heat, size, and the speed of light.

Heat

Until the early 2000s CPU speeds doubled along with transistor count. Smaller transistors meant electricity was traveling a smaller distance during a single cycle of computation, which allowed computer engineers to increase the clock speed. However, Increasing a CPU’s clock speed also makes it hotter. In about 2005 high-end CPUs started running hot enough to melt critical components of the CPU.

In fact, chip manufactures have all but stopped trying to increase clock speeds (called frequency in the chart below) and started adding more cores instead. Since then high performance computing has increasingly been all about parallelism.

Fun software fact: The main reason Transformers have replaced Recurrent Neural Networks (RNN) as the state of the art in natural language processing is because Transformers are optimized for parallel computing and RNNs are not which means transformers effectively capture the value of these additional cores while RNNs cannot.

This graph shows various how features of high-end computer hardware has changed over time. Source: https://www.researchgate.net/figure/48-Years-of-Microprocessor-Trend-Data-2_fig1_358572677

Size

The current state of the art transistor size is 2 nanometers — roughly the width of 10 carbon atoms. 2nm transistors have not been commercialized yet, but The Taiwan Semiconductor Manufacturing Company (TSMC) says they’ll open the first fab producing 2 nm chips by 2025.

2nm is an incredible feat, but we may not be able to go much further. Assuming the size of a transistor keeps halving, we only have 3 generations before transistors are the size of a single carbon atom. A single-atom transistor actually HAS been created, way back in 2012, it’s just not suitable for making computers: the 2012 single-atom transistor has to be held at negative 196° C to function. Another was invented in 2020 which has it’s own commercialization issues.

Any theoretical subatomic transistor would be a member of the quantum realm. I won’t pretend to be an expert in quantum physics, but the gist is that subatomic sized transistors will be subject to Heisenberg’s uncertainty principle. If that happens we won’t be able to reliably measure the state of the transistor, rendering them useless for computation. If transistors continue to shrink according to Moore’s Law then we’ll hit this quantum limit by 2036.

The Speed of Light

The above factors might not matter if something could travel infinitely fast. Instead of making computers faster by making components smaller, we could just make the information travel faster. We’ll probably get a bit more computational power by switching to light based computers which transmit information faster than electricity based computers. But if the speed of light is indeed a hard limit in this universe, then CPU speeds will be subject to it.

What Does This Have to do China?

Basically, the Biden Administration has learned The Bitter Lesson and is applying it in their increasingly adversarial approach to China. Biden views AI as a major nation security issue, and his administration is trying to limit China’s ability to advance in that domain. The result is an ongoing trade war centered on computer chip production capabilities.

This particular battle began in earnest in October 2022 when the US Bureau of Industry and Security issued a 139 page document which, among other things, expanded export controls. The controls were specifically targeted to limit China’s ability to acquire or build the high-end computer hardware user to train ML systems. Things like EUV Lithography technology and high-end chips made with EUV are among the controlled technologies.

China has retaliated, for example by creating their own export controls on gallium and germanium, which are important raw materials for making computer chips, fiber optics, and related technologies. China also ramped up it’s production of “legacy” chips, which are less powerful; fabricated using different processes; and still very important in computer hardware writ large — but not for state of the art AI research.

Another layer to this story is that the industry leader, The Taiwan Semiconductor Manufacturing Company (TSMC), is located in Taiwan. Tensions between China and Taiwan are high, and Biden has previously indicated that his administration would defend Taiwan from a Chinese invasion if it ever came to that. The TSMC fabs are such an important strategic asset that some war-gamers have suggested that the U.S. would destroy them before allowing China to take control.

We had classes Monday-Thursday. We’re cutting the News Quiz this week because we ran out of time.

Themes in the News

Commodification of Generative Models is Increasing Malicious Uses of AI

From text to audio to images to video, ML models are increasingly able to generate impressive, useful, and realistic content. These models have recently become commodified via open source, pre-trained models, APIs, and other consumer interfaces which dramatically lower the barrier to entry for deploying AI systems.

While there are many useful, cool, and pro-social uses of this technology, we’re witnessing a significant rise in malicious or otherwise unsavory uses, too.

Versions of the GPT architecture trained specifically to create malware, execute phishing attacks, and defraud people have started popping up. In terms of generative cybersecurity there’s enough “there” there for DARPA to get involved: they announced a two-year “AI Cyber Challenge” with nearly $20 million in prize money, and an additional $7 million as startup cash for small businesses who want to participate.

NAO Medical—which runs a series of medical clinics in New York—was caught using large language models to write nonsense articles in order to boost its SEO and rank higher on search engines. It appears they’ve since taken these articles down, but a copy of the article titled “Derek Jeter Herpes Tree: Causes, Symptoms, and Treatment” is retained in the linked Time article.

In a creepy and distasteful—although perhaps not exactly malicious—example some TikTok creators used generative models to recreate the likeness deceased or missing children and have those children tell their stories. Their works include a video where Anne Frank’s likeness first tries to sell you baby clothes then tells Anne Frank’s actual story.

These so called “deepfakes” are popping up in an array of unsavory uses. They’re being used to imitate the likeness of people (mostly women) who wouldn’t otherwise appear in pornography, with major implications for their reputation and privacy.

Audio deepfakes have been used in multiple instances of bank fraud. For example, scammers will convince a bank teller they are someone with authority over the phone by faking the authorities voice. Then they ask said teller to execute a wire transfer.

In general, generative AI tools are making it much easier and faster to create spammy content, accelerating the creation of all sorts of scams and click farming.

Teb’s Tidbits

Large tech firms are mostly driving AI regulation in the US (no conflict of interest there).
News is a huge part of Google’s revenue, which makes their stance on blocking news in Canada a bit suspicious.
The Kid’s Online Safety Act actually puts us all at risk.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

Machine Learning and the "-isms”

Tyler Bettilyon — Fri, 11 Aug 2023 09:00:00 +0000

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: Our goal is to deepen your understanding of software and technology by explaining the concepts behind the news.
If you’re new to the Lab Report you can subscribe here.
If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

My friend and former colleague Ian—who now runs DevSprout—is building an introductory Python and Web Development class with me. We’re planning a 4 week course meeting 2 nights per week for 1.5-2 hours each session.

If there’s interest in the class I’ll share a draft of the course outline in next week’s newsletter. If you’re interested in a similar course on a different topic please reply to this email and let me know what you’d like to learn about.

Today’s Lesson

Machine Learning and the “-Isms”

Porcha Woodruff was arrested for robbery and carjacking on Feb. 16th, 2023 following a facial recognition match. Woodruff was 8-months pregnant at the time. She was detained for 11 hours then released on a $100,000 personal bond. Her case has since been dismissed and she is now suing the Detroit Police Department (DPD) for damages caused by the false arrest.

This whole ordeal was based on a facial recognition match between gas station surveillance footage and a grainy mugshot of Woodruff from 2015 (Woodruff was arrested in 2015 for driving with an expired license).

The 2015 mugshot matched by the facial recognition system (left) and Porcha’s 2021 drivers license photo (right), which was also available to the facial recognition system, but didn’t match. Via NYTimes: https://www.nytimes.com/2023/08/06/business/facial-recognition-false-arrest.html

According to The New York Times Woodruff is the 6th person—and first woman—to report being arrested due to a false positive from a facial recognition system. All six are black. Coincidence?

Machine learning systems have been plagued with issues related to social bias. Amazon built a sexist hiring AI. Several municipal governments have used racist risk assessment tools to inform bail and sentencing decisions. Microsoft infamously released a Twitter bot that became a Nazi propagandist within 24 hours.

The day before Porcha’s arrest PubMed published this analysis of medical ML systems which concluded, “With the exception of only a few cases, we found that the performance for the White group was, in general, significantly higher than that of the other racial groups across all ML algorithms.”

State of the art generative systems have also fallen under fire for reinforcing a wide variety of stereotypes. Bloomberg recently analyzed the image generator Stable Diffusion and found substantial racial and gender bias. They prompted the AI to depict people with various jobs and categorized the images based on skin tone and perceived gender. In their analysis “lawyers” were mostly depicted as white men and “housekeepers” were mostly women of color.

An analysis done by Bloomberg shows the skin tone composition of images generated by Stable Diffusion when prompted to show a person with a particular job. Source: https://www.bloomberg.com/graphics/2023-generative-ai-bias/

Many people associate algorithms with a kind of pure mathematical objectivity. But now, roughly two decades into the machine learning revolution, more experts are admitting that it’s probably impossible to create an unbiased ML model.

So, why?

It’s The Data, Mostly

Garbage in garbage out.

ML models become biased primarily because social biases are deeply embedded in the datasets used to train them. In a sense the algorithms really are acting as tools of unbiased, objective mathematics: They precisely recreate the problematic patterns in the data.

Amazon used their historical hiring data to train their hiring AI; it’s easy to believe Amazon’s historical hiring practices were at least somewhat sexist. The American court and police systems produced the data that was used to train the risk assessment tool; those systems have a well-established history of racism. Tay was actually intentionally trained by Twitter trolls to spew Nazi propaganda.

The unfortunate truth is that we live in a society where racism, sexism, and other forms of social bias are deeply embedded in many aspects of our culture. Historically accurate data about our society inevitably contains artifacts of these biases. The same goes for the makeup and content of writing, art, pictures, and other artifacts produced by that society.

As a then-Harvard Ph.D candidate (now Ph.D holder) Alex Najibi described in 2020 with regard to facial recognition systems:

Several avenues are being pursued to address these inequities. Some target technical algorithmic performance. First, algorithms can train on diverse and representative datasets, as standard training databases are predominantly White and male. Inclusion within these datasets should require consent by each individual. Second, the data sources (photos) can be made more equitable. Default camera settings are often not optimized to capture darker skin tones, resulting in lower-quality database images of Black Americans. Establishing standards of image quality to run face recognition, and settings for photographing Black subjects, can reduce this effect.
Alex Najibi, empahsis mine.

Balancing the racial makeup of datasets used to train facial recognition systems would not be terribly difficult. Correcting the long history of camera technology that prioritizes light skin tones is another story.

It’s not possible to retroactively fix every image captured with sub-optimal film or sensor settings. It’s also generally not feasible (and certainly not profitable) for firms to manually collect the millions-to-billions of high-quality images needed to train a modern facial recognition system to ensure proper lighting, color balance, and other settings are used. Most firms just scrape publicly available data from the web.

Even if firms did collect pristine photos, law enforcement typically uses facial recognition to match against low-quality surveillance footage which likely has poor color balance settings and opens the systems up to the extrapolation problem.

Generative systems like Stable Diffusion are holding up an unflattering mirror to society. White men really are over-represented among doctors. Women of color really are over-represented among housekeepers. When Stable Diffusion trains on data that represents reality it recreates discrepancies like these.

Lots of people in online forums use racist language. When we train models like GPT-4 on this data of course it learns to parrot that language.

Making matters worse, modern ML systems require massive amounts of data. Collecting, curating, and cleaning such massive datasets is an enormous task. It’s easy for data cleaners and labelers (who are mostly poorly paid gig workers) to miss subtle forms of bias. If they did succeed at removing every hint of social bias the training sets would shrink dramatically, plausibly making them too small to train large models like OpenAI’s GPT-4 or Google’s Bard.

Engineering Teams, Executives, and End Users Share The Blame

ML practitioners aren’t generally experts with rich experience in all the nuances and subtlety of racism, sexism, homophobia, etc. Exhaustively testing models like ChatGPT for every possible problematic utterance is a already huge challenge, one that’s exacerbated by lack of expertise in the wide world of social bias.

Even when engineering teams do have expertise their concerns about bias and ethics often play a secondary role to building and releasing a profitable product. Sometimes engineers are even punished for concerning themselves with ethics and bias: When Timnet Gebru and Margaret Mitchell authored a paper demonstrating that large language models frequently produce racist content they were ousted from their jobs at Google.

In some cases—such Porcha Woodruff’s arrest—the end users lack crucial skepticism. The Detroit Police Department uncritically trusted their AI’s output, even in the face of mitigating evidence (such as Woodruff’s pregnancy). A judge also appears to have uncritically signed an arrest warrant primarily based on the AI’s output and little supporting evidence.

Finally, some actors are just malicious. With intentional prompting it’s always possible to use even a relatively neutral AI system to produce something that contains or represents social bias, like this:

I prompted an AI to produce a picture of: “A white man gobbling mayonnaise from the jar using a spoon.”

I hope our white readers can laugh at this stereotype. I especially appreciated that the man appears to be sunburned. But similar images that might be perfectly innocuous in some settings might also carry more insidious racist or sexist undertones.

Should AI image generators specifically refuse to create images of black people eating fried chicken and watermelon? Plenty of black people in the real world do eat these foods, but the stereotype that black people prefer them has an ugly history and seems to reappear every Black History Month.

There’s nothing inherently wrong with an image of a black person eating fried chicken, but it’s pretty easy to use that imagery in a way that reinforces nasty stereotypes. Similarly, if you prompted an AI for an image of “A black man eating” and it produced a man with a bucket of fried chicken that would be cause for concern.

Fortunately, the system I used didn’t produce something obviously racist… but it was still a monstrosity. (Seriously, WTF is going on with his mouth and fingers? And are those supposed to be noodles?)

I prompted an AI to create an image of “A black man eating.”

It’s probably impossible to create an ML system that eliminates all social biases while also being generally useful. Which brings us to today’s News Quiz.

The News Quiz

Every week we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

Image from the recent research paper: https://aclanthology.org/2023.acl-long.656.pdf

New research studied various text generators and classified them across two political dimensions using a tool called the “political compass” (pictured above).

The color of the circles indicates the model family. Yellow dots are from Google’s BERT family, orange dots are from Meta’s LLaMA family, and white dots are OpenAI’s GPT family. The X-axis is a measure of political economic alignment. The Y-axis is a measure of political social alignment.

The research is awesome and the paper is surprisingly approachable, I definitely suggest you read it.

Here’s a snippet from the paper:

Generally, BERT variants of LMs are more socially conservative (authoritarian) compared to GPT model variants.
Section 4.1, second bullet point: https://aclanthology.org/2023.acl-long.656.pdf

Which of the following is the most likely explanation for the difference in social conservatism?

The GPT family’s model architecture is inherently more libertarian.
The engineers at OpenAI are more libertarian than the engineers at Google.
The BERT family of models were trained on a more socially conservative dataset.

This research has implications for a popular “toxic speech classifier” called Jigsaw (and others). Here’s another snip from the research paper:

No language model can be entirely free from social biases.
Final paragraph of section 1

If the authors are right, Jigsaw’s classifications of what counts as “toxic” must be biased. Which of the following options could Jigsaw peruse to mitigate the bias?

Train multiple models with different datasets and biases and have them vote or otherwise combine their classifications (this strategy is called “ensembling”).
Gather and curate a dataset that perfectly balances all the relevant perspectives on what constitutes “toxic speech.”
Select a type of model that is inherently non-partisan.

Teb’s Tidbits

The main article and news quiz ran long this week, so we’re skipping the “Themes” section.

Meta has officially started blocking news content for Canadian readers following a law that would require Meta to pay news producers whose links appear on Meta’s services.
The new California Privacy Protection Agency is investigating cars and their data collection practices.
ML based image generators are disrupting a surprisingly big-dollar cottage industry: YouTube thumbnail creation.
ChatGPT can analyze images and caption them as a tool for the blind, but OpenAI worries about unintentionally creating a facial recognition tool and all the associated privacy and bias issues that entails.
It was bound to happen: Evil ML-powered chatbots are emerging to steal your login credentials and mire you in spam.
A new Russian law aims to crack down on online anonymity and further isolate Russia’s corner of the internet from the broader world.
The BBC is starting a 6-month experiment with their own Mastodon server.

Answers To The News Quiz

Which of the following is the most likely explanation for the difference in social conservatism?

The GPT family’s model architecture is inherently more libertarian.
- No research known to me suggests any particular model architecture is inherently biased in one way or another. But, here’s a really interesting quote from the paper:
The engineers at OpenAI are more liberal/libertarian than the engineers at Google.
- This might be true. If it is true these biases may have slipped into the verification, training, and testing processes at OpenAI and Google. But it’s probably of secondary importance to the training data. Here’s a quote from the paper:
The BERT family of models were trained on a more socially conservative dataset.
- This is the most likely cause. One more quote from the paper:

This research has implications for a popular “toxic speech classifier” called Jigsaw (and others). Here’s another snip from the research paper:

No language model can be entirely free from social biases.
Final paragraph of section 1

If the authors are right Jigsaw’s classifications of what counts as “toxic” must be biased. Which of the following options could Jigsaw peruse to mitigate the bias?

Train multiple models with different datasets and biases and have them vote or otherwise combine their classifications (this strategy is called “ensembling”).
- This solution is recommended by the paper:
Gather and curate a dataset that perfectly balances all the relevant perspectives on what constitutes “toxic speech.”
- The paper suggests this is a) impossible, and b) might help, but won’t ever fully eliminate social bias.
Select a type of model that is inherently non-partisan.
- Again, there is no research known to me that suggests any type of model is inherently partisan to a particular political persuasion.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

ML Failure Part 3: Model Drift

Tyler Bettilyon — Fri, 04 Aug 2023 10:00:00 +0000

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: cut through the noise with our weekly rundown of software and technology news.
If you’re new to the Lab Report you can subscribe here. If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

Last week we ran Intro to SQL for one of our favorite clients and got some helpful feedback that led to updates in our open source curriculum. This week we’re running Intro to Python with a special addendum to run SQL queries and statements from Python code with a focus on analysis and ETL.

If you want to schedule a class for your team reply to this email or book a meeting with me automatically from my website.

Today’s Lesson

How ML Fails Part 3: Model Drift

Its a car “drifting” …

Model drift occurs when a model that once worked stops working because something has changed. Model drift is common and can have a variety of underlying causes. A few examples should illustrate why:

1) Netflix released their streaming service in 2007. Ratings you made in 2007 are still in Netflix’s database, but your tastes have probably changed since then. Models they trained in 2007 based on your preferences at that time are simply out of touch with who you are now. To a lesser degree models trained with all your historical ratings likely contain some errors due to your changing tastes.

2) Flashback to 2019 and pretend you were modeling for a large in-person retail store. You had sales forecasts for 2020, but they were based on data generated in 2019’s. Surprise, COVID-19 changed everything and your model “drifted” significantly from the new reality on the ground.

3) Suppose you have a self-driving car system trained on brand new cars. Over time the tires fall out of alignment on some cars. Depending on the severity of the misalignment the system’s turning will be mildly to substantially wrong on those cars.

ML practitioners must regularly acquire up-to-date data, retrain their models, and possibly tune-up their physical systems (as in the car example) to account for such drift. In many situations it’s also wise to retire old data that may no longer be representative of the phenomena at hand.

Other Types of Drift

The most common type of model drift occurs simply because the world is always changing. The above three are examples of this — something about the world changed and the model wasn’t updated to reflect those changes. But there are some trickier causes, too.

Device or Protocol Based Drift

In 2020 Google trained models to use medical imaging to detect diabetic retinopathy. In the lab Google’s model achieved over 90% accuracy. In their first attempt to deploy the model in a real hospital, though, the model struggled with differences between the scans done in the lab vs those in the hospital.

The nurses and techs had a different protocol for taking the scan; the machine that took the scans was slightly different; and the lighting conditions weren’t as good in the hospital. These differences caused the model to reject a large number of scans which in turn caused problems at the hospital for doctors, nurses, and patients.

Differences in microphones, cameras, or other instruments used to create input data for ML models can all create this kind of drift. Sometimes different versions of software and/or different configurations running on those instruments can also cause model drift.

Adversarial Drift

Some ML applications are adversarial in nature. Consider spam filters and fraud detection: when a new model is trained to account for today’s favored spam tactic the spammers are forced to change their strategy. These new tactics are developed specifically to avoid the old filters, so the filter needs to be updated.

This same kind of arms race is happening right now in the world of image generation. Copy protected data is being consumed by generative ML systems in enormous quantities. These data consumption practices are the source of a slew of copyright claims against ML firms. In addition to lawsuits, some artists are turning to adversarial software to protect their work.

Recently, the University of Chicago released a tool called “Glaze.” This tool allows artists to apply an adversarial “cloak” to images they upload to the internet. This “cloak” is a set of subtle perturbations to the original image that humans (generally) cannot see, but that have a substantial impact on the AI’s “understanding” of the image. Specifically the cloak is designed to prevent models from learning key features of the images’ style.

Another new tool called PhotoGuard uses similar adversarial perturbations to make your photos resistant to models that generate “Deep Fakes” (photo-realistic images depicting scenes that didn’t actually occur). Instead of confusing models at training time this tool confuses models at inference time, causing attempted deepfakery on a protected image to instead produce a grey blob or other obviously unrealistic artifacts in the background.

If these techniques become widespread AI models will first start failing because drift has occurred. AI firms will likely try to incorporate some kind of remediation. If the remediation works, then new techniques to confuse the models may have to be developed.

Glaze even addresses this in the “What is Glaze” article:

Unfortunately, Glaze is not a permanent solution against AI mimicry. Systems like Glaze face an inherent challenge of being future-proof (Radiya et al). It is always possible for techniques we use today to be overcome by a future algorithm, possibly rendering previously protected art vulnerable.
Glaze Authors

Low Drift Environments

Some models are less susceptible to drift than others. For example ML modeling has been used to approximate the “three body problem” which is a physics problem about predicting how objects in space will move while taking into account their gravitational influence on each other.

Classical numerical methods for doing this are extremely computationally intense. Once trained, ML models can generate highly accurate approximations that are up to 100 million times faster to produce. This model is not going to suffer from drift because (as far as this writer knows) the laws of physics have not ever changed.

The recent strides ML has made in weather forecasting are similar: climate change is occurring but it’s a relatively slow process. Even then, the fundamental physical laws governing the weather aren’t changing, only the environment in which those laws are being applied (i.e. Earth’s atmosphere).

Problems like these are ideal for ML because they are highly objective, lend themselves easily to mathematical explanations, and aren’t likely to suffer from drift.

The News Quiz

Every week we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

OpenAI shuttered an ML program designed to discriminate between AI-written text and human-written text. Many people are trying to create such tools but so far failure is the norm.

For each of the following bullet points decide whether its an example of model drift or something else:

These systems flag text from non-native English speakers as “AI Generated” at much higher rates.
Making slight tweaks to AI generated text caused accuracy on one tool to plummet as described here:
- This tool (code) released once upon a time by OpenAI can usually detect text produced by GPT-2, but not GPT-3.5, GPT-4, or ChatGPT.

Themes in the News

Alternatives to Detecting AI Generated Materials

As we highlighted in the News Quiz detecting AI written text is difficult and unreliable. Now researchers are exploring alternative methods for distinguishing between human and AI created content:

Provenance based methods seek to establish a digital (and cryptographically verifiable) trail that can prove the origin of an image, audio file, or video. The Coalition for Content Provenance and Authenticity (C2PA) is one such protocol that’s gaining notoriety.
Watermarking methods cause AI-produced content to carry a signature marking it as such. Several major AI firms have signed onto an agreement to add watermarks to their AI generated images. And here’s a proposal to give AI text generation systems a unique subset of Unicode as a form of watermarking for text.

Data Labeling and Generation is Big Business

ML models are notoriously data hungry. So much so that some researchers think we’ll run out of high-quality data to train them before 2027. Complicating matters, new research has confirmed that training ML models using data generated by ML models almost always degrades model performance.

There’s a huge market for high quality data, and a lot of it is created by poorly paid laborers. Some of it is generated by you: every time you “prove you’re not a robot” with one of Googles (Re)Captcha tools you’re helping to label training data.

It’s not all exploitation though. Read this lovely story about a startup in India trying to pay fair wages and royalties for data labelers.

Teb’s Tidbits

Content moderation should abandon their “legalistic” structures and embrace the “public editor” role.
Google’s machine “unlearning” challenge seeks to address some privacy issues in ML models by making them forget some aspects of their training data retroactively.
Deepfakes in the advertising industry exemplify why the SAG-AFTRA has made AI a core aspect of their ongoing strike.

Answers To The News Quiz

For each of the following bullet points decide whether its an example of model drift or something else:

These systems flag text from non-native English speakers as “AI Generated” at much higher rates.
- Something else. Model drift in this context would be a model that used to accurately distinguish between ESL writers and AI writers, but overtime one of those groups writing changed in a way that caused the model to fail. More likely this is plain old underfitting.
Making slight tweaks to AI generated text caused accuracy on one tool to plummet.
- Model drift. This is a great example of adversarial drift. The researchers took AI generated output, made an intentional perturbation, and then the model failed on the data which had “drifted” from the original distribution of AI generated text.
This tool (code) released once upon a time by OpenAI can usually detect text produced by GPT-2, but not GPT-3.5, GPT-4, or ChatGPT.
- Model drift. This time due to the advances in text generation. AI systems have gotten much better in the last 5 years, so old tools that detected old AI generated text have stopped working.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

ML Failure Part Two: Extrapolation vs Interpolation

Tyler Bettilyon — Fri, 28 Jul 2023 10:00:00 +0000

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: cut through the noise with our weekly rundown of software and technology news.
If you’re new to the Lab Report you can subscribe here. If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

This week we ran the first half of a custom class for a prominent coffee purveyor. The class pairs intro to SQL and intro to Python as part of a longer course at the company regarding data analysis and visualization. Browse the open source materials or get in touch to schedule a custom class of your own.

Today’s Lesson: How ML Fails Part 2

All the code from today’s lesson can be viewed on Github and Google Colab.

Extrapolation vs Interpolation

In statistics interpolation and extrapolation are both types of estimation based on data. Interpolation is an estimate made within the bounds of the existing data, extrapolation is an estimate made outside those bounds. Although they are sometimes presented as a binary, predictions and estimations often exist on a spectrum between the two.

Imagine we’re the owner of an ice cream shop. We want to make a sales estimate for tomorrow. Tomorrow hasn’t happened, so in that sense this is fundamentally an extrapolation problem: tomorrow’s sales could not possibly be in our existing data set. However, we know it’s going to be a hot Saturday in the middle of summer. We have sales records from similar days in the past, so in that sense it’s an interpolation problem.

In general — and for ML specifically — performing extrapolation is much harder. Forcing a model make predictions outside the bounds of its training data is a common cause for failure.

An Abstract Example: Predicting Sine

Consider the humble sine curve. Sine is a perfectly repeating pattern, there is no noise, and its period is consistent. It’s just a boring, repetitious curve.

The sine function from -5π to 5π

Many ML models can easily learn the features of sine within the bounds of the training data. For example, here’s a fairly simple neural network’s approximation of sine from -5π to 5π. Our model’s output in red, the training data in blue:

A neural network easily learns sine…

Clearly, our model can do interpolation quite well. But what happens when we ask this model to make predictions outside the bounds of the original data?

The same model’s predictions for sine outside the bounds of the training data.

Unfortunately, our model has failed abysmally at extrapolation. It learned sine almost perfectly from -5π to 5π but learned a near-linear function below -5π and above 5π.

And it gets worse. This is what happens when we train the same model on sine with a few holes in the sample:

The model also fails to properly estimate within the missing areas from the training data.

The hole on the right isn’t terrible, but the hole on the left is basically a horizontal line covering what should be a full period of the sine curve.

In the real world these sorts of issues arise all the time.

Real World Examples

ChatGPT and Esoteric Poetry

Consider the “domain” of a model like ChatGPT: you need a representative sample of “human written text” within every subdomain where users might prompt the model.

While the dataset for ChatGPT is enormous, it still doesn’t contain many samples of every esoteric topic under the sun. In December Ian Bogost, a technology writer at The Atlantic, challenged ChatGPT to write in esoteric styles of poetry (among other things). Here’s what he found:

I also urged the AI to generate a lai (a medieval narrative poem) in the style of Marie de France about the beloved, Texas-based fast-food chain Whataburger. It sure as heck looked a lot like a lai, beginning like this:
In the land of Whataburger, where burgers and fries are king,
A stranger came to visit, and with him brought a ring.
He wore a cloak of velvet, and a hat upon his head,
And he smiled with a twinkle in his eye as he approached the counter, where the burgers are served hot and fresh, and the fries are crispy and golden.
Except for that last line, which is a metrical mess. The AI […] was able to tell me about the typical structure of a lai—a short, narrative poem of the 12th century with a particular structure and form, usually written in octosyllabic couplets. The lyrical errors in the generated lai arise from its failure to adhere to that metrical form.
Ian Bogost in The Atlantic

Although the model was trained on enough informational content to correctly define a lai, it wasn’t trained on enough lais to correctly produce one. This demonstrates another weakness of such models: they do not really “understand” the things they produce the way humans usually think of understanding. While the model correctly defined a lai, its “knowledge” of their structure did not translate to producing a poem that actually has that structure.

Fatal Self Driving Crash

In 2018 a self-driving SUV operated by Uber struck and killed Elaine Herzberg. An investigation found that in the moments before the crash the system was struggling to properly classify Herzberg as a pedestrian that needed to be avoided.

Herzberg was doing something that the AI had never encountered during training: Jaywalking with her bike. The system had seen pedestrians in and out of crosswalks. It had seen cyclists on the roadway AND pedestrians walking with a bike in a crosswalk. But never a pedestrian walking with a bike outside a crosswalk.

As a result the system oscillated between classifying her as a pedestrian, bicycle, and vehicle which in turn caused the system to incorrectly predict her path, which ultimately led to the crash.

This problem is a big one for self-driving cars because just about anything can happen on the roadways. It’s also why the most successful self-driving firms have started in places like Arizona, where the streets are wide and weather conditions are fairly constant and predictable. Those factors decrease the size of “representative driving conditions.”

Big problems, Big Data

The extrapolation problem is another reason large ML models need so much data. If we expect our ML model to succeed, we need truly representative data sets that include the entire domain of whatever problem it is we’re hoping to solve.

Ask yourself: what is “truly representative” of the problem of driving, or “writing at a human level of proficiency” and you begin to understand why some researchers think we’ll run out of high quality data before 2027.

The News Quiz

An AI with zero conception of the rules of Go generated this board, which would be quite absurd if it occurred in a real game.

in 2016 AlphaGo defeated the Go world champion Lee Sedol 4-1 in a 5-game show match. It was a significant achievement in ML because the game of Go is extremely complex. Fun fact: The number of legal Go board states is larger than the number of atoms in the universe.

Since that achievement other Go-playing AI’s have extended and refined AlphaGo’s core concepts. In particular one called KataGo has become a standard bearer. Earlier this year a fairly high ranked amateur Go player named Kellin Pelrine defeated KataGo 14 games to 1.

Read these two articles about the “adversarial strategy”:

This article about Perline’s strategy.
This article about an AI trained specifically to beat KataGo, but that often loses to novice human Go players.

Then, answer these questions:

Perline described his encircling strategy this way: "As a human, it would be quite easy to spot." Why then, didn’t the top ranked Go AIs spot it?
In terms of “extrapolation” and “interpolation” what is this adversarial AI missing that allows novice human players to defeat it, even while it defeats a top ranked AI that in turn frequently defeats top ranked humans?

Themes in the News

No one really knows the future of AI

OpenAI CEO Sam Altman has been making the rounds, charming legislators, and musing about whether his company’s creations have a 0.5% or a 50% chance of destroying humanity.

Are large language models just “stochastic parrots” as computational linguist Emily Bender et al have argued? Or is Altman right that general intelligence might be one of the, “emergent properties from doing simple things on a massive scale.”

If ML systems run out of high quality training data, as some ML researchers have speculated will most likely happen before 2027, will they already have consumed enough data to become super-intelligent and continue learning autonomously?

Is Anthropic AI right that we need to build exactly the types of systems that might destroy humanity to prevent that outcome? Or is the only way to avoid our destruction the complete abandonment of general artificial intelligence R&D?

Right now there are a lot of unknowns and a lot of disagreement even among experts.

Will anything replace Twitter or will it just be X?

Apparently Twitter is now X.

Since Elon Musk bought Twitter the site has been turbulent. Between massive layoffs, increased service outages, and advertiser abandonment, other tech firms smell blood in the water. Mastodon, Bluesky, Substack’s Notes, and now Meta’s Threads have all made attempts to capture Twitter refugees or otherwise capitalize on the chaos surrounding Musk’s takeover, though it’s not clear any of them will succeed.

Mastadon’s decentralization gives it an inherent moderation problem, which is why it’s become a haven for child sexual abuse materials. Notes and Bluesky have yet to attract massive user bases, although Bluesky reportedly crested 1 million users. Threads is probably best positioned because of their ability to basically import Instagram’s users, but Meta isn’t exactly a harbinger of warm-fuzzies.

This writer has been off Twitter since long before the Musk takeover, and I find myself increasingly sympathetic to the notion that maybe we just don’t need a new Twitter.

Teb’s Tidbits

AI is being used to find certain metal deposits which important for their use in batteries and other green energy and electrification efforts.
AI coding assistants can introduce major security issues. Humans do this too, but if AI’s succeed at automating a lot of coding processes this will exacerbate the software insecurity situation.
Researchers have discovered a way to generate electricity from humid air.

Answers To The News Quiz

Perline described his encircling strategy this way: "As a human, it would be quite easy to spot." Why then, didn’t the top ranked Go AIs spot it?

KataGo and other top ranked AIs primarily train in stages. First they “watch” expert level Go games between top ranked humans, then two similar versions of the model engage in “self play.”

But because the strategy Perline employed is “quite easy to spot” for top ranked human players, they never use it. When the models switch to self play the bots don’t use the tactic either, because they’ve never seen it.

In terms of “extrapolation” and “interpolation” what is this adversarial AI missing that allows novice human players to defeat it, even while it defeats a top ranked AI that frequently defeats top ranked humans?

The adversarial AI only really practiced against KataGo. Not only that, it “practiced” in a peculiar way that was explicitly designed to find a specific weakness or blindspot in KataGo’s play. The only kind of Go game that could really be considered “interpolation” for this machine is exactly the style of game that KataGo plays.

So, even the basic strategies of a novice human are “extrapolation” and therefore cause the adversarial AI significant problems.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

ML Failure Part 1: Underfitting and Overfitting

Tyler Bettilyon — Fri, 21 Jul 2023 10:00:00 +0000

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: cut through the noise with our weekly rundown of software and technology news.
If you’re new to the Lab Report you can subscribe here. If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

My birthday was this week, huzzah! If you want to help me celebrate you could:

Forward this newsletter to someone who might enjoy it.
Checkout our course catalog and take one of our classes.
Give me the gift of feedback: we’ve only done 3 editions of the revised newsletter and I have no idea how it’s going. Reply to this email!

Today’s Lesson: How ML Fails Part 1

All the code from today’s lesson can be viewed on Github and Google Colab.

Over the next few editions we’re highlighting the strengths and weaknesses of machine learning systems. ML is certainly having a heyday, but nearly 80% of ML endeavors ultimately fail.

ML news has a substantial survivor bias: We only hear about the models that actually get deployed. But the path to success is littered with failures. Today we’re looking at some of the most common ways ML projects fail.

The Happy Path to Success

Machine learning is an most effective for problems that share these three features:

There is a large volume of high-quality data about the problem.
The problem has relevant statistical patterns that could plausibly be explained by mathematics.
The problem is not highly subjective.

Big data: Machine learning succeeds by finding and replicating patterns in their training data. In order to find complex patterns you must have complex data. When the data is complex you need a lot of it to have a “representative sample.”

Statistical patterns: In ML the pattern finding process is almost universally done with mathematical models. If a phenomenon cannot be explained with math then these models won’t work. There must be discernible statistical patterns for the machine to imitate, exploit, or discover.

Subjectivity: The more subjective a problem is the more likely human biases are incorporated into the data used to train the model. ML can handle subjectivity in some circumstances — for example social media feeds, advertisement, and movie/TV recommendation systems — but some of the most harmful failures of ML have come in highly subjective domains such as bail setting, predictive policing, and hiring.

Games have long been at the forefront of ML research because they have these properties. Near infinite amounts of data can be generated by encoding the rules and simulating play. There are often clear statistical patterns related to winning and losing. And — the the games ML researches choose —there’s generally nothing subjective about the rules of play.

In fact, the phrase “machine learning” was coined by Arthur Samuels in the 1950’s during the development of a Checkers playing program. That program recorded games and used the wins and losses to decide which moves were good or bad. Samuels’ system used both ML and search algorithms — a combination that state of the art systems like AlphaGo still employ today.

Arthur Samuels and the IBM 701 computer playing a game of checkers.

Other areas where ML has been highly successful — like financial fraud detection and spam filtering — also have lots of good data, relevant statistical patterns, and relatively low levels of subjectivity.

To a certain extent these guidelines can be broken. But with less data, fewer statistical trends, and more subjective problems you invite failure.

When Things Go Wrong

In this edition we’re looking at two broad categories of failure called “underfitting” and “overfitting.” Next week we’ll look at some other more nuanced failure types and some subtle causes of overfitting and underfitting.

Underfitting

Underfitting is the most straightforward way that things can go wrong in ML. This happens when your model just doesn’t find the right patterns in the training data to map inputs to labels. This often happen because the model you’re using is too simplistic to capture the patterns in the data. Last week we gave such an example: Using linear regression to fit a parabolic data.

Linear Regression will always underfit parabolic trends.

Underfitting can also stem from lack of data. When there’s not enough data it might be impossible to discern meaningful patterns. Imagine randomly picking 5 dots from a parabolic distribution… Can you still tell it’s a parabola? Neither can an ML model.

With a small number of data points its impossible to discern meaningful patterns.

This problem is exacerbated in many machine learning contexts because of something called “the curse of dimensionality.” Essentially this “curse” means: the more features your input data has the more of it you need to find patterns across those features. Consider what the same 40 data points looks like in 1D, 2D, and 3D:

As you add features the data becomes increasingly sparse.

In one dimension the data is tightly packed, in two dimensions it’s a bit more spaced out, and in three dimensions the chart is mostly unoccupied space. Said another way: in one dimension most values of x were represented; in two dimensions many (x,y) combinations were not represented; in three dimensions a tiny fraction of (x,y,z) combinations were represented.

When the data is not dense across all the dimensions the data set is not representative of all the relevant combinations of values in those dimensions. As a result whatever patterns ML models find are not likely to be representative either. Ironically, small datasets also put models at greater risk over overfitting (more on that in a moment).

Finally, underfitting can also happen when a model has good data but doesn’t train enough. Many models train iteratively. If you only train for a few rounds, you might miss critical patterns. Consider this series of charts showing a neural network’s training progress across 400 rounds (or “epochs”).

At first the model can only capture the general trend in the data, meaning it’s underfit. By the end it’s much more precise.

Many ML models get better with more training.

Overfitting

Overfitting happens when a model learns too much about the specific training data instead of learning about the phenomenon in general.

A helpful analogy is a student cheating on an exam. Say the student had access to the answers before the test and memorizes them. If the exam is an exact match then this student will pass, but if the exam has been changed some — even if the questions are very similar — this student will fail because they only know the answers, not how to derive those answers.

Like the cheating student some of the more powerful ML models can effectively memorize the training data and how each data point maps to the output label. When this happens, the models stop finding patterns that are generally predictive and starts using patterns to identify individual training data points. When this happens we call it overfitting.

Here’s an example: I generated a small dataset that follows a linear trend with a little bit of noise. Then, I trained a neural network for about fifty thousand epochs on this small dataset. Here’s the result:

Strong overfitting, especially when x < 20. More training would have resulted in further overfitting on the right hand side of the distribution.

Even though the true underlying trend in the data is linear, my model has learned some wacky function that tries to perfectly capture the training data’s distribution. The problem is that this isn’t accurate. Lets zoom in on the area with the most overfitting (0 <= x<= 20) and generate more data from the exact same distribution.

The learned peaks and valley’s are spurious, just artifacts of overfitting to noisy data.

Clearly this model isn’t really learning the trend in our underlying distribution. Instead we’ve learned the noise values from the training data.

This example demonstrates the importance of two major ML principles: the need for big data and something called validation.

Validation is a collection of methods (we’ll discuss some in future editions) that help us detect when a model is failing. The most common methods involve curating at least two datasets from the same underlying pool of data. Then we train models on one of those sets, and use the other to test whether our model “generalizes” to the data that was held out of training.

This is exactly what we did in the chart above, and as you can see it demonstrated a major flaw in our model.

Bigger datasets make it harder to overfit because for any given combination of input there will be multiple (ideally many) datapoints represented, each with unique output. In our little experiment a chart with “big data” would look something like this:

Dense data distributions make it hard for a model to become overfit.

It’s much more difficult to draw a weird squiggly line that hits every datapoint, because there are multiple y values for most x values. Given enough training some complex models will still start to overfit, you can see that here on the very left:

Very complex models can still overfit when trained for a very long time, but the effect is substantially diminished compared to smaller datasets.

Next week we’ll talk about some more nuanced causes of failure in ML, including how an amateur Go player beat AlphaGo and the cause of a fatal self driving car crash.

The News Quiz

Every week we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

Are the following examples of underfitting, overfitting, or neither.

In it’s recent lawsuit Getty Images is alleging that ML based image generators produce exact or near-exact replicas of copyright protected images.
State of the art discriminators cannot reliably tell the difference between AI and human written content.
The type of identity based discrimination Local Law 144 in New York City is trying to prohibit AI systems from exhibiting.

Themes in the News

An ML generated image of Lady Justice… for irony’s sake.

ML lawsuits and regulation are still the matter of the day

A slew of lawsuits and investigations continue to target ML companies over their models and their data collection procedures.

The FTC is investigating OpenAI on consumer protection grounds, Getty is suing Stability AI for unauthorized use and reproduction of copy-protected images, the SAG-AFTRA is highlighting generative models as part of their ongoing strike and negotiations, Microsoft and Github’s Copilot tool is facing a class action lawsuit for utilizing open-source (but not public domain) code, new EU legislation targeting big tech firms is about to go into effect, the US Senate published their investigation into how Meta, Google, and others got access to private tax data…

The list goes on and on.

While just about everyone agrees that some regulation is needed, there is substantial disagreement about exactly how to proceed. Fast.ai published a great example of this disagreement on July 10th, claiming:

Proposals for stringent AI model licensing and surveillance will likely be ineffective or counterproductive, concentrating power in unsustainable ways, and potentially rolling back the societal gains of the Enlightenment. The balance between defending society and empowering society to defend itself is delicate. We should advocate for openness, humility and broad consultation to develop better responses aligned with our principles and values — responses that can evolve as we learn more about this technology with the potential to transform society for good or ill.
Fast.AI

Teb’s Tidbits

Chinese hackers have compromised a significant number of government email accounts.
North Carolina plans to crack down on online speech related to abortion.
The computer-chip focused trade war with China continues to escalate.
Meta open sourced it’s large language model LLaMA 2.
Google apparently violates its own ad policy quite frequently.

Answers To The News Quiz

Are the following examples of underfitting, overfitting, or neither.

In it’s recent lawsuit Getty Images is alleging that ML based image generators produce exact or near-exact replicas of copyright protected images.
- Overfitting. The model learned to perfectly reproduce the training data, it’s a classic example of overfitting.
State of the art discriminators cannot reliably tell the difference between AI and human written content.
- Underfitting. Models, for a variety of reasons, have a hard time finding patterns that prove something is AI vs human generated.
The type of identity based discrimination Local Law 144 in New York City is trying to prohibit AI systems from exhibiting.
- Neither. Generally this type of discrimination comes from models that are well-fit to their training datasets. The problem is that those datasets (i.e. historical hiring and promotion data) contain artifacts of identity based discrimination. These models do exactly what they were trained to do: recreate historical hiring practices, which were racist, sexist, etc.
- More on this next week!

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

WTF is ML Anyway?

Tyler Bettilyon — Fri, 14 Jul 2023 11:00:00 +0000

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: cut through the noise with our weekly rundown of software and technology news.
If you’re new to the Lab Report you can subscribe here. If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

Learn more about machine learning with our 4-day introduction to the topic. You’ll learn how to build and train machine learning models using popular libraries such as Scikit Learn and Tensorflow; how to manage and manipulate datasets using Pandas; and how to manage and mitigate common sources of error and failure in ML systems.

Checkout Intro to Machine Learning

As always: we want to hear from you. Do you have any questions about today’s lesson? Is there a topic you want to see covered in a future edition? Is there a piece of software news you think we missed? Reply to this email and let me know!

Today’s Lesson

FYI, the code used to train models and make charts in today’s lesson is available here.

AI Series 2: What is Machine Learning?

If you missed it we covered some definitions and terminology last week.

Machine Learning (ML) is a subset of Artificial Intelligence (AI). Here’s a Venn Diagram:

All ML is AI, but not all AI is ML.

ML is the most popular branch of AI today. ML models have consistently been the state of the art in a wide array of AI tasks since the 2010s, and that trend seems likely to continue into the next decade and beyond.

Today we’ll describe what distinguishes ML from other types of AI and why it’s so successful.

“Classical” AI

In classical AI human engineers design all the rules for turning inputs to outputs. Human engineers decide how to represent the data, world, or task being performed; they decide how to process that representation; they decide what form the output will take; and so on. End to end, humans design all of it.

These designs can be complex. The “rules” are frequently described in the form of one-or-more complex algorithms that utilize clever data structures. But humans explicitly wrote those algorithms and designed the data models.

In classical AI humans design all the rules that control input to output mapping.

Google Maps’ path finding feature is a classical AI. Engineers at Google explicitly mapped the world’s roadways into a data structure called a graph. Then they use graph search algorithms to find the shortest path between two nodes in the graph.

Other examples of classical AI include:

Constraint satisfaction algorithms, which are good at scheduling problems, puzzles like Sudoku and crossword, and shipping/freight optimization.
Search algorithms which power Google Maps, spellcheck, social media friend suggestions, and some game playing systems including the famous Deep Blue chess playing system.
Simulation based methods, which are popular in drug discovery and weather prediction.
And more…

Machine Learning

Machine learning is different. With ML certain aspects of the modeling process are “learned” from data in a process called training or fitting (these terms are used interchangeably). Human engineers still decide how to represent the input data and explicitly design the training process. However, during the training process several aspects of how to map inputs to outputs are determined based on the training data.

In ML the training process produces the model, which controls input to output mapping.

For the most popular kind of ML, which is called supervised learning^*, the training data must be labeled which means each data-point contains both the input value and the correct output value.

*In future editions we’ll describe other types of ML in more detail, including unsupervised learning and reinforcement learning.

For ML designed to price houses you need examples of houses and their sale value. For ML designed to detect spam emails you need examples of emails and whether or not they are spam. For ML designed for facial recognition you need pictures of people labeled with who is in the picture.

Each model type has it’s own training process for how to use this labeled data.

Some models are optimization based, which generally means using calculus to minimize an “error” or “loss” function. Some are “distance” based, which generally means representing the data as vectors and using distances measures like euclidean or cosine distance to measure similarities or make groups. Other models use metrics like the Gini coefficient to repeatedly split the data into increasingly homogeneous groups.

You can think of an “untrained” ML model or agent as a template: some of the important decisions and structure are set by the template, but the details must be filled in during training. The “shape” of the inputs and outputs are almost always fixed: for example a house pricing system takes in a fixed set of data points about any given house (e.g. size, number of bedrooms, and year built) and returns a single number representing the price.

Some ML models make quite a lot of assumptions while others are more flexible. For example, perhaps the simplest ML model is one you may have already encountered in a high school or college statistics class: linear regression.

Linear regression makes a big assumption: there is a linear relationship between the input and output data. There are an infinite number of lines with different slope and y-intercept values. We can use training data and an optimization algorithm (e.g. gradient descent) to determine which line is the best line by iteratively trying several slope/y-intercept combinations, measuring the error by comparing our model to the labels in our training data, and making adjustments to the slope/y-intercept.

Linear regression succeeds when the training data (blue dots) have a linear trend by learning the red line as our model for how to map input values (x axis) to output values (y axis)

BUT! The model can only be a line. If your data has some other kind of relationship linear regression will still just produce a line and, as a result, it will probably be quite bad at modeling your data.

Linear regression utterly fails when the trend in our data is parabolic.

Neural networks are a much more flexible family of ML models. These models are “universal function approximators” which means they can represent any math function whatsoever. The only assumption they make about how to map input data to output data is that the mapping must be a math function.

When we use a neural network our hypothesis is that there is some math function that can map the inputs to the labels. It could be any function. The process of training a neural network is essentially an attempt to find the best math function for mapping our inputs to our labels.

A neural network can learn to approximate the parabolic function:

A simple neural network’s approximation of a parabola.

It can also approximate the sine function, even with some random noise added to the training data:

A simple neural network’s approximation of a noisy sine function.

It can even learn this weird function I pulled out of thin air for the purpose of this demonstration:

A simple neural network approximating the function:
(x^3) + 100000sin(x) - 3000x - x^2

Notice there is one area at about x=75 where the model is a worse approximation. It still follows the overall trend, but fails to capture the nuance of about 3 oscillations of the sine component.

This could be addressed through more training or by adding a bit more complexity to the underlying neural network^*. But it demonstrates one way these models fail — by not capturing every aspect of the underlying problem during training — so I left it in.

*In future editions we’ll explore “neural network architecture” and what it means to have a “complex” vs “simple” neural network.

This is all nifty, but it doesn’t explain why ML methods are so popular. Plus, the computational costs of these training processes are substantial, especially for the large models redefining the state of the art: OpenAI CEO Sam Altman estimated the cost of training GPT-4 — just the price of performing the computation — was $100 million.

So why is this cost one worth paying?

The Benefits of ML

Here’s a motivating example I use with my students: If I asked you to describe what makes a picture of a cat cat-like, what would you say? BUT! You have to do it in a language that computers understand, i.e. “mathematics.”

Photo by Amber Kip (Unsplash)

One answer I often get is, more or less, we should look for pointy ears and whiskers. Okay, so in the language of geometry, what is a “pointy ear?”

Jokes aside this approach isn’t totally impossible, but it is hard. First we’d do something called “edge detection” which can reduce the photo to lines:

The result of Canny edge detection (using OpenCV) on the cat photo.

As you can see edge detection can be a messy business. Shadows in the cats chest fur have resulted in a lot of noise. We don’t know which edges belong to the cat and which belong to something else in the image. The cat’s edges aren’t entirely continuous, including around it’s left ear…

Now, using this messy representation, we have to perform some non-trivial geometry to describe what exactly constitutes “pointy ears” or “whiskers” plus some tricky logic to find such shapes within the detected edges. Even then, plenty of other animals have pointy ears and/or whiskers.

Another tactic: we could examine the distribution of colors in the photo. If there’s a lot of pink and green then maybe its not a cat, if there’s more black, white, orange, and grey maybe it is a cat. Unfortunately, plenty of other animals share colors with cats. Some of those even have pointy ears and whiskers. Or, maybe the cat is far away so it’s colors only contribute a little to the overall color composition.

These classical approaches are an enormous challenge with some serious flaws. With machine learning we don’t do any of that. Instead we build a complex template that:

Expects pixel data as input.
Is a binary classifier (i.e. produces “cat” or “not a cat” as output).
And is sufficiently complex that it can capture important features of what makes an image cat-like or not.

Then — instead of defining features that make a picture cat-like — we collect a bunch of pictures and label them cat or not-cat. We train the model on these labeled pictures and it figures out features matter during the training process.

And, just to be clear, modern ML models are extraordinarily good at this problem which we’d call “Image Classification” in AI circles. Here’s a bunch of relevant papers.

ML also makes it much easier to expand our classifier to recognize other things. Instead of trying to describe cars, ladybugs, and hamburgers in the language of mathematics we just collect and label images of those things. Which, of course, is easy thanks to the internet.

This is what Richard Sutton called “The Bitter Lesson” in 2019:

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.
Richard Sutton

Because neural networks are universal function approximators and math is unreasonably effective at modeling so many things we can simply unleash the worlds remarkable and ever-growing computational power on most problems.

The dirty secret of the current ML revolution is that, while there have been some clever software breakthroughs, it’s mostly due to hardware advances and computational availability.

One final note: some classical methods also scale with computational power, in particular search based methods:

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
Richard Sutton

This explains why some of the most incredible state of the art systems, such as AlphaGo, employ both search and machine learning.

BUT! ML can’t solve everything (at least not yet). In the next issue we’ll explore the strengths and weaknesses of modern ML, using some prominent successes and failures as motivating examples.

The News Quiz

Every week we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

A wind prediction figure from NVIDIA’s FourCastNet paper.

Recently, ML researchers have designed models that can outperform classical “numerical weather prediction” (NWS) based systems. Read this short summary about weather predictions ML moment, then answer the following questions:

Rate the following statements as true, false, or it’s complicated.

NWS methods are essentially large physics simulations.
ML methods require less computational power overall than NWS methods.
ML methods are especially good at predicting weather patterns that did not appear in the training data.

Themes in the News

Successful ML Weather Models are a Big Deal

We highlighted these in the news quiz, but it’s worth saying a bit more. Weather prediction has long been an area where ML models ought to thrive: Weather is a genuinely objective phenomena, governed by laws of physics, with incredible (and growing) amounts of historic data available.

But, until recently, ML based methods for weather prediction were considered by many meteorologists to be a “toy.” New systems such as FourCastNet, NowcastNet, and Pangu-Weather are changing that.

Once trained, these ML systems make predictions much faster than traditional simulation-based methods — in some cases 10,000 times faster. Quick updates are important in extreme weather scenarios such as the heavy rains that battered North India and New England this week.

Equally exciting is Pangu-Weather’s ability to track tropical storms without explicitly training on that phenomena. Tropical storms are an emergent property of other more fundamental weather patterns. By predicting these fundamental patterns Pangu-Weather was able to predict weather phenomena that weren’t included in the training data. This generalization to unseen patterns is an area where ML models usually struggle. Such generalization is especially important in an era of a changing climate.

Such advances are an example of how former Google CEO Eric Schmidt thinks AI will transform science.

Privacy, Surveillance, and Regulation

The EU and US reached a deal regarding US spy agencies and their access to EU citizens’ data. A bill was proposed that would prevent those same US agencies from circumventing the warrant process by buying certain kinds of data from brokers, cellphone providers, and others.

Self driving cars provide significant amounts of surveillance footage to police agencies, just one more reason that some regulators are taking a hard look at the industry.

All this data is increasingly used in Real Time Crime Centers, which attempt to make the police more responsive or more predictive, but can also exacerbate existing police biases and overreach.

Teb’s Tidbits

China is responding in kind to semiconductor export restrictions.
Google and Meta will stop showing Canadian news in protest of a law that requires them to pay Canadian news outlets.
The UK is trying to attract American crypto refugees.
Syracuse might be a big winner in the US’s attempt to bring microchip production back onshore.

Answers To The News Quiz

Rate the following statements as true, false, or it’s complicated.

NWS methods are essentially large physics simulations.
- True.
ML methods require less computational power overall than NWS methods.
- It’s complicated. ML models are much faster at making predictions once they are trained. Unfortunately the authors of the respective papers have not published details about the training costs for these ML models. If the models don’t need to be retrained frequently then this statement would likely be true. The more often they need to be retrained the more likely the statement is to be false.
ML methods are especially good at predicting weather patterns that did not appear in the training data.
- False. This is the area that ML models typically perform the worst (more on that in the next Lab Report), but this is also what makes the recent advances so exciting, here’s two quotes:
- “AI-powered forecasting models are trained on historical weather data that goes back decades, which means they are great at predicting events that are similar to the weather of the past. That’s a problem in an era of increasingly unpredictable conditions.” (From the previous summary).
- “Pangu-Weather was also able to accurately track the path of a tropical cyclone, despite not having been trained with data on tropical cyclones. This finding shows that machine-learning models are able to pick up on the physical processes of weather and generalize them to situations they haven’t seen before.” (From this deeper dive).

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

We’re Back, Plus a Series on Artificial Intelligence

Tyler Bettilyon — Fri, 07 Jul 2023 11:00:00 +0000

The Weekly Lab Report

I’m Tyler Elliot Bettilyon (Teb) and this is the Lab Report: cut through the noise with our weekly rundown of software and technology news.
If you’re new to the Lab Report you can subscribe here. If you like what you’re reading you’ll love one of our classes. Schedule a training from our catalog or request a custom class consultation.

From The Lab

We’re back!

I can’t believe it’s been nearly two years since we last published a Lab Report. I took a short vacation from writing that ultimately got out of hand. Oops. Anyway, I’m excited to get back into it. We’re also making some changes:

First, we changed platforms (from MailChimp to Beehiiv). We’re also going to post articles to Beehiiv rather than Medium going forward. Bear with us, there will probably be a few display and delivery issues as we adjust. If your email looked weird please let us know so we can fix it!

Second, we’re leaning into the education side of our business by putting more emphasis on explaining the software concepts behind the news.

Third, we’re doing more marketing for our corporate training offerings by highlighting relevant courses. Like this:

AL and ML have never been more important. Schedule our approachable half-day machine learning workshop. It’s appropriate for executives, project managers, and other engineer-adjacent roles. You’ll gain a deeper understanding of machine learning and become more proficient at discussing it. Bonus points: I promise to talk in plain English.

Learn More About ML For Everyone

Some things don’t change: The curricula for our courses will remain open source and public domain. Teb’s Lab still believes in sharing. If you’re a teacher and you want to use or contribute to our curricula I’d be delighted to talk to you. Respond to this email and we’ll set something up.

Finally, a request: I want to hear from you!

What software topics do you wish you understood better? What do you wish your friends and colleagues knew about software? The more I know about your interests, the better this newsletter can serve you.

The next few editions will focus on artificial intelligence and machine learning. If you want to know something specific about AI or ML, drop me a line ASAP and get your pet topic into the series.

Today’s Lesson

Artificial Intelligence Part 1: What is AI?

Whether it’s taking everyone’s job, destroying the whole world, creating new medicines, flooding the internet with misinformative garbage, or saterically dressing politicians in drag, artificial intelligence is suddenly everywhere.

AI also powers many of the essential conveniences of digital life. Spam filters, fraud detectors, and media feeds are all AI based.

Because AI is actually many different things we’re starting our series on AI we’re by taking Karen Hao’s advice to improve discussions about AI: use more specific language.

The Elusive Definition of AI

AI is a broad field. Here’s a definition from Google marketing material:

Artificial intelligence is a field of science concerned with building computers and machines that can reason, learn, and act in such a way that would normally require human intelligence or that involves data whose scale exceeds what humans can analyze.
Google

Here’s another from a luminary in the field:

The science and engineering of making intelligent machines, especially intelligent computer programs. […] Intelligence is the computational part of the ability to achieve goals in the world. Varying kinds and degrees of intelligence occur in people, many animals and some machines.
John McCarthy

And an old joke:

Modern artificial intelligence uses tactics and techniques from hundreds of related fields and sub-domains. The Large Language Model (LLM) behind ChatGPT is fundamentally different from Google Maps’ heuristic graph search algorithms. Simply calling both of these “AI” obscures more than it illuminates.

So, lets get specific.

Defining AI by Application

Sometimes we define AI systems based on their application. Other times we define them based on their underlying technology. Let’s start with application types.

The first two keywords to know are model and agent.

A model is a system that encodes or represents something. We use models to understand the world, make predictions, categorize things, and derive useful information.

Google Maps is a model. The data structures that power it are called graphs.(Learn more with our Graph Theory Primer). It uses graphs to represent roadways. Then, graph algorithms consume those structures to suggest driving routes.

Spam filters are models. A popular approach is to represent emails in a numeric format called an embedding. Then, algorithms use the embeddings to categorize email as “spam” or “not spam.”

An agent is an AI system that consumes information about its environment and takes actions that impact that environment. Typically agents take actions repeatedly, over time.

“States,” “rewards,” and “actions” are terms of art in a branch of AI called “Reinforcement Learning” which is used to make AI agents.

Game playing systems are agents. The environment is the game. The agent makes moves — which change the state of the game — repeatedly until the game ends.

Self driving cars are agents. The environment is the roadway (as parsed by the car's various sensors). The agent accelerates, turns, stops, etc. — which changes the state of the roadway — repeatedly until it arrives at the destination.

Agents frequently use models as components. For example a self driving car will have multiple models which all contribute to the agent’s decision making. One model to understand images from a camera; another for parsing LiDAR data; another for audio data; and so on.

The output of these models is fed into the agent’s decision making system: the camera and audio models agree there is an ambulance approaching, so the car pulls over.

Agents can use several models as part of their decision making infrastructure.

Numerical vs Categorical Output

Models usually fall into one of two categories based on the kind of output they produce: regression or classification.

Regression models produce one or more continuous numbers, i.e. numbers that can be arranged from smallest to largest. Zillow’s Zestimate is a regression model: given data about a property it predicts the price.

Classification models produce one or more discrete categories. Spam filtering is done by a classification model: given an email the model classifies it as one of two distinct categories, spam or not-spam.

Some models combine classification and regression.

A computer vision^* task called “object localization” requires models to do both. In object localization we determine both the location of various objects within an image (regression) and categorize each object (classification). Most self-driving car systems use object localization models. We want our car to know where various things are on the road and be able to distinguish between (e.g.) pedestrians, semi-trucks, and stop signs.

^*Computer vision is the field of using AI to process images, video, or other optical data.

An example of “Object Localization” from the Yolonet website.

Finally, some models don’t fall into either of these two categories. Google Maps produces a “path” which is neither a category nor a continuous number.

Discrimination vs Generation

Models can also be discriminative or generative in terms of what they produce.

A discriminative model identifies differences between various inputs of the same type.

The Zestimate, spam filters, and object localization are all discriminative — the first discriminates between houses by returning different prices; the second discriminates between emails by labeling them spam or not; the third discriminates between images by labeling and locating objects in the images.

A generative model produces output that matches a specification or falls into a particular distribution.

GPT4 is a generative model. It produces text from a complex statistical distribution that fits the specification of “plausibly written by a human.” Most generative text models are classification systems because they iteratively select the next word^* from a set of discrete words.

^*Technically these systems use “tokens” rather than words. Punctuation marks and word-parts can be tokens.

I discussed framing text generation as a classification problem with ChatGPT, which did a good job explaining the usefulness and limits of such framing.

Midjourney, DALL-E, and so-called “Deep Fakes” are also generative. These models produce images matching some distribution or parameters. Midjourney and DALL-E by taking text as input and producing images that “match” the text. Deep Fakes by specifically producing photo realistic images that appear to be a specific person, but depicting scenes that never occurred.

DALL-E attempts to recreate our logo based on the prompt: “A beaker with filled with splashing blue liquid and the letter T in the style of a modern digital icon.”

These models perform regression. They produce images, which are comprised of pixels. Each pixel is 3 continuous numbers representing red, green, and blue values. The size of the image is usually preset.

That’s already ~1500 words so next week we’ll discuss “Machine Learning” and what sets it apart from other types of AI.

The News Quiz

Every week we challenge ourselves to tie the lesson to the news. Answers are at the end of this newsletter.

Categorize the following AI systems into this 2 by 2 grid:

	Regression	Classification
Discriminative	Discriminative Regression	Discriminative Classification
Generative	Generative Regression	Generative Classification

Midjourney banned Justin T. Brown for creating a series of fake images of politicians having affairs called “AI will revolutionize the blackmail industry.”
“Nonsense” books written by AI tools are flooding Amazon’s “Kindle Unlimited” YA Romance section.
“Artificially generated videos that appear to show me engaging in sex acts,” meant to, “humiliate, shame, and objectify women.” The firsthand account from former head of the Department of Homeland Security’s Disinformation Governance Board, Nina Jankowicz.
AI hiring systems regulated under New York City’s Local Law 144 which, as of July 5th, requires such systems to be audited for bias on the basis of sex, ethnicity, and other protected categories.
Bonus points: Object localization demonstrates a weakness of this 2 by 2. Can you explain the weakness? Where would you put an object localization system in the grid?

Themes in the News

Experts Disagree About AI’s Existential Risks

Some, like Ray Kurzweil and Nick Bostrom, see systems like ChatGPT and “Large Language Models” (LLMs) more generally as a sign that “Artificial General Intelligence” (AGI) is not just inevitable but likely coming soon.

Others, like Rodney Brooks, argue that “We neither have super powerful AI around the corner, nor the end of the world caused by AI about to come down upon us.”

Still another camp, including Joy Buolawini and Timnet Gebru, contend that the hype surrounding AGI distracts us from addressing serious problems AI has already caused.

IEEE Spectrum put together this fantastic “apocalypse scorecard” that demonstrates just how fractured the AI community is about the possibility of, and risks posed by, AGI.

AI is Already Disruptive

Tabloid writers, copywriters, and other similar workers are being partially replaced by large language models. Buzzfeed uses them to produce certain types of articles with minimal oversight. Click farmers use them to generate “made for advertising” (AKA spam) websites. These websites are filled with SEO optimized garbage and exist to siphon money from the “programmatic advertising” that Google and others offer.

Self-driving taxis, trucks, and buses will impact the professional driving industry. Image generators will change the design and illustration industries. However, these tools will likely also create new jobs and industries. Many argue AI will shift employment trends rather than massively replace human workers overall.

In the quiz we highlighted how AI is used to harass, deceive, and discriminate. Politicians have used AI to create fake photos of Toronto overrun with homelessness and Donald Trump kissing Anthony Fauci (among other things).

Pressing Issues Should Be Addressed Now

Current AI systems have created substantial privacy, copyright, fairness, and safety issues.

Data collection practices associated with information-hungry AIs “inevitably include some sensitive personal information.” Models sometimes regurgitate this information verbatim. Laws like the European Union’s GDPR can help protect netizens from such privacy breaches.

OpenAI is facing a class action lawsuit over the “misappropriation” of copyrighted materials which were used to train their systems. Getty Images is suing Stability AI over their image generator on similar grounds.

AI tools have demonstrated substantial and consistent bias in sensitive areas like welfare applications and bail setting. More audits like the ones required by New York’s Local Law 144 may help.

Self driving cars create safety issues (as do human drivers, obviously). The burgeoning self-driving taxi industry should probably be subject to more oversight. So should commercially available systems like Tesla’s Autopilot, which has a variety of potentially fatal flaws.

Teb’s Tidbits

AI tools try to reduce doctor burnout by handling required paperwork.
Corporate executives are installing military grade spyware on their employees’ computers.
Workers who build datasets to train AI systems are now outsourcing that job to AI systems. Doing so generally degrades the quality of AI’s trained on the data sets built by AI.

Answers To The News Quiz

	Regression	Classification
Discriminative		Hiring Systems (Local Law 144)
Generative	"AI Will Revolutionize Blackmail" DeepFake Revenge Porn	Nonsense Books

“AI will revolutionize the blackmail industry,” is generative because it produces new pieces of data from some distribution (realistic images of a specific person). It’s regression because the generated pixels are continuous values.
Deepfake Revenge Porn is generative regression for the same reasons as “AI will revolutionize the blackmail industry.”
AIs that write “Nonsense” books are generative because they produce a new piece of data from some distribution (text similar to novels). They’re classification because the output at each step is a word (or token) that is chosen from a discrete set of words (or tokens).
AI hiring systems are discriminative because they discriminate between applicants. They’re classification because they choose from discrete options, e.g. “hire/don’t hire” or “interview/don’t interview.”
- Such a system could be regression if it produced a “candidate score” representing the “quality” of the applicant.
Bonus points: The task we called “Object Localization” demonstrates a weakness of this 2 by 2 grid.
- The weakness is that systems can be both classification and regression simultaneously. The grid implies that every AI system is just one or the other.
- Object localization could be placed in two of the quadrants or straddle the boundary between regression and classification on the discriminative row.

Remember…

The Lab Report is free and doesn’t even advertise. Our curricula is open source and published under a public domain license for anyone to use for any purpose. We’re also a very small team with no investors.
Help us keep providing these free services by scheduling one of our world class trainings or requesting a custom class for your team.

Teb's Lab

Software and the Climate

The Lab Report

From The Lab

Today’s Lesson

Software’s Growing Energy Problem

How Did We Get Here?

Layering and Dependencies

Great Hardware → Poor Software

Compute Intense Software

What Can Be Done?

Themes in the News

Are Phones and The Internet Bad For Kids?

The XZ Backdoor Attempt

Teb’s Tidbits

Remember…

It's Not All Exponential Growth

Line Goes Up?

The Quest for The Holy Scale

Where Does That Leave Machine Learning?

Remember…

Automated Copyright Infringement

The Workbench

Related Lab Reports:

Background: Data Requirements and Copy Protection are Fundamentally at Odds

Two Tries to Mario

How Does It Work?

Examples:

Aang from Avatar: The Last Airbender

Elsa from Frozen

Some Assembly Required

A Catch-22

Some Limitations

Some Characters Are Harder

Transformative or Not: GenAI Contributes Materially to Infringement

Remember…

GPT Doesn't Understand Randomness

The Workbench

Two Quick Updates From The Lab

Background: Tokenization

GPT Doesn’t Understand Randomness … and Struggles With Quantities.

Challenge Yourself!

Remember…

AI Assisted AI Prompt Engineering

The Workbench

Introducing The Workbench

The Latest Level of Indirection

AI Based AI Prompt Engineering

One More Example

How It Works

Note That

So… What?

Challenge Yourself!

Remember…

Opening The Black Box

The Lab Report

From The Lab

Today’s Lesson

Peeking Inside the “Black Boxes”

The Fundamentals

Neural Networks are Applied Calculus

What We Don’t Know… And What We’ve Learned

Convolutional Filters Detect Features

Recurrent Neurons Have Semantic Meaning

Attention Layers Find Grammatical Patterns

The Latent Spaces of Image Generators are Partially Interpretable

The Bottom Line

Themes in the News

The Latest TikTok Ban Attempt

AI’s Copyright Issues Continue to Evolve

Teb’s Tidbits

Remember…

Provenance Authentication of AI-Generated Content

The Lab Report

From The Lab

Today’s Lesson

Provenance Authentication of AI-Generated Content

So, What Is C2PA?

How It Works

Dissecting a C2PA Compliant Image

Question 1:
Is Training a Model Infringement Per Se?

Question 2:
Can The Model’s Outputs Be Infringing?

Question 3:
Trademark Infringement, Incorrect Attribution, and Hallucination