weekend ai reads for 2024-04-26

📰 ABOVE THE FOLD: BENCHMARKS

“The first thing to recognise is that it’s very hard to really properly evaluate models in the same way it’s very hard to properly evaluate humans,” said Mike Volpi, a partner at venture capital firm Index Ventures. “If you look at one thing like ‘can you jump high or run fast?’ it’s easy. But human intelligence? It’s almost an impossible task.”

Traditional benchmarks are often static or close-ended (e.g., MMLU multi-choice QA), which do not satisfy the above requirements. On the other hand, models are evolving faster than ever, underscoring the need to build benchmarks with high separability.

We introduce Arena-Hard – a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, which is a crowd-sourced platform for LLM evals. To measure its quality, we propose two key metrics:

1. Agreement to Human preference: whether the benchmark score has high agreement to human preference.

2. Separability: whether the benchmark can confidently separate models.

  • detailed look into what it takes to develop a benchmark for a specific domain

    To fully utilize the power of LLMs in healthcare, it is crucial to develop and benchmark models using a setup specifically designed for the medical domain. This setup should take into account the unique characteristics and requirements of healthcare data and applications. The development of methods to evaluate the Medical-LLM is not just of academic interest but of practical importance, given the real-life risks they pose in the healthcare sector.

A.I. Has a Measurement Problem — Which A.I. system writes the best computer code or generates the most realistic image? Right now, there’s no easy way to answer those questions. / New York Times (8 minutes)

“All of these benchmarks are wrong, but some are useful,” he said. “Some of them can serve some utility for a fixed amount of time, but at some point, there’s so much pressure put on it that it reaches its breaking point.”

AI is getting so clever, so fast, that many of the benchmarks used to this point are now obsolete. Indeed, researchers in this area are scrambling to develop new, more challenging benchmarks. To put it simply, AIs are getting so good at passing tests that now we need new tests – not to measure competence, but to highlight areas where humans and AIs are still different, and find where we still have an advantage.

 

📻 QUOTE OF THE WEEK

I think that people don’t realize how much they expose by simply putting a picture out there.

Michal Kosinski, associate professor of organizational behavior at Stanford University’s Graduate School of Business (source) (the paper)

 

🏗️ FOUNDATIONS & CULTURE

When it comes to artificial intelligence, what are we actually creating? Even those closest to its development are struggling to describe exactly where things are headed, says Microsoft AI CEO Mustafa Suleyman, one of the primary architects of the AI models many of us use today. He offers an honest and compelling new vision for the future of AI, proposing an unignorable metaphor — a new digital species — to focus attention on this extraordinary moment.

What does it mean to make something totally new, fundamentally different to any invention that we’ve known before?

WHY AI Works (32:40) / YouTube

  • former Senior Vice President of Software Engineering at Apple

    Bertrand Serlet’s thoughts on WHY LLMs and AI in general work so well, nowadays.

  • related (1), How does ChatGPT work? As explained by the ChatGPT team. / The Pragmatic Engineer (10 minutes)

  • related (2), What can LLMs never do? — On goal drift and lower reliability. Or, why can't LLMs play Conway's Game Of Life? / Strange Loop Canon (sorry) (30 minutes)

How I Created an AI Version of Myself / Keith McNulty, Medium (15 minutes)

Now that I have my documents of an appropriate length, I will need to load them to a vector database. A vector database stores text in both its original form, but also as embeddings, which are large arrays of floating point numbers, fundamental to how large language models process language. Words, sentences or documents that have ‘close’ embeddings in multidimensional space will be closely related to each other in content.

  • related, Why did I deepfake myself? To see if conversing with an AI-generated version of myself can lead to self-reflection, new insights into my thought patterns, and deep truths. (15:00) / Reid Hoffman, Twitter (sorry)

    • simultaneously bizarre and fascinating

    • Reid Hoffman has a 15-minute conversation with his AI avatar, trained on his own books, writings, talks, etc.

 

🎓 EDUCATION

ASU+GSV 2024 Conference Notes / On EdTech Newsletter (7 minutes)

What was encouraging for me is that we seem to be getting over that annoying moral panic phase of AI where everything seemed driven and dominated by a fear of cheating and the attempts to detect it. I believe we are in a new phase now where colleges and universities and vendors and investors are exploring what AI can do and how it might fit into education, but crucially we haven’t figured it out yet.

The AI Tools in Education Database / EdTech Insiders, Notion

This database is intended to be a community resource for educators, researchers, students, and other edtech specialists looking to stay up to date.

  • 320 products with a description and summary of features (e.g., “Designed for Learners”, “Homework help”)

All of the questions apply a culturally aware perspective rather than a traditional edtech adoption perspective (though the traditional perspective is another useful lens to evaluate this moment of rapid AI integration).

Kaiden AI – AI Teaching Assistant

Kai is your Al-powered teaching assistant, designed to save you time on lesson planning, content creation, and grading. It integrates chat, file uploads, and a robust knowledge base to streamline your workflow.

 

📊 DATA & TECHNOLOGY

Vana — The first network for user-owned data

  • aims to empower individuals to own and control their data by creating a market and infrastructure to sell aggregated data to model trainers, with the goal of preventing a centralized state where the power of AI is held by a few

  • as data start getting real dollars assigned to them (see related), personal agency over data will be more discussed

  • related (1), Inside Big Tech's underground race to buy AI training data / Reuters (10 minutes)

    Many major market research firms say they have not even begun to estimate the size of the opaque AI data market, where companies often don't disclose agreements. Those researchers who do, such as Business Research Insights, put the market at roughly $2.5 billion now and forecast it could grow close to $30 billion within a decade.

  • related (2), Is there enough text to feed the AI beast? / Semafor (4 minutes)

    The amount of data on the internet is growing at a pace of about 7% per year, Epoch’s director, Jaime Sevilla, said, while the amount of data AI is being trained on is increasing at 200% per year. If the biggest models have ingested most of the content already, there won’t be much new information for them to learn from.

    “We’re less sure about how important it’s going to be to train only on high-quality data. We think that broader kinds of data might still be useful, perhaps not to the same degree, but they might still be enough to continue the pace of scaling so we have become a bit more optimistic,” Sevilla said.

While the Phi-3 family of models knows some general knowledge, it cannot beat a GPT-4 or another LLM in breadth — there’s a big difference in the kind of answers you can get from a LLM trained on the entirety of the internet versus a smaller model like Phi-3.

  • true; and small models will likely have a very important role in future AI-powered ecosystems

  • related (1), official statement, Introducing Phi-3: Redefining what's possible with SLMs / Microsoft Azure Blog

  • related (2), try it at HuggingChat (may require a free HuggingFace account)

  • unrelated (1), Apple releases eight small AI language models aimed at on-device use / Ars Technica (5 minutes)

    On Wednesday, Apple introduced a set of tiny source-available AI language models called OpenELM that are small enough to run directly on a smartphone. They're mostly proof-of-concept research models for now, but they could form the basis of future on-device AI offerings from Apple.

 

🎉 FUN and/or PRACTICAL THINGS

AI Image Generator — Free Text-to-Image Generator

  • select a style and enter a prompt and it will extend the prompt accordingly

  • possibly a good way to learn image prompting techniques

  • related, updated diffusion model from Adobe Firefly

Limitless — Personalized AI powered by what you’ve seen, said, and heard

  • no

  • an AI-powered device that claims to enhance workplace productivity by automating tasks like meeting transcription and summarization

  • it seems like a privacy nightmare for non-consenting people since it is designed to continuously collect and analyze everything the wearer sees, says, and hears

  • related, The Ray-Ban Meta Smart Glasses have multimodal AI now / The Verge (6 minutes)

    • also no

MiniFigure AI — Turn Your Headshot Into a (Lego) MiniFigure

  • does what it advertises, while also generating some interesting glitches

 

🧿 AI-ADJACENT

function musicFor(task = 'programming') { return 'A series of mixes intended for listening while ${task to focus the brain and inspire the mind.';
  • good background music and great interface