- That AI Thing
- Posts
- weekend ai reads for 2024-04-26
weekend ai reads for 2024-04-26
📰 ABOVE THE FOLD: BENCHMARKS
Speed of AI development stretches risk assessments to breaking point / The Financial Times (6 minutes)
“The first thing to recognise is that it’s very hard to really properly evaluate models in the same way it’s very hard to properly evaluate humans,” said Mike Volpi, a partner at venture capital firm Index Ventures. “If you look at one thing like ‘can you jump high or run fast?’ it’s easy. But human intelligence? It’s almost an impossible task.”
From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline / LMSYS Org (10 minutes)
Traditional benchmarks are often static or close-ended (e.g., MMLU multi-choice QA), which do not satisfy the above requirements. On the other hand, models are evolving faster than ever, underscoring the need to build benchmarks with high separability.
We introduce Arena-Hard – a data pipeline to build high-quality benchmarks from live data in Chatbot Arena, which is a crowd-sourced platform for LLM evals. To measure its quality, we propose two key metrics:
1. Agreement to Human preference: whether the benchmark score has high agreement to human preference.
2. Separability: whether the benchmark can confidently separate models.
The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare / HuggingFace Blog (11 minutes)
detailed look into what it takes to develop a benchmark for a specific domain
A.I. Has a Measurement Problem — Which A.I. system writes the best computer code or generates the most realistic image? Right now, there’s no easy way to answer those questions. / New York Times (8 minutes)
“All of these benchmarks are wrong, but some are useful,” he said. “Some of them can serve some utility for a fixed amount of time, but at some point, there’s so much pressure put on it that it reaches its breaking point.”
AI now surpasses humans in almost all performance benchmarks / New Atlas (7 minutes)
AI is getting so clever, so fast, that many of the benchmarks used to this point are now obsolete. Indeed, researchers in this area are scrambling to develop new, more challenging benchmarks. To put it simply, AIs are getting so good at passing tests that now we need new tests – not to measure competence, but to highlight areas where humans and AIs are still different, and find where we still have an advantage.
📻 QUOTE OF THE WEEK
I think that people don’t realize how much they expose by simply putting a picture out there.
Michal Kosinski, associate professor of organizational behavior at Stanford University’s Graduate School of Business (source) (the paper)
🏗️ FOUNDATIONS & CULTURE
Mustafa Suleyman: AI is turning into something totally new (22:00) / TED Talk
When it comes to artificial intelligence, what are we actually creating? Even those closest to its development are struggling to describe exactly where things are headed, says Microsoft AI CEO Mustafa Suleyman, one of the primary architects of the AI models many of us use today. He offers an honest and compelling new vision for the future of AI, proposing an unignorable metaphor — a new digital species — to focus attention on this extraordinary moment.
What does it mean to make something totally new, fundamentally different to any invention that we’ve known before?
WHY AI Works (32:40) / YouTube
former Senior Vice President of Software Engineering at Apple
related (1), How does ChatGPT work? As explained by the ChatGPT team. / The Pragmatic Engineer (10 minutes)
related (2), What can LLMs never do? — On goal drift and lower reliability. Or, why can't LLMs play Conway's Game Of Life? / Strange Loop Canon (sorry) (30 minutes)
How I Created an AI Version of Myself / Keith McNulty, Medium (15 minutes)
Now that I have my documents of an appropriate length, I will need to load them to a vector database. A vector database stores text in both its original form, but also as embeddings, which are large arrays of floating point numbers, fundamental to how large language models process language. Words, sentences or documents that have ‘close’ embeddings in multidimensional space will be closely related to each other in content.
related, Why did I deepfake myself? To see if conversing with an AI-generated version of myself can lead to self-reflection, new insights into my thought patterns, and deep truths. (15:00) / Reid Hoffman, Twitter (sorry)
simultaneously bizarre and fascinating
Reid Hoffman has a 15-minute conversation with his AI avatar, trained on his own books, writings, talks, etc.
🎓 EDUCATION
ASU+GSV 2024 Conference Notes / On EdTech Newsletter (7 minutes)
What was encouraging for me is that we seem to be getting over that annoying moral panic phase of AI where everything seemed driven and dominated by a fear of cheating and the attempts to detect it. I believe we are in a new phase now where colleges and universities and vendors and investors are exploring what AI can do and how it might fit into education, but crucially we haven’t figured it out yet.
The AI Tools in Education Database / EdTech Insiders, Notion
This database is intended to be a community resource for educators, researchers, students, and other edtech specialists looking to stay up to date.
320 products with a description and summary of features (e.g., “Designed for Learners”, “Homework help”)
Creating a Culture Around AI: Thoughts and Decision-Making / Educause Review (18 minutes)
All of the questions apply a culturally aware perspective rather than a traditional edtech adoption perspective (though the traditional perspective is another useful lens to evaluate this moment of rapid AI integration).
Kaiden AI – AI Teaching Assistant
Kai is your Al-powered teaching assistant, designed to save you time on lesson planning, content creation, and grading. It integrates chat, file uploads, and a robust knowledge base to streamline your workflow.
📊 DATA & TECHNOLOGY
Vana — The first network for user-owned data
aims to empower individuals to own and control their data by creating a market and infrastructure to sell aggregated data to model trainers, with the goal of preventing a centralized state where the power of AI is held by a few
as data start getting real dollars assigned to them (see related), personal agency over data will be more discussed
related (1), Inside Big Tech's underground race to buy AI training data / Reuters (10 minutes)
related (2), Is there enough text to feed the AI beast? / Semafor (4 minutes)
Microsoft launches Phi-3, its smallest AI model yet / The Verge (3 minutes)
While the Phi-3 family of models knows some general knowledge, it cannot beat a GPT-4 or another LLM in breadth — there’s a big difference in the kind of answers you can get from a LLM trained on the entirety of the internet versus a smaller model like Phi-3.
true; and small models will likely have a very important role in future AI-powered ecosystems
related (1), official statement, Introducing Phi-3: Redefining what's possible with SLMs / Microsoft Azure Blog
related (2), try it at HuggingChat (may require a free HuggingFace account)
unrelated (1), Apple releases eight small AI language models aimed at on-device use / Ars Technica (5 minutes)
Mark Zuckerberg - Llama 3, Open Sourcing $10b Models, & Caesar Augustus (1:17:54) / Dwarkesh Podcast (sorry)
🎉 FUN and/or PRACTICAL THINGS
AI Image Generator — Free Text-to-Image Generator
select a style and enter a prompt and it will extend the prompt accordingly
possibly a good way to learn image prompting techniques
related, updated diffusion model from Adobe Firefly
Limitless — Personalized AI powered by what you’ve seen, said, and heard
no
an AI-powered device that claims to enhance workplace productivity by automating tasks like meeting transcription and summarization
it seems like a privacy nightmare for non-consenting people since it is designed to continuously collect and analyze everything the wearer sees, says, and hears
related, The Ray-Ban Meta Smart Glasses have multimodal AI now / The Verge (6 minutes)
also no
MiniFigure AI — Turn Your Headshot Into a (Lego) MiniFigure
does what it advertises, while also generating some interesting glitches
🧿 AI-ADJACENT
function musicFor(task = 'programming') { return 'A series of mixes intended for listening while ${task to focus the brain and inspire the mind.';
good background music and great interface
⋄