Imagine that you’re listening to a podcast and the guest says something particularly insightful. Maybe it relates to something you’re working on, thinking about, or just want to come back to. If only you could snap your fingers and have that insight saved for later.
This is an idea that I’ve kicked around for a while. It also turns out to be unoriginal; the Fathom podcast app does this via its clipping feature:
The basic utility of bookmarking audio segments is obvious. What interests me now is what happens after you save segments of audio.
Can this insight be automatically distilled, collated, and interlinked with everything you’ve saved before it? Does the automation of this work reduce some of the work involved in understanding new information?
I think the answer is no. The mental, menial, task of truly understanding an idea or new concept and linking it to others can’t be outsourced, irrespective of how advanced that computer is or will be.
I used OpenAI’s Whisper to transcribe three podcast episode segments (across shows and genres) and GPT-4 (via ChatGPT, GPT builders) to manipulate the resulting text. I also built-out a podcast app prototype that takes the source audio, transcript, and then handles the ‘stamping’: forming the distilled insight.
Here’s a simplified example of what Whisper produced from the source audio:
"text": "All right, so the so the famous mediumist message I find that quote really opaque I I never connected for me, but there's something else McLuhan says in that same book understanding media that that I love so he says...",
"text": "All right, so the so the famous mediumist message I find that quote really opaque"
"text": "I I never connected for me, but there's something else McLuhan says in that same book understanding media that that I love so he says..."
GPT-4 could then read this JSON file and distill insights within segments based on my prior instructions and a timestamp. This was surprisingly straightforward with GPT Builder.
Here’s what I found provided the best results:
And here’s what that look like in practice:
This would work well in an API or production setting but was impractical for my immediate prototyping purposes. I instead manually merged related segments and their timestamps, and then generated the summaries of each of those clusters. That way, when the podcast player stamped at a particular timestamp, it could just cross-reference a prepared summary.
Assembling by hand also meant I could more easily iterate over alternate prompts for the creation of each summary. This was important because although the results were useable, they didn’t seem all that useful.
This lack of usefulness became even pronounced when summary types were placed in the Stamper app prototype.
I could never capture an insight in a way that would make sense to someone coming to it afresh. Let alone to myself, having also heard the source material verbatim.
Comparing the AI-produced summaries against the source audio, you can see it has done exactly as asked. And AI will just get better at this type of task in time. But I think there’s a bigger issue at at hand than the quality of summation.
In 2003, Dutch cognitive psychologist Christof van Nimwegen ran several experiments on the effects of computer-aided learning. In each experiment, two groups were given a problem to solve. One group received software assistance, the other did not.
The results are summarised nicely by Nicholas Carr in his book The Shallows:
No matter how good the AI-generated summation, collation, and interlinking gets, we still won’t understand the original ideas any better. In fact, we’ll probably get worse at comprehending them as our own, underutilised, mental muscles atrophy.
It sounds obvious in hindsight that leaving the rote work of transcribing, summarising and interlinking to do manually is a necessary step in our understanding of new material. Casey Newton wrote something similar a few months ago in his Platformer newsletter: although computers can do our thinking for us, it doesn’t mean they should.
Computers are great. But we rely on them for everything at our peril.
A good rule of thumb might be to maintain human agency; relying on automation and AI assistance only when it is in service of said human agency. We’ll know we’ve broken that rule when we find ourselves subservient to said automation or assistance (or driving into oceans).
Examples of ‘healthy’ AI that comes to mind include Apple and Google’s respective intelligent photo recognition. I also think of Fathom—the podcast app I mentioned earlier—which allows for natural language search across all podcasts for topic, theme, and specific dialogue.
I suppose I’m conflating ‘heathy AI’ with assistance that is largely grunt work, done behind the scenes. This reminds me of the “ChatGPT as an intern” analogy, used to make the point that the current state of technology is only so useful.
I wonder if that is perhaps the right way to treat virtual aids permanently, irrespective of how powerful they become. Artificial assistance should be used for additive work that we otherwise wouldn’t do. Freeing us humans up for more agency, not less.
I’ve come back to this to link to Jess Fong’s excellent Vox video AI can do your homework. Now what?. The science of learning chapter hits these same notes. It also more neatly proposes a healthy relationship with AI: using it to test our thinking which happens elsewhere, manually.
I judged response quality based on how well it could jog my memory on the point discussed. I also judged it on how well the summation might work on someone who came to it afresh. Could they understand the point without having to listen to the source? ↩︎