notebooklm-nonsense.mp4

Speaker 2: [0:00:00.0] Okay, so today we're diving into something pretty critical

Speaker 2: [0:00:03.0] in the modern tech landscape.

Speaker 2: [0:00:06.1] How do you get really good AI transcription,

Speaker 2: [0:00:08.7] fast, accurate, but crucially,

Speaker 2: [0:00:11.1] without sending potentially sensitive stuff,

Speaker 2: [0:00:13.1] audio video up to some third party cloud service?

Speaker 2: [0:00:16.2] This is our deep dive into the documentation

Speaker 2: [0:00:18.6] for a tool set called Transcribe with Whisper.

Speaker 2: [0:00:21.3] It's all about self-hosting.

Speaker 1: [0:00:23.0] Exactly. It's aiming to solve that core data sovereignty problem, keeping your data completely

Speaker 1: [0:00:29.1] under your control. And what's really interesting here, the source material, the README,

Speaker 2: [0:00:27.8] Hmm.

Speaker 1: [0:00:33.6] States their goal pretty boldly. They want to make this advanced AI accessible,

Speaker 1: [0:00:38.0] even if your technical skill is basically just installing programs and clicking buttons.

Speaker 1: [0:00:42.5] That's their quote more or less.

Speaker 2: [0:00:44.0] Hmm, that is ambitious, especially with AI tools they often have these really complex

Speaker 2: [0:00:48.1] dependencies, right?

Speaker 2: [0:00:49.1] So does the documentation actually show they pulled it off, hiding the complexity while,

Speaker 2: [0:00:53.8] you know, keeping that privacy promise for the data?

Speaker 1: [0:00:56.0] It seems like they really have.

Speaker 1: [0:00:57.5] The unique part is how it all fits together.

Speaker 1: [0:00:59.5] It's not just using a powerful model

Speaker 1: [0:01:01.4] like say, faster whisper for the accuracy.

Speaker 1: [0:01:04.4] It intelligently bundles that with other vital things,

Speaker 1: [0:01:08.7] like speaker separation, knowing who spoke when

Speaker 1: [0:01:11.4] and making sure the output plays nice

Speaker 1: [0:01:13.3] with research software or analytical tools.

Speaker 1: [0:01:17.6] Think beyond just basic text.

Speaker 2: [0:01:19.3] Right, okay, let's translate that.

Speaker 2: [0:01:21.5] So the process, it sounds clean enough,

Speaker 2: [0:01:24.0] you feed it a video file MP4 MOV MKV,

Speaker 2: [0:01:26.7] it turns that into an audio only WAV file first.

Speaker 2: [0:01:30.0] Is that conversion step okay?

Speaker 2: [0:01:31.6] Like no quality loss?

Speaker 1: [0:01:33.0] Yeah, it's designed to be lossless, just stripping the video out.

Speaker 1: [0:01:35.8] But the real magic, the clever part happens next.

Speaker 1: [0:01:38.2] It's not just transcribing one long audio stream.

Speaker 1: [0:01:40.8] It figures out who was speaking.

Speaker 1: [0:01:42.2] That's called speaker diarization.

Speaker 2: [0:01:43.6] Direwasation. Okay.

Speaker 1: [0:01:44.5] That's handled by a specific free AI model from Huggieface,

Speaker 1: [0:01:48.2] it's called Piano Speaker Diarization.

Speaker 1: [0:01:50.2] This model runs first, slices up the audio based on who's talking,

Speaker 1: [0:01:53.5] then feeds those speaker-specific chunks to the transcription AI.

Speaker 2: [0:01:56.6] Ah, okay. That makes a huge difference compared to just getting a wall of text you have to sort

Speaker 2: [0:02:03.1] out later. So for you, the listener running this, what do you actually get on your computer at the end?

Speaker 1: [0:02:08.4] You get two main things, primarily.

Speaker 1: [0:02:10.8] First, there's this really neat interactive HTML file.

Speaker 1: [0:02:13.9] It shows the video player right next to the transcript text.

Speaker 2: [0:02:16.5] In the cool part, the documentation says you can click a word in the text.

Speaker 1: [0:02:19.8] Yeah, click any word and bam, the video jumps straight to that exact spot.

Speaker 1: [0:02:24.6] You do need both the HTML file and the original video file together for that to work, obviously.

Speaker 2: [0:02:24.3] Hmm.

Speaker 2: [0:02:29.6] That's incredibly useful for a review.

Speaker 2: [0:02:31.7] But what about for more serious analysis?

Speaker 1: [0:02:34.4] Right, that's the second output, an automatically generated

Speaker 1: [0:02:37.3] DOCX file, a Word document.

Speaker 1: [0:02:40.5] And this one is structured for analysis.

Speaker 1: [0:02:42.6] It includes the speaker label

Speaker 1: [0:02:44.2] and the precise timestamp for each segment.

Speaker 1: [0:02:46.5] So it's basically ready to go for import into research software

Speaker 1: [0:02:49.1] like MxQDA right out of the box.

Speaker 2: [0:02:51.3] Okay, that's the game changer, because usually, yeah, you get raw text and spend hours

Speaker 2: [0:02:55.5] syncing it up, marking speakers, having the speaker tracking and the timestamps built

Speaker 2: [0:02:59.4] in automatically, that saves so much time, huge value boost for actual analysis or even

Speaker 2: [0:03:05.1] just editing.

Speaker 2: [0:03:06.1] All right, so the output is powerful, but what about getting it installed?

Speaker 2: [0:03:10.8] You mentioned accessibility being key.

Speaker 2: [0:03:13.4] It's a Python tool, which can sometimes mean tricky setups.

Speaker 2: [0:03:16.5] How do they handle that?

Speaker 1: [0:03:17.5] Well, they were smart about it, they basically offered two doors, depending on your technical

Speaker 1: [0:03:20.9] comfort level.

Speaker 2: [0:03:21.5] Okay.

Speaker 1: [0:03:22.5] Path number one is the direct Python route.

Speaker 1: [0:03:25.3] If you're comfortable with command lines, terminals,

Speaker 1: [0:03:27.8] package managers, you just use PIP.

Speaker 1: [0:03:30.5] PIP three install, transcribe with whisper.

Speaker 1: [0:03:33.0] Simple enough for Python folks,

Speaker 1: [0:03:34.5] you do also need FFM Peg installed separately.

Speaker 1: [0:03:37.5] That's the standard tool for handling

Speaker 1: [0:03:38.9] audio video conversions, which this relies on.

Speaker 2: [0:03:41.2] Okay, standard for developers maybe, but what about the person from their mission statement,

Speaker 2: [0:03:45.5] the one who just clicked stuff?

Speaker 1: [0:03:47.6] Ah, right.

Speaker 1: [0:03:49.0] For them, the strongly recommended path is Docker,

Speaker 1: [0:03:52.7] either Docker Desktop or just the Docker Engine.

Speaker 1: [0:03:56.1] The documentation basically says if you're asking

Speaker 1: [0:03:58.4] what's Python, then Docker is probably your answer.

Speaker 2: [0:04:01.2] Docker. So you install this one application, Docker desktop, and it kind of sidesteps all the Python

Speaker 2: [0:04:06.7] version issues and dependencies. Is that the idea?

Speaker 1: [0:04:09.8] That's exactly it.

Speaker 1: [0:04:10.9] It's the secret weapon for ease of use here.

Speaker 1: [0:04:13.2] Instead of you managing all the bits and pieces,

Speaker 1: [0:04:15.3] you run a simple command line instruction.

Speaker 1: [0:04:17.7] That pulls down a pre-packaged Docker container.

Speaker 1: [0:04:20.9] Think of it like a mini computer

Speaker 1: [0:04:22.4] with everything already set up inside.

Speaker 1: [0:04:24.1] It's kept up to date by the developer.

Speaker 2: [0:04:25.9] And does that give you a graphical interface?

Speaker 1: [0:04:27.9] Often yes.

Speaker 1: [0:04:28.9] The setup usually includes a web user interface.

Speaker 1: [0:04:31.3] You just open your web browser to a local address

Speaker 1: [0:04:33.4] like http.logalhost.5001.

Speaker 1: [0:04:36.8] And right there in your browser,

Speaker 1: [0:04:37.8] you can upload your file, maybe type in the speaker names

Speaker 1: [0:04:40.1] if you know them and click go.

Speaker 1: [0:04:42.0] It hides all the command line stuff.

Speaker 2: [0:04:43.3] That sounds much more user-friendly, does there any catch?

Speaker 1: [0:04:46.3] It's a small one for Windows users.

Speaker 1: [0:04:48.5] If you're on Windows and using Docker, you need to install WSL first.

Speaker 1: [0:04:52.0] That's the Windows subsystem for Linux.

Speaker 1: [0:04:54.5] Docker on Windows relies on WSL to run properly.

Speaker 2: [0:04:56.9] Okay, WSEL first, then Docker, got it.

Speaker 2: [0:05:00.3] But you mentioned something absolutely required

Speaker 2: [0:05:02.3] regardless of the path.

Speaker 1: [0:05:03.7] Yes, whether you go Python or Docker, there's one critical step before you can transcribe

Speaker 1: [0:05:09.3] anything with speaker separation.

Speaker 1: [0:05:11.6] You need a hugging face off token.

Speaker 1: [0:05:13.9] It's non-negotiable for the diurization part.

Speaker 2: [0:05:16.4] hugging face.

Speaker 2: [0:05:17.3] Okay, they host a lot of AI models.

Speaker 2: [0:05:19.8] Why is a token needed for this specific tool

Speaker 2: [0:05:22.9] if it's running locally?

Speaker 1: [0:05:24.6] Good question.

Speaker 1: [0:05:25.7] It's not actually for the transcription part

Speaker 1: [0:05:27.5] using FastWisper.

Speaker 1: [0:05:28.7] That runs fine offline, no token needed.

Speaker 1: [0:05:31.4] The token is required specifically to download the AI models

Speaker 1: [0:05:34.6] that do the speaker diurization,

Speaker 1: [0:05:36.0] that speaker separation we talked about.

Speaker 2: [0:05:37.4] Ah, so the piano models need it.

Speaker 1: [0:05:39.0] Exactly. Those specific models, Piano Speaker Diarrhization 3.1 and Peno segmentation 3.0

Speaker 1: [0:05:45.4] are what they call gated models on hugging face. You need to prove you have permission to

Speaker 1: [0:05:48.9] download them even though they're free. The token acts as your key.

Speaker 2: [0:05:52.5] Okay, so this is crucial for anyone setting it up.

Speaker 2: [0:05:55.6] Let's walk through the steps based on the docs.

Speaker 2: [0:05:58.1] First, create a free hugging face account, simple enough.

Speaker 2: [0:06:02.6] Then, and this sounds like the step people might miss,

Speaker 2: [0:06:05.6] you have to go to the pages

Speaker 2: [0:06:06.9] for both of those P&O models.

Speaker 1: [0:06:08.2] Right, both speaker diurization 3.1 and segmentation 3.0.

Speaker 2: [0:06:12.4] And click the button that says something like, use this model or request access and agree

Speaker 2: [0:06:17.5] to their terms of service.

Speaker 2: [0:06:19.5] The docs warn that if you get a 401 or 403 error later, it means you probably skipped this

Speaker 2: [0:06:24.1] explicit acceptance step.

Speaker 1: [0:06:25.8] Precisely.

Speaker 1: [0:06:26.7] Once you've accepted the terms for both models,

Speaker 1: [0:06:29.0] you go to your hug and face account settings

Speaker 1: [0:06:31.0] and create an access token.

Speaker 1: [0:06:32.2] Make sure it's a read token.

Speaker 1: [0:06:33.5] It'll be a string of characters starting with HF.

Speaker 2: [0:06:35.5] And then you have to tell your computer about this token.

Speaker 1: [0:06:37.2] Yes, you set it as an environment variable.

Speaker 1: [0:06:39.4] The variable name needs to be exactly hugging face off token.

Speaker 1: [0:06:43.3] How you set that depends on your OS.

Speaker 1: [0:06:46.0] The dogs give a helpful tip for Windows users.

Speaker 1: [0:06:48.4] On Windows, using the standard set command in the command prompt

Speaker 1: [0:06:52.2] only makes the variable temporary for that session.

Speaker 1: [0:06:54.9] If you wanted to stick around permanently

Speaker 1: [0:06:56.5] so you don't have to set it every time,

Speaker 1: [0:06:58.2] you should use the set as command instead.

Speaker 2: [0:06:47.8] Hello.

Speaker 2: [0:06:59.9] Ah, good tip.

Speaker 2: [0:07:01.4] Set X for permanent on Windows.

Speaker 2: [0:07:03.4] Okay, so token set, permissions granted, then you run the tool.

Speaker 1: [0:07:08.4] than you were on the tool.

Speaker 1: [0:07:09.3] And just ahead of the very first time you run it,

Speaker 1: [0:07:11.2] my take a little while,

Speaker 1: [0:07:12.1] I have to download those fairly large P&O models

Speaker 1: [0:07:14.3] from hugging face using your token.

Speaker 1: [0:07:17.1] Could be a few minutes depending on your internet.

Speaker 2: [0:07:18.9] But after the first time, much.

Speaker 1: [0:07:19.8] much faster.

Speaker 1: [0:07:20.6] Once the models are downloaded, they're cached locally on your machine.

Speaker 1: [0:07:24.2] Subsequent runs just use the local copies, so they should be significantly quicker.

Speaker 2: [0:07:28.7] Okay, good to know.

Speaker 2: [0:07:30.8] So we covered the main outputs,

Speaker 2: [0:07:32.7] the Interactive HTML and the Analysis Ready DOCX.

Speaker 2: [0:07:36.3] Are there any other details or maybe caveats mentioned

Speaker 2: [0:07:40.2] in the documentation?

Speaker 1: [0:07:41.4] a few practical points.

Speaker 1: [0:07:43.2] They mentioned the HTML output isn't stuck as HTML.

Speaker 1: [0:07:47.0] There's actually an included helper script,

Speaker 1: [0:07:49.0] HTML to docs, that can convert that interactive HTML file

Speaker 1: [0:07:53.7] into other formats if you need them,

Speaker 1: [0:07:55.2] like ODTE for Libra Office or even PDF.

Speaker 2: [0:07:58.2] Oh, handy. Flexibility's good.

Speaker 1: [0:07:59.8] What?

Speaker 2: [0:07:59.9] What about other standard subtitle formats?

Speaker 1: [0:08:02.1] It does also generate VTT files, you know the standard web video text tracks format that lots of players use for captains.

Speaker 2: [0:08:08.0] Right, see your VTTs too.

Speaker 1: [0:08:09.7] You do, but this is a critical distinction they make.

Speaker 1: [0:08:12.4] The VTT files generated by this tool

Speaker 1: [0:08:14.9] do not contain the speaker information.

Speaker 1: [0:08:17.1] They're just the time text.

Speaker 1: [0:08:18.2] Find for basic subtitles, maybe.

Speaker 1: [0:08:20.1] But totally useless if you need to know who said what,

Speaker 1: [0:08:22.5] which is often the whole point for researcher detailed review.

Speaker 1: [0:08:25.4] For that, the DOC X is the way to go.

Speaker 2: [0:08:27.1] got it. VTT for basic captions, DOCX for the real analysis with speaker data

Speaker 2: [0:08:31.7] that makes sense. Anything else? Platform quirks.

Speaker 1: [0:08:34.5] Yeah, and note about Max, if you're running on Apple,

Speaker 1: [0:08:37.6] Silicon, M1, M2, M3, M4 chips,

Speaker 1: [0:08:41.3] the default setup works fine,

Speaker 1: [0:08:43.1] but the performance might be a bit slower

Speaker 1: [0:08:45.0] than you'd expect from such powerful hardware.

Speaker 1: [0:08:47.6] Apparently to really leverage the GPU

Speaker 1: [0:08:49.6] or the neural engine on those Max for faster processing,

Speaker 1: [0:08:52.6] you'd need to install some optional

Speaker 1: [0:08:54.0] more specialized Python packages with Core ML support.

Speaker 1: [0:08:57.3] The basic setup doesn't do that automatically.

Speaker 2: [0:08:59.3] Okay, so power users on Mac might need an extra step for peak speed.

Speaker 2: [0:09:03.0] What about Windows?

Speaker 2: [0:09:04.1] Aside from the WSL thing for Docker.

Speaker 1: [0:09:06.0] Yes, this part was quite amusing in the documentation. There's a section on Windows installation.

Speaker 1: [0:09:11.8] And the author includes a note stating very clearly that the Windows instructions were generated by

Speaker 1: [0:09:16.6] chatGPT.

Speaker 2: [0:09:17.7] Seriously, generated by chat GPT.

Speaker 1: [0:09:19.8] Yep, and followed by the disclaimer that they are not tested by the author personally.

Speaker 1: [0:09:24.3] Apparently the author's own experience with Windows ended, and I quote, around Windows

Speaker 2: [0:09:29.3] Wow, okay, points for honesty, I guess.

Speaker 2: [0:09:31.2] That's refreshingly transparent.

Speaker 1: [0:09:31.4] Yeah.

Speaker 1: [0:09:33.0] It really is.

Speaker 1: [0:09:33.9] It tells you a front.

Speaker 1: [0:09:35.1] Look, this is primarily built and tested on Linux and Mac.

Speaker 1: [0:09:38.1] The Windows stuff is provided as is.

Speaker 1: [0:09:40.1] Maybe it works, maybe it doesn't.

Speaker 1: [0:09:41.5] We didn't really check.

Speaker 1: [0:09:42.8] You don't often see that level of candor

Speaker 1: [0:09:45.1] in open-source project docs.

Speaker 1: [0:09:46.8] It manages expectations perfectly.

Speaker 2: [0:09:42.5] Yeah.

Speaker 2: [0:09:48.6] Absolutely. Okay, so let's try and synthesize this. What's the big picture here?

Speaker 1: [0:09:52.2] I think the key takeaway is how this tool manages to sort of democratize some pretty sophisticated

Speaker 1: [0:09:58.3] AI.

Speaker 1: [0:09:59.3] You get high quality transcription from Whisper, plus that really crucial speaker diurization

Speaker 1: [0:10:04.0] layer.

Speaker 1: [0:10:05.0] And it's packaged up especially via Docker in a way that lets people who aren't AI experts

Speaker 1: [0:10:10.2] or Python wizards actually use it.

Speaker 2: [0:10:12.8] Uh huh.

Speaker 1: [0:10:13.2] all while keeping the data completely private on their own machine.

Speaker 2: [0:10:16.5] Right, it bridges the gap between powerful AI and practical usability for privacy conscious

Speaker 1: [0:10:22.3] Exactly. It takes these complex open-source bits and pieces and integrates them into something genuinely

Speaker 1: [0:10:27.8] useful and accessible for, you know, researchers, journalists, anyone needing private transcription.

Speaker 2: [0:10:32.6] Okay, so for final thought, here's something interesting that this whole set of highlights.

Speaker 2: [0:10:36.6] You get this powerful, local, privacy-preserving AI tool, fantastic.

Speaker 2: [0:10:42.2] But you can only unlock its full potential, specifically that speaker separation, after you

Speaker 2: [0:10:47.2] perform this little digital handshake with hugging face.

Speaker 2: [0:10:49.7] You need that free account.

Speaker 2: [0:10:50.7] You need to accept terms.

Speaker 2: [0:10:51.7] You get that token.

Speaker 2: [0:10:52.7] So the only real gatekeeper to running this advanced self-hosted AI isn't technical skill

Speaker 2: [0:10:57.4] anymore.

Speaker 2: [0:10:58.4] Thanks to Docker.

Speaker 1: [0:10:58.7] Hmm.

Speaker 2: [0:10:59.0] Having this free online account and agreeing to the terms for the models, what does that imply for the future as more powerful specialized AI models become available, but maybe gated behind similar simple free but required accounts.

Speaker 2: [0:11:13.0] What does that mean for access, for control, even for privacy in the long run? That's definitely something interesting for you to mull over.

notebooklm-nonsense (DEMO MODE)