Speaker 2:
in the modern tech landscape.
Speaker 2:
How do you get really good AI transcription,
Speaker 2:
fast, accurate, but crucially,
Speaker 2:
without sending potentially sensitive stuff,
Speaker 2:
This is our deep dive into the documentation
Speaker 2:
It's all about self-hosting.
Speaker 2:
Hmm.
Speaker 1:
That's their quote more or less.
Speaker 2:
dependencies, right?
Speaker 1:
It seems like they really have.
Speaker 1:
The unique part is how it all fits together.
Speaker 1:
It's not just using a powerful model
Speaker 1:
like say, faster whisper for the accuracy.
Speaker 1:
and making sure the output plays nice
Speaker 1:
with research software or analytical tools.
Speaker 1:
Think beyond just basic text.
Speaker 2:
Right, okay, let's translate that.
Speaker 2:
So the process, it sounds clean enough,
Speaker 2:
you feed it a video file MP4 MOV MKV,
Speaker 2:
Is that conversion step okay?
Speaker 2:
Like no quality loss?
Speaker 1:
It figures out who was speaking.
Speaker 1:
That's called speaker diarization.
Speaker 2:
Direwasation. Okay.
Speaker 1:
it's called Piano Speaker Diarization.
Speaker 1:
You get two main things, primarily.
Speaker 2:
Hmm.
Speaker 2:
That's incredibly useful for a review.
Speaker 2:
But what about for more serious analysis?
Speaker 1:
DOCX file, a Word document.
Speaker 1:
And this one is structured for analysis.
Speaker 1:
It includes the speaker label
Speaker 1:
and the precise timestamp for each segment.
Speaker 1:
like MxQDA right out of the box.
Speaker 2:
just editing.
Speaker 2:
You mentioned accessibility being key.
Speaker 2:
How do they handle that?
Speaker 1:
Well, they were smart about it, they basically offered two
doors, depending on your technical
Speaker 1:
comfort level.
Speaker 2:
Okay.
Speaker 1:
Path number one is the direct Python route.
Speaker 1:
package managers, you just use PIP.
Speaker 1:
PIP three install, transcribe with whisper.
Speaker 1:
Simple enough for Python folks,
Speaker 1:
That's the standard tool for handling
Speaker 2:
Okay, standard for developers maybe, but what about the person
from their mission statement,
Speaker 2:
the one who just clicked stuff?
Speaker 1:
Ah, right.
Speaker 1:
That's exactly it.
Speaker 1:
It's the secret weapon for ease of use here.
Speaker 1:
you run a simple command line instruction.
Speaker 1:
Think of it like a mini computer
Speaker 1:
with everything already set up inside.
Speaker 1:
It's kept up to date by the developer.
Speaker 1:
Often yes.
Speaker 1:
like http.logalhost.5001.
Speaker 1:
And right there in your browser,
Speaker 1:
if you know them and click go.
Speaker 1:
It hides all the command line stuff.
Speaker 1:
It's a small one for Windows users.
Speaker 1:
That's the Windows subsystem for Linux.
Speaker 2:
Okay, WSEL first, then Docker, got it.
Speaker 2:
regardless of the path.
Speaker 1:
Yes, whether you go Python or Docker, there's one critical step
before you can transcribe
Speaker 1:
anything with speaker separation.
Speaker 1:
You need a hugging face off token.
Speaker 2:
hugging face.
Speaker 2:
Okay, they host a lot of AI models.
Speaker 2:
Why is a token needed for this specific tool
Speaker 2:
if it's running locally?
Speaker 1:
Good question.
Speaker 1:
It's not actually for the transcription part
Speaker 1:
using FastWisper.
Speaker 1:
That runs fine offline, no token needed.
Speaker 1:
that do the speaker diurization,
Speaker 1:
that speaker separation we talked about.
Speaker 2:
Ah, so the piano models need it.
Speaker 1:
Exactly. Those specific models, Piano Speaker Diarrhization 3.1
and Peno segmentation 3.0
Speaker 2:
you have to go to the pages
Speaker 2:
for both of those P&O models.
Speaker 2:
And click the button that says something like, use this model
or request access and agree
Speaker 2:
to their terms of service.
Speaker 2:
The docs warn that if you get a 401 or 403 error later, it
means you probably skipped this
Speaker 2:
explicit acceptance step.
Speaker 1:
Precisely.
Speaker 1:
you go to your hug and face account settings
Speaker 1:
and create an access token.
Speaker 1:
Make sure it's a read token.
Speaker 1:
Yes, you set it as an environment variable.
Speaker 1:
How you set that depends on your OS.
Speaker 1:
If you wanted to stick around permanently
Speaker 1:
so you don't have to set it every time,
Speaker 1:
you should use the set as command instead.
Speaker 2:
Hello.
Speaker 2:
Ah, good tip.
Speaker 2:
Set X for permanent on Windows.
Speaker 1:
than you were on the tool.
Speaker 1:
my take a little while,
Speaker 1:
from hugging face using your token.
Speaker 2:
But after the first time, much.
Speaker 1:
much faster.
Speaker 2:
Okay, good to know.
Speaker 2:
So we covered the main outputs,
Speaker 2:
in the documentation?
Speaker 1:
a few practical points.
Speaker 1:
There's actually an included helper script,
Speaker 1:
into other formats if you need them,
Speaker 1:
like ODTE for Libra Office or even PDF.
Speaker 2:
Oh, handy. Flexibility's good.
Speaker 1:
What?
Speaker 2:
What about other standard subtitle formats?
Speaker 2:
Right, see your VTTs too.
Speaker 1:
The VTT files generated by this tool
Speaker 1:
do not contain the speaker information.
Speaker 1:
They're just the time text.
Speaker 1:
Find for basic subtitles, maybe.
Speaker 1:
For that, the DOC X is the way to go.
Speaker 1:
Silicon, M1, M2, M3, M4 chips,
Speaker 1:
the default setup works fine,
Speaker 1:
but the performance might be a bit slower
Speaker 1:
Apparently to really leverage the GPU
Speaker 1:
you'd need to install some optional
Speaker 2:
What about Windows?
Speaker 2:
Aside from the WSL thing for Docker.
Speaker 1:
Yes, this part was quite amusing in the documentation. There's
a section on Windows installation.
Speaker 1:
And the author includes a note stating very clearly that the
Windows instructions were generated by
Speaker 1:
chatGPT.
Speaker 2:
Seriously, generated by chat GPT.
Speaker 2:
Wow, okay, points for honesty, I guess.
Speaker 2:
That's refreshingly transparent.
Speaker 1:
Yeah.
Speaker 1:
It really is.
Speaker 1:
It tells you a front.
Speaker 1:
The Windows stuff is provided as is.
Speaker 1:
Maybe it works, maybe it doesn't.
Speaker 1:
We didn't really check.
Speaker 1:
You don't often see that level of candor
Speaker 1:
in open-source project docs.
Speaker 1:
It manages expectations perfectly.
Speaker 2:
Yeah.
Speaker 1:
I think the key takeaway is how this tool manages to sort of
democratize some pretty sophisticated
Speaker 1:
AI.
Speaker 1:
You get high quality transcription from Whisper, plus that
really crucial speaker diurization
Speaker 1:
layer.
Speaker 1:
And it's packaged up especially via Docker in a way that lets
people who aren't AI experts
Speaker 1:
or Python wizards actually use it.
Speaker 2:
Uh huh.
Speaker 2:
Okay, so for final thought, here's something interesting that
this whole set of highlights.
Speaker 2:
But you can only unlock its full potential, specifically that
speaker separation, after you
Speaker 2:
You need that free account.
Speaker 2:
You need to accept terms.
Speaker 2:
You get that token.
Speaker 2:
So the only real gatekeeper to running this advanced
self-hosted AI isn't technical skill
Speaker 2:
anymore.
Speaker 2:
Thanks to Docker.
Speaker 1:
Hmm.