Last active
May 6, 2021 22:29
-
-
Save titu1994/a44fffd459236988ee52079ff8be1d2e to your computer and use it in GitHub Desktop.
Long-audio-transcription-Citrinet.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"nbformat": 4, | |
"nbformat_minor": 0, | |
"metadata": { | |
"colab": { | |
"name": "Long-audio-transcription-Citrinet.ipynb", | |
"provenance": [], | |
"collapsed_sections": [], | |
"toc_visible": true, | |
"include_colab_link": true | |
}, | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3" | |
}, | |
"language_info": { | |
"name": "python" | |
}, | |
"accelerator": "GPU" | |
}, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/titu1994/a44fffd459236988ee52079ff8be1d2e/long-audio-transcription-citrinet.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "rZITgro3DC_v" | |
}, | |
"source": [ | |
"## Install NeMo\n", | |
"BRANCH = 'main'\n", | |
"!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", | |
"print(\"Finished installing nemo !\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "mQQMOq4uDMFI" | |
}, | |
"source": [ | |
"import nemo.collections.asr as nemo_asr" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "t3B0i9Tlaa2T" | |
}, | |
"source": [ | |
"# Long-form audio transcription\n", | |
"\n", | |
"Long-form audio transcription is an interesting application of ASR. Generally, models are trained on short segments of 15-20 seconds of audio clips. If an ASR model is compatibile with streaming inference, it can then be evaluated on audio clips much longer than the training duration.\n", | |
"\n", | |
"Generally, streaming inference will incur a small increase in WER due to loss of long term context. Think of it as this - if a streaming model has a context window of a few seconds of audio, even if it streams several minute long audio clips - later transcriptions have lost some of prior context.\n", | |
"\n", | |
"-------\n", | |
"\n", | |
"In this demo, we consider the naive case of long-form audio transcription, asking the question - in the offline mode (i.e. when the model is given the entire audio sequence at once), what is maximum duration of audio that it can transcribe?\n", | |
"\n", | |
"For the purposes of this demo, we will test the limits of Citrinet models [(arxiv)](https://arxiv.org/abs/2104.01721), which are purely convolutional ASR models.\n", | |
"\n", | |
"Unlike general attention based models, convolutional models don't have a quadratic cost to their context window, but they also miss out on global context offered by the attention mechanism. Citrinet instead attains relatively long context by replacing attention with [Squeeze-and-Excitation modules](https://arxiv.org/abs/1709.01507) between its blocks." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Gm13VSJ9FDc0" | |
}, | |
"source": [ | |
"# Transcribing a podcast\n", | |
"\n", | |
"In order to make the task slightly more difficult, we will attempt to transcribe an entire podcast at once. \n", | |
"\n", | |
"Why a podcast? Podcasts are generally long verbal discussions between one or more people on a specific topic, the domain of discussion is unlikely to match the model's training corpus (unless the training corpus is vast), and possibly inclue background audio or sponsorship information in between the discussion.\n", | |
"\n", | |
"------" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "lul3ZBlre27R" | |
}, | |
"source": [ | |
"We refer to the following post [Transcription API comparison: Google Speech-to-text, Amazon, Rev.ai](https://cloudcompiled.com/blog/transcription-api-comparison/amp/), where there is a thorough discussion of streaming performance of various cloud providers. Of the three podcasts discussed there, \"The staying power of Kubernetes with Kelsey Hightower\" is a particularly good match for a use case.\n", | |
"\n", | |
"The podcast is somewhat long (nearly 42 minutes), discusses a technical topic (Kubernetes), and provides a detailed transcript of the discussion - including the sponsorship information. This allows a fair comparison between the model's transcription and the actual ground truth - since the model will simply transcribe the sponsorship segments along with the actual discussion.\n", | |
"\n", | |
"------\n", | |
"The podcast is available at : https://www.lastweekinaws.com/podcast/screaming-in-the-cloud/the-staying-power-of-kubernetes-with-kelsey-hightower/" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "_UpkxNg3kyLG" | |
}, | |
"source": [ | |
"**Below, please give your permission to download the audio clip and the transcript from the Screaming in the Cloud podcast mentioned above.**" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"cellView": "form", | |
"id": "SZuosVT8iZn9" | |
}, | |
"source": [ | |
"#@title Execute cell to accept downloading of podcast and its transcript\n", | |
"allow_download_of_podcast = \"Don't Accept\" #@param [\"Don't Accept\", \"Accept\"]\n" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "aXC7zwiKF3Kh" | |
}, | |
"source": [ | |
"# Download the podcast\n", | |
"import os\n", | |
"\n", | |
"if not os.path.exists(\"data\"):\n", | |
" os.makedirs(\"data\")\n", | |
" \n", | |
"if allow_download_of_podcast != \"Accept\":\n", | |
" raise RuntimeError(\"Request to download the podcast has been stopped as the user has not accepted downloading of audio\")\n", | |
"\n", | |
"if not os.path.exists(\"data/raw_audio.mp3\"):\n", | |
" !wget https://cdn.transistor.fm/file/transistor/m/shows/1494/62fad3e7a01de5beca8289352fe3a7bf.mp3 -O \"data/raw_audio.mp3\" -P data " | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "wrn6ggIVG0vF" | |
}, | |
"source": [ | |
"## Preprocess Audio\n", | |
"\n", | |
"We now have the raw audio file (in mp3 format) from the podcast. To make this audio file compatible with the model (monochannel, 16 KHz audio), we will use FFMPEG to preprocess this file." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "fNwVEWhxGJ9j" | |
}, | |
"source": [ | |
"import sys\n", | |
"import glob \n", | |
"import subprocess\n", | |
"\n", | |
"def transcode(input_dir, output_format, sample_rate, skip, duration):\n", | |
" files = glob.glob(os.path.join(input_dir, \"*.*\"))\n", | |
"\n", | |
" # Filter out additional directories\n", | |
" files = [f for f in files if not os.path.isdir(f)]\n", | |
"\n", | |
" output_dir = os.path.join(input_dir, \"processed\")\n", | |
"\n", | |
" if not os.path.exists(output_dir):\n", | |
" print(f\"Output directory {output_dir} does not exist, creating ...\")\n", | |
" os.makedirs(output_dir)\n", | |
"\n", | |
" for filepath in files:\n", | |
" output_filename = os.path.basename(filepath)\n", | |
" output_filename = os.path.splitext(output_filename)[0]\n", | |
"\n", | |
" output_filename = f\"{output_filename}_processed.{output_format}\"\n", | |
"\n", | |
" args = [\n", | |
" 'ffmpeg',\n", | |
" '-i',\n", | |
" str(filepath),\n", | |
" '-ar',\n", | |
" str(sample_rate),\n", | |
" '-ac',\n", | |
" str(1),\n", | |
" '-y'\n", | |
" ]\n", | |
"\n", | |
" if skip is not None:\n", | |
" args.extend(['-ss', str(skip)])\n", | |
"\n", | |
" if duration is not None:\n", | |
" args.extend(['-to', str(duration)])\n", | |
"\n", | |
" args.append(os.path.join(output_dir, output_filename))\n", | |
" command = \" \".join(args)\n", | |
" !{command}\n", | |
"\n", | |
" print(\"\\n\")\n", | |
" print(f\"Finished trancoding {len(files)} audio files\")\n" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Q6WlUUKoHCNw" | |
}, | |
"source": [ | |
"transcode(\n", | |
" input_dir=\"./data/\",\n", | |
" output_format=\"wav\",\n", | |
" sample_rate=16000,\n", | |
" skip=None,\n", | |
" duration=None,\n", | |
" )" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "Ucpmx9tYIOzn" | |
}, | |
"source": [ | |
"## Prepare transcript\n", | |
"\n", | |
"The original transcript section is provided below. We then preprocess the text in the transcript to construct a \"ground truth\" text corpus for this podcast." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "BUxczbkzISYv", | |
"cellView": "form" | |
}, | |
"source": [ | |
"#@title Execute cell to initialize raw transcript!\n", | |
"\n", | |
"if allow_download_of_podcast != \"Accept\":\n", | |
" raise RuntimeError(\"Request to download the podcast has been stopped as the user has not accepted downloading of the transcript\")\n", | |
"\n", | |
"raw_transcript = \"\"\"\n", | |
"Announcer: Hello and welcome to Screaming in the Cloud, with your host Cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of Cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.\n", | |
"\n", | |
"\n", | |
"Corey: This episode is sponsored by CloudZero. CloudZero wants you to know a few things. One: they realize it's not that original to put a random noun after the word cloud to name your company. Yet somehow they did it anyway. Two: most cost management solutions weren't designed for engineering teams, and three, spending $50,000 on something by accident would be bananas in virtually any other industry. That's why CloudZero built a product that helped engineers correlate their costs with the engineering activity that caused it, so you can quickly detect anomalous span, investigate the source of any cost, and slice up your AWS costs by the metrics that matter to you. Go to cloudzero.com to kick off a free trial that's cloud Z E R O.com and my thanks to them for sponsoring this episode.\n", | |
"\n", | |
"\n", | |
"Corey: This episode is brought to you by DigitalOcean, the cloud provider that makes it easy for startups to deploy and scale modern web applications with, and this is important to me: No billing surprises. With simple, predictable pricing that's flat across 12 global data center regions, and a UX developers around the world love, you can control your cloud infrastructure costs and have more time for your team to focus on growing your business. See what businesses are building on DigitalOcean and get started for free at do.co/screaming. That's D-O-Dot-C-O-slash-screaming, and my thanks to DigitalOcean for their continuing support of this ridiculous podcast.\n", | |
"\n", | |
"\n", | |
"Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn. I'm joined this week by Kelsey Hightower, who claims to be a principal developer advocate at Google, but based upon various keynotes I've seen him in, he basically gets on stage and plays video games like Tetris in front of large audiences. So I assume he is somehow involved with e-sports. Kelsey, welcome to the show.\n", | |
"\n", | |
"\n", | |
"Kelsey: You've outed me. Most people didn't know that I am a full-time e-sports Tetris champion at home. And the technology thing is just a side gig.\n", | |
"\n", | |
"\n", | |
"Corey: Exactly. It's one of those things you do just to keep the lights on, like you're waiting to get discovered, but in the meantime, you're waiting table. Same type of thing. Some people wait tables you more or less a sling Kubernetes, for lack of a better term.\n", | |
"\n", | |
"\n", | |
"Kelsey: Yes.\n", | |
"\n", | |
"\n", | |
"Corey: So let's dive right into this. You've been a strong proponent for a long time of Kubernetes and all of its intricacies and all the power that it unlocks and I've been pretty much the exact opposite of that, as far as saying it tends to be over complicated, that it's hype-driven and a whole bunch of other, shall we say criticisms that are sometimes bounded in reality and sometimes just because I think it'll be funny when I put them on Twitter. Where do you stand on the state of Kubernetes in 2020?\n", | |
"\n", | |
"\n", | |
"Kelsey: So, I want to make sure it's clear what I do. Because when I started talking about Kubernetes, I was not working at Google. I was actually working at CoreOS where we had a competitor Kubernetes called Fleet. And Kubernetes coming out kind of put this like fork in our roadmap, like where do we go from here? What people saw me doing with Kubernetes was basically learning in public. Like I was really excited about the technology because it's attempting to solve a very complex thing. I think most people will agree building a distributed system is what cloud providers typically do, right? With VMs and hypervisors. Those are very big, complex distributed systems. And before Kubernetes came out, the closest I'd gotten to a distributed system before working at CoreOS was just reading the various white papers on the subject and hearing stories about how Google has systems like Borg tools, like Mesa was being used by some of the largest hyperscalers in the world, but I was never going to have the chance to ever touch one of those unless I would go work at one of those companies.\n", | |
"\n", | |
"\n", | |
"So when Kubernetes came out and the fact that it was open source and I could read the code to understand how it was implemented, to understand how schedulers actually work and then bonus points for being able to contribute to it. Those early years, what you saw me doing was just being so excited about systems that I attended to build on my own, becoming this new thing just like Linux came up. So I kind of agree with you that a lot of people look at it as a more of a hype thing. They're looking at it regardless of their own needs, regardless of understanding how it works and what problems is trying to solve that. My stance on it, it's a really, really cool tool for the level that it operates in, and in order for it to be successful, people can't know that it's there.\n", | |
"\n", | |
"\n", | |
"Corey: And I think that might be where part of my disconnect from Kubernetes comes into play. I have a background in ops, more or less, the grumpy Unix sysadmin because it's not like there's a second kind of Unix sysadmin you're ever going to encounter. Where everything in development works in theory, but in practice things pan out a little differently. I always joke that ops is the difference between theory and practice. In theory, devs can do everything and there's no ops needed. In practice, well it's been a burgeoning career for a while. The challenge with this is Kubernetes at times exposes certain levels of abstraction that, sorry certain levels of detail that generally people would not want to have to think about or deal with, while papering over other things with other layers of abstraction on top of it. That obscure, valuable troubleshooting information from a running something in an operational context. It absolutely is a fascinating piece of technology, but it feels today like it is overly complicated for the use a lot of people are attempting to put it to. Is that a fair criticism from where you sit?\n", | |
"\n", | |
"\n", | |
"Kelsey: So I think the reason why it's a fair criticism is because there are people attempting to run their own Kubernetes cluster, right? So when we think about the cloud, unless you're in OpenStack land, but for the people who look at the cloud and you say, \"Wow, this is much easier.\" There's an API for creating virtual machines and I don't see the distributed state store that's keeping all of that together. I don't see the farm of hypervisors. So we don't necessarily think about the inherent complexity into a system like that, because we just get to use it. So on one end, if you're just a user of a Kubernetes cluster, maybe using something fully managed or you have an ops team that's taking care of everything, your interface of the system becomes this Kubernetes configuration language where you say, \"Give me a load balancer, give me three copies of this container running.\" And if we do it well, then you'd think it's a fairly easy system to deal with because you say, \"kubectl, apply,\" and things seem to start running.\n", | |
"\n", | |
"\n", | |
"Just like in the cloud where you say, \"AWS create this VM, or G cloud compute instance, create.\" You just submit API calls and things happen. I think the fact that Kubernetes is very transparent to most people is, now you can see the complexity, right? Imagine everyone driving with the hood off the car. You'd be looking at a lot of moving things, but we have hoods on cars to hide the complexity and all we expose is the steering wheel and the pedals. That car is super complex but we don't see it. So therefore we don't attribute as complexity to the driving experience.\n", | |
"\n", | |
"\n", | |
"Corey: This to some extent feels it's on the same axis as serverless, with just a different level of abstraction piled onto it. And while I am a large proponent of serverless, I think it's fantastic for a lot of Greenfield projects. The constraints inherent to the model mean that it is almost completely non-tenable for a tremendous number of existing workloads. Some developers like to call it legacy, but when I hear the term legacy I hear, \"it makes actual money.\" So just treating it as, \"Oh, it's a science experiment we can throw into a new environment, spend a bunch of time rewriting it for minimal gains,\" is just not going to happen as companies undergo digital transformations, if you'll pardon the term.\n", | |
"\n", | |
"\n", | |
"Kelsey: Yeah, so I think you're right. So let's take Amazon's Lambda for example, it's a very opinionated high-level platform that assumes you're going to build apps a certain way. And if that's you, look, go for it. Now, one or two levels below that there is this distributed system. Kubernetes decided to play in that space because everyone that's building other platforms needs a place to start. The analogy I like to think of is like in the mobile space, iOS and Android deal with the complexities of managing multiple applications on a mobile device, security aspects, app stores, that kind of thing. And then you as a developer, you build your thing on top of those platforms and APIs and frameworks. Now, it's debatable, someone would say, \"Why do we even need an open-source implementation of such a complex system? Why not just everyone moved to the cloud?\" And then everyone that's not in a cloud on-premise gets left behind.\n", | |
"\n", | |
"\n", | |
"But typically that's not how open source typically works, right? The reason why we have Linux, the precursor to the cloud is because someone looked at the big proprietary Unix systems and decided to re-implement them in a way that anyone could run those systems. So when you look at Kubernetes, you have to look at it from that lens. It's the ability to democratize these platform layers in a way that other people can innovate on top. That doesn't necessarily mean that everyone needs to start with Kubernetes, just like not everyone needs to start with the Linux server, but it's there for you to build the next thing on top of, if that's the route you want to go.\n", | |
"\n", | |
"\n", | |
"Corey: It's been almost a year now since I made an original tweet about this, that in five years, no one will care about Kubernetes. So now I guess I have four years running on that clock and that attracted a bit of, shall we say controversy. There were people who thought that I meant that it was going to be a flash in the pan and it would dry up and blow away. But my impression of it is that in, well four years now, it will have become more or less system D for the data center, in that there's a bunch of complexity under the hood. It does a bunch of things. No-one sensible wants to spend all their time mucking around with it in most companies. But it's not something that people have to think about in an ongoing basis the way it feels like we do today.\n", | |
"\n", | |
"\n", | |
"Kelsey: Yeah, I mean to me, I kind of see this as the natural evolution, right? It's new, it gets a lot of attention and kind of the assumption you make in that statement is there's something better that should be able to arise, giving that checkpoint. If this is what people think is hot, within five years surely we should see something else that can be deserving of that attention, right? Docker comes out and almost four or five years later you have Kubernetes. So it's obvious that there should be a progression here that steals some of the attention away from Kubernetes, but I think where it's so new, right? It's only five years in, Linux is like over 20 years old now at this point, and it's still top of mind for a lot of people, right? Microsoft is still porting a lot of Windows only things into Linux, so we still discuss the differences between Windows and Linux.\n", | |
"\n", | |
"\n", | |
"The idea that the cloud, for the most part, is driven by Linux virtual machines, that I think the majority of workloads run on virtual machines still to this day, so it's still front and center, especially if you're a system administrator managing BDMs, right? You're dealing with tools that target Linux, you know the Cisco interface and you're thinking about how to secure it and lock it down. Kubernetes is just at the very first part of that life cycle where it's new. We're all interested in even what it is and how it works, and now we're starting to move into that next phase, which is the distro phase. Like in Linux, you had Red Hat, Slackware, Ubuntu, special purpose distros.\n", | |
"\n", | |
"\n", | |
"Some will consider Android a special purpose distribution of Linux for mobile devices. And now that we're in this distro phase, that's going to go on for another 5 to 10 years where people start to align themselves around, maybe it's OpenShift, maybe it's GKE, maybe it's Fargate for EKS. These are now distributions built on top of Kubernetes that start to add a little bit more opinionation about how Kubernetes should be pushed together. And then we'll enter another phase where you'll build a platform on top of Kubernetes, but it won't be worth mentioning that Kubernetes is underneath because people will be more interested on the thing above.\n", | |
"\n", | |
"\n", | |
"Corey: I think we're already seeing that now, in terms of people no longer really care that much what operating system they're running, let alone with distribution of that operating system. The things that you have to care about slip below the surface of awareness and we've seen this for a long time now. Originally to install a web server, it wound up taking a few days and an intimate knowledge of GCC compiler flags, then RPM or D package and then yum on top of that, then ensure installed, once we had configuration management that was halfway decent.\n", | |
"\n", | |
"\n", | |
"Then Docker run, whatever it is. And today feels like it's with serverless technologies being what they are, it's effectively a push a file to S3 or it's equivalent somewhere else and you're done. The things that people have to be aware of and the barrier to entry continually lowers. The downside to that of course, is that things that people specialize in today and effectively make very lucrative careers out of are going to be not front and center in 5 to 10 years the way that they are today. And that's always been the way of technology. It's a treadmill to some extent.\n", | |
"\n", | |
"\n", | |
"Kelsey: And on the flip side of that, look at all of the new jobs that are centered around these cloud-native technologies, right? So you know, we're just going to make up some numbers here, imagine if there were only 10,000 jobs around just Linux system administration. Now when you look at this whole Kubernetes landscape where people are saying we can actually do a better job with metrics and monitoring. Observability is now a thing culturally that people assume you should have, because you're dealing with these distributed systems. The ability to start thinking about multi-regional deployments when I think that would've been infeasible with the previous tools or you'd have to build all those tools yourself. So I think now we're starting to see a lot more opportunities, where instead of 10,000 people, maybe you need 20,000 people because now you have the tools necessary to tackle bigger projects where you didn't see that before.\n", | |
"\n", | |
"\n", | |
"Corey: That's what's going to be really neat to see. But the challenge is always to people who are steeped in existing technologies. What does this mean for them? I mean I spent a lot of time early in my career fighting against cloud because I thought that it was taking away a cornerstone of my identity. I was a large scale Unix administrator, specifically focusing on email. Well, it turns out that there aren't nearly as many companies that need to have that particular skill set in house as it did 10 years ago. And what we're seeing now is this sort of forced evolution of people's skillsets or they hunker down on a particular area of technology or particular application to try and make a bet that they can ride that out until retirement. It's challenging, but at some point it seems that some folks like to stop learning, and I don't fully pretend to understand that. I'm sure I will someday where, \"No, at this point technology come far enough. We're just going to stop here, and anything after this is garbage.\" I hope not, but I can see a world in which that happens.\n", | |
"\n", | |
"\n", | |
"Kelsey: Yeah, and I also think one thing that we don't talk a lot about in the Kubernetes community, is that Kubernetes makes hyper-specialization worth doing because now you start to have a clear separation from concerns. Now the OS can be hyperfocused on security system calls and not necessarily packaging every programming language under the sun into a single distribution. So we can kind of move part of that layer out of the core OS and start to just think about the OS being a security boundary where we try to lock things down. And for some people that play at that layer, they have a lot of work ahead of them in locking down these system calls, improving the idea of containerization, whether that's something like Firecracker or some of the work that you see VMware doing, that's going to be a whole class of hyper-specialization. And the reason why they're going to be able to focus now is because we're starting to move into a world, whether that's serverless or the Kubernetes API.\n", | |
"\n", | |
"\n", | |
"We're saying we should deploy applications that don't target machines. I mean just that step alone is going to allow for so much specialization at the various layers because even on the networking front, which arguably has been a specialization up until this point, can truly specialize because now the IP assignments, how networking fits together, has also abstracted a way one more step where you're not asking for interfaces or binding to a specific port or playing with port mappings. You can now let the platform do that. So I think for some of the people who may be not as interested as moving up the stack, they need to be aware that the number of people we need being hyper-specialized at Linux administration will definitely shrink. And a lot of that work will move up the stack, whether that's Kubernetes or managing a serverless deployment and all the configuration that goes with that. But if you are a Linux, like that is your bread and butter, I think there's going to be an opportunity to go super deep, but you may have to expand into things like security and not just things like configuration management.\n", | |
"\n", | |
"\n", | |
"Corey: Let's call it the unfulfilled promise of Kubernetes. On paper, I love what it hints at being possible. Namely, if I build something that runs well on top of Kubernetes than we truly have a write once, run anywhere type of environment. Stop me if you've heard that one before, 50,000 times in our industry... or history. But in practice, as has happened before, it seems like it tends to fall down for one reason or another. Now, Amazon is famous because for many reasons, but the one that I like to pick on them for is, you can't say the word multi-cloud at their events. Right. That'll change people's perspective, good job. The people tend to see multi-cloud are a couple of different lenses.\n", | |
"\n", | |
"\n", | |
"I've been rather anti multi-cloud from the perspective of the idea that you're setting out day one to build an application with the idea that it can be run on top of any cloud provider, or even on-premises if that's what you want to do, is generally not the way to proceed. You wind up having to make certain trade-offs along the way, you have to rebuild anything that isn't consistent between those providers, and it slows you down. Kubernetes on the other hand hints at if it works and fulfills this promise, you can suddenly abstract an awful lot beyond that and just write generic applications that can run anywhere. Where do you stand on the whole multi-cloud topic?\n", | |
"\n", | |
"\n", | |
"Kelsey: So I think we have to make sure we talk about the different layers that are kind of ready for this thing. So for example, like multi-cloud networking, we just call that networking, right? What's the IP address over there? I can just hit it. So we don't make a big deal about multi-cloud networking. Now there's an area where people say, how do I configure the various cloud providers? And I think the healthy way to think about this is, in your own data centers, right, so we know a lot of people have investments on-premises. Now, if you were to take the mindset that you only need one provider, then you would try to buy everything from HP, right? You would buy HP store's devices, you buy HP racks, power. Maybe HP doesn't sell air conditioners. So you're going to have to buy an air conditioner from a vendor who specializes in making air conditioners, hopefully for a data center and not your house.\n", | |
"\n", | |
"\n", | |
"So now you've entered this world where one vendor does it make every single piece that you need. Now in the data center, we don't say, \"Oh, I am multi-vendor in my data center.\" Typically, you just buy the switches that you need, you buy the power racks that you need, you buy the ethernet cables that you need, and they have common interfaces that allow them to connect together and they typically have different configuration languages and methods for configuring those components. The cloud on the other hand also represents the same kind of opportunity. There are some people who really love DynamoDB and S3, but then they may prefer something like BigQuery to analyze the data that they're uploading into S3. Now, if this was a data center, you would just buy all three of those things and put them in the same rack and call it good.\n", | |
"\n", | |
"\n", | |
"But the cloud presents this other challenge. How do you authenticate to those systems? And then there's usually this additional networking costs, egress or ingress charges that make it prohibitive to say, \"I want to use two different products from two different vendors.\" And I think that's-\n", | |
"\n", | |
"\n", | |
"Corey: ...winds up causing serious problems.\n", | |
"\n", | |
"\n", | |
"Kelsey: Yes, so that data gravity, the associated cost becomes a little bit more in your face. Whereas, in a data center you kind of feel that the cost has already been paid. I already have a network switch with enough bandwidth, I have an extra port on my switch to plug this thing in and they're all standard interfaces. Why not? So I think the multi-cloud gets lost in the chew problem, which is the barrier to entry of leveraging things across two different providers because of networking and configuration practices.\n", | |
"\n", | |
"\n", | |
"Corey: That's often the challenge, I think, that people get bogged down in. On an earlier episode of this show we had Mitchell Hashimoto on, and his entire theory around using Terraform to wind up configuring various bits of infrastructure, was not the idea of workload portability because that feels like the windmill we all keep tilting at and failing to hit. But instead the idea of workflow portability, where different things can wind up being interacted with in the same way. So if this one division is on one cloud provider, the others are on something else, then you at least can have some points of consistency in how you interact with those things. And in the event that you do need to move, you don't have to effectively redo all of your CICD process, all of your tooling, et cetera. And I thought that there was something compelling about that argument.\n", | |
"\n", | |
"\n", | |
"Kelsey: And that's actually what Kubernetes does for a lot of people. For Kubernetes, if you think about it, when we start to talk about workflow consistency, if you want to deploy an application, queue CTL, apply, some config, you want the application to have a load balancer in front of it. Regardless of the cloud provider, because Kubernetes has an extension point we call the cloud provider. And that's where Amazon, Azure, Google Cloud, we do all the heavy lifting of mapping the high-level ingress object that specifies, \"I want a load balancer, maybe a few options,\" to the actual implementation detail. So maybe you don't have to use four or five different tools and that's where that kind of workload portability comes from. Like if you think about Linux, right? It has a set of system calls, for the most part, even if you're using a different distro at this point, Red Hat or Amazon Linux or Google's container optimized Linux.\n", | |
"\n", | |
"\n", | |
"If I build a Go binary on my laptop, I can SCP it to any of those Linux machines and it's going to probably run. So you could call that multi-cloud, but that doesn't make a lot of sense because it's just because of the way Linux works. Kubernetes does something very similar because it sits right on top of Linux, so you get the portability just from the previous example and then you get the other portability and workload, like you just stated, where I'm calling kubectl apply, and I'm using the same workflow to get resources spun up on the various cloud providers. Even if that configuration isn't one-to-one identical.\n", | |
"\n", | |
"\n", | |
"Corey: This episode is sponsored in part by DataStax. The NoSQL event of the year is DataStax Accelerate in San Diego this May from the 11th through the 13th. I've given a talk previously called the myth of multi-cloud, and it's time for me to revisit that with... A sequel! Which is funny given that it's a NoSQL conference, but there you have it. To learn more, visit datastax.com that's D-A-T-A-S-T-A-X.com and I hope to see you in San Diego. This May.\n", | |
"\n", | |
"\n", | |
"Corey: One thing I'm curious about is you wind up walking through the world and seeing companies adopting Kubernetes in different ways. How are you finding the adoption of Kubernetes is looking like inside of big E enterprise style companies? I don't have as much insight into those environments as I probably should. That's sort of a focus area for the next year for me. But in startups, it seems that it's either someone goes in and rolls it out and suddenly it's fantastic, or they avoid it entirely and do something serverless. In large enterprises, I see a lot of Kubernetes and a lot of Kubernetes stories coming out of it, but what isn't usually told is, what's the tipping point where they say, \"Yeah, let's try this.\" Or, \"Here's the problem we're trying to solve for. Let's chase it.\"\n", | |
"\n", | |
"\n", | |
"Kelsey: What I see is enterprises buy everything. If you're big enough and you have a big enough IT budget, most enterprises have a POC of everything that's for sale, period. There's some team in some pocket, maybe they came through via acquisition. Maybe they live in a different state. Maybe it's just a new project that came out. And what you tend to see, at least from my experiences, if I walk into a typical enterprise, they may tell me something like, \"Hey, we have a POC, a Pivotal Cloud Foundry, OpenShift, and we want some of that new thing that we just saw from you guys. How do we get a POC going?\" So there's always this appetite to evaluate what's for sale, right? So, that's one case. There's another case where, when you start to think about an enterprise there's a big range of skillsets. Sometimes I'll go to some companies like, \"Oh, my insurance is through that company, and there's ex-Googlers that work there.\" They used to work on things like Borg, or something else, and they kind of know how these systems work.\n", | |
"\n", | |
"\n", | |
"And they have a slightly better edge at evaluating whether Kubernetes is any good for the problem at hand. And you'll see them bring it in. Now that same company, I could drive over to the other campus, maybe it's five miles away and that team doesn't even know what Kubernetes is. And for them, they're going to be chugging along with what they're currently doing. So then the challenge becomes if Kubernetes is a great fit, how wide of a fit it isn't? How many teams at that company should be using it? So what I'm currently seeing as there are some enterprises that have found a way to make Kubernetes the place where they do a lot of new work, because that makes sense. A lot of enterprises to my surprise though, are actually stepping back and saying, \"You know what? We've been stitching together our own platform for the last five years. We had the Netflix stack, we got some Spring Boot, we got Console, we got Vault, we got Docker. And now this whole thing is getting a little more fragile because we're doing all of this glue code.\"\n", | |
"\n", | |
"\n", | |
"Kubernetes, We've been trying to build our own Kubernetes and now that we know what it is and we know what it isn't, we know that we can probably get rid of this kind of bespoke stack ourselves and just because of the ecosystem, right? If I go to HashiCorp's website, I would probably find the word Kubernetes as much as I find the word Nomad on their site because they've made things like Console and Vault become first-class offerings inside of the world of Kubernetes. So I think it's that momentum that you see across even People Oracle, Juniper, Palo Alto Networks, they're all have seem to have a Kubernetes story. And this is why you start to see the enterprise able to adopt it because it's so much in their face and it's where the ecosystem is going.\n", | |
"\n", | |
"\n", | |
"Corey: It feels like a lot of the excitement and the promise and even the same problems that Kubernetes is aimed at today, could have just as easily been talked about half a decade ago in the context of OpenStack. And for better or worse, OpenStack is nowhere near where it once was. It would felt like it had such promise and such potential and when it didn't pan out, that left a lot of people feeling relatively sad, burnt out, depressed, et cetera. And I'm seeing a lot of parallels today, at least between what was said about OpenStack and what was said about Kubernetes. How do you see those two diverging?\n", | |
"\n", | |
"\n", | |
"Kelsey: I will tell you the big difference that I saw, personally. Just for my personal journey outside of Google, just having that option. And I remember I was working at a company and we were like, \"We're going to roll our own OpenStack. We're going to buy a free BSD box and make it a file server. We're going all open sources,\" like do whatever you want to do. And that was just having so many issues in terms of first-class integrations, education, people with the skills to even do that. And I was like, \"You know what, let's just cut the check for VMware.\" We want virtualization. VMware, for the cost and when it does, it's good enough. Or we can just actually use a cloud provider. That space in many ways was a purely solved problem. Now, let's fast forward to Kubernetes, and also when you get OpenStack finished, you're just back where you started.\n", | |
"\n", | |
"\n", | |
"You got a bunch of VMs and now you've got to go figure out how to build the real platform that people want to use because no one just wants a VM. If you think Kubernetes is low level, just having OpenStack, even OpenStack was perfect. You're still at square one for the most part. Maybe you can just say, \"Now I'm paying a little less money for my stack in terms of software licensing costs,\" but from an extraction and automation and API standpoint, I don't think OpenStack moved the needle in that regard. Now in the Kubernetes world, it's solving a huge gap.\n", | |
"\n", | |
"\n", | |
"Lots of people have virtual machine sprawl than they had Docker sprawl, and when you bring in this thing by Kubernetes, it says, \"You know what? Let's reign all of that in. Let's build some first-class abstractions, assuming that the layer below us is a solved problem.\" You got to remember when Kubernetes came out, it wasn't trying to replace the hypervisor, it assumed it was there. It also assumed that the hypervisor had APIs for creating virtual machines and attaching disc and creating load balancers, so Kubernetes came out as a complementary technology, not one looking to replace. And I think that's why it was able to stick because it solved a problem at another layer where there was not a lot of competition.\n", | |
"\n", | |
"\n", | |
"Corey: I think a more cynical take, at least one of the ones that I've heard articulated and I tend to agree with, was that OpenStack originally seemed super awesome because there were a lot of interesting people behind it, fascinating organizations, but then you wound up looking through the backers of the foundation behind it and the rest. And there were something like 500 companies behind it, an awful lot of them were these giant organizations that ... they were big e-corporate IT enterprise software vendors, and you take a look at that, I'm not going to name anyone because at that point, oh will we get letters.\n", | |
"\n", | |
"\n", | |
"But at that point, you start seeing so many of the patterns being worked into it that it almost feels like it has to collapse under its own weight. I don't, for better or worse, get the sense that Kubernetes is succumbing to the same thing, despite the CNCF having an awful lot of those same backers behind it and as far as I can tell, significantly more money, they seem to have all the money to throw at these sorts of things. So I'm wondering how Kubernetes has managed to effectively sidestep I guess the open-source miasma that OpenStack didn't quite manage to avoid.\n", | |
"\n", | |
"\n", | |
"Kelsey: Kubernetes gained its own identity before the foundation existed. Its purpose, if you think back from the Borg paper almost eight years prior, maybe even 10 years prior. It defined this problem really, really well. I think Mesos came out and also had a slightly different take on this problem. And you could just see at that time there was a real need, you had choices between Docker Swarm, Nomad. It seems like everybody was trying to fill in this gap because, across most verticals or industries, this was a true problem worth solving. What Kubernetes did was played in the exact same sandbox, but it kind of got put out with experience. It's not like, \"Oh, let's just copy this thing that already exists, but let's just make it open.\"\n", | |
"\n", | |
"\n", | |
"And in that case, you don't really have your own identity. It's you versus Amazon, in the case of OpenStack, it's you versus VMware. And that's just really a hard place to be in because you don't have an identity that stands alone. Kubernetes itself had an identity that stood alone. It comes from this experience of running a system like this. It comes from research and white papers. It comes after previous attempts at solving this problem. So we agree that this problem needs to be solved. We know what layer it needs to be solved at. We just didn't get it right yet, so Kubernetes didn't necessarily try to get it right.\n", | |
"\n", | |
"\n", | |
"It tried to start with only the primitives necessary to focus on the problem at hand. Now to your point, the extension interface of Kubernetes is what keeps it small. Years ago I remember plenty of meetings where we all got in rooms and said, \"This thing is done.\" It doesn't need to be a PaaS. It doesn't need to compete with serverless platforms. The core of Kubernetes, like Linux, is largely done. Here's the core objects, and we're going to make a very great extension interface. We're going to make one for the container run time level so that way people can swap that out if they really want to, and we're going to do one that makes other APIs as first-class as ones we have, and we don't need to try to boil the ocean in every Kubernetes release. Everyone else has the ability to deploy extensions just like Linux, and I think that's why we're avoiding some of this tension in the vendor world because you don't have to change the core to get something that feels like a native part of Kubernetes.\n", | |
"\n", | |
"\n", | |
"Corey: What do you think is currently being the most misinterpreted or misunderstood aspect of Kubernetes in the ecosystem?\n", | |
"\n", | |
"\n", | |
"Kelsey: I think the biggest thing that's misunderstood is what Kubernetes actually is. And the thing that made it click for me, especially when I was writing the tutorial Kubernetes The Hard Way. I had to sit down and ask myself, \"Where do you start trying to learn what Kubernetes is?\" So I start with the database, right? The configuration store isn't Postgres, it isn't MySQL, it's Etcd. Why? Because we're not trying to be this generic data stores platform. We just need to store configuration data. Great. Now, do we let all the components talk to Etcd? No. We have this API server and between the API server and the chosen data store, that's essentially what Kubernetes is. You can stop there. At that point, you have a valid Kubernetes cluster and it can understand a few things. Like I can say, using the Kubernetes command-line tool, create this configuration map that stores configuration data and I can read it back.\n", | |
"\n", | |
"\n", | |
"Great. Now I can't do a lot of things that are interesting with that. Maybe I just use it as a configuration store, but then if I want to build a container platform, I can install the Kubernetes kubelet agent on a bunch of machines and have it talk to the API server looking for other objects you add in the scheduler, all the other components. So what that means is that Kubernetes most important component is its API because that's how the whole system is built. It's actually a very simple system when you think about just those two components in isolation. If you want a container management tool that you need a scheduler, controller, manager, cloud provider integrations, and now you have a container tool. But let's say you want a service mesh platform. Well in a service mesh you have a data plane that can be Nginx or Envoy and that's going to handle routing traffic. And you need a control plane. That's going to be something that takes in configuration and it uses that to configure all the things in a data plane.\n", | |
"\n", | |
"\n", | |
"Well, guess what? Kubernetes is 90% there in terms of a control plane, with just those two components, the API server, and the data store. So now when you want to build control planes, if you start with the Kubernetes API, we call it the API machinery, you're going to be 95% there. And then what do you get? You get a distributed system that can handle kind of failures on the back end, thanks to Etcd. You're going to get our backs or you can have permission on top of your schemas, and there's a built-in framework, we call it custom resource definitions that allows you to articulate a schema and then your own control loops provide meaning to that schema. And once you do those two things, you can build any platform you want. And I think that's one thing that it takes a while for people to understand that part of Kubernetes, that the thing we talk about today, for the most part, is just the first system that we built on top of this.\n", | |
"\n", | |
"\n", | |
"Corey: I think that's a very far-reaching story with implications that I'm not entirely sure I am able to wrap my head around. I hope to see it, I really do. I mean you mentioned about writing Learn Kubernetes the Hard Way and your tutorial, which I'll link to in the show notes. I mean my, of course, sarcastic response to that recently was to register the domain Kubernetes the Easy Way and just re-pointed to Amazon's ECS, which is in no way shape or form Kubernetes and basically has the effect of irritating absolutely everyone as is my typical pattern of behavior on Twitter. But I have been meaning to dive into Kubernetes on a deeper level and the stuff that you've written, not just the online tutorial, both the books have always been my first port of call when it comes to that. The hard part, of course, is there's just never enough hours in the day.\n", | |
"\n", | |
"\n", | |
"Kelsey: And one thing that I think about too is like the web. We have the internet, there's webpages, there's web browsers. Web Browsers talk to web servers over HTTP. There's verbs, there's bodies, there's headers. And if you look at it, that's like a very big complex system. If I were to extract out the protocol pieces, this concept of HTTP verbs, get, put, post and delete, this idea that I can put stuff in a body and I can give it headers to give it other meaning and semantics. If I just take those pieces, I can bill restful API's.\n", | |
"\n", | |
"\n", | |
"Hell, I can even bill graph QL and those are just different systems built on the same API machinery that we call the internet or the web today. But you have to really dig into the details and pull that part out and you can build all kind of other platforms and I think that's what Kubernetes is. It's going to probably take people a little while longer to see that piece, but it's hidden in there and that's that piece that's going to be, like you said, it's going to probably be the foundation for building more control planes. And when people build control planes, I think if you think about it, maybe Fargate for EKS represents another control plane for making a serverless platform that takes to Kubernetes API, even though the implementation isn't what you find on GitHub.\n", | |
"\n", | |
"\n", | |
"Corey: That's the truth. Whenever you see something as broadly adopted as Kubernetes, there's always the question of, \"Okay, there's an awful lot of blog posts.\" Getting started to it, learn it in 10 minutes, I mean at some point, I'm sure there are some people still convince Kubernetes is, in fact, a breakfast cereal based upon what some of the stuff the CNCF has gotten up to. I wouldn't necessarily bet against it socks today, breakfast cereal tomorrow. But it's hard to find a decent level of quality, finding the certain quality bar of a trusted source to get started with is important. Some people believe in the hero's journey, story of a narrative building.\n", | |
"\n", | |
"\n", | |
"I always prefer to go with the morons journey because I'm the moron. I touch technologies, I have no idea what they do and figure it out and go careening into edge and corner cases constantly. And by the end of it I have something that vaguely sort of works and my understanding's improved. But I've gone down so many terrible paths just by picking a bad point to get started. So everyone I've talked to who's actually good at things has pointed to your work in this space as being something that is authoritative and largely correct and given some of these people, that's high praise.\n", | |
"\n", | |
"\n", | |
"Kelsey: Awesome. I'm going to put that on my next performance review as evidence of my success and impact.\n", | |
"\n", | |
"\n", | |
"Corey: Absolutely. Grouchy people say, \"It's all right,\" you know, for the right people that counts. If people want to learn more about what you're up to and see what you have to say, where can they find you?\n", | |
"\n", | |
"\n", | |
"Kelsey: I aggregate most of outward interactions on Twitter, so I'm @KelseyHightower and my DMs are open, so I'm happy to field any questions and I attempt to answer as many as I can.\n", | |
"\n", | |
"\n", | |
"Corey: Excellent. Thank you so much for taking the time to speak with me today. I appreciate it.\n", | |
"\n", | |
"\n", | |
"Kelsey: Awesome. I was happy to be here.\n", | |
"\n", | |
"\n", | |
"Corey: Kelsey Hightower, Principal Developer Advocate at Google. I'm Corey Quinn. This is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on Apple podcasts. If you've hated this podcast, please leave a five-star review on Apple podcasts and then leave a funny comment. Thanks.\n", | |
"\n", | |
"\n", | |
"Announcer: This has been this week's episode of Screaming in the Cloud. You can also find more Core at screaminginthecloud.com or wherever fine snark is sold.\n", | |
"\n", | |
"\n", | |
"Announcer: This has been a HumblePod production. Stay humble.\n", | |
"\"\"\"" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Ukci_4WJLppL" | |
}, | |
"source": [ | |
"# write out the transcript into a file\n", | |
"if not os.path.exists(\"transcripts\"):\n", | |
" os.makedirs(\"transcripts\")\n", | |
"\n", | |
"with open(\"transcripts/raw_transcript.txt\", 'w') as f:\n", | |
" f.write(raw_transcript)\n", | |
" f.write(\"\\n\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "JMyRaBwfEi4R" | |
}, | |
"source": [ | |
"# Text Preprocessing pipeline\n", | |
"\n", | |
"Here, we create a basic preprocessing pipeline for the transcription to get it into a format that can easily be compared against the model's transcription. We won't focus on efficiency here, instead opting to implement just a simple pipeline processor that offers some flexibility." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "x79dvHfZEdcT" | |
}, | |
"source": [ | |
"import re\n", | |
"from tqdm import tqdm\n", | |
"from typing import List, Optional" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "gA43asZLK5mW" | |
}, | |
"source": [ | |
"### Preprocessing tasks\n", | |
"\n", | |
"First we create a basic unit of work in the pipeline - aka a \"task\" that will be accept some input and will emit some output. \n", | |
"\n", | |
"It will also maintain a dictionary of metadata so that subsequent tasks can monitor / use the previous steps' metadata information (if needed)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "A-w5Lz0hK2JO" | |
}, | |
"source": [ | |
"class ProcessingTask:\n", | |
" def __init__(self):\n", | |
" self.metadata = {}\n", | |
"\n", | |
" def __call__(self, *args):\n", | |
" raise NotImplementedError()" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "s-tByXYJLOVP" | |
}, | |
"source": [ | |
"### Text Preprocessor\n", | |
"\n", | |
"Next we create the actual processor of multiple tasks. We will impose a few restrictions - the input must be args (no kwargs), output must also be args (no dicts). \n", | |
"\n", | |
"Each task can also read/update the global metadata registry which will be updated with the metadata of all the previous tasks." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "erybK5s8LKAK" | |
}, | |
"source": [ | |
"class TextPreprocessor:\n", | |
" def __init__(self, filepath):\n", | |
" self.filepath = filepath\n", | |
" self.metadata = {}\n", | |
"\n", | |
" def process(self, tasks: List[ProcessingTask]):\n", | |
" # read the text file in its entirety\n", | |
" with open(self.filepath, 'r', encoding='utf-8') as f:\n", | |
" lines = f.readlines()\n", | |
"\n", | |
" print(f\"Loaded {len(lines)} lines of text into memory ...\")\n", | |
" assert len(tasks) > 0\n", | |
"\n", | |
" # Prepare the processing pipeline\n", | |
" processed_outputs = [lines]\n", | |
" metadata = self.metadata\n", | |
" metadata['task_pipeline'] = [] # keep track of order of tasks executed\n", | |
"\n", | |
" for task in tasks:\n", | |
" print(f\"Performing task : {task.__class__.__name__}\")\n", | |
" metadata['task_pipeline'].append(task.__class__.__name__)\n", | |
" task.metadata.update(metadata) # update the global metadata with the previous tasks metadata\n", | |
"\n", | |
" processed_outputs = task(*processed_outputs) # process the tasks\n", | |
" metadata = task.metadata\n", | |
"\n", | |
" # if output was a single value, pack it into a tuple\n", | |
" if type(processed_outputs) != tuple:\n", | |
" processed_outputs = (processed_outputs,)\n", | |
"\n", | |
" self.metadata = metadata\n", | |
"\n", | |
" return processed_outputs" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "By2TpBhgMdmw" | |
}, | |
"source": [ | |
"### Generic Processing Tasks\n", | |
"\n", | |
"Now that we have the basic task format and its executer, let's create some basic text preprocessing tasks." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "LJBHGwQJMc0F" | |
}, | |
"source": [ | |
"class TrimNewLinesTask(ProcessingTask):\n", | |
" \"\"\" Trim all new lines from the text \"\"\"\n", | |
" def __call__(self, texts, **kwargs):\n", | |
" for idx in tqdm(range(len(texts)), desc=self.__class__.__name__, total=len(texts)):\n", | |
" texts[idx] = texts[idx].replace(\"\\n\", \"\")\n", | |
"\n", | |
" return texts" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "RwqbyTgTM12v" | |
}, | |
"source": [ | |
"class IndexBlankLinesTask(ProcessingTask):\n", | |
" \"\"\" Get the index of blank lines within the text (after replacing newlines) \"\"\"\n", | |
" def __init__(self):\n", | |
" super(IndexBlankLinesTask, self).__init__()\n", | |
" self.metadata['blank_idx'] = []\n", | |
"\n", | |
" def __call__(self, texts, **kwargs):\n", | |
" for idx in tqdm(range(len(texts)), desc=self.__class__.__name__, total=len(texts)):\n", | |
" if len(texts[idx]) == 0: # was just a new line\n", | |
" self.metadata['blank_idx'].append(idx)\n", | |
"\n", | |
" return texts" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "lwaeSxhgM4f-" | |
}, | |
"source": [ | |
"class LowerCaseTask(ProcessingTask):\n", | |
" \"\"\" Lower case all of the text \"\"\"\n", | |
" def __call__(self, texts):\n", | |
" for idx in tqdm(range(len(texts)), desc=self.__class__.__name__, total=len(texts)):\n", | |
" texts[idx] = texts[idx].lower()\n", | |
"\n", | |
" return texts" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "-0u2r_zDM6EL" | |
}, | |
"source": [ | |
"class SpecialCharactersTask(ProcessingTask):\n", | |
" \"\"\" Replace a set of special characters with empty spaces \"\"\"\n", | |
" def __init__(self, special_characters: List[str], replacement: str = \"\"):\n", | |
" super(SpecialCharactersTask, self).__init__()\n", | |
" self.special_characters = special_characters\n", | |
" self.replacement = replacement\n", | |
"\n", | |
" def __call__(self, texts):\n", | |
" self.metadata['special_characters_idx'] = self.metadata.get('special_characters_idx', {})\n", | |
" self.metadata['special_characters_replacement_tokens'] = self.metadata.get('special_characters_replacement_tokens', [])\n", | |
" self.metadata['special_characters_replacement_tokens'].append(self.replacement)\n", | |
"\n", | |
" for idx in tqdm(range(len(texts)), desc=self.__class__.__name__, total=len(texts)):\n", | |
" for special_char in self.special_characters:\n", | |
" if special_char in texts[idx]:\n", | |
" texts[idx] = texts[idx].replace(special_char, self.replacement).strip()\n", | |
"\n", | |
" if special_char in self.metadata['special_characters_idx']:\n", | |
" self.metadata['special_characters_idx'][special_char].append(idx)\n", | |
" else:\n", | |
" self.metadata['special_characters_idx'][special_char] = [idx]\n", | |
"\n", | |
" return texts" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "z5S7jAcXNA6H" | |
}, | |
"source": [ | |
"class EmptyLineRemovalTask(ProcessingTask):\n", | |
" \"\"\" Remove all empty lines from the text list \"\"\"\n", | |
" def __call__(self, texts):\n", | |
" new_texts = []\n", | |
" for idx in tqdm(range(len(texts)), desc=self.__class__.__name__, total=len(texts)):\n", | |
" if len(texts[idx]) > 0:\n", | |
" new_texts.append(texts[idx])\n", | |
"\n", | |
" return new_texts" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Y1-AEeyzNCcg" | |
}, | |
"source": [ | |
"class MergeLinesTask(ProcessingTask):\n", | |
" \"\"\" Merge all of the text with spaces \"\"\"\n", | |
" def __call__(self, texts):\n", | |
" self.metadata['num_lines_merged'] = len(texts)\n", | |
"\n", | |
" texts = ' '.join(texts)\n", | |
" return [texts]" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "zmBjE-H0OASA" | |
}, | |
"source": [ | |
"### Specialized Preprocessing Tasks\n", | |
"\n", | |
"The podcast requires more processing as compared to the generic tasks above. We define them below. \n", | |
"\n", | |
"The `SpeakerHeaderRemovalTask` will remove the mention of the speaker - (whether the speaker was Corey, Kelsey etc).\n", | |
"\n", | |
"The `SeperateTokenTask` will take a unique word - say \"screaminginthecloud\" and replace it with spaces. While the original text is meant to represent an entity, the ASR model has no such notion and will probably transcribe it as separate words.\n", | |
"\n", | |
"Finally, the `WebsiteTransformTask` will replace the \".com\" in websites to \"dot com\" since the ASR model has no special characters." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "PaCz-OAJNDQw" | |
}, | |
"source": [ | |
"class SpeakerHeaderRemovalTask(ProcessingTask):\n", | |
" \"\"\" Remove the speaker header information (represented as the first name of the entity - Announcer, Corey, Kelsey etc) \"\"\"\n", | |
" def __call__(self, texts):\n", | |
" self.metadata['unique_speakers'] = {}\n", | |
"\n", | |
" for idx in tqdm(range(len(texts)), desc=self.__class__.__name__, total=len(texts)):\n", | |
" text = texts[idx]\n", | |
" if len(text) > 0:\n", | |
" search = re.search(r\"[a-zA-z]*:\", text)\n", | |
" if search:\n", | |
" speaker = search.group(0)\n", | |
" if speaker in self.metadata['unique_speakers']:\n", | |
" self.metadata['unique_speakers'][speaker].append(idx)\n", | |
" else:\n", | |
" self.metadata['unique_speakers'][speaker] = [idx]\n", | |
"\n", | |
" texts[idx] = texts[idx].replace(speaker, \"\").strip()\n", | |
"\n", | |
" return texts" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "v5gjUz_5OKqg" | |
}, | |
"source": [ | |
"class SeperateTokenTask(ProcessingTask):\n", | |
" \"\"\" there are several mentions of \"digital ocean\" - but as it is an entity, it is transcribed as one word. Split it.\"\"\"\n", | |
"\n", | |
" def __init__(self, token: str, split_token: str):\n", | |
" super().__init__()\n", | |
" self.token = token\n", | |
" self.split_token = split_token\n", | |
"\n", | |
" def __call__(self, texts):\n", | |
" self.metadata['separate_token'] = self.metadata.get('separate_token', {})\n", | |
" self.metadata['separate_token'][self.token] = self.split_token\n", | |
"\n", | |
" for idx in tqdm(range(len(texts)), desc=self.__class__.__name__, total=len(texts)):\n", | |
" if self.token in texts[idx]:\n", | |
" texts[idx] = texts[idx].replace(self.token, self.split_token)\n", | |
"\n", | |
" return texts" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "97R0aODeOLz-" | |
}, | |
"source": [ | |
"class WebsiteTransformTask(ProcessingTask):\n", | |
" \"\"\" There are multiple references to *.com - replace . with \"dot\" as model does not produce special characters like period and comma \"\"\"\n", | |
" def __call__(self, texts):\n", | |
" self.metadata[f'website_convert_idx'] = {}\n", | |
"\n", | |
" for idx in tqdm(range(len(texts)), desc=self.__class__.__name__, total=len(texts)):\n", | |
" search = re.search(r\"[a-zA-z-]*\\.com\", texts[idx])\n", | |
" if search:\n", | |
" matched_text = search.group(0)\n", | |
" texts[idx] = texts[idx].replace(\".\", \" dot\").strip()\n", | |
"\n", | |
" if matched_text in self.metadata[f'website_convert_idx']:\n", | |
" self.metadata[f'website_convert_idx'][matched_text].append(idx)\n", | |
" else:\n", | |
" self.metadata[f'website_convert_idx'][matched_text] = [idx]\n", | |
"\n", | |
" return texts" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "HdZxho_5PSzt" | |
}, | |
"source": [ | |
"## Preprocess the transcript\n", | |
"\n", | |
"Now that the pipeline is setup, let's execute it to clean up the text!\n", | |
"\n", | |
"**NOTE**: These preprocessing steps were arbitrarily chosen and may/may not be valid corrections. Feel free to comment out the tasks list below to change the pipeline, or to add even more tasks !" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "KXGxWQE1OM13" | |
}, | |
"source": [ | |
"raw_transcript_filepath = \"transcripts/raw_transcript.txt\"" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "LlP4JnX3Pbas" | |
}, | |
"source": [ | |
"tasks = [\n", | |
" TrimNewLinesTask(),\n", | |
" LowerCaseTask(),\n", | |
" SpeakerHeaderRemovalTask(),\n", | |
" WebsiteTransformTask(),\n", | |
" SeperateTokenTask(\"digitalocean\", \"digital ocean\"),\n", | |
" SeperateTokenTask(\"cloudzero\", \"cloud zero\"),\n", | |
" SeperateTokenTask(\"screaminginthecloud\", \"screaming in the cloud\"),\n", | |
" SpecialCharactersTask(special_characters=[\",\", \"%\", \"-\", \"?\", \"@\", '\"', \":\", \".\", \"!\"]),\n", | |
" SpecialCharactersTask(special_characters=[\"$\"], replacement=\"\"),\n", | |
" SpecialCharactersTask(special_characters=[\"/\"], replacement=\" slash \"),\n", | |
" IndexBlankLinesTask(),\n", | |
" EmptyLineRemovalTask(),\n", | |
" MergeLinesTask(),\n", | |
" ]\n", | |
"\n", | |
"processor = TextPreprocessor(raw_transcript_filepath)\n", | |
"result = processor.process(tasks)\n", | |
"\n", | |
"# remove the tuple packing\n", | |
"result = result[0]\n", | |
"# the output text has been merged into a single text, so no need for the list anymore\n", | |
"result= result[0] " | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "emKBridIPjO4" | |
}, | |
"source": [ | |
"print(\"Metadata\")\n", | |
"print(\"*\" * 80)\n", | |
"for key in processor.metadata.keys():\n", | |
" print(f\"{key} :\")\n", | |
" print(processor.metadata[key])\n", | |
" print()" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "H6FcBSUlQJhO" | |
}, | |
"source": [ | |
"---------\n", | |
"\n", | |
"The resultant text is now just a long string of text with no special characters and new lines" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "uiV7JbaaPrR1" | |
}, | |
"source": [ | |
"print(\"Result :\")\n", | |
"print(result)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "XNoeRVT9QVdg" | |
}, | |
"source": [ | |
"Write the normalized transcript into an output file" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Ot-Z47cFQHWt" | |
}, | |
"source": [ | |
"if not os.path.exists(\"transcripts/normalized/\"):\n", | |
" os.makedirs(\"transcripts/normalized/\")\n", | |
"\n", | |
"normalized_transcript_path = \"transcripts/normalized/ground_truth.txt\"\n", | |
"\n", | |
"with open(normalized_transcript_path, 'w', encoding='utf-8') as f:\n", | |
" f.write(f\"{result}\")\n", | |
" f.write(\"\\n\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "gZlnKflsXvXq" | |
}, | |
"source": [ | |
"# Transcribe the processed audio file\n", | |
"\n", | |
"Now that we have a \"ground truth\" text transcript we can compare against, let's actually transcribe the podcast with a model !" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "KXc5NQFcX01v" | |
}, | |
"source": [ | |
"## Helper methods\n", | |
"\n", | |
"We define a few helper methods to enable automatic mixed precision if it is available in the colab GPU (if a GPU is being used at all)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "19p2X0deX3Ku" | |
}, | |
"source": [ | |
"import contextlib\n", | |
"import torch" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Adg2CWbqX4Cf" | |
}, | |
"source": [ | |
"# Helper for torch amp autocast\n", | |
"if torch.cuda.is_available():\n", | |
" autocast = torch.cuda.amp.autocast\n", | |
"else:\n", | |
" @contextlib.contextmanager\n", | |
" def autocast():\n", | |
" print(\"AMP was not available, using FP32!\")\n", | |
" yield" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "74bEhFRuX99E" | |
}, | |
"source": [ | |
"device = 'cuda' if torch.cuda.is_available() else 'cpu'\n", | |
"device" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "1u_WXD8IYTpI" | |
}, | |
"source": [ | |
"## Instantiate a model\n", | |
"\n", | |
"We choose a small model - Citrinet 256 - since it offers good transcription accuracy but is just 10 M parameters.\n", | |
"\n", | |
"**Feel free to change to the medium and larger sized models !**\n", | |
"\n", | |
" - small = \"stt_en_citrinet_256\" (9.8 M parameters)\n", | |
" - medium = \"stt_en_citrinet_512\" (38 M parameters)\n", | |
" - large = \"stt_en_citrinet_1024\" (142 M parameters)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Nm-BveGxYNkP" | |
}, | |
"source": [ | |
"model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(\"stt_en_citrinet_256\", map_location=device)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "FoGVootFYqGW" | |
}, | |
"source": [ | |
"model = model.to(device)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "cpndoavWY27v" | |
}, | |
"source": [ | |
"## Transcribe audio\n", | |
"\n", | |
"Here, we simply call the model's \"transcribe()\" method, which does offline transcription of a provided list of audio clips." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "r7FyeM8BY0Ca" | |
}, | |
"source": [ | |
"%%time\n", | |
"\n", | |
"audio_path = \"data/processed/raw_audio_processed.wav\"\n", | |
"transcribed_filepath = f\"transcripts/normalized/transcribed_speech.txt\"\n", | |
"\n", | |
"if os.path.exists(transcribed_filepath):\n", | |
" print(f\"File already exists, delete {transcribed_filepath} manually before re-transcribing the audio !\")\n", | |
"\n", | |
"else:\n", | |
" with autocast():\n", | |
" transcript = model.transcribe([audio_path], batch_size=1)[0]" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "FIIiZ6GfZvkM" | |
}, | |
"source": [ | |
"## Write transcription" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "UkrDWKSuZD0k" | |
}, | |
"source": [ | |
"with open(transcribed_filepath, 'w', encoding='utf-8') as f:\n", | |
" f.write(f\"{transcript}\\n\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "797I6ITYaK0Y" | |
}, | |
"source": [ | |
"# Compute accuracy of transcription\n", | |
"\n", | |
"Now that we have a model's transcriped result, we compare the WER and CER against the \"ground truth\" transcription that we preprocessed earlier" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "ej7LDotcZ8oF" | |
}, | |
"source": [ | |
"from nemo.collections.asr.metrics.wer import word_error_rate" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "TUueKbEsaaZx" | |
}, | |
"source": [ | |
"ground_truth = normalized_transcript_path\n", | |
"model_transcript = transcribed_filepath" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "Dsev1XArapUz" | |
}, | |
"source": [ | |
"with open(ground_truth, 'r') as f:\n", | |
" ground_truth_txt = f.readlines()\n", | |
" ground_truth_txt = [text.replace(\"\\n\", \"\") for text in ground_truth_txt]\n", | |
"\n", | |
"with open(model_transcript, 'r') as f:\n", | |
" transcription_txt = f.readlines()\n", | |
" transcription_txt = [text.replace(\"\\n\", \"\") for text in transcription_txt]" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "uIFYpBwFtx6Z" | |
}, | |
"source": [ | |
"Compute both the character error rate and the word error rate (this might take a while on such large text transcripts)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "wnpJ8pXNavct" | |
}, | |
"source": [ | |
"cer = word_error_rate(transcription_txt, ground_truth_txt, use_cer=True)\n", | |
"wer = word_error_rate(transcription_txt, ground_truth_txt, use_cer=False)\n", | |
"print(\"Finished computing metrics !\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "c_IV5l01a38U" | |
}, | |
"source": [ | |
"print(f\"CER : {cer * 100:0.4f}%\")\n", | |
"print(f\"WER : {wer * 100:0.4f}%\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "_DPFoCSEuA2U" | |
}, | |
"source": [ | |
"-----\n", | |
"The model did fairly well, considering it wasn't trained on any corpus with technical terms (the train corpus is only publically available speech datasets). Furthermore, the ground truth preprocessing is not sufficient in some cases, but for a demonstration it's a reasonable effort." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "i45noyISmDBL" | |
}, | |
"source": [ | |
"# Diffing the generated transcript\n", | |
"\n", | |
"While a numeric score describes some context of how well the model did, it is much more useful to actually visualize the differences between the ground truth and the model transcription. \n", | |
"\n", | |
"We will partially port https://skeptric.com/python-diffs/ to suit our purposes. Note that we don't have any new lines so minor modifications will be necessary." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "8xh8pKGnbf74" | |
}, | |
"source": [ | |
"import difflib\n", | |
"from typing import List, Any, Callable, Tuple, Union\n", | |
"from itertools import zip_longest\n", | |
"import html\n", | |
"\n", | |
"Token = str\n", | |
"TokenList = List[Token]" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "6NZ6eugUmVUo" | |
}, | |
"source": [ | |
"whitespace = re.compile('\\s+')\n", | |
"# end_sentence = re.compile('[.!?]\\s+')\n", | |
"end_sentence = re.compile('[.]\\s+')\n", | |
"\n", | |
"def tokenize(s:str) -> TokenList:\n", | |
" '''Split a string into tokens'''\n", | |
" return whitespace.split(s)\n", | |
"\n", | |
"def untokenize(ts:TokenList) -> str:\n", | |
" '''Join a list of tokens into a string'''\n", | |
" return ' '.join(ts)\n", | |
"\n", | |
"def sentencize(s:str) -> TokenList:\n", | |
" '''Split a string into a list of sentences'''\n", | |
" return end_sentence.split(s)\n", | |
"\n", | |
"def unsentencise(ts:TokenList) -> str:\n", | |
" '''Join a list of sentences into a string'''\n", | |
" return '. '.join(ts)\n", | |
"\n", | |
"def html_unsentencise(ts:TokenList) -> str:\n", | |
" '''Joing a list of sentences into HTML for display'''\n", | |
" return ''.join(f'<p>{t}</p>' for t in ts)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "IiJezCeMmZWL" | |
}, | |
"source": [ | |
"def mark_text(text:str) -> str:\n", | |
" return f'<span style=\"color: red;\">{text}</span>'\n", | |
" \n", | |
"def mark_span(text:TokenList) -> TokenList:\n", | |
" if len(text) > 0:\n", | |
" text[0] = '<span style=\"background: #69E2FB;\">' + text[0]\n", | |
" text[-1] += '</span>'\n", | |
" return text\n", | |
"\n", | |
"# def mark_span(text:TokenList) -> TokenList:\n", | |
"# return [mark_text(token) for token in text]\n", | |
"\n", | |
"def markup_diff(a:TokenList, b:TokenList,\n", | |
" mark=mark_span,\n", | |
" default_mark = lambda x: x,\n", | |
" isjunk=None) -> Tuple[TokenList, TokenList]:\n", | |
" \"\"\"Returns a and b with any differences processed by mark\n", | |
"\n", | |
" Junk is ignored by the differ\n", | |
" \"\"\"\n", | |
" seqmatcher = difflib.SequenceMatcher(isjunk=isjunk, a=a, b=b, autojunk=False)\n", | |
" out_a, out_b = [], []\n", | |
" for tag, a0, a1, b0, b1 in seqmatcher.get_opcodes():\n", | |
" markup = default_mark if tag == 'equal' else mark\n", | |
" out_a += markup(a[a0:a1])\n", | |
" out_b += markup(b[b0:b1])\n", | |
" assert len(out_a) == len(a)\n", | |
" assert len(out_b) == len(b)\n", | |
" return out_a, out_b\n", | |
"\n", | |
"\n", | |
"def align_seqs(a: TokenList, b: TokenList, fill:Token='') -> Tuple[TokenList, TokenList]:\n", | |
" out_a, out_b = [], []\n", | |
" seqmatcher = difflib.SequenceMatcher(a=a, b=b, autojunk=False)\n", | |
" for tag, a0, a1, b0, b1 in seqmatcher.get_opcodes():\n", | |
" delta = (a1 - a0) - (b1 - b0)\n", | |
" out_a += a[a0:a1] + [fill] * max(-delta, 0)\n", | |
" out_b += b[b0:b1] + [fill] * max(delta, 0)\n", | |
" assert len(out_a) == len(out_b)\n", | |
" return out_a, out_b\n", | |
"\n", | |
"\n", | |
"def html_sidebyside(a, b):\n", | |
" # Set the panel display\n", | |
" out = '<div style=\"display: grid;grid-template-columns: 1fr 1fr;grid-gap: 20px;\">'\n", | |
" # There's some CSS in Jupyter notebooks that makes the first pair unalign. This is a workaround\n", | |
" out += '<p></p><p></p>'\n", | |
" for left, right in zip_longest(a, b, fillvalue=''):\n", | |
" out += f'<p>{left}</p>'\n", | |
" out += f'<p>{right}</p>'\n", | |
" out += '</div>'\n", | |
" return out\n", | |
"\n", | |
"def html_diffs(a, b):\n", | |
" a = html.escape(a)\n", | |
" b = html.escape(b)\n", | |
"\n", | |
" out_a, out_b = [], []\n", | |
" for sent_a, sent_b in zip(*align_seqs(sentencize(a), sentencize(b))):\n", | |
" mark_a, mark_b = markup_diff(tokenize(sent_a), tokenize(sent_b))\n", | |
" out_a.append(untokenize(mark_a))\n", | |
" out_b.append(untokenize(mark_b))\n", | |
"\n", | |
" return html_sidebyside(out_a, out_b)\n" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "sNlKUiEbnyEv" | |
}, | |
"source": [ | |
"from IPython.display import HTML, display\n", | |
"def show_diffs(a, b):\n", | |
" display(HTML(html_diffs(a,b)))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "yxSv7MwAzKqj" | |
}, | |
"source": [ | |
"### Side by side comparison" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "sNV1IoB7n0Ui" | |
}, | |
"source": [ | |
"show_diffs(ground_truth_txt[0], transcription_txt[0])" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "W-9kz_im0chj" | |
}, | |
"source": [ | |
"### Inplace comparison" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "x1vSWfBMyOh4" | |
}, | |
"source": [ | |
"!pip install diff_match_patch" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "DmjVq9MlyY3l" | |
}, | |
"source": [ | |
"import diff_match_patch\n", | |
"\n", | |
"def show_inplace_diff(ground_truth, generated_transcript):\n", | |
" print(\"Showing inplace diff\")\n", | |
" print(\"Ground truth is in red, predicted word is in green\")\n", | |
" \n", | |
" print()\n", | |
" print(\"-\" * 160)\n", | |
" print()\n", | |
"\n", | |
" diff = diff_match_patch.diff_match_patch()\n", | |
" diff.Diff_Timeout = 0\n", | |
" orig_enc, pred_enc, enc = diff.diff_linesToChars(ground_truth.replace(\" \", \"\\n\"), generated_transcript.replace(\" \", \"\\n\"))\n", | |
" diffs = diff.diff_main(orig_enc, pred_enc, False)\n", | |
" diff.diff_charsToLines(diffs, enc)\n", | |
" diffs_post = []\n", | |
" for d in diffs:\n", | |
" diffs_post.append((d[0], d[1].replace('\\n', ' ')))\n", | |
"\n", | |
" diff_html = diff.diff_prettyHtml(diffs_post)\n", | |
" return HTML(diff_html)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "fOUNjkG_zcur" | |
}, | |
"source": [ | |
"show_inplace_diff(ground_truth_txt[0], transcription_txt[0])" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "v_lsxi5iw59m" | |
}, | |
"source": [ | |
"# [Extra] Seeking the upper limit of audio sequence length\n", | |
"\n", | |
"So we were able to transcribe a nearly 40 minute podcast and obtain a moderately accurate transcript. While this was great for a first effort, this raises the question - \n", | |
"\n", | |
"**Given 32 GB of memory on a GPU, what is the upper bound of audio duration that can be transcribed by a Citrinet model in a single forward pass?**" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "gg0YNG9Fb_D4" | |
}, | |
"source": [ | |
"import librosa\n", | |
"import datetime\n", | |
"import math\n", | |
"import gc\n", | |
"\n", | |
"original_duration = librosa.get_duration(filename=audio_path)\n", | |
"print(\"Original audio duration :\", datetime.timedelta(seconds=original_duration))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "hJEPXWANvudq" | |
}, | |
"source": [ | |
"In order to extend the audio duration, we will concatenate the same audio clip multiple times, and then trim off any excess duration from the clip as needed.\n", | |
"\n", | |
"For convenience, we provide a scalar multiplier to the original audio duration." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "8s9MHlpXyJFC" | |
}, | |
"source": [ | |
"# concatenate the file multiple times\n", | |
"NUM_REPEATS = 3.5\n", | |
"new_duration = original_duration * NUM_REPEATS\n", | |
"\n", | |
"# write a temp file\n", | |
"with open('audio_repeat.txt', 'w') as f:\n", | |
" for _ in range(int(math.ceil(NUM_REPEATS))):\n", | |
" f.write(f\"file {audio_path}\\n\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "-niWHB0xzfZb" | |
}, | |
"source": [ | |
"Duplicate the audio several times, then trim off the required duration from the concatenated audio clip." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "RIAZCKh9zY0B" | |
}, | |
"source": [ | |
"repeated_audio_path = \"data/processed/concatenated_audio.wav\"\n", | |
"\n", | |
"!ffmpeg -t {new_duration} -f concat -i audio_repeat.txt -c copy -t {new_duration} {repeated_audio_path} -y\n", | |
"print(\"Finished repeating audio file!\")" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "cjBkE3M20OJR" | |
}, | |
"source": [ | |
"original_duration = librosa.get_duration(filename=audio_path)\n", | |
"repeated_duration = librosa.get_duration(filename=repeated_audio_path)\n", | |
"\n", | |
"print(\"Original audio duration :\", datetime.timedelta(seconds=original_duration))\n", | |
"print(\"Repeated audio duration :\", datetime.timedelta(seconds=repeated_duration))" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "xfIWxZ0y0hUP" | |
}, | |
"source": [ | |
"Attempt to transcribe it (Note this may OOM!)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "U3uweQm95y44" | |
}, | |
"source": [ | |
"# Clear up memory\n", | |
"torch.cuda.empty_cache()\n", | |
"gc.collect()\n", | |
"\n", | |
"device = 'cuda' if torch.cuda.is_available() else 'cpu'\n", | |
"# device = 'cpu' # You can transcribe even longer samples on the CPU, though it will take much longer !\n", | |
"model = model.to(device)" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "r39TLjDX0bVF" | |
}, | |
"source": [ | |
"%%time\n", | |
"\n", | |
"with autocast():\n", | |
" transcript_repeated = model.transcribe([repeated_audio_path], batch_size=1)[0]\n", | |
" del transcript_repeated" | |
], | |
"execution_count": null, | |
"outputs": [] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "eSmes0rMwZpN" | |
}, | |
"source": [ | |
"Given a large amount of GPU memory, the Citrinet model can efficiently transcribe long audio segments with ease, without the need for streaming inference.\n", | |
"\n", | |
"This is possible due to a simple reason - no attention mechanism is used, and Squeeze-and-Excitation mechanism does not require quadratic memory requirements yet still provides reasonable global context information." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"metadata": { | |
"id": "I1i7-r7D0qDb" | |
}, | |
"source": [ | |
"" | |
], | |
"execution_count": null, | |
"outputs": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment