2

Auto-generated YouTube subtitles contain timestamps for every word and other content that hampers readability:

00:00:30.230 --> 00:00:33.900 align:start position:19%
you<00:00:31.230><c> think</c><c.colorE5E5E5><00:00:31.470><c> from</c><00:00:31.650><c> my</c><00:00:31.740><c> calm</c><00:00:31.980><c> demeanor</c><00:00:32.010><c> that</c></c><c.colorCCCCCC><00:00:32.430><c> I</c></c>


00:00:32.580 --> 00:00:36.180 align:start position:19%
haven't<c.colorE5E5E5><00:00:32.760><c> got</c><00:00:32.910><c> a</c><00:00:32.940><c> care</c><00:00:33.150><c> in</c><00:00:33.210><c> the</c><00:00:33.330><c> world</c><00:00:33.420><c> that</c></c>

00:00:33.900 --> 00:00:38.160 align:start position:19%
you'd<00:00:34.019><c> be</c><00:00:34.140><c> wrong</c><00:00:34.410><c> you</c><00:00:34.680><c> see</c><c.colorE5E5E5><00:00:35.000><c> hidden</c><00:00:36.000><c> within</c></c>

How can I only save speech with reasonable formatting? Speech-centric videos of some users are measured in hours instead of minutes and by reading I could finish these "one-man talk shows" in fraction of the time.

user198350
  • 4,019

1 Answers1

1

Do the following:

  1. Make a copy of the file.
  2. Open the file in a text editor that has Regex-based find and replace functionality, like Notepad++ or Visual Studio Code.
  3. Invoke the find and replace function (Ctrl+H in the examples I gave), find the following regular expression, and replace with nothing:

    <.*?>
    

    Do not forget to activate the Regex mode. In Notepad++, you need to select the "Regular expressions" radio button and in Visual Studio Code, you need to click on the button that reads: ".*" (Or press Alt+R)

  4. Replace all instances.

Here is the result from Visual Studio Code:

00:00:30.230 --> 00:00:33.900 align:start position:19%
you think from my calm demeanor that I

00:00:32.580 --> 00:00:36.180 align:start position:19%
haven't got a care in the world that

00:00:33.900 --> 00:00:38.160 align:start position:19%
you'd be wrong you see hidden within
  • Videos with official support for subtitles (for example https://www.youtube.com/watch?v=Ye8mB6VsUHw) already use this layout, I'd prefer to remove all timestamps, position markers (align:start position:) and superfluous linebreaks. – user198350 Oct 16 '17 at 11:59
  • 1
    I can probably give you additional Regex-based solutions, but per 80/20 rule, it won't help you as much as this one did. The easiest thing to do at this stage is to load the resulting text from the Regex I gave you into Subtitle Edit and start reading. –  Oct 16 '17 at 13:35
  • I know that superuser.com is not "a free scripting service", but I also wanted to know if there's a built-in command in youtube-dl. – user198350 Oct 16 '17 at 13:46
  • Note that besides the extraneous formatting (which can be removed with RegEx) auto-generated closed captions show up progressively, word by word. Which means the source .vtt is full of duplicated content. Our use case is getting a searchable transcript of talks for later reference. Duplicated content breaks this use case. A solution is required that takes this aspect into account as well. – Leeroy Feb 10 '19 at 15:18
  • I just found a script on gist.github.com which has worked for me just now (on youtube-dl of a vtt file of youtube video's auto-captions): https://gist.github.com/glasslion/b2fcad16bc8a9630dbd7a945ab5ebf5e . It adds some logic to remove the duplicate line issues – orangenarwhals Apr 07 '20 at 05:03