Skip to content

Instantly share code, notes, and snippets.

@Equim-chan
Last active March 17, 2024 11:16
Show Gist options
  • Save Equim-chan/bdb6e64425d10d09196eb5eadfeb600f to your computer and use it in GitHub Desktop.
Save Equim-chan/bdb6e64425d10d09196eb5eadfeb600f to your computer and use it in GitHub Desktop.
(obsolete) Equim's SOP for Archiving YouTube Livestream

Equim’s SOP for Archiving YouTube Livestream

This document describes an SOP for archiving a YouTube livestream, either public or private.

The demonstrations below are operated under Arch Linux, but it should work on other systems as well, including Windows MSYS2.

This SOP was originally written for archiving Gawr Gura’s unarchived streams.

Overview

The result of the archive consists of:

  • an MPEG2-TS video file

  • a thumbnail file

  • a metadata JSON file

  • a streamlink trace log file

  • an integrity check log file

  • a raw livechat JSON file

  • a rendered livechat HTML file

The folder layout will be like:

.
├── m-Bq5CG_rGQ.html
├── m-Bq5CG_rGQ.json.gz
├── slchk.log
├── streamlink.log
├── [UNARCHIVED KARAOKE] Jazz Lounge!-m-Bq5CG_rGQ.info.json
├── [UNARCHIVED KARAOKE] Jazz Lounge!-m-Bq5CG_rGQ.jpg
└── [UNARCHIVED KARAOKE] Jazz Lounge!-m-Bq5CG_rGQ.ts

Preparation for Tools

Install the Tools

You will need

  • streamlink

  • youtube-dl

  • ffmpeg

  • virtualenv

  • pytchat

  • scripts in this gist

$ sudo pacman -S streamlink youtube-dl python-virtualenv ffmpeg
$ git clone https://github.com/taizan-hokuto/pytchat.git
$ cd pytchat
$ virtualenv venv
$ . venv/bin/activate
$ pip install -r requirements.txt
$ deactivate

Prepare cookies.txt

Note
This step is only needed if you are going to archive a private stream.

Export your cookies of host youtube.com to a cookies.txt, then you need to test and sanitize the file using youtube-dl.

$ youtube-dl --cookies cookies.txt --skip-download "$any_youtube_video_url"

youtube-dl will actually rewrite your cookies.txt, you will see # This file is generated by youtube-dl. Do not edit. at the beginning of your cookies.txt after that.

Before the Stream, After the Waiting Room is Available

Set Cookies for pytchat

Note
This step is only needed if you are going to archive a private stream.

You need to prepare the livechat archiver at this step.

Visit https://www.youtube.com/live_chat?v=$video_id in your browser, after the page is loaded, open devtools (usually F12), go to the "network" tab, look up for any POST request whose URL is prefixed with ttps://www.youtube.com/youtubei/v1/live_chat/get_live_chat, then copy the value of Cookie of such request.

Edit pytchat/config/__init__.py, add a field with key cookie to headers, and paste the cookie value there.

Start Archiving the Chat

$ . venv/bin/activate
$ ./chat-archive.py fetch "$video_id"

2 Minutes Before the Scheduled Start Time

Start Archiving the Video

Run script streamlink-cookies.py if you have prepared a cookies.txt, otherwise you should replace it with vanilla streamlink.

$ env TZ=UTC ./streamlink-cookie.py \
  -o archive.ts \
  -l trace \
  --retry-streams 30 \
  --hls-live-restart \
  --hls-segment-threads 4 \
  --hls-segment-attempts 20 \
  --hls-playlist-reload-attempt 20 \
  "$video_url" \
  best \
  |& tee streamlink.log
Tip
The log file at trace level is very important as it is the source to check your archive’s integrity.

During the Stream

Get the Exact Start Time of the Stream

Note
If the stream is going to be unarchived, this will be the only chance you are able get the infomation.
$ ./get_start_time.sh "$video_id"

Archive Metadata and Thumbnail

$ youtube-dl --write-thumbnail --write-info-json --skip-download "$video_url"
Note
The archived JSON may contain sensitive infomation about your archiving environment such as your IP address, make sure to erase them if you intend to share it.

After the Stream

Render the Chat HTML

$ env TZ=UTC ./chat-archive.py render "$video_id" "$exact_stream_start_timestamp"
$ # compress the raw json
$ gzip -9 "$video_id.json"

Check the Integrity of the Archive

$ ./slchk.py streamlink.log |& tee slchk.log

If you see missing segments count: 0, congrats.

Organize Files

Rename your archive.ts to with a proper title. It is recommeneded to use the same filename as the metadata JSON file you got from youtube-dl.

#!/usr/bin/env python
import sys
import time
import logging
import json
from datetime import datetime, timedelta, timezone
import pytchat
from pytchat.processors.dummy_processor import DummyProcessor
from pytchat.processors.html_archiver import HTMLArchiver
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(pathname)s:%(lineno)s:\t%(msg)s')
def fetch(video_id, fallback_poll_interval=5):
stream = pytchat.create(video_id=video_id, processor=DummyProcessor())
total_len = 0
with open(video_id + '.json', 'a') as json_out:
logging.info(f'appending to {json_out.name}')
while stream.is_alive():
poll_interval = fallback_poll_interval
chats = stream.get()
if len(chats) != 1:
logging.info(f'len(chats) != 1, sleep: {poll_interval}')
time.sleep(poll_interval)
continue
chat = chats[0]
if not chat:
logging.info(f'chats[0] is empty, sleep: {poll_interval}')
time.sleep(poll_interval)
continue
poll_interval = chat.get('timeout', poll_interval)
chatdata = chat.get('chatdata') or []
for item in chatdata:
json_out.write(json.dumps(item, ensure_ascii=False, sort_keys=True, separators=(',', ':')) + '\n')
logging.info(f'len: {total_len} + {len(chatdata)}, sleep: {poll_interval}')
total_len += len(chatdata)
time.sleep(poll_interval)
def render(video_id, start_us_utc):
start = datetime.fromtimestamp(start_us_utc / 1e6, timezone.utc)
ar = HTMLArchiver(video_id + '.html')
with open(video_id + '.json') as json_in:
logging.info(f'reading from {json_in.name}')
batch = []
for line in json_in:
chat = json.loads(line)
if 'addChatItemAction' not in chat:
continue
# write elapsed time
for k, v in chat['addChatItemAction']['item'].items():
if not v.get('timestampUsec'):
continue
timestamp_us = float(v['timestampUsec'])
timestamp = datetime.fromtimestamp(timestamp_us / 1e6, timezone.utc)
if timestamp >= start:
elapsed = str(timestamp - start)
else:
elapsed = '-' + str(start - timestamp)
chat['addChatItemAction']['item'][k]['timestampText'] = {'simpleText': elapsed}
batch.append(chat)
if (len(batch)+1) % 32 == 0:
ar.process([{'chatdata': batch}])
batch.clear()
if len(batch) > 0:
ar.process([{'chatdata': batch}])
ar.finalize()
if __name__ == '__main__':
verb = sys.argv[1]
if verb == 'fetch':
video_id = sys.argv[2]
fetch(video_id)
elif verb == 'render':
video_id = sys.argv[2]
start_us_utc = float(sys.argv[3])
render(video_id, start_us_utc)
else:
sys.exit(1)
#!/usr/bin/env bash
youtube-dl --cookies cookies.txt -g "https://www.youtube.com/watch?v=$1" | \
head -n 1 | \
xargs curl -SsL | \
tail -n 1 | \
xargs curl -SsL -I | \
grep -i 'last-modified' | \
sed 's/last-modified: //i' | \
xargs -d '\n' date +'%s%6N' --utc -d
#!/usr/bin/env python
import re
import sys
log_file = sys.argv[1]
segment_pat = re.compile(r'^.+ Segment')
segments = []
with open(log_file) as f:
for line in f:
if 'Segment' not in line:
continue
segment_id = int(segment_pat.sub('', line.replace('complete', '')).strip())
timestamp = line[1:line.index(']')]
segments.append((segment_id, timestamp))
segments.sort(key=lambda x: x[0])
print(f'start: {segments[0][0]} ({segments[0][1]})')
print(f'end: {segments[-1][0]} ({segments[-1][1]})')
latest_id, latest_timestamp = segments[0]
missing_count = 0
for segment_id, timestamp in segments[1:]:
if segment_id == latest_id + 1:
latest_id = segment_id
latest_timestamp = timestamp
continue
has_missing = True
delta = segment_id - latest_id - 1
if delta > 1:
print(f'missing: {latest_id + 1}-{segment_id - 1} ({latest_timestamp} - {timestamp})')
else:
print(f'missing: {latest_id + 1} ({latest_timestamp} - {timestamp})')
missing_count += delta
latest_id = segment_id
latest_timestamp = timestamp
print(f'missing segments count: {missing_count}')
sys.exit(1 if missing_count > 0 else 0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment