Ideas for archiving web content in org-roam notes

I have come up with some ways to work with URLs in my org-roam notes by running a few commands on schedule with cron. First thing I do is saving all the URLs inside the notes:

rg --only-matching --no-filename --no-line-number "(http|https)://[a-zA-Z0-9./?=_%:-]*" ~/org/roam/*.org | sort > ~/org/roam/all_urls.txt

I then create seperate files for YouTube, SoundCloud and Mixcloud:

grep -E "youtu(\.be|be\.com/watch)" ~/org/roam/all_urls.txt > ~/org/roam/youtube_urls.txt
grep -E "soundcloud\.com" ~/org/roam/all_urls.txt > ~/org/roam/soundcloud_urls.txt
grep -E "mixcloud\.com" ~/org/roam/all_urls.txt > ~/org/roam/soundcloud_urls.txt

Now you can use shuf to choose a random link:

shuf -n 1 youtube_urls.txt

You can use this to play a random YouTube video with catt

catt cast "$(shuf -n 1 ~/org/roam/youtube_urls.txt)"

Or mpv:

mpv "$(shuf -n 1 ~/org/roam/youtube_urls.txt)"

You can also do mpv --shuffle --playlist ~/org/roam/youtube_urls.txt!

Use youtube-dl to archive videos. I use --download-archive to keep track of
what has already been downloaded:

youtube-dl --batch-file="~/org/roam/youtube_urls.txt" --download-archive="~/org/roam/youtube-dl_archive.txt"

My actual youtube-dl config (~/.config/youtube-dl.conf) looks like this:

--output "~/sync/youtube-dl/%(uploader)s/%(title)s.%(ext)s"
--download-archive "~/sync/youtube-dl/download_archive.txt"
--sleep-interval 5
--max-sleep-interval 30

Just season to taste.

The ~/sync/ directory is synchronized with syncthing by the way

With this configuration, I can play the downloaded videos like so:

mpv --shuffle ~/sync/youtube-dl/**

Another thing I have been experimenting with is using archivebox
(ArchiveBox/ArchiveBox) to archive absolutely everything:

archivebox add < ~/org/roam/all_urls.txt

Have a look at iipc/awesome-web-archiving for more archival options.

Related stuff I found:

Does anyone else have experience with doing stuff like this? I’d love to hear your thoughts.