November 29, 2019

Patreon web scraper

Patreon has become a platform that supports folks who make amazing content and I’m gladly donating in order to get high-quality content. The patreon web page is used by most creators to post updates and provide access to paid content. Unfortunately the patreon page is slow, loads a bunch of ad trackers and hard to navigate. The pagination gives access to only a few posts at a time and the performance of the page really makes you not want to load more… Additionally, when you stop your paid support for a channel you also loose access to everything that was posted.

To remedy this situation somewhat I created a small utility that uses a headless Chrome instance via puppeteer to download that data and provide it in form of single page html pages that just contain the posted content and not more.

That is a great example that showcases how powerful a programmable browser can be. If there is no good API available and content is only available as dynamically generated web pages, this approach continues to work pretty well. Plus, it’s easy to program live as you can incrementally control Chrome via puppeteer or the Chrome dev protocol. For the latter approach I really like the Clojure library tatut/clj-chrome-devtools. Anyway, glad that there are still ways around walled gardens…

© Robert Krahn 2021