August 11, 2018

I want to rate websites based on performance metrics, presence of tracking technology, and ratio of content to bullshit (complex CSS, useless JS, ads, images, and so on). I'm also interested in using a machine learning classifier to rate sites, but I need to learn about what kinds of inputs are appropriate for different machine learning algorithms. I'm assuming that I can't expect good results if I just use all of a page's sources, plus a ton of metrics, as the representation of a page to classify.

I plan to collect as much data as possible for each page, and from that data extract a few key metrics, and then rate the pages based on rules that I create manually. There are already plenty of website auditing tools, such as Lighthouse. The difference is my opinion about what makes a site bullshit. Lighthouse doesn't penalize your site for serving tracking scripts and ads.

Next, I'll try sending the data to different algorithms. I'll have to figure out what kind of classifier will work for my data, and gradually expand the amount of inputs, starting with the handful of metrics I picked manually and working up to inputting all of the raw data collected. I'm not sure if I should keep picking inputs manually, or if I can direct the algorithm to only use a percentage of the inputs and learn which inputs to work on. I understand the basic concepts of statistics and machine learning and neural networks, but I have a lot to learn about concrete implementation and usage.

With only a handful of metrics, I'm pretty sure there is no point in using machine learning. I can just tweak the rules manually until I get results I like. Machine learning might be useful for discovering factors I haven't considered or can't predict. It will also be fun to learn more about using machine learning capabilities in my code.


If you want to read more, subscribe to my personal newsletter.