We are nearing the end of the Hugging Face Build Small Hackathon. Time to write about my experiences: why I built Chorus, how I built it, and what I would change given more time.

The plan

A few months ago, I heard WSJ columnist Louise Perry on her podcast wishing she had a trusted assistant who could read her social media comments. These comments can contain valuable feedback, but they also contain abuse and personal attacks.

That seemed like a problem LLMs can fix. Even a fairly small model should be able to remove the abuse, and find the genuinely interesting comments. This is what I set out to build.

I never built an app around an LLM before, but I knew it is fairly trivial to run an open model on a powerful CPU using llama.cpp. Not to be overwhelmed with too much unfamiliar technology at once, I decided to:

  • Initially run everything local
  • First build the filtering mechanism as a command line app
  • Then write a Gradio UI
  • Finally, host it on Hugging Face and Modal

Filtering the comments

In my naive original implementation, I did foresee some filtering of duplicates and obvious spam. But most of the work would be done by the model. I sent all comments one by one to the LLM and asked them to be scored on toxicity, noise level, originality and some other parameters. The plan was to then filter out the ones that scored best, and ask the model to summarize these top comments.

I was running this with Qwen3-8B, on a six-year-old AMD 5950X. It immediately became obvious that this was way too slow for anything more than a handful of comments.

And so I set out to optimize. I filtered much more aggressively. I downloaded comments for a large number of videos, and with the help of AI tried to find patterns in the most useful comments, which I could implement in python as part of my filter pipeline. And I decided to cluster comments. I initially planned 2 LLM passes: one to rate comments, and then a second one to summarize the top comments. If I first clustered on keywords, one LLM pass could do. I could rate whole clusters, and summarize them in the same call. I would then display summaries for the top clusters.

This worked fairly well. For videos with a few thousand comments, I went from processing times of over an hour, to under 10 minutes on my hardware. That was promising.

Hosting and User Interface

I was a bit worried about postponing this part for too long. I never even heard of Gradio before. The UI has to start a request in the background, and poll its status. That could be tricky. But it turned out Gradio has support for this. Writing the UI, was fairly trivial.

Hosting this project was also easy. Push the code to Hugging Face, and the Gradio app automatically appears. For Modal, it took me some time to find the correct llama.cpp image, just because I didn’t know where to look. But once I found it, hosting the model also became trivial. You can find my Modal App here.

Tweaking and testing

I hoped clustering would be trivial. I used BERTopic with the BAAI/bge-small-en-v1.5 embedding model. But depending on the video, there is usually a set of words that appear in may comments, but are not useful for clustering. This would often result in huge, very diverse clusters. I tackled that by removing all words that appeared in more then 5% of comments (my original threshold was 60%, but testing eventually brought it down to 5). This had as added bonus, that I did not have to rely on fixed stop word lists which are inevitably language dependent. Elsewhere, my filters still target English specifically, but some testing on other languages and even on multilingual comment threads, showed my clustering still gives useful results.

Another side effect of clustering, and of removing low-value comments before clustering, is that toxic comments usually disappeared automatically. Filtering toxic comments was originally one of the main goals, but the current version no longer has a toxicity filter, simply because it doesn’t seem to need one.

I also hoped to speed up processing with smaller models. A test script ran the same test data against qwen3-1.7b, qwen3-4b and qwen3-8b. On my hardware and test data, the 4B model was almost twice as fast as the 8B model. The 1.7B model was again roughly twice as fast as the 4B model. Unfortunately, the resulting summaries and titles where also nowhere near as good, so I ended up sticking with qwen3-8b.

Demo

Here you can watch a short demo:

Results are cached. If you don’t want to wait for the system to process all comments, you can paste in an URL that has been processed before. For example this one: https://www.youtube.com/watch?v=GfH4QL4VqJ0

What’s next?

In retrospect, I may have focused too much on optimizing performance. The system works, but the quality of the non-LLM filtering is not really good enough. That of course then impacts the clustering and the LLM-generated summary as well. Clustering gives us the most-disused topics, but does not necessarily give us the most insightful feedback.

Given more time, I would probably have tried scoring in 2 passes, much closer to my original idea. But maybe with a smaller model for the first comment scoring pass, and then the 8B model for summarizing. This would take longer. But it would probably yield much more useful results. The current, faster implementation is probably not good enough to be a useful tool for writers and other content creators.

The code, and more details on the implementation

The code is available at Hugging Face. While this blog post gives some background on the decisions made while building this, the readme in the code repo gives more detail on the current implementation, and on all the steps of the filtering process.