Serious question, how many of you were still training models on full datasets?

I was working on a text classifier for my local library's archives and kept hitting a wall with accuracy. My friend, who does this for a living, asked me how I was picking my training examples. I told him I was using everything, and he just said 'that's your problem right there.' He showed me a paper from a team at Stanford about active learning strategies. I switched to a simple uncertainty sampling method, picking only the 20% of data the model was least sure about for the next training loop. The accuracy jumped in two cycles. Has anyone else made a switch like this and seen similar results?

2 comments

2 Comments

charlesb898d ago

Sounds like you just got lucky with a clean dataset. I trained a spam filter on every single email in my company's ten year archive, junk folder and all. The model learned the noise patterns so well it caught phishing attempts the active learning system missed. Sometimes brute force and more data beats a clever algorithm, especially when your "uncertain" examples are just badly labeled data. You might have just trimmed the confusing edge cases that actually mattered.

vera_lewis8d ago

Your spam filter example is interesting, @charlesb89, but that's a different problem. For a text classifier on a specific archive, you don't want the model to learn noise. The Stanford paper basically says you're wasting time on stuff the model already gets right. My library project had tons of old, similar meeting notes. Training on all of it just made the model lazy on the hard stuff, like telling apart fundraising reports from event summaries. Focusing on the confusing examples forced it to learn the real differences.