Most of the effort went into evaluation discipline, because that is where audio benchmarks quietly leak. Recordings are split by collector, so no contributor — and none of their overlapping segments — appears in both training and test; measured cross-split overlap is exactly zero. Labels from one to five annotators per file are reconciled by majority vote. Every feature is standardized using statistics fit on the training split only; fitting on the full dataset would shift per-feature scale by up to 11%, a leak I quantified rather than ignored. Performance is reported as macro-F1 and macro-AP against three reference points: an always-silent baseline (0.00), a class-prior baseline (0.06), and a human annotator-agreement ceiling (0.78).
On that footing, logistic-regression and random-forest classifiers are swept across regularization strength and tree depth, producing clean bias-variance curves; logistic regression settles at 0.38 macro-F1, comfortably above the prior. A case study then runs the model on unseen recordings, pairing mel-spectrograms with predicted-versus-true timelines and a per-class error-overlap matrix. It makes the structure of the errors legible: continuous sounds like running water and vacuum are easy, while brief transients — light switch, window, wardrobe — are genuinely hard.