Back to work
Audio ML · Sound-event detection · 2026

Multi-label sound-event detection

For every second of a recording, predict which of fifteen everyday sound events are happening — footsteps, running water, a microwave, a light switch. A sparse, imbalanced, multi-label problem where the entire challenge is honest evaluation. Built as a JKU machine-learning project, finished in a competitive challenge that reached 0.70 macro-F1.

Type University ML project · JKU
Role Data pipeline · Modeling · Evaluation
Stack scikit-learn · PyTorch · librosa · NumPy
Result 0.70 macro-F1 · 15 classes

Brief

The dataset is a crowd-collected library of everyday indoor sounds: 3,656 recordings, cut into roughly 168,000 one-second segments, each described by 960 precomputed audio features and labelled across fifteen event classes — footsteps, running water, keyboard typing, doors, cutlery and dishes, microwave, vacuum, light switch, and more. The task is multi-label and segment-level: any combination of events can overlap within a single second. Crucially, the labels are extremely sparse — 94% of all class slots are silent — and imbalanced, with the most common event roughly 29× more frequent than the rarest. A model that simply predicts "nothing, everywhere" scores 94% accuracy, so accuracy is meaningless. The real objective was a trustworthy evaluation and a model that genuinely beats trivial baselines.

Approach

Most of the effort went into evaluation discipline, because that is where audio benchmarks quietly leak. Recordings are split by collector, so no contributor — and none of their overlapping segments — appears in both training and test; measured cross-split overlap is exactly zero. Labels from one to five annotators per file are reconciled by majority vote. Every feature is standardized using statistics fit on the training split only; fitting on the full dataset would shift per-feature scale by up to 11%, a leak I quantified rather than ignored. Performance is reported as macro-F1 and macro-AP against three reference points: an always-silent baseline (0.00), a class-prior baseline (0.06), and a human annotator-agreement ceiling (0.78).

On that footing, logistic-regression and random-forest classifiers are swept across regularization strength and tree depth, producing clean bias-variance curves; logistic regression settles at 0.38 macro-F1, comfortably above the prior. A case study then runs the model on unseen recordings, pairing mel-spectrograms with predicted-versus-true timelines and a per-class error-overlap matrix. It makes the structure of the errors legible: continuous sounds like running water and vacuum are easy, while brief transients — light switch, window, wardrobe — are genuinely hard.

Outcome

A follow-on challenge pushed for the highest achievable macro-F1, and the progression was itself the lesson. Tuning per-class decision thresholds, instead of leaving them at the default 0.5, lifted logistic regression from 0.41 to 0.50; lightly median-filtering the prediction timeline added a little more. A CRNN trained from scratch on log-mel spectrograms matched the classical system at 0.51 — and no further.

The decisive move was transfer learning. Frozen embeddings from an AudioSet-pretrained model with a small BiGRU head reached 0.70 macro-F1 — and, tellingly, fine-tuning that same backbone end-to-end did worse (0.60), a textbook small-data result. Across both phases the consistent finding held: calibration and leakage control moved the score more than raw model capacity.

Report, figures, and code available on request.

The technical report, per-class evaluation, error-analysis figures, and the challenge pipeline can be shared for review.

Next project Satellite image classification