Predicting bird presence in audio clips

This model is designed to predict if birds are present in five-second clips of an audio file. The model's use is upstream of a larger modeling effort: to identify clips in weakly labeled audio files that contain bird sounds with high probability. This model is not optimal! The aim is not to perfectly predict bird presence (or have good calibration), but to (confidently) strongly label a subset of audio clips in the Birdclef Kaggle competition datasets.

Features

The 24 features are of the characteristics:

  1. Sound frequency percentiles
  2. Thirteen Mel spectrogram cepstral coefficients
    • Averaged over axis 1 (columns)
    • n_fft=2048, hop_length=512
  3. Summary statistics of zero crossing rates in 1-second segments
    • Mean, standard deviations, min, max
    • Zero crossing rate of the entire clip
    • Threshold of 0.02

These features summarize an entire clip, irrespective of position in waveform or spectrogram, and technically, the clip does not have to be 5 seconds long.



Too long do not read



Data

I built the model iteratively with some publicly available datasets.

First model iterate

  1. Fit decision tree-based classifiers to Freefield and Warblrb10k
    • The Warblrb10k data is about 3/4 does have bird
    • The Freefield data is about 1/4 does not have bird
    • No data augmentation
  2. Grid search with 25% test and 75% training splits (averaging over 5 randomizations)
    • RandomForestClassifier, GradientBoostingClassifier, XGBClassifier
    • n_estimators: [10, 20, 50,]
    • max_depth: [5, 10, 20,]
    • I saved the results in the following file
  3. I chose to use the XGBClassifier with n_estimators=20 and max_depth=5
    • This simpler model does not have too large a gap between training and test metrics
    • The test accuracy is 80.40%.
    • The test precision is 79.05%.
    • The test recall is 81.68%.
    • The test AUROC is 88.08%.

Second model iterate

  1. Fit XGBClassifier to all heretofore mentioned data
  2. Use first model iterate to predict "hasbird" in Birdclef data
    • Apply zero padding to the Birdclef data if the final clip longer than 2 seconds
    • Subset Birdclef data to those with
    • Predicted presence > 0.75, or
    • Audio file duration <= 15 seconds, or
    • Amphibian, Insecta, Mammalia as 0 in 2025 data
  3. Five data augmented instances for each file
    • Use OneOf in audiomentations
    • AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=1.)
    • AddGaussianSNR(min_snr_db=5.0, max_snr_db=40.0, p=1.)
    • AddColorNoise(min_snr_db=5.0, max_snr_db=40.0, n_fft=128, p=1.)
  4. Grid search with 25% test and 75% training splits (averaging over 10 randomizations)
    • n_estimators: [10, 20, 50,]
    • max_depth: [5, 10, 20,]
  5. I chose the XGBClassifier with 50 estimators and max depth 10
    • The training accuracy is.
    • The training precision is.
    • The training recall is.
    • The training AUROC is.
    • The test accuracy is 94.51%.
    • The test precision is 96.73%.
    • The test recall is 96.09%.
    • The test AUROC is 98.18%.

Third model iterate

  1. Fit XGBClassifier to all heretofore mentioned data
  2. Use the second model iterate to predict "hasbird" in Birdclef data
  3. Subset Birdclef data to those wth
    • Predicted presence > 0.90
    • Amphibia, Insecta, Mammalia as 0 in 2025 data
  4. I chose the XGBClassifier with 50 estimators and max depth 5
    • The training accuracy is.
    • The training precision is.
    • The training recall is.
    • The training AUROC is.
    • The test accuracy is 94.45%.
    • The test precision is 98.06%.
    • The test recall is 95.60%.
    • The test AUROC is 95.91%.

Non-2025 model

I fit a model like the third iterate but without the Birdclef 2025 data. The point is to evaluate if the model predicts presence for birds not observed in the training data. In the 2025 dataset, there are some birds that are not observed in 2022, 2023, and 2024 datasets.

Because initial model iterates used the 2025 data, there is some data leakage in how pseudo-present bird sounds were determined in the second model iterate.

I chose the XGBClassifier with 50 estimators and max depth 5.

  • The training accuracy is.
  • The training precision is.
  • The training recall is.
  • The training AUROC is.
  • The test accuracy is 94.85%.
  • The test precision is 97.56%.
  • The test recall is 96.32%.
  • THe test AUROC is 97.54%.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support