Predicting bird presence in audio clips

This model is designed to predict if birds are present in five-second clips of an audio file. The model's use is upstream of a larger modeling effort: to identify clips in weakly labeled audio files that contain bird sounds with high probability. This model is not optimal! The aim is not to perfectly predict bird presence (or have good calibration), but to (confidently) strongly label a subset of audio clips in the Birdclef Kaggle competition datasets.

Features

The 24 features are of the characteristics:

Sound frequency percentiles
- https://www.tandfonline.com/doi/full/10.1080/09524622.2020.1730241
- quantiles = [0.8,0.9,0.925,0.95,0.975,0.99,]
Thirteen Mel spectrogram cepstral coefficients
- Averaged over axis 1 (columns)
- n_fft=2048, hop_length=512
Summary statistics of zero crossing rates in 1-second segments
- Mean, standard deviations, min, max
- Zero crossing rate of the entire clip
- Threshold of 0.02

These features summarize an entire clip, irrespective of position in waveform or spectrogram, and technically, the clip does not have to be 5 seconds long.

Too long do not read

Data

I built the model iteratively with some publicly available datasets.

Freefield
Warblrb 10k
- These contain 10-second audio files
- The files are classified as 1 or 0 "hasbird" https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13103
ESC50
- https://github.com/karolpiczak/ESC-50
- Most files are labeled 0 "hasbird", but some classes fall under 1 "hasbird"
- 5-second audio files
DBR (dog, bird, and rain sounds)
- https://zenodo.org/records/1069747#.Xlj0vi2ZN24
- Dog and rain sounds are 0 "hasbird"
Urban8k
- https://urbansounddataset.weebly.com/urbansound8k.html
- All audio files are labeled 0 "hasbird"
- Around 8k files, many of 2 to 4-second clips
ARCA23K
- https://zenodo.org/records/5117901
- All audio files are labeled 0 "hasbird"
Birdclef Kaggle competitions 2022 - 2025
- These are long audio files with weakly labeled species
- They come from xeno-canto.org
- There can be multiple species in a file, which is not always annotated

First model iterate

Fit decision tree-based classifiers to Freefield and Warblrb10k
- The Warblrb10k data is about 3/4 does have bird
- The Freefield data is about 1/4 does not have bird
- No data augmentation
Grid search with 25% test and 75% training splits (averaging over 5 randomizations)
- RandomForestClassifier, GradientBoostingClassifier, XGBClassifier
- n_estimators: [10, 20, 50,]
- max_depth: [5, 10, 20,]
- I saved the results in the following file
I chose to use the XGBClassifier with n_estimators=20 and max_depth=5
- This simpler model does not have too large a gap between training and test metrics
- The test accuracy is 80.40%.
- The test precision is 79.05%.
- The test recall is 81.68%.
- The test AUROC is 88.08%.

Second model iterate

Fit XGBClassifier to all heretofore mentioned data
Use first model iterate to predict "hasbird" in Birdclef data
- Apply zero padding to the Birdclef data if the final clip longer than 2 seconds
- Subset Birdclef data to those with
- Predicted presence > 0.75, or
- Audio file duration <= 15 seconds, or
- Amphibian, Insecta, Mammalia as 0 in 2025 data
Five data augmented instances for each file
- Use OneOf in audiomentations
- AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=1.)
- AddGaussianSNR(min_snr_db=5.0, max_snr_db=40.0, p=1.)
- AddColorNoise(min_snr_db=5.0, max_snr_db=40.0, n_fft=128, p=1.)
Grid search with 25% test and 75% training splits (averaging over 10 randomizations)
- n_estimators: [10, 20, 50,]
- max_depth: [5, 10, 20,]
I chose the XGBClassifier with 50 estimators and max depth 10
- The training accuracy is.
- The training precision is.
- The training recall is.
- The training AUROC is.
- The test accuracy is 94.51%.
- The test precision is 96.73%.
- The test recall is 96.09%.
- The test AUROC is 98.18%.

Third model iterate

Fit XGBClassifier to all heretofore mentioned data
Use the second model iterate to predict "hasbird" in Birdclef data
Subset Birdclef data to those wth
- Predicted presence > 0.90
- Amphibia, Insecta, Mammalia as 0 in 2025 data
I chose the XGBClassifier with 50 estimators and max depth 5
- The training accuracy is.
- The training precision is.
- The training recall is.
- The training AUROC is.
- The test accuracy is 94.45%.
- The test precision is 98.06%.
- The test recall is 95.60%.
- The test AUROC is 95.91%.

Non-2025 model

I fit a model like the third iterate but without the Birdclef 2025 data. The point is to evaluate if the model predicts presence for birds not observed in the training data. In the 2025 dataset, there are some birds that are not observed in 2022, 2023, and 2024 datasets.

Because initial model iterates used the 2025 data, there is some data leakage in how pseudo-present bird sounds were determined in the second model iterate.

I chose the XGBClassifier with 50 estimators and max depth 5.

The training accuracy is.
The training precision is.
The training recall is.
The training AUROC is.
The test accuracy is 94.85%.
The test precision is 97.56%.
The test recall is 96.32%.
THe test AUROC is 97.54%.

Downloads last month: -; Downloads are not tracked for this model. How to track