Analyzing the full eBird Dataset with DuckDB
Ever since I started birding, I’ve had an ever growing list of questions. Questions about the species I’ve seen, the ones I haven’t, the habitats they’re found in, the activity and the participants itself. When do Eastern Phoebes typically show up in New York? Where do they go when they leave? Where do they like to nest? How frequently are they seen in New York? Why does that frequency dip in the summertime?
The Journey
I love having this mounting list of questions as it’s one of the key drivers of a “mid life renaissance” of sorts in my curiosity. I’d love to write about that spark even more soon, but for now I’d like to talk about one of the avenues I’ve turned to both quench and deepen this curiosity. Soon after getting into birding, I started getting quite curious about the underlying data that birding both depends upon and also adds to. Birding, more so than any other hobby I’ve encountered, focuses on both the reading and contribution of data. Birders study weather forecasts, migration timelines and previous year’s records to answer questions like: “when are the first migrants of the year going to show up?”, “are the winds out of the South this weekend?” (a good sign here in New York in the spring that tropical migrants will be coming in, born along by the favorable winds) or “I just saw an owl species in Brooklyn. Which species might be most likely that matches the brief glimpse I got?” Just as interesting, and particularly compelling to many of us, is the fact that by recording our checklists, taking photos and recording audio that we’re also adding to the ever growing data set of birding observations (most commonly here in the US via Cornell’s eBird).
As I got started into birding I was participating in both of these, but I was also starting to get curious about what the underlying data might look like. Soon into my birding journey I found out that eBird had an API you could use to access this data. I’ve already made more than a few fun projects off of that, but one of the limitations I ran into there is that an API like this is best suited for answering questions served by targeted, smaller data sets. It was great to see a list of the most recent species in an area or to get more information on hotspots around me. But, for questions that stretched across larger areas or across long time periods, it would take quite a long time (and more than likely hitting some rate limits) and not be that efficient to use an API for this. APIs like this are great at serving small-to-medium amounts of data very quickly based on specific filters or conditions. They’re not so great at providing deeper or richer datasets spanning thousands or millions of entries. And while my questions weren’t of the “millions” (yet), I was running into the limitations of the API when it came to many of my questions. Thankfully, eBird offers another route to accessing its data, via pre-packaged downloads of large swaths of data. This wouldn’t have the immediate, quick accessibility of the API, but would more than make up for it in its comprehensive and rich dataset.
What’s funny to me is that it took me nearly a year before I actually downloaded my first segment of the dataset. I can’t say the exact reason why, but one of the reasons why surely had to do with my unfamiliarity around digging through data like this. This isn’t to say I wasn’t comfortable or even happy around data analysis. While working at Stripe I had the distinction (happy or not) of being one of the top users of our shared analytics system (being summoned on more than one occasion to explain I insisted on running so many SQL queries every day). But it’s that last part, SQL, that maybe best explains my hesitancy with this form of data querying. See, the main recommendation for working with the eBird dataset was with R. No shade meant to the language, it’s just not something I had really ever worked with before. So, while I had a little fun poking around at the dataset before, I very quickly got frustrated by both the slowness of asking relatively simple questions via R (and by slowness I mean my slowness in figuring out just how to get anything working with it). I put this aside and promised myself I’d return to it another day when maybe my curiosity was willing to drive me a bit further.
Sometime later, however, my curiosity actually led me in a different direction. If I was having such a hard time (and honestly not a fun time) trying to look at this data in this way… why not just try and do it the way that I was used to? To be fair to me, I think I assumed that using what I used at Stripe would be just too cost prohibitive and difficult for a small personal project (I mean it’s called a data warehouse for starters). On a whim, I asked if what I wanted was even possible on my own humble computer. Not only was I not discouraged, but right away I was pointed to a good solution.
Enter DuckDB. Okay, right from the get go, it has to be prefect right? It has Duck in the name. But, before we go further, let’s take a brief aside to even talk about what’s going on here in a more detailed manner.
The problem is basically this: in its raw form the ebird data set is just lines and lines of text separated by tabs (the sister file format to the very commonly used CSV). In fact, you probably can read this here and get a sense of what the data includes!
URN:CornellLabOfOrnithology:EBIRD:OBS2919749158 2025-03-01 23:25:39.781016 21333 species avibase-69544B59 American Crow Corvus brachyrhynchos 2 United States US Alabama US-AL Tuscaloosa US-AL-125 27 Lakefield Drive, Tuscaloosa L10510092 H 33.1559565 -87.6336990 2025-03-01 06:21:00 obsr1073731 S215984764 Traveling Traveling P22 24 1.092 2 1 G14142220 0 1 0
URN:CornellLabOfOrnithology:EBIRD:OBS2933256785 2025-03-05 07:43:54.669907 21333 species avibase-69544B59 American Crow Corvus brachyrhynchos 1 United States US Alabama US-AL Tuscaloosa US-AL-125 27 23rd Street Area, Tuscaloosa L12228602 P 33.1882745 -87.5396724 2025-03-05 06:27:00 obsr451424 S216708215 Stationary Stationary P21 15 1 1 0 1 0
URN:CornellLabOfOrnithology:EBIRD:OBS3009418896 2025-03-30 17:15:29.683739 21333 species avibase-69544B59 American Crow Corvus brachyrhynchos 3 United States US Alabama US-AL Tuscaloosa US-AL-125 27 River Bend Turf L5039140 H 33.1335213 -87.6534175 2025-03-30 15:11:00 obsr451424 S221934152 Traveling Traveling P22 60 13.532 2 1 G14355008 0 1 0
That’s nice for general readability and it also makes working with this across a broad audience and technical background quite feasible. The downside here is that, in this form, this file is prohibitively large and also quite inefficient to use. I’m not going to go too deep into the technical reasons why, but suffice it to say that, by default, most any analysis of this file is going to require reading each and every line: all 1,809,934,873 of them for the full file. There are lots of different ways we could make this easier! Possibly the most commonly used method would be to throw it into a general purpose database like Postgres. This would certainly speed things up and reduce the size of the file. However, a database like this isn’t really suited for what I want to do here. I want to be able to ask near arbitrary types of questions without spending time before hand defining things like indexes which are basically telling the database which axes I want to analyze data along. I could define an index for the date of the observation and then looking only for observations on a certain date would be blazingly fast! But this comes with 2 downsides: first, I’d need to wait a lengthy amount of time for the database to create that index (it needs to sort through all the data to see where it lives on that axis) and, moreover, I’m interested in so many more axes than that!
The thing I really wanted was a columnar database! Not only would a columnar index not really require indexes, but it was perfectly suited for this kind of “Ask me anything” style approach. The thing that I didn’t know was that columnar databases can be used in any variety of uses cases: from the large corporate behemoths to me poking around to see where the best place to see an Eastern Phoebe in March was. The exact reason why columnar databases are so good is honestly still beyond me. But I do know one of their biggest advantages is how they treat the data they store. See, in a traditional database, all elements in one of the rows in that TSV are going to be stored together. That’s quite useful when you’re trying to do something like look at one specific observation or a few to see all the relevant details about it: the species, the date, the observer, the location etc. Columnar databases flip this idea on its head though and treat everything by column and not by row. In an analytics use case, I don’t care as much about all the details of one observation; instead, I’m far more interested in a few details about a large list of observations. By storing data by column, they make it far more efficient to analyze along these. So, instead of thinking of these entries as one row (“I saw an Eastern Phoebe, on May 1st at Prospect Park on this checklist”), it flips this to instead sort things by the species, or date, or location. (How do they do this so quickly? I still have no idea… I believe it has something to do with the fact that they’re storing all this data along an index by default?)
In any case, we’ve got our data principles set: now we can talk about ducks again. DuckDB caught my interest right away based on the simple install instructions (geared toward a local user), the ease at which I could work with CSV files, the fact they had just shipped a UI for running queries (I’m all for running my queries in a terminal, but in 2025 I don’t really wanna be interpreting tab and pipe characters in one) and, I mean, come on, they have Duck in the name! I loaded up the very small starter dataset I had and, glancing at the instructions, I had it working in less than 30 seconds. Here’s where I should start to really emphasize my unfamiliarity with what I was doing (if I haven’t laid it on thick already). I started trying to figure out if I needed to put this in S3 (a cloud-based storage system that probably backs nearly everything on the Internet in some way) or some other remote file storage system. I started playing around with just the right way to store this in Parquet (a newer file storage format far more suited for this problem than a TSV). I was game-planning the right way to stream the data from eBird’s server, through my machine as a TSV, up into the cloud as Parquet… when it dawned on me. This TSV was large, but not prohibitively so: 137GB. And while I didn’t have a spare cloud server sitting around… I did have quite a number of unused hard drives from my color editing work with my brother. Why not just throw the file on that?
I thought I’d go from that step to querying data in an hour or two… but boy was that over-confident. For one, the download would end up taking far longer than that. Not only cause I didn’t have the best internet speed, but my drive also kept going to sleep halfway through it… in any case, it was actually for the best, since it meant I spent more time testing and playing around with a more modest 25GB data set instead. This really paid off in the long run since I actually figured out a good plan with a quicker dataset than just trying and failing repeatedly on the first one. Here’s the first thing that threw me off: I had assumed that I needed to figure out my own specific method of data storage. It turned out that, while DuckDB will happily spit out Parquet (and other forms of data), that’s not really the “database” it’s working off of! Instead the Database is its own file… in any case, it was definitely fun to see how Parquet can drastically reduce the size of the download! With the 25gb NY data set it reduced it down to just 3gb… and I’m not even sure I was doing any compression with Parquet either! (See the questions below for areas I still need to explore) [and an editor’s note; he certainly was doing compression].
So I didn’t need Parquet to store the data… but, as usual when I’m doing anything technical (or honestly, just anything) I was far ahead of myself. I was already thinking of how to store the data when I first needed to not only finish downloading it… but then figure out how I’d even get that huge TSV file uncompressed and into DuckDB. The first part, decompression, on any modern computer is typically just as easy as double clicking the file, but that’s not as straightforward when you’re dealing with output that will end up exceeding 600gb (the native decompressor on my Mac never worked on this and was painfully slow). One obvious problem here was that I didn’t exactly have 600gb to spare on my local machine. However, if I moved this over to one of those aforementioned external drives then I ran into another issue: that these drives get real bogged down if you’re trying to read from them while writing to them. Put another way, if I’m trying to decompress the data into another file I need to tell the drive “Hey take these chunks of data and tell me what they are. Oh also, while you’re doing that, also remember these other chunks.” For, again, complicated reasons, a lot of drives don’t love doing this. They’d prefer one operation or another, but not both. A straightforward workaround would be then to read the file from my local drive and write to the external one. An even better find, however, was when I started looking for a tool to do this and realized that there were far quicker de-compressors available than the default one that lives on my machine (they use multiple cores to accomplish this speedup). Enter pigz (continuing with the animal-named tools on this one I guess):
@����n+Y�%���@>��Xdƺ�E�b�"���[���U�h$Bd���`ɭT���yn��6�l����…
Uhh that’s not quite right… oh yeah, you need to tell this tool to decompress, but default it will just recompress things. Okay adding the right arg to pigz, and we’ve got data! Loading this into DuckDB was a bit more involved as I had to fiddle a bit more with the loading arguments to get everything right. The main issue was around how restricted characters in the TSV are quoted (isn’t always the case when dealing with these?). But… we’ve got the data in! Again, this is with the more limited NY only data set, yet still, it only took 30s to create the DuckDB table for this. That’s way faster than any operation using R ever took! We can ask our first question… When was the first Eastern Phoebe observation in NY?
“Mounted at Wards [$1.25]; destroyed February 1907”
Always a nice reminder about how ornithology used to be just a bit more grisly. We’re close to realizing the dream! Writing that question was quick and the result was even quicker: 58ms. Now, we just need to work on the final goal, loading up the full data set.
The key challenge here is still around file size. The uncompressed full dataset is around 600 gb. While I do have drives that can handle that, it’s a bit more difficult to figure out how best to decompress that file. What I ended up doing to avoid filling up my hard drive (which I mistakenly had already done twice) and to have this finish in a relatively speedy fashion was to actually keep the compressed file on my computer’s hard drive and then decompress the file directly to the external hard drive. This ended up taking about 2 hours which is far from great, but at least we have something that works! Something I’d like to figure out in the future is a better story around downloading -> decompression -> loading into DuckDB. Ideally one that’s quicker and involves less need for intermediate storage. You should be able to stream or pipe more of these processes to each other so that we don’t have to wait for the entire 600gb TSV file to be ready before loading it into DuckDB. The main reason I didn’t spend much time on this yet is that, a) it’s easier to reason about what I did b) premature optimization is the death of so many of my projects and c) dealing with issues is a bit harder in the streaming case than it is when I have more discrete check points along the way.
Setting that aside, I’m now one step away from the final dream. I need to get this data into DuckDB. One downside I discovered here with DuckDB is that it’s pretty hard to track the timing of larger tasks like this. There’s not a way to get a progress bar, that I was able to find. A strategy I used for this, and a few other tasks here, was to use times from the smaller NY dataset to then estimate the time for the full dataset. This isn’t a perfect strategy, given the non-linear impact of compression and other things like disk caches that will have more of an impact on the smaller file sizes. Still, it was actually pretty good at predicting things like the final row length of the DB (off by about 5%), but less good at predicting the disk size. It ended up being around 214GB which is way bigger than I was planning for. It turns out DuckDB isn’t going to magically solve my disk size needs. Still, that’s a problem for future David (a fun problem no doubt) and not the focus of the day.
The focus of the day is answering some burning questions, like who has seen the most ducks in a day? It surprisingly isn’t a tie… it’s this checklist here with 11 ducks: https://ebird.org/checklist/S16191277. Wood Duck, Mottled Duck, Ruddy Duck… Great work here. Plumed Whistling-Duck, Mandarin Duck, Ruddy Shelduck, what on earth is going on here? Oh all of these are escapees… that doesn’t feel right in the spirit of “most ducks.” I think that should mean wild ducks. If we filter those pesky escapees out… we’re down to 10 ducks reported. Still, surprisingly, there’s no tie! Only 1 leader emerges and it’s from 1996 in Ethiopia: https://ebird.org/checklist/S77656343. Here’s the query:
select
"SAMPLING EVENT IDENTIFIER",
country,
state,
"OBSERVATION DATE",
count(distinct "SCIENTIFIC NAME") as no_of_ducks_seen,
array_agg("COMMON NAME"),
array_agg("OBSERVATION DATE")
from
ebd.full
where
"COMMON NAME" ilike '%duck%'
and category not in ('slash', 'spuh')
and APPROVED = 1 -- no escapees here!
and (
"EXOTIC CODE" is null
or "EXOTIC CODE" = 'N'
)
group by
1,
2,
3,
4
order by
no_of_ducks_seen desc
limit
100
But another problem you say! That’s not the true definition of duck! You need to filter to the species in Anatidae… true that! But definitely another challenge for another day.
Speaking of which, boy is there more work to do here. I’ve loved being able to scratch data questions easily and quickly. But it’s quick impracticable to have a 200+GB database sitting around your hard drive. It’s also unwieldy to have to leave things running for hours to refresh the dataset each month. Beyond that, I’d love to be able to use this in more broadly reachable tools. DuckDB offers a hosted version of their service and they have a free tier you can use if your dataset is below 10GB. A 5% reduction sounds really difficult, but I haven’t engaged much at all in reducing the dataset. For one, there are numerous columns here I don’t really need (comments, redundant taxonomic information, checklist IDs). In fact, it’d be a quite fun activity to see just what columns or portions of the DB actually contribute to its size. I also probably don’t need all data to answer most questions I have.
Here are some other outstanding tasks I’d love to look into: - All the questions around performance here. Could I use ordering for better indexing? https://duckdb.org/docs/stable/guides/performance/indexing#the-effect-of-ordering-on-zonemaps - Why is the first eBird checklist from Thailand on 1481-03-24 - Is there a form of Parquet compression that would reduce the size even more? - I should clean up data types more (move from integer to binary). This will reduce size more! Probably shifting things to enums could be massive too? - Convert column names to underscore to make them easier to auto_complete. - Figure out how to get x/y coords into data to perform distance analysis more quickly!
One post-writing note that I figured out a few days later: at a few points I mention that I needed to change the file format or data types to take advantage of DuckDB’s compression. This isn’t necessarily true. I found out subsequently that DuckDB is already quite smart at inferring the right data type for a column all on its own. So there’s less tuning needed at the onset (great!), but it also means there aren’t as many quick wins as I imagined.
Step-by-step guide on how to load the EBD into DuckDB
Ingredients
pigzor another tool for decompressing the EBD into a TSV. If you're not using a huge data set, you might can just use your native OS's decompression tool.duckdb: download instructions here- the EBD! Sign up and request data here
Steps
Step 1: Download your Data
This can take a few minutes or a few hours. Note, Cornell's download speeds seem to be capped at around 8-10MB/S so plan accordingly!
Step 2: Decompress your data
You can do this whatever tool you like. I liked using pigz for performance. Here's a handy snippet I used to use pigz with progress tracking:
pv ebd_relJan-2025.tsv.gz | pigz -d > ebd_relJan-2025.tsv
pv is used to monitor the progress of data coming through the pipe.
At the end of this step you should have the "raw" TSV, looking something like this:
URN:CornellLabOfOrnithology:EBIRD:OBS2919749158 2025-03-01 23:25:39.781016 21333 species avibase-69544B59 American Crow Corvus brachyrhynchos 2 United States US Alabama US-AL Tuscaloosa US-AL-125 27 Lakefield Drive, Tuscaloosa L10510092 H 33.1559565 -87.6336990 2025-03-01 06:21:00 obsr1073731 S215984764 Traveling Traveling P22 24 1.092 2 1 G14142220 0 1 0
URN:CornellLabOfOrnithology:EBIRD:OBS2933256785 2025-03-05 07:43:54.669907 21333 species avibase-69544B59 American Crow Corvus brachyrhynchos 1 United States US Alabama US-AL Tuscaloosa US-AL-125 27 23rd Street Area, Tuscaloosa L12228602 P 33.1882745 -87.5396724 2025-03-05 06:27:00 obsr451424 S216708215 Stationary Stationary P21 15 1 1 0 1 0
URN:CornellLabOfOrnithology:EBIRD:OBS3009418896 2025-03-30 17:15:29.683739 21333 species avibase-69544B59 American Crow Corvus brachyrhynchos 3 United States US Alabama US-AL Tuscaloosa US-AL-125 27 River Bend Turf L5039140 H 33.1335213 -87.6534175 2025-03-30 15:11:00 obsr451424 S221934152 Traveling Traveling P22 60 13.532 2 1 G14355008 0 1 0
Step 3: Load the data into DuckDB
There's a wide variety of ways to do this. I'm including the simplest version I found here:
create TABLE ebd.full AS
SELECT
*
FROM
read_csv(
'/path/to/decompressed/ebd.tsv',
store_rejects = true,
quote = ''
);
The three things I'd call out there are:
- We're using
read_csvto well, read the TSV. DuckDB is smart enough to detect the delimiter. - We're storing the rejected lines. It's always good to review these to make sure everything went alright!
- The last, most important part is
quote = ''. If you don't do this most if not all lines will get rejected for some reason...
Step 4: Write a query against your data!
The most fun part. Time to learn from the data!
Here's a query to see the first observation in the EBD:
select *
from ebd.full
order by "OBSERVATION DATE"
limit 1
