Databasing Deed Dish

Wed, Oct 8, 2025

As I’ve been building out Deed Dish I’ve iterated through a number of different database solutions to store and host my data. This blog post is a reflection on each of the solutions and why I used them or moved on to another solution.

Deed Dish

Deed Dish is, at it’s core, a data visualization project powered by data scraped from the web. The website allows the user to explore historic property ownership for parcels in Chicago. At the time of writing, the project covers a small section of Chicago amounting to about 50,000 parcels. Each of these parcels can have anywhere from 1 to 300 documents associated with them, and each of these documents is split out into multiple records that are loaded into the Deed Dish database. All in all, Deed Dish amounts to 3 tables and 7 million records. Not something you would necessarily call Big Data, but it isn’t pocket change either.

SQLite

When I started this project my goal was to quickly develop a pipeline to scrape, validate, and store the data. I had very little in the way of opinions on which database I wanted to use, and because SQLite has a built-in Python library I defaulted to using it. This solution worked wonderfully for iterating on the ETL process and developing my data model.

Eventually the project reached a point where I wanted to be able to run the pipeline on my home lab to avoid hogging my laptop resources. After researching the trade-offs for other database options I landed on migrating to a hosted PostgreSQL instance. Using a hosted database solution would allow for data to be loaded and accessed from anywhere, which would allow for me to iterate on the front-end even while the scrape was ongoing. I was also more comfortable with Postgres than SQLite, as I had used it in a number of other projects.

This was also the point in the project where I decided to transition the project over to an Object Relational Mapping tool. I chose to do so because I still wasn’t sure what the final database would look like as I worked on the project. An ORM has some up-front development costs, but makes it relatively easy to migrate between databases. I ended up using SQLAlchemy’s ORM tools for this project, again because I was familiar with them from using them in previous projects.

PostgreSQL

PostgreSQL is an robust database solution. It isn’t the fastest, nor is it the lightest database out there, but it has a large community which makes debugging easier during development. I ran into a few hurdles while building out the ORM and updating the scraper, and most of the time I was able to find a solution by searching the docs. There are also a large number of hosted Postgres instances out there, which means you won’t be starved for choice if you decide that your ultimate storage solution is Postgres. I settled on the Xata free tier as it has a generous storage limit compared to other options and “free” was one of my hosting requirements. I chose Xata over something like Supabase because Xata’s free tier provides up to 15GB of storage and I anticipated reaching that limit with this project.

One other reason I opted to migrate to Postgres was because I had anticipated the project to have a major geospatial component and I was familiar with the PostGIS extension. That turned out to not be the case - the scraped data remained tabular and was joined to separately hosted GIS data on the front-end.

There are a lot of upsides to hosted database solutions: The server your database is hosted on is managed by someone else, which means fewer parts to have to think about during development. Your data is stored elsewhere, which can be a risk, but also means that your data isn’t at risk of disappearing due to real-world problems like spilled drinks or falls. The downside, however, is that by running your data on a remote server you’re introducing latency to your queries. For some projects this is fine, but for a project like Deed Dish where querying the database and getting results quickly is a high priority this latency is a liability. Ultimately I decided that, with the initial scrape completed, I would migrate the data off of the hosted Postgres instance and try out a new database, DuckDB.

DuckDB

DuckDB is a database that is optimized for online analytical processing (OLAP) workloads. This means that, instead of optimizing for transaction speed, DuckDB optimizes for handling complex queries quickly, making it a perfect solution for a project that amounts to a dashboard powered by millions of records. DuckDB itself also has some really cool quality of life tools, like DuckDB UI, that made it easy to iterate on developing queries to power the front-end.

As front-end development progressed I began exploring hosting again. Xata was no longer my solution due to speed, and I didn’t want to shell out any money for hosting for such a small project. I wanted something that would allow a user to quickly load the data for whatever parcel interested them without having to wait for a database connection to be established. A friend of mine suggested I experiment with generating JSON files and hosting those on a Cloudflare R2 instance instead of having the front-end directly query the database, and after some more research I decided to give that a shot. DuckDB’s speed meant that I was able to quickly generate tens of thousands of JSON documents. These are the documents currently powering Deed Dish.

50,000 JSON Files

Currently, Deed Dish is powered by about 50,000 JSON files hosted on the previously mentioned R2 instance. When the user clicks on a parcel the client sends a GET request to the R2 bucket and, if there is an associated file, receives a JSON response with the relevant data. This has turned out to be quite fast and perfectly adequate for the current state of the project.

I do find myself wondering: Is 50,000 JSON files really a database? I’d say so, but if you disagree let me know. Maybe it’s just a CDN, but then isn’t a CDN a database at its core? These are all questions to come back to in the future after Deed Dish is more fleshed out. For now, go check out the project and if you like it (or don’t) let me know via email!

The Future?

I’m pretty happy with my current back-end storage solution. Having a locally hosted DuckDB means that I can generate the back-end data quickly, and having the data hosted as individual files on an R2 instance makes the user experience snappy. However, there are limitations to this approach. The most notable one being that Deed Dish seems to slow down significantly under load. If the project continues growing I could easily see myself paying for hosting on a platform like MotherDuck or experimenting with hosting the database on R2 directly and using the client to query that instead of individual files.