Harmonique

Play radio

Small-scale data engineering with Go and PostgreSQL: a few lessons learned

I just released dgtools, a command line utility to work with the Discogs data dumps. This little endeavor was supposed to be a quick side quest, but it transformed into a rabbit hole.

Discogs is the go-to service for record collectors. They might have one of the biggest databases for physical music releases. On a monthly basis, they release a compressed XML of a subset of their database under a CC0 license. Tools already exist to import them into a PostgreSQL database, but I wanted the flexibility of a custom-built solution. I started building something in a Ruby on Rails app but quickly diverged to Go as I didn't want to pay the ActiveRecord performance cost.

I have quite a bit of experience with Go. I started learning it in 2012, and some code of mine still runs on thousands of production servers worldwide. I haven't really touched the language for a few years though. So it was interesting to write a small project and learn a few things.

Go's XML parser is great to work with. With a mix of streaming and parsing into tagged structs, it is very convenient and performant. I wasn't expecting it would be SO easy to reliably parse XML.

Go's ecosystem is rich. I thought pq was the go-to choice when working with PostgreSQL and Go, but the development has stalled. I hit a bug where pq is unable to bulk load data into jsonb columns. pgx is now the right choice for that kind of workload. Similarly, there are tons of good libraries when it comes to structuring a CLI application. And don't get me started on the cool TUI libs!

I forgot how amazing Go's embed feature is. Embedding migration files into a binary is so easy and such a useful feature. The ability to distribute a single binary is my favorite feature of Go.

While Go correctly produces and consumes multi-line CSV, PostgreSQL cannot import them. Unless I missed something, having new lines in a CSV means that PostgreSQL cannot process them correctly with a COPY FROM 'filename' query.

Go is a framework as much as it is a language. The documentation, standard library, and bundled tools make it an amazing built-in developer experience for small-scale projects like the one I wrote.

posted by marc.in.space in
  • Discogs
  • Software Development
  • Golang
  • PostgreSQL