Harmonique

Play radio

Notes

Making Discogs Data 13% Smaller with Parquet

Recently, I have been working with the Discogs data dumps. Discogs uploads monthly dumps of their database in a gzipped XML format. They release dumps for: artists, labels, masters, and releases. I was curious about converting them to the Parquet file format. Parquet is a binary columnar file format heavily used in data engineering. It allows different compression algorithms per column and nested structures. It is also natively supported by databases such as ClickHouse or DuckDB. I was mostly curious about the size of a parquet file vs a compressed XML file. Would parquet files be smaller than a gzipped XML? If so, by how much? Also, what would be the conversion speed?

Continue reading →

Small-scale data engineering with Go and PostgreSQL: a few lessons learned

I just released dgtools, a command line utility to work with the Discogs data dumps. This little endeavor was supposed to be a quick side quest, but it transformed into a rabbit hole.

Discogs is the go-to service for record collectors. They might have one of the biggest databases for physical music releases. On a monthly basis, they release a compressed XML of a subset of their database under a CC0 license. Tools already exist to import them into a PostgreSQL database, but I wanted the flexibility of a custom-built solution. I started building something in a Ruby on Rails app but quickly diverged to Go as I didn't want to pay the ActiveRecord performance cost.

Continue reading →

OpenSimplex noise

OpenSimplex noise is a gradient noise function designed to avoid patent issues with simplex noise while fixing the directional artifacts in Perlin noise. It uses a different grid structure with stretched hypercubic honeycombs and larger kernel sizes, making it smoother but slower than simplex noise.