Harmonique

Play radio

Radio


Notes

Making Discogs Data 13% Smaller with Parquet

Recently, I have been working with the Discogs data dumps. Discogs uploads monthly dumps of their database in a gzipped XML format. They release dumps for: artists, labels, masters, and releases. I was curious about converting them to the Parquet file format. Parquet is a binary columnar file format heavily used in data engineering. It allows different compression algorithms per column and nested structures. It is also natively supported by databases such as ClickHouse or DuckDB. I was mostly curious about the size of a parquet file vs a compressed XML file. Would parquet files be smaller than a gzipped XML? If so, by how much? Also, what would be the conversion speed?

Implementation

I implemented a parquet conversion command into dgtools (a Go tool I recently built). Since the bigger work of modeling the data dumps to Go structs was done previously, the implementation was straightforward.

The parquet-go module makes good use of struct tags. It allows for setting the encoding and the compression of the columns. The (now deprecated) parquet.Writer is perfect to use for quickly exporting data.

Parquet offers several compression algorithms. I just went with zstd as the default for nearly all columns. For other columns where  cardinality is extremely low, I chose the dict encoding.

Once a parquet file is created, it is quite easy to inspect it with DuckDB. A query such as:

SELECT path_in_schema, type, encodings, compression, 
       (total_compressed_size / 1024) AS compressed_size, 
       (total_uncompressed_size / 1024) AS uncompressed_size 
FROM parquet_metadata('file.parquet');

will show the respective size of columns.

Results

Conversion speed

| Type     | Records    | Time   | Records / Second |
|----------|------------|--------|------------------|
| Labels   | 2,274,143  | 12.48s | 182,222          |
| Artists  | 9,174,834  | 63.44s | 144,713          |
| Masters  | 2,459,324  | 69.77s | 35,249           |
| Releases | 18,412,655 | 34m14s | 8,964            |

Conversion speed decreases with the complexity of the structure.

File size

| Type     | .xml.gz | Parquet | Difference |
|----------|---------|---------|------------|
| Labels   | 83M     | 72M     | -13.2%     |
| Artists  | 441M    | 397M    | -9.9%      |
| Masters  | 577M    | 537M    | -6.7%      |
| Releases | 10.74G  | 10.14G  | -5.5%      |

The bigger the file, the lower the returns.

Better results might be possible with more optimization, but I have little interest in spending more time on this. The results are clear: parquet files have a smaller footprint than large .xml.gz files. They are also easily queryable with embedded databases like DuckDB.

Maybe we should reconsider how structured data is exported. While text files are nice, they're also a chore to transform and load into different databases. For data analysis workflows, binary formats like Parquet offer real advantages in both storage efficiency and query performance.

posted by marc.in.space in
  • data engineering
  • Discogs
  • go
  • parquet
  • xml

Small-scale data engineering with Go and PostgreSQL: a few lessons learned

I just released dgtools, a command line utility to work with the Discogs data dumps. This little endeavor was supposed to be a quick side quest, but it transformed into a rabbit hole.

Discogs is the go-to service for record collectors. They might have one of the biggest databases for physical music releases. On a monthly basis, they release a compressed XML of a subset of their database under a CC0 license. Tools already exist to import them into a PostgreSQL database, but I wanted the flexibility of a custom-built solution. I started building something in a Ruby on Rails app but quickly diverged to Go as I didn't want to pay the ActiveRecord performance cost.

I have quite a bit of experience with Go. I started learning it in 2012, and some code of mine still runs on thousands of production servers worldwide. I haven't really touched the language for a few years though. So it was interesting to write a small project and learn a few things.

Go's XML parser is great to work with. With a mix of streaming and parsing into tagged structs, it is very convenient and performant. I wasn't expecting it would be SO easy to reliably parse XML.

Go's ecosystem is rich. I thought pq was the go-to choice when working with PostgreSQL and Go, but the development has stalled. I hit a bug where pq is unable to bulk load data into jsonb columns. pgx is now the right choice for that kind of workload. Similarly, there are tons of good libraries when it comes to structuring a CLI application. And don't get me started on the cool TUI libs!

I forgot how amazing Go's embed feature is. Embedding migration files into a binary is so easy and such a useful feature. The ability to distribute a single binary is my favorite feature of Go.

While Go correctly produces and consumes multi-line CSV, PostgreSQL cannot import them. Unless I missed something, having new lines in a CSV means that PostgreSQL cannot process them correctly with a COPY FROM 'filename' query.

Go is a framework as much as it is a language. The documentation, standard library, and bundled tools make it an amazing built-in developer experience for small-scale projects like the one I wrote.

posted by marc.in.space in
  • Discogs
  • Software Development
  • Golang
  • PostgreSQL

OpenSimplex noise

OpenSimplex noise is a gradient noise function designed to avoid patent issues with simplex noise while fixing the directional artifacts in Perlin noise. It uses a different grid structure with stretched hypercubic honeycombs and larger kernel sizes, making it smoother but slower than simplex noise.
posted by marc.in.space in
  • OpenSimplex noise
  • procedural generation
  • computer graphics
  • gradient noise
  • simplex noise
  • Perlin noise