Harmonique

Play radio

Notes related to go

Making Discogs Data 13% Smaller with Parquet

Marc in Space ยท Builds software and draws with robots

Recently, I have been working with the Discogs data dumps. Discogs uploads monthly dumps of their database in a gzipped XML format. They release dumps for: artists, labels, masters, and releases. I was curious about converting them to the Parquet file format. Parquet is a binary columnar file format heavily used in data engineering. It allows different compression algorithms per column and nested structures. It is also natively supported by databases such as ClickHouse or DuckDB. I was mostly curious about the size of a parquet file vs a compressed XML file. Would parquet files be smaller than a gzipped XML? If so, by how much? Also, what would be the conversion speed?

Continue reading →