Making Discogs Data 13% Smaller with Parquet

Recently, I have been working with the Discogs data dumps. Discogs uploads monthly dumps of their database in a gzipped XML format. They release dumps for: artists, labels, masters, and releases. I was curious about converting them to the Parquet file format. Parquet is a binary columnar file format heavily used in data engineering. It allows different compression algorithms per column and nested structures. It is also natively supported by databases such as ClickHouse or DuckDB. I was mostly curious about the size of a parquet file vs a compressed XML file. Would parquet files be smaller than a gzipped XML? If so, by how much? Also, what would be the conversion speed?

Implementation

I implemented a parquet conversion command into dgtools (a Go tool I recently built). Since the bigger work of modeling the data dumps to Go structs was done previously, the implementation was straightforward.

The parquet-go module makes good use of struct tags. It allows for setting the encoding and the compression of the columns. The (now deprecated) parquet.Writer is perfect to use for quickly exporting data.

Parquet offers several compression algorithms. I just went with zstd as the default for nearly all columns. For other columns where cardinality is extremely low, I chose the dict encoding.

Once a parquet file is created, it is quite easy to inspect it with DuckDB. A query such as:

SELECT path_in_schema, type, encodings, compression, 
       (total_compressed_size / 1024) AS compressed_size, 
       (total_uncompressed_size / 1024) AS uncompressed_size 
FROM parquet_metadata('file.parquet');

will show the respective size of columns.

Results

Conversion speed

| Type     | Records    | Time   | Records / Second |
|----------|------------|--------|------------------|
| Labels   | 2,274,143  | 12.48s | 182,222          |
| Artists  | 9,174,834  | 63.44s | 144,713          |
| Masters  | 2,459,324  | 69.77s | 35,249           |
| Releases | 18,412,655 | 34m14s | 8,964            |

Conversion speed decreases with the complexity of the structure.

File size

| Type     | .xml.gz | Parquet | Difference |
|----------|---------|---------|------------|
| Labels   | 83M     | 72M     | -13.2%     |
| Artists  | 441M    | 397M    | -9.9%      |
| Masters  | 577M    | 537M    | -6.7%      |
| Releases | 10.74G  | 10.14G  | -5.5%      |

The bigger the file, the lower the returns.

Better results might be possible with more optimization, but I have little interest in spending more time on this. The results are clear: parquet files have a smaller footprint than large .xml.gz files. They are also easily queryable with embedded databases like DuckDB.

Maybe we should reconsider how structured data is exported. While text files are nice, they're also a chore to transform and load into different databases. For data analysis workflows, binary formats like Parquet offer real advantages in both storage efficiency and query performance.

posted by marc.in.space in

Harmonique