Partitioned Writes
Examples
Write a table to a Hive partitioned data set of Parquet files:
COPY orders TO 'orders' (FORMAT PARQUET, PARTITION_BY (year, month));
Write a table to a Hive partitioned data set of CSV files, allowing overwrites:
COPY orders TO 'orders' (FORMAT CSV, PARTITION_BY (year, month), OVERWRITE_OR_IGNORE);
Write a table to a Hive partitioned data set of GZIP-compressed CSV files, setting explicit data files' extension:
COPY orders TO 'orders' (FORMAT CSV, PARTITION_BY (year, month), COMPRESSION GZIP, FILE_EXTENSION 'csv.gz');
Partitioned Writes
When the PARTITION_BY clause is specified for the COPY statement, the files are written in a Hive partitioned folder hierarchy. The target is the name of the root directory (in the example above: orders). The files are written in-order in the file hierarchy. Currently, one file is written per thread to each directory.
orders
├── year=2021
│ ├── month=1
│ │ ├── data_1.parquet
│ │ └── data_2.parquet
│ └── month=2
│ └── data_1.parquet
└── year=2022
├── month=11
│ ├── data_1.parquet
│ └── data_2.parquet
└── month=12
└── data_1.parquet The values of the partitions are automatically extracted from the data. Note that it can be very expensive to write many partitions as many files will be created. The ideal partition count depends on how large your data set is.
Bestpractice Writing data into many small partitions is expensive. It is generally recommended to have at least
100 MBof data per partition.
Overwriting
By default the partitioned write will not allow overwriting existing directories. Use the OVERWRITE_OR_IGNORE option to allow overwriting an existing directory.
Filename Pattern
By default, files will be named data_0.parquet or data_0.csv. With the flag FILENAME_PATTERN a pattern with {i} or {uuid} can be defined to create specific filenames:
-
{i}will be replaced by an index -
{uuid}will be replaced by a 128 bits long UUID
Write a table to a Hive partitioned data set of .parquet files, with an index in the filename:
COPY orders TO 'orders'
(FORMAT PARQUET, PARTITION_BY (year, month), OVERWRITE_OR_IGNORE, FILENAME_PATTERN 'orders_{i}'); Write a table to a Hive partitioned data set of .parquet files, with unique filenames:
COPY orders TO 'orders'
(FORMAT PARQUET, PARTITION_BY (year, month), OVERWRITE_OR_IGNORE, FILENAME_PATTERN 'file_{uuid}');
© Copyright 2018–2024 Stichting DuckDB Foundation
Licensed under the MIT License.
https://duckdb.org/docs/data/partitioning/partitioned_writes.html