I have some parquet files, is it possible to delete rows or data from the parquet files without corrupting the files? For example I have data in regular parquet format which contains data for different users stored in hdfs, is it possible to loop through those parquet files and delete data for a specific user? Also, I don't want to reprocess the data and filter the specific user, I want to delete the data directly from the parquet files. Will this corrupt the parquet files?
Asked
Active
Viewed 94 times
-2
-
[this](https://stackoverflow.com/a/43015779/8279585) answer, I think, appropriately answers your question. – samkart May 11 '22 at 04:59
-
Does this answer your question? [Updating values in apache parquet file](https://stackoverflow.com/questions/28837456/updating-values-in-apache-parquet-file) – samkart May 11 '22 at 05:01
-
http://aseigneurin.github.io/2017/03/14/incrementally-loaded-parquet-files.html should help -- tells how to add new parquet files which has the increment-ed data. So, lets say each of your rows has a column : `System.nanos()` column, all it takes is to read the latest row (based on nanos) to get latest data. – chen May 11 '22 at 06:11