in

Spark buffer holder size limit issue


I am doing the aggregate function on column level like
df.groupby(“a”).agg(collect_set(b))
The column value is increasing beyond default size of 2gb.

Error details:
Spark job fails with an IllegalArgumentException: Cannot grow BufferHolder error.
java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 95969 because the size after growing exceeds size limitation 2147483632

As we already known
BufferHolder has a maximum size of 2147483632 bytes (approximately 2 GB).
If a column value exceeds this size, Spark returns the exception.

I have removed all duplicate records, repartition(), increased the default patititions and did increase all memory parameters also but no use it is giving above error.
We have huge volume of data in a column after applying the agg of collect_set.

Is there any way to increase the BufferHolder maximum size of 2gb while processing.

Can you please send me customisation.
Any user defined function.

Thanks



Source: https://stackoverflow.com/questions/70537890/spark-buffer-holder-size-limit-issue

Predict the stock market with AI

Flutter library that provides useful functions for working with colors