PySpark Interview Questions and Answers
Ques 26. What is the purpose of the 'accumulator' in PySpark?
An 'accumulator' is a variable that can be used in parallel operations and is updated by multiple tasks. It is typically used for implementing counters or sums in distributed computing.
Example:
accumulator = spark.sparkContext.accumulator(0)
# Inside a transformation or action
accumulator.add(1)
Ques 27. Explain the use of the 'broadcast' hint in PySpark.
The 'broadcast' hint is used to explicitly instruct PySpark to use a broadcast join strategy for better performance, especially when one DataFrame is significantly smaller than the other.
Example:
from pyspark.sql.functions import broadcast
result = df1.join(broadcast(df2), 'key')
Ques 28. What is the purpose of the 'agg' method in PySpark?
The 'agg' method is used for aggregating data in a PySpark DataFrame. It allows you to perform various aggregate functions like sum, avg, max, min, etc., on specified columns.
Example:
result = df.agg({'Sales': 'sum', 'Quantity': 'avg'})
Ques 29. How can you handle data skewness in PySpark?
Data skewness can be handled by using techniques like salting, bucketing, or using the 'broadcast' hint to distribute data more evenly across partitions.
Example:
df.write.option('skew_hint', 'true').parquet('output_path')
Ques 30. Explain the purpose of the 'coalesce' method in PySpark.
The 'coalesce' method is used to reduce the number of partitions in a PySpark DataFrame. It helps in optimizing the performance when the number of partitions is unnecessarily large.
Example:
df_coalesced = df.coalesce(5)
Most helpful rated by users: