面接対策、オンラインテスト、チュートリアル、ライブ練習のための学習プラットフォーム

集中型学習パス、模擬テスト、面接向けコンテンツでスキルを伸ばしましょう。

WithoutBook は、分野別の面接質問、オンライン練習テスト、チュートリアル、比較ガイドをひとつのレスポンシブな学習空間にまとめています。

ライブラリを検索

面接準備

PySpark 面接の質問と回答

質問 26. What is the purpose of the 'accumulator' in PySpark?

An 'accumulator' is a variable that can be used in parallel operations and is updated by multiple tasks. It is typically used for implementing counters or sums in distributed computing.

Example:

accumulator = spark.sparkContext.accumulator(0)

# Inside a transformation or action
accumulator.add(1)

役に立ちましたか？はいいいえコメントを追加コメントを見る

質問 27. Explain the use of the 'broadcast' hint in PySpark.

The 'broadcast' hint is used to explicitly instruct PySpark to use a broadcast join strategy for better performance, especially when one DataFrame is significantly smaller than the other.

Example:

from pyspark.sql.functions import broadcast

result = df1.join(broadcast(df2), 'key')

役に立ちましたか？はいいいえコメントを追加コメントを見る

質問 28. What is the purpose of the 'agg' method in PySpark?

The 'agg' method is used for aggregating data in a PySpark DataFrame. It allows you to perform various aggregate functions like sum, avg, max, min, etc., on specified columns.

Example:

result = df.agg({'Sales': 'sum', 'Quantity': 'avg'})

役に立ちましたか？はいいいえコメントを追加コメントを見る

質問 29. How can you handle data skewness in PySpark?

Data skewness can be handled by using techniques like salting, bucketing, or using the 'broadcast' hint to distribute data more evenly across partitions.

Example:

df.write.option('skew_hint', 'true').parquet('output_path')

役に立ちましたか？はいいいえコメントを追加コメントを見る

質問 30. Explain the purpose of the 'coalesce' method in PySpark.

The 'coalesce' method is used to reduce the number of partitions in a PySpark DataFrame. It helps in optimizing the performance when the number of partitions is unnecessarily large.

Example:

df_coalesced = df.coalesce(5)

役に立ちましたか？はいいいえコメントを追加コメントを見る

ユーザー評価で最も役立つ内容: