가장 많이 묻는 면접 질문과 답변 & 온라인 테스트
면접 준비, 온라인 테스트, 튜토리얼, 라이브 연습을 위한 학습 플랫폼

집중 학습 경로, 모의고사, 면접 준비 콘텐츠로 실력을 키우세요.

WithoutBook은 주제별 면접 질문, 온라인 연습 테스트, 튜토리얼, 비교 가이드를 하나의 반응형 학습 공간으로 제공합니다.

Prepare Interview

PySpark 면접 질문과 답변

Ques 1. What is PySpark?

PySpark is the Python API for Apache Spark, a fast and general-purpose cluster computing system.

Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('example').getOrCreate()

도움이 되었나요? Add Comment View Comments
 

Ques 2. Explain the concept of Resilient Distributed Datasets (RDD) in PySpark.

RDD is the fundamental data structure in PySpark, representing an immutable distributed collection of objects. It allows parallel processing and fault tolerance.

Example:

data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

도움이 되었나요? Add Comment View Comments
 

Ques 3. What is the difference between a DataFrame and an RDD in PySpark?

DataFrame is a higher-level abstraction on top of RDD, providing a structured and tabular representation of data. It supports various optimizations and operations similar to SQL.

Example:

df = spark.createDataFrame([(1, 'John'), (2, 'Jane')], ['ID', 'Name'])

도움이 되었나요? Add Comment View Comments
 

Ques 4. How can you perform the join operation in PySpark?

You can use the 'join' method on DataFrames. For example, df1.join(df2, df1['key'] == df2['key'], 'inner') performs an inner join on 'key'.

Example:

result = df1.join(df2, df1['key'] == df2['key'], 'inner')

도움이 되었나요? Add Comment View Comments
 

Ques 5. Explain the purpose of the 'groupBy' operation in PySpark.

'groupBy' is used to group the data based on one or more columns. It is often followed by aggregation functions to perform operations on each group.

Example:

grouped_data = df.groupBy('Category').agg({'Price': 'mean'})

도움이 되었나요? Add Comment View Comments
 

Most helpful rated by users:

Copyright © 2026, WithoutBook.