Apache Spark 面试题与答案
问题 6. Explain the concept of partitions in Apache Spark.
Partitions are basic units of parallelism in Spark. They represent the logical division of data across the nodes in a cluster, and each partition is processed independently.
Example:
val inputRDD = sc.parallelize(Seq(1, 2, 3, 4, 5), 2)
问题 7. What is a Spark Executor and what role does it play in Spark applications?
A Spark Executor is a process responsible for executing tasks on a worker node. Executors are launched at the beginning of a Spark application and run tasks until the application completes or encounters an error.
Example:
spark-submit --master yarn --deploy-mode client --num-executors 3 mySparkApp.jar
问题 8. How does Spark handle fault tolerance in RDDs?
Spark achieves fault tolerance through lineage information (DAG) and recomputing lost data from the original source. If a partition of an RDD is lost, Spark can recompute it using the lineage information.
Example:
val resilientRDD = originalRDD.filter(x => x > 0)
问题 9. What is the Broadcast variable in Spark and when is it used?
A Broadcast variable is a read-only variable cached on each worker node. It is used to efficiently distribute large read-only data structures, such as lookup tables, to all tasks in a Spark job.
Example:
val broadcastVar = sc.broadcast(Array(1, 2, 3))
问题 10. Explain the role of the Spark Driver in a Spark application.
The Spark Driver is the program that runs the main() function and creates the SparkContext. It coordinates the execution of tasks on the Spark Executors and collects results from them.
Example:
object MyApp {
def main(args: Array[String]): Unit = {
val sc = new SparkContext("local", "MyApp")
}
}
用户评价最有帮助的内容:
- What is the purpose of the Spark SQL module?
- Explain the difference between narrow and wide transformations in Spark.