30 principais Interview Questions and Answers (2024)

Pergunta 1

Explain the concept of Resilient Distributed Datasets (RDD) in PySpark.

RDD is the fundamental data structure in PySpark, representing an immutable distributed collection of objects. It allows parallel processing and fault tolerance.

Example:

data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 2

What is the difference between a DataFrame and an RDD in PySpark?

DataFrame is a higher-level abstraction on top of RDD, providing a structured and tabular representation of data. It supports various optimizations and operations similar to SQL.

Example:

df = spark.createDataFrame([(1, 'John'), (2, 'Jane')], ['ID', 'Name'])

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 3

What is the purpose of the 'cache' operation in PySpark?

The 'cache' operation is used to persist a DataFrame or RDD in memory, enhancing the performance of iterative algorithms or repeated operations.

Example:

df.cache()

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 4

How can you handle missing or null values in a PySpark DataFrame?

You can use the 'na' functions like 'drop' or 'fill' to handle missing values in a PySpark DataFrame.

Example:

df.na.drop()

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 5

What is the purpose of the 'explode' function in PySpark?

The 'explode' function is used to transform a column with arrays or maps into multiple rows, duplicating the values of the other columns.

Example:

from pyspark.sql.functions import explode

exploded_df = df.select('ID', explode('items').alias('item'))

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 6

Explain the purpose of the 'persist' operation in PySpark.

'Persist' is used to persist a DataFrame or RDD in memory or on disk, allowing faster access to the data in subsequent operations.

Example:

df.persist()

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 7

What is the purpose of the 'explode' function in PySpark?

The 'explode' function is used to transform a column with arrays or maps into multiple rows, duplicating the values of the other columns.

Example:

from pyspark.sql.functions import explode

exploded_df = df.select('ID', explode('items').alias('item'))

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 8

How can you handle missing or null values in a PySpark DataFrame?

You can use the 'na' functions like 'drop' or 'fill' to handle missing values in a PySpark DataFrame.

Example:

df.na.drop()

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 9

Explain the difference between 'cache' and 'persist' operations in PySpark.

'Cache' is a shorthand for 'persist(memory_only=True)', while 'persist' allows more flexibility by specifying storage levels (memory-only, disk-only, etc.).

Example:

df.cache()

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 10

What is the purpose of the 'agg' method in PySpark?

The 'agg' method is used for aggregating data in a PySpark DataFrame. It allows you to perform various aggregate functions like sum, avg, max, min, etc., on specified columns.

Example:

result = df.agg({'Sales': 'sum', 'Quantity': 'avg'})

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Pergunta 11

Explain the purpose of the 'coalesce' method in PySpark.

The 'coalesce' method is used to reduce the number of partitions in a PySpark DataFrame. It helps in optimizing the performance when the number of partitions is unnecessarily large.

Example:

df_coalesced = df.coalesce(5)

Salvar para revisao

Adicione este item aos favoritos, marque-o como dificil ou coloque-o em um conjunto de revisao.

Abrir minha biblioteca de aprendizado

Isto e util? Sim Nao

Adicionar comentario Ver comentarios

Desenvolva habilidades com trilhas de aprendizado focadas, simulados e conteudo pronto para entrevistas.

Interview Questions and Answers

A melhor entrevista simulada ao vivo para assistir antes de uma entrevista

Interview Questions and Answers

Perguntas e respostas de nivel intermediario / de 1 a 5 anos de experiencia

Explain the concept of Resilient Distributed Datasets (RDD) in PySpark.

Salvar para revisao

What is the difference between a DataFrame and an RDD in PySpark?

Salvar para revisao

What is the purpose of the 'cache' operation in PySpark?

Salvar para revisao

How can you handle missing or null values in a PySpark DataFrame?

Salvar para revisao

What is the purpose of the 'explode' function in PySpark?

Salvar para revisao

Explain the purpose of the 'persist' operation in PySpark.

Salvar para revisao

What is the purpose of the 'explode' function in PySpark?

Salvar para revisao

How can you handle missing or null values in a PySpark DataFrame?

Salvar para revisao

Explain the difference between 'cache' and 'persist' operations in PySpark.

Salvar para revisao

What is the purpose of the 'agg' method in PySpark?

Salvar para revisao

Explain the purpose of the 'coalesce' method in PySpark.

Salvar para revisao

Mais uteis segundo os usuarios:

Assuntos de entrevista relacionados

Todos os assuntos de entrevista

WithoutBook