Top 50 PySpark Interview Questions and Answers
PySpark Basics
What is PySpark?
PySpark is the Python API for Apache Spark, enabling Python developers to write Spark applications.What are the main features of PySpark?
- Distributed computing
- Fault tolerance
- Lazy evaluation
- In-memory processing
What are the core components of Spark?
- Spark Core
- Spark SQL
- Spark Streaming
- MLlib
- GraphX
What is RDD in PySpark?
Resilient Distributed Dataset (RDD) is the fundamental data structure in Spark, representing an immutable distributed collection of objects.What is the difference between RDD and DataFrame in PySpark?
- RDD: Low-level API, no schema, supports functional programming.
- DataFrame: High-level API, schema support, optimized query execution.
PySpark Architecture
Explain the execution model of PySpark.
- Driver: Orchestrates the execution.
- Executor: Executes tasks on worker nodes.
- Cluster Manager: Manages resources in the cluster.
What is the role of SparkContext in PySpark?
SparkContext is the entry point for any PySpark application, managing the connection to the Spark cluster.What is DAG in PySpark?
Directed Acyclic Graph (DAG) represents a sequence of computations performed on data.What is a SparkSession?
A unified entry point to work with Spark functionalities, replacing SparkContext in PySpark 2.0+.What is the role of the cluster manager in PySpark?
It allocates resources to Spark applications. Examples include YARN, Mesos, and Standalone.
RDD Operations
What are transformations in PySpark?
Transformations create new RDDs from existing ones (e.g.,map
,filter
).What are actions in PySpark?
Actions trigger execution and return results (e.g.,count
,collect
).What is lazy evaluation in PySpark?
Transformations are not executed immediately; they are evaluated when an action is invoked.What is the difference between
map()
andflatMap()
?map()
: Transforms each element into a single element.flatMap()
: Transforms each element into multiple elements.
What is the
persist()
method in PySpark?
It allows storing RDDs in memory or disk to optimize computations.
DataFrame and SQL
What is a DataFrame in PySpark?
A distributed collection of data organized into named columns.How do you create a DataFrame in PySpark?
UsingSparkSession.read
for external files orSparkSession.createDataFrame
for local data.What is Spark SQL?
A module for structured data processing using SQL queries.How do you perform filtering in PySpark DataFrames?
Using.filter()
or.where()
methods.What is Catalyst Optimizer in PySpark?
An internal query optimizer that improves query execution in Spark SQL.
PySpark Programming
What is a PySpark UDF?
A user-defined function to apply custom transformations on DataFrame columns.How do you handle null values in PySpark?
Using methods like.fillna()
,.dropna()
, or.replace()
.What is the difference between
join()
andunion()
in PySpark?join()
: Combines two DataFrames based on a condition.union()
: Combines two DataFrames vertically.
What is the difference between narrow and wide transformations?
- Narrow: Data transfer happens within a single partition (e.g.,
map
,filter
). - Wide: Data transfer happens between partitions (e.g.,
groupByKey
,reduceByKey
).
- Narrow: Data transfer happens within a single partition (e.g.,
What is the role of partitioning in PySpark?
Partitioning distributes data across multiple nodes, optimizing parallelism and performance.
PySpark Streaming
What is Spark Streaming?
A Spark module for processing real-time data streams.How do you create a DStream in PySpark?
UsingSparkContext.streamingContext
with data sources like Kafka or socket.What is a window operation in Spark Streaming?
It applies transformations over a sliding window of data.What is the difference between batch and stream processing in PySpark?
- Batch: Processes stored data.
- Stream: Processes real-time data.
What is the role of checkpointing in Spark Streaming?
Checkpointing saves intermediate data for fault tolerance.
Machine Learning
What is MLlib in PySpark?
Spark’s library for machine learning, offering scalable algorithms.How do you build a pipeline in PySpark MLlib?
UsingPipeline
andPipelineStage
objects.What are transformers and estimators in PySpark MLlib?
- Transformer: Converts input data (e.g.,
Tokenizer
). - Estimator: Trains a model (e.g.,
LinearRegression
).
- Transformer: Converts input data (e.g.,
How do you handle categorical data in PySpark?
UsingStringIndexer
orOneHotEncoder
.What is the role of
VectorAssembler
in PySpark MLlib?
It combines multiple columns into a single vector column.
Performance Optimization
What is data serialization in PySpark?
Converting data into a byte stream for transmission or storage.How do you optimize joins in PySpark?
- Use broadcast joins for smaller datasets.
- Ensure proper partitioning.
What is the role of caching in PySpark?
Caching stores data in memory for reuse in subsequent actions.What is the Tungsten project in Spark?
An initiative to improve Spark’s performance using off-heap memory and optimized code generation.How do you handle skewed data in PySpark?
- Use salting techniques.
- Increase parallelism with
repartition()
.
PySpark Advanced Concepts
What is a broadcast variable in PySpark?
A read-only variable distributed to all nodes.What is an accumulator in PySpark?
A variable used for aggregating information across tasks.What is a shared variable in PySpark?
Variables like broadcast and accumulators used for sharing data between tasks.What is the difference between
repartition()
andcoalesce()
?repartition()
: Increases or decreases partitions.coalesce()
: Decreases partitions without shuffling.
What are PySpark’s common file formats?
CSV, JSON, Parquet, ORC, Avro.
Scenario-Based Questions
How do you debug PySpark applications?
- Use logs and Spark UI.
- Run in local mode for debugging.
How do you integrate PySpark with Kafka?
Usestructured-streaming-kafka
for consuming Kafka streams.How do you process large datasets in PySpark?
- Optimize partitions.
- Use caching and serialization techniques.
What happens if an executor fails in PySpark?
Spark re-executes the tasks on other executors using RDD lineage.How do you deploy PySpark applications?
Package the application and submit it usingspark-submit
.