Top 50 PySpark Interview Questions and Answers

What is PySpark?
PySpark is the Python API for Apache Spark, enabling Python developers to write Spark applications.
What are the main features of PySpark?
- Distributed computing
- Fault tolerance
- Lazy evaluation
- In-memory processing
What are the core components of Spark?
- Spark Core
- Spark SQL
- Spark Streaming
- MLlib
- GraphX
What is RDD in PySpark?
Resilient Distributed Dataset (RDD) is the fundamental data structure in Spark, representing an immutable distributed collection of objects.
What is the difference between RDD and DataFrame in PySpark?
- RDD: Low-level API, no schema, supports functional programming.
- DataFrame: High-level API, schema support, optimized query execution.

Explain the execution model of PySpark.
- Driver: Orchestrates the execution.
- Executor: Executes tasks on worker nodes.
- Cluster Manager: Manages resources in the cluster.
What is the role of SparkContext in PySpark?
SparkContext is the entry point for any PySpark application, managing the connection to the Spark cluster.
What is DAG in PySpark?
Directed Acyclic Graph (DAG) represents a sequence of computations performed on data.
What is a SparkSession?
A unified entry point to work with Spark functionalities, replacing SparkContext in PySpark 2.0+.
What is the role of the cluster manager in PySpark?
It allocates resources to Spark applications. Examples include YARN, Mesos, and Standalone.

What are transformations in PySpark?
Transformations create new RDDs from existing ones (e.g., map, filter).
What are actions in PySpark?
Actions trigger execution and return results (e.g., count, collect).
What is lazy evaluation in PySpark?
Transformations are not executed immediately; they are evaluated when an action is invoked.
What is the difference between map() and flatMap()?
- map(): Transforms each element into a single element.
- flatMap(): Transforms each element into multiple elements.
What is the persist() method in PySpark?
It allows storing RDDs in memory or disk to optimize computations.

What is a DataFrame in PySpark?
A distributed collection of data organized into named columns.
How do you create a DataFrame in PySpark?
Using SparkSession.read for external files or SparkSession.createDataFrame for local data.
What is Spark SQL?
A module for structured data processing using SQL queries.
How do you perform filtering in PySpark DataFrames?
Using .filter() or .where() methods.
What is Catalyst Optimizer in PySpark?
An internal query optimizer that improves query execution in Spark SQL.

What is a PySpark UDF?
A user-defined function to apply custom transformations on DataFrame columns.
How do you handle null values in PySpark?
Using methods like .fillna(), .dropna(), or .replace().
What is the difference between join() and union() in PySpark?
- join(): Combines two DataFrames based on a condition.
- union(): Combines two DataFrames vertically.
What is the difference between narrow and wide transformations?
- Narrow: Data transfer happens within a single partition (e.g., map, filter).
- Wide: Data transfer happens between partitions (e.g., groupByKey, reduceByKey).
What is the role of partitioning in PySpark?
Partitioning distributes data across multiple nodes, optimizing parallelism and performance.

What is Spark Streaming?
A Spark module for processing real-time data streams.
How do you create a DStream in PySpark?
Using SparkContext.streamingContext with data sources like Kafka or socket.
What is a window operation in Spark Streaming?
It applies transformations over a sliding window of data.
What is the difference between batch and stream processing in PySpark?
- Batch: Processes stored data.
- Stream: Processes real-time data.
What is the role of checkpointing in Spark Streaming?
Checkpointing saves intermediate data for fault tolerance.

What is MLlib in PySpark?
Spark’s library for machine learning, offering scalable algorithms.
How do you build a pipeline in PySpark MLlib?
Using Pipeline and PipelineStage objects.
What are transformers and estimators in PySpark MLlib?
- Transformer: Converts input data (e.g., Tokenizer).
- Estimator: Trains a model (e.g., LinearRegression).
How do you handle categorical data in PySpark?
Using StringIndexer or OneHotEncoder.
What is the role of VectorAssembler in PySpark MLlib?
It combines multiple columns into a single vector column.

What is data serialization in PySpark?
Converting data into a byte stream for transmission or storage.
How do you optimize joins in PySpark?
- Use broadcast joins for smaller datasets.
- Ensure proper partitioning.
What is the role of caching in PySpark?
Caching stores data in memory for reuse in subsequent actions.
What is the Tungsten project in Spark?
An initiative to improve Spark’s performance using off-heap memory and optimized code generation.
How do you handle skewed data in PySpark?
- Use salting techniques.
- Increase parallelism with repartition().

What is a broadcast variable in PySpark?
A read-only variable distributed to all nodes.
What is an accumulator in PySpark?
A variable used for aggregating information across tasks.
What is a shared variable in PySpark?
Variables like broadcast and accumulators used for sharing data between tasks.
What is the difference between repartition() and coalesce()?
- repartition(): Increases or decreases partitions.
- coalesce(): Decreases partitions without shuffling.
What are PySpark’s common file formats?
CSV, JSON, Parquet, ORC, Avro.

How do you debug PySpark applications?
- Use logs and Spark UI.
- Run in local mode for debugging.
How do you integrate PySpark with Kafka?
Use structured-streaming-kafka for consuming Kafka streams.
How do you process large datasets in PySpark?
- Optimize partitions.
- Use caching and serialization techniques.
What happens if an executor fails in PySpark?
Spark re-executes the tasks on other executors using RDD lineage.
How do you deploy PySpark applications?
Package the application and submit it using spark-submit.

Contact form

12+48=?