Top 99 Apache Spark Interview Questions

This post include Big Data Spark Interview Questions and Answers for experienced and beginners. If you are a beginner don't worry, answers are explained in detail. These are very frequently asked Data Engineer Interview Questions which will help you to crack big data job interview.

What is Apache Spark?

According to Spark documentation, Apache Spark is a fast and general-purpose in-memory cluster computing system.

It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

In simple terms, Spark is a distributed data processing engine which supports programming language like Java, Scala, Python and R. In core, Spark engine has four built-in libraries which supports Spark SQL, Machine Learning, Spark Streaming and GraphX.

What is Apache Spark used for?

Apache Spark is used for real time data processing.
Implementing Extract, Transform, Load (ETL) processes.
Implementing machine learning algorithms and create interactive dashboards for data analytics.
Apache Spark is also used to store petabytes of data with data distributed over cluster with thousands of nodes.

How does Apache Spark work?

Spark uses master-slave architecture to distribute data across worker nodes and process them in parallel. Just like mapreduce, Spark has a central coordinator called driver and rest worker nodes as executors. Driver communicates with the executors to process the data.

Why is Spark faster than Hadoop mapreduce?

One of the drawbacks of Hadoop mapreduce is that it holds full data into HDFS after running each mapper and reducer job. This is very expensive because it consumes lot of disk I/O and network I/O. While in Spark, there are two processes transformations and actions. Spark doesn't write or hold the data in memory until an action is called. Thus, it reduces disk I/O and network I/O. Another innovation is in-memory caching where you can instruct Spark to hold input data in-memory so that program doesn't have to read data again from disk, thus reducing disk I/O.

Is Hadoop required for spark?

No, Hadoop file system is not required for Spark. However for better performance, Spark can use HDFS-YARN if required.

Is Spark built on top of Hadoop?

No. Spark is totally independent of Hadoop.

What is Spark API?

Apache Spark has basically three sets of APIs (Application Program Interface) - RDDs, Datasets and DataFrames that allow developers to access the data and run various functions across four different languages - Java, Scala, Python and R.

What is Spark RDD?

Resilient Distributed Datasets (RDDs) are basically an immutable collection of elements which is used as fundamental data structure in Apache Spark.

These are logically partitioned data across thousands of nodes in your cluster that can be accessed and computed in parallel.
RDD was the primary Spark API since Apache Spark foundation.

Which are the methods to create RDD in spark?

There are mainly two methods to create RDD.

Parallelizing - sc.parallelize()
Reference external dataset - sc.textFile()

Read - Spark context parallelize and reference external dataset example.

When would you use Spark RDD?

RDDs are used for unstructured data like streams of media texts, when schema and columnar format of data is not mandatory requirement like accessing data by column name and any other tabular attributes.
Secondly, RRDs are used when you want full control over physical distribution of data.

What is SparkContext, SQLContext, SparkSession and SparkConf?

SparkContext tells Spark driver application whether to access the cluster through a resource manager or to run locally in standalone mode. The resource manager can be YARN, or Spark's cluster manager.
SparkConf stores configuration parameters that Spark driver application passes to SparkContext. These parameters define properties of Spark driver application which is used by Spark to allocate resources on the cluster. Such as the number, memory size and cores used by the executors running on the worker nodes.
SQLContext is a class which is used to implement Spark SQL. You need to create SparkConf and SparkContext first in order to implement SQLContext. It is basically used for structured data when you want to implement schema and run SQL.
All three - SparkContext, SparkConf and SQLContext are encapsulated within SparkSession. In newer version you can directly implement Spark SQL with SparkSession.

What is Spark checkpointing?

Spark checkpointing is a process that saves the actual intermediate RDD data to a reliable distributed file system. It's the process of saving intermediate stage of a RDD lineage. You can do it by calling checkpoint, RDD.checkpoint() while developing the Spark driver application. You need to set up checkpoint directory where Spark can store these intermediate RDDs by calling RDD.setCheckpointDir().

What is an action in Spark and what happens when it's executed?

Action triggers execution of RDD lineage graph, loads original data from disk to create intermediate RDDs, performs all transformations and returns the final output to the Spark driver program or writes the data to file system (based on the type of action). According to Spark documentation, following are the list of actions.

What is Spark Streaming?

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

Reference: Apache Spark documentation

What is difference between RDDs, DataFrame and dataset?
Why is spark RDD immutable?
Are spark DataFrames immutable?
Are spark DataFrames distributed?
What is Spark stage?
How does SQL spark work?
What is spark executor and how does it work?
How will you improve Apache Spark performance?
What is spark SQL Warehouse Dir?
What is Spark shell? How would you open and close it?
How will you clear the screen on spark shell?
What is parallelize in spark?
Does spark SQL require hive?
What is Metastore in hive?
What does repartition and coalesce do in spark?
What is spark mapPartitions?
What is difference between MAP and flatMap in spark?
What is spark reduceByKey?
What is lazy evaluation in spark?
What is accumulator in spark?
Can RDD be shared between SparkContexts?
What happens if RDD partition is lost due to worker node failure?
Which serialization libraries are supported in spark?
What is cluster manager in spark?

Questions? Feel free to write in comments section below. Thank you.

What is SparkContext (Scala)?

Understanding SparkContext textFile & parallelize method

Spark Transformations example (Part 1)

Spark Transformation example in Scala (Part 2)

Just enough Scala for Spark

Terms

Policy

Privacy

Contact

Processing Time Calculator

Green Card Calculator