Spark Interview Questions

Download PDF of Apache Spark Interview Questions

61. Which all cluster manager can be used with Spark?

Ans:

Apache Mesos, Hadoop YARN, Spark standalone and

Spark local: Local node or on single JVM. Drivers and executor runs in same JVM. In this case same node will be used for execution.

Premium Training : Spark Full Length Training : with Hands On Lab

62. What is a BlockManager?

Ans: Block Manager is a key-value store for blocks that acts as a cache. It runs on every node, i.e. a driver and executors, in a Spark runtime environment. It provides interfaces for putting and retrieving blocks both locally and remotely into various stores, i.e. memory, disk, and offheap.

A BlockManager manages the storage for most of the data in Spark, i.e. block that represent a cached RDD partition, intermediate shuffle data, and broadcast data.

Premium : Cloudera Hadoop and Spark Developer Certification Material

63. What is Data locality / placement?

Ans: Spark relies on data locality or data placement or proximity to data source, that makes Spark jobs sensitive to where the data is located. It is therefore important to have Spark running on Hadoop YARN cluster if the data comes from HDFS.

With HDFS the Spark driver contacts NameNode about the DataNodes (ideally local) containing the various blocks of a file or directory as well as their locations (represented as InputSplits ), and then schedules the work to the SparkWorkers. Spark’s compute nodes / workers should be running on storage nodes.

64. What is master URL in local mode?

Ans: You can run Spark in local mode using local , local[n] or the most general local[*].

The URL says how many threads can be used in total:

· local uses 1 thread only.

· local[n] uses n threads.

· local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).

Premium : Hortonworks Spark Developer Certification Material (HDPCD:Spark)

65. Define components of YARN?

Ans: YARN components are below

ResourceManager: runs as a master daemon and manages ApplicationMasters and NodeManagers.

ApplicationMaster: is a lightweight process that coordinates the execution of tasks of an application and asks the ResourceManager for resource containers for tasks. It monitors tasks, restarts failed ones, etc. It can run any type of tasks, be them MapReduce tasks or Giraph tasks, or Spark tasks.

NodeManager offers resources (memory and CPU) as resource containers.

NameNode

Container: can run tasks, including ApplicationMasters.

66. What is a Broadcast Variable?

Ans: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.

67. How can you define Spark Accumulators?

Ans: This are similar to counters in Hadoop MapReduce framework, which gives information regarding completion of tasks, or how much data is processed etc.

68. What all are the data sources Spark can process?

Ans:

· Hadoop File System (HDFS)

· Cassandra (NoSQL databases)

· HBase (NoSQL database)

· S3 (Amazon WebService Storage : AWS Cloud)

69. What is Apache Parquet format?

Ans: Apache Parquet is a columnar storage format

70. What is Apache Spark Streaming?

Ans: Spark Streaming helps to process live stream data. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

Premium Training : Spark Full Length Training : with Hands On Lab


Previous Next

Home Spark Hadoop NiFi Java












Disclaimer :

1. Hortonworks® is a registered trademark of Hortonworks.

2. Cloudera® is a registered trademark of Cloudera Inc

3. Azure® is aregistered trademark of Microsoft Inc.

4. Oracle®, Java® are registered trademark of Oracle Inc

5. SAS® is a registered trademark of SAS Inc

6. IBM® is a registered trademark of IBM Inc

7. DataStax ® is a registered trademark of DataStax

8. MapR® is a registered trademark of MapR Inc.

2014-2017 © HadoopExam.com | Dont Copy , it's bad Karma |