Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

_______ is the component of Spark that is responsible for assigning work that will be completed in parallel. In a single Databricks cluster, there will

_______ is the component of Spark that is responsible for assigning work that will be completed in parallel. In a single Databricks cluster, there will only be one of this component.

QUESTION 2

Select all different methods to run tasks in parallel.

Running tasks in multiple threads

Running tasks in multiple processes

Increasing CPU clock speed of the node the task is running

Adding additional nodes to run the tasks

Increasing the disk space of the node the task is running

Increasing the memory of the node the task is running

QUESTION 3

Select all the benefits of MapReduce algorithm over others

MapReduce approach is that it does not require a central data structure

Allows multiple tasks to run in parallel in mapping, shuffling, and reducing

Tasks that require iterative processes over data works well with MapReduce

QUESTION 4

Select all correct information about Spark DataFrames.

A DataFrame is an immutable, distributed collection of data organized into named columns

A Spark DataFrame carries important metadata that allows Spark to optimize queries

Spark DataFrames conceptually equivalent to a table in a relational database

Information stored in a Spark DataFrame automatically saved into the database

QUESTION 5

Actions are statements that are computed AND executed when they are encountered in the developer's code.

True

False

QUESTION 6

What are the five main components of Apache Spark ecosystem?

Spark Core

Spark SQL

Spark Streaming and Structured Streaming

Machine Learning Library (MLlib)

Graph Computation (GraphX)

Spark HDFS

Spark YARN

QUESTION 7

______ are created by the driver and assigned a partiton of data to process. Then, ______ are assigned to slots for parallel execution.

QUESTION 8

Spark can be configured in three different deployment modes: local, client, and cluster mode.

True

False

QUESTION 9

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster

True

False

QUESTION 10

Apache Spark is a sophisticated distributed computation framework for executing code in parallel across many different machines.

True

False

QUESTION 11

How many driver programs can run in a single Spark Cluster?

QUESTION 12

Select all correct information about Spark Executors.

The executors are responsible for carrying out the work assigned by the driver

Execute code assigned by the driver

Report the state of the computation back to the driver

Maintaining information about the Spark Application

QUESTION 13

What are the four main components of Hadoop ecosystem?

Hadoop Distributed File System (HDFS):

Yet Another Resource Negotiator (YARN)

Hadoop MapReduce

Hadoop Common (Hadoop Core)

Hadoop Analytics

Hadoop Spark

QUESTION 14

Select all correct information about map reduce algorithm.

A data set is mapped into a collection of (key value) pairs in the mapping step

Mapping step produces intermediate results and associates values with an output key

Shuffling step produces intermediate results and associates values with an output key

Shuffling step groups intermediate results associated with the same output key

Reducing step groups intermediate results associated with the same output key

Reducing step processes groups of intermediate results with the same output key

Mapping step processes groups of intermediate results with the same output key

QUESTION 15

Select all correct information about Transformations

Transformations are at the core of how you express your business logic in Spark.

Transformations has lazy evaluation

There are 3 types of transformations, narrow, wide, and shuffler.

Narrow transformations mean that the work happens on the executor without changing the way data is partitioned over the system

QUESTION 16

RDDs uses Catalyst Optimizer to find efficient plan for applying your transformations and actions.

True

False

QUESTION 17

What is the primary difference between Spark and Hadoop MapReduce?

Spark processes and retains data in memory for subsequent steps, whereas MapReduce processes data on disk.

Hadoop processes and retains data in memory for subsequent steps, whereas Spark processes data on disk.

Hadoop brings compute to datasets, whereas Spark brings data during compute.

Spark brings compute to datasets, whereas Hadoop brings data during compute.

QUESTION 18

A ______ is a collection of rows that sit on one physical machine in the cluster.

QUESTION 19

If you have 3 executors and each executor has 3 slots, what is the maximum number of tasks that can be executed at any one time?

QUESTION 20

Sort the transformational phases in the Catalyst Optimizer.

- 1. 2. 3. 4.

Code generation

- 1. 2. 3. 4.

Analysis

- 1. 2. 3. 4.

Physical planning

- 1. 2. 3. 4.

Logical optimization

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions

Question

What is the difference between a model and a tool?

Answered: 1 week ago

Question

1. Describe the factors that lead to productive conflict

Answered: 1 week ago