Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Sample Questions Answers

Questions 4

Which of the following code blocks returns a DataFrame where columns predError and productId are removed from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.|1 |3 |4 |25 |1 |null|

5.|2 |6 |7 |2 |2 |null|

6.|3 |3 |null |25 |3 |null|

7.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.withColumnRemoved("predError", "productId")

transactionsDf.drop(["predError", "productId", "associateId"])

transactionsDf.drop("predError", "productId", "associateId")

transactionsDf.dropColumns("predError", "productId", "associateId")

transactionsDf.drop(col("predError", "productId"))

Buy Now

Questions 5

The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

transactionsDf.__1__(__2__.__3__(__4__))

Options:

1. select

2. col("storeId")

3. cast

4. StringType

1. select

2. col("storeId")

3. as

4. StringType

1. cast

2. "storeId"

3. as

4. StringType()

1. select

2. col("storeId")

3. cast

4. StringType()

1. select

2. storeId

3. cast

4. StringType()

Buy Now

Questions 6

The code block displayed below contains an error. The code block should trigger Spark to cache DataFrame transactionsDf in executor memory where available, writing to disk where insufficient

executor memory is available, in a fault-tolerant way. Find the error.

Code block:

transactionsDf.persist(StorageLevel.MEMORY_AND_DISK)

Options:

Caching is not supported in Spark, data are always recomputed.

Data caching capabilities can be accessed through the spark object, but not through the DataFrame API.

The storage level is inappropriate for fault-tolerant storage.

The code block uses the wrong operator for caching.

The DataFrameWriter needs to be invoked.

Buy Now

Questions 7

The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string

type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))

Options:

1. withColumn

2. "transactionDateForm"

3. "MMM d (EEEE)"

4. "transactionDate"

1. select

2. "transactionDate"

3. "transactionDateForm"

4. "MMM d (EEEE)"

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MMM d (EEEE)"

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MM d (EEE)"

1. withColumnRenamed

2. "transactionDate"

3. "transactionDateForm"

4. "MM d (EEE)"

Buy Now

Questions 8

Which of the following describes characteristics of the Spark UI?

Options:

Via the Spark UI, workloads can be manually distributed across executors.

Via the Spark UI, stage execution speed can be modified.

The Scheduler tab shows how jobs that are run in parallel by multiple users are distributed across the cluster.

There is a place in the Spark UI that shows the property spark.executor.memory.

Some of the tabs in the Spark UI are named Jobs, Stages, Storage, DAGs, Executors, and SQL.

Buy Now

Questions 9

Which of the following code blocks returns a 2-column DataFrame that shows the distinct values in column productId and the number of rows with that productId in DataFrame transactionsDf?

Options:

transactionsDf.count("productId").distinct()

transactionsDf.groupBy("productId").agg(col("value").count())

transactionsDf.count("productId")

transactionsDf.groupBy("productId").count()

transactionsDf.groupBy("productId").select(count("value"))

Buy Now

Questions 10

Which of the following code blocks returns a DataFrame that has all columns of DataFrame transactionsDf and an additional column predErrorSquared which is the squared value of column

predError in DataFrame transactionsDf?

Options:

transactionsDf.withColumn("predError", pow(col("predErrorSquared"), 2))

transactionsDf.withColumnRenamed("predErrorSquared", pow(predError, 2))

transactionsDf.withColumn("predErrorSquared", pow(col("predError"), lit(2)))

transactionsDf.withColumn("predErrorSquared", pow(predError, lit(2)))

transactionsDf.withColumn("predErrorSquared", "predError"**2)

Buy Now

Questions 11

The code block displayed below contains an error. The code block should read the csv file located at path data/transactions.csv into DataFrame transactionsDf, using the first row as column header

and casting the columns in the most appropriate type. Find the error.

First 3 rows of transactions.csv:

1.transactionId;storeId;productId;name

2.1;23;12;green grass

3.2;35;31;yellow sun

4.3;23;12;green grass

Code block:

transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True)

Options:

The DataFrameReader is not accessed correctly.

The transaction is evaluated lazily, so no file will be read.

Spark is unable to understand the file type.

The code block is unable to capture all columns.

The resulting DataFrame will not have the appropriate schema.

Buy Now

Questions 12

Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should

only be listed once.

Sample of DataFrame itemsDf:

1.+------+--------------------+--------------------+-------------------+

3.+------+--------------------+--------------------+-------------------+

7.+------+--------------------+--------------------+-------------------+

Options:

itemsDf.filter(col(supplier).not_contains('X')).select(supplier).distinct()

itemsDf.select(~col('supplier').contains('X')).distinct()

itemsDf.filter(not(col('supplier').contains('X'))).select('supplier').unique()

itemsDf.filter(~col('supplier').contains('X')).select('supplier').distinct()

itemsDf.filter(!col('supplier').contains('X')).select(col('supplier')).unique()

Buy Now

Questions 13

Which of the following code blocks generally causes a great amount of network traffic?

Options:

DataFrame.select()

DataFrame.coalesce()

DataFrame.collect()

DataFrame.rdd.map()

DataFrame.count()

Buy Now

Questions 14

Which of the following describes Spark actions?

Options:

Writing data to disk is the primary purpose of actions.

Actions are Spark's way of exchanging data between executors.

The driver receives data upon request by actions.

Stage boundaries are commonly established by actions.

Actions are Spark's way of modifying RDDs.

Buy Now

Questions 15

The code block displayed below contains an error. The code block below is intended to add a column itemNameElements to DataFrame itemsDf that includes an array of all words in column

itemName. Find the error.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-------------------+

2.|itemId|itemName |supplier |

3.+------+----------------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |YetiX |

6.|3 |Outdoors Backpack |Sports Company Inc.|

7.+------+----------------------------------+-------------------+

Code block:

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

Options:

All column names need to be wrapped in the col() operator.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument "," needs to be passed to the split method.

Operator withColumnRenamed needs to be replaced with operator withColumn and the split method needs to be replaced by the splitString method.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument " " needs to be passed to the split method.

The expressions "itemNameElements" and split("itemName") need to be swapped.

Buy Now

Questions 16

Which of the elements in the labeled panels represent the operation performed for broadcast variables?

Larger image

Options:

2, 5

2, 3

1, 2

1, 3, 4

Buy Now

Questions 17

The code block shown below should write DataFrame transactionsDf as a parquet file to path storeDir, using brotli compression and replacing any previously existing file. Choose the answer that

correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__.format("parquet").__2__(__3__).option(__4__, "brotli").__5__(storeDir)

Options:

1. save

2. mode

3. "ignore"

4. "compression"

5. path

1. store

2. with

3. "replacement"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. "compression"

5. save

(Correct)

1. save

2. mode

3. "replace"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. compression

5. parquet

Buy Now

Questions 18

Which of the following code blocks produces the following output, given DataFrame transactionsDf?

Output:

1.root

2. |-- transactionId: integer (nullable = true)

3. |-- predError: integer (nullable = true)

4. |-- value: integer (nullable = true)

5. |-- storeId: integer (nullable = true)

6. |-- productId: integer (nullable = true)

7. |-- f: integer (nullable = true)

DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

Options:

transactionsDf.schema.print()

transactionsDf.rdd.printSchema()

transactionsDf.rdd.formatSchema()

transactionsDf.printSchema()

print(transactionsDf.schema)

Buy Now

Questions 19

Which of the following statements about Spark's DataFrames is incorrect?

Options:

Spark's DataFrames are immutable.

Spark's DataFrames are equal to Python's DataFrames.

Data in DataFrames is organized into named columns.

RDDs are at the core of DataFrames.

The data in DataFrames may be split into multiple chunks.

Buy Now

Questions 20

Which is the highest level in Spark's execution hierarchy?

Options:

Task

Executor

Slot

Job

Stage

Buy Now

Questions 21

Which of the following code blocks reads in the JSON file stored at filePath, enforcing the schema expressed in JSON format in variable json_schema, shown in the code block below?

Code block:

1.json_schema = """

2.{"type": "struct",

3. "fields": [

4. {

5. "name": "itemId",

6. "type": "integer",

7. "nullable": true,

8. "metadata": {}

9. },

10. {

11. "name": "supplier",

12. "type": "string",

13. "nullable": true,

14. "metadata": {}

15. }

16. ]

17.}

18."""

Options:

spark.read.json(filePath, schema=json_schema)

spark.read.schema(json_schema).json(filePath)

1.schema = StructType.fromJson(json.loads(json_schema))

2.spark.read.json(filePath, schema=schema)

spark.read.json(filePath, schema=schema_of_json(json_schema))

spark.read.json(filePath, schema=spark.read.json(json_schema))

Buy Now

Answer:

Explanation:

Explanation

Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this QUESTION NO: is beneficial to your exam

preparation, since

it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in - a topic within the scope of the exam.

The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the

operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be "a DDL-formatted string (For

example col0 INT, col1 DOUBLE)". Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.

With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type

pyspark.sql.types.StructType or "a DDL-formatted string (For example col0 INT, col1 DOUBLE)". We already know that json_schema does not follow this format, so we should focus on how we can

transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.

The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.

Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator's documentation (linked below) states that it "[p]arses a JSON string and infers its schema in DDL

format". This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to "infer" a schema from it. In the documentation you can see

an example use case which helps you understand the difference better. Here, you pass string '{a: 1}' to schema_of_json() and the method infers a DDL-format schema STRUCT from it.

In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.

Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType - exactly the type which the schema parameter of spark.read.json expects.

Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.

More info:

- pyspark.sql.DataFrameReader.schema — PySpark 3.1.2 documentation

- pyspark.sql.DataFrameReader.json — PySpark 3.1.2 documentation

- pyspark.sql.functions.schema_of_json — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 51 (Databricks import instructions)

Questions 22

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

Options:

itemsDf.persist(StorageLevel.MEMORY_ONLY)

itemsDf.cache(StorageLevel.MEMORY_AND_DISK)

itemsDf.store()

itemsDf.cache()

itemsDf.write.option('destination', 'memory').save()

Buy Now

Questions 23

Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?

Options:

spark.mode("parquet").read("/FileStore/imports.parquet")

spark.read.path("/FileStore/imports.parquet", source="parquet")

spark.read().parquet("/FileStore/imports.parquet")

spark.read.parquet("/FileStore/imports.parquet")

spark.read().format('parquet').open("/FileStore/imports.parquet")

Buy Now

Questions 24

Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

Options:

Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.

Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.

Use a narrow transformation to reduce the number of partitions.

Use a wide transformation to reduce the number of partitions.

Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.

Buy Now

Questions 25

Which of the following code blocks returns a single-row DataFrame that only has a column corr which shows the Pearson correlation coefficient between columns predError and value in DataFrame

transactionsDf?

Options:

transactionsDf.select(corr(["predError", "value"]).alias("corr")).first()

transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first()

transactionsDf.select(corr(predError, value).alias("corr"))

transactionsDf.select(corr(col("predError"), col("value")).alias("corr"))

(Correct)

transactionsDf.select(corr("predError", "value"))

Buy Now

Answer:

Explanation:

Explanation

In difficulty, this QUESTION NO: is above what you can expect from the exam. What this QUESTION NO: wants to teach you, however, is to pay attention to the useful details included in the

documentation.

pyspark.sql.corr is not a very common method, but it deals with Spark's data structure in an interesting way. The command takes two columns over multiple rows and returns a single row - similar to

an aggregation function. When examining the documentation (linked below), you will find this code example:

a = range(20)

b = [2 * x for x in range(20)]

df = spark.createDataFrame(zip(a, b), ["a", "b"])

df.agg(corr("a", "b").alias('c')).collect()

[Row(c=1.0)]

See how corr just returns a single row? Once you understand this, you should be suspicious about answers that include first(), since there is no need to just select a single row. A reason to eliminate

those answers is that DataFrame.first() returns an object of type Row, but not DataFrame, as requested in the question.

transactionsDf.select(corr(col("predError"), col("value")).alias("corr"))

Correct! After calculating the Pearson correlation coefficient, the resulting column is correctly renamed to corr.

transactionsDf.select(corr(predError, value).alias("corr"))

No. In this answer, Python will interpret column names predError and value as variable names.

transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first()

Incorrect. first() returns a row, not a DataFrame (see above and linked documentation below).

transactionsDf.select(corr("predError", "value"))

Wrong. Whie this statement returns a DataFrame in the desired shape, the column will have the name corr(predError, value) and not corr.

transactionsDf.select(corr(["predError", "value"]).alias("corr")).first()

False. In addition to first() returning a row, this code block also uses the wrong call structure for command corr which takes two arguments (the two columns to correlate).

More info:

- pyspark.sql.functions.corr — PySpark 3.1.2 documentation

- pyspark.sql.DataFrame.first — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 53 (Databricks import instructions)

Questions 26

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and

transactionDate (in this order). Find the error.

Code block:

transactionsDf.coalesce(14, ("storeId", "transactionDate"))

Options:

The parentheses around the column names need to be removed and .select() needs to be appended to the code block.

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.

(Correct)

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.

Operator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.

Operator coalesce needs to be replaced by repartition.

Buy Now

Questions 27

Which of the following code blocks performs an inner join between DataFrame itemsDf and DataFrame transactionsDf, using columns itemId and transactionId as join keys, respectively?

Options:

itemsDf.join(transactionsDf, "inner", itemsDf.itemId == transactionsDf.transactionId)

itemsDf.join(transactionsDf, itemId == transactionId)

itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.transactionId, "inner")

itemsDf.join(transactionsDf, "itemsDf.itemId == transactionsDf.transactionId", "inner")

itemsDf.join(transactionsDf, col(itemsDf.itemId) == col(transactionsDf.transactionId))

Buy Now

Questions 28

Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?

Options:

transactionsDf.dropna("any")

transactionsDf.dropna(thresh=4)

transactionsDf.drop.na("",2)

transactionsDf.dropna(thresh=2)

transactionsDf.dropna("",4)

Buy Now

Questions 29

Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?

Options:

1.counter = 0

3.for index, row in itemsDf.iterrows():

4. if 'Inc.' in row['supplier']:

5. counter = counter + 1

7.print(counter)

1.counter = 0

3.def count(x):

4. if 'Inc.' in x['supplier']:

5. counter = counter + 1

7.itemsDf.foreach(count)

8.print(counter)

print(itemsDf.foreach(lambda x: 'Inc.' in x))

print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())

1.accum=sc.accumulator(0)

3.def check_if_inc_in_supplier(row):

4. if 'Inc.' in row['supplier']:

5. accum.add(1)

7.itemsDf.foreach(check_if_inc_in_supplier)

8.print(accum.value)

Buy Now

Questions 30

Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

3.+------+----------------------------------+-----------------------------+-------------------+

7.+------+----------------------------------+-----------------------------+-------------------+

Options:

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))

itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))

itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))

itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").contains("i"))

Buy Now

Questions 31

Which of the following code blocks displays the 10 rows with the smallest values of column value in DataFrame transactionsDf in a nicely formatted way?

Options:

transactionsDf.sort(asc(value)).show(10)

transactionsDf.sort(col("value")).show(10)

transactionsDf.sort(col("value").desc()).head()

transactionsDf.sort(col("value").asc()).print(10)

transactionsDf.orderBy("value").asc().show(10)

Buy Now

Questions 32

The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

1.from pyspark import StorageLevel

2.transactionsDf.__1__(StorageLevel.__2__).__3__

Options:

1. cache

2. MEMORY_ONLY_2

3. count()

1. persist

2. DISK_ONLY_2

3. count()

1. persist

2. MEMORY_ONLY_2

3. select()

1. cache

2. DISK_ONLY_2

3. count()

1. persist

2. MEMORY_ONLY_2

3. count()

Buy Now

Questions 33

The code block displayed below contains an error. The code block should return DataFrame transactionsDf, but with the column storeId renamed to storeNumber. Find the error.

Code block:

transactionsDf.withColumn("storeNumber", "storeId")

Options:

Instead of withColumn, the withColumnRenamed method should be used.

Arguments "storeNumber" and "storeId" each need to be wrapped in a col() operator.

Argument "storeId" should be the first and argument "storeNumber" should be the second argument to the withColumn method.

The withColumn operator should be replaced with the copyDataFrame operator.

Instead of withColumn, the withColumnRenamed method should be used and argument "storeId" should be the first and argument "storeNumber" should be the second argument to that method.

Buy Now

Questions 34

Which of the following code blocks returns a DataFrame that is an inner join of DataFrame itemsDf and DataFrame transactionsDf, on columns itemId and productId, respectively and in which every

itemId just appears once?

Options:

itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId").distinct("itemId")

itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates(["itemId"])

itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates("itemId")

itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId, how="inner").distinct(["itemId"])

itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId", how="inner").dropDuplicates(["itemId"])

Buy Now

Questions 35

Which of the following code blocks returns a copy of DataFrame transactionsDf where the column storeId has been converted to string type?

Options:

transactionsDf.withColumn("storeId", convert("storeId", "string"))

transactionsDf.withColumn("storeId", col("storeId", "string"))

transactionsDf.withColumn("storeId", col("storeId").convert("string"))

transactionsDf.withColumn("storeId", col("storeId").cast("string"))

transactionsDf.withColumn("storeId", convert("storeId").as("string"))

Buy Now

Questions 36

Which of the following code blocks applies the boolean-returning Python function evaluateTestSuccess to column storeId of DataFrame transactionsDf as a user-defined function?

Options:

1.from pyspark.sql import types as T

2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType())

3.transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))

1.evaluateTestSuccessUDF = udf(evaluateTestSuccess)

2.transactionsDf.withColumn("result", evaluateTestSuccessUDF(storeId))

1.from pyspark.sql import types as T

2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.IntegerType())

3.transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))

1.evaluateTestSuccessUDF = udf(evaluateTestSuccess)

2.transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))

1.from pyspark.sql import types as T

2.evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType())

3.transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))

Buy Now

Questions 37

Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate

format for this kind of data?

Options:

1.spark.read.schema(

2. StructType(

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)

5. )).load(filePath)

1.spark.read.schema([

2. StructField("transactionId", NumberType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath)

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", StringType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).parquet(filePath)

1.spark.read.schema(

2. StructType([

3. StructField("transactionId", IntegerType(), True),

4. StructField("predError", IntegerType(), True)]

5. )).format("parquet").load(filePath)

1.spark.read.schema([

2. StructField("transactionId", IntegerType(), True),

3. StructField("predError", IntegerType(), True)

4. ]).load(filePath, format="parquet")

Buy Now

Questions 38

Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?

Options:

transactionsDf.sort("storeId", asc("productId"))

transactionsDf.sort(col(storeId)).desc(col(productId))

transactionsDf.order_by(col(storeId), desc(col(productId)))

transactionsDf.sort("storeId", desc("productId"))

transactionsDf.sort("storeId").sort(desc("productId"))

Buy Now

Questions 39

Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?

Options:

itemsDf.cache().count()

itemsDf.cache(eager=True)

cache(itemsDf)

itemsDf.cache().filter()

itemsDf.rdd.storeCopy()

Buy Now

Questions 40

Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned

DataFrame?

Options:

transactionsDf.resample(0.15, False, 3142)

transactionsDf.sample(0.15, False, 3142)

transactionsDf.sample(0.15)

transactionsDf.sample(0.85, 8429)

transactionsDf.sample(True, 0.15, 8261)

Buy Now

Questions 41

Which of the following describes a narrow transformation?

Options:

narrow transformation is an operation in which data is exchanged across partitions.

A narrow transformation is a process in which data from multiple RDDs is used.

A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.

A narrow transformation is an operation in which data is exchanged across the cluster.

A narrow transformation is an operation in which no data is exchanged across the cluster.

Buy Now

Questions 42

Which of the following statements about lazy evaluation is incorrect?

Options:

Predicate pushdown is a feature resulting from lazy evaluation.

Execution is triggered by transformations.

Spark will fail a job only during execution, but not during definition.

Accumulators do not change the lazy evaluation model of Spark.

Lineages allow Spark to coalesce transformations into stages

Buy Now

Questions 43

Which of the following code blocks returns all unique values of column storeId in DataFrame transactionsDf?

Options:

transactionsDf["storeId"].distinct()

transactionsDf.select("storeId").distinct()

(Correct)

transactionsDf.filter("storeId").distinct()

transactionsDf.select(col("storeId").distinct())

transactionsDf.distinct("storeId")

Buy Now

Questions 44

Which of the following code blocks reads JSON file imports.json into a DataFrame?

Options:

spark.read().mode("json").path("/FileStore/imports.json")

spark.read.format("json").path("/FileStore/imports.json")

spark.read("json", "/FileStore/imports.json")

spark.read.json("/FileStore/imports.json")

spark.read().json("/FileStore/imports.json")

Buy Now

Questions 45

The code block shown below should return a copy of DataFrame transactionsDf with an added column cos. This column should have the values in column value converted to degrees and having

the cosine of those converted values taken, rounded to two decimals. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

transactionsDf.__1__(__2__, round(__3__(__4__(__5__)),2))

Options:

1. withColumn

2. col("cos")

3. cos

4. degrees

5. transactionsDf.value

1. withColumnRenamed

2. "cos"

3. cos

4. degrees

5. "transactionsDf.value"

1. withColumn

2. "cos"

3. cos

4. degrees

5. transactionsDf.value

1. withColumn

2. col("cos")

3. cos

4. degrees

5. col("value")

. 1. withColumn

2. "cos"

3. degrees

4. cos

5. col("value")

Buy Now

Questions 46

Which of the following describes tasks?

Options:

A task is a command sent from the driver to the executors in response to a transformation.

Tasks transform jobs into DAGs.

A task is a collection of slots.

A task is a collection of rows.

Tasks get assigned to the executors by the driver.

Buy Now

Questions 47

Which of the following code blocks returns approximately 1000 rows, some of them potentially being duplicates, from the 2000-row DataFrame transactionsDf that only has unique rows?

Options:

transactionsDf.sample(True, 0.5)

transactionsDf.take(1000).distinct()

transactionsDf.sample(False, 0.5)

transactionsDf.take(1000)

transactionsDf.sample(True, 0.5, force=True)

Buy Now

Questions 48

The code block shown below should write DataFrame transactionsDf as a parquet file to path storeDir, using brotli compression and replacing any previously existing file. Choose the answer that

correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__.format("parquet").__2__(__3__).option(__4__, "brotli").__5__(storeDir)

Options:

1. save

2. mode

3. "ignore"

4. "compression"

5. path

1. store

2. with

3. "replacement"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. "compression"

5. save

(Correct)

1. save

2. mode

3. "replace"

4. "compression"

5. path

1. write

2. mode

3. "overwrite"

4. compression

5. parquet

Buy Now

Questions 49

Which of the following statements about executors is correct?

Options:

Executors are launched by the driver.

Executors stop upon application completion by default.

Each node hosts a single executor.

Executors store data in memory only.

An executor can serve multiple applications.

Buy Now

Questions 50

The code block displayed below contains at least one error. The code block should return a DataFrame with only one column, result. That column should include all values in column value from

DataFrame transactionsDf raised to the power of 5, and a null value for rows in which there is no value in column value. Find the error(s).

Code block:

1.from pyspark.sql.functions import udf

2.from pyspark.sql import types as T

4.transactionsDf.createOrReplaceTempView('transactions')

6.def pow_5(x):

7. return x**5

9.spark.udf.register(pow_5, 'power_5_udf', T.LongType())

10.spark.sql('SELECT power_5_udf(value) FROM transactions')

Options:

The pow_5 method is unable to handle empty values in column value and the name of the column in the returned DataFrame is not result.

The returned DataFrame includes multiple columns instead of just one column.

The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and the SparkSession cannot access the transactionsDf

DataFrame.

The pow_5 method is unable to handle empty values in column value, the name of the column in the returned DataFrame is not result, and Spark driver does not call the UDF function

appropriately.

The pow_5 method is unable to handle empty values in column value, the UDF function is not registered properly with the Spark driver, and the name of the column in the returned DataFrame is

not result.

Buy Now

Questions 51

Which of the following code blocks writes DataFrame itemsDf to disk at storage location filePath, making sure to substitute any existing data at that location?

Options:

itemsDf.write.mode("overwrite").parquet(filePath)

itemsDf.write.option("parquet").mode("overwrite").path(filePath)

itemsDf.write(filePath, mode="overwrite")

itemsDf.write.mode("overwrite").path(filePath)

itemsDf.write().parquet(filePath, mode="overwrite")

Buy Now

Questions 52

Which of the following DataFrame operators is never classified as a wide transformation?

Options:

DataFrame.sort()

DataFrame.aggregate()

DataFrame.repartition()

DataFrame.select()

DataFrame.join()

Buy Now

Answer:

Explanation:

Explanation

As a general rule: After having gone through the practice tests you probably have a good feeling for what classifies as a wide and what classifies as a narrow transformation. If you are unsure, feel

free to play around in Spark and display the explanation of the Spark execution plan via DataFrame.[operation, for example sort()].explain(). If repartitioning is involved, it would count as a wide

transformation.

DataFrame.select()

Correct! A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select()

operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each

partition can be worked on independently. Thus, you do not cause a wide transformation.

DataFrame.repartition()

Incorrect. When you repartition a DataFrame, you redefine partition boundaries. Data will flow across your cluster and end up in different partitions after the repartitioning is completed. This is

known as a shuffle and, in turn, is classified as a wide transformation.

DataFrame.aggregate()

No. When you aggregate, you may compare and summarize data across partitions. In the process, data are exchanged across the cluster, and newly formed output partitions depend on one or more

input partitions. This is a typical characteristic of a shuffle, meaning that the aggregate operation may classify as a wide transformation.

DataFrame.join()

Wrong. Joining multiple DataFrames usually means that large amounts of data are exchanged across the cluster, as new partitions are formed. This is a shuffle and therefore DataFrame.join()

counts as a wide transformation.

DataFrame.sort()

False. When sorting, Spark needs to compare many rows across all partitions to each other. This is an expensive operation, since data is exchanged across the cluster and new partitions are

formed as data is reordered. This process classifies as a shuffle and, as a result, DataFrame.sort() counts as wide transformation.

More info: Understanding Apache Spark Shuffle | Philipp Brunenberg

Questions 53

The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.

A sample of DataFrame itemsDf is below.

Code block:

itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")

Options:

Since itemId is the index, it does not need to be an argument to the select() method.

The alias() method needs to be called after the select() method.

The explode() method expects a Column object rather than a string.

explode() is not a method of DataFrame. explode() should be used inside the select() method instead.

The split() method should be used inside the select() method instead of the explode() method.

Buy Now

Questions 54

Which of the following is the deepest level in Spark's execution hierarchy?

Options:

Job

Task

Executor

Slot

Stage

Buy Now

Exam Code: Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Exam Name: Databricks Certified Associate Developer for Apache Spark 3.0 Exam

Last Update: May 18, 2024

Questions: 180

PDF + Testing Engine

$64 ~~$159.99~~

Testing Engine (only)

$48 ~~$119.99~~

PDF (only)

$40 ~~$99.99~~

buy now Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0

Easter Special Sale - Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 575363r9

dumpspedia logo

Navigation:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Sample Questions Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options: