Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam Dumps - Databricks Certified Associate Developer for Apache Spark 3.0 Exam

Go to page:

Question # 17

The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to

accomplish this.

transactionsDf.__1__(__2__.__3__(__4__))

1. select

2. col("storeId")

3. cast

4. StringType

1. select

2. col("storeId")

3. as

4. StringType

1. cast

2. "storeId"

3. as

4. StringType()

1. select

2. col("storeId")

3. cast

4. StringType()

1. select

2. storeId

3. cast

4. StringType()

Full Access

Question # 18

Which of the following code blocks returns a 2-column DataFrame that shows the distinct values in column productId and the number of rows with that productId in DataFrame transactionsDf?

transactionsDf.count("productId").distinct()

transactionsDf.groupBy("productId").agg(col("value").count())

transactionsDf.count("productId")

transactionsDf.groupBy("productId").count()

transactionsDf.groupBy("productId").select(count("value"))

Full Access

Question # 19

Which of the following describes characteristics of the Spark UI?

Via the Spark UI, workloads can be manually distributed across executors.

Via the Spark UI, stage execution speed can be modified.

The Scheduler tab shows how jobs that are run in parallel by multiple users are distributed across the cluster.

There is a place in the Spark UI that shows the property spark.executor.memory.

Some of the tabs in the Spark UI are named Jobs, Stages, Storage, DAGs, Executors, and SQL.

Full Access

Question # 20

Which of the following describes Spark's standalone deployment mode?

Standalone mode uses a single JVM to run Spark driver and executor processes.

Standalone mode means that the cluster does not contain the driver.

Standalone mode is how Spark runs on YARN and Mesos clusters.

Standalone mode uses only a single executor per worker per application.

Standalone mode is a viable solution for clusters that run multiple frameworks, not only Spark.

Full Access

Question # 21

Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?

transactionsDf.select("storeId").dropDuplicates().count()

transactionsDf.select(count("storeId")).dropDuplicates()

transactionsDf.select(distinct("storeId")).count()

transactionsDf.dropDuplicates().agg(count("storeId"))

transactionsDf.distinct().select("storeId").count()

Full Access

Question # 22

The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate

row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes

contains the element cozy.

A sample of DataFrame itemsDf is below.

Code block:

itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))

1. filter

2. array_contains("cozy")

3. select

4. "itemId"

5. explode

6. "attributes"

1. where

2. "array_contains(attributes, 'cozy')"

3. select

4. itemId

5. explode

6. attributes

1. filter

2. "array_contains(attributes, 'cozy')"

3. select

4. "itemId"

5. map

6. "attributes"

1. filter

2. "array_contains(attributes, cozy)"

3. select

4. "itemId"

5. explode

6. "attributes"

1. filter

2. "array_contains(attributes, 'cozy')"

3. select

4. "itemId"

5. explode

6. "attributes"

Full Access

Question # 23

Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from

DataFrame transactionsDf and column attributes from DataFrame itemsDf?

transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)

1.transactionsDf.createOrReplaceTempView('transactionsDf')

2.itemsDf.createOrReplaceTempView('itemsDf')

4.spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

1.transactionsDf \

2. .drop(col('value'), col('storeId')) \

3. .join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))

1.transactionsDf.createOrReplaceTempView('transactionsDf')

2.itemsDf.createOrReplaceTempView('itemsDf')

4.statement = """

5.SELECT * FROM transactionsDf

6.INNER JOIN itemsDf

7.ON transactionsDf.productId==itemsDf.itemId

8."""

9.spark.sql(statement).drop("value", "storeId", "attributes")

Full Access

Answer:

Explanation:

Explanation

This QUESTION NO: offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to

understand

some SQL syntax to get to the correct answer here.

transactionsDf.createOrReplaceTempView('transactionsDf')

itemsDf.createOrReplaceTempView('itemsDf')

statement = """

SELECT * FROM transactionsDf

INNER JOIN itemsDf

ON transactionsDf.productId==itemsDf.itemId

"""

spark.sql(statement).drop("value", "storeId", "attributes")

Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows

you to express strings as multiple lines.

transactionsDf \

drop(col('value'), col('storeId')) \

join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))

No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead.

transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")

Incorrect - Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.

transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)

Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.

transactionsDf.createOrReplaceTempView('transactionsDf')

itemsDf.createOrReplaceTempView('itemsDf')

spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")

No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.

More info: pyspark.sql.DataFrame.join â€” PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 25 (Databricks import instructions)

Question # 24

The code block displayed below contains an error. The code block should use Python method find_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDf and

return it in a new column most_frequent_letter. Find the error.

Code block:

1. find_most_freq_letter_udf = udf(find_most_freq_letter)

2. itemsDf.withColumn("most_frequent_letter", find_most_freq_letter("itemName"))

Spark is not using the UDF method correctly.

The UDF method is not registered correctly, since the return type is missing.

The "itemName" expression should be wrapped in col().

UDFs do not exist in PySpark.

Spark is not adding a column.

Full Access