Explanation: Explanation
This QUESTION NO: offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to
understand
some SQL syntax to get to the correct answer here.
transactionsDf.createOrReplaceTempView('transactionsDf')
itemsDf.createOrReplaceTempView('itemsDf')
statement = """
SELECT * FROM transactionsDf
INNER JOIN itemsDf
ON transactionsDf.productId==itemsDf.itemId
"""
spark.sql(statement).drop("value", "storeId", "attributes")
Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows
you to express strings as multiple lines.
transactionsDf \
drop(col('value'), col('storeId')) \
join(itemsDf.drop(col('attributes')), col('productId')==col('itemId'))
No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead.
transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId")
Incorrect - Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string.
transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId)
Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop.
transactionsDf.createOrReplaceTempView('transactionsDf')
itemsDf.createOrReplaceTempView('itemsDf')
spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes")
No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column.
More info: pyspark.sql.DataFrame.join — PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, QUESTION NO: 25 (Databricks import instructions)