Databricks-Certified-Data-Engineer-Associate Exam Dumps - Databricks Certified Data Engineer Associate Exam

Go to page:

Question # 17

Which of the following describes the relationship between Bronze tables and raw data?

Bronze tables contain less data than raw data files.

Bronze tables contain more truthful data than raw data.

Bronze tables contain aggregates while raw data is unaggregated.

Bronze tables contain a less refined view of data than raw data.

Bronze tables contain raw data with a schema applied.

Full Access

Question # 18

Which of the following describes a scenario in which a data engineer will want to use a single-node cluster?

When they are working interactively with a small amount of data

When they are running automated reports to be refreshed as quickly as possible

When they are working with SQL within Databricks SQL

When they are concerned about the ability to automatically scale with larger data

When they are manually running reports with a large amount of data

Full Access

Answer:

Explanation:

The scenario in which a data engineer will want to use a single-node cluster is when they are working interactively with a small amount of data.Â A single-node cluster is a cluster consisting of an Apache Spark driver and no Spark workers1.Â A single-node cluster supports Spark jobs and all Spark data sources, including Delta Lake1.Â A single-node cluster is helpful for single-node machine learning workloads that use Spark to load and save data, and for lightweight exploratory data analysis1.Â A single-node cluster can run Spark locally, spawn one executor thread per logical core in the cluster, and save all log output in the driver log1.Â A single-node cluster can be created by selecting the Single Node button when configuring a cluster1.

The other options are not suitable for using a single-node cluster.Â When running automated reports to be refreshed as quickly as possible, a data engineer will want to use a multi-node cluster that can scale up and down automatically based on the workload demand2.Â When working with SQL within Databricks SQL, a data engineer will want to use a SQL Endpoint that can execute SQL queries on a serverless pool or an existing cluster3.Â When concerned about the ability to automatically scale with larger data, a data engineer will want to use a multi-node cluster that can leverage the Databricks Lakehouse Platform and the Delta Engine to handle large-scale data processing efficiently and reliably4. When manually running reports with a large amount of data, a data engineer will want to use a multi-node cluster that can distribute the computation across multiple workers and leverage the Spark UI to monitor the performance and troubleshoot the issues.

References:

1:Â Single Node clusters | Databricks on AWS
2:Â Autoscaling | Databricks on AWS
3:Â SQL Endpoints | Databricks on AWS
4:Â Databricks Lakehouse Platform | Databricks on AWS
: [Spark UI | Databricks on AWS]

Question # 19

A single Job runs two notebooks as two separate tasks. A data engineer has noticed that one of the notebooks is running slowly in the Jobâ€™s current run. The data engineer asks a tech lead for help in identifying why this might be the case.

Which of the following approaches can the tech lead use to identify why the notebook is running slowly as part of the Job?

They can navigate to the Runs tab in the Jobs UI to immediately review the processing notebook.

They can navigate to the Tasks tab in the Jobs UI and click on the active run to review the processing notebook.

They can navigate to the Runs tab in the Jobs UI and click on the active run to review the processing notebook.

There is no way to determine why a Job task is running slowly.

They can navigate to the Tasks tab in the Jobs UI to immediately review the processing notebook.

Full Access

Question # 20

A new data engineering team has been assigned to work on a project. The team will need access to database customers in order to see what tables already exist. The team has its own group team.

Which of the following commands can be used to grant the necessary permission on the entire database to the new team?

GRANT VIEW ON CATALOG customers TO team;

GRANT CREATE ON DATABASE customers TO team;

GRANT USAGE ON CATALOG team TO customers;

GRANT CREATE ON DATABASE team TO customers;

GRANT USAGE ON DATABASE customers TO team;

Full Access

Question # 21

A data engineer has a single-task Job that runs each morning before they begin working. After identifying an upstream data issue, they need to set up another task to run a new notebook prior to the original task.

Which of the following approaches can the data engineer use to set up the new task?

They can clone the existing task in the existing Job and update it to run the new notebook.

They can create a new task in the existing Job and then add it as a dependency of the original task.

They can create a new task in the existing Job and then add the original task as a dependency of the new task.

They can create a new job from scratch and add both tasks to run concurrently.

They can clone the existing task to a new Job and then edit it to run the new notebook.

Full Access

Question # 22

A data engineer has three tables in a Delta Live Tables (DLT) pipeline. They have configured the pipeline to drop invalid records at each table. They notice that some data is being dropped due to quality concerns at some point in the DLT pipeline. They would like to determine at which table in their pipeline the data is being dropped.

Which of the following approaches can the data engineer take to identify the table that is dropping the records?

They can set up separate expectations for each table when developing their DLT pipeline.

They cannot determine which table is dropping the records.

They can set up DLT to notify them via email when records are dropped.

They can navigate to the DLT pipeline page, click on each table, and view the data quality statistics.

They can navigate to the DLT pipeline page, click on the â€œErrorâ€ button, and review the present errors.

Full Access

Question # 23

Which of the following is a benefit of the Databricks Lakehouse Platform embracing open source technologies?

Cloud-specific integrations

Simplified governance

Ability to scale storage

Ability to scale workloads

Avoiding vendor lock-in

Full Access

Question # 24

A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table.

The cade block used by the data engineer is below:

If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?

trigger("5 seconds")

trigger()

trigger(once="5 seconds")

trigger(processingTime="5 seconds")

trigger(continuous="5 seconds")

Full Access