Problem Scenario 93 : You have to run your Spark application with locally 8 thread or locally on 8 cores. Replace XXX with correct values.
spark-submit --class com.hadoopexam.MyTask XXX \ -deploy-mode cluster SSPARK_HOME/lib/hadoopexam.jar 10
Problem Scenario 48 : You have been given below Python code snippet, with intermediate output.
We want to take a list of records about people and then we want to sum up their ages and count them.
So for this example the type in the RDD will be a Dictionary in the format of {name: NAME, age:AGE, gender:GENDER}.
The result type will be a tuple that looks like so (Sum of Ages, Count)
people = []
people.append({'name':'Amit', 'age':45,'gender':'M'})
people.append({'name':'Ganga', 'age':43,'gender':'F'})
people.append({'name':'John', 'age':28,'gender':'M'})
people.append({'name':'Lolita', 'age':33,'gender':'F'})
people.append({'name':'Dont Know', 'age':18,'gender':'T'})
peopleRdd=sc.parallelize(people) //Create an RDD
peopleRdd.aggregate((0,0), seqOp, combOp) //Output of above line : 167, 5)
Now define two operation seqOp and combOp , such that
seqOp : Sum the age of all people as well count them, in each partition. combOp : Combine results from all partitions.
Problem Scenario 41 : You have been given below code snippet.
val aul = sc.parallelize(List (("a" , Array(1,2)), ("b" , Array(1,2))))
val au2 = sc.parallelize(List (("a" , Array(3)), ("b" , Array(2))))
Apply the Spark method, which will generate below output.
Array[(String, Array[lnt])] = Array((a,Array(1, 2)), (b,Array(1, 2)), (a(Array(3)), (b,Array(2)))
Problem Scenario 14 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. Create a csv file named updated_departments.csv with the following contents in local file system.
updated_departments.csv
2,fitness
3,footwear
12,fathematics
13,fcience
14,engineering
1000,management
2. Upload this csv file to hdfs filesystem,
3. Now export this data from hdfs to mysql retaildb.departments table. During upload make sure existing department will just updated and new departments needs to be inserted.
4. Now update updated_departments.csv file with below content.
2,Fitness
3,Footwear
12,Fathematics
13,Science
14,Engineering
1000,Management
2000,Quality Check
5. Now upload this file to hdfs.
6. Now export this data from hdfs to mysql retail_db.departments table. During upload make sure existing department will just updated and no new departments needs to be inserted.
Problem Scenario 33 : You have given a files as below.
spark5/EmployeeName.csv (id,name)
spark5/EmployeeSalary.csv (id,salary)
Data is given below:
EmployeeName.csv
E01,Lokesh
E02,Bhupesh
E03,Amit
E04,Ratan
E05,Dinesh
E06,Pavan
E07,Tejas
E08,Sheela
E09,Kumar
E10,Venkat
EmployeeSalary.csv
E01,50000
E02,50000
E03,45000
E04,45000
E05,50000
E06,45000
E07,50000
E08,10000
E09,10000
E10,10000
Now write a Spark code in scala which will load these two tiles from hdfs and join the same, and produce the (name.salary) values.
And save the data in multiple tile group by salary (Means each file will have name of employees with same salary). Make sure file name include salary as well.