CCA175 Sample Questions Answers

Questions 4

Problem Scenario 76 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.orders

table=retail_db.order_items

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Columns of order table : (orderid , order_date , ordercustomerid, order_status}

.....

Please accomplish following activities.

1. Copy "retail_db.orders" table to hdfs in a directory p91_orders.

2. Once data is copied to hdfs, using pyspark calculate the number of order for each status.

3. Use all the following methods to calculate the number of order for each status. (You need to know all these functions and its behavior for real exam)

- countByKey()

-groupByKey()

- reduceByKey()

-aggregateByKey()

- combineByKey()

Options:

Buy Now

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Solution :

Step 1 : Import Single table

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username=retail dba -password=cloudera -table=orders --target-dir=p91_orders

Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs

Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p91_orders/part-m-00000

Step 3: countByKey #Number of orders by status allOrders = sc.textFile("p91_orders")

#Generate key and value pairs (key is order status and vale as an empty string keyValue = aIIOrders.map(lambda line: (line.split(",")[3], ""))

#Using countByKey, aggregate data based on status as a key output=keyValue.countByKey()Jtems()

for line in output: print(line)

Step 4 : groupByKey

#Generate key and value pairs (key is order status and vale as an one

keyValue = allOrders.map(lambda line: (line.split)",")[3], 1))

#Using countByKey, aggregate data based on status as a key output= keyValue.groupByKey().map(lambda kv: (kv[0], sum(kv[1]}}}

tor line in output.collect(): print(line}

Step 5 : reduceByKey

#Generate key and value pairs (key is order status and vale as an one

keyValue = allOrders.map(lambda line: (line.split(","}[3], 1))

#Using countByKey, aggregate data based on status as a key output= keyValue.reduceByKey(lambda a, b: a + b)

tor line in output.collect(): print(line}

Step 6: aggregateByKey

#Generate key and value pairs (key is order status and vale as an one keyValue = allOrders.map(lambda line: (line.split(",")[3], line}}

output=keyValue.aggregateByKey(0, lambda a, b: a+1, lambda a, b: a+b}

for line in output.collect(): print(line}

Step 7 : combineByKey

#Generate key and value pairs (key is order status and vale as an one

keyValue = allOrders.map(lambda line: (line.split(",")[3], line))

output=keyValue.combineByKey(lambda value: 1, lambda ace, value: acc+1, lambda ace, value: acc+value)

tor line in output.collect(): print(line)

#Watch Spark Professional Training provided by www.ABCTECH.com to understand more on each above functions. (These are very important functions for real exam)

Questions 5

Problem Scenario 86 : In Continuation of previous question, please accomplish following activities.

1. Select Maximum, minimum, average , Standard Deviation, and total quantity.

2. Select minimum and maximum price for each product code.

3. Select Maximum, minimum, average , Standard Deviation, and total quantity for each product code, hwoever make sure Average and Standard deviation will have maximum two decimal values.

4. Select all the product code and average price only where product count is more than or equal to 3.

5. Select maximum, minimum , average and total of all the products for each code. Also produce the same across all the products.

Options:

Buy Now

Questions 6

Problem Scenario 72 : You have been given a table named "employee2" with following detail.

first_name string

last_name string

Write a spark script in python which read this table and print all the rows and individual column values.

Options:

Buy Now

Questions 7

Problem Scenario 7 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following.

1. Import department tables using your custom boundary query, which import departments between 1 to 25.

2. Also make sure each tables file is partitioned in 2 files e.g. part-00000, part-00002

3. Also make sure you have imported only two columns from table, which are department_id,department_name

Options:

Buy Now

Questions 8

Problem Scenario 68 : You have given a file as below.

spark75/f ile1.txt

File contain some text. As given Below

spark75/file1.txt

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework

The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.

his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking

For a slightly more complicated task, lets look into splitting up sentences from our documents into word bigrams. A bigram is pair of successive tokens in some sequence. We will look at building bigrams from the sequences of words in each sentence, and then try to find the most frequently occuring ones.

The first problem is that values in each partition of our initial RDD describe lines from the file rather than sentences. Sentences may be split over multiple lines. The glom() RDD method is used to create a single entry for each document containing the list of all lines, we can then join the lines up, then resplit them into sentences using "." as the separator, using flatMap so that every object in our RDD is now a sentence.

A bigram is pair of successive tokens in some sequence. Please build bigrams from the sequences of words in each sentence, and then try to find the most frequently occuring ones.

Options:

Buy Now

Questions 9

Problem Scenario 4: You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.categories

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

Import Single table categories (Subset data} to hive managed table , where category_id between 1 and 22

Options:

Buy Now

Questions 10

Problem Scenario 39 : You have been given two files

spark16/file1.txt

1,9,5

2,7,4

3,8,3

spark16/file2.txt

1,g,h

2,i,j

3,k,l

Load these two tiles as Spark RDD and join them to produce the below results

(l,((9,5),(g,h)))

(2, ((7,4), (i,j))) (3, ((8,3), (k,l)))

And write code snippet which will sum the second columns of above joined results (5+4+3).

Options:

Buy Now

Questions 11

Problem Scenario 40 : You have been given sample data as below in a file called spark15/file1.txt

3070811,1963,1096,,"US","CA",,1,

3022811,1963,1096,,"US","CA",,1,56

3033811,1963,1096,,"US","CA",,1,23

Below is the code snippet to process this tile.

val field= sc.textFile("spark15/f ilel.txt")

val mapper = field.map(x=> A)

mapper.map(x => x.map(x=> {B})).collect

Please fill in A and B so it can generate below final output

Array(Array(3070811,1963,109G, 0, "US", "CA", 0,1, 0)

,Array(3022811,1963,1096, 0, "US", "CA", 0,1, 56)

,Array(3033811,1963,1096, 0, "US", "CA", 0,1, 23)

)

Options:

Buy Now

Questions 12

Problem Scenario 79 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.orders

table=retail_db.order_items

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. Copy "retaildb.products" table to hdfs in a directory p93_products

2. Filter out all the empty prices

3. Sort all the products based on price in both ascending as well as descending order.

4. Sort all the products based on price as well as product_id in descending order.

5. Use the below functions to do data ordering or ranking and fetch top 10 elements top()

takeOrdered() sortByKey()

Options:

Buy Now

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Solution :

Step 1 : Import Single table .

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=products -target-dir=p93_products -m 1

Note : Please check you dont have space between before or after '=' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs

Step 2 : Step 2 : Read the data from one of the partition, created using above command, hadoop fs -cat p93_products/part-m-00000

Step 3 : Load this directory as RDD using Spark and Python (Open pyspark terminal and do following). productsRDD = sc.textFile("p93_products")

Step 4 : Filter empty prices, if exists

#filter out empty prices lines

nonemptyjines = productsRDD.filter(lambda x: len(x.split(",")[4]) > 0)

Step 5 : Now sort data based on product_price in order. sortedPriceProducts=nonempty_lines.map(lambdaline:(float(line.split(",")[4]),line.split(",")[2])).sortByKey()

for line in sortedPriceProducts.collect(): print(line)

Step 6 : Now sort data based on product_price in descending order. sortedPriceProducts=nonempty_lines.map(lambda line: (float(line.split(",")[4]),line.split(",")[2])).sortByKey(False)

for line in sortedPriceProducts.collect(): print(line)

Step 7 : Get highest price products name. sortedPriceProducts=nonemptyJines.map(lambda line : (float(line.split(",")[4]),line-split(,,,,,)[2]))-sortByKey(False).take(1)

print(sortedPriceProducts)

Step 8 : Now sort data based on product_price as well as product_id in descending order.

#Dont forget to cast string #Tuple as key ((price,id),name)

sortedPriceProducts=nonemptyJines.map(lambda line : ((float(line print(sortedPriceProducts)

Step 9 : Now sort data based on product_price as well as product_id in descending order, using top() function.

#Dont forget to cast string

#Tuple as key ((price,id),name)

sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.s^^

print(sortedPriceProducts)

Step 10 : Now sort data based on product_price as ascending and product_id in ascending order, using takeOrdered{) function.

#Dont forget to cast string

#Tuple as key ((price,id),name) sortedPriceProducts=nonemptyJines.map(lambda line: ((float(line.split(","}[4]},int(line.split(","}[0]}},line.split(","}[2]}}.takeOrdered(10, lambda tuple : (tuple[0][0],tuple[0][1]))

Step 11 : Now sort data based on product_price as descending and product_id in ascending order, using takeOrdered() function.

#Dont forget to cast string

#Tuple as key ((price,id},name)

#Using minus(-) parameter can help you to make descending ordering , only for numeric value.

sortedPrlceProducts=nonemptylines.map(lambda line: ((float(line.split(","}[4]},int(line.split(","}[0]}},line.split(","}[2]}}.takeOrdered(10, lambda tuple : (-tuple[0][0],tuple[0][1]}}

Questions 13

Problem Scenario 38 : You have been given an RDD as below,

val rdd: RDD[Array[Byte]]

Now you have to save this RDD as a SequenceFile. And below is the code snippet.

import org.apache.hadoop.io.compress.GzipCodec

rdd.map(bytesArray => (A.get(), new B(bytesArray))).saveAsSequenceFile('7output/path",classOt[GzipCodec])

What would be the correct replacement for A and B in above snippet.

Options:

Buy Now

Questions 14

Problem Scenario 48 : You have been given below Python code snippet, with intermediate output.

We want to take a list of records about people and then we want to sum up their ages and count them.

So for this example the type in the RDD will be a Dictionary in the format of {name: NAME, age:AGE, gender:GENDER}.

The result type will be a tuple that looks like so (Sum of Ages, Count)

people = []

people.append({'name':'Amit', 'age':45,'gender':'M'})

people.append({'name':'Ganga', 'age':43,'gender':'F'})

people.append({'name':'John', 'age':28,'gender':'M'})

people.append({'name':'Lolita', 'age':33,'gender':'F'})

people.append({'name':'Dont Know', 'age':18,'gender':'T'})

peopleRdd=sc.parallelize(people) //Create an RDD

peopleRdd.aggregate((0,0), seqOp, combOp) //Output of above line : 167, 5)

Now define two operation seqOp and combOp , such that

seqOp : Sum the age of all people as well count them, in each partition. combOp : Combine results from all partitions.

Options:

Buy Now

Questions 15

Problem Scenario 34 : You have given a file named spark6/user.csv.

Data is given below:

user.csv

id,topic,hits

Rahul,scala,120

Nikita,spark,80

Mithun,spark,1

myself,cca175,180

Now write a Spark code in scala which will remove the header part and create RDD of values as below, for all rows. And also if id is myself" than filter out row.

Map(id -> om, topic -> scala, hits -> 120)

Options:

Buy Now

Questions 16

Problem Scenario 62 : You have been given below code snippet.

val a = sc.parallelize(List("dogM, "tiger", "lion", "cat", "panther", "eagle"), 2)

val b = a.map(x => (x.length, x))

operation1

Write a correct code snippet for operationl which will produce desired output, shown below. Array[(lnt, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx), (5,xeaglex))

Options:

Buy Now

Questions 17

Problem Scenario 36 : You have been given a file named spark8/data.csv (type,name).

data.csv

1,Lokesh

2,Bhupesh

2,Amit

2,Ratan

2,Dinesh

1,Pavan

1,Tejas

2,Sheela

1,Kumar

1,Venkat

1. Load this file from hdfs and save it back as (id, (all names of same type)) in results directory. However, make sure while saving it should be

Options:

Buy Now

Questions 18

Problem Scenario 94 : You have to run your Spark application on yarn with each executor 20GB and number of executors should be 50. Please replace XXX, YYY, ZZZ

export HADOOP_CONF_DIR=XXX

./bin/spark-submit \

-class com.hadoopexam.MyTask \

xxx\

-deploy-mode cluster \ # can be client for client mode

YYY\

222 \

/path/to/hadoopexam.jar \

1000

Options:

Buy Now

Questions 19

Problem Scenario 16 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish below assignment.

1. Create a table in hive as below.

create table departments_hive(department_id int, department_name string);

2. Now import data from mysql table departments to this hive table. Please make sure that data should be visible using below hive command, select" from departments_hive

Options:

Buy Now

Questions 20

Problem Scenario 87 : You have been given below three files

product.csv (Create this file in hdfs)

productID,productCode,name,quantity,price,supplierid

1001,PEN,Pen Red,5000,1.23,501

1002,PEN,Pen Blue,8000,1.25,501

1003,PEN,Pen Black,2000,1.25,501

1004,PEC,Pencil 2B,10000,0.48,502

1005,PEC,Pencil 2H,8000,0.49,502

1006,PEC,Pencil HB,0,9999.99,502

2001,PEC,Pencil 3B,500,0.52,501

2002,PEC,Pencil 4B,200,0.62,501

2003,PEC,Pencil 5B,100,0.73,501

2004,PEC,Pencil 6B,500,0.47,502

supplier.csv

supplierid,name,phone

501,ABC Traders,88881111

502,XYZ Company,88882222

503,QQ Corp,88883333

products_suppliers.csv

productID,supplierID

2001,501

2002,501

2003,501

2004,502

2001,503

Now accomplish all the queries given in solution.

Select product, its price , its supplier name where product price is less than 0.6 using SparkSQL

Options:

Buy Now

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Solution :

Step 1:

hdfs dfs -mkdir sparksql2

hdfs dfs -put product.csv sparksq!2/

hdfs dfs -put supplier.csv sparksql2/

hdfs dfs -put products_suppliers.csv sparksql2/

Step 2 : Now in spark shell

// this Is used to Implicitly convert an RDD to a DataFrame.

import sqlContext.impIicits._

// Import Spark SQL data types and Row.

import org.apache.spark.sql._

// load the data into a new RDD

val products = sc.textFile("sparksql2/product.csv")

val supplier = sc.textFileC'sparksq^supplier.csv")

val prdsup = sc.textFile("sparksql2/products_suppliers.csv"}

// Return the first element in this RDD

products.fi rst()

supplier.first{).

prdsup.first()

//define the schema using a case class

case class Product(productid: Integer, code: String, name: String, quantity:lnteger, price: Float, supplierid:lnteger)

case class Suplier(supplierid: Integer, name: String, phone: String)

case class PRDSUP(productid: Integer.supplierid: Integer)

// create an RDD of Product objects

val prdRDD = products.map(_.split('\")).map(p => Product(p(0).tolnt,p(1),p(2),p(3).tolnt,p(4).toFloat,p(5).toint))

val supRDD = supplier.map(_.split(",")).map(p => Suplier(p(0).tolnt,p(1),p(2)))

val prdsupRDD = prdsup.map(_.split(",")).map(p => PRDSUP(p(0).tolnt,p(1}.tolnt}}

prdRDD.first()

prdRDD.count()

supRDD.first() supRDD.count()

prdsupRDD.first() prdsupRDD.count(}

// change RDD of Product objects to a DataFrame

val prdDF = prdRDD.toDF()

val supDF = supRDD.toDF()

val prdsupDF = prdsupRDD.toDF()

// register the DataFrame as a temp table prdDF.registerTempTablef'products")

supDF.registerTempTablef'suppliers")

prdsupDF.registerTempTablef'productssuppliers"}

//Select product, its price , its supplier name where product price is less than 0.6

val results = sqlContext.sql(......SELECT products.name, price, suppliers.name as sup_name FROM products JOIN suppliers ON products.supplierlD= suppliers.supplierlD WHERE price < 0.6......]

results. show()

Questions 21

Problem Scenario 6 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Compression Codec : org.apache.hadoop.io.compress.SnappyCodec

Please accomplish following.

1. Import entire database such that it can be used as a hive tables, it must be created in default schema.

2. Also make sure each tables file is partitioned in 3 files e.g. part-00000, part-00002, part-00003

3. Store all the Java files in a directory called java_output to evalute the further

Options:

Buy Now

Questions 22

Problem Scenario 25 : You have been given below comma separated employee information. That needs to be added in /home/cloudera/flumetest/in.txt file (to do tail source)

sex,name,city

1,alok,mumbai

1,jatin,chennai

1,yogesh,kolkata

2,ragini,delhi

2,jyotsana,pune

1,valmiki,banglore

Create a flume conf file using fastest non-durable channel, which write data in hive warehouse directory, in two separate tables called flumemaleemployee1 and flumefemaleemployee1

(Create hive table as well for given data}. Please use tail source with /home/cloudera/flumetest/in.txt file.

Flumemaleemployee1 : will contain only male employees data flumefemaleemployee1 : Will contain only woman employees data

Options:

Buy Now

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Solution :

Step 1 : Create hive table for flumemaleemployeel and .'

CREATE TABLE flumemaleemployeel

(

sex_type int, name string, city string )

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

CREATE TABLE flumefemaleemployeel

(

sex_type int, name string, city string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Step 2 : Create below directory and file mkdir /home/cloudera/flumetest/ cd /home/cloudera/flumetest/

Step 3 : Create flume configuration file, with below configuration for source, sink and channel and save it in flume5.conf.

agent.sources = tailsrc

agent.channels = mem1 mem2

agent.sinks = stdl std2

agent.sources.tailsrc.type = exec

agent.sources.tailsrc.command = tail -F /home/cloudera/flumetest/in.txt

agent.sources.tailsrc.batchSize = 1

agent.sources.tailsrc.interceptors = i1 agent.sources.tailsrc.interceptors.i1.type = regex_extractor agent.sources.tailsrc.interceptors.il.regex = A(\\d} agent.sources.tailsrc. interceptors. M.serializers = t1 agent.sources.tailsrc. interceptors, i1.serializers.t1. name = type

agent.sources.tailsrc.selector.type = multiplexing agent.sources.tailsrc.selector.header = type agent.sources.tailsrc.selector.mapping.1 = memi agent.sources.tailsrc.selector.mapping.2 = mem2

agent.sinks.std1.type = hdfs

agent.sinks.stdl.channel = mem1

agent.sinks.stdl.batchSize = 1

agent.sinks.std1.hdfs.path = /user/hive/warehouse/flumemaleemployeei

agent.sinks.stdl.rolllnterval = 0

agent.sinks.stdl.hdfs.tileType = Data Stream

agent.sinks.std2.type = hdfs

agent.sinks.std2.channel = mem2

agent.sinks.std2.batchSize = 1

agent.sinks.std2.hdfs.path = /user/hi ve/warehouse/fIumefemaleemployee1

agent.sinks.std2.rolllnterval = 0

agent.sinks.std2.hdfs.tileType = Data Stream

agent.channels.mem1.type = memory agent.channels.meml.capacity = 100

agent.channels.mem2.type = memory agent.channels.mem2.capacity = 100

agent.sources.tailsrc.channels = mem1 mem2

Step 4 : Run below command which will use this configuration file and append data in hdfs.

Start flume service:

flume-ng agent -conf /home/cloudera/flumeconf -conf-file /home/cloudera/fIumeconf/flume5.conf --name agent

Step 5 : Open another terminal create a file at /home/cloudera/flumetest/in.txt.

Step 6 : Enter below data in file and save it.

l.alok.mumbai

1 jatin.chennai

1,yogesh,kolkata

2,ragini,delhi

2,jyotsana,pune

1,valmiki,banglore

Step 7 : Open hue and check the data is available in hive table or not.

Step 8 : Stop flume service by pressing ctrl+c

Questions 23

Problem Scenario 51 : You have been given below code snippet.

val a = sc.parallelize(List(1, 2,1, 3), 1)

val b = a.map((_, "b"))

val c = a.map((_, "c"))

Operation_xyz

Write a correct code snippet for Operationxyz which will produce below output.

Output:

Array[(lnt, (lterable[String], lterable[String]))] = Array(

(2,(ArrayBuffer(b),ArrayBuffer(c))),

(3,(ArrayBuffer(b),ArrayBuffer(c))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))

)

Options:

Buy Now

Questions 24

Problem Scenario 33 : You have given a files as below.

spark5/EmployeeName.csv (id,name)

spark5/EmployeeSalary.csv (id,salary)

Data is given below:

EmployeeName.csv

E01,Lokesh

E02,Bhupesh

E03,Amit

E04,Ratan

E05,Dinesh

E06,Pavan

E07,Tejas

E08,Sheela

E09,Kumar

E10,Venkat

EmployeeSalary.csv

E01,50000

E02,50000

E03,45000

E04,45000

E05,50000

E06,45000

E07,50000

E08,10000

E09,10000

E10,10000

Now write a Spark code in scala which will load these two tiles from hdfs and join the same, and produce the (name.salary) values.

And save the data in multiple tile group by salary (Means each file will have name of employees with same salary). Make sure file name include salary as well.

Options:

Buy Now

Questions 25

Problem Scenario 82 : You have been given table in Hive with following structure (Which you have created in previous exercise).

productid int code string name string quantity int price float

Using SparkSQL accomplish following activities.

1. Select all the products name and quantity having quantity <= 2000

2. Select name and price of the product having code as 'PEN'

3. Select all the products, which name starts with PENCIL

4. Select all products which "name" begins with 'P\ followed by any two characters, followed by space, followed by zero or more characters

Options:

Buy Now

Questions 26

Problem Scenario 22 : You have been given below comma separated employee information.

name,salary,sex,age

alok,100000,male,29

jatin,105000,male,32

yogesh,134000,male,39

ragini,112000,female,35

jyotsana,129000,female,39

valmiki,123000,male,29

Use the netcat service on port 44444, and nc above data line by line. Please do the following activities.

1. Create a flume conf file using fastest channel, which write data in hive warehouse directory, in a table called flumeemployee (Create hive table as well tor given data).

2. Write a hive query to read average salary of all employees.

Options:

Buy Now

Questions 27

Problem Scenario 28 : You need to implement near real time solutions for collecting information when submitted in file with below

Data

echo "IBM,100,20160104" >> /tmp/spooldir2/.bb.txt

echo "IBM,103,20160105" >> /tmp/spooldir2/.bb.txt

mv /tmp/spooldir2/.bb.txt /tmp/spooldir2/bb.txt

After few mins

echo "IBM,100.2,20160104" >> /tmp/spooldir2/.dr.txt

echo "IBM,103.1,20160105" >> /tmp/spooldir2/.dr.txt

mv /tmp/spooldir2/.dr.txt /tmp/spooldir2/dr.txt

You have been given below directory location (if not available than create it) /tmp/spooldir2 .

As soon as file committed in this directory that needs to be available in hdfs in /tmp/flume/primary as well as /tmp/flume/secondary location.

However, note that/tmp/flume/secondary is optional, if transaction failed which writes in this directory need not to be rollback.

Write a flume configuration file named flumeS.conf and use it to load data in hdfs with following additional properties .

1. Spool /tmp/spooldir2 directory

2. File prefix in hdfs sholuld be events

3. File suffix should be .log

4. If file is not committed and in use than it should have _ as prefix.

5. Data should be written as text to hdfs

Options:

Buy Now

Questions 28

Problem Scenario 31 : You have given following two files

1. Content.txt: Contain a huge text file containing space separated words.

2. Remove.txt: Ignore/filter all the words given in this file (Comma Separated).

Write a Spark program which reads the Content.txt file and load as an RDD, remove all the words from a broadcast variables (which is loaded as an RDD of words from Remove.txt). And count the occurrence of the each word and save it as a text file in HDFS.

Content.txt

Hello this is ABCTech.com

This is TechABY.com

Apache Spark Training

This is Spark Learning Session

Spark is faster than MapReduce

Remove.txt

Hello, is, this, the

Options:

Buy Now

Exam Code: CCA175

Exam Name: CCA Spark and Hadoop Developer Exam

Last Update: Sep 14, 2025

Questions: 96

PDF + Testing Engine

$52.15 ~~$149~~

Testing Engine (only)

$45.15 ~~$129~~

PDF (only)

$34.65 ~~$99~~

Weekend Sale - Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 65percent

dumpspedia logo

Navigation:

CCA175 Sample Questions Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Quick Links