Easter Special Sale - Limited Time 60% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 575363r9

Welcome To DumpsPedia

Databricks-Certified-Professional-Data-Scientist Sample Questions Answers

Questions 4

A denote the event 'student is female' and let B denote the event 'student is French'. In a class of 100 students suppose 60 are French, and suppose that 10 of the French students are females. Find the probability that if I pick a French student, it will be a girl, that is, find P(A|B).

Options:

A.

1/3

B.

2/3

C.

1/6

D.

2/6

Buy Now
Questions 5

A website is opened 3 times by a user. What is the probability of he clicks 2 times the advertisement, is best calculated by

Options:

A.

Binomial

B.

Poisson

C.

Normal

D.

Any of the above

Buy Now
Questions 6

Which of the following is a correct example of the target variable in regression (supervised learning)?

Options:

A.

Nominal values like true, false

B.

Reptile, fish, mammal, amphibian, plant, fungi

C.

Infinite number of numeric values, such as 0.100, 42.001, 1000.743..

D.

All of the above

Buy Now
Questions 7

Which of the following steps you will be using in the discovery phase?

Options:

A.

What all are the data sources for the project?

B.

Analyze the Raw data and its format and structure.

C.

What all tools are required, in the project?

D.

What is the network capacity required

E.

What Unix server capacity required?

Buy Now
Questions 8

Which of the below best describe the Principal component analysis

Options:

A.

Dimensionality reduction

B.

Collaborative filtering

C.

Classification

D.

Regression

E.

Clustering

Buy Now
Questions 9

You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters. What should you do?

Options:

A.

Identify additional measures to add to the analysis

B.

Remove one of the measures

C.

Decrease the number of clusters

D.

Increase the number of clusters

Buy Now
Questions 10

RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a______, as it is scale-dependent.

Options:

A.

Between Variables

B.

Particular Variable

C.

Among all the variables

D.

All of the above are correct

Buy Now
Questions 11

A problem statement is given as below

Hospital records show that of patients suffering from a certain disease, 75% die of it. What is the probability that of 6 randomly selected patients, 4 will recover?

Which of the following model will you use to solve it.

Options:

A.

Binomial

B.

Poisson

C.

Normal

D.

Any of the above

Buy Now
Questions 12

In which of the following scenario you should apply the Bay's Theorem

Options:

A.

The sample space is partitioned into a set of mutually exclusive events {A1, A2, . .., An }.

B.

Within the sample space, there exists an event B, for which P(B) > 0.

C.

The analytical goal is to compute a conditional probability of the form: P(Ak | B ).

D.

In all above cases

Buy Now
Questions 13

Refer to the Exhibit.

In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output attribute "class". Which decision tree is valid for the data?

Options:

A.

Tree A

B.

Tree B

C.

Tree C

D.

Tree D

Buy Now
Questions 14

You are designing a recommendation engine for a website where the ability to generate more personalized recommendations by analyzing information from the past activity of a specific user, or the history of other users deemed to be of similar taste to a given user. These resources are used as user profiling and helps the site recommend content on a user-by-user basis. The more a given user makes use of the system, the better the recommendations become, as the system gains data to improve its model of that user. What kind of this recommendation engine is ?

Options:

A.

Naive Bayes classifier

B.

Collaborative filtering

C.

Logistic Regression

D.

Content-based filtering

Buy Now
Questions 15

While working with Netflix the movie rating websites you have developed a recommender system that has produced ratings predictions for your data set that are consistently exactly 1 higher for the user-item pairs in your dataset than the ratings given in the dataset. There are n items in the dataset. What will be the calculated RMSE of your recommender system on the dataset?

Options:

A.

1

B.

2

C.

0

D.

n/2

Buy Now
Questions 16

Which of the following is not a correct application for the Classification?

Options:

A.

credit scoring

B.

tumor detection

C.

image recognition

D.

drug discovery

Buy Now
Questions 17

What is the best way to evaluate the quality of the model found by an unsupervised algorithm like k-means clustering, given metrics for the cost of the clustering (how well it fits the data) and its stability (how similar the clusters are across multiple runs over the same data)?

Options:

A.

The lowest cost clustering subject to a stability constraint

B.

The lowest cost clustering

C.

The most stable clustering subject to a minimal cost constraint

D.

The most stable clustering

Buy Now
Questions 18

Select the correct objectives of principal component analysis

Options:

A.

To reduce the dimensionality of the data set

B.

To identify new meaningful underlying variables

C.

To discover the dimensionality of the data set

D.

Only 1 and 2

E.

All 1, 2 and 3

Buy Now
Questions 19

The method based on principal component analysis (PCA) evaluates the features according to

Options:

A.

The projection of the largest eigenvector of the correlation matrix on the initial dimensions

B.

According to the magnitude of the components of the discriminate vector

C.

The projection of the smallest eigenvector of the correlation matrix on the initial dimensions

D.

None of the above

Buy Now
Questions 20

A data scientist is asked to implement an article recommendation feature for an on-line magazine.

The magazine does not want to use client tracking technologies such as cookies or reading history. Therefore, only the style and subject matter of the current article is available for making recommendations. All of the magazine's articles are stored in a database in a format suitable for analytics.

Which method should the data scientist try first?

Options:

A.

K Means Clustering

B.

Naive Bayesian

C.

Logistic Regression

D.

Association Rules

Buy Now
Questions 21

What type of output generated in case of linear regression?

Options:

A.

Continuous variable

B.

Discrete Variable

C.

Any of the Continuous and Discrete variable

D.

Values between 0 and 1

Buy Now
Questions 22

Which of the following metrics are useful in measuring the accuracy and quality of a recommender system?

Options:

A.

Cluster Density

B.

Support Vector Count

C.

Mean Absolute Error

D.

Sum of Absolute Errors

Buy Now
Questions 23

Regularization is a very important technique in machine learning to prevent overfitting. Mathematically speaking, it adds a regularization term in order to prevent the coefficients to fit so perfectly to overfit. The difference between the L1 and L2 is...

Options:

A.

L2 is the sum of the square of the weights, while L1 is just the sum of the weights

B.

L1 is the sum of the square of the weights, while L2 is just the sum of the weights

C.

L1 gives Non-sparse output while L2 gives sparse outputs

D.

None of the above

Buy Now
Questions 24

Consider the following confusion matrix for a data set with 600 out of 11,100 instances positive:

In this case, Precision = 50%, Recall = 83%, Specificity = 95%, and Accuracy = 95%.

Select the correct statement

Options:

A.

Precision is low, which means the classifier is predicting positives best

B.

Precision is low, which means the classifier is predicting positives poorly

C.

problem domain has a major impact on the measures that should be used to evaluate a classifier within it

D.

1 and 3

E.

2 and 3

Buy Now
Questions 25

Select the correct statement which applies to K-Nearest Neighbors

Options:

A.

No Assumption about the data

B.

Computationally expensive

C.

Require less memory

D.

Works with Numeric Values

Buy Now
Questions 26

Suppose you have made a model for the rating system, which rates between 1 to 5 stars. And you calculated that RMSE value is 1.0 then which of the following is correct

Options:

A.

It means that your predictions are on average one star off of what people really think

B.

It means that your predictions are on average two star off of what people really think

C.

It means that your predictions are on average three star off of what people really think

D.

It means that your predictions are on average four star off of what people really think

Buy Now
Questions 27

You are working in a classification model for a book, written by HadoopExam Learning Resources and decided to use building a text classification model

for determining whether this book is for Hadoop or Cloud computing. You have to select the proper features (feature selection) hence, to cut down on the size of the feature space, you will use the mutual information of each word with the label of hadoop or cloud to select the 1000 best features to use as input to a Naive Bayes model. When you compare the performance of a model built with the 250 best features to a model built with the 1000 best features, you notice that the model with only 250 features performs slightly better on our test data.

What would help you choose better features for your model?

Options:

A.

Include least mutual information with other selected features as a feature selection criterion

B.

Include the number of times each of the words appears in the book in your model

C.

Decrease the size of our training data

D.

Evaluate a model that only includes the top 100 words

Buy Now
Questions 28

Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several variables that may be......

Options:

A.

Numerical

B.

Categorical

C.

Both 1 and 2 are correct

D.

None of the 1 and 2 are correct

Buy Now
Questions 29

In which of the scenario you can use the linear regression model?

Options:

A.

Predicting Home Price based on the location and house area

B.

Predicting demand of the goods and services based on the weather

C.

Predicting tumor size reduction based on input as number of radiation treatment

D.

Predicting sales of the text book based on the number of students in state

Buy Now
Questions 30

If you are trying to predict or forecast a discrete target value, then which is the correct options

Options:

A.

Supervised Learning regression algorithms

B.

Supervised Learning classification algorithms

C.

Un supervised Learning

D.

Density estimation algorithm

Buy Now
Questions 31

You are creating a regression model with the input income, education and current debt of a customer, what could be the possible output from this model.

Options:

A.

Customer fit as a good

B.

Customer fit as acceptable or average category

C.

expressed as a percent, that the customer will default on a loan

D.

1 and 3 are correct

E.

2 and 3 are correct

Buy Now
Questions 32

You are studying the behavior of a population, and you are provided with multidimensional data at the individual level. You have identified four specific individuals who are valuable to your study, and would like to find all users who are most similar to each individual. Which algorithm is the most appropriate for this study?

Options:

A.

Association rules

B.

Decision trees

C.

Linear regression

D.

K-means clustering

Buy Now
Questions 33

Scenario: Suppose that Bob can decide to go to work by one of three modes of transportation,

car, bus, or commuter train. Because of high traffic, if he decides to go by car. there is a 50% chance he will be late. If he goes by bus, which has special reserved lanes but is sometimes overcrowded, the probability of being late is only 20%. The commuter train is almost never late, with a probability of only 1 %, but is more expensive than the bus.

Suppose that Bob is late one day, and his boss wishes to estimate the probability that he drove to work that day by car. Since he does not know Which mode of transportation Bob usually uses, he gives a prior probability of 1 3 to each of the three possibilities. Which of the following method the boss will use to estimate of the probability that Bob drove to work?

Options:

A.

Naive Bayes

B.

Linear regression

C.

Random decision forests

D.

None of the above

Buy Now
Questions 34

Select the correct algorithm of unsupervised algorithm

Options:

A.

K-Nearest Neighbors

B.

K-Means

C.

Support Vector Machines

D.

Naive Bayes

Buy Now
Questions 35

Select the correct statement which applies to logistic regression

Options:

A.

Computationally inexpensive, easy to implement knowledge representation easy to interpret

B.

May have low accuracy

C.

Works with Numeric values

D.

Only 1 and 3 are correct

E.

All 1, 2 and 3 are correct

Buy Now
Questions 36

A bio-scientist is working on the analysis of the cancer cells. To identify whether the cell is cancerous or not, there has been hundreds of tests are done with small variations to say yes to the problem. Given the test result for a sample of healthy and cancerous cells, which of the following technique you will use to determine whether a cell is healthy?

Options:

A.

Linear regression

B.

Collaborative filtering

C.

Naive Bayes

D.

Identification Test

Buy Now
Questions 37

Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign, the amount of time spent campaigning negatively and whether or not the candidate is an incumbent.

Above is an example of

Options:

A.

Linear Regression

B.

Logistic Regression

C.

Recommendation system

D.

Maximum likelihood estimation

E.

Hierarchical linear models

Buy Now
Questions 38

In which phase of the data analytics lifecycle do Data Scientists spend the most time in a project?

Options:

A.

Discovery

B.

Data Preparation

C.

Model Building

D.

Communicate Results

Buy Now
Questions 39

What describes a true limitation of Logistic Regression method?

Options:

A.

It does not handle redundant variables well.

B.

It does not handle missing values well.

C.

It does not handle correlated variables well.

D.

It does not have explanatory values.

Buy Now
Questions 40

You are having 1000 patients' data with the height and age. Where age in years and height in meters. You wanted to create cluster using this two attributes. You wanted to have near equal effect for both the age and height while creating the cluster. What you can do?

Options:

A.

You will be adding height with the numeric value 100

B.

You will be converting each height value to centimeters

C.

You will be dividing both age and height with their respective standard deviation

D.

You will be taking square root of height

Buy Now
Questions 41

Feature Hashing approach is "SGD-based classifiers avoid the need to predetermine vector size by simply picking a reasonable size and shoehorning the training data into vectors of that size" now with large vectors or with multiple locations per feature in Feature hashing?

Options:

A.

Is a problem with accuracy

B.

It is hard to understand what classifier is doing

C.

It is easy to understand what classifier is doing

D.

Is a problem with accuracy as well as hard to understand what classifier us doing

Buy Now
Exam Code: Databricks-Certified-Professional-Data-Scientist
Exam Name: Databricks Certified Professional Data Scientist Exam
Last Update: May 18, 2024
Questions: 138
$64  $159.99
$48  $119.99
$40  $99.99
buy now Databricks-Certified-Professional-Data-Scientist