Question 1:

You are provided four different datasets. Initial analysis on these datasets show that they have identical mean, variance and correlation values. What should your next step in the analysis be?

A. Visualize the data to further explore the characteristics of each data set

B. Select one of the four datasets and begin planning and building a model

C. Combine the data from all four of the datasets and begin planning and bulding a model

D. Recalculate the descriptive statistics since they are unlikely to be identical for each dataset

Correct Answer: A

Question 2:

If your intention is to show trends over time, which chart type is the most appropriate way to depict the data?

A. Line chart

B. Bar chart

C. Stacked bar chart

D. Histogram

Correct Answer: A

Question 3:

Refer to the exhibit.

You have created a density plot of purchase amounts from a retail website as shown. What should you do next?

A. Recreate the plot using the barplot() function

B. Use the rug() function to add elements to the plot

C. Recreate the density plot using a log normal distribution of the purchase amount data

D. Reduce the sample size of the purchase amount data used to create the plot

Correct Answer: C

Question 4:

To ensure a successful analytic project, which key role can provide business domain expertise with a deep understanding of the data and key performance indicators?

A. Business Intelligence Analyst

B. Project Manager

C. Project Sponsor

D. Business User

Correct Answer: A

Question 5:

Refer to the Exhibit.

In the Exhibit. For effective visualization, what is the chart\’s primary flaw?

A. The use of 3 dimensions.

B. The slanting of axis labels.

C. The location of the legend.

D. The order of the columns.

Correct Answer: A

Question 6:

You have been assigned to do a study of the daily revenue effect of a pricing model of online transactions. You have tested all the theoretical models in the previous model planning stage, and all tests have yielded statistically insignificant results. What is your next step?

A. Report that the results are insignificant, and reevaluate the original business question.

B. Run all the models again against a larger sample, leveraging more historical data.

C. Move forward on the model with the highest significance scores relative to the others.

D. Modify samples used by the models and iterate until a significant result occurs.

Correct Answer: A

Question 7:

A disk drive manufacturer has a defect rate of less than 1.0% with 98% confidence. A quality assurance team samples 1000 disk drives and finds 14 defective units. Which action should the team recommend?

A. The manufacturing process should be inspected for problems.

B. A larger sample size should be taken to determine if the plant is functioning properly

C. A smaller sample size should be taken to determine if the plant is functioning properly D. The manufacturing process is functioning properly and no further action is required.

Correct Answer: A

Question 8:

While having a discussion with your colleague, this person mentions that they want to perform K-means clustering on text file data stored in HDFS.

Which tool would you recommend to this colleague?

A. Mahout

B. HBase

C. Scribe

D. Sqoop

Correct Answer: A

Question 9:

In which lifecycle stage are initial hypotheses formed?

A. Discovery

B. Model planning

C. Model building

D. Data preparation

Correct Answer: A

Question 10:

Before building an ARMA model, how can you determine if the time series is weakly stationary?

A. Constant variance around a constant mean is apparent

B. Mean of the series is close to 0

C. Series is normally distributed

D. No trend component is apparent

Correct Answer: A

Question 11:

Assume that you have a data frame in R. Which function would you use to display descriptive statistics about this variable?

A. summary

B. str

C. attributes

D. levels

Correct Answer: A

Question 12:

Based on the exhibit,

what is a likely issue with the data?

A. Saturated data; indicating potential issues with data definitions

B. Incomplete data; indicating potential issues with data transmission

C. Mis-scaled data; indicating potential issues with data entry

D. No obvious concerns with the data is visible

Correct Answer: A

Question 13:

When is a Na飗e Bayesian Classifier model for classification preferred versus a Logistic Regression model?

A. When using several categorical input variables with over 1000 possible values each

B. When an estimate of the probability of an outcome is needed, not just which class it is in

C. When all input variables are numerical

D. When some of the input variables might be correlated

Correct Answer: A

Question 14:

What is the primary bottleneck in text classification?

A. The availablilty of tagged training data.

B. The ability to parse unstructured text data.

C. The high dimensionality of text data.

D. The fact that text corpora are dynamic.

Correct Answer: A

Question 15:

Refer to the exhibit.

Which type of data issue would you suspect based on the exhibit?

A. “Saturated” data, indicating potential issues with data definitions

B. Incomplete data, indicating potential issues with data transmission

C. Mis-scaled data, indicating potential issues with data entry

D. The exhibit does not raise any obvious concerns with the data.

Correct Answer: A

