What's new

P2.T9.902. Big data techniques including machine learning (Varian)

Nicole Seaman

Chief Admin Officer
Staff member
Thread starter #1
Learning objectives: Describe the issues unique to big datasets. Explain and assess different tools and techniques for manipulating and analyzing big data. Examine the areas for collaboration between econometrics and machine learning


902.1. About the analysis of big data, Hal Varian says "Conventional statistical and econometric techniques such as regression often work well, but there are issues unique to big datasets that may require different tools. First, the sheer size of the data involved may require more powerful data manipulation tools. Second, we may have more potential predictors than appropriate for estimation, so we need to do some kind of variable selection. Third, large datasets may allow for more flexible relationships than simple linear models. Machine learning techniques such as decision trees, support vector machines, neural nets, deep learning, and so on may allow for more effective ways to model complex relationships."

Further, according to Hal Varian, which of the following statements is TRUE?

a. Excel remains the best database for Big Data because it contains fully 2^18 rows
b. The goal of machine learning is to develop good in-sample predictions but such methods are not helpful if the data is "too fat" or "too tall"
c. NoSQL databases are table-based relational databases that are more sophisticated than (i.e., "less primitive than") structured query language
d. Data analysis includes four categories: prediction (a primary concern of machine learning), summarization, estimation, and hypothesis testing

902.2. Hal Varian introduces "trees" as non-linear methods that are effective alternatives to linear or logistic regression for prediction. Classification trees such as binary trees (i.e., two branches at each node) are used for multiple outcomes, while regression trees handle continuous dependent variables. In regard to some of the different tools and techniques for manipulating and analyzing big data, each of the following statements is true EXCEPT which statement is inaccurate?

a. Random forests is a technique that uses multiple classification and/or regression trees
b. The primary drawback of trees is that, because they lack methods for coping with missing values, trees require all observations in the dataset to be complete cases
c. Trees sometimes do not work well when the underlying relationship is linear, but on the other hand they tend to thrive when there are important non-linear relationships and interactions
d. Elastic net regression adds a penalty term to the sum of squared residuals in a multivariate regression model such that it includes the special case of ordinary least squares (OLS) when the penalty term equals zero

902.3. In regard to areas of potential collaboration between econometrics and machine learning, according to Hal Varian each of the following statements is true EXCEPT which is inaccurate?

a. In big datasets, model uncertainty tend to be small but sampling uncertainty tends to be quite large
b. Machine learning tends to find that averaging over many small models tends to give better out-of-sample prediction than choosing a single model
c. In order to model the average treatment effect as a function of other variables, we typically need to model both the observed difference in outcome and the selection bias
d. Prediction methods can assist with the thorny problem of estimating causation; for example, Bayesian Structural Time Series (BSTS) is a machine learning technique that can be used to forecast a counterfactual and estimate the causal effect of certain variables

Answers here: