Advanced Statistics and Alternative Data Types
When learning about statistics, most students are exposed to cross-sectional data, e.g., numerical observations of one entity at one point in time. However, most data also have a time dimension, and most of the data in the world is not in numerical format, it is textual. As such, time series and text are important data types frequently used in many different industries, ranging from finance and economics to HR and marketing.
In this course we will first cover important principles for standard regression analysis both from a statistical and machine learning perspective, including distributional assumptions, bias-variance trade-offs, cross validation and resampling. We will then give you an introduction to how to work with time series and text data in this setting. Both types of data come with their own challenges compared to using traditional cross-sectional data sets. For example, text data is unstructured, meaning that it can not be directly handled by standard statistical and machine learning models. Accordingly, you will learn how to apply standard pre-processing tools such that text can be used in both standard machine learning models as well as in models tailored for natural language processing. Similarly, data with time dependence comes with its own set of challenges, and naïve application of statistical methods and machine learning techniques to time dependent data are unlikely to work well.
The linear regression model (recap w/cross-sectional data)
- Standard (distributional) assumptions
- Bias/variance trade-off
- Cross validation and in-sample/out-of-sample predictions
Time series data
- Stationarity and autocorrelation
- Unit roots and tests
- Fundamental time series processes
- Random Walk
- Autoregressive model
- Estimation and HAC corrections
- Persistence and impulse response functions
- Out-of-sample forecasting
Natural language processing
- Tokenization, converting text to numbers
- An important step of using text as data is to convert raw text into numbers that can be used in statistical models or machine learning models. You will learn several ways to do this.
- Textual pre-processing – how to extract signal rather than noise
- The student will learn how to use concepts like stop words, lemmatization, tf-idf weighting, and embeddings to extract meaningful content from text data.
- Simple models for text data (and the limits of linear regression)
- Examples may include: Classifying review score (books, movies, restaurants), classifying spam mail from usual mail, extract topics, or derive sentiment scores
- Distributional assumptions when working with generative text data models
This is an excerpt from the complete course description for the course. If you are an active student at BI, you can find the complete course descriptions with information on eg. learning goals, learning process, curriculum and exam at portal.bi.no. We reserve the right to make changes to this description.