-
Excerpt from course description

Advanced Statistics and Alternative Data Types

Introduction

This course is in three parts:

1) - The linear regression model in the cross-sectional context
2) - The statistical analysis of time series data
3) - Analysis of text data

In this course we will first review standard regression analysis from a statistical perspective. The course will then introduce time series and text data and standard models used to analyse such data. Both types of data come with their own challenges compared to using traditional cross-sectional data sets. For example, time series data typically is dependent and naïve application of statistical methods and machine learning techniques to time dependent data is unlikely to work well. Text data is unstructured, meaning it cannot be handled directly by standard statistical and machine learning models. Accordingly, the course will cover how to apply standard pre-processing tools such that text can be used in both standard machine learning models as well as in models tailored for natural language processing.

Course content

The linear regression model (recap w/cross-sectional data)

  • Standard (distributional) assumptions
  • Bias/variance trade-off
  • Regularization
  • Prediction

Time series data

  •  Stationarity and autocorrelation
  •  Fundamental time series processes
    • Random Walk
    • ARMA models
  •     Estimation and inference
  •     Out-of-sample forecasting

Natural language processing

  • Tokenization, converting text to numbers
    • An important step of using text as data is to convert raw text into numbers that can be used in statistical models or machine learning models. You will learn several ways to do this.
  • Textual pre-processing – how to extract signal rather than noise
    • The student will learn how to use concepts like stop words, lemmatization, tf-idf weighting, and embeddings to extract meaningful content from text data.
  • Simple models for text data (and the limits of linear regression)
    • Examples may include: Classifying review score (books, movies, restaurants), classifying spam mail from usual mail, extract topics, or derive sentiment scores

Disclaimer

This is an excerpt from the complete course description for the course. If you are an active student at BI, you can find the complete course descriptions with information on eg. learning goals, learning process, curriculum and exam at portal.bi.no. We reserve the right to make changes to this description.