Winter School in Empirical Research Methods

Machine Learning with R – Advanced

Instructor: Brett Lantz


This course is a continuation of Introductory Machine Learning with R and assumes a basic knowledge of at least several machine learning classification methods. Students having equivalent real-world experience (via other ML courses or on-the-job experiences) are also welcome.


A laptop computer is required to complete the in-class exercises.


R and R Studio are available at no cost and are needed for this course.


With machine learning, it is often difficult to make the leap from classroom examples to the real-world. Real-world applications often present challenges that require more advanced approaches for preparing, exploring, modeling, and evaluating the data. The goal of this course is to prepare students to independently apply machine learning methods to their own tasks. We will cover the practical techniques that are not often found in textbooks but discovered through hands-on experience. We will practice these techniques by simulating a machine learning competition like those found on Kaggle. The target audience includes students who are interested in applying ML knowledge to more difficult problems and learning more advanced techniques to improve the performance of traditional ML methods.


The course will be designed to be interactive, with ample time for hands-on practice. Each day will include at least one lecture based on the day’s topic in addition to a hands-on “lab” section to apply the learnings to a competition dataset (or one’s own data).

The tentative schedule is as follows:

Day 1: Handling messy data

  • Discussion: Typically, 80% of the time spent on ML is for data preparation. Why?
  • Lecture: Learning to explore data
  • Lecture: Missing values – imputation and other strategies
  • Lecture: The R data pipeline – tidyverse
  • Lab: Getting to know your data


Day 2: Understanding ML performance

  • Discussion: What makes a successful ML model?
  • Lecture: Getting beyond accuracy – other performance measures
  • Lecture: The “no free lunch” theorem
  • Lecture: Estimating future performance – sampling methods, model selection
  • Lab: Comparing models on your dataset with ROC curves


Day 3: Improving ML performance

  • Discussion: What factors keep ML models from perfect prediction?
  • Lecture: Tuning stock models – automated parameter tuning
  • Lecture: Meta-learning – ensembles, stacked models
  • Lab: Machine Learning Competition (Round 1)


Day 4: “Big data” problems

  • Discussion: Is more data always better? Why or why not?
  • Lecture: The curse of dimensionality – dimensionality reduction, t-SNE
  • Lecture: Imbalanced datasets – under and over-sampling strategies
  • Lecture: Improving R’s performance on big data
  • Lab: Machine Learning Competition (Round 2)


Day 5: Next-generation “Black Box” methods

  • Discussion: What are the strengths and weaknesses of man versus machine?
  • Lecture: Deep Learning – Keras, Tensorflow
  • Lecture: Text embeddings – word2vec
  • Lecture: Cluster computing – use cases of Hadoop, Spark, etc.
  • Discussion: Results of ML Competition – winners’ tips and tricks
  • Lab: Work on your final project



  • Machine Learning with R (2nd ed.) by Brett Lantz (2015). Packt Publishing

Mandatory readings before course start:

  • Students should have R and R Studio installed on their laptop prior to the 1st class. Be sure that these are working correctly and that external packages can be installed. Instructions for doing this are in the first chapter of Machine Learning with R.


80% of the course grade will be based on a project and final report (approximately 5-10 pages), to be delivered within 2-3 weeks after the course in R Notebook format. This will be graded based on its use of the methods covered in class as well as making appropriate conclusions from the data. This project is intended to evaluate an ability to apply in class learnings to a real-world topic of one’s own choosing. Students should feel free to use a project related to their career or field of study. For example, one may use this opportunity to advance his/her dissertation research or complete a task for his/her job. The exact scoring criteria for this assignment will be provided on the 1st day of class.

The remaining 20% of the course grade will be based on participation during in-class discussions and during the machine learning competitions. The ML competition winner(s) will receive maximum points, while runners-up will receive a fraction of the points based on effort, innovation, and proximity to the winners’ performance. The performance metrics for this competition will be provided prior to the competition.

Supplementary aids

Students may reference any literature as needed when writing the final report.


The final report should illustrate an ability to apply advanced machine learning methods to a topic of the student’s choosing. The student should explain the methods applied and evaluate the machine learning model’s performance for the chosen task.