Winter School in Empirical Research Methods

Machine Learning with R – Introduction

Instructor: Brett Lantz

COURSE CONTENT

Machine learning, put simply, involves teaching computers to learn from experience, typically for the purpose of identifying and/or responding to patterns as well as making predictions about what may happen in the future. This course is intended to be an introduction to machine learning through the exploration of real-world examples. We will cover the basic math and statistical theory needed to begin applying many of the most common machine learning techniques, but no advanced math or programming skills are required. The target audience may include social scientists or practitioners who are interested in understanding more about these methods and applying them to their own work. Students with extensive programming or statistics experience may be better served by a more advanced course on the topic.

PREREQUISITES (KNOWLEDGE OF TOPIC)

This course assumes no prior experience with machine learning or R, though it may be helpful to be familiar with introductory statistics and programming.

HARDWARE

A laptop computer is required to complete the in-class exercises.

SOFTWARE

R and R Studio are available at no cost and are needed for this course.

STRUCTURE

The course will be designed to be extensively interactive, with ample time for hands-on practice and group problem solving. Each day will include several 45-minute lecture periods, each based around a Machine Learning topic, in addition to a longer hands-on “lab” section to conclude the day. Throughout the course, students will be encouraged to identify a project and dataset to work on individually or in groups during the lab sessions. Students will be asked to write a report on this project after completion of the course.

A tentative schedule would be:

Day 1:  Introducing Machine Learning with R

  • How machines learn
  • Using R
  • kNN models
  • Afternoon: Lab section

Day 2:  Intermediate ML Models

  • Naïve Bayes
  • Decision Trees and Rules
  • Regression
  • Afternoon: Lab section

Day 3:  Advanced ML Models

  • Neural Networks
  • Support Vector Machines
  • Random Forests
  • Afternoon: Lab section

Day 4:  Other ML Methods

  • Association Rules
  • k-means Clustering
  • Afternoon: Lab section

Day 5:  Improving ML Performance

  • Model Evaluation
  • Parameter Tuning / Meta Learning / Ensembles
  • Random Forests
  • Afternoon: Lab section

LITERATURE

Mandatory:

  • Machine Learning with R (2nd ed.) by Brett Lantz (2015). Packt Publishing

Supplementary / voluntary:

The following books may be useful for students looking to better understand the statistical and computational theory behind Machine Learning:

  • An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013). Springer Publishing.
  • The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009). Springer Publishing.
  • Pattern Recognition and Machine Learning by Christopher Bishop (2007). Springer Publishing.

Mandatory readings before course start:

  • Read Chapters 1 and 2 in MLwR (2nd ed.) prior to the course start. Additionally install R and R Studio on your laptop prior to the 1st class, and ensure that they are working correctly.

EXAMINATION

100% of the course grade will be based on the student’s project and final report, to be delivered within 2-3 weeks after the course. This will be graded based on the use of the methods discussed as well as drawing appropriate conclusions from the data. At the beginning of each class session a short quiz will be given based upon the assigned readings, but this will not factor in the final grade.

Supplementary aids

Students may reference literature as needed when writing the final report.

EXAMINATION CONTENT

The final report should illustrate an ability to apply an appropriate machine learning method to a topic of the student’s choosing. The student should explain the methods applied and evaluate the machine learning model’s performance for the chosen task.

LITERATURE

Not applicable

WORK LOAD

At least 24 units 45 minutes each on 5 consecutive days.