(Big) Data Curation, Pipelines and Management

English
GRA 4157
6 Credits

Introduction

To gain consistent benefits from machine learning models in business, it is essential to move data science projects from experimentation to production by building automated machine learning pipelines. A standard machine learning pipeline consists of data preparation, model training, model evaluation and validation. The tasks of the automated pipeline range from collecting real-time streaming data to model and output management.

In this course, you will learn the life cycle of a data science project and the responsibilities of different roles in a data science team. You will also learn how to build an efficient end-to-end data science project and get hands-on experience on programming in python. Particular focus will be put on what is typically the most time consuming part, namely data curation, cleaning, and management, including different database infrastructures and SQL-style queries.

Course content

The course will go towards advanced python programming with a particular focus on data science project life cycles. This includes:

Understanding business problems.
Under data curation, collection, and preprocessing the course will cover:
- Reading and writing data “manually” to files in python.
- Handling missing values, outliers, and other data anomalies.
- Database infrastructures and explore and curate data with pandas (including SQL and MongoDB).
- Make analysis queries that answer business questions.
- Create training data using queries.
Feature engineering.
Model building and deployment, including machine learning.
Roles and responsibilities in data science projects.

Disclaimer

This is an excerpt from the complete course description for the course. If you are an active student at BI, you can find the complete course descriptions with information on eg. learning goals, learning process, curriculum and exam at portal.bi.no. We reserve the right to make changes to this description.