Detailed information about the course

[ Back ]
Title

Practical Crowdsourcing for Efficient Machine Learning

Dates

7 june – 25 june 2021

Responsable de l'activité

Philippe CUDRE-MAUROUX

Organizer(s)

Alisa Smirnova, PhD candidate, University of Fribourg

Prof. Philippe Cudré-Mauroux, University of Fribourg

Dr. Valerio Schiavoni, University of Neuchatel

Speakers

Olga Megorskaya, Yandex LLC, Russia

Alexey Drutsa, Yandex LLC, Russia

Dmitry Ustalov, University of Mannheim, Germany

Description

The course is dedicated to crowdsourcing as a tool for efficient and scalable data labeling. Great amounts of reliable data are essential for creating and training ML-based algorithms and models, both for industrial purposes and scientific research. Accurate data allows to train efficient models and to evaluate their quality. This is the reason why efficient data labeling is a demanded and essential skill for professionals and researchers dealing with ML. 

 

Crowdsourcing helps establish robust and scalable data labeling processes by distributing tasks among a vast cloud of users. However, establishing data quality requires tackling certain challenges, such as preparing the task simply and clearly, training crowd workers to accomplish the task correctly, and aggregating results. At Yandex, we have been dealing with these challenges for over 10 years. In this course, we share our experience and see how crowdsourcing can be applied to the needs of course participants. We will discuss the basic components of the crowdsourcing approach that allow us to turn manual data labeling into an engineering task. We will see how crowdsourcing helps collect and label data for common ML-based tasks (e.g., natural language processing or computer vision). In addition, the course participants will practice their new skills by designing and setting up a project for their research needs using a real-world crowdsourcing platform.

 

Course Outcomes

At the end of the course, the participants will: 

  • understand the general principles of crowdsourcing; 
  • know the state of the current research related to crowdsourcing; 
  • understand how crowdsourcing can be applied to various research challenges; 
  • be able to design a pipeline for a data labeling task. 

 

Program

Introduction

Session 1: Introduction to crowdsourcing

Date: 8 June 10:00-11:30 UTC +2

Practical exercise: Run a simple data labeling task on a crowdsourcing platform

Case Studies

Session 2: Crowdsourcing for NLP tasks

Date: 10 June 10:00-11:30 UTC +2

Session 3: Crowdsourcing for Computer Vision tasks

Date: 15 June 10:00-11:30 UTC +2

Practical exercise: Run a pipeline of tasks for data collection

Crowdsourcing and Research

Session 4: Research challenges related to data labeling

Date: 17 June 10:00-11:30 UTC +2

Sessions 5: Application of crowdsourcing to participants' research needs

Date: 22 June 10:00-11:30 UTC +2

Session 6: Application of crowdsourcing to participants' research needs

Date: 24 June 10:00-11:30 UTC +2

Practical exercise: Run a personal data labeling project

Location

Online

Information

Estimated workload

Each session: 1,5 h 

Each practical exercise: approximately 3 h 

Total workload: approximately 18 h

 

Prerequisites

General understanding of ML

Experience with HTML, CSS, JS and Python will be an advantage

 

Practical exercises and research grants

Practical exercises require participants to use a real-world crowdsourcing platform. In this course we'll be offering test accounts for Toloka, a crowdsourcing data labeling platform developed at Yandex, enabling the participants to complete two practical exercises. 

For the participants who have their own research project that could benefit from crowdsourced data labeling, we'll be offering Toloka research grants up to $500. Grants will be awarded on Week 3 of the course when participants start working on their personal projects. Grant receivers should commit to the following: 

  • Any publication that relies on the data collected using the awarded funds should acknowledge that the study was supported by the Toloka research grant.
  • The dataset collected in the experiment should be released publicly in the Toloka repository of datasets within 6 months after data collection ends.

 

 

 

Places

20

Deadline for registration 04.06.2021
short-url short URL

short-url URL onepage