Detailed information about the course
Title | Practical Crowdsourcing for Efficient Machine Learning |
Dates | 7 June – 25 June 2021 |
Responsible | Philippe CUDRE-MAUROUX |
Organizer(s) | Alisa Smirnova, PhD candidate, University of Fribourg Prof. Philippe Cudré-Mauroux, University of Fribourg Dr. Valerio Schiavoni, University of Neuchatel |
Speakers | Olga Megorskaya, Yandex LLC, Russia Alexey Drutsa, Yandex LLC, Russia Dmitry Ustalov, University of Mannheim, Germany |
Description | The course is dedicated to crowdsourcing as a tool for efficient and scalable data labeling. Great amounts of reliable data are essential for creating and training ML-based algorithms and models, both for industrial purposes and scientific research. Accurate data allows to train efficient models and to evaluate their quality. This is the reason why efficient data labeling is a demanded and essential skill for professionals and researchers dealing with ML.
Crowdsourcing helps establish robust and scalable data labeling processes by distributing tasks among a vast cloud of users. However, establishing data quality requires tackling certain challenges, such as preparing the task simply and clearly, training crowd workers to accomplish the task correctly, and aggregating results. At Yandex, we have been dealing with these challenges for over 10 years. In this course, we share our experience and see how crowdsourcing can be applied to the needs of course participants. We will discuss the basic components of the crowdsourcing approach that allow us to turn manual data labeling into an engineering task. We will see how crowdsourcing helps collect and label data for common ML-based tasks (e.g., natural language processing or computer vision). In addition, the course participants will practice their new skills by designing and setting up a project for their research needs using a real-world crowdsourcing platform.
Course OutcomesAt the end of the course, the participants will:
|
Program | IntroductionSession 1: Introduction to crowdsourcingDate: 8 June 10:00-11:30 UTC +2 Practical exercise: Run a simple data labeling task on a crowdsourcing platform Case StudiesSession 2: Crowdsourcing for NLP tasksDate: 10 June 10:00-11:30 UTC +2 Session 3: Crowdsourcing for Computer Vision tasksDate: 15 June 10:00-11:30 UTC +2 Practical exercise: Run a pipeline of tasks for data collection Crowdsourcing and ResearchSession 4: Research challenges related to data labelingDate: 17 June 10:00-11:30 UTC +2 Sessions 5: Application of crowdsourcing to participants' research needsDate: 22 June 10:00-11:30 UTC +2 Session 6: Application of crowdsourcing to participants' research needsDate: 24 June 10:00-11:30 UTC +2 Practical exercise: Run a personal data labeling project |
Location |
Online |
Information | Estimated workload Each session: 1,5 h Each practical exercise: approximately 3 h Total workload: approximately 18 h
Prerequisites General understanding of ML Experience with HTML, CSS, JS and Python will be an advantage
Practical exercises and research grants Practical exercises require participants to use a real-world crowdsourcing platform. In this course we'll be offering test accounts for Toloka, a crowdsourcing data labeling platform developed at Yandex, enabling the participants to complete two practical exercises. For the participants who have their own research project that could benefit from crowdsourced data labeling, we'll be offering Toloka research grants up to $500. Grants will be awarded on Week 3 of the course when participants start working on their personal projects. Grant receivers should commit to the following:
|
Places | 20 |
Deadline for registration | 04.06.2021 |

