detail

Detailed information about the course

[ Back ]

Title	Practical Crowdsourcing for Efficient Machine Learning
Dates	7 June – 25 June 2021
Responsable de l'activité	Philippe CUDRE-MAUROUX
Organizer(s)	Alisa Smirnova, PhD candidate, University of Fribourg Prof. Philippe Cudré-Mauroux, University of Fribourg Dr. Valerio Schiavoni, University of Neuchatel
Speakers	Olga Megorskaya, Yandex LLC, Russia Alexey Drutsa, Yandex LLC, Russia Dmitry Ustalov, University of Mannheim, Germany
Description	The course is dedicated to crowdsourcing as a tool for efficient and scalable data labeling. Great amounts of reliable data are essential for creating and training ML-based algorithms and models, both for industrial purposes and scientific research. Accurate data allows to train efficient models and to evaluate their quality. This is the reason why efficient data labeling is a demanded and essential skill for professionals and researchers dealing with ML. Crowdsourcing helps establish robust and scalable data labeling processes by distributing tasks among a vast cloud of users. However, establishing data quality requires tackling certain challenges, such as preparing the task simply and clearly, training crowd workers to accomplish the task correctly, and aggregating results. At Yandex, we have been dealing with these challenges for over 10 years. In this course, we share our experience and see how crowdsourcing can be applied to the needs of course participants. We will discuss the basic components of the crowdsourcing approach that allow us to turn manual data labeling into an engineering task. We will see how crowdsourcing helps collect and label data for common ML-based tasks (e.g., natural language processing or computer vision). In addition, the course participants will practice their new skills by designing and setting up a project for their research needs using a real-world crowdsourcing platform. Course Outcomes At the end of the course, the participants will: understand the general principles of crowdsourcing; know the state of the current research related to crowdsourcing; understand how crowdsourcing can be applied to various research challenges; be able to design a pipeline for a data labeling task.
Program	Introduction Session 1: Introduction to crowdsourcing Date: 8 June 10:00-11:30 UTC +2 Practical exercise: Run a simple data labeling task on a crowdsourcing platform Case Studies Session 2: Crowdsourcing for NLP tasks Date: 10 June 10:00-11:30 UTC +2 Session 3: Crowdsourcing for Computer Vision tasks Date: 15 June 10:00-11:30 UTC +2 Practical exercise: Run a pipeline of tasks for data collection Crowdsourcing and Research Session 4: Research challenges related to data labeling Date: 17 June 10:00-11:30 UTC +2 Sessions 5: Application of crowdsourcing to participants' research needs Date: 22 June 10:00-11:30 UTC +2 Session 6: Application of crowdsourcing to participants' research needs Date: 24 June 10:00-11:30 UTC +2 Practical exercise: Run a personal data labeling project
Location	Online
Information	Estimated workload Each session: 1,5 h Each practical exercise: approximately 3 h Total workload: approximately 18 h Prerequisites General understanding of ML Experience with HTML, CSS, JS and Python will be an advantage Practical exercises and research grants Practical exercises require participants to use a real-world crowdsourcing platform. In this course we'll be offering test accounts for Toloka, a crowdsourcing data labeling platform developed at Yandex, enabling the participants to complete two practical exercises. For the participants who have their own research project that could benefit from crowdsourced data labeling, we'll be offering Toloka research grants up to $500. Grants will be awarded on Week 3 of the course when participants start working on their personal projects. Grant receivers should commit to the following: Any publication that relies on the data collected using the awarded funds should acknowledge that the study was supported by the Toloka research grant. The dataset collected in the experiment should be released publicly in the Toloka repository of datasets within 6 months after data collection ends.
Places	20
Deadline for registration	04.06.2021

short URL

URL onepage