Two third awards of our students in the Eurostat competition “The Web Intelligence – Deduplication Challenge”

Students of our specialization in Informatics and Econometrics – Mikołaj Tym and Jakub Żerebecki (2nd degree, 1st year, speciality: Information Systems for Business and Administration) – took part in the competition “The Web Intelligence – Deduplication Challenge” organized by Eurostat between December 2022 and April 2023 year. The task performed concerned the field of Data Science and natural language processing.

The aim of the competition was to identify potential duplicate job offers collected from websites across the European Union. The dataset contained 112,000 job advertisements in various languages that had to be classified into one of the following categories:

  1. Unique ads.
  2. Full duplicates – offers with the same title and job description.
  3. Semantic duplicates – offers for the same job position, but expressed differently in natural language or in different languages.
  4. Temporal duplicates – semantic duplicates with different dates of obtaining the advertisement.
  5. Partial duplicates – offers regarding the same professional position, but containing, for example, additional requirements for the candidate that are not included in the original offer.

Our students have prepared a solution in Python that uses LLM (large language model) as well as other natural language processing methods to identify duplicate job offers.

69 teams from 17 countries took part in the competition, and our students (IDA team) took third places in two categories:

  1. Accuracy – identification of duplicates as precisely as possible (EUR 3,000).
  2. Reproducibility – development of an innovative and scalable methodology to produce European statistics (EUR 3,000).

Special thanks are due to Prof. Krzysztof Węcel, whose classes inspired the team members to develop in Data Science, and for his invaluable help and support during the competition!

More info on the web page: