From science to practice: identifying important sources of information on Wikipedia

Wikipedia, being a widely available source of information in the digital era, attaches great importance to the verifiability of its content, which is fundamental to its credibility and trust. The platform’s verifiability rules require that all information, especially controversial or controversial information, be supported by credible, published sources. This ensures that the content in Wikipedia articles is not based on personal opinion or original research. However, the subjective nature of the concept of credibility and the dependence of the assessment on many factors (including language version or topic) may create a certain problem for users editing Wikipedia in terms of selecting appropriate sources of information.

With the huge number of websites (currently over a billion), individually assessing the credibility of each source becomes a challenge for Wikipedia users. Although there are detailed guidelines in various language versions of Wikipedia that define what reliable sources are, there is no comprehensive list of websites or other sources of information that can be considered reliable in the context of the various topics covered on Wikipedia. Additionally, the credibility and reputation of websites may change over time, and evaluation criteria may vary depending on the language version of Wikipedia or the topic area, which requires regular updates of such lists. For this reason, a comprehensive and constantly updated list of reliable sources would be very helpful not only to Wikipedia editors, but also to its readers who are looking for accurate and reliable information.

Based on the analysis of over 60 million articles on Wikipedia, it is possible to extract information about over 330 million references (footnotes with information sources). This allowed the identification of the best information sources of Wikipedia using different assessment models. The table below shows the results of references extraction for selected language versions and the number of unique websites in October 2023:

Wiki Language Version Number of Articles Number of References Unique Websites
ar Arabic 1,219,168 6,355,164 294,089
ca Catalan 735,551 3,895,389 197,470
cs Czech 532,602 2,752,877 119,313
de German 2,839,878 14,473,501 622,551
en English 6,722,214 79,687,819 1,942,579
es Spanish 1,833,749 12,558,623 509,313
fa Persian 975,931 2,477,763 133,634
fi Finnish 559,931 3,371,084 138,320
fr French 2,557,559 19,455,752 576,523
he Hebrew 342,285 1,867,068 103,848
hi Hindi 162,954 496,057 47,617
hu Hungarian 530,977 2,545,152 124,536
id Indonesian 661,844 2,672,604 162,924
it Italian 1,829,095 8,856,574 278,232
ja Japanese 1,388,532 14,684,917 359,446
ko Korean 646,717 1,885,878 91,918
nl Dutch 2,133,536 3,010,002 112,318
no Norwegian 616,624 2,102,507 107,343
pl Polish 1,583,919 8,847,928 242,835
pt Portuguese 1,110,209 7,692,600 319,534
ru Russian 1,940,113 15,461,960 454,351
sv Swedish 2,572,575 11,791,609 134,081
th Thai 158,905 1,010,438 70,395
tr Turkish 533,201 2,773,455 146,854
uk Ukrainian 1,289,727 5,455,954 217,787
vi Vietnamese 1,288,093 3,796,577 147,041
zh Chinese 1,379,496 8,130,187 283,516

During the webinar, Dr. Włodzimierz Lewoniewski presented the possibilities of identifying and automatically assessing the importance of information sources of Wikipedia articles from different language versions. As part of the practical part, some of the capabilities of the BestRef tool were shown, which contains information about the results of the evaluation of millions of Internet sources in Wikipedia articles from the point of view of individual language versions.

The webinar took place on November 23, 2023. The organizer of the event is the Wikimedia Polska, which supports and promotes Wikipedia and its sister projects (such as Wikidata, Wiktionary, Wikinews, Wikisource and others).

More information about research on the analysis of information sources on Wikipedia can be found in scientific publications: