How accurate are gender detection tools in predicting the gender for Chinese names? A study with 20,000 given names in Pinyin format
Keywords:accuracy, Chinese, gender detection, misclassification, name, name-to-gender, performance
Objective: We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format.
Methods: We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded).
Results: For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort.
Conclusion: We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population.
Gottlieb M, Krzyzaniak SM, Mannix A, Mannix A, Parsons M, Mody S, Kalantari A, Ashraf H, Chan TM. Sex distribution of editorial board members among emergency medicine journals. Ann Emerg Med. 2021;77:117–23.
Sebo P, Clair C. Are female authors under-represented in primary healthcare and general internal medicine journals? Br J Gen Pract. 2021;71:302.
Jefferson L, Bloor K, Maynard A. Women in medicine: historical perspectives and recent trends. Br Med Bull. 2015;114:5–15.
Ley TJ, Hamilton BH. Sociology. The gender gap in NIH grant applications. Science. 2008;322:1472–4.
Richter KP, Clark L, Wick JA, Cruvinel E, Durham D, Shaw P, Shih GH, Befort CA, Simari RD. Women physicians and promotion in academic medicine. N Engl J Med. 2020;383:2148–57.
Santamaría L, Mihaljević H. Comparison and benchmark of name-to-gender inference services. PeerJ Comput Sci. 2018;4:e156.
Sebo P. Performance of gender detection tools: a comparative study of name-to-gender inference services. J Med Libr Assoc. 2021;109(3):414–21.
Zhou Y. The rapid rise of a research nation. Nature. 2015;528:S170–3.
Center for Strategic and International Studies. Is China a global leader in research and development? [Internet]. China Power; 2018 [cited 14 May 2021]. <http://chinapower.csis.org/china-research-and-development-rnd/>.
Gu M, Zheng C, Trines S. Education in China [Internet]. WENR; 2019 [cited 14 May 2021]. <https://wenr.wes.org/2019/12/education-in-china-3>.
SJR. International Science Ranking [Internet]. Scimago Journal & Country Rank [cited 14 May 2021]. <https://www.scimagojr.com/countryrank.php?year=2019>.
Fan W. China ranks second in number of high-quality research papers in 2019 [Internet]. Ecns.cn; 29 Dec 2021 [cited 14 May 2021]. <http://www.ecns.cn/news/sci-tech/2020-12-29/detail-ihafcxvt0506226.shtml>.
Raffaem. raffaem/chinese_name_gender. Github; 2020 [cited 15 May 2021]. <https://github.com/raffaem/chinese_name_gender>.
Jizheng J, Qiyang Z. Gender prediction based on Chinese name. In: Natural language processing and Chinese computing. Springer International Publishing; 2019. Available from: https://www.springerprofessional.de/en/gender-prediction-based-on-chinese-name/17220754.
Christiansen F. Chinese characters in academic writing. University of Duisburg-Essen: Institute of East Asian Studies; 2014. Available from: https://www.uni-due.de/in-east_former_website/fileadmin/fuer_studierende/Pinyin.pdf.
Mozillazg. Pypinyin. Github [cited 14 May 2021]. <https://github.com/mozillazg/python-pinyin>.
Gender API [Internet]. Germany [cited 14 May 2021]. Available from: <https://gender-api.com/en/>.
Carsenat E. Inferring gender from names in any region, language, or alphabet. 2019. DOI: http://dx.doi.org/10.13140/RG.2.2.11516.90247.
Bérubé N, Ghiasi G, Sainte-Marie M, Larivière V. Wiki-Gendersort: Automatic gender detection using first names in Wikipedia. SocArXiv. 2020. DOI: https://doi.org/10.31235/osf.io/ezw7p.
Wais K. Gender prediction methods based on first names with genderizeR. The R Journal. 2016. January;8(1):17–37. DOI: 10.32614/RJ-2016-002.
Matias J. How to ethically and responsibly identify gender in large datasets [Internet]. MediaShift; 2014 [14 May 2021]. <http://mediashift.org/2014/11/how-to-ethically-and-responsibly-identify-gender-in-large-datasets/>.
Peters SAE, Norton R. Sex and gender reporting in global health: new editorial policies. BMJ Glob Health 2018;3:e001038.
Copyright (c) 2022 Paul Sebo
This work is licensed under a Creative Commons Attribution 4.0 International License.