Demystifying COVID-19 publications: institutions, journals, concepts, and topics
Keywords:COVID-19 pandemic, CORD-19 dataset, global research roadmap, data analytics, topic modeling
Objective: We analyzed the COVID-19 Open Research Dataset (CORD-19) to understand leading research institutions, collaborations among institutions, major publication venues, key research concepts, and topics covered by pandemic-related research.
Methods: We conducted a descriptive analysis of authors’ institutions and relationships, automatic content extraction of key words and phrases from titles and abstracts, and topic modeling and evolution. Data visualization techniques were applied to present the results of the analysis.
Results: We found that leading research institutions on COVID-19 included the Chinese Academy of Sciences, the US National Institutes of Health, and the University of California. Research studies mostly involved collaboration among different institutions at national and international levels. In addition to bioRxiv, major publication venues included journals such as The BMJ, PLOS One, Journal of Virology, and The Lancet. Key research concepts included the coronavirus, acute respiratory impairments, health care, and social distancing. The ten most popular topics were identified through topic modeling and included human metapneumovirus and livestock, clinical outcomes of severe patients, and risk factors for higher mortality rate.
Conclusion: Data analytics is a powerful approach for quickly processing and understanding large-scale datasets like CORD-19. This approach could help medical librarians, researchers, and the public understand important characteristics of COVID-19 research and could be applied to the analysis of other large datasets.
World Health Organization. Coronavirus disease (COVID-19) pandemic [Internet]. World Health Organization; 2020 [cited 15 Sep 2020]. Available from: <https://www.who.int/emergencies/diseases/novel-coronavirus-2019>.
WHO. A coordinated global research roadmap: 2019 novel coronavirus. Geneva: World Health Organization; 2020.
Wang LL, Lo K, Chandrasekhar Y, Reas R, Yang J, Eide D, Funk K, Kinney RM, Liu Z, Merrill W, Mooney P, Murdick D, Rishi D, Sheehan J, Shen Z, Stilson B, Wade A, Wang K, Wilhelm C, Xie B, Raymond D, Weld DS, Etzioni O, Kohlmeier S. CORD-19: The Covid-19 open research dataset [dataset]. arXiv. 2020 [cited 17 Oct 2020]. <https://www.semanticscholar.org/cord19>.
Alimadadi A, Aryal S, Manandhar I, Munroe PB, Joe B, Cheng X. Artificial intelligence and machine learning to fight COVID-19. Physiol Genomics. 2020;52(4):200–2. DOI: https://doi.org/10.1152/physiolgenomics.00029.2020.
Kaggle. COVID-19 Kaggle community contributions: COVID-19 data, tools and findings from the Kaggle community [Internet]. Kaggle; 2020 [ cited 10 Sep 2020]. Available from: <https://www.kaggle.com/covid-19-contributions>.
Kaggle. COVID19: Answer the key question from papers [Internet]. Kaggle; 2020 [cited 2 Sep 2020]. Available from: <https://www.kaggle.com/super13579/covid19-answer-the-key-question-from-papers>.
Fister Jr I, Fister K, Fister I. Discovering associations in COVID-19 related research papers. arXiv:2004.03397 [Preprint]. 2020. < https://arxiv.org/abs/2004.03397>.
Chen M, Qu J, Xu Y, Chen J. Smart and connected health: what can we learn from funded projects? Data and Information Management. 2018;2(3):141–52. DOI: http://dx.doi.org/10.2478/dim-2018-0015.
Almuhaideb S, Menai MEB. Impact of preprocessing on medical data classification. Front Comput Sci. 2016;10(6):1082–1102. DOI: https://doi.org/10.1007/s11704-016-5203-5.
NetworkX. NetworkX: Network analysis in Python [Internet]. NetworkX; 2020 [cited 2 Sep 2020]. Available from: <https://networkx.org/>.
Gephi. The open graph viz platform [Internet]. Gephi; 2017 [cited 4 Sep 2020]. Available from: <https://gephi.org/>.
Bougouin A, Boudin F, Daille B. TopicRank: Graph-based topic ranking for keyphrase extraction. Proceedings of the Sixth International Joint Conference on Natural Language Processing. Nagoya, Japan: Asian Federation of Natural Language Processing; 2013. p. 543–51.
Campos R, Mangaravite V, Pasquali A, Jorge A, Nunes C, Jatowt A. YAKE! Keyword extraction from single documents using multiple local features. Inf Sci. 2020;509:257–89.
Martinc M, Škrlj B, Pollak S. TNT-KID: Transformer-based neural tagger for keyword identification. arXiv:2003.09166 [Preprint]. 2020. <https://arxiv.org/abs/2003.09166>.
Mihalcea R, Tarau P. TextRank: Bringing order into text. EMNLP. 2004 Jul;(4)404–7.
Rose S, Engel D, Cramer N, Cowley W. Automatic keyword extraction from individual documents. In: Berry MW, Kogan J, eds. Text Mining: Applications and Theory. 1st ed. West Sussex, United Kingdom: Wiley; 2010. p. 1–20.
Alghamdi R, Alfalqi K. A survey of topic modeling in text mining. Int J Adv Comput Sci Appl. 2015;6(1). DOI: https://dx.doi.org/10.14569/IJACSA.2015.060121.
Radim Rehurek PS. Models.lda: Latent Dirichlet allocation [Internet]. Gensim; 2020 [cited Sep 2020]. Available from: <https://radimrehurek.com/gensim/models/ldamodel.html>.
Greene D, O’Callaghan D, Cunningham P. How many topics? Stability analysis for topic models. In: Calders T, Esposito F, Hüllermeier E, Meo R, eds. Machine learning and knowledge discovery in databases. Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer; 2014 Sep. p. 498–515.
Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. 2015; 399–9. DOI: https://doi.org/10.1145/2684822.2685324.
Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D. Exploring topic coherence over many models and many topics. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2012; 952-9.
Blei DM, Lafferty JD. Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning. 2006 Jun; 113–7. DOI: http://dx.doi.org/10.1145/1143844.1143859.
Radim Rehurek PS. Models.ldaseqmodel: Dynamic topic modeling in Python [Internet]. Gensim; 2020 [cited Sep 2020]. Available from: <https://radimrehurek.com/gensim/models/ldaseqmodel.html>.
Karpovich S, Smirnov A, Teslya N, Grigorev A. Topic model visualization with IPython. 2017 20th Conference of Open Innovations Association (FRUCT). IEEE. 2017; 131–6. DOI: https://doi.org/10.23919/FRUCT.2017.8071303.
Radu S. U.S., China compete for medical research leadership [Internet]. U.S. News & World Report; 27 Sep 2019 [cited 11 Sep 2020]. Available from: <https://www.usnews.com/news/best-countries/articles/2019-09-27/china-threatens-the-us-leadership-position-in-medical-research>.
Bioinformatics Organization. Journals [Internet]. Bioinformatics Organization; 2016 [cited 15 Sep 2020]. Available from: <https://www.bioinformatics.org/wiki/Journals>.
Wang X, Cheng Q, Lu W. Analyzing evolution of research topics with NEViewer: A new method based on dynamic co-word networks. Scientometrics. 2014;101(2):1253–71. DOI: https://doi.org/10.1007/s11192-014-1347-y.
Franceschet M. Collaboration in computer science: a network science approach. J Am Soc Inf Sci Technol. 2011;62(10):1992–2012. DOI: https://doi.org/10.1002/asi.21614.
Mäntylä MV, Graziotin D, Kuutila M. The evolution of sentiment analysis—A review of research topics, venues, and top cited papers. Comput Sci Rev. 2018;27:16–1. DOI: https://doi.org/10.1016/j.cosrev.2017.10.002.
This work is licensed under a Creative Commons Attribution 4.0 International License.