A South Korean graduate student has used data to prevent the spread of COVID-19 and spread the data globally. The data is then used to predict confirmed cases and number of deaths.
“Dataset of COVID19 in South Korea” was registered on “Kaggle”, which is a platform to show forecasting models based on machine learning, on the 24th of February. It was created by Kim Ji-hoo who currently studies Computer Software from Hanyang University’s graduate school. Kim Ji-hoo, who currently uses a nickname “Data Artist” on Kaggle, is currently updating the dataset in real time. As of the 6th, the number of downloads surpassed 5,000. Considering the fact that the dataset is simply data needed to develop an AI model, 5,000 is a very high number in just a short period of time. It was also ranked the “most popular data” within Kaggle on the 3rd. Currently, more than 130,000 people use Kaggle.
Dataset is a collection of data needed for machine learning. The dataset made by Kim is made by official information provided by Korea Centers for Disease Control and Prevention (KCDC) and geographic information such as latitude and longitude of the flow of confirmed cases. It is more accurate than the dataset created by Johns Hopkins University that is just based on confirmed cases and number of deaths. Kim’s dataset includes more specific information such as main areas of operation of confirmed cases, routes of confirmed cases, disparity between confirmed cases, and locations visited by confirmed cases. His dataset structuralizes data on COVID-19 that is spread out based on consistent rules. Unified dataset such as Kim’s dataset is needed to create a forecasting model.
“Machine learning makes predictions after creating a forecasting model based on data.” said Professor Kim Hui-kang of Korea University’s School of Cybersecurity who has focused on research on data. “AI and machine learning will be useless if the quality of data is no good no matter how much advanced they are.” Professor Kim is indicating that data with good quality and sharing of such data is important to develop the AI industry.

Photo Image
<Kim Ji-hoo’s dataset of COVID-19 on Kaggle>

Kim Ji-hoo created the dataset when he was looking for a pattern on how COVID-19 spreads. Because there was no organized dataset to analyze the spread of COVID-19, he created one on his own. He labeled information from KCDC and local governments for a data analysis purpose. Because the quality of data will be poor based on automation alone, he also has to work manually. While he had worked alone at first, he is currently with other 9 fellow colleagues who are interested in data research. In order to share the dataset with data scientists from across the world, they are also working on having the dataset available in English.
“We predicted that our dataset would be used to predict the number of confirmed cases and the release rate of isolation and there are already codes being registered based on our dataset.” said Kim Ji-hoo. Codes based on the dataset are re-shared through “Kernel” within Kaggle. By using the dataset, one can create a model that can predict how many more confirmed cases there will be within the next days. It is also possible to predict whether confirmed cases will survive or die depending on their age, sex, and condition. It is also possible to create clustering depending on characteristics of confirmed cases or separate special confirmed cases such as super-spreader by detecting anomalies.
Machine learning is commonly divided into regression analysis and classification. Many codes for supervised learning and unsupervised learning have been registered using Kim’s dataset. In addition to forecasting models, there are also visual models based on the dataset. It is expected that Kim’s dataset will be used to investigate hidden truths and the process of the spread by using various data analysis methods even after COVID-19 situation. Data analysis can be used to find truths that cannot be uncovered with epidemiological surveys.
“Although systems and cultures are different between countries, our dataset can be used for other infectious diseases as well.” said Kim Ji-hoo. “We hope that South Korea will establish and model data well and use data for critical situations in the future.”
Staff Reporter Oh, Dain | ohdain@etnews.com