Data Analysis on Korean Triage and Acuity Scale

Mert Barbaros
7 min readApr 4, 2021

Project Steps

Data Source

You can find the relevant data source on this link, under the Associated Data section in xlsx format. In this article, I will share the EDA part of my project.

I will explain every part of the EDA step by step but love to summarize my process.

Data Acquisition

In data acquisition part, I imported necessary libraries (numpy, pandas, matplotlib, seaborn) and import the Excel data document with pd.read_excel() function. Also, I downloaded the data description Excel file and upload it with the same function.

Data Exploration & Pre-processing

I started my data exploration operations with dtypes. In the main dataset: HR, BR, BT, Saturation columns were in object type. Also, those columns had a unnecessary characters like 측불. I removed unnecessary characters and convert the selected columns to int64 and float.

After the data type conversion, I started to missing data handling operations. First, I identified the NaN values with .isna() function and then I formulated a strategy for handling those missing cells. As a missing data handling operation, I decided to apply following strategy to columns which has NaN values:

NRS_pains the numeric rating scale of pain. We can fill it with average of the related chief complaint. Saturationis saturation to use pulse oxmete. We can fill it with average of the related chief complaint. Diagnosis in EDis the diagnosis. I prefer to discard those rows due to it is not accurate to fill it based on any assumption.

After applied that strategy, I excluded remaining NaN values. There were still NaN values because some of the rows did not have any NRS_pains or Saturation value.

After cleaned the null values, I created a function using scipy library which identify and exclude the outlier values from numerical columns. I applied the function and excluded the outlier values in the dataset.

When I completed the missing data and outlier operations, I started to make denormalization operations. Using data description file, I created new columns based on conditions and values I created and insert them to our main data frame called data.

Data Visualization & Analysis

In data visualization part, I created following visualizations and tables via seaborn and matplotlib libraries and also group by and pivot table functions:

- Mistriage Distribution among arrived patients

- Heart and respiration rate based on diagnosis and gender

- Gender based heart and respiration rate scatter plot

- Length of stay (minutes) based on diagnosis with bar plot

- Discharge numbers according to cases with sns count plot

- Age distribution plot

- Age bins and Mistriage status with sns heatmap

EDA

First, setting up our necessary libraries on Jupyter notebook.

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns

When we import our dataset, we can see the data with

data.head()

Our data has 24 columns, let’s explore them with their descriptions

From original paper, it’s good to read the method section

This cross-sectional retrospective study was based on 1267 systematically selected records of adult patients admitted to two emergency departments between October 2016 and September 2017. Twenty-four variables were assessed, including chief complaints, vital signs according to the initial nursing records, and clinical outcomes. Three triage experts, a certified emergency nurse, a KTAS provider and instructor, and a nurse recommended based on excellent emergency department experience and competence determined the true KTAS. Triage accuracy was evaluated by inter-rater agreement between the expert and emergency nurse KTAS scores. The comments of the experts were analyzed to evaluate the cause of triage error. An independent sample t-test was conducted to compare the number of patient visits per hour in terms of the accuracy and inaccuracy of triage.

When we look to the saturation column, we will see the 측불value which I could not find any meaning. I will discard it then convert the column into float type.

data = data[data['Saturation'] != '측불']data['Saturation'] = data['Saturation'].astype('float')

Next, let’s find out the Null values in each column. NRS_pain, Saturationand Diagnosis in EDcolumns have null values.

NRS_painis the numeric rating scale of pain. We can fill it with average of the related chief complaint. Saturationis saturation to use pulse oxmete. We can fill it with average of the related chief complaint. Diagnosis in EDis the diagnosis. I prefer to discard those rows due to it is not accurate to fill it based on any assumption.

In the following code, I assigned mean value for NRS Pain and Saturation according to Chief Complaint. However, some of the chief complaints has no value for NRS Pain or Saturation. For those values, I excluded them from the dataset

#NRS Painmeans = data.groupby('Chief_complain')['NRS_pain'].mean()nulls = data['NRS_pain'].isnull()fills = data['Chief_complain'][nulls].map(means)data.loc[nulls, 'NRS_pain'] = fills#Saturationmeans = data.groupby('Chief_complain')['Saturation'].mean()nulls = data['Saturation'].isnull()fills = data['Chief_complain'][nulls].map(means)data.loc[nulls, 'Saturation'] = fillsdata.dropna(inplace = True)

Next thing is identification of outlier values. I created a function with scipy library.

from scipy import statsdef drop_numerical_outliers(df, z_thresh=3):constrains = df.select_dtypes(include=[np.number]) \.apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, result_type='reduce') \.all(axis=1)df.drop(df.index[~constrains], inplace=True)drop_numerical_outliers(data)

Next, let’s make the necessary type conversions

data = data[data['SBP'] != '측불']data = data[data['DBP'] != '측불']data = data[data['HR'] != '측불']data = data[data['RR'] != '측불']data = data[data['BT'] != '측불']data['SBP'] = data['SBP'].astype('int64')data['DBP'] = data['DBP'].astype('float')data['HR'] = data['HR'].astype('int64')data['RR'] = data['RR'].astype('int64')data['BT'] = data['BT'].astype('float')

Next, based on columns and descriptions, I performed the normalization operations

conditions = [ (data['Sex'] == 1), (data['Sex'] ==2)]values = ['Female', 'Male']data['Gender'] = np.select(conditions, values)conditions = [ (data['Group'] == 1), (data['Group'] ==2)]values = ['Local ED', 'Regional ED']data['GroupName'] = np.select(conditions, values)conditions = [ (data['Arrival mode'] == 1), (data['Arrival mode'] ==2),(data['Arrival mode'] ==3),(data['Arrival mode'] ==4), (data['Arrival mode'] ==5), (data['Arrival mode'] ==6), (data['Arrival mode'] ==7)]values = ['Walking', '119 Use', 'Private Car', 'Private Ambulance', 'Public Transportation', 'Wheelchair','Others']data['ArrivalMethod'] = np.select(conditions, values)conditions = [ (data['Injury'] == 1), (data['Injury'] ==2)]values = ['Injury', 'Non Injury']data['InjuryName'] = np.select(conditions, values)conditions = [ (data['Mental'] == 1), (data['Mental'] == 2), (data['Mental'] == 3), (data['Mental'] == 4)]values = ['Alert', 'Verval Response', 'Pain Response', 'Unconciousness']data['MentalName'] = np.select(conditions, values)conditions = [ (data['Pain'] == 1), (data['Pain'] ==2)]values = ['Pain', 'Non Pain']data['PainName'] = np.select(conditions, values)conditions = [ (data['Disposition'] == 1), (data['Disposition'] ==2),(data['Disposition'] ==3),(data['Disposition'] ==4), (data['Disposition'] ==5), (data['Disposition'] ==6), (data['Disposition'] ==7)]values = ['Discharge', 'Ward admission', 'ICU admission', 'AMA discharge', 'Transfer', 'Death','OP fom ED']data['DispositionName'] = np.select(conditions, values)conditions = [ (data['Error_group'] == 1), (data['Error_group'] ==2),(data['Error_group'] ==3),(data['Error_group'] ==4), (data['Error_group'] ==5), (data['Error_group'] ==6), (data['Error_group'] ==7), (data['Error_group'] == 8), (data['Error_group'] ==9)]values = ['Vital sign', 'Physical exam', 'Psychatric', 'Pain', 'Mental','Underlying disease', 'Medical records of other ED','On set', 'Others' ]data['ErrorName'] = np.select(conditions, values)conditions = [ (data['mistriage'] == 0), (data['mistriage'] ==1), (data['mistriage'] ==2)]values = ['Correct', 'Over Triage', 'Under Triage']data['MistriageName'] = np.select(conditions, values)

Now, our dataset is much more readable

Based on normalization, let’s jump into the visualization part. First, let’s see the distribution among arrived patients.

plt.figure(figsize=(10,10))dep = data['MistriageName'].value_counts()labels = (np.array(dep.index))sizes = (np.array((dep / dep.sum())*100))plt.pie(sizes, labels=labels,autopct='%1.1f%%', startangle=200)plt.title("Mistriage Distribution", fontsize=15)plt.show()

Next, let’s analyze the heart rate and respitory rate according to dignosis and gender in correnct mistriages

kk =data[data["MistriageName"] == 'Correct'].groupby(by =['Diagnosis in ED', 'Gender'])['HR', 'RR'].mean()
fig, ax = plt.subplots()ax.plot(kk['Gender'] == 'Female', kk['HR'], '--b', label='Female HR')ax.plot(kk['Gender'] == 'Male', kk['HR'], '-r', label='Male HR')ax.legend(framealpha=1, frameon=True)fig, ay = plt.subplots()ay.plot(kk['Gender'] == 'Female', kk['RR'], '--b', label='Female RR')ay.plot(kk['Gender'] == 'Male', kk['RR'], '-r', label='Male RR')ay.legend(framealpha=1, frameon=True);

Similarly, we can compare the length of stay with diagnoisis in correct mistriages. I will summarize the diagnosis according to length of stay and select the top 20.

kk =data[data["MistriageName"] == 'Correct'].groupby(by =['Diagnosis in ED'])['Length of stay_min'].mean()kk = kk.sort_values(ascending = False, by = 'Length of stay_min' )kk = kk.head(20)kk = pd.DataFrame(kk)kk = kk.reset_index()plt.figure(figsize=(20, 20))k = sns.barplot(x = kk['Length of stay_min'], y = kk['Diagnosis in ED'], ci = None)for p in k.patches:k.annotate("%.f" % p.get_width(), xy=(p.get_width(), p.get_y()+p.get_height()/2),xytext=(5, 0), textcoords='offset points', ha="left", va="center")

I will share the complete notebook on my GitHub page, don’t forget to subscribe to my newsletter for notifications

http://eepurl.com/g12Drv

--

--