# Part 2 - Data Creation for Free Doctor

In this section we are going to create the dataset, we are going to download the raw data and clean and create a data frame.

First, let us download the online datasets to work

The MedDialog dataset (English) contains conversations (in English) between doctors and patients. It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. All copyrights of the data belong to healthcaremagic.com and icliniq.com.

In [None]:
#!pip install pathlib

In [6]:
import gdown

In [7]:
url="https://drive.google.com/drive/folders/1-5mQW2gNj_kcBobllL9EpbJcUcT5aFpE?usp=sharing"

In [8]:
gdown.download_folder(url, quiet=True, use_cookies=False)

['C:\\Users\\rusla\\Dropbox\\23-GITHUB\\Projects\\Free-Doctor-with-Artificial-Intelligence\\2-Data\\Medical-Dialogue-System\\dialogue_0.txt',
 'C:\\Users\\rusla\\Dropbox\\23-GITHUB\\Projects\\Free-Doctor-with-Artificial-Intelligence\\2-Data\\Medical-Dialogue-System\\dialogue_1.txt',
 'C:\\Users\\rusla\\Dropbox\\23-GITHUB\\Projects\\Free-Doctor-with-Artificial-Intelligence\\2-Data\\Medical-Dialogue-System\\dialogue_2.txt',
 'C:\\Users\\rusla\\Dropbox\\23-GITHUB\\Projects\\Free-Doctor-with-Artificial-Intelligence\\2-Data\\Medical-Dialogue-System\\dialogue_3.txt',
 'C:\\Users\\rusla\\Dropbox\\23-GITHUB\\Projects\\Free-Doctor-with-Artificial-Intelligence\\2-Data\\Medical-Dialogue-System\\dialogue_4.txt']

There are 5 raw dialogs that we are going to process to create the dataset to work.

We are going to create a Dataset with the following schema:

- Description	 - String
- Patient - String	
- Doctor - String	

The conversion of text to json.
Then we will create the pandas dataframes

In [57]:
#importing  modules
import os
from pathlib import Path
import pandas as pd
import json
import re
import json

In [14]:
from tqdm import tqdm
from tools import timer
t = timer.Timer()

In [2]:
def split_content(filename):
    '''
    filename:  The filename must be txt format and stored in the 
               ./2-Data/Medical-Dialogue-System/ folder
    res: The output is the list of all dialogues separated in each file.
    '''
    #to get the current working directory
    path = os.getcwd()
    file = os.path.join(path, "Medical-Dialogue-System", filename)
    subdirectory=filename.replace(".txt","")
    #creating a new directory called data
    out_dir=os.path.join(path, "data",subdirectory)
    Path(out_dir).mkdir(parents=True, exist_ok=True)
    out_n = 0
    done = False
    try:   
        with open(file, encoding="utf-8") as in_file:
            while not done: #loop over output file names
                # Join various path components
                name=f"out{out_n}.txt"
                file_tmp=os.path.join(path, "data", subdirectory, name)
                #print(file_tmp)
                with open(file_tmp, "w", encoding="utf-8") as out_file: #generate an output file name
                    while not done: #loop over lines in the input file and write to the output file
                        try:
                            line = next(in_file).strip() #strip whitespace for consistency
                        except StopIteration:
                            done = True
                            break
                        if "id=" in line: #more robust than 'if line == "SPLIT\n":'
                            break
                        else:
                            out_file.write(line + '\n') #must add back in newline because we stripped it out earlier 
                    out_n += 1 #increment output file name integer
     
    except Exception as error:
        print("An error occurred to open dialog:", error) # An error occurred: name 'x' is not defined
    from os import walk
    # folder path
    dir_path = out_dir
    # List to store files name
    res = []
    for (dir_path, dir_names, file_names) in walk(dir_path):
        res.extend(file_names)
    #print(res)
    return res

In [3]:

def findword(str, word):
    m = re.search(word, str)
    return m

In [4]:
def create_dataframe(text_as_string,name_partial):
    string = re.sub('http://\S+|https://\S+', '', text_as_string)
    keywords = {'Description', 'Dialogue', 'Patient:', 'Doctor:'}
    text=re.split(r'\n(?=Description|Dialogue|Patient|Doctor)' , string)
    updated_dic ={}
    for str in  text:    
        for word in keywords:
            #print("Looking for {}".format(word))
            res = findword(str,word)
            if res is None:
                log="Word not found!!"
                #print(log)
            else:
                #print("Search Success!!")
                # Python program to convert text
                # file to JSON
                # The file to be converted to
                # json format
                lines = str
                # dictionary where the lines from
                # text will be stored
                parsed_dict = {}
                # reads each line and trims of extra the spaces
                # and gives only the valid words
                #print("Analyzing text:",lines)
                try:
                    command, content = lines.strip().split(None, 1) 	 	
                    command=command.replace(":","") 
                    content=content.strip()
                    content=content.replace("\n", " ")
                    parsed_dict[command] = content
                    updated_dic.update(parsed_dict)
                    
                except:
                  #print("No recurrence found")
                    pass
    #print("The output dataframe is:")
    df = pd.DataFrame(updated_dic, index = [name_partial])
    return df

In [5]:
def create(filename):
    '''
    filename:  The filename must be txt format and stored in the 
               ./2-Data/Medical-Dialogue-System/ folder
    df: The output is a dataframe
    '''
    #to get the current working directory
    path = os.getcwd()
    res=split_content(filename)
    # create an Empty DataFrame object
    df = pd.DataFrame()
    for partial in res:
        name_partial=partial
        subdirectory=filename.replace(".txt","")
        file_partial=os.path.join(path, "data", subdirectory,name_partial)
        text_as_string = open(file_partial, encoding="utf-8").read()
        #print(partial)
        df_partial=create_dataframe(text_as_string,name_partial)
        # A continuous index value will be maintained
        # across the rows in the new appended data frame.
        frames = [df, df_partial]
        df = pd.concat(frames)
    return df

In [6]:
def create_csv(filename):
    print("Creating dataframe ...")
    dfa=create(filename)
    dfa=dfa.reset_index(names="Filename")
    file_name=filename.replace(".txt",".csv")
    path = os.getcwd()
    out_dir=os.path.join(path, "data", "csv")
    out_file=os.path.join(out_dir,file_name)
    Path(out_dir).mkdir(parents=True, exist_ok=True)
    dfa.to_csv(out_file, sep='\t', encoding='utf-8', index=False)
    df = pd.read_csv(out_file, sep = '\t')
    print("File created: ",out_file)
    return df

In [7]:
filename="test.txt"
#filename="dialogue_0.txt"
create_csv(filename)

Creating dataframe ...
File created:  C:\Users\rusla\Dropbox\23-GITHUB\Projects\Free-Doctor-with-Artificial-Intelligence\2-Data\data\csv\test.csv


Unnamed: 0,Filename,Description,Patient,Doctor
0,out0.txt,,,
1,out1.txt,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...
2,out2.txt,"Q. Every time I eat spicy food, I poop blood. ...","Hi doctor, I am a 26 year old male. I am 5 fee...",Hello. I have gone through your information an...
3,out3.txt,Q. Will Nano-Leo give permanent solution for e...,"Hello doctor, I am 48 years old. I am experien...",Hi. For further doubts consult a sexologist on...


We select the list of documents to create dataframes

In [17]:
filenames=["dialogue_0.txt",
           "dialogue_1.txt",
           "dialogue_2.txt",
           "dialogue_3.txt",
           "dialogue_4.txt"]
#filenames=[filename]

We perform the creation of dataframes

In [18]:
t.start()
for filename in tqdm(filenames):
    create_csv(filename)
    print("Done")
t.stop()

  0%|                                                                                            | 0/5 [00:00<?, ?it/s]

Creating dataframe ...


 20%|████████████████▌                                                                  | 1/5 [03:48<15:13, 228.44s/it]

File created:  C:\Users\rusla\Dropbox\23-GITHUB\Projects\Free-Doctor-with-Artificial-Intelligence\2-Data\data\csv\dialogue_0.csv
Done
Creating dataframe ...


 40%|█████████████████████████████████▏                                                 | 2/5 [08:57<13:47, 275.77s/it]

File created:  C:\Users\rusla\Dropbox\23-GITHUB\Projects\Free-Doctor-with-Artificial-Intelligence\2-Data\data\csv\dialogue_1.csv
Done
Creating dataframe ...


 60%|█████████████████████████████████████████████████▊                                 | 3/5 [36:57<30:33, 916.88s/it]

File created:  C:\Users\rusla\Dropbox\23-GITHUB\Projects\Free-Doctor-with-Artificial-Intelligence\2-Data\data\csv\dialogue_2.csv
Done
Creating dataframe ...


 80%|████████████████████████████████████████████████████████████████                | 4/5 [1:00:39<18:36, 1116.54s/it]

File created:  C:\Users\rusla\Dropbox\23-GITHUB\Projects\Free-Doctor-with-Artificial-Intelligence\2-Data\data\csv\dialogue_3.csv
Done
Creating dataframe ...


100%|█████████████████████████████████████████████████████████████████████████████████| 5/5 [1:04:45<00:00, 777.07s/it]

File created:  C:\Users\rusla\Dropbox\23-GITHUB\Projects\Free-Doctor-with-Artificial-Intelligence\2-Data\data\csv\dialogue_4.csv
Done
Elapsed time: 3885.3336 seconds





In [61]:
import os
def merge():
    print("Merging dataframes ...")
    path = os.getcwd()
    dir_path=os.path.join(path, "data", "csv")
    # list file and directories
    csvs = os.listdir(dir_path)
    csvs.remove('.ipynb_checkpoints')
    filepaths=[os.path.join(dir_path,s)  for s in csvs]
    df = pd.concat([pd.read_csv(f,  sep = '\t', encoding='utf-8') for f in  filepaths], ignore_index=True)
    #Saving final dataframe
    out_dir=os.path.join(path, "data", "final")
    Path(out_dir).mkdir(parents=True, exist_ok=True)
    print("Saving dataframe ...")
    out_file=os.path.join(path, "data", "final", "dialogues.csv")
    df.to_csv(out_file, sep='\t', encoding='utf-8', index=False)
    print(out_file)
    print("Done!")
    return df


In [62]:
df= merge()

Merging dataframes ...
Saving dataframe ...
C:\Users\rusla\Dropbox\23-GITHUB\Projects\Free-Doctor-with-Artificial-Intelligence\2-Data\data\final\dialogues.csv
Done!


In [63]:
dialogues_path=os.path.join(os.getcwd(), "data", "final", "dialogues.csv")

In [65]:
df=pd.read_csv(dialogues_path,  sep = '\t', encoding='utf-8')

In [66]:
df.shape

(257492, 4)

In [67]:
df.head()

Unnamed: 0,Filename,Description,Patient,Doctor
0,out0.txt,,,
1,out1.txt,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...
2,out10.txt,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...
3,out100.txt,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...
4,out1000.txt,Q. Can vitamin D3 deficiency cause inflammatio...,Vitamin d3 deficiency (11 units).....consuming...,


In [68]:
df.tail(1)['Patient'].values

array(['iam having hairfall for a decade.. but fews weeks its getting worse.. recently taken blood test in which my iron and D3 are low... doctor has prescribed me with D3 60000iu once in a week and Livogen. i would like to know if biotin supplements are required to stop hair fall. if so pls recommned the brand names also.'],
      dtype=object)

In [69]:
df.tail(1)['Doctor'].values

array(["you did'nt mention about thyroid problem ...usually iron deficiency can cause hairloss ...also not mentioning about dandruff ...so keep your scalp clean ...avoid dandruff take iron tab ...takee mor iron rich foods like leafy vegetables..better reduce spicy and salty food ...take only soft food ..dont use hot water in hair...take less oil but maximum massage ...our oil neelibhringadi is good for growing hair ...do protein treatment also ...dont use hair colours ,regular use of shampoo avoid...thankyou"],
      dtype=object)

# Cleaning Dataframe


In this part we are going to separate the NaN values from the training dataset.

In [104]:
df.isnull().any(axis=1)

0          True
1         False
2         False
3         False
4          True
          ...  
257487    False
257488    False
257489    False
257490    False
257491    False
Length: 257492, dtype: bool

In [108]:
df2= df[df.isnull().any(axis=1)]

In [110]:
df2

Unnamed: 0,Filename,Description,Patient,Doctor
0,out0.txt,,,
4,out1000.txt,Q. Can vitamin D3 deficiency cause inflammatio...,Vitamin d3 deficiency (11 units).....consuming...,
225,out102.txt,Q. Why has my father's swollen ankle turned da...,"My father, Male, 77 years old with swollen ank...",
1214,out1109.txt,Q. I have run out of Seroflo 250 inhaler that ...,"Hi, firstly i would like to thank for this won...",
1292,out1116.txt,"Q. My mother has severe heart problem, and her...",Age: 62 years My mother has severe heart probl...,
...,...,...,...,...
255610,out8304.txt,Suggest ways to obtain a flawless skin,,Hello. Thank you for writing to usThis cream i...
255907,out8572.txt,Is Melas cream effective for acne scars?,,Hello and welcome to healthcaremagic.Melas cre...
255986,out8643.txt,,"Hi Doctor,I am taking Kaya's treatment for alm...","Hi, Welcome to HCM. you should have followed y..."
256061,out8710.txt,"Chicken pox scars on face, body. Taking Vitami...",,"hello and welcome to HCM forum dilusreni, I am..."


In [111]:
null_mask = df.isnull().any(axis=1)
null_rows = df[null_mask]

In [112]:
null_rows

Unnamed: 0,Filename,Description,Patient,Doctor
0,out0.txt,,,
4,out1000.txt,Q. Can vitamin D3 deficiency cause inflammatio...,Vitamin d3 deficiency (11 units).....consuming...,
225,out102.txt,Q. Why has my father's swollen ankle turned da...,"My father, Male, 77 years old with swollen ank...",
1214,out1109.txt,Q. I have run out of Seroflo 250 inhaler that ...,"Hi, firstly i would like to thank for this won...",
1292,out1116.txt,"Q. My mother has severe heart problem, and her...",Age: 62 years My mother has severe heart probl...,
...,...,...,...,...
255610,out8304.txt,Suggest ways to obtain a flawless skin,,Hello. Thank you for writing to usThis cream i...
255907,out8572.txt,Is Melas cream effective for acne scars?,,Hello and welcome to healthcaremagic.Melas cre...
255986,out8643.txt,,"Hi Doctor,I am taking Kaya's treatment for alm...","Hi, Welcome to HCM. you should have followed y..."
256061,out8710.txt,"Chicken pox scars on face, body. Taking Vitami...",,"hello and welcome to HCM forum dilusreni, I am..."


In [113]:
not_null_mask = df.notnull().all(axis=1)
not_null_rows = df[not_null_mask]

In [114]:
not_null_rows

Unnamed: 0,Filename,Description,Patient,Doctor
1,out1.txt,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...
2,out10.txt,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...
3,out100.txt,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...
5,out10000.txt,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...
6,out10001.txt,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...
...,...,...,...,...
257487,out9995.txt,Why is hair fall increasing while using Bontre...,I am suffering from excessive hairfall. My doc...,"Hello Dear Thanks for writing to us, we are he..."
257488,out9996.txt,Why was I asked to discontinue Androanagen whi...,"Hi Doctor, I have been having severe hair fall...","hello, hair4u is combination of minoxid..."
257489,out9997.txt,Can Mintop 5% Lotion be used by women for seve...,Hi..i hav sever hair loss problem so consulted...,HI I have evaluated your query thoroughly you...
257490,out9998.txt,Is Minoxin 5% lotion advisable instead of Foli...,"Hi, i am 25 year old girl, i am having massive...",Hello and Welcome to ‘Ask A Doctor’ service.I ...


In [115]:
not_null_rows.drop('Filename', inplace=True, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  not_null_rows.drop('Filename', inplace=True, axis=1)


In [116]:
not_null_rows

Unnamed: 0,Description,Patient,Doctor
1,Q. What does abutment of the nerve root mean?,"Hi doctor,I am just wondering what is abutting...",Hi. I have gone through your query with dilige...
2,Q. What should I do to reduce my weight gained...,"Hi doctor, I am a 22-year-old female who was d...",Hi. You have really done well with the hypothy...
3,Q. I have started to get lots of acne on my fa...,Hi doctor! I used to have clear skin but since...,Hi there Acne has multifactorial etiology. Onl...
5,Q. Why do I have uncomfortable feeling between...,"Hello doctor,I am having an uncomfortable feel...",Hello. The popping and discomfort what you fel...
6,Q. My symptoms after intercourse threatns me e...,"Hello doctor,Before two years had sex with a c...",Hello. The HIV test uses a finger prick blood ...
...,...,...,...
257487,Why is hair fall increasing while using Bontre...,I am suffering from excessive hairfall. My doc...,"Hello Dear Thanks for writing to us, we are he..."
257488,Why was I asked to discontinue Androanagen whi...,"Hi Doctor, I have been having severe hair fall...","hello, hair4u is combination of minoxid..."
257489,Can Mintop 5% Lotion be used by women for seve...,Hi..i hav sever hair loss problem so consulted...,HI I have evaluated your query thoroughly you...
257490,Is Minoxin 5% lotion advisable instead of Foli...,"Hi, i am 25 year old girl, i am having massive...",Hello and Welcome to ‘Ask A Doctor’ service.I ...


We save the not null data to go to the third step that is modeling

In [117]:
not_null_rows.to_csv("dialogues.csv", sep='\t', encoding='utf-8', index=False)