The Challenge
Semantic extraction is a natural language processing technique which identifies and extract entities(for example, people, information, locations, companies from news articles, etc.), facts, attributes, concepts, and events to populate meta-data fields. The purpose of this is to enable the analysis of enterprise unstructured content, such as text documents, emails, images, reports, and other business-critical content. Modern Scientific / scholarly content services look for content or information extract such as keywords, clauses, sentences, scientific claims (core finding of an article) or paragraphs extracted from PDF files or in word or HTML. Several methods have been applied to extract such information such as rule-based linguistic approaches, statistical approaches, machine learning approaches (supervised, unsupervised and semi-supervised) and domain specific approaches. The choice of method should facilitate extraction of a great deal of relevant information that makes sense
Opportunity
Semantic extraction for extracting disease codes and names from the already extracted recent research papers. There are two ways of doing this – Rule-Based matching and Machine learning
Rule-based matching is done by matching approximate disease names with the research paper and finding out the similar one from the disease names. This will not work if a single character does not match while matching, so this is not a better option to reach a solution. While using Machine learning we need to train a model which already has disease names and label as disease for those names, so that the system can train on it. After training the model, we then can test the model by giving only our research paper to them and to see whether the model extracts the disease names correctly.
The International Classification of Diseases (ICD), is a medical classification list of codes for diagnoses and procedures. ICD codes have been adopted widely by physicians and other health care providers for reimbursement, storage and retrieval of diagnostic information. The process of assigning ICD codes to a patient is time-consuming and error prone. Clinical coders need to extract key information from Electronic Medical Records (EMRs) and assign correct codes based on category, anatomic site, laterality and severity. The amount of information and complex hierarchy greatly increases the difficulty. By Applying ICD code as input, corresponding diseases can be extracted at the first stage. For instance, Diseases of the respiratory system: J00-J99.
- J00-J06 Acute upper respiratory infections
- J09-J18 Influenza and pneumonia
- J20-J22 Other acute lower respiratory infections
- J30-J39 Other diseases of the upper respiratory tract
- J40-J47 Chronic lower respiratory diseases
- J60-J70 Lung diseases due to external agents
- J80-J84 Other respiratory diseases principally affecting the interstitium
- J85-J86 Suppurative and necrotic conditions of the lower respiratory tract
- J90-J94 Other diseases of the pleura
- J95-J95 Intraoperative and postprocedural complications and disorders of the respiratory system, not elsewhere classified
- J96-J99 Other diseases of the respiratory system
For example, for J60, we get – Coal worker’s pneumoconiosis. Subsequently, we can code based on study design, followed by target population such as coal miners, and ‘risk factors. Now the next step for this would be to search for recent research papers using python, by building a model. With deep learning bidirectional LSTM, model had been built that created a word and tag dictionary, test and train sets, extracting features, training of bi-directional LSTM model and predicting on test set.
Research papers have been gathered from PUBMED website. This website contains more than 30 million citations for biomedical literature from MEDLINE, life science journals and online books. Title, Abstract, Date of journal publication, Author information, copyright, Keywords from the journal, method and results from journals, DOI, link to paper if it is open for all users.
Paper Extraction Process
Code and Output-
We need to connect to the PubMed site for getting information about our search. We have also sorted the search by recent. The output of this search will be in XML file.
From our output here we have printed a snippet of title and abstract of the recent research paper.
Why Guires
Guires Data analytics mission is to democratize AI for healthcare industries that mainly involve modern scientific/scholarly content services such as Chemical Abstracts Services for chemistry-related articles, Web of Knowledge, CiteSeer. IST, DBLP. The team of data science expert uses the power of AI to solve business and social challenges and we specialize inautomatic extraction of semantic information from wide range of digital resources such as metadata extraction, document summarization and keyword extraction techniques. We apply a wide range of techniques and approaches to extract a single or multiple claim from a scientific article using text mining models such as text clustering, association rule extraction, K-means Algorithm, information visualization, word cloud, subsequently ML approaches such as least-square support vector machines.
Guires offers innovative solutions:
- Guires Text mining and Machine learning approach helps you build and deploy text mining to improve business process
- Guires deploy semantic data modeling as a layer to your knowledge-centric architecture by integrating your enterprise data virtually.
Get Semantic Extraction working for you. Contact Guires expert.
Comment here