Knowledge Extraction from Scientific Research Articles using Semantic Extraction: Explore a Use Case and Perspective

The Challenges

Semantic extraction is a natural language processing technique which identifies and extract entities(for example, people, information, locations, companies from news articles, etc.), facts, attributes, concepts, and events to populate meta-data fields. The purpose of this is to enable the analysis of enterprise unstructured content, such as text documents, emails, images, reports, and other business-critical content. Modern Scientific / scholarly content services look for content or information extract such as keywords, clauses, sentences, scientific claims (core finding of an article) or paragraphs extracted from PDF files or in word or HTML. Several methods have been applied to extract such information such as rule-based linguistic approaches, statistical approaches, machine learning approaches (supervised, unsupervised and semi-supervised) and domain specific approaches. The choice of method must able to extract a great deal of relevant information that makes sense

Opportunity

Semantic extraction for extracting disease codes and names from the already extracted recent research papers. There are two ways of doing this – Rule-Based matching and Machine learning

Rule-based matching is done by matching approximate disease names with the research paper and finding out the similar one from the disease names. This will not catch if a single character does not match while matching, so this is not that better option for our solution. While using Machine learning we need to train a model which already has disease names and label as disease for those names so that the system can train on it. After training the model, we then can test the model by giving only our research paper to them and to see whether the model extracts the disease names correctly.

The International Classification of Diseases (ICD), is a medical classification list of codes for diagnoses and procedures. ICD codes have been adopted widely by physicians and other health care providers for reimbursement, storage and retrieval of diagnostic information. The process of assigning ICD codes to a patient visit is time-consuming and error prone. Clinical coders need to extract key information from Electronic Medical Records (EMRs) and assign correct codes based on category, anatomic site, laterality and severity. The amount of information and complex hierarchy greatly increase the difficulty. By Applying ICD code as input and corresponding diseases can be extracted at the first stage. For instance, Diseases of the respiratory system: J00-J99.

J00-J06 Acute upper respiratory infections
J09-J18 Influenza and pneumonia
J20-J22 Other acute lower respiratory infections
J30-J39 Other diseases of the upper respiratory tract
J40-J47 Chronic lower respiratory diseases
J60-J70 Lung diseases due to external agents
J80-J84 Other respiratory diseases principally affecting the interstitium
J85-J86 Suppurative and necrotic conditions of the lower respiratory tract
J90-J94 Other diseases of the pleura
J95-J95 Intraoperative and postprocedural complications and disorders of the respiratory system, not elsewhere classified
J96-J99 Other diseases of the respiratory system

For example, for J60, we get – Coal worker’s pneumoconiosis. Subsequently, we can code based on study design, followed by target population such as coal miners, and ‘risk factors. Now next step for this would be to search for recent Research papers using python by building a model. With deep learning bidirectional LSTM, model had been built that create word and tag dictionary, test and train sets, extracting features, training of bi-directional LSTM model and predicting on test set.

Research papers are scraped from PUBMED organisations website. This website contains more than 30 million citations for biomedical literature from MEDLINE, life science journals and online books. Title, Abstract, Date of journal publication, Author information, copyright, Keywords from the journal, method and results from journals, DOI, link to paper if it is open for all users.

Paper Extraction Process

Code and Output-

We need to connect to the PubMed site for getting information about our search. We have also sorted the search by recent. The output of this search will be in XML file.

From our output here we have printed a snippet of title and abstract of the recent research paper.

Why Guires

Guires Data analytics mission is to democratize AI for healthcare industries that mainly involve modern scientific/scholarly content services such as Chemical Abstracts Services for chemistry-related articles, Web of Knowledge, CiteSeer. IST, DBLP. The team of data science expert use the power of AI to solve business and social challenges. Our Automatic extraction of semantic information from wide range of digital resources such as metadata extraction, document summarization and keyword extraction techniques. We apply a wide range of techniques and approaches to extract a single or multiple claim from a scientific article using text mining models such as text clustering, association rule extraction, K-means Algorithm, information visualization, word cloud, subsequently ML approaches such as least-square support vector machines.

How can you make the most of semantic extraction? Let us help you get started.

Guires offers innovative solutions:

Guires Text mining and Machine learning approach helps you build and deploy text mining to improve business process
Guires deploy semantic data modeling as a layer to your knowledge-centric architecture by integrating your enterprise data virtually.

Get Semantic Extraction working for you. Contact Guires expert.

The Challenges

Opportunity

Why Guires

Guires offers innovative solutions:

Related Articles

Predicting Hospital Readmission – Predictive Modeling using Training dataset.

Predicting Diabetic Retinopathy – Predictive Modeling using Training dataset.

Predicting Liver Disease – Predictive Modeling using Training dataset.

Comment here Cancel reply