What are the key challenges in analyzing unstructured data?
Unstructured data is everywhere. You can find unstructured text data in letters, memos, reports, contracts, emails etc. Now imagine that you have been called upon to draw insights from 100s of such documents. What can you do?
You can read all of them which is quite a task. An improvement would be if you had some kind of inkling on which ones are most important through metadata such as titles. Even better would be if you could search through them, and not have to rely on predefined metadata, and then focus on only those that showed up in search. Another improvement would be if you could do semantic instead of keyword search so that you don't miss out on documents that do not use specific search terms. The search technology such as in Google Search takes you until this point.
From 2022 onward, a big improvement was made possible through large language models such as ChatGPT. You could now interact with the document contents but of course you could not fit data from all documents back then into the model, and later on researchers found out that even if you could fit it all, it would be too confusing for LLMs to provide reliable answers. Then came Retrieval Augmented Generation (RAG), a technique that greatly improved the accuracy of LLM results by searching through documents and sending only the most relevant ones through to the LLM. This technology is the basis of many document chatbots such as NotebookLLM.
Perhaps by now, you start to see a pattern, every method referred to until now, selects documents first before giving us answers. Just as it is impossible for humans to keep 100 documents in their head to answer questions. But what if you really wanted a non-trivial quantitative answer from the documents? What can you do?
How to get quantitative insights from unstructured data, and why is it important?
What you need to do is something perhaps you have done intuitively at some point in your professional life. Manually extract terms or note down interpretations, perhaps in an excel table, as you read through the documents. Once you have got this information in a table form, you can do quantitative analysis. This is immensely powerful!
Note how much confidence you will suddenly have in your decisions. Quantitative insights are always much more actionable than qualitative ones. For example, knowing definitively that only 20% of your contracts have an indemnity clause is far superior than knowing that only an unspecified minority of them have this clause. Because once you know a number and you can then slice and dice it by other metadata that can lead to real actions. Say those 20% contracts with indemnity clause are of low value. This might need a different action than say if the 20% contracts were those with high value. This is what Augmend does. It automates extraction of structured data from unstructured documents so that you can get quantitative insights. And because it does so in an automated way, you can run it again and again to uncover new facets of information from the same documents, should that be needed. Let us look at the Complete Response Letter example now.
What are Complete Response Letters and why are they interesting?
US FDA issues Complete Response Letters (CRLs) to sponsors or drug approval applicants if the FDA determines that it will not approve the drug application in the current form. The reason FDA does not approve is because it has identified certain deficiencies in the application, e.g., the efficacy data is not good enough. A CRL also mentions what resolutions to deficiencies are expected by FDA, and what additional information the sponsor should ideally be submitting, should they submit a revised application. Prior to the Trump II administration, FDA did not release to the public CRLs of drug applications that were unapproved. If a drug was ultimately approved after revisions, CRLs were published together with the approval. But the Trump II administration, decided to publish CRLs of unapproved drugs as well as they were deemed to provide transparency into FDA's decision making. The availability of CRLs provides an unparalleled opportunity to get an insight into how FDA looks at applications, what it focuses on, and what prospective applicants need to prepare well before submitting their own applications.
Analysis of CRLs
I have embedded an example CRL below, and you can download them all if you hit the openFDA website. The letters start with deficiencies and suggested resolutions for those deficiencies, and are followed by certain resubmission requirements and finally some additional comments on the application. A significant portion of the CRL is redacted before release to prevent disclosure of sensitive trade secrets and commercial information. Although the flow of topics is consistent across all letters, the number of topics can vary a lot and some of them may be omitted altogether. Also, the terminology varies from letter to letter. These factors make it difficult to do quantitative analysis directly on CRLs.
Example Complete Response Letter
How Augmend enabled quantitative analysis of CRLs
Augmend uses LLMs for extracting data but unlike the document chatbots, one needs to process documents with Augmend before one can do quantitative analysis. Augmend can extract information in a human-like interpretative way, so it is not looking for exact keywords but really understand the meaning in the context of surrounding text leading to a very reliable extraction.
We created standardized detailed categories and sub-categories for deficiencies, resolutions, resubmission requirements, and additional comments. Augmend then combed through the documents and labelled these documents with the right categories to create a structured dataset with the categories. This is how not-quantitative data becomes quantitative. After that we assembled and overlayed metadata of drugs such as therapeutic area (TA), modality and drug format. Now you can slice and dice the data by various (sub-)categories and drug metadata to obtain targeted quantitative insights to answer questions such as:
- What is the relative proportion of various deficiencies across all CRLs?
- Are certain drug modalities or formats susceptible to certain deficiencies?
- What resolution requirements are generally recommended for most important deficiencies?
- Etc.
You will find data in the PowerBI dashboard below. The publicly available data is partially blinded. But I hope you are able to appreciate the power of quantitative insights that you can gain after structuring unstructured data.
CRL Analysis Dashboard
Interested in the dataset behind the dashboard?
Contact usThis is but just one example where Augmend can help. It can be used to structure all kinds of complex information to provide quantitative insights that you never could have before.
Do you want to see Augmend in action?
If you're looking to extract quantitative insights from unstructured data, Augmend might be the solution you've been waiting for. Click the Tackle your challenges button in the navigation bar to discuss your data challenges and check whether we could solve them using Augmend