Azati designed and developed a semantic search engine powered by machine learning. It extractsthe actual meaning from the search query and looks for the most relevant results across hugescientific datasets.
A US Company focused on the development of in vitro diagnostic (IVD) and biopharmaceuticalproducts. It provides products and services that support research and development activities andaccelerates the time to market of products.
The customer offers clinical trial management services, biological materials, central laboratorytesting and other solutions that enable product development and research in infectious diseases,oncology, rheumatology, endocrinology, cardiology, and genetic disorders.
A lot of companies suffer from the lack of accurate and fast search engine that can handlesubstantial scientific datasets. Scientific datasets are known for the structural complexity anda vast number of interconnected terms and abbreviations that make data processing quite tricky.
The customer was looking for a partner who can overcome this challenge.
The customer wanted us to build an intelligent search engine that can help him deal with theinternal inventory search. The inventory included a considerable number of blood samples. Eachblood sample was described using several tags, grouped into subcategories, which were groupedinto larger categories, etc.
Customer’s employees were forced to select many tags by hand to get the information they wanted.It took several minutes to perform a single search. And what was more disappointing, if anemployee makes a single mistake or provides an inaccurate query, he or she will get an emptyresult page.
The entire data lookup process was a huge disappointment and headache for the personnel and thecustomer. There were several challenges to overcome to improve the customer’s workflow.
Every blood sample was described using a textual description and specific tags, manually mappedby external data entry vendor according to the description.
There was a typical situation, where a blood sample had misstatements in the description and thetags. It means that any approach to improving the search by tag or by description would fail dueto inconsistent data.
The first thing we thought about was cleansing the data. We faced two interconnected issues.First one was the lack of knowledge about all possible factors that can differentiate one bloodsample from another.
Another was the lack of knowledge about all alternative disease names: for example, Hepatitis B,HBV DNA, Hepatitis B Virus, HBV PCR, Hepatitis B Virus Genotype by Sequencing basically mean thesame thing.
Another challenge was the amount of data. There was a significant number of entries to process,and what was more important — there was no sample data for algorithm training to match thetags automatically.
From the very beginning, the customer provided a list of keywords that describe blood samples.Very soon our team discovered that this list was incomplete and required additional research -it was not enough to complete the project. Similar issues can’t deny our team from completingthe project in time.
This way we decided to split the final solution into two pluggable modules. One for intelligentmatching, it determined the level of confidence while tagging a blood sample. Another to extractall possible tags from search queries. De facto the second module transferred unstructured userinput into structured data.
The first challenge our engineers overcame was the lack of sample data. We trained a custom modelbased on hundred thousand life science documents related to blood samples from the open datasources. Data Scientist used Word2vec to analyze the connections between the most common wordsfrom the thesaurus to find synonyms and determine how these words are related to each other.
As a result, the model could automatically analyze the description and tag of blood samples witha high confidence level — close to 98%.
The module responsible for entity detection in search queries was partially ready. We had alreadybuilt a similar module while developing a platform for custom chatbot development. All that wasleft to do was retrain the model according to the list of entities: sample types, geography,diseases, genders, etc.
To achieve a high level of confidence, we analyzed the massive number of user search queriescollected from the open data sources. In the end, we compiled a collection of patterns used toform search queries.
The final solution consists of three separate interconnected modules hosted in the cloud. Such anapproach helps us to maintain the system remotely to avoid on-site personnel training. Cloudarchitecture makes the application more flexible, cutting down development and maintenancecosts.
The system consists of three modules:
We are proud to say, two of the modules are powered by machine learning. Query Analysis moduleuses natural language processing algorithms to extract entities from search queries, whileSearch Engine module uses the extracted entities to match these entities with synonyms toperform an accurate and fast search.
Modules are built as independent RESTful microservices, which helps us to scale the finalsolution to any size in the cloud.
We significantly optimized traditional search algorithms. Instead of searching among the wholedataset we processed about 150.000 samples with about 100 tags and performed the search amongthese tags. We cached all processed samples with Redis, which helped us to implement in-memorydata lookups and avoid the bottlenecks of reading/writing data to hard-drive.
Performed optimizations helped us to provide outstanding search quality and blazing speed.
We successfully implemented a commercial semantic search engine that can handle massivescientific datasets. We used modern Natural Language Processing technologies to extract entitiesfrom search queries and categorize scientific texts by tags. The algorithms we built helped thecustomer to eliminate ineffective search results and significantly improve employee’ssatisfaction with the data lookup process.
were analyzed to build a semantic search engine
A considerable amount of samples helped us to train the machine learning model effectively.
It takes to analyze a search query and return a result
Advanced caching and algorithms improvement helped to make a blazing fast search enginestruture.
It takes to retrain neural networks for the new dataset
Our engineers built a scalable system that can easily be retrained to any amount of similarsamples.
We have successfully launched a semantic search engine in the middle of March. Now we aremaintaining the application, processing new datasets, increasing search quality and scaling thesystem in the cloud.