Semantic Search Engine for Bioinformatics Company

Azati developed a machine learning-powered semantic search engine to improve the accuracy and speed of searches within vast and complex scientific datasets, specifically for a bioinformatics company.

Discuss an idea
27 ms

average time to process a search query and return results

3 min

time required to retrain neural networks on new datasets

150,000+

blood samples effectively analyzed and tagged

All Technologies Used

Python
Python
TensorFlow
TensorFlow
Scikit Learn
Scikit Learn
Flask
Flask
Redis
Redis
World2Vec
World2Vec

Motivation

To develop an intelligent semantic search engine that addresses the inefficiency and inaccuracy of the client’s existing system, eliminating the need for manual tag selection, handling inconsistent descriptions, synonyms, and variations in blood sample data, significantly speeding up search queries from minutes to milliseconds, and providing a scalable solution capable of adapting to new datasets while ensuring relevant results are consistently found.

Main Challenges

Challenge 01
Inconsistent Blood Sample Descriptions

Blood sample descriptions and manually assigned tags were inconsistent, leading to inaccurate search results. Azati addressed this by cleansing and standardizing the data, training a custom Word2Vec model to understand synonyms and relationships between terms, ensuring the search engine could correctly interpret and match queries despite inconsistencies.

#1
Challenge 02
Lack of Knowledge about Synonyms and Variations

The team faced challenges due to multiple naming conventions and variations in disease names, which hindered precise tagging and search accuracy. Azati solved this by analyzing hundreds of thousands of life sciences documents to build a comprehensive thesaurus and train the Word2Vec model to detect and map synonyms, enabling accurate semantic matching.

#2
Challenge 03
Lack of Initial Sample Data

The project involved processing a vast number of entries without any pre-labeled sample data for algorithm training. Azati overcame this by leveraging open-source life sciences documents to create a training dataset, developing intelligent matching and query analysis modules, and implementing RESTful microservices with Redis caching for efficient, scalable search performance.

#3

Our Approach

Intelligent Matching Module
Developed a pluggable module for automatic tagging of blood samples. The module analyzes sample descriptions and assigns tags with a high confidence score (around 98%), enabling accurate semantic searches even on inconsistent data.
Query Analysis Module
Built a module that converts unstructured user queries into structured entities. It extracts sample types, diseases, geography, and other relevant attributes, ensuring that searches match the dataset accurately and completely.
Custom Word2Vec Model
Trained a custom Word2Vec model on life sciences documents to identify synonyms and semantic relationships between terms. This allows the system to match different expressions of the same concept, such as alternative disease names or lab test variations.
Performance Optimization with Redis
Implemented caching for preprocessed samples using Redis, enabling in-memory lookups. Combined with optimized search algorithms, this reduced search query times from several minutes to under 30 milliseconds.
Scalable Microservices Architecture
All modules were implemented as RESTful microservices deployed in the cloud, allowing the system to scale horizontally and handle growing datasets without downtime or performance degradation.

Want a similar solution?

Just tell us about your project and we'll get back to you with a free consultation.

Schedule a call

Solution

01

Intelligent Matching Module

This module tags blood samples automatically by analyzing descriptions and related documents, ensuring high-confidence matches even with inconsistent or incomplete data.
Key capabilities:
  • Automatic tagging of blood samples
  • High-confidence semantic matching (~98%)
  • Handling inconsistent or incomplete data
  • Custom Word2Vec model trained on life sciences documents
02

Query Analysis Module

Processes unstructured user search queries, extracts relevant entities, and converts them into structured data for accurate semantic matching against the dataset.
Key capabilities:
  • Natural language processing for query analysis
  • Entity extraction (sample type, disease, geography, etc.)
  • Conversion of unstructured queries into structured data
  • Improved search precision and recall
03

RESTful Microservices Architecture

Modules are deployed as independent microservices, allowing scalability, easy maintenance, and efficient integration with cloud infrastructure.
Key capabilities:
  • Scalable cloud deployment
  • Independent module updates and maintenance
  • Integration with existing infrastructure
  • Flexible expansion for new datasets or modules
04

Performance Optimization with Redis

Caching and in-memory data storage dramatically reduces query response times and improves system throughput for handling large-scale datasets.
Key capabilities:
  • In-memory caching with Redis
  • Sub-30 millisecond query response
  • High-throughput data processing
  • Efficient handling of large scientific datasets

Business Value

High Accuracy Tagging: Enabled automatic analysis and tagging of blood samples with up to 98% confidence, reducing manual effort and errors.

Blazing Fast Query Response: Search queries return results in ~27 milliseconds, improving employee productivity and satisfaction.

Scalable Neural Network Retraining: New datasets can be incorporated in ~3 minutes, allowing the system to adapt quickly to expanding scientific data.

Improved Search Precision: Semantic matching of queries to datasets significantly reduced irrelevant results and enhanced data accessibility for researchers.

Ready To Get Started

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.