Semantic Search Engine for Bioinformatics Company

Azati developed a machine learning-powered semantic search engine to improve the accuracy and speed of searches within vast and complex scientific datasets, specifically for a bioinformatics company.

Discuss an idea

All Technologies Used

Python
Python
TensorFlow
TensorFlow
Scikit Learn
Scikit Learn
Flask
Flask
Redis
Redis
World2Vec
World2Vec

Motivation

To design an intelligent search engine capable of accurately processing complex queries and delivering relevant results by analyzing and tagging scientific datasets.

Main Challenges

Challenge 1
Inconsistent Blood Sample Descriptions

Blood sample descriptions and tags were inconsistent, leading to inaccurate search results.

Challenge 2
Lack of Knowledge about Synonyms and Variations

The team faced a lack of knowledge about synonyms and variations in disease names, which hindered precise tagging.

Challenge 3
Lack of Initial Sample Data

The project involved processing a vast number of entries without any initial sample data to train the algorithm.

Key Features

  • Natural language processing to extract entities from search queries
  • Semantic matching of queries to tagged datasets
  • RESTful microservices for scalability
  • In-memory caching with Redis for high-speed performance

Our Approach

Intelligent Matching Module
The team developed a pluggable module for intelligent matching that tagged blood samples with a high level of confidence.
Query Analysis Module
The team developed another pluggable module for query analysis to convert unstructured input into structured data.
Custom Word2vec Model
Using Word2vec, the team trained a custom model on life science documents to understand synonyms and relations between terms.
Performance Optimization with Redis
Optimizations such as caching with Redis enabled fast in-memory data lookups.

Project Impact

150,000 samples: analyzed to build the semantic search engine.

27 milliseconds: required to analyze a search query and return a result, achieved through advanced caching and optimized algorithms.

3 minutes: needed to retrain neural networks for a new dataset, demonstrating system scalability and efficiency.

Ready To Get Started

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.