Every biotechnology patent defines its scope through percent identity thresholds: 95%, 90%, 80%, 75%. These numbers determine what a patent covers and what it doesn't. Yet the primary tool used to search patent sequence databases, NCBI BLAST, was never designed to answer questions framed this way, leaving patent examiners and IP attorneys to filter results manually, which is slow, error-prone, and often incomplete.
At Azati, we've extended BLAST to speak the language of patent search. Here's why that matters and what it means for IP professionals.
The gap between biology and patent law
BLAST is the gold standard for biological sequence search. It's used by researchers worldwide to compare DNA, RNA, and protein sequences against massive databases. It's fast, reliable, and well-understood.
But BLAST was built for biologists. It ranks results by statistical significance using E-values and bit scores, exactly the right measures for evolutionary biology and genomics research.
They are the wrong metrics for patent search.
When a patent examiner or IP attorney needs to determine whether a newly filed sequence infringes an existing patent, the question is straightforward: "Is this sequence at least 95% identical to the patented one?" BLAST doesn't answer this question directly. Instead, it returns thousands of hits ranked by statistical significance, leaving the searcher to figure out percent identity on their own.
What patent searchers actually need
Patent sequence search involves three distinct scenarios that standard BLAST handles poorly:
- Fragment detection. A patent may cover short peptide fragments (15 to 25 amino acid residues) derived from a larger protein. Finding these fragments requires filtering by sequence length and identity simultaneously. Standard BLAST has no mechanism for this.
- Full-sequence containment. An examiner may need to know whether a query sequence appears within any longer database sequence with high identity over most or all of the query. BLAST reports local alignments that may cover only a small portion of the query, making it difficult to assess full-sequence matches.
- Variant identification. Often the goal is to find sequences that are almost identical to the query across their entire length. BLAST's local alignment approach can miss these, reporting partial matches over short windows instead of surfacing near-identical full-length hits.
Our approach: filtering inside the algorithm
The Azati bioinformatics team extended the open-source NCBI BLAST+ engine with additional filtering parameters that operate inside the search algorithm itself. Instead of post-processing results externally, our version applies identity and coverage filters after alignment but before output, so only results that meet your criteria are returned.
This approach offers three key advantages:
- Precision. Results are filtered by the exact metrics patent claims use: percent identity and percent coverage relative to the query, the subject, or the alignment, as well as subject sequence length. These filters can be applied individually or in combination.
- Completeness. Because filtering happens inside the algorithm, BLAST processes every candidate alignment against your criteria before deciding what to keep. You get a complete set of results that meet your thresholds, rather than a statistically ranked subset.
- Speed. No separate post-processing step. For large databases like the WIPO patent sequence collection, this eliminates hours of manual filtering work.
Real-world search scenarios
Here's how our enhanced BLAST handles the three patent search scenarios described above.
Fragment search
Parameters: subject_length_min, subject_length_max, min_subject_identity
Search for short peptide fragments within a specific length range that match your query with high identity.
Identity threshold search
Parameters: min_align_identity, min_query_identity, min_query_coverage
Find all database sequences that contain your query above a specified percent identity threshold. This directly answers the question patent claims ask: "Is there anything in the database that is at least X% identical to my sequence?"
Variant detection
Parameters: min_subject_identity, min_subject_coverage
Find database sequences that are similar to your query across their entire length, not just in a local alignment window. By requiring both high coverage and high identity of the subject sequence, the search identifies true sequence variants while excluding partial or coincidental matches.
Understanding the limitations
These enhancements have limitations worth understanding.
BLAST is a heuristic algorithm that uses short "seed" sequences to identify candidate regions before performing full alignment. If two sequences are similar but don't share a seed match, BLAST will miss the hit entirely. Our enhancements improve filtering, not candidate detection.
Additionally, BLAST performs local alignment, finding the best-matching region between two sequences. When a patent says "95% identical," it typically implies identity over the entire sequence. Our identity and coverage filters closely approximate global identity, especially when combined, but they are not a substitute for formal global alignment tools in legally sensitive contexts.
Work with us
The Azati bioinformatics team develops custom sequence search algorithms and extends established open-source tools like NCBI BLAST+ to meet the specific demands of patent sequence work.
If you're facing complex challenges in sequence search or analysis, contact the Azati team, we're happy to discuss your specific use case and run a complimentary test search on your sequences.