Biodiversity PMC customizable search API
Description
This API allows to perform a fully customizable search for valuable annotated citations in a given collection. The power of SIBiLS is based on the efficiency of Elasticsearch engines, and on the rich Lucene query language, which allows to investigate a large panel of searching strategies. For example: basic search with keywords or entity identifiers (“ZBED1” or “NP_NX_O96006”), searches in specified fields (“title: ZBED1” or “annotations_str: genes”), boosting fields or query parts, Boolean, exploiting identified concepts or identified concept types...) The input is thus a Lucene json query. The output is the Elasticsearch ranked result set, ranked by relevance, in its native json format; for each retrieved citation (up to 10,000 per request), a relevance score and the indexed content are included.
API endpoint
URL: biodiversitypmc.sibils.org/api/search
Mandatory input: q OR jq: a query q in free text, which is interpreted by query analyzer, OR a Lucene json_query jq
Mandatory input: one collection (&col=), amongst "medline", "pmc", "plazi" and "suppdata"
Optional input: the number of requested documents (&n=, default 10, max 1000)
Example: simple search for MEDLINE (&col=) documents containing (&q=) Rhinolophus and Pangolin.
biodiversitypmc.sibils.org/api/search?q=Rhinolophus%20and%20Pangolin&col=medline
Example: customizable search (&jq) with a Lucene style json query
{"query": {"bool" : {"must": {"match" : { "title" : "Digitoxin metabolism" }},"should" : {"match" : { "annotations_str" : "GO" }},"boost": 1}}}
Query language: JSON queries for Elasticsearch are described in elastic.co
Code sample
A python script for demonstrating POST calls to the API, with multiple examples of Lucene style queries:
import requests # not installed in default Python
import json
query = {}
# queries must be formatted in Lucene ElasticSearch style
# https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html
# here are different query types :
# MULTI_MATCH queries
# use it for searching in specific fields, with a general OR or AND
search_fields = ["title","abstract","keywords","mesh_terms"]
# you can use "title^3" for boosting by 3 scores in title
keywords = "BRCA2 cancer"
my_operator = "and" # default is "or"
my_type = "phrase" # use it for phrase matching (exact expression)
my_query = {
"size": 20, # maximum amount of hits returned
"from" : 0, # offset from the first result you want to fetch
"query": {
"multi_match" : {
"query" : keywords,
"fields" : search_fields
#,"operator" : my_operator
#,"type" = "phrase"
}
}
}
# BOOLEAN queries
# The AND/OR/NOT operators can be used to fine tune the search queries.
# This is implemented in the search API as a bool query.
# The bool query accepts a must parameter (equivalent to AND),
# a must_not parameter (equivalent to NOT),
# and a should parameter (equivalent to OR).
# https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
"""
# ex: ("EML4" OR "ALK" in title) AND ("Lung Cancer" in journal)
my_query = {
"query": {
"bool" : {
"must" :[ # AND
{
"bool" : {
"should" :[ # OR
{
"match" : {"title": "EML4"},
"match" : {"title": "ALK"}
}
]
}
},
{
"match" : {"journal": "Lung Cancer"}
}
]
}
}
}
"""
# EXPLOITING ANNOTATIONS
# the annotations_str field can be exploited in order to search for
# concept types (drugs, genes, diseases), or concepts ids (gene
# P51587).
"""
# ex: records with identified BRCA2 (NX_P51587) and identified drugs
my_ids = "NP_NX_P51587"
my_types = "drugs"
my_operator = "and"
my_query = {
"query": {
"multi_match" : {
"query" : my_ids + " " + my_types,
"fields" : "annotations_str"
,"operator" : my_operator
}
}
}
"""
# call
url_API = "https://biodiversitypmc.sibils.org/api/search"
my_json_query = json.dumps(my_query) # json to string
my_params = {"jq": my_json_query;col="pmc"} # parameters dictionary
r = requests.post(url = url_API, data = my_params)
# get response and print in output file
response = r.text
with open("SIBiLS_search.json","w",encoding="utf-8") as file:
file.write(r.text)
Output
Output is a native Elasticsearch response (json formatted), and includes retrieval scores for each document.