:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
Pubmed Parser is a Python library for parsing the PubMed Open-Access (OA) subset , MEDLINE XML repositories, and Entrez Programming Utilities (E-utils). It uses the
lxmllibrary to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.
For available APIs and details about the dataset, please see our wiki page or documentation page for more details. Below, we list some of the core funtionalities and code examples.
pathprovided to a function can be the path to a compressed or uncompressed XML file. We provide example files in the
datafolder.
Below, we list available parsers from
pubmed_parser.
We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called
parse_pubmed_xmlwhich will return a dictionary with the following information:
full_title: article's title
abstract: abstract
journal: Journal name
pmid: PubMed ID
pmc: PubMed Central ID
doi: DOI of the article
publisher_id: publisher ID
author_list: list of authors with affiliation keys in the following format
[['last_name_1', 'first_name_1', 'aff_key_1'], ['last_name_1', 'first_name_1', 'aff_key_2'], ['last_name_2', 'first_name_2', 'aff_key_1'], ...]
affiliation_list: list of affiliation keys and affiliation strings in the following format
[['aff_key_1', 'affiliation_1'], ['aff_key_2', 'affiliation_2'], ...]
publication_year: publication year
subjects: list of subjects listed in the article separated by semicolon. Sometimes, it only contains the type of the article, such as a research article, review proceedings, etc.
import pubmed_parser as pp dict_out = pp.parse_pubmed_xml(path)
The function
parse_pubmed_referenceswill process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows
pmid: PubMed ID of the article
pmc: PubMed Central ID of the article
article_title: title of cited article
journal: journal name
journal_type: type of journal
pmid_cited: PubMed ID of article that article cites
doi_cited: DOI of article that article cites
year: Publication year as it appears in the reference (may include letter suffix, e.g.2007a)
dicts_out = pp.parse_pubmed_references(path) # return list of dictionary
The function
parse_pubmed_captioncan parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys
pmid: PubMed ID
pmc: PubMed Central ID
fig_caption: string of caption
fig_id: reference id for figure (use to refer in XML article)
fig_label: label of the figure
graphic_ref: reference to image file name provided from Pubmed OA
dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary
For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use
parse_pubmed_paragraphto parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:
pmid: PubMed ID
pmc: PubMed Central ID
text: full text of the paragraph
reference_ids: list of reference code within that paragraph.
This IDs can merge with output from
parse_pubmed_references.
section: section of paragraph (e.g. Background, Discussion, Appendix, etc.)
dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)
You can use
parse_pubmed_tableto parse table from XML file. This function will return list of dictionaries where each has following keys.
pmid: PubMed ID
pmc: PubMed Central ID
caption: caption of the table
label: lable of the table
table_columns: list of column name
table_values: list of values inside the table
table_xml: raw xml text of the table (return if
return_xml=True)
dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)
MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD here. You can use the function
parse_medline_xmlto parse that format. This function will return list of dictionaries, where each element contains:
pmid: PubMed ID
pmc: PubMed Central ID
doi: DOI
other_id: Other IDs found, each separated by
;
title: title of the article
abstract: abstract of the article
authors: authors, each separated by
;
mesh_terms: list of MeSH terms with corresponding MeSH ID, each separated by
;e.g.
'D000161:Acoustic Stimulation; D000328:Adult; ...
publication_types: list of publication type list each separated by
;e.g.
'D016428:Journal Article'
keywords: list of keywords, each separated by
;
chemical_list: list of chemical terms, each separated by
;
pubdate: Publication date. Defaults to year information only.
journal: journal of the given paper
medline_ta: this is abbreviation of the journal name
nlm_unique_id: NLM unique identification
issn_linking: ISSN linkage, typically use to link with Web of Science dataset
country: Country extracted from journal information field
reference: string of PMID each separated by
;or list of references made to the article
delete: boolean if
Falsemeans paper got updated so you might have two
XMLs for the same paper. You can delete the record of deleted paper because it got updated.
dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz', year_info_only=False, nlm_category=False, author_list=False, reference_list=False) # return list of dictionary
To extract month and day information from PubDate, set
year_info_only=True. We also allow parsing structured abstract and we can control display of each section or label by changing
nlm_categoryargument.
Use
parse_medline_grant_idin order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing
pmid: PubMed ID
grant_id: Grant ID
grant_acronym: Acronym of grant
country: Country where grant funding from
agency: Grant agency
If no Grant ID is found, it will return
None
You can use PubMed parser to parse XML file from E-Utilities using
parse_xml_web. For this function, you can provide a single
pmidas an input and get a dictionary with following keys
title: title
abstract: abstract
journal: journal
affiliation: affiliation of first author
authors: string of authors, separated by
;
year: Publication year
keywords: keywords or MESH terms of the article
dict_out = pp.parse_xml_web(pmid, save_xml=False)
The function
parse_citation_weballows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys
pmc: PubMed Central ID
pmid: PubMed ID
doi: DOI of the article
n_citations: number of citations for given articles
pmc_cited: list of PMCs that cite the given PMC
dict_out = pp.parse_citation_web(doc_id, id_type='PMC')
The function
parse_outgoing_citation_weballows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys
n_citations: number of cited articles
doc_id: the document identifier given
id_type: the type of identifier given. Either
'PMID'or
'PMC'
pmid_cited: list of PMIDs cited by the article
dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')
Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings without the
'PMC'prefix. If no citations are found, or if no article is found matching
doc_idin the indicated database, it will return
None.
Install directly from the repository
pip install git+git://github.com/titipata/pubmed_parser.git
or clone the repository and install using
pip
git clone https://github.com/titipata/pubmed_parser pip install ./pubmed_parser
You can test your installation by running
pytest --cov=pubmed_parser tests/ --verbose
An example usage is shown as follows
import pubmed_parser as pp path_xml = pp.list_xml_path('data') # list all xml paths under directory pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output print(pubmed_dict){'abstract': u"Background Despite identical genotypes and ...", 'affiliation_list': [['I1': 'Department of Biological Sciences, ...'], ['I2': 'Biology Department, Queens College, and the Graduate Center ...']], 'author_list': [['Dennehy', 'John J', 'I1'], ['Dennehy', 'John J', 'I2'], ['Wang', 'Ing-Nang', 'I1']], 'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb', 'journal': 'BMC Microbiology', 'pmc': '3166277', 'pmid': '21810267', 'publication_year': '2011', 'publisher_id': '1471-2180-11-174', 'subjects': 'Research Article'}
This is a snippet to parse all PubMed Open Access subset using PySpark 2.1
import os import pubmed_parser as pp from pyspark.sql import Rowpath_all = pp.list_xml_path('/path/to/xml/folder/') path_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000) parse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x), **pp.parse_pubmed_xml(x))) pubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe pubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi', 'file_name', 'pmc', 'pmid', 'publication_year', 'publisher_id', 'journal', 'subjects']] # select columns pubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe
See scripts folder for more information.
and contributors
If you use Pubmed Parser, please cite it from JOSS as follows
Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979
or using BibTex
@article{Achakulvisut2020, doi = {10.21105/joss.01979}, url = {https://doi.org/10.21105/joss.01979}, year = {2020}, publisher = {The Open Journal}, volume = {5}, number = {46}, pages = {1979}, author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording}, title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset}, journal = {Journal of Open Source Software} }
We welcome contributions from anyone who would like to improve Pubmed Parser. You can create GitHub issues to discuss questions or issues relating to the repository. We suggest you to read our Contributing Guidelines before creating issues, reporting bugs, or making a contribution to the repository.
This package is developed in Konrad Kording's Lab at the University of Pennsylvania. We would like to thank reviewers and the editor from JOSS including
tleonardi,
timClicks, and
majensen. They made our repository much better!
MIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna