Knowledge transfer from science to economy
illustrated via patent - publication pairs:
reducing ambiguities with word embeddings and references

This is the code repository to reproduce the results and figures from [my paper](LINK MISSING).

Description
Data Sources
Synopsis
Remarks
File Description
Data Flow

Description

This repo builds up a docker container with all needed libaries and a Postgresql database for storing the intermediate stages and the results.

You can access the postgresql when the container is running, for the database credentials please see the output of the building script. The raw data files can be accessed anytime form outside the container.

└── home
    └── user        
        ├── code
        ├── data
        │   ├── patent                 persistent on local drive: stores raw patent data
        │   ├── postgresql             persistent on local drive: stores the DB
        │   └── publication            persistent on local drive: stores raw publication data
        └── figures

Data Sources

Publication data from the National Library of Medicine using the FTP server of NCBI
- homepage
- datasource
Patent data from the European Patent Office (EPO) using their htmls service
- homepage
- datasource
MeSH terms
- german and english
- french

Synopsis

to be used inside a UNIX BASH

install docker
clone this repo
run ./quamedfo_publication/build_image.sh -> you end up in a BASH inside the container
download the raw data manually. The EPO
run ```/home/user/code

remarks (Yes, i know, but ...)

YES, you need a heavy machine to run those scripts in a reasonable amount of time. This work was part of the QuaMedFo Project where ZB MED provides me with a heavy 40 CPU, 300 GB RAM VM. Reasonable time on this machine was still approx. 6 weeks.
YES, you're root inside the docker container and you've persistant drives. The requirement of this repo is not to deliver a production ready secure container.
YES, you need to be in the root group on your local machine. I could not solve the owner-ship issues of the persistant data folders and so you have to chown the folders to yourself each time you run the script again. If someone can solve those issues, please do a pull request, i would be very happy.
YES, the download of the raw data is manual and will stay manual. Be happy, that the script realizes that the EPO download limit is reached and will stop and not fill quietly your xml with useless data. Wait a day or two and start it again.
YES, the script uses the actual PubMed baseline data provided on the ftp-server. When you run it next year, you use different data for the publications. But in the observed time period 2000-2005 there should not be a lot of changes anymore (just guessing, I didn't check)
YES, this scripts only use a static snapshoot of the data. If you want to extend this repo: there are no continious delta-load functions. Additionally, you need intermediate steps, that you e.g. don't do the long lasting processing again and again, but also handle growing patent families. This functionality will be soon implemented in the ZB MED Search Portal for Live Sciences LIVIVO, where you'll get the patent-publication links as additional information to the latest publication data (and not only PubMed), using the latest patent data.

Data Flow

    graph TB
subgraph get data -> files
    EPO-->|funct02_01|file:/data/patent
    PubMed-->|funct01_01|file:/data/publication
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
subgraph files -> DB: stage
    file:/data/patent-->|funct02_02...funct02_05|patent.stage
    file:/data/publication-->|funct01_01|pubmed_baseline.stage
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
subgraph raw
    subgraph json/xml
        patent.stage-->|funct03_01|patent.raw
        pubmed_baseline.stage-->|funct03_01|pubmed_baseline.raw
    end
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    subgraph relational
        pubmed_baseline.raw-->|funct03_02|publ.raw_pubmed_baseline
        patent.raw-->|funct03_03|publ.raw_patente
    end
end    
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
subgraph 1. JOIN by names
    subgraph Process names pubmed
    publ.raw_pubmed_baseline-->|funct04_01|publ.names_pubmed_baseline
    publ.names_pubmed_baseline-->|funct04_02|publ.tmp_extract_country_iso
    publ.names_pubmed_baseline-->|funct04_03|publ.names_ext_pubmed_baseline
    publ.tmp_extract_country_iso-->|funct04_03|publ.names_ext_pubmed_baseline
    publ.names_ext_pubmed_baseline-->|funct04_04|publ.master_pubmed_baseline
    end
    %%%%%%%%%%%%%%%%%%%%%%%%%%%
    subgraph Process names patente
    publ.raw_patente-->|funct05_01|publ.names_raw_patente
    publ.names_raw_patente--->|funct05_02|publ.names_patente
    publ.names_patente-->|funct05_03|publ.master_patente
    end
    %%%%%%%%%%%%%%%%%%%%%%%%%%%
    publ.master_pubmed_baseline-->|funct06_01|publ.join_raw
    publ.master_patente-->|funct06_01|publ.join_raw
    publ.join_raw-->|funct06_02|publ.dist_publication
    publ.join_raw-->|funct06_03|publ.dist_patents
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
subgraph 2. content similarity
    publ.raw_pubmed_baseline-->|funct07_01|publ.mesh_pubmed_baseline
    publ.dist_publication-->publ.mesh_pubmed_baseline
    %%%%%%%%%%%%%%%%%%%%%%%
    publ.dist_patents-->publ.text_patente
    publ.raw_patente-->|funct08_01|publ.text_patente
    publ.text_patente-->|funct08_02|publ.text_patente_proc
    publ.text_patente_proc-->|funct08_03|publ.text_patente_proc
    publ.text_patente_proc-->|funct08_04|publ.mesh_patente
    %%%%%%%%%%%%%%%%%%%%%%%%
    publ.mesh_patente-->|funt09_01|publ.embed_bert_pubmed_baseline
    publ.mesh_pubmed_baseline-->|funt09_01|publ.embed_bert_patente
    %%%%%%%%%%%%%%%%%%%%%%%%%
    publ.embed_bert_pubmed_baseline-->|funct10_01|publ.join_cos_sim
    publ.embed_bert_patente-->|funct10_01|publ.join_cos_sim
    publ.join_raw-->publ.join_cos_sim
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
_[ ]-->|funct11_01|publ.journal_normalize
%%%%%%%%%%%%%%%%%%%%%%%%%%%
subgraph 3. common references
    pubmed_baseline.stage-->|funct11_02|publ.ref_pubmed_baseline
    publ.dist_publication-->publ.ref_pubmed_baseline
    %%%%%%%%%%%%%%%%%%%%%%%%%%
    publ.raw_patente-->|funct11_03|publ.ref_raw_patents
    publ.dist_patents-->publ.ref_raw_patents
    publ.journal_normalize-->publ.ref_raw_patents
    publ.ref_raw_patents-->|funct11_04|publ.ref_raw_patents
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%
    publ.join_raw-->publ.join_common_refs
    publ.ref_pubmed_baseline-->|funct11_05|publ.join_common_refs
    publ.ref_raw_patents-->|funct11_05|publ.join_common_refs
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
subgraph 4. ranking
    publ.raw_patente--->|funct12_01|publ.pat_ipc
    publ.join_raw--->|funct12_02|publ.join_all
    publ.join_common_refs--->|funct12_02|publ.join_all
    publ.join_cos_sim--->|funct12_02|publ.join_all
    publ.pat_ipc-->|funct12_02|publ.join_all

    publ.join_all-->|funct12_03|publ.join_all_ranked
end

Loading

File Description

	script in /home/user/code	purpose
	run.sh	main script to run the complete pipeline. Since there might be problems with the amount of data downloaded from EPO or PubMed, I recommend using the download scripts manually, comment them out in this file and run the rest automatically.
---	---	---
preparation	funct00_01_create_schemas_and_extensions.sql	creates needed schemas and extensions on the PostgresQL database
	funct00_02_SQL_functions.sql	create db-function for finding emailendings in text
	funct00_03_upload_MeSH_mainheadings.py	upload the MeSH terms
	funct00_04_countrycodes_for_emailendings.py	ISO countrycodes used by country-detection via emailending
	funct00_05_countrynames_for_keywordextraction.py	ISO countrynamens used for country-detection
---	---	---
get publication data	funct01_01_data_download_pubmed.py	downloads the publication data from PubMed. If you're a company. Pls check the legal rules for commercial / non-commercial use of this data yourself. better run this manually, because of download limit from PubMed
---	---	---
get patent data	funct02_01_data_download_epo.py	downloads the patent data from EPO as xml; better run this manually, because of download limit from EPO. When you're over the download limit, EPO will give you a valid xml with no useful data inside.
	funct02_02_create_patent_filelist.py	creates a list of all patentfiles, needed for upload to postgres
	funct02_03_create_table_patent_stage.sql	nomen est omen
	funct02_04_patent_xml_upload.sh	nomen est omen
	funct02_05_parse_pat_xml.py	helper script for funct02_04, dealing with enconding
---	---	---
create flat tables	funct03_01_create_pub_pat_raw.sql	creates postgres table for publication data
	funct03_02_create_unnested_pubmed_baseline.sql	unnest the publication xml into flat tables
	funct03_03_create_unnested_patente.sql	unnest the patente xml into flat tables
---	---	---
process publication data	funct04_01_proc_pubmed_names.sql	process publication author names
	funct04_02_country_extract.py	extract countries from authors affiliation
	funct04_03_normalize_pubmed_names.sql	normalize publication author names
	funct04_04_pubmed_master.sql
---	---	---
process patent data	funct05_01_patente_names.sql
	funct05_02_process_patent_names.py
	funct05_03_patent_master.sql
---	---	---
create raw publication patent pairs	funct06_01_join_patent_publication.py
	funct06_02_dist_publications.sql
	funct06_03_dist_patents.sql
---	---	---
extract MeSH from publications	funct07_01_pubmed_mesh.sql
---	---	---
extract MeSH from patents	funct08_01_patente_mesh.sql
	funct08_02_table_text_patente_proc.sql
	funct08_03_MeSH_extract.py
	funct08_04_patente_mainheadings.sql
---	---	---
create embeddings	funct09_01_bert_embed.py
---	---	---
calc cosine similarity	funct10_01_join_cosine_similarity.sql
---	---	---
common references	funct11_01_create_journalnorm_table.sql
	funct11_03_references_patents.sql
	funct11_02_references_publication.sql
	funct11_04_enrich_patents_w_crossref.py
	funct11_05_common_references.sql
---	---	---
combine and rank everything	funct12_01_patent_ipc.sql
	funct12_02_join_all.sql
	funct12_03_join_all_ranked.sql

script	what it does	comment
requirements.txt	for automatically installing all need Python libraries
my_pubmed_parser	pubmed parser from XXX with one small change by me

| country_name_iso_dict.p | data for normalizing country names to ISO codes|| | keyword_extraction.py | extracts keywords from text. here: extracts MeSH-Terms|| | med_mesh.sql | TODO: check if needed|| | LEGACY_04d_3_pubmed_names.sql ||| | LEGACY_05a_patente_names.sql ||| | LEGACY_05c_patent_master.sql ||| | LEGACY_pat_univ_applicants.sql ||| | db_credentials.sh ||| | PAT_get_data.py |||

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Knowledge transfer from science to economy
illustrated via patent - publication pairs:
reducing ambiguities with word embeddings and references

Description

Data Sources

Synopsis

remarks (Yes, i know, but ...)

Data Flow

File Description

Files

README.md

Latest commit

History

README.md

File metadata and controls

Knowledge transfer from science to economy illustrated via patent - publication pairs: reducing ambiguities with word embeddings and references

Description

Data Sources

Synopsis

remarks (Yes, i know, but ...)

Data Flow

File Description

Knowledge transfer from science to economy
illustrated via patent - publication pairs:
reducing ambiguities with word embeddings and references