GitHub - Jong-Sig/patents: Parse and cluster USPTO patent data. Includes applications, grants, assignments, and maintenance.

USPTO Patent Parser

Parse patent application, grant, assignment, and maintenance info from USPTO Bulk Data. This handles all patent formats and outputs to pure CSV. Clusters patents by firm name, first filtering using locality-sensitive hashing, then finding components induced by a Levenshtein distance threshhold.

Requirements

For parsing, you'll need: numpy, pandas, and lxml. For firm clustering, you'll additionally need: xxhash, editdistance, networkx, and Cython. All of these are available through both pip and conda.

Usage

Below is the pipeline that you'll want to follow. The entire process is designed to idempotent, meaning you can rerun any step to propagate updates.

Downloading Data

You can download and extract the raw XML data files using the fetch_*.py files. These look in the respective meta/*_files.txt for the list of files to download and store them in the corresponding data/*. The fetch scripts by default will not overwrite existing files and will only download those that are absent.

Parsing Data

Parsing is done using the parse_*.py scripts. In the case of apply, grant, assign, this will output to separate files in parsed/* so as to allow for multi-threaded parsing. Running load_data.py afterwards will combine these intermediate files into unified files in tables. Conversely, parse_maint.py and parse_compustat.py output directly to tables.

Firm Clustering

If you're parsing patent assignments, run firm_assign.py to flag assignments between the same entity. To cluster firms into common entities based on name similarity, run firm_cluster.py. You can use the sources option to control which names are included. Run firm_cites.py to aggregate citation information to the patent level and account for self citations.

Finally, run firm_merge.py to merge all of the above into a firm-year panel. This last step requires applications, grants, assignments, and maintenance events to be used (Compustat is optional).

Example

Suppose you just want to parse patent grants. To do this, you would go through the following steps:

Fetch the grant data with fetch_grant.py.
Parse the grant data with parse_grant.py.
Load the grant data with load_data.py grant.
Cluster firm names with firm_cluster.py grant.
Process citations with firm_cites.py.

Name		Name	Last commit message	Last commit date
Latest commit History 148 Commits
meta		meta
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
fetch_apply.py		fetch_apply.py
fetch_assign.py		fetch_assign.py
fetch_grant.py		fetch_grant.py
fetch_maint.py		fetch_maint.py
firm_assign.py		firm_assign.py
firm_cites.py		firm_cites.py
firm_cluster.py		firm_cluster.py
firm_merge.py		firm_merge.py
load_data.py		load_data.py
parse_apply.py		parse_apply.py
parse_assign.py		parse_assign.py
parse_compu.py		parse_compu.py
parse_grant.py		parse_grant.py
parse_maint.py		parse_maint.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

USPTO Patent Parser

Requirements

Usage

Downloading Data

Parsing Data

Firm Clustering

Example

About

Releases

Packages

Languages

License

Jong-Sig/patents

Folders and files

Latest commit

History

Repository files navigation

USPTO Patent Parser

Requirements

Usage

Downloading Data

Parsing Data

Firm Clustering

Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages