GitHub - USPTO/TrademarkPublicData: Utilties which support the proccessing of XML based USPTO trademark bulk download files

TrademarkPublicData

Utilities which support the processing of XML based USPTO trademark bulk download files

Overview

The USPTO makes trademark data available to the public on both its own Bulk Data Download System Site as well as the external Reed Tech USPTO data portal. The TM applications data is made available in XML format on a daily as well as annual basis. The collection of ZIP files on the Reed Tech site contains both the daily XML files(front files) as well as the annual XML files(back file). The XML files are created and uploaded daily and contain pending and registered trademark text data including word mark, serial number, registration number, filing date, registration date, goods and services, classification number(s), status code(s) and design search code(s).

Reed Tech IP Services, "USPTO Data Portal: Trademark Daily + Annual XML Files - Applications"

The annual XML application files are available on both the USPTO Bulk Data Storage System site as well as the Reed Tech site and contain files with TM XML application data from April 7, 1884 to 2017:

USPTO Bulkdata Daily + Annual Trademark XML(Front + Back Files)

Reed Tech Annual Trademark XML (Back files)

The TM annual bulk download application files are available for download from either the USPTO Bulk Data Site or the Reed Tech site and consist of a series of ZIP files containing all trademark data from April 7, 1884 through the last day of the previous year. Once unzipped, the concatenated XML files range in size from roughly 400 MB to 3 GB and contain upwards of 80,000 trademark records per file. The files are too large to be opened with most standard text editors or IDEs for viewing. Some commercial XML tools such as Oxygen XML Editor support viewing files of this size but these tools require a license. There are currently 55 TM annual application ZIP files in the series containing all trademark data through the end of 2018. The annual TM files can be distinguished from the daily TM XML ZIP files available on the site by the file names. Annual files are named using the last day of the year following by the series number (1-59):

i.e.
apc18840407-20181231-01.zip

apc18840407-20181231-55.zip

Daily XML application files are named using the date with no series number:
i.e.
apc190101.zip

The trademark splitter is a Python based utility which separates out the trademarks contained within each bulk download application file and then builds a corpus using a directory structure based on name and date of the ZIP file. The utility currently supports both the TM daily bulk download files(front files) as well as the TM annual bulk download files (back files). The Python TM splitter tool uses a buffered reader to read and process the large XML input file in chunks so it won’t run out of memory.
For each individual trademark extracted from the bulk application XML file, the TM splitter creates 2 files:

complete trademark containing all fields in USPTO standard trademark XML format
file containing a subset of fields in Solr ready XML format

The splitter uses regular expressions to match and extract the fields that are then exported to Solr ready XML format. Fields currently supported by the tool include the following:

trademark serial number used as the unique document id in Solr
mark name
mark drawing code
design codes
goods and services codes and descriptions

Required software

The tool was built and tested using Python 3.6 which can be downloaded from http://www.python.org.
Install Python 3.6 and run the following command from a command/terminal/shell window in order to confirm the version :
C:\TrademarkPublicData\TMProcessing>python -V
Python 3.6.0

Data download

The tool can be tested with the following trademark annual application XML input file:

apc18840407-apc181231-55.zip

Unzip the file to a location with plenty of storage and confirm that there is an XML file of the same name:
i.e. e:\TMData\Applications_XML\apc18840407-181231-55.zip

apc190329.zip

Running the tool

Copy the Python script to the same directory at the XML test data:
i.e. C:\TrademarkPublicData\TMProcessing\tm_splitter.py

After the tool has completed processing, confirm that files were created for each trademark under the directory specified on the command line.
The utility will create a corpus structure using the input file name for directory name and trademark serial number for file name:
i.e.

c:\tm_corpus\apc18840407-181231-55-87275954\87275954.xml
c:\tm_corpus\apc18840407-171231-55-87275954\solr\solr_87275954.xml

Setting up PyDev with Eclipse

The following versions of PyDev are compatible with Eclipse:

Eclipse 4.6 (Nyon), Java 8: PyDev 5.5
Eclipse 4.5, Java 8: PyDev 5.2.0
Eclipse 3.8, Java 7: PyDev 4.5.5
Eclipse 3.x, Java 6: PyDev 2.8.2

i.e. MyEclipse version 2016 CI 7 uses Eclipse 4.5
Configure Eclipse 4.5 with PyDev 5.2

Read the notes on the manual installation of PyDev with Eclipse: PyDev Install Manual 101

Download the PyDev zip file:
i.e. To download the PyDev 5.2 zip use the following link:
PyDev 5.2
C:\Users\mdhen_000\Downloads\PyDev_5.2.0.zip

Copy the PyDev 5.2 zip to the Eclipse dropins directory and unzip:
i.e.
C:\MyEclipse2016CI\dropins\PyDev5.2.0.zip
C:\MyEclipse2016CI\dropins\org.python.pydev.mylyn.feature_0.3.0.zip

Restart MyEclipse

##Creating Python project in Eclipse: Create Eclipse project for tm_splitter.py

Navigate to Package Explorer tab in Eclipse:
Right mouse click:
Select New->Project->Other->Pydev->PyDev Project
Name: Python_TM_Splitter
Grammar: 3.0-3.5
Interpreter: C:\Python3.6\python.exe

##Link to source code from Python project:
Right mouse from Package Explorer
New->Folder->Advanced
Select: Link to alternate location (Linked folder) ->Browse
Select: C drive:
C:\TrademarkPublicData\TMProcessing

This will link to the external files and not create them in the workspace itself.

##Configuring Preferences in Eclipse:
Windows->Preferences->PyDev->Interpreters:
Python 3.6:
C:\Python3.6\python.exe
Windows->Preferences->PyDev->Editor:
Hover->PyDevDocstring Hover->Unchecked

##Open PyDev Perspective:
Right mouse click Open Perspective icon with plus symbol in right hand corner of tool bar
Open Perspective-> PyDev

Notice

This source code is a work in progress and has not been fully vetted for a production environment.

Other Information

The United States Department of Commerce (DOC)and the United States Patent and Trademark Office (USPTO) GitHub project code is provided on an ‘as is’ basis without any warranty of any kind, either expressed, implied or statutory, including but not limited to any warranty that the subject software will conform to specifications, any implied warranties of merchantability, fitness for a particular purpose, or freedom from infringement, or any warranty that the documentation, if provided, will conform to the subject software. DOC and USPTO disclaim all warranties and liabilities regarding third party software, if present in the original software, and distribute it as is. The user or recipient assumes responsibility for its use. DOC and USPTO have relinquished control of the information and no longer have responsibility to protect the integrity, confidentiality, or availability of the information.

User and recipient agree to waive any and all claims against the United States Government, its contractors and subcontractors as well as any prior recipient, if any. If user or recipient’s use of the subject software results in any liabilities, demands, damages, expenses or losses arising from such use, including any damages from products based on, or resulting from recipient’s use of the subject software, user or recipient shall indemnify and hold harmless the United States government, its contractors and subcontractors as well as any prior recipient, if any, to the extent permitted by law. User or recipient’s sole remedy for any such matter shall be immediate termination of the agreement. This agreement shall be subject to United States federal law for all purposes including but not limited to the validity of the readme or license files, the meaning of the provisions and rights and the obligations and remedies of the parties. Any claims against DOC or USPTO stemming from the use of its GitHub project will be governed by all applicable Federal law. “User” or “Recipient” means anyone who acquires or utilizes the subject code, including all contributors. “Contributors” means any entity that makes a modification.

This agreement or any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not in any manner constitute or imply their endorsement, recommendation or favoring by DOC or the USPTO, nor does it constitute an endorsement by DOC or USPTO or any prior recipient of any results, resulting designs, hardware, software products or any other applications resulting from the use of the subject software. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, including USPTO, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC, USPTO or the United States Government.

To the extent possible under law, https://github.com/USPTO/TrademarkPublicData has waived all copyright and related or neighboring rights to Trademark Public Data. This work is published from: United States.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
TMProcessing		TMProcessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrademarkPublicData

Overview

Required software

Data download

Running the tool

Setting up PyDev with Eclipse

Notice

Other Information

About

Releases

Packages

Languages

License

USPTO/TrademarkPublicData

Folders and files

Latest commit

History

Repository files navigation

TrademarkPublicData

Overview

Required software

Data download

Running the tool

Setting up PyDev with Eclipse

Notice

Other Information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages