XML Publication Analyzer

This mini-assignment demonstrates how to efficiently perform aggregation and extract tag content from a large XML file to derive meaningful insights. It utilizes SAX abstract class for parsing the XML file, ensuring minimal memory usage while processing large datasets. After completing the aggregation, the resulting dictionary is visualized through a bar-chart using matplotlib.

Folder Structure

The project follows this structure:

.
├── code
│   └── main.py               # Main script to run the XML analyzer
├── dataset
│   └── file.xml              # Place your XML files here (e.g., file.xml)
├── libs
│   ├── xml_parser.py         # XML parsing logic
│   ├── logger_config.py      # Logger setup
│   ├── file_handler.py       # File handling utilities
│   └── plotter.py            # Plotting functions
├── .gitignore
├── README.md
└── requirements.txt

Getting Started

Prerequisites

Python 3.x
Required packages listed in requirements.txt

To install the necessary packages, run:

pip install -r requirements.txt

Dataset

To replicate the results, download the XML dataset from this link, unzip it, and place the .xml file in the dataset folder. Rename it to file.xml or update the path in the code.

Usage

Add your XML file to the dataset folder.
In main.py, set the tag_name variable to the tag you want to count. For example, to count occurrences of <year> tags, set:
```
tag_name = "year"
```
Run the main script:
```
python code/main.py
```

Output

The script will output a dictionary with counts of each year (or specified tag) and display a bar chart of the publication counts.

Example

An example XML file containing <year> tags can be found in the DBLP dataset linked above. The dictionary output will look something like:

{"2020": 100, "2021": 150, "2022": 200}

The program will also display a bar chart with the counts per year.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

XML Publication Analyzer

Folder Structure

Getting Started

Prerequisites

Dataset

Usage

Output

Example

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
code		code
dataset		dataset
libs		libs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

menychtak/SAX-XML_mini_assignment

Folders and files

Latest commit

History

Repository files navigation

XML Publication Analyzer

Folder Structure

Getting Started

Prerequisites

Dataset

Usage

Output

Example

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages