Tamandua is a framework for logfile analysis and aggregation developed at and for the IT Services Group D-PHYS (called ISG in this readme) located at the ETH Zurich.
The ISG is running a complex mail infrastructure. It consists out of four different servers, where each server is responsible for another task. Each server generates logfiles about different actions performed. This log is consolidated on a fifth server, called the loghost. The goal of Tamandua is to take the consolidated logfile, parse it, extract information and aggregate this information into so-called mail-objects.
Each mail-object represents one mail processed by the infrastructure. This information serves two purposes: statistics are generated to provide the system administrators at the ISG with an overview of the current mail flow. Secondly, the mail objects can be searched in order to comprehend and resolve errors. (for example false positives in the virus and spam filter) This search function is also used to help customers (mainly ETH intern) to resolve problems, for example, when a mail "went missing" or when "a mail wasn't delivered".
Tamandua was specifically designed and developed for the ISG. This means that Tamandua contains domain-specific parts and will probably not work out of the box in another environment. Although it can be adapted easily to operate in any other environment.
- Setup
- Usage
- Docs
- Deployment
- Configuration
- Extending Tamandua
- Authors
- License
- Appendix
Tamandua requires:
- python >= 3.5 (Because: type hints)
- MongoDB >= 3.4 (Because: $split)
First you need to install the dependencies. You do this either via a linux package manager (eg. apt
) or you use a virtual environment. (Or install the deps globally) If you want to use a linux packages, please refer to deployment/roles/tamandua/tasks/main.yml
.
Using a virtualenv
:
$ cd <tamandua-dir>
$ python3 -m venv ve
$ source ve/bin/activate
(ve) $ pip install -r requirements.txt
The manager encapsulated the parser and associated administrative tasks. In almost all cases you will use the manager.
Most commonly used options:
(ve) $ tamandua_manager --help # show all CLI options with description
run # run the parser pipeline:
# get logfile diff --> run parser --> cleanup --> remove local temp data
cleanup # remove data which is older than 30 days (default)
(ve) $ tamandua_parser <logfile>
--print-data # print data after aggregation (eg. mail objects)
--print-msgs # print system messages (eg. number of currently processed lines or aggregated objects)
--help # show all options
NOTE: For production, use the web-app in combination with wsgi, ssl, auth and apache2
. (Or something similar)
# Run web-app in debug mode
(ve) $ tamandua_web
# open your browser and navigate to: http://localhost:8080/
Generate ToC:
$ doctoc --title '## Table of contents' --github README.md
Although you first need to install doctoc
:
$ npm install -g doctoc
Ansible is used to automate the deployment.
Run the deployment:
$ cd deployment
$ ansible-playbook tamandua_deployment.yml
For more information, consider looking at the tamandua
role.
$ <your-fav-editor> config.json
{
# only gather data from following hosts:
"limit_hosts": [
"hostname1",
"hostname2"
],
# regex which extracts month, day, time and hostname from every logline
"preregex": "^(?P<month>[^\\s]*?)\\s{1,2}(?P<day>[^\\s]*?)\\s(?P<time>[^\\s]*?)\\s(?P<hostname>[^\/\\s]*?)[^\\w-]+?",
# specify database type
"database_type": "mongo",
#
# All options from this point on vary depending on the database_type
# This options here are for mongodb (currently only mongodb is supported)
#
# name of the database to store the data
"database_name": "tamandua",
# name of the collection containing the complete data
"collection_complete": "complete",
# and the incomplete data
"collection_incomplete": "incomplete",
# name of the collection of the metadata
"collection_metadata": "metadata",
# servername/ip
"dbserver": "localhost",
# port
"dbport": 27017
}
The following sections describe how to extend Tamandua.
src
contains the tamandua source codeweb
contains all components of the web viewdeployment
contains the ansible deploymentplugins-available
contains all available pluginsplugins-enabled
files or folder here are symlinked toplugins-available
, tamandua will look in this directory for plugins.
"""Add docstrings to the module."""
# Class coding style
class UpperCamelCase():
"""Supply classes with docstrings."""
def __private_method() -> None:
pass
def _protected_method() -> None:
pass
def public_method() -> None:
pass
def __init__(self):
self.__privateField
self._protectedField
self.publicField
# through the whole project type annotations are used
# in order to keep tamandua typesafe, _always_ use them
# https://docs.python.org/3/library/typing.html
#
# eg.:
def method(s: str, i: int) -> List[Union[str, int]]:
"""Always add docstrings."""
return [s, i]
# Interface coding style
from abc import ABCMeta, abstractmethod
# all interfaces start with the letter 'I'
# in order to easily identify them
class IUpperCamelCase(metaclass=ABCMeta):
@abstractmethod
def method_name(self) -> None:
pass
@property
@abstractmethod
def property(self):
return 'foobar'
Modules and packages: (snake case)
package_name/module_name.py
Tamandua consists of 2 main components: the parser which extracts and aggregates data from a given logfile and the web-app which represents this gathered data to the user. Each of those two components is further divided into internal components.
The parser will store the processed data in a database which is also used by the web-app. This database is abstracted using the repository pattern. (More to it later)
The parser consists of:
- PluginManager
- Plugins
- Containers
- Data Extraction
- Processors (pre/post/edge cases)
- Plugins
- Repository
- MongoDB
Data flow:
logfile --> logline --> pluginmanager --> data extraction --> containers --> pre/post/edge case processors --> repository --> store data to disk
The web-app follows the design principle of MVC:
- Controller, Model runs on the server
- View run in the browser of the user
User searches data:
view search form --> controller --> repository --> get data from database --> controller --> return results to view
Available plugins are found in the plugins-available
folder and enabled plugins are symlinked into the plugins-enabled
folder.
Data collection plugins are located in folders in plugins-available
. They may inherit from different base classes or directly from the IPlugin
interface: (It is recommended to inherit from a base class for generic plugin logic)
Lets create a new plugin:
First we have to decide to which data container we want to send the data extracted from our plugin. This is done by putting the plugin in the corresponding folder in plugins-available/<IDataContainer.subscribedFolder>/.
.
At the time of writing this readme there are two data containers: MailContainer
and Statistics
which subscribe to mail-container
and statistics
. The plugins written for both data containers are almost the same:
class ExampleDataCollection(SimplePlugin):
def check_subscription(self, line: str) -> bool:
pass
def gather_data(self, line: str, preRegexMatches: dict) -> dict:
pass
The statistics plugin is implemented the same way as the mail aggregation plugins, the only difference is that you will want to inherit from src.plugins.plugin_base.PluginBase
instead of SimplePlugin
. This is because the base class PluginBase
offers great means to extend your regexp group names.
class ExampleDataCollection(PluginBase):
def check_subscription(self, line: str) -> bool:
pass
def gather_data(self, line: str, preRegexMatches: dict) -> dict:
pass
If you do not want to reuse any functionality of any of those base classes, you may inherit directly from the IPlugin
interface and implements its members. The pluginManager will now know how to treat your "custom" plugin.
Eg.:
class CustomPlugin(IPlugin):
def check_subscription(self, line: str) -> bool:
pass
def gather_data(self, line: str, preRegexMatches: dict) -> dict:
pass
The most basic Processor inherits from IProcessorPlugin
.
class ExampleProcessor(IProcessorPlugin):
def process(self, obj: object) -> None:
pass
A given type of processors are put together to a chain of responsibility by a concrete IPluginCollection
. Then a responsibility (string) is assigned to this chain. Clients can then get those chains by their responsibility:
Client code:
chain = pluginManager.get_chain_with_responsibility('example_resp')
A client can then give an object to the chain which will in turn give this object to all processors in a given order.
chain.process(object)
Processor plugins are located in plugin folders under plugins-available
. (eg. postprocessors.d
) The name of their responsibility is this foldername minus the .d
at the end. (eg. folder: postprocessors.d
--> responsibility: postprocessors
) Although the .d
is not required.
For each of this responsibilities a Chain
is created.
Each Chain
sorts its processors by the filename of the file in which their were found.
For Example:
.
└── postprocessors.d
├── 01_remover.py
├── 02_tag_delivery.py
├── 03_tag_action.py
├── 04_tag_maillinglist.py
├── 05_tag_spam.py
└── 06_verify_delivery.py
Plugins in those folder will be executed in the order shown. Note that plugins are classes in those files, this means, that if a file contains multiple plugins, they will not have a specific order.
You may want to trigger external logic from within a processor. This is done via a wrapper class (ProcessorData
) which holds this metadata.
For example you may want to delete a given object, this can be done as follow:
class ExampleProcessor(IProcessorPlugin):
def process(self, obj: ProcessorData) -> None:
obj.action = ProcessorAction.DELETE
The Chain
will detect, that a given plugin has marked its data to be deleted. So it will stop iterating over its processor and give control back to the caller immediately. (Who then will do the actual deletion)
Container plugins are plugins which process the data extracted using the data extraction plugins. Currently there are two container plugins: Statistics
which is not used and MailContainer
which aggregates loglines into mail objects.
To implement a new container you need to implement the IDataContainer
interface:
First we create a new container in src.containers.example_container
:
class ExampleContainer(IDataContainer):
( ... )
Implement the IRequiresRepository
interface and you will be given a repository by the framework. (DI)
class MailContainer(IDataContainer, IRequiresRepository):
def set_repository(self, repository: IRepository) -> None:
# called when constructing this contaier
# store the given repository in a member field
self._repository = repository
You may now use the repository to save/get data.
You may use processor plugins in your data container by inheriting from IRequiresPlugins
and implementing the interface:
class ExampleContainer(IDataContainer, IRequiresPlugins):
( ... )
def set_pluginmanager(self, pluginManager: 'PluginManager') -> None:
"""Assign the pluginManager to a member field for later use."""
self._pluginManager = pluginManager
def build_final(self) -> None:
"""
Then you may request some processors by their responsibility from the pluginManager.
This is discussed in more detail in the 'Plugins' section.
"""
chain = self._pluginManager.get_chain_with_responsibility('postprocessors')
# construct meta object
pd = ProcessorData(mail)
# give data to processors
chain.process(pd)
# evaluate action result
if pd.action == ProcessorAction.DELETE:
pass
( ... )
The framework will now setup your data container with the plugin manager. (Dependency Injection)
You may add a new type of plugin:
Every plugin has to inherit from the marker interface IAbstractPlugin
in order to be detected as plugin by the plugin manager.
A plugin is defined by its interface:
from abc import ABCMeta
class IMyNewPlugin(IAbstractPlugin):
""""""
@abstractmethod
def do_something(self):
""""""
pass
Then you need to implement a new plugin collection: (src.plugins.plugin_collection
)
class MyNewPluginCollection(BasePluginCollection):
"""""""
@property
def subscribed_cls(self) -> type:
# subscribe to your new type of plugin
# in order to receive those plugins
# from the PluginAssociator
return IMyNewPlugin
And register this new plugin collection to the PluginAssociator
: (src.plugins.plugin_collection
)
class PluginAssociator():
"""Associate a given plugin with a plugin collection."""
def __init__(self, pluginManager: 'PluginManager'):
"""Constructor of PluginAssociator."""
# List[IPluginCollection]
self._pluginCollections = [
ContainerPluginCollection(pluginManager),
DataCollectionPlugins(),
ProcessorPluginCollection(),
# add your new collection here
MyNewPluginCollection()
]
In order to abstract the query language of each concrete repository (eg. MongoDB and MySQL) Tamandua has its own intermediate query language. (aka intermediate expression) This query language is used extensively throughout the project. (View --> Controller --> Repository
, Parser --> Repository
) Each concrete repository then has to compile this intermediate expression to its specific query language.
In order to abstract the expression itself and make building expressions easier an ExpressionBuilder
was implemented:
from src.expression.builder import ExpressionBuilder, ExpressionField, Comparator
builder = ExpressionBuilder()
builder.add_field(
ExpressionField('field', 'value', Comparator(Comparator.regex_i))
)
# check source for more methods (eg. adding datetime)
As you can see an ExpressionField
and a Comparator
are used to further abstract the components of an expression.
An ExpressionField
represents a field you want to search for. A Comparator
is, well, a comparator, checkout source for all options.
See also builder pattern.
Inspired by: .NET's IEnumerable.Count
Tamandua abstract the storage backend using a repository pattern. This allows the developer to easily extend Tamandua with a new storage backend. The creation of this repository is handled by a RepositoryFactory
.
Lets assume that Tamandua should be changed so that it writes its data to a couchbase db. For this we will first create a new concrete repository:
Each repository has to implement the IRepository
interface:
from src.repository.interfaces import IRepository
class CouchbaseRepository(IRepository):
( ... )
# Look at docstrings in interface source for implementation details
Then we will need to add the new repository to the RepositoryFactory
:
class RepositoryFactory():
__repo_map = {
# the existing mongo repository
'mongo': {
'config': MongoRepository.get_config_fields(),
'cls': MongoRepository
}
# the new couchbase repository
'couchbase': {
'config': CouchbaseRepository.get_config_fields(),
'cls': CouchbaseRepository
}
}
( ... )
Now its time to change the database type in the config file in order for the framework to actually use the new repository:
{
"database_type": "couchbase"
}
Note, that the database_type
field has to match the key in the __repo_map
.
See also repository pattern, factory method.
- Anastassios Martakos
Tamandua - Logfile aggregation framework
Copyright (C) 2017 Anastassios Martakos
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
Tamandua is a genus of anteaters with two species. The analogy is as follows: anteaters eat and digest a lot of small animals: ants and termites. This applications eats a lot of log file lines and aggregates them into statistics.
This readme still lacks a lot of explanations. (Especially the description of components that are domain specific and how to rewrite them to a more generic state or to your environment.) Although docstrings are provided through the whole source code. Please refer to them.
The documentation will, by any chance, be completed someday.
Tamandua is developed at ETH Zurich Department Physics in the ISG group. (IT Services Group D-PHYS) It's goal is to provide the system administrators there with an easy way to analyse the mail traffic of their mail infrastructure. This means that Tamandua contains some domain specific parts like the hostnames of the mail servers. Although Tamandua can be easily rewritten to operate in a different environment.
Refer to the Performance Measurements (incomplete)
Generally speaking consumes the parser a bit more RAM than the size of the logfile:
130MB logfile --> 160 MB RAM usage
Note that MongoDB can use up to several hundred MB RAM when it is building its indexes.
Tamandua has full Windows support.
The only thing you need to pay attention on Windows is, that linux symlinks won't work out of the box. You need to install git for windows and either enable symlinks during the installation or enable them when you clone the repo:
$ git clone -c core.symlinks=true <URL>
Although in order to make the actual symlinks you need to have admin rights.
Git GUIs like GitKraken also seems to have problems working with symlinks on windows. (You may use the powershell instead)