Table semantics and aggregation!
This project requires Python 3.6. We recommend you set up a conda environment:
conda create -n corvid python=3.6
source activate corvid
The dependencies are listed in the requirements.in
file:
pip install -r requirements.in
After installing, you can run all the unit tests:
pytest tests/
|-- corvid/
| |-- table/
| | |-- table.py
| | |-- table_loader.py
| |-- semantic_table/
| | |-- semantic_table.py
| | |-- evaluate.py
| |-- table_aggregation/
| | |-- schema_matcher.py
| | |-- evaluate.py
|-- tests/
|-- requirements.in
A few important things:
-
table.py
contains theTable
class, which is the data structure used to represent Tables. It's fine to think ofTable
as a wrapper around a 2Dnumpy
array, where each[i,j]
element represents a cell in the Table. -
semantic_table.py
contains theSemanticTable
class. It takes aTable
object as input and learns a normalization of it, which can be accessed via.normalized_table
. -
schema_matcher.py
contains theSchemaMatcher
class. The.aggregate_tables()
method takes a list ofTable
objects and finds alignments between columns. For example, a column "p" in Table 1 could be aligned with another column "precision" in Table 2. The.map_tables()
method uses these alignments to build a single aggregate Table. -
evaluate.py
contains a functionevaluate()
which computes a suite of performance metrics on a given a Gold Table and Predicted Table pair. Thesemantic_table
andtable_aggregation
modules have their own respective evaluation methods.
First, instantiate a Table
object:
from corvid.table.table import Cell, Table
cells = [
Cell(tokens=['a'], index_topleft_row=0, index_topleft_col=0, rowspan=1, colspan=1),
Cell(tokens=['b'], index_topleft_row=0, index_topleft_col=1, rowspan=1, colspan=1),
Cell(tokens=['c'], index_topleft_row=1, index_topleft_col=0, rowspan=1, colspan=1),
Cell(tokens=['d'], index_topleft_row=1, index_topleft_col=1, rowspan=1, colspan=1),
]
table = Table(cells=cells, nrow=2, ncol=2)
You can access certain elements by indexing like you would a 2D array:
# visualize
print(table)
# shape
table.nrow; table.ncol; table.dim
# indexing via grid
first_row = table[0,:]
first_col = table[:,0]
bottom_right_element = table[-1, -1]
# indexing via cells
first_cell = table[0]
You can serialize this object to JSON:
import json
with open('myfilename', 'w') as f:
json.dump(table.to_json(), f)
You can load it back in from JSON using the Loader
classes:
from corvid.table.table_loader import CellLoader, TableLoader
cell_loader = CellLoader(cell_type=Cell)
table_loader = TableLoader(table_type=Table, cell_loader=cell_loader)
with open('myfilename', 'r') as f:
table = table_loader.from_json(json.load(f))
You can extend all of these classes to contain augmented information:
class ColorfulCell(Cell):
def __init__(self, color: str, ...):
super().__init__(...)
self.color = color
class ColorfulTableWithCaption(Table):
def __init__(self, caption: str, ...):
super().__init__(...)
self.caption = caption
cells = [ColorfulCell(color='red', ...), ColorfulCell(color='blue', ...), ...]
table = ColorfulTableWithCaption(cells=cells, nrow=2, ncol=2, caption='red and blue cells')
Serialization of these objects is similar, but requires specification of the correct Cell
and Table
types:
with open('myfilename', 'w') as f:
json.dump(table.to_json(), f)
cell_loader = CellLoader(cell_type=ColorfulCell)
table_loader = TableLoader(table_type=ColorfulTableWithCaption, cell_loader=cell_loader)
with open('myfilename', 'r') as f:
table = table_loader.from_json(json.load(f))
Normalize an existing Table
object by creating a SemanticTable
object:
from corvid.semantic_table.semantic_table import SemanticTable
semantic_table = SemanticTable(raw_table=table)
print(semantic_table.normalized_table)
Aggregate Table
objects using a SchemaMatcher
:
from corvid.table_aggregation.schema_matcher import ColNameSchemaMatcher
schema_matcher = ColNameSchemaMatcher()
First, construct a list of Tables
. For best results, use normalized_tables
from SemanticTable
, but everything works on raw_tables
as well.
normalized_source_tables = [SemanticTable(raw_table=t).normalized_table for t in tables]
Second, build a "Schema" by initializing a Table
object, which only has a single row containing column header strings. For example:
schema_cells = [Cell(tokens=['header1'], ...), Cell(tokens=['header2'], ...)]
schema_table = Table(cells=schema_cells, nrow=1, ncol=2)
Third, build list of PairwiseMappings
which indicate the column alignments between pairs of Tables
.
pairwise_mappings = schema_matcher.map_tables(
tables=normalized_source_tables,
target_schema=schema_table
)
Finally, use these PairwiseMappings
to build a single Table
object that has the columns specified by the "Schema" Table
.
aggregate_table = schema_matcher.aggregate_tables(
pairwise_mappings=pairwise_mappings,
target_schema=schema_table
)
To evaluate this aggregation, use:
from corvid.table_aggregation.evaluate import evaluate
evaluate(gold_table=gold_table, pred_table=aggregate_table)
- cell-wise classification of
raw_table
Cells
- evaluation for semantic table
- latex source to table (for training/evaluation)