Skip to content

cell extraction #3

@hampsterx

Description

@hampsterx

your package works great but I had to modify it slightly..

self._insert(row_ind, col_ind, row_span, col_span, self._transformer(cell.get_text()))

This is fine if the content is text but if it contains links you want to keep then it's problematic

I have modified it to:

class Extractor(object):
    def __init__(self, table, id_=None, cell_transformer=None):
        ...
        self._cell_transformer = cell_transformer if cell_transformer else lambda x: x.get_text()

    def parse(self):
      ...
      self._insert(row_ind, col_ind, row_span, col_span, self._cell_transformer(cell))

this allows the callee to implement the cell extraction if required.

Also, having to do 3 lines..

ext = Extractor(html)
ext.parse()
print ext.return_list()

would be nicer to just do

result = Extractor().parse(html)

Thanks, this package is small but useful :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions