Skip to content

Conversation

@ksassnowski
Copy link
Contributor

This PR adds the ability to define custom item classes and and an option for item processors to only process certain items.

Custom Items

Custom items are simple PHP objects which extend the RoachPHP\ItemPipeline\AbstractItem class. The AbstractItem class implements all necessary interfaces in order to stand in for any other kind of item.

final class Team extends AbstractItem
{
    public function __construct(
        public readonly UuidInterface $id,
        public readonly string $name,
    ) {
    }
}

To yield a custom item from a spider, we can continue to use the spider's item method. Instead of passing in array, however, we pass in an instance of our custom item class.

public function parse(Response $response): Generator
{
    // Do some processing...
    yield $this->item(new MyCustomItem($id, $name));
}

This can already be nice on its own if we want to structure our scraped data a little more instead of passing around raw arrays. The real value comes from combining this with the next feature, however: custom item processors.

Custom Item Processors

Up until now, every processor of a spider would run for each yielded item. This can become problematic if our spider yields multiple different types of data from the same parse callback.

Say we're parsing a match summary of a football match. We might want to yield a Team item for both the home and the away team. The teams should get saved to the database so we can reference them later when we store the matches themselves. However, we also want to yield a FootballMatch item containing the information about the match itself.

The issue now is that we probably want to process the two types of items completely differently. The only way to deal with this at the moment is to add if-else blocks to all of our processors to manually check which kind of item we're dealing with. This is really cumbersome because it often requires us to add additional metadata to our items for the sole purpose of being able to tell which kind of item we're dealing with in our processor.

This PR introduces a new ConditionalItemProcessor interface as well as a CustomItemProcessor base class. The ConditionalItemProcessor interface describes a processor which may not run for each yielded item.

interface ConditionalItemProcessor
{
    /**
     * Check if the yielded item should get handled by this item processor.
     */
    public function shouldHandle(ItemInterface $item): bool;
}

The item pipeline will call the shouldHandle method of each processor that implements the ConditionalItemProcessor interface to check if this processor should handle the item.

The CustomItemProcessor base class is a convenience to handle one of the most common cases why we might do this: handling only a certain type of item. To create a processor like this, we extend the ConditionalItemProcessor class and implement the getHandledItemClasses as well as the usual processItem methods.

final class SaveTeamToDatabaseProcessor extends CustomItemProcessor
{
    /**
     * @param Team $item
     */
    public function processItem(ItemInterface $item): ItemInterface
    {
         // Process the item. Note how we can now narrow the type hint for `$item`
         // since we know that this processor will never get called for a different
         // type of item.
    }

    /**
     * @return array<int, class-string<ItemInterface>>
     */
    protected function getHandledItemClasses(): array
    {
        // Team is the custom item we defined above.
        return [Team::class];
    }
}

We can then define a separate item processor to only process FootballMatch items. We register these processors just like any other processor.

class MySpider extends BasicSpider
{
    public array $itemProcessors = [
        SaveTeamToDatabaseProcessor::class,
        SaveMatchToDatabaseProcessor::class,
    ];
}

Note: Custom processors only process the item types defined in the getHandledItemClasses method. This means that non-custom items don't get processed.

@ksassnowski ksassnowski force-pushed the custom_item_classes branch from 3b5d9be to f48f826 Compare June 22, 2022 07:24
@ksassnowski ksassnowski force-pushed the custom_item_classes branch from 23e498a to 7052c68 Compare June 22, 2022 07:38
@ksassnowski ksassnowski merged commit 20c6dc6 into main Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants