Epub exporter does not include embedded attachments; proposal for an output agnostic mechanism

Images are being embedded in attachments as base64 encoded strings. 

Right now the epub exporter does seem to be getting a link to an attachment like structure for some kind of strange file system query, e.g.: 

```bash
[NbConvertApp] Converting notebook hide_cells_based_on_tags.ipynb to epub
[NbConvertApp] Writing 9663 bytes to notebook.md
[NbConvertApp] Building Epub
pandoc: Could not find media 'attachment://ScreenShot2016-10-12at19.20.34.png', skipping…
[NbConvertApp] Epub successfully created
[NbConvertApp] Writing 7101 bytes to hide_cells_based_on_tags.epub
```

(NB: in that ↑ I changed the ` to a ' and the ... to a … for better highlighting)

This makes me think that something similar might be happening (or not happening) elsewhere specifically in https://github.com/jupyter/nbconvert/issues/328. Some of the discussion there partially inspires that which is below.  

I think there may be a output agnostic way to approach this, as a three step gather, tap, and clean (optionally) process. First, we gather and organise all of the relevant resources into a single location with known relative directory structure. Second, we use format specific mechanisms to include these images. Third, we optionally clean up everything to return it to the state that it was in (if we want it to be a single file per https://github.com/jupyter/nbconvert/issues/328#issuecomment-237777128). 

To encapsulate these steps a creating a new directory in which to work will be useful. We can treat the events as happening from the root level of the directory & build up the structure, that means that we can give things canonical known locations in known structures. Then, because it can be done in terms of relative paths, the code that stores and finds files can rely on a common file path function by specifying locations in terms of relative paths as defined in the canonical structure. That takes care of 1. Format specific stuff can then be developed on these common locations, which will take a while but will take care of 2. And then by using temporary directories optionally, that allows for easy cleanup. 

For example, the epub reader uses the markdown exporter as an intermediate step, producing the file in a temporary directory. This is because the markdown exporter spits out a bunch of media files to be referenced if they are output. Pandoc's epub exporter can find these files and include them in its native format. However, we do not do this for attached files, instead those are embedded as `![ScreenShot2016-10-12at19.20.34.png](attachment://ScreenShot2016-10-12at19.20.34.png)`, which does not point to a file system location. If we treat input attachments as we do output, the markdown exporter will be able to make attachments visible as easily as it does the output images. If we change the link to a more appropriate location such as `![ScreenShot2016-10-12at19.20.34.png](./attachment/ScreenShot2016-10-12at19.20.34.png)` this would be sufficient to find the attached files. 

And we can likely use a similar means we should make it so that the markdown to html conversion can either include these as embedded images or as separate files. In one case you just include the dataURI in the other case you maintain the same mechanism as described above for epub. The same machinery can support both versions. Then, instead of trying to figure out how to pass them in independently , we create them as separate files and then read them back in. Yes it will be less efficient, but then we will have a common mechanism for achieving all of this. 

From there we can work backwards and figure out ways to solve the problem in a more efficient manner. But this should be able to be done without a postprocessor but rather as a standard default option based on somewhat common output agnostic machinery.

I'm going to try to make this work for the epub exporter regardless because we're already using a TemporaryWorkingDirectory, so it'll make for a good test case. The way I'll approach it is by giving a hook to do this in the markdown exporter itself, since it's already handling the correct file placement for the output, I figure I can mirror that for the attachments. 

Tips on how to make it generalisable are extremely welcome, however I may pursue a local optimum for the epub solution and then try to abstract away from that rather than transform any piece of advise on proper generalisation to code from the get-go. If in a week that hasn't gone anywhere then I'll know I'm barking up the wrong tree because as far as I'm expecting it, this shouldn't be too hard of a modification to make. 

Relates to #467.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epub exporter does not include embedded attachments; proposal for an output agnostic mechanism #473

mpacer
openedon Nov 18, 2016

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Epub exporter does not include embedded attachments; proposal for an output agnostic mechanism #473

Description

mpaceropenedon Nov 18, 2016

Metadata