Skip to content

Txt2tags writer #6551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Txt2tags writer #6551

wants to merge 22 commits into from

Conversation

farvardin
Copy link

hello,

I've made a txt2tags writer, adapted from Dokuwiki writer. It works rather ok but will be improved. Could you integrate this in pandoc please?

thank you

@@ -67,6 +67,7 @@ module Text.Pandoc.Writers
, writeTEI
, writeTexinfo
, writeTextile
, writeTxt2Tags
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should line up with the other entries

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's strange because I got some compilation warnings into other places, which I fixed, and I believed it was clean when I've submitted the PR. I'll check this soon.

@jgm
Copy link
Owner

jgm commented Jul 21, 2020

Thank you! This seems a good start, but it still needs some work before it's merge-ready.

At the very least, we need automated tests. The recommended (and easiest) approach would be to add to test/Tests/Old.hs (this requires creating a golden file test/writer.t2t).

Another thing you might try is doing round-trip tests with the t2t reader. This will help identify bugs and missing things.

added test case
@tarleb
Copy link
Collaborator

tarleb commented Dec 28, 2020

Thanks @farvardin, I think it would be great if we could get this moving again. I see you already added an "old-style" golden file test, but no code to use it. Are you still interested in advancing this? Please let us know how we can help.

I had quick read through the writer source, I think it looks good. Would you be interested in an in-depth review, or do you rather just want to get it merged soon?

@farvardin
Copy link
Author

@tarleb if you could help me how to improve it, it would be great. I don't really understand the test-thing.

@tarleb
Copy link
Collaborator

tarleb commented Jan 2, 2021

Gladly.

You can enable the tests by adding the appropriate commands to test/Tests/Old.hs, lines 162-164. Check the other formats to see what's required.

It would also be great if you could take a look at the doclayout package. It is used in plain text writers to avoid overlong lines. If I understand correctly, then txt2tags supports breaking long paragraphs into shorter lines by inserting newline characters. This is what doclayout is designed to do, see also pandoc's --columns option.

LMK if you run into problems.

@farvardin
Copy link
Author

@tarleb Thank you, I understand better the test process now. I've added the txt2tags writer to the file and it seems to work now. About the Lint error the missing file is in my pull request (I've never used lint or Haskell before and I'm not really a developper) and therefore I don't know why it's complaining.

I've had a look at doclayout but even with the example I can't figure out what I could do with this, sorry. I've seen all the writers are importing it, but I don't even find where they really use it in the code.

@tarleb
Copy link
Collaborator

tarleb commented Jan 6, 2021

The linting issue is caused by a missing entry in pandoc.cabal. All test files must be listed under extra-source-files; otherwise they won't be included in the source distribution, causing test failure for people building from source. I'm guilty of doing that (more than once 😔), so John added this nice check. See here.

The Doc Text type from doclayout behaves mostlyl just like Text, so most of the work should be updating the function type signatures. If you'd like to discuss this with less latency, feel free to reach out in the fp-chat slack chat.

@farvardin
Copy link
Author

@tarleb thanks again for the explanation. I should have figured it out myself by looking for the reference to "writer.rtf" for example, since it can only be found in pandoc.cabal.

lint is no longer complaining, so I suppose it could be merged officially!

I've just converted an epub file into txt2tags, using the command "pandoc -f epub -t txt2tags doc.epub > doc.t2t" and it works fine!

Copy link
Collaborator

@tarleb tarleb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good news, thanks! I went through the code and it looks good; I made inline comments whenever I was unsure about something.

I'm not sure if @jgm considers it essential, but it would be best if we could switch the result type of blockToTxt2Tags from Text to Doc Text so pandoc's --columns option would work better with this writer. Do you have time to give it a try?

"javascript", "latex", "lisp", "lua", "matlab", "mirc", "mpasm", "mysql", "nsis", "objc",
"ocaml", "ocaml-brief", "oobas", "oracle8", "pascal", "perl", "php", "php-brief", "plsql",
"python", "qbasic", "rails", "reg", "robots", "ruby", "sas", "scheme", "sdlbasic",
"smalltalk", "smarty", "sql", "tcl", "", "thinbasic", "tsql", "vb", "vbnet", "vhdl",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty string in this list strikes me as weird. Is it intentional?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm pretty sure I've just copied this part from the DokuWiki.hs without modification. I'm not sure how this part should work. The new code in DokuWiki is quite different now:

image

I'll remove the empty string, though I'm sure most of this part could be removed. But if I change it myself, I might break it!


inlineToTxt2Tags _ il@(RawInline f str)
| f == Format "Txt2Tags" = return str
| f == Format "html" = return $ "<html>" <> str <> "</html>"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most other <html> tags are uppercase, but here they are lower case. Is there a difference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there is no difference. Again, this part is straight from the DokuWiki.hs file:

image

Those are good news, thanks! I went through the code and it looks good; I made inline comments whenever I was unsure about something.

I'm not sure if @jgm considers it essential, but it would be best if we could switch the result type of blockToTxt2Tags from Text to Doc Text so pandoc's --columns option would work better with this writer. Do you have time to give it a try?

I don't know how to do this, sorry. I've just hacked the existing dokuwiki code so it's outputting valid txt2tags markup (most of the difference is with the headings)

case txt of
[Str s] | "mailto:" `T.isPrefixOf` src -> return $ "<" <> s <> ">"
| escapeURI s == src -> return src
_ -> if isURI src
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this if statement necessary? I'm not aware of a string with a leading / that would be recognized as valid URI.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just copied from Dokuwiki... most txt2tags syntax is similar to dokuwiki, so I suppose it's legit

image

probably the code should be refined in the future

, stUseTags = False
, stBackSlashLB = False }

type Txt2Tags m = ReaderT WriterEnvironment (StateT WriterState m)
Copy link
Collaborator

@tarleb tarleb Jan 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we don't need the WriterState, so we could shorten this to ReaderT WriterEnvironment m. The function runTxt2Tags would then become runTxt2Tags = flip runReaderT def.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about haskell, so I don't understand the difference, sorry :(

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copy-pasted together without you knowing Haskell?

I'm not even mad. That's amazing!

Copy link
Author

@farvardin farvardin Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main stumbling block for us will be maintenance. We have to ensure that this will be maintained after it's merged. Not sure how to proceed there, I'm not too keen to up to take up an additional responsibility. @jgm, what do you think?

the txt2tags language is not evolving, for 2 reasons: it's stable for many years, and also it's possible to add custom markup with some preproc and postprocessors, so the core doesn't need changing. So the only maintenance would be to adapt the writer code in the case pandoc change something in its internal code. But it would require the same maintenance as for the dokuwiki writer.

@tarleb
Copy link
Collaborator

tarleb commented Jan 18, 2021

Oh, I forgot: I think we should also add a test/tables.t2t file, as otherwise table output would remain untested.

@farvardin
Copy link
Author

Oh, I forgot: I think we should also add a test/tables.t2t file, as otherwise table output would remain untested.

done

@tarleb
Copy link
Collaborator

tarleb commented Jan 19, 2021

I think the main stumbling block for us will be maintenance. We have to ensure that this will be maintained after it's merged. Not sure how to proceed there, I'm not too keen to up to take up an additional responsibility. @jgm, what do you think?

@jgm
Copy link
Owner

jgm commented Jan 19, 2021

Have you considered putting this through its paces with round trip tests?
pandoc -f native -t txt2tags -s | pandoc -f txt2tags -t native -s should ideally give you what you started from. (You could try it with testsuite.native and txt2tags.native in the test directory.)

Pandoc doesn't generally offer a round-tripping guarantee, so we usually don't put things like this in the test suite, but doing this is a good way to identify bugs.

I'd also echo @tarleb's comment that doclayout should be used. It's not used in some of the older writers, and in formats where newlines cause paragraph breaks, but I'd rather have it supported in new writers. This is a matter of

  • importing Text.DocLayout
  • changing the writer function types to produce Doc rather than Text
  • using doclayout's functions to create the Doc values. e.g. where t is a Text, literal t is a Doc. For breakable spaces, use space. There are also functions for nesting, hanging indents, etc. -- see doclayout documentation on hackage.

Anyway, it's good to know that you don't need to know Haskell to contribute to pandoc.

@farvardin
Copy link
Author

I've done it, and it's rather faithful (I've written in the header of the writer which parts are not correctly working, but it's not very important, like definition lists)

image

while testing your command, I've noticed the writer is using the "txt2tags" tag, while the reader is using "t2t". I don't know which one is better.

pandoc -f html sample.html -t txt2tags -s | pandoc -f t2t -t html -s > test.html

about doclayout it's totally beyond my skills, I'm sorry.

@jgm
Copy link
Owner

jgm commented Jan 21, 2021

What are you using for sample.html?

@jgm
Copy link
Owner

jgm commented Jan 21, 2021

Can you try with test/testsuite.native? That exercises most of the standard pandoc features, and you wouldn't get irrelevant differences due to HTML formatting.

@jgm
Copy link
Owner

jgm commented Jan 21, 2021

As for the name, we'd better stick with t2t just for consistency's sake.

@farvardin
Copy link
Author

What are you using for sample.html?

it was a sample document from the txt2tags website (https://txt2tags.org/sample.html)

Can you try with test/testsuite.native? That exercises most of the standard pandoc features, and you wouldn't get irrelevant differences due to HTML formatting.

I still get several differences because pandoc creates id and adds some < p > markup in the original export (or maybe I've missed an option to avoid this) but yet it's still acceptable as result.

image

As for the name, we'd better stick with t2t just for consistency's sake.

I suppose it would be more logical to rename the reader part to txt2tags, because markdown and dokuwiki are written respectively "markdown" and "dokuwiki", but I understand if you prefer not breaking the previous behavior. So I've just changed all referencies of txt2tags to "t2t" in the writer part option, which will be a new addition, so it wouldn't confuse people.

@@ -20,8 +20,7 @@ Txt2Tags: <https://www.Txt2Tags.org/>
- Definition lists are broken
- Tables could be improved
- Some formats make better results than others (html is ok, md is somehow broken)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember, it was 6 months ago. I suppose when writing from markdown, some elements were not correctly rendered

@@ -20,8 +20,7 @@ Txt2Tags: <https://www.Txt2Tags.org/>
- Definition lists are broken
- Tables could be improved
- Some formats make better results than others (html is ok, md is somehow broken)
- code related to dokuwiki only could be removed
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still dokuwiki-specific code? IF so, let's remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there are still some code, yes

@jgm
Copy link
Owner

jgm commented Jan 21, 2021

Yes, I agree, txt2tags would make more sense, but I think it's more important not to break existing behavior, and to remain consistent.

Comment on lines +1 to +6
$pagetitle$
$author-meta$
$date-meta$



Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why you're getting all that blank space at the beginning. Do you want the 3 blank lines if there's a title? If there's no title block, would it be better to omit them? You can use conditionals like

$if(pagetitle)$
$pagetitle$
$author-meta$
$date-meta$


$endif$

@jgm
Copy link
Owner

jgm commented Jan 21, 2021

I ran writer.t2t through txt2tags, and it appears that some of the syntax is invalid. (e.g. the <HTML> bits).
Have you tried this experiment? (Also with tables.t2t.)
I think this writer still needs some work before it's ready.

@farvardin
Copy link
Author

I ran writer.t2t through txt2tags, and it appears that some of the syntax is invalid. (e.g. the <HTML> bits).
Have you tried this experiment? (Also with tables.t2t.)
I think this writer still needs some work before it's ready.

I've removed the html part, and now the test no longer works. There are some parts I'll never be able to fix (def lists), so I suppose I'll give up then, I'll use the generated binary for myself. There are fewer and fewer people who are still using txt2tags today, so it's not a big deal :(

@jgm
Copy link
Owner

jgm commented Jan 22, 2021

You can update the tests, no?
Or by "no longer works" do you mean that the new output is invalid t2t?

@jgm
Copy link
Owner

jgm commented Jan 22, 2021

Note: dokuwiki falls back to HTML for tables and lists whose contents are too complex to represent using the native syntax. I don't know if t2t offers any fallback possibility like that, so there's a question what to do if some of the things pandoc can represent can't be renedred in t2t. We could simply skip them and issue a warning, I suppose.

@farvardin
Copy link
Author

You can update the tests, no?
Or by "no longer works" do you mean that the new output is invalid t2t?

I don't know, I've updated the code, and thought I've also updated the test, but I still get:

      ------------------------------------------------------------------------
      --- writer.t2t
      +++ /temp/github/pandoc/.stack-work/dist/x86_64-linux-tinfo6/Cabal-3.0.1.0/build/pandoc/pandoc --data-dir=../data --quiet testsuite.native -r native -w t2t --columns=78 --variable pandoc-version= -s
      +   1 
      +   2 
      +   3 
      +   4 
      +   5 
      +   6 
      ------------------------------------------------------------------------

so I don't know what is wrong.

@farvardin
Copy link
Author

Note: dokuwiki falls back to HTML for tables and lists whose contents are too complex to represent using the native syntax. I don't know if t2t offers any fallback possibility like that, so there's a question what to do if some of the things pandoc can represent can't be renedred in t2t. We could simply skip them and issue a warning, I suppose.

it's still possible to start a line with % which means it's a comment. It won't be interpreted, but user could see it and check and correct it.

@jgm
Copy link
Owner

jgm commented Jan 23, 2021

make TESTARGS=--accept will regenerate all the tests with the current output of the writer.
(Then it's very important to inspect all the changes before committing.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants