Skip to content

Conversation

kinow
Copy link
Member

@kinow kinow commented Mar 30, 2022

Closes #395

Description

A script that given a JSON created by the dump.sh script, uses Pandas + tqdm to process the JSON and write to a CSV file. The CSV is in a format supported by PostgreSQL's COPY command, to load DB dumps into tables.

Motivation and Context

See #395

How Has This Been Tested?

The script successfully parses the JSON, creates a CSV, and PostgreSQL can import it. Now pending further tests to confirm at least the initial batch of 2000 workflows don't show any errors after the data import.

Screenshots (if appropriate):

See #395

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@kinow
Copy link
Member Author

kinow commented Mar 31, 2022

The only issues found so far were related to the graph image not being displayed.

I found this one that did not show on both local latest version, and on view.commonwl.org:

Screenshot from 2022-03-31 18-23-19

Screenshot from 2022-03-31 18-26-50

However, this other one is showing correctly on view.commonwl.org, but not on my notebook:

Screenshot from 2022-03-31 18-30-10

@kinow
Copy link
Member Author

kinow commented Mar 31, 2022

Looking at the Docker resources, everything looks good.

Screenshot from 2022-03-31 17-58-01

But the logs… found the error 🤓

Error
spring_1_7d49152b30f2 | 2022-03-31 05:31:36,390 ERROR [http-nio-8080-exec-10] org.thymeleaf.TemplateEngine: [THYMELEAF][http-nio-8080-exec-10] Exception processing template "workflow": Instantiation of new objects and access to static classes is forbidden in this context (template: "workflow" - line 261, col 75)
spring_1_7d49152b30f2 | org.thymeleaf.exceptions.TemplateProcessingException: Instantiation of new objects and access to static classes is forbidden in this context (template: "workflow" - line 261, col 75)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.expression.SPELVariableExpressionEvaluator.getExpression(SPELVariableExpressionEvaluator.java:368)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.expression.SPELVariableExpressionEvaluator.obtainComputedSpelExpression(SPELVariableExpressionEvaluator.java:315)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.expression.SPELVariableExpressionEvaluator.evaluate(SPELVariableExpressionEvaluator.java:182)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.expression.VariableExpression.executeVariableExpression(VariableExpression.java:166)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.expression.SimpleExpression.executeSimple(SimpleExpression.java:66)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.expression.Expression.execute(Expression.java:109)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.expression.Expression.execute(Expression.java:138)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.processor.StandardUtextTagProcessor.doProcess(StandardUtextTagProcessor.java:87)
spring_1_7d49152b30f2 | 	at org.thymeleaf.processor.element.AbstractAttributeTagProcessor.doProcess(AbstractAttributeTagProcessor.java:74)
spring_1_7d49152b30f2 | 	at org.thymeleaf.processor.element.AbstractElementTagProcessor.process(AbstractElementTagProcessor.java:95)
spring_1_7d49152b30f2 | 	at org.thymeleaf.util.ProcessorConfigurationUtils$ElementTagProcessorWrapper.process(ProcessorConfigurationUtils.java:633)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.ProcessorTemplateHandler.handleOpenElement(ProcessorTemplateHandler.java:1314)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.OpenElementTag.beHandled(OpenElementTag.java:205)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.Model.process(Model.java:282)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.Model.process(Model.java:290)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.IteratedGatheringModelProcessable.processIterationModel(IteratedGatheringModelProcessable.java:367)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.IteratedGatheringModelProcessable.process(IteratedGatheringModelProcessable.java:221)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.ProcessorTemplateHandler.handleCloseElement(ProcessorTemplateHandler.java:1640)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.CloseElementTag.beHandled(CloseElementTag.java:139)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.TemplateModel.process(TemplateModel.java:136)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.TemplateManager.parseAndProcess(TemplateManager.java:592)
spring_1_7d49152b30f2 | 	at org.thymeleaf.TemplateEngine.process(TemplateEngine.java:1098)
spring_1_7d49152b30f2 | 	at org.thymeleaf.TemplateEngine.process(TemplateEngine.java:1072)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.view.ThymeleafView.renderFragment(ThymeleafView.java:366)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.view.ThymeleafView.render(ThymeleafView.java:190)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1400)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1145)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1084)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:963)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898)
spring_1_7d49152b30f2 | 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:655)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883)
spring_1_7d49152b30f2 | 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:764)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:227)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:197)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97)
spring_1_7d49152b30f2 | 	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:540)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135)
spring_1_7d49152b30f2 | 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78)
spring_1_7d49152b30f2 | 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:357)
spring_1_7d49152b30f2 | 	at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:382)
spring_1_7d49152b30f2 | 	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
spring_1_7d49152b30f2 | 	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:895)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1722)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
spring_1_7d49152b30f2 | 	at java.lang.Thread.run(Thread.java:748)
spring_1_7d49152b30f2 | 2022-03-31 05:31:36,390 ERROR [http-nio-8080-exec-10] org.apache.juli.logging.DirectJDKLog: Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.thymeleaf.exceptions.TemplateProcessingException: Instantiation of new objects and access to static classes is forbidden in this context (template: "workflow" - line 261, col 75)] with root cause
spring_1_7d49152b30f2 | org.thymeleaf.exceptions.TemplateProcessingException: Instantiation of new objects and access to static classes is forbidden in this context (template: "workflow" - line 261, col 75)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.expression.SPELVariableExpressionEvaluator.getExpression(SPELVariableExpressionEvaluator.java:368)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.expression.SPELVariableExpressionEvaluator.obtainComputedSpelExpression(SPELVariableExpressionEvaluator.java:315)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.expression.SPELVariableExpressionEvaluator.evaluate(SPELVariableExpressionEvaluator.java:182)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.expression.VariableExpression.executeVariableExpression(VariableExpression.java:166)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.expression.SimpleExpression.executeSimple(SimpleExpression.java:66)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.expression.Expression.execute(Expression.java:109)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.expression.Expression.execute(Expression.java:138)
spring_1_7d49152b30f2 | 	at org.thymeleaf.standard.processor.StandardUtextTagProcessor.doProcess(StandardUtextTagProcessor.java:87)
spring_1_7d49152b30f2 | 	at org.thymeleaf.processor.element.AbstractAttributeTagProcessor.doProcess(AbstractAttributeTagProcessor.java:74)
spring_1_7d49152b30f2 | 	at org.thymeleaf.processor.element.AbstractElementTagProcessor.process(AbstractElementTagProcessor.java:95)
spring_1_7d49152b30f2 | 	at org.thymeleaf.util.ProcessorConfigurationUtils$ElementTagProcessorWrapper.process(ProcessorConfigurationUtils.java:633)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.ProcessorTemplateHandler.handleOpenElement(ProcessorTemplateHandler.java:1314)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.OpenElementTag.beHandled(OpenElementTag.java:205)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.Model.process(Model.java:282)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.Model.process(Model.java:290)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.IteratedGatheringModelProcessable.processIterationModel(IteratedGatheringModelProcessable.java:367)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.IteratedGatheringModelProcessable.process(IteratedGatheringModelProcessable.java:221)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.ProcessorTemplateHandler.handleCloseElement(ProcessorTemplateHandler.java:1640)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.CloseElementTag.beHandled(CloseElementTag.java:139)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.TemplateModel.process(TemplateModel.java:136)
spring_1_7d49152b30f2 | 	at org.thymeleaf.engine.TemplateManager.parseAndProcess(TemplateManager.java:592)
spring_1_7d49152b30f2 | 	at org.thymeleaf.TemplateEngine.process(TemplateEngine.java:1098)
spring_1_7d49152b30f2 | 	at org.thymeleaf.TemplateEngine.process(TemplateEngine.java:1072)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.view.ThymeleafView.renderFragment(ThymeleafView.java:366)
spring_1_7d49152b30f2 | 	at org.thymeleaf.spring5.view.ThymeleafView.render(ThymeleafView.java:190)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1400)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1145)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1084)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:963)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1006)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:898)
spring_1_7d49152b30f2 | 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:655)
spring_1_7d49152b30f2 | 	at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:883)
spring_1_7d49152b30f2 | 	at javax.servlet.http.HttpServlet.service(HttpServlet.java:764)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:227)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
spring_1_7d49152b30f2 | 	at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:119)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:197)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97)
spring_1_7d49152b30f2 | 	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:540)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135)
spring_1_7d49152b30f2 | 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92)
spring_1_7d49152b30f2 | 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78)
spring_1_7d49152b30f2 | 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:357)
spring_1_7d49152b30f2 | 	at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:382)
spring_1_7d49152b30f2 | 	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65)
spring_1_7d49152b30f2 | 	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:895)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1722)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
spring_1_7d49152b30f2 | 	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
spring_1_7d49152b30f2 | 	at java.lang.Thread.run(Thread.java:748)
spring_1_7d49152b30f2 | 2022-03-31 05:31:36,394 ERROR [http-nio-8080-exec-10] org.springframework.boot.autoconfigure.web.servlet.error.ErrorMvcAutoConfiguration$StaticView: Cannot render error page for request [/workflows/github.com/mr-c/h3agatk/blob/cwl_v1_0/workflows/GATK/GATK-complete-WES-Workflow-h3abionet.cwl] as the response has already been committed. As a result, the response may have the wrong status code.

@kinow
Copy link
Member Author

kinow commented Mar 31, 2022

Ok, looks like it's not an issue with the data itself, or the script 🎉 🥳

thymeleaf/thymeleaf#809

Thymeleaf, the template used in Spring Boot, became more strict with access to static methods. I think the solution will be to move the call to the controller and pass that down to the template. Related to #378

@kinow kinow marked this pull request as ready for review March 31, 2022 06:23
@kinow
Copy link
Member Author

kinow commented Mar 31, 2022

To review this pull request, I think one could:

  • look at the README.md file in this pull request
  • take a look at the mongo_to_pg.py script, and try it if you have Pandas + Numpy, and a dump of CWL Viewer; the CSV should contain the information of the workflow table (similar to the Mongo collection, just different column names, e.g. cwlVersion vs. vwl_version)

@kinow kinow marked this pull request as draft March 31, 2022 18:57
@kinow
Copy link
Member Author

kinow commented Mar 31, 2022

Making it a draft just until I update the dump.sh to go over the entries 2000 at a time, so that we can get the complete Mongo database dump with that script. Will update the PR, and post the time that it took to dump & load on my machine.

@kinow kinow force-pushed the mongo-pg-script branch from 9eb10d3 to 54ccbed Compare April 1, 2022 21:23
@kinow
Copy link
Member Author

kinow commented Apr 1, 2022

I've replaced the old dump.sh by dump.py. It has the same behavior as dump.sh, with similar syntax, e.g.:

(venv) kinow@ranma:~/Development/java/workspace/cwlviewer$ python dump.py
usage: dump.py [-h] [-v VIEWER] -o OUTPUT [-p PAGE] [-s SIZE] [-a] [-d]
dump.py: error: the following arguments are required: -o/--output

(venv) kinow@ranma:~/Development/java/workspace/cwlviewer$ python dump.py -h
usage: dump.py [-h] [-v VIEWER] -o OUTPUT [-p PAGE] [-s SIZE] [-a] [-d]

optional arguments:
  -h, --help            show this help message and exit
  -v VIEWER, --viewer VIEWER
                        server base URL
  -o OUTPUT, --output OUTPUT
                        output directory
  -p PAGE, --page PAGE  what workflows page to retrieve
  -s SIZE, --size SIZE  how many workflows to retrieve (capped at 2000)
  -a, --all             dump all the workflows
  -d, --debug           set logging level to debug

The script raises an error if you use --all with page or size. Internaly, --all will fetch one small request to discover how many elements we have in the DB, calculate number of pages, and send separate requests of 2000 size. On my computer (i7, 16GB DDR3, SSD, old Thinkpad T530) it took < 5 minutes to retrieve all the workflows.

(venv) kinow@ranma:~/Development/java/workspace/cwlviewer$ python dump.py -o /tmp/backups -a
INFO:Viewer URL: https://view.commonwl.org/
INFO:Output: /tmp/backups
INFO:Dumping all the workflows from https://view.commonwl.org/ to /tmp/backups
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [04:39<00:00, 17.49s/it]

Updated the docs too in the last commit.

@kinow
Copy link
Member Author

kinow commented Apr 1, 2022

And in under 20 seconds, the JSON dump files are converted to CSV with the script in this pull request.

(venv) kinow@ranma:~/Development/java/workspace/cwlviewer/docs/mongo-to-postgres$ BKP_DIR="/home/kinow/Desktop/backups/"; for F in $(ls ${BKP_DIR}); do python mongo_to_pg.py -i ${BKP_DIR}${F} -o ${BKP_DIR}${F}.csv; done
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 198.22it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 224.22it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 194.57it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 222.76it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 222.55it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 175.33it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 239.19it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 236.46it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 225.65it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 204.07it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 195.28it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 234.74it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 253.18it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 237.69it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 225.56it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 281.71it/s]

@kinow kinow force-pushed the mongo-pg-script branch from 54ccbed to e5738d5 Compare April 1, 2022 21:37
@kinow
Copy link
Member Author

kinow commented Apr 1, 2022

Stopped the Docker Compose containers with docker-compose down -v (volumes, to remove every volume too). And started again docker-compose up --force-recreate. After a few minutes, the development environment was ready for the final test.

kinow@ranma:~/Development/java/workspace/cwlviewer$ cd ~/Desktop/backups/
kinow@ranma:~/Desktop/backups$ ls
2022-04-02T100812.json.csv  2022-04-02T100925.json.csv  2022-04-02T101032.json.csv  2022-04-02T101145.json.csv
2022-04-02T100830.json.csv  2022-04-02T100946.json.csv  2022-04-02T101054.json.csv  2022-04-02T101158.json.csv
2022-04-02T100850.json.csv  2022-04-02T101001.json.csv  2022-04-02T101115.json.csv  2022-04-02T101213.json.csv
2022-04-02T100908.json.csv  2022-04-02T101017.json.csv  2022-04-02T101132.json.csv  2022-04-02T101223.json.csv
kinow@ranma:~/Desktop/backups$ docker ps
CONTAINER ID        IMAGE                                     COMMAND                  CREATED             STATUS              PORTS                    NAMES
9061dfd7afbf        commonworkflowlanguage/cwlviewer:v1.4.2   "/usr/local/bin/mvn-…"   5 minutes ago       Up 5 minutes        0.0.0.0:8080->8080/tcp   cwlviewer_spring_1_806c3a8430c4
af561db25bcb        stain/jena-fuseki:3.4.0                   "/docker-entrypoint.…"   5 minutes ago       Up 5 minutes        0.0.0.0:3030->3030/tcp   cwlviewer_sparql_1_19e73735e7a1
6abce175c7b6        postgres:14-alpine                        "docker-entrypoint.s…"   5 minutes ago       Up 5 minutes        0.0.0.0:5432->5432/tcp   cwlviewer_postgres_1_3947782daac4
kinow@ranma:~/Desktop/backups$ for F in $(ls *.csv); do docker cp ${F} 6abce175c7b6:/; done
kinow@ranma:~/Desktop/backups$ docker exec -ti 6abce175c7b6 /bin/bash
bash-5.1# ls /
2022-04-02T100812.json.csv  2022-04-02T101017.json.csv  2022-04-02T101213.json.csv  lib                         sbin
2022-04-02T100830.json.csv  2022-04-02T101032.json.csv  2022-04-02T101223.json.csv  media                       srv
2022-04-02T100850.json.csv  2022-04-02T101054.json.csv  bin                         mnt                         sys
2022-04-02T100908.json.csv  2022-04-02T101115.json.csv  dev                         opt                         tmp
2022-04-02T100925.json.csv  2022-04-02T101132.json.csv  docker-entrypoint-initdb.d  proc                        usr
2022-04-02T100946.json.csv  2022-04-02T101145.json.csv  etc                         root                        var
2022-04-02T101001.json.csv  2022-04-02T101158.json.csv  home                        run
bash-5.1# for F in $(ls *.csv); do echo "\copy workflow ($(head -1 /${F} | sed 's/|/,/g')) FROM '${F}' DELIMITER ',' CSV HEADER;" | psql -U sa cwlviewer; done
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 2000
COPY 1273
bash-5.1# 

The order of workflows displayed appears to be different, but the content appears to be the same :-) the cwltool and Fetched Date, both got updated too to the version of cwltool in the Docker Compose setup and the date to today.

image

The PostgreSQL COPY command outputs how many entries it copied. And that indicates it loaded 31273 workflows. Exactly what we have (at the moment of writing) in production. Finally, we can also use the API to fetch a small batch in both prod and locally to confirm the number of entries.

The command curl -XGET -H "accept: application/json" https://view.commonwl.org/workflows?page=0&size=0 returns a JSON that ends with ... "totalPages":3127,"totalElements":31267,"last":false,"sort":null,"numberOfElements":10,"first":true,"size":10,"number":0}.

Same command, to local Docker Compose curl -XGET -H "accept: application/json" http://localhost:8080/workflows?page=0&size=0 setup returns "totalPages":3128,"last":false,"totalElements":31273,"sort":{"unsorted":true,"empty":true,"sorted":false},"numberOfElements":10,"first":true,"size":10,"number":0,"empty":false} (SpringBoot updated resulted in different pagination attributes, but that's used by SpringBoot internally for pagination, ignore differences in the JSON here related to sorting/paging.)

In both responses we have the "totalPages":3128 🎉 🥳 I also browsed a few pages, selecting some workflows and they were mostly rendered fine (see #391 for a pending change to fix rendering in some workflows after the SpringBoot upgrade, not a blocker for the DB migration 👍 )

Et voilà! That's that, I think we can use this to migrate the DB now 🤞

Bruno

@kinow kinow marked this pull request as ready for review April 1, 2022 22:07
@cure
Copy link
Contributor

cure commented Apr 14, 2022

Tested this by importing from the production installation into the staging site, which worked great. LGTM!

@kinow kinow force-pushed the mongo-pg-script branch from 9ee39d1 to b87e0ac Compare April 14, 2022 20:40
@kinow
Copy link
Member Author

kinow commented Apr 14, 2022

Rebased, and left two commits only, one with the notebook, one with the script changes.

@kinow kinow force-pushed the mongo-pg-script branch from b87e0ac to 07cff88 Compare April 19, 2022 08:09
@kinow
Copy link
Member Author

kinow commented Apr 19, 2022

Rebased.

@kinow kinow merged commit 047bd7b into main Apr 19, 2022
@kinow kinow deleted the mongo-pg-script branch April 19, 2022 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create script to process a backup/dump from MongoDB and load into PostgreSQL

2 participants