Add Description2Code to request-for-research #5

ethancaballero · 2016-06-09T02:13:54Z

Given CS/ML paper describing a model, generate the model's source code. Somewhat insane, but solvable/impactful. [this request has since been altered as described below]

Given CS/ML paper describing a model, generate the model's source code. Somewhat insane, but high impact.

ilyasu123 · 2016-06-09T22:39:34Z

At present the problem is too hard, and success is impossible. Con you narrow down the problem? For example, can you find a similar problem where the program description are only a couple sentence long, and the programs do not exceed 20 lines of code? Using gitxiv as a source of training data is cool, but it is too hard, even for a sequence to sequence model with attention.

ilyasu123 · 2016-06-13T20:22:39Z

gitxiv as a source of training data for paper2code is too hard for 2016's machine learning. Try to find a source of programs that do not exceed 20 lines in length. Otherwise the project is pretty much guaranteed to be unsolvable barring significant breakthroughs in ML.

ethancaballero · 2016-06-13T22:33:27Z

I've altered the research request file to meet the specifications you outlined in your previous two responses. I'm currently working on scraping a dataset of short program descriptions and short (~20 line) programs. I'll probably have the dataset by the end of the week. Also, you might be able to ask Sam Altman to get one of YCombinator's companies (such as Triplebyte/HackerRank) to allow access to its (~20 line) description-code pair (programming interview) dataset(s).

ilyasu123 · 2016-06-18T18:24:28Z

I think the most important task is the preparation of the dataset (which will become a valuable resource in the community). For the purposes of this research request, do not assume that we will get TripleByte's question set. You may be right that the best way to get such a dataset is to partner with an organization that is involved in programming education or testing. But if there is a way to get such data online, that would be great.

BTW, I suggest that you paste a few example programs here before investing a big effort in scraping the data. I may have useful feedback.

Once you have the dataset, simply ask the user to apply a sequence to sequence model with attention on this input-output example set, and to try to experiment with all kinds of architectural designs that might improve performance.

ethancaballero · 2016-06-18T21:42:49Z

Here's an input-output (Description-Solution respectively) example from the dataset; (the sections labelled Input/Output/Sample_Input/Sample_Output are all part of the Description):

Description:
For any positive integer, we define a digit rotation as either moving the first digit to the end of the number (left digit rotation), or the last digit to the front of the number (right digit rotation). For example, the number 12345 could be left digit rotated to 23451, or right digit rotated to 51234. If there are any leading zeros after digit rotation, they must be removed. So 10203 could be left digit rotated to 2031, then left digit rotated again to 312. Given an integer N, determine the largest integer that can result from performing a series of one or more digit rotations on N.

Input:
Input will begin with an integer T (at most 1000), the number of test cases. Each test case consists of a positive integer N<100000000 (10^8) on a line by itself.

Output:
For each test case, print the largest integer that can result from performing one or more digit rotations on N.

Sample Input:
6
12345
54321
10901
211011
7
90

Sample Output:
51234
54321
11090
211011
7
9

Solution:

def left_rotate(s):
    s = s[-1]+s[:-1]
    s = s.lstrip('0')
    return s

def right_rotate(s):
    s = s[1:]+s[0]
    s = s.lstrip('0')
    return s

t = int(raw_input())
while t :
    t=t-1
    n = raw_input()
    ans = max(int(left_rotate(right_rotate(n))),int(right_rotate(left_rotate(n))))
    temp = n[:]
    for i in range(len(n)) :
        temp = left_rotate(temp)
        ans = max(ans,int(temp))
    temp = n[:]
    for i in range(len(n)) :
        temp = right_rotate(temp)
        ans = max(ans,int(temp))
    print ans

ethancaballero · 2016-06-18T21:50:05Z

The practice problems section on Codechef.com is where I'm currently scraping description-solution pairs from; click on some of the practice problems that the link leads to to see what other pairs will look like.

During the testing phase of the ml model, the Sample Input and Sample Output can be used to verify whether or not each solution that the model generates is correct.

I've also tried scraping other programming_challenge websites but have found that CodeChef is the easiest website to scrape that contains a large number of description-solution pairs. The only possible downside to CodeChef is that ~half of the challenges have a goofy ~2 sentence character-driven backstory in the description that might introduce undesirable noise.

Also, I originally was scraping HackerRank as well, but decided to switch to just CodeChef when I realized all the math equations in the english translation of HackerRank's problem descriptions are rendered as SVG images.

ilyasu123 · 2016-06-18T23:16:01Z

This is a great example. The problem is definitely ambitious, so don't expect a solution to pop up in the near term. The supervised learning techniques of today are probably inadequate, and will require real advances. On the other hand, this is a good stimulant of research.

How many input-output examples of this kind do you think you'd be able to collect?

ethancaballero · 2016-06-18T23:27:58Z

somewhere between 1000 and 5000, and I can augment the data by collecting multiple (~10 to 20?) solutions for every description.

ilyasu123 · 2016-06-18T23:53:49Z

This is a great dataset and it's worth collecting. However, it's a legitimately small dataset by modern ML standards, especially for a problem as difficult as this. Expect it to be solved only once we get transfer learning so good that we could train a system on lots of other data, so that it'll have an easy time "getting" programming from these examples. However, I'll gladly to accept the pull request (with a few tweaks) once the dataset gets into a good shape.

ilyasu123 · 2016-06-19T00:35:02Z

I should set the expectation: the problem will probably remain unsolved for a significant (by ML standards) amount of time.

ethancaballero · 2016-06-19T10:40:42Z

I just discovered an easy way to scrape CodeForces and HackerEarth as well, so I think I can collect ~9000 examples now.

ilyasu123 · 2016-06-19T16:25:45Z

Great!

ilyasu123 · 2016-06-27T14:07:37Z

How is data collection going?

ethancaballero · 2016-07-01T21:36:01Z

Approximately half of codechef (which contains ~4000) is scraped as of now. I'm working on parallelizing the scraping code I'm using because it's pretty slow in its current serial form. After codechef is done, I have to finish scraping hackerrank, codeforces, and hackerearth to get to ~9000.

Here's a link to a sample of ~500 description-solution code pairs from the codechef scrape and the code I've been using for scraping: https://github.com/ethancaballero/description2code-data-sample

The file_structure/formatting is the same as that which will be used for the finished dataset. If you see any problems with the dataset sample let me know.

ilyasu123 · 2016-07-04T17:34:05Z

The dataset looks good. Extremely ambitious. Will push online once the data is ready.

ilyasu123 · 2016-07-26T23:02:31Z

Any updates?

ethancaballero · 2016-07-29T01:54:53Z

I currently have 5300 scraped. I’m working on scraping remaining sites. Also, I’m working on normalizing the formatting of descriptions of problems so that all problems have descriptions of a similar format.

Here’s a link to current 5300 with a README.md describing ways to use the dataset for curriculum learning and ways to benchmark which types of algorithms one’s model is capable of learning/generating:
https://docs.google.com/uc?id=0Bz3fihKG133ceWNFQTQ5S0xhZUk&export=download

ilyasu123 · 2016-08-03T16:11:45Z

Great work, thanks for collecting this data! This problem is extremely hard for today's ML, but there's no harm putting it out there :)

ilyasu123 · 2016-08-03T16:19:30Z

Also submit a PR to the problem description once you finish scraping the remainder of the problems.

ilyasu123 · 2016-08-03T16:49:23Z

Do you want me to link to your github username, stating that you collected the data?

ethancaballero · 2016-08-04T11:12:35Z

Linking to my github username is fine. When I'm done with all the scraping, switch the username link to a link to a repo that I'll post with the scraping code (similar to how im2latex contains a link to Miffyli's dataset creation tools code repo).

tlbtlbtlb · 2016-08-04T20:47:43Z

The descriptions seem to lose newlines inside the example input and output,
which are important in most of these problems. For example, in
d2c_current/codeforces/280_C/description/description.txt (compare to
http://codeforces.com/problemset/problem/280/C) the two lines "2" and "1 2"
are merged into a single line "21 2".

On Thu, Jul 28, 2016 at 6:54 PM, Ethan Caballero notifications@github.com
wrote:

I currently have 5300 scraped. I’m working on scraping remaining sites.
Also, I’m working on adjusting the formatting of descriptions of problems
so that all problems have descriptions of a similar format.

Here’s a link to current 5300 with a README.md describing ways to use the
dataset for curriculum learning and ways to benchmark which types of
algorithms one’s model is capable of learning/generating:
https://docs.google.com/uc?id=0Bz3fihKG133ceWNFQTQ5S0xhZUk&export=download

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#5 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANZdFjcoadICyz_qtvZebxrKKfWkGFUks5qaV1tgaJpZM4Ixkyb
.

Trevor Blackwell tlb@openai.com 650 776 7870

ethancaballero · 2016-08-04T21:53:48Z

Ok, working on fixing newline error(s). Let me know if anyone finds any other errors or has any suggestions.

Avmb · 2016-12-05T14:14:17Z

What about generating python functions from their docstrings? It should be easy to collect hundred of thousands, if not millions of examples.

ethancaballero · 2016-12-05T14:26:11Z

That's a cool idea. Here's a github scraper & preprocessor:
https://github.com/uclmr/pycodesuggest/tree/master/github-scraper
https://github.com/uclmr/pycodesuggest#step-1-cloning-the-repos

We would also need to use collected gold code as oracles to create multiple test cases to test whether (and reward the model if) generated code satisfies functional specifications that docstring describes. To create test cases, random input arguments would be passed to oracle to get its returned outputs which are then paired with corresponding inputs' arguments.

An alternative would be to find urls that already contain test cases that correspond to a code and its docstring/description (similar to the way most programming competition websites provide dozens of scrapable corresponding test cases for each code)

Avmb · 2016-12-05T18:53:37Z

The Github scraper seems well-suited for the task.

However, I'm afraid that automatically generating sensible test cases for arbitrary code snippets, especially for a dynamic language like python, would be very hard. It is in fact a research problem on its own.

Maybe it is better if we initially evaluate on textual similarity (maybe up to alpha transformations or smth) and then move to functional similarity at a later stage.

0bserver07 · 2016-12-05T19:15:03Z

@Avmb That would be neat, though in terms of learning end-to-end it rather becomes a hard task, since doc-strings are usually programmer friendly, very brief and tend to over-fit with it's length.

In order to generalize enough over the test-cases of out-putting a working code, going down the route of having code parsed as AST in contrast to human readable text can be considered easier.

That being said, there are some resources in this area that might be useful, here is the list:

Cog python-based code generation toolkit developed by Ned Batchelder This can allow learned doc-string to code in an easier manner.
swagger-codegen, this is used in writing test-cases through a DSL (that can be learned, since of the thousands of examples given)

Another neat trick that can be compared to your doc-string approach is using Pycco
Generate documentation from already written code, then learn to generate them back:

:
pycco = todo/todos/__init__.py -> docs/todo/todos/__init__.html

from django.db import models

# === Models for Todos app ===


class ListItem(models.Model):
   """
   The ListItem class defines the main storage point for todos.
   Each todo has two fields:
   text - stores the text of the todo
   is_visible - used to control if the todo is displayed on screen
   """

   text = models.CharField(max_length=300)
   is_visible = models.BooleanField()

re: your second comment :)
Yes. that is an initial approach towards applying current existing NN architectures, then moving into contextual (or in this scenario) Functional similarity through some type of reinforcement learning.

gokul-uf · 2016-12-05T20:14:45Z

@ilyasu123 This suggestion might sound like a cheat, but honestly, why not have a standardized representation of network models and training parameters that comes with a paper? This could be something like UML or XML. Then the problem reduces to creating a parser for the file format and each framework (Keras / Theano / TensorFlow / Torch ) could have its own code generator to create appropriate code.

I believe that this would help in greatly reducing the time to recreate models published in papers, and can also detect if the authors are intentionally trying to avoid mentioning some hyperparameters.

One could argue that what if there are new models coming up that our markup cannot represent (like what happened with Caffe and RNNs), that's fine, we release a new version of the markup which is backward compatible with the older versions.

ethancaballero · 2017-04-07T21:16:57Z

https://github.com/EdinburghNLP/code-docstring-corpus

Jeremy-sense · 2022-03-20T11:06:19Z

This set is really helpful.

Is there an update for the link https://docs.google.com/uc?id=0Bz3fihKG133ceWNFQTQ5S0xhZUk&export=download ? That link no longer works.
Are all the solutions included correct? (I am guessing not based on the discussion of unit tests) I am interested in finding sets where there are mistakes, so we can train a model to recognize (common) mistakes and help students.
[if this should be listed as an issue and not a comment here, I am sorry and happy to move it]
Thanks.

ethancaballero added 4 commits June 8, 2016 19:07

Paper2Code

e216e99

Given CS/ML paper describing a model, generate the model's source code. Somewhat insane, but high impact.

Update Paper2Code

9482525

Update Paper2Code

126b714

Update Paper2Code

877fb6e

ethancaballero changed the title ~~Add Paper2Code for request-for-research~~ Add Paper2Code to request-for-research Jun 9, 2016

ethancaballero added 2 commits June 8, 2016 21:15

Update and rename Paper2Code to paper2code

544a343

Update paper2code

a6088af

Update paper2code

8a239fd

Update and rename paper2code to description2code

d6198b8

ethancaballero changed the title ~~Add Paper2Code to request-for-research~~ Add Description2Code to request-for-research Jun 14, 2016

ethancaballero added 2 commits June 13, 2016 17:45

Update description2code

4ecd8fa

Update description2code

a11a83d

ethancaballero added 2 commits June 19, 2016 12:45

Update description2code

16e7f08

Update description2code

bb7a852

ilyasu123 merged commit 90bddaa into openai:master Aug 3, 2016

0bserver07 mentioned this pull request Sep 27, 2019

Benchmark Submission: 0bserver07 github/CodeSearchNet#42

Merged

Add Description2Code to request-for-research #5

Add Description2Code to request-for-research #5

Uh oh!

Conversation

ethancaballero commented Jun 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilyasu123 commented Jun 9, 2016

Uh oh!

ilyasu123 commented Jun 13, 2016

Uh oh!

ethancaballero commented Jun 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilyasu123 commented Jun 18, 2016

Uh oh!

ethancaballero commented Jun 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ethancaballero commented Jun 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilyasu123 commented Jun 18, 2016

Uh oh!

ethancaballero commented Jun 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilyasu123 commented Jun 18, 2016

Uh oh!

ilyasu123 commented Jun 19, 2016

Uh oh!

ethancaballero commented Jun 19, 2016

Uh oh!

ilyasu123 commented Jun 19, 2016

Uh oh!

ilyasu123 commented Jun 27, 2016

Uh oh!

ethancaballero commented Jul 1, 2016

Uh oh!

ilyasu123 commented Jul 4, 2016

Uh oh!

ilyasu123 commented Jul 26, 2016

Uh oh!

ethancaballero commented Jul 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilyasu123 commented Aug 3, 2016

Uh oh!

ilyasu123 commented Aug 3, 2016

Uh oh!

ilyasu123 commented Aug 3, 2016

Uh oh!

ethancaballero commented Aug 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlbtlbtlb commented Aug 4, 2016

Uh oh!

ethancaballero commented Aug 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Avmb commented Dec 5, 2016

Uh oh!

ethancaballero commented Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Avmb commented Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0bserver07 commented Dec 5, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gokul-uf commented Dec 5, 2016

Uh oh!

ethancaballero commented Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jeremy-sense commented Mar 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ethancaballero commented Jun 9, 2016 •

edited

Loading

ethancaballero commented Jun 13, 2016 •

edited

Loading

ethancaballero commented Jun 18, 2016 •

edited

Loading

ethancaballero commented Jun 18, 2016 •

edited

Loading

ethancaballero commented Jun 18, 2016 •

edited

Loading

ethancaballero commented Jul 29, 2016 •

edited

Loading

ethancaballero commented Aug 4, 2016 •

edited

Loading

ethancaballero commented Aug 4, 2016 •

edited

Loading

ethancaballero commented Dec 5, 2016 •

edited

Loading

Avmb commented Dec 5, 2016 •

edited

Loading

0bserver07 commented Dec 5, 2016 •

edited

Loading

ethancaballero commented Apr 7, 2017 •

edited

Loading

Jeremy-sense commented Mar 20, 2022 •

edited

Loading