Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Description2Code to request-for-research #5

Merged
merged 12 commits into from
Aug 3, 2016

Conversation

ethancaballero
Copy link
Contributor

@ethancaballero ethancaballero commented Jun 9, 2016

Given CS/ML paper describing a model, generate the model's source code. Somewhat insane, but solvable/impactful. [this request has since been altered as described below]

Given CS/ML paper describing a model, generate the model's source code. Somewhat insane, but high impact.
@ethancaballero ethancaballero changed the title Add Paper2Code for request-for-research Add Paper2Code to request-for-research Jun 9, 2016
@ilyasu123
Copy link
Contributor

At present the problem is too hard, and success is impossible. Con you narrow down the problem? For example, can you find a similar problem where the program description are only a couple sentence long, and the programs do not exceed 20 lines of code? Using gitxiv as a source of training data is cool, but it is too hard, even for a sequence to sequence model with attention.

@ilyasu123
Copy link
Contributor

gitxiv as a source of training data for paper2code is too hard for 2016's machine learning. Try to find a source of programs that do not exceed 20 lines in length. Otherwise the project is pretty much guaranteed to be unsolvable barring significant breakthroughs in ML.

@ethancaballero
Copy link
Contributor Author

ethancaballero commented Jun 13, 2016

I've altered the research request file to meet the specifications you outlined in your previous two responses. I'm currently working on scraping a dataset of short program descriptions and short (~20 line) programs. I'll probably have the dataset by the end of the week. Also, you might be able to ask Sam Altman to get one of YCombinator's companies (such as Triplebyte/HackerRank) to allow access to its (~20 line) description-code pair (programming interview) dataset(s).

@ethancaballero ethancaballero changed the title Add Paper2Code to request-for-research Add Description2Code to request-for-research Jun 14, 2016
@ilyasu123
Copy link
Contributor

I think the most important task is the preparation of the dataset (which will become a valuable resource in the community). For the purposes of this research request, do not assume that we will get TripleByte's question set. You may be right that the best way to get such a dataset is to partner with an organization that is involved in programming education or testing. But if there is a way to get such data online, that would be great.

BTW, I suggest that you paste a few example programs here before investing a big effort in scraping the data. I may have useful feedback.

Once you have the dataset, simply ask the user to apply a sequence to sequence model with attention on this input-output example set, and to try to experiment with all kinds of architectural designs that might improve performance.

@ethancaballero
Copy link
Contributor Author

ethancaballero commented Jun 18, 2016

Here's an input-output (Description-Solution respectively) example from the dataset; (the sections labelled Input/Output/Sample_Input/Sample_Output are all part of the Description):

Description:
For any positive integer, we define a digit rotation as either moving the first digit to the end of the number (left digit rotation), or the last digit to the front of the number (right digit rotation). For example, the number 12345 could be left digit rotated to 23451, or right digit rotated to 51234. If there are any leading zeros after digit rotation, they must be removed. So 10203 could be left digit rotated to 2031, then left digit rotated again to 312. Given an integer N, determine the largest integer that can result from performing a series of one or more digit rotations on N.

Input:
Input will begin with an integer T (at most 1000), the number of test cases. Each test case consists of a positive integer N<100000000 (10^8) on a line by itself.

Output:
For each test case, print the largest integer that can result from performing one or more digit rotations on N.

Sample Input:
6
12345
54321
10901
211011
7
90

Sample Output:
51234
54321
11090
211011
7
9

Solution:

def left_rotate(s):
    s = s[-1]+s[:-1]
    s = s.lstrip('0')
    return s

def right_rotate(s):
    s = s[1:]+s[0]
    s = s.lstrip('0')
    return s

t = int(raw_input())
while t :
    t=t-1
    n = raw_input()
    ans = max(int(left_rotate(right_rotate(n))),int(right_rotate(left_rotate(n))))
    temp = n[:]
    for i in range(len(n)) :
        temp = left_rotate(temp)
        ans = max(ans,int(temp))
    temp = n[:]
    for i in range(len(n)) :
        temp = right_rotate(temp)
        ans = max(ans,int(temp))
    print ans

@ethancaballero
Copy link
Contributor Author

ethancaballero commented Jun 18, 2016

The practice problems section on Codechef.com is where I'm currently scraping description-solution pairs from; click on some of the practice problems that the link leads to to see what other pairs will look like.

During the testing phase of the ml model, the Sample Input and Sample Output can be used to verify whether or not each solution that the model generates is correct.

I've also tried scraping other programming_challenge websites but have found that CodeChef is the easiest website to scrape that contains a large number of description-solution pairs. The only possible downside to CodeChef is that ~half of the challenges have a goofy ~2 sentence character-driven backstory in the description that might introduce undesirable noise.

Also, I originally was scraping HackerRank as well, but decided to switch to just CodeChef when I realized all the math equations in the english translation of HackerRank's problem descriptions are rendered as SVG images.

@ilyasu123
Copy link
Contributor

This is a great example. The problem is definitely ambitious, so don't expect a solution to pop up in the near term. The supervised learning techniques of today are probably inadequate, and will require real advances. On the other hand, this is a good stimulant of research.

How many input-output examples of this kind do you think you'd be able to collect?

@ethancaballero
Copy link
Contributor Author

ethancaballero commented Jun 18, 2016

somewhere between 1000 and 5000, and I can augment the data by collecting multiple (~10 to 20?) solutions for every description.

@ilyasu123
Copy link
Contributor

This is a great dataset and it's worth collecting. However, it's a legitimately small dataset by modern ML standards, especially for a problem as difficult as this. Expect it to be solved only once we get transfer learning so good that we could train a system on lots of other data, so that it'll have an easy time "getting" programming from these examples. However, I'll gladly to accept the pull request (with a few tweaks) once the dataset gets into a good shape.

@ilyasu123
Copy link
Contributor

I should set the expectation: the problem will probably remain unsolved for a significant (by ML standards) amount of time.

@ethancaballero
Copy link
Contributor Author

I just discovered an easy way to scrape CodeForces and HackerEarth as well, so I think I can collect ~9000 examples now.

@ilyasu123
Copy link
Contributor

Great!

@ilyasu123
Copy link
Contributor

How is data collection going?

@ethancaballero
Copy link
Contributor Author

Approximately half of codechef (which contains ~4000) is scraped as of now. I'm working on parallelizing the scraping code I'm using because it's pretty slow in its current serial form. After codechef is done, I have to finish scraping hackerrank, codeforces, and hackerearth to get to ~9000.

Here's a link to a sample of ~500 description-solution code pairs from the codechef scrape and the code I've been using for scraping: https://github.com/ethancaballero/description2code-data-sample

The file_structure/formatting is the same as that which will be used for the finished dataset. If you see any problems with the dataset sample let me know.

@ilyasu123
Copy link
Contributor

The dataset looks good. Extremely ambitious. Will push online once the data is ready.

@ilyasu123
Copy link
Contributor

Any updates?

@ethancaballero
Copy link
Contributor Author

ethancaballero commented Jul 29, 2016

I currently have 5300 scraped. I’m working on scraping remaining sites. Also, I’m working on normalizing the formatting of descriptions of problems so that all problems have descriptions of a similar format.

Here’s a link to current 5300 with a README.md describing ways to use the dataset for curriculum learning and ways to benchmark which types of algorithms one’s model is capable of learning/generating:
https://docs.google.com/uc?id=0Bz3fihKG133ceWNFQTQ5S0xhZUk&export=download

@ilyasu123
Copy link
Contributor

Great work, thanks for collecting this data! This problem is extremely hard for today's ML, but there's no harm putting it out there :)

@ilyasu123 ilyasu123 merged commit 90bddaa into openai:master Aug 3, 2016
@ilyasu123
Copy link
Contributor

Also submit a PR to the problem description once you finish scraping the remainder of the problems.

@ilyasu123
Copy link
Contributor

Do you want me to link to your github username, stating that you collected the data?

@ethancaballero
Copy link
Contributor Author

ethancaballero commented Aug 4, 2016

Linking to my github username is fine. When I'm done with all the scraping, switch the username link to a link to a repo that I'll post with the scraping code (similar to how im2latex contains a link to Miffyli's dataset creation tools code repo).

@tlbtlbtlb
Copy link

The descriptions seem to lose newlines inside the example input and output,
which are important in most of these problems. For example, in
d2c_current/codeforces/280_C/description/description.txt (compare to
http://codeforces.com/problemset/problem/280/C) the two lines "2" and "1 2"
are merged into a single line "21 2".

On Thu, Jul 28, 2016 at 6:54 PM, Ethan Caballero notifications@github.com
wrote:

I currently have 5300 scraped. I’m working on scraping remaining sites.
Also, I’m working on adjusting the formatting of descriptions of problems
so that all problems have descriptions of a similar format.

Here’s a link to current 5300 with a README.md describing ways to use the
dataset for curriculum learning and ways to benchmark which types of
algorithms one’s model is capable of learning/generating:
https://docs.google.com/uc?id=0Bz3fihKG133ceWNFQTQ5S0xhZUk&export=download


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#5 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AANZdFjcoadICyz_qtvZebxrKKfWkGFUks5qaV1tgaJpZM4Ixkyb
.

Trevor Blackwell tlb@openai.com 650 776 7870

@ethancaballero
Copy link
Contributor Author

ethancaballero commented Aug 4, 2016

Ok, working on fixing newline error(s). Let me know if anyone finds any other errors or has any suggestions.

@Avmb
Copy link

Avmb commented Dec 5, 2016

What about generating python functions from their docstrings? It should be easy to collect hundred of thousands, if not millions of examples.

@ethancaballero
Copy link
Contributor Author

ethancaballero commented Dec 5, 2016

That's a cool idea. Here's a github scraper & preprocessor:
https://github.com/uclmr/pycodesuggest/tree/master/github-scraper
https://github.com/uclmr/pycodesuggest#step-1-cloning-the-repos

We would also need to use collected gold code as oracles to create multiple test cases to test whether (and reward the model if) generated code satisfies functional specifications that docstring describes. To create test cases, random input arguments would be passed to oracle to get its returned outputs which are then paired with corresponding inputs' arguments.

An alternative would be to find urls that already contain test cases that correspond to a code and its docstring/description (similar to the way most programming competition websites provide dozens of scrapable corresponding test cases for each code)

@Avmb
Copy link

Avmb commented Dec 5, 2016

The Github scraper seems well-suited for the task.

However, I'm afraid that automatically generating sensible test cases for arbitrary code snippets, especially for a dynamic language like python, would be very hard. It is in fact a research problem on its own.

Maybe it is better if we initially evaluate on textual similarity (maybe up to alpha transformations or smth) and then move to functional similarity at a later stage.

@0bserver07
Copy link

0bserver07 commented Dec 5, 2016

@Avmb That would be neat, though in terms of learning end-to-end it rather becomes a hard task, since doc-strings are usually programmer friendly, very brief and tend to over-fit with it's length.

In order to generalize enough over the test-cases of out-putting a working code, going down the route of having code parsed as AST in contrast to human readable text can be considered easier.

That being said, there are some resources in this area that might be useful, here is the list:

Another neat trick that can be compared to your doc-string approach is using Pycco
Generate documentation from already written code, then learn to generate them back:

  • :

  • pycco = todo/todos/__init__.py -> docs/todo/todos/__init__.html

from django.db import models

# === Models for Todos app ===


class ListItem(models.Model):
   """
   The ListItem class defines the main storage point for todos.
   Each todo has two fields:
   text - stores the text of the todo
   is_visible - used to control if the todo is displayed on screen
   """

   text = models.CharField(max_length=300)
   is_visible = models.BooleanField()

re: your second comment :)
Yes. that is an initial approach towards applying current existing NN architectures, then moving into contextual (or in this scenario) Functional similarity through some type of reinforcement learning.

@gokul-uf
Copy link

gokul-uf commented Dec 5, 2016

@ilyasu123 This suggestion might sound like a cheat, but honestly, why not have a standardized representation of network models and training parameters that comes with a paper? This could be something like UML or XML. Then the problem reduces to creating a parser for the file format and each framework (Keras / Theano / TensorFlow / Torch ) could have its own code generator to create appropriate code.

I believe that this would help in greatly reducing the time to recreate models published in papers, and can also detect if the authors are intentionally trying to avoid mentioning some hyperparameters.

One could argue that what if there are new models coming up that our markup cannot represent (like what happened with Caffe and RNNs), that's fine, we release a new version of the markup which is backward compatible with the older versions.

@ethancaballero
Copy link
Contributor Author

ethancaballero commented Apr 7, 2017

@Jeremy-sense
Copy link

Jeremy-sense commented Mar 20, 2022

This set is really helpful.

  1. Is there an update for the link https://docs.google.com/uc?id=0Bz3fihKG133ceWNFQTQ5S0xhZUk&export=download ? That link no longer works.
  2. Are all the solutions included correct? (I am guessing not based on the discussion of unit tests) I am interested in finding sets where there are mistakes, so we can train a model to recognize (common) mistakes and help students.
    [if this should be listed as an issue and not a comment here, I am sorry and happy to move it]
    Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants