-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Description2Code to request-for-research #5
Conversation
Given CS/ML paper describing a model, generate the model's source code. Somewhat insane, but high impact.
At present the problem is too hard, and success is impossible. Con you narrow down the problem? For example, can you find a similar problem where the program description are only a couple sentence long, and the programs do not exceed 20 lines of code? Using gitxiv as a source of training data is cool, but it is too hard, even for a sequence to sequence model with attention. |
gitxiv as a source of training data for paper2code is too hard for 2016's machine learning. Try to find a source of programs that do not exceed 20 lines in length. Otherwise the project is pretty much guaranteed to be unsolvable barring significant breakthroughs in ML. |
I've altered the research request file to meet the specifications you outlined in your previous two responses. I'm currently working on scraping a dataset of short program descriptions and short (~20 line) programs. I'll probably have the dataset by the end of the week. Also, you might be able to ask Sam Altman to get one of YCombinator's companies (such as Triplebyte/HackerRank) to allow access to its (~20 line) description-code pair (programming interview) dataset(s). |
I think the most important task is the preparation of the dataset (which will become a valuable resource in the community). For the purposes of this research request, do not assume that we will get TripleByte's question set. You may be right that the best way to get such a dataset is to partner with an organization that is involved in programming education or testing. But if there is a way to get such data online, that would be great. BTW, I suggest that you paste a few example programs here before investing a big effort in scraping the data. I may have useful feedback. Once you have the dataset, simply ask the user to apply a sequence to sequence model with attention on this input-output example set, and to try to experiment with all kinds of architectural designs that might improve performance. |
Here's an input-output (Description-Solution respectively) example from the dataset; (the sections labelled Input/Output/Sample_Input/Sample_Output are all part of the Description): Description: Input: Output: Sample Input: Sample Output: Solution: def left_rotate(s):
s = s[-1]+s[:-1]
s = s.lstrip('0')
return s
def right_rotate(s):
s = s[1:]+s[0]
s = s.lstrip('0')
return s
t = int(raw_input())
while t :
t=t-1
n = raw_input()
ans = max(int(left_rotate(right_rotate(n))),int(right_rotate(left_rotate(n))))
temp = n[:]
for i in range(len(n)) :
temp = left_rotate(temp)
ans = max(ans,int(temp))
temp = n[:]
for i in range(len(n)) :
temp = right_rotate(temp)
ans = max(ans,int(temp))
print ans |
The practice problems section on Codechef.com is where I'm currently scraping description-solution pairs from; click on some of the practice problems that the link leads to to see what other pairs will look like. During the testing phase of the ml model, the Sample Input and Sample Output can be used to verify whether or not each solution that the model generates is correct. I've also tried scraping other programming_challenge websites but have found that CodeChef is the easiest website to scrape that contains a large number of description-solution pairs. The only possible downside to CodeChef is that ~half of the challenges have a goofy ~2 sentence character-driven backstory in the description that might introduce undesirable noise. Also, I originally was scraping HackerRank as well, but decided to switch to just CodeChef when I realized all the math equations in the english translation of HackerRank's problem descriptions are rendered as SVG images. |
This is a great example. The problem is definitely ambitious, so don't expect a solution to pop up in the near term. The supervised learning techniques of today are probably inadequate, and will require real advances. On the other hand, this is a good stimulant of research. How many input-output examples of this kind do you think you'd be able to collect? |
somewhere between 1000 and 5000, and I can augment the data by collecting multiple (~10 to 20?) solutions for every description. |
This is a great dataset and it's worth collecting. However, it's a legitimately small dataset by modern ML standards, especially for a problem as difficult as this. Expect it to be solved only once we get transfer learning so good that we could train a system on lots of other data, so that it'll have an easy time "getting" programming from these examples. However, I'll gladly to accept the pull request (with a few tweaks) once the dataset gets into a good shape. |
I should set the expectation: the problem will probably remain unsolved for a significant (by ML standards) amount of time. |
I just discovered an easy way to scrape CodeForces and HackerEarth as well, so I think I can collect ~9000 examples now. |
Great! |
How is data collection going? |
Approximately half of codechef (which contains ~4000) is scraped as of now. I'm working on parallelizing the scraping code I'm using because it's pretty slow in its current serial form. After codechef is done, I have to finish scraping hackerrank, codeforces, and hackerearth to get to ~9000. Here's a link to a sample of ~500 description-solution code pairs from the codechef scrape and the code I've been using for scraping: https://github.com/ethancaballero/description2code-data-sample The file_structure/formatting is the same as that which will be used for the finished dataset. If you see any problems with the dataset sample let me know. |
The dataset looks good. Extremely ambitious. Will push online once the data is ready. |
Any updates? |
I currently have 5300 scraped. I’m working on scraping remaining sites. Also, I’m working on normalizing the formatting of descriptions of problems so that all problems have descriptions of a similar format. Here’s a link to current 5300 with a README.md describing ways to use the dataset for curriculum learning and ways to benchmark which types of algorithms one’s model is capable of learning/generating: |
Great work, thanks for collecting this data! This problem is extremely hard for today's ML, but there's no harm putting it out there :) |
Also submit a PR to the problem description once you finish scraping the remainder of the problems. |
Do you want me to link to your github username, stating that you collected the data? |
Linking to my github username is fine. When I'm done with all the scraping, switch the username link to a link to a repo that I'll post with the scraping code (similar to how im2latex contains a link to Miffyli's dataset creation tools code repo). |
The descriptions seem to lose newlines inside the example input and output, On Thu, Jul 28, 2016 at 6:54 PM, Ethan Caballero notifications@github.com
Trevor Blackwell tlb@openai.com 650 776 7870 |
Ok, working on fixing newline error(s). Let me know if anyone finds any other errors or has any suggestions. |
What about generating python functions from their docstrings? It should be easy to collect hundred of thousands, if not millions of examples. |
That's a cool idea. Here's a github scraper & preprocessor: We would also need to use collected gold code as oracles to create multiple test cases to test whether (and reward the model if) generated code satisfies functional specifications that docstring describes. To create test cases, random input arguments would be passed to oracle to get its returned outputs which are then paired with corresponding inputs' arguments. An alternative would be to find urls that already contain test cases that correspond to a code and its docstring/description (similar to the way most programming competition websites provide dozens of scrapable corresponding test cases for each code) |
The Github scraper seems well-suited for the task. However, I'm afraid that automatically generating sensible test cases for arbitrary code snippets, especially for a dynamic language like python, would be very hard. It is in fact a research problem on its own. Maybe it is better if we initially evaluate on textual similarity (maybe up to alpha transformations or smth) and then move to functional similarity at a later stage. |
@Avmb That would be neat, though in terms of learning end-to-end it rather becomes a hard task, since doc-strings are usually programmer friendly, very brief and tend to over-fit with it's length. In order to generalize enough over the test-cases of out-putting a working code, going down the route of having code parsed as That being said, there are some resources in this area that might be useful, here is the list:
Another neat trick that can be compared to your doc-string approach is using from django.db import models
# === Models for Todos app ===
class ListItem(models.Model):
"""
The ListItem class defines the main storage point for todos.
Each todo has two fields:
text - stores the text of the todo
is_visible - used to control if the todo is displayed on screen
"""
text = models.CharField(max_length=300)
is_visible = models.BooleanField() re: your second comment :) |
@ilyasu123 This suggestion might sound like a cheat, but honestly, why not have a standardized representation of network models and training parameters that comes with a paper? This could be something like UML or XML. Then the problem reduces to creating a parser for the file format and each framework (Keras / Theano / TensorFlow / Torch ) could have its own code generator to create appropriate code. I believe that this would help in greatly reducing the time to recreate models published in papers, and can also detect if the authors are intentionally trying to avoid mentioning some hyperparameters. One could argue that what if there are new models coming up that our markup cannot represent (like what happened with Caffe and RNNs), that's fine, we release a new version of the markup which is backward compatible with the older versions. |
This set is really helpful.
|
Given CS/ML paper describing a model, generate the model's source code. Somewhat insane, but solvable/impactful. [this request has since been altered as described below]