arxiv:1803.09371

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

Published on Mar 26, 2018

Upvote

Authors:

Ziyu Yao ,

Daniel S. Weld ,

Abstract

A novel Bi-View Hierarchical Neural Network is proposed to systematically mine high-quality question-code pairs from Stack Overflow, achieving better performance than heuristic methods and facilitating the development of data-hungry models in natural language and programming language association.

AI-generated summary

Stack Overflow (SO) has been a great source of natural language questions and their code solutions (i.e., question-code pairs), which are critical for many tasks including code retrieval and annotation. In most existing research, question-code pairs were collected heuristically and tend to have low quality. In this paper, we investigate a new problem of systematically mining question-code pairs from Stack Overflow (in contrast to heuristically collecting them). It is formulated as predicting whether or not a code snippet is a standalone solution to a question. We propose a novel Bi-View Hierarchical Neural Network which can capture both the programming content and the textual context of a code snippet (i.e., two views) to make a prediction. On two manually annotated datasets in Python and SQL domain, our framework substantially outperforms heuristic methods with at least 15% higher F1 and accuracy. Furthermore, we present StaQC (Stack Overflow Question-Code pairs), the largest dataset to date of ~148K Python and ~120K SQL question-code pairs, automatically mined from SO using our framework. Under various case studies, we demonstrate that StaQC can greatly help develop data-hungry models for associating natural language with programming language.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/1803.09371 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 17

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.