CMPUT 660 Machine Learning Applied: Software Engineering, Software Analtyics, and Software Energy Consumption (2022)

Course Outline

General Information

Was CMPUT 663.

Term: Fall, 2022, F22 Date and Time: Friday 2PM til 4:50PM
Location: Online (see eclass) Number of credits: 3 credits

See the contact information for details about your instructor.

Recording of lectures or lab sessions is permitted only with the prior written consent of the professor or if recording is part of an approved accommodation plan.

Overview

Machine Learning was built to solve many of our problems using available data. Applying machine learning to a field such as software engineering requires data. A field called mining software repositories can be used to help triage bug-reports to experts, to improve development processes and to aide debugging. This course will introduce the methods and tools (especially machine learning) of mining software repositories and artifacts used by software developers and researchers. Students will learn to extract and abstract data from software artifacts and repositories such as source code, version control systems and revisions, mailing-lists and discussions, and issue-trackers and issues. Students will also learn about techniques to analyze this data in order to infer intent, recover behaviours and software development processes from evidence, or to empirically test hypotheses about software development.

Objectives

Students will learn how to apply various forms of machine learning in order to extract and analyze information from multiple software repositories in order to reason about existing software systems and development processes, as well to validate hypotheses about software development using data extracted from existing software systems.

Pre-requisites

There are no official prerequisites but I would prefer if you had experience making software, writing software, maybe even maintaining software: a systems or software engineering course with a project
would suffice. Prior knowledge of software engineering, machine learning, statistics and natural language processing would help, but is not required.

Course Topics

Some topics we will touch in this course:

Software repositories and their associated data:

Source code: how to apply Software Metrics.
Version Control Systems
Issue trackers
Discussions: Mailing-lists and forums
Documentation
Extraction
Version Control
Issue Trackers
Mailing Lists
People
Analysis and Inference and Machine Learning
Summary Statistics
Distributions and comparisons
Relevant statistical tests
Software Metrics
Machine Learning
Natural Language Processing
Topic
Regression
Social Networking
Deep learning (brief)
Software Energy Measurement
Applications
Bug Prediction
Software Development Process Recovery
Pattern Mining
Expertise tracking
Topic Analysis
Concept Location
Coupling

Course Work and Evaluation

Note the last column, it shows collaboration models for each unit of work. These styles are better described on the CS course policy page:

https://www.cs.ualberta.ca/resources-services/policy-information/department-course-policies

Course Work	Weight
Partcipation	8%
Assignment 1	10%
Assignment 2	30%
Paper Presentations	12%
Project	30%
Project Proposal	1%
Project Presentation	6%
Project Proposal Presentation	3%

The conversion of your total numeric coursework score to a final grade will be based on interpreting the guidelines of the descriptors, letter grading system, and four-point scale as defined in section 23.4 of the University Calendar. That is, grades are assigned on what we judge to be “failure”, “minimal pass”, “poor”, “satisfactory”, “good”, or “excellent” performance in the context of the class.

We do not use a particular distribution to do the conversion, but instead use our judgment of how your score reflects mastery of the course material. That said, you generally need to be above the median to earn at least a 3.0 or B. For instance, in CMPUT 301, marked the same way, in Fall 2016 A+ was 97+, A was 95+, A- was 92+, B+ was 88+, B was 85, B- was 80, C+ was 75, C was 73, C- was 72, D+ was 71, D was 50+ and F was 0+. The numbers will change based on the perception of what is excellent, good, satisfactory and poor.

Excused Absences and Missed Work

No make-ups, alternatives, or supplementals will be given for missed course components. If a student misses a midterm exam, a zero is given.

A student who cannot write the midterm exam or complete an assignment due to incapacitating illness, severe domestic affliction, or other compelling reason can apply to the instructor (within 2 working days) for an Excused Absence. In the case of an excused absence, the weight of the missed assignments and midterm will be added to the final exam weight. In the case of an excused absence, the weight of the missed project work will be calculated from the rest of the project work. In the case of an excused absence, the weight of the missed lab work will be calculated from the rest of the lab work. Missed participation work will not be excused.

Excused absences are a privilege and not a right. Misrepresentation of facts to gain an excused absence is a breach of the Code of Student Behavior.

Missed Term Exams and Assignments:

For an excused absence where the cause is religious belief, a student must contact the instructor(s) within two weeks of the start of Fall or Winter classes to request accommodation for the term (including the final exam, where relevant). Instructors may request adequate documentation to substantiate the student request.

A student who cannot write a term examination or complete a term assignment due to incapacitating illness, severe domestic affliction or other compelling reasons can apply for deferral of the exam or assignment. In all cases, instructors may request adequate documentation to substantiate the reason for the absence at their discretion. Deferral of term work is a privilege and not a right; there is no guarantee that a deferral will be granted. Misrepresentation of Facts to gain a deferral is a serious breach of the Code of Student Behaviour.

Students must apply for deferral in written form, via email. The reasons the student is requesting the deferral must be clearly listed.

Automated Analysis

We reserve the right the automatically analyze your course work, archive your course work, and use it to analyze other course work, especially in the case of detecting plagiarism.

Coursework License

You may assume that any code examples we provide to you are public domain and free for you take without attribution unless they are licensed. Furthermore, source code associated with assignments must be licensed under an OSI-approved open-source license (GPL-2, GPL-3, Apache2, BSD, ISC), preferably a license that is GPL compatible.

Re-evaluation

Any questions or concerns about marks on course work must be brought to the attention of the TA or instructor within 5 business days after the mark has been posted. After that, we will not consider re-marking or re-evaluating the work. The re-evaluation of work may remove unwarranted marks as well as add marks that may have been missed.

Course Materials

This course does not have a required textbook. There are a number of excellent resources for this course, available as electronic books or through open access on the Web. See the course eClass site for links.

Images reproduced in lecture slides have been included under section 29 of the Copyright Act, as fair dealing for research, private study, criticism, or review. Further distribution or uses may infringe copyright on these images.

In addition to fair dealing, the Copyright Act specifically exempts projected displays by educational institutions for the purposes of education or training on the premises of the education institution.

The new copyright regulations of the University of Alberta (effective January 1, 2011), however, prohibit me from distributing complete copies of the lecture slides on the course eClass site.

Assignments that are code should be licensed for use under an OSI-approved Open Source License. This allows you, your team-members to leverage your work, as well grants us permission to try to compile and mark your work.

Student Responsibilities

Ask permission beforehand if you intend to recycle your work from another course in this course.

Regardless of the collaboration method allowed, you must always properly acknowledge the sources you used and people you worked with. Your professors reserve the right to give you an exam (oral, written, or both) to determine the degree that you participated in the making of the deliverable, and how well you understand what was submitted. For example, you may be asked to explain any code that was submitted and why you choose to write it that way. This may impact the mark that you receive for the deliverable. So, whenever you submit a deliverable, especially if you collaborate, you should be prepared for an individual inspection/walkthrough in which you explain what every line of your code, assignment, design, documentation etc. does and why you choose to write it that way.

Academic Integrity

The University of Alberta is committed to the highest standards of academic integrity and honesty. Students are expected to be familiar with these standards regarding academic honesty and to uphold the policies of the University in this respect. Students are particularly urged to familiarize themselves with the provisions of the Code of Student Behaviour (online at www.governance.ualberta.ca) and avoid any behaviour which could potentially result in suspicions of cheating, plagiarism, misrepresentation of facts and/or participation in an offence. Academic dishonesty is a serious offence and can result in suspension or expulsion from the University.”

All forms of dishonesty are unacceptable at the University. Any offence will be reported to the Associate Dean of Science who will determine the disciplinary action to be taken. Cheating, plagiarism and misrepresentation of facts are serious offences. Anyone who engages in these practices will receive at minimum a grade of zero for the exam or paper in question and no opportunity will be given to replace the grade or redistribute the weights. As well, in the Faculty of Science the sanction for cheating on any examination will include a disciplinary failing grade (NO EXCEPTIONS) and senior students should expect a period of suspension or expulsion from the University of Alberta.

Academic Integrity

The University of Alberta is committed to the highest standards of academic integrity and honesty. Students are expected to be familiar with these standards regarding academic honesty and to uphold the policies of the University in this respect. Students are particularly urged to familiarize themselves with the provisions of the Code of Student Behaviour and avoid any behaviour which could potentially result in suspicions of cheating, plagiarism, misrepresentation of facts and/or participation in an offence. Academic dishonesty is a serious offence and can result in suspension or expulsion from the University.

General Course Policies

https://www.cs.ualberta.ca/resources-services/policy-information/department-course-policies

Cell Phones

Cell phones are to be turned off during lectures, labs and seminars. Cell phones are not to be brought to exams.

In Class Laptop Use

Laptop use is permitted in class. Please remain professional in terms of laptop use. Please read:http://www.cbc.ca/news/technology/story/2013/08/14/technology-laptop-grades.html

Eligible students have both rights and responsibilities with regard to accessibility-related accommodations. Consequently, scheduling exam accommodations in accordance with SAS deadlines and procedures is essential. Please note adherence to procedures and deadlines is required for U of A to provide accommodations. Contact SAS (www.ssds.ualberta.ca) for further information.

Student Success Centre

Students who require additional help in developing strategies for better time management, study skills or examination skills should contact the Student Success Centre (2-300 Students’ Union Building).

Recording and/or Distribution of Course Materials

Audio or video recording, digital or otherwise, of lectures, labs, seminars or any other teaching environment by students is allowed only with the prior written consent of the instructor or as a part of an approved accommodation plan. Student or instructor content, digital or otherwise, created and/or used within the context of the course is to be used solely for personal study, and is not to be used or distributed for any other purpose

without prior written consent from the content author(s).

Department Policies

Most Course policies are determined by the department, please see the following page for information on CS course policies.

https://uofa.ualberta.ca/computing-science/links-and-resources/policy-information/department-course-policies

With respect to the models described in the Collaboration Policy section of

these policies, the coursework is:

“consultation” for the assignment, and
“teamwork” for the project.

There are assignment and project-specific policies on how much source code from publicly available sources may be used. Always give proper credit to the original developers in your source code and documentation.

Ask permission beforehand if you intend to recycle your work from another course in this course.

University Policies

Relevant University of Alberta policies can be found in§23.4(2) of the University Calendar.

Code of Student Behavior (Section 30 of the GFC Policy Manual)
Examinations (Section 23.5 of the University Calendar)
Electronic Communications Policy

Disclaimer

Any typographical errors in this Course Outline are subject to change and will be announced in class. The date of the final examination is set by the Registrar and takes precedence over the final examination date reported in this syllabus.

Copyright: Dr. Abram Hindle, Joshua Campbell, Department of Computing Science, Faculty of Science, University of Alberta (2017)

Last modified: Friday, 4 January 2019, 4:36 PM

Welcome

Assignment 1

1. Assignment 1
- 1.1. When: January 23
- 1.2. Who: Just you, with some consultation
- 1.3. Why: Get an introduction to sifting through Software Repositories
- 1.4. What: We will use the MSR 2019 mining challenge data and tools
- 1.5. Questions
  - 1.5.1. Briefly describe the schemas of the data held within SOTorrent?
  - 1.5.2. Size metrics
  - 1.5.3. Traceability
- 1.6. Notes
  - 1.6.1. What do you mean by summary statistics
  - 1.6.2. What format do you want?
  - 1.6.3. How?

1 Assignment 1

1.1 When: January 23

1.2 Who: Just you, with some consultation

1.3 Why: Get an introduction to sifting through Software Repositories

I want you to become comfortable with the data and the repositories that exist, in particular the data set for the assignment 2 and project. You may use some of your results from here in the project.

1.4 What: We will use the MSR 2019 mining challenge data and tools

This is a data-set is of StackOverflow posts https://2019.msrconf.org/track/msr-2019-Mining-Challenge#Call-for-Mining-Challenge-Papers

https://zenodo.org/record/2273117 You will have to convert this data to some format that you are comfortable with and then walk through it.

Scripting languages with regexes are great for this, but respectably parsing XML is good too.

You may share parsed databases. That is if you have translated the challenge data into a RDBMs like PostgreSQL or SQLite you are free to share the raw translated data (please cite each other).

1.5 Questions

1.5.1 Briefly describe the schemas of the data held within SOTorrent?

Question: What is not in the schema that you might expect from a QA systyem?
Very briefly describe how you would get this information.

1.5.2 Size metrics

Question: what is the size of the SOTorrent Data?
- number of entities
- number of authors
- summary statistics of the entities and their sizes
  - number of lines
  - number of blocks
  - number of entities
- Essentially for each dataset and database can you give me summary statistics about them.

1.5.3 Traceability

Question: How many foriegn URLs are there?
Question: How many Posts are there without answers?
Question: How many questions or answers have python snippets?
Question: What are the rare badges?
Question: Please plot questions versus answers over time.
Question: Please plot popular tags over time.

1.6 Notes

1.6.1 What do you mean by summary statistics

At the bare minimum we need Mean, Median, Standard Deviation, Variance.
- Skew, Kurtosis and other statistical moments are great too.
- A boxplot is even better.
  - Or a CDF/PDF plot

1.6.2 What format do you want?

I want a terse PDF document from you.

1.6.3 How?

Submit it to moodle and email it to me as well
Use the subject like “[CMPUT663] Assignment 1 submission” in the email.

Author: Abram Hindle

Created: 2019-01-04 Fri 16:58

Assignment 2

1 Assignment 2

1.1 Warning: Assignment 2 definition is not complete!

1.2 When: February 1 (Abstract) & February 6th (Submission) & Presentations on Feb 11 & 13

1.3 Who: Your team (2 people+)

Collaboration with other teams is to follow the consultation style of collaboration illustrated on the CS course policy website. https://www.cs.ualberta.ca/resources-services/policy-information/department-course-policies

1.4 What: A Mining Software Repositories 2019 Mining Challenge Submission

The easiest method is to take a challenge submission from the previous year and to replicate it on with this data. You can also replicate it on the updated versions of the software that the report initially used.
Or please talk to me (the instructor) if you want to do something different and propose a new angle.
The resulting paper should be submitted to the mining challenge. https://2019.msrconf.org/track/msr-2019-Mining-Challenge#Call-for-Mining-Challenge-Papers https://zenodo.org/record/2273117
You may use parsed data with other teams (such as transforming a file to a database). Note collaboration.
Deliverable is a 4 page mining challenge report and a 7 minute presentation.

1.5 Why:

To challenge you to produce a MSR challenge paper to describe findings with SOTOrrent.
To gain experience writing papers for conferences.
To motivate you to find interesting MSR methodologies and tools

1.6 Details

1.6.1 Write Up

The write up must conform to the MSR Mining Challenge guidelines as described at: https://2019.msrconf.org/track/msr-2019-Mining-Challenge#Call-for-Mining-Challenge-Papers
From the mining challenge page by The Challenge Chairs:

How to Participate in the Challenge
First, familiarize yourself with the SOTorrent dataset:
- Read our MSR 2018 paper about SOTorrent and the preprint of our mining challenge proposal, which contains exemplary queries.
- Study the project page of SOTorrent, which includes the most recent database layout and links to the online and download versions of the dataset.
- Create a new issue here in case you have problems with the dataset or want to suggest ideas for improvements.
Then, use the dataset to answer your research questions, report your findings in a four-page challenge paper (see information below), submit your abstract before February 1, 2019, and your final paper before February 6, 2019. If your paper is accepted, present your results at MSR 2019 in Montreal, Canada!
Submission (To Challenge Track)
A challenge paper should describe the results of your work by providing an introduction to the problem you address and why it is worth studying, the version of the dataset you used, the approach and tools you used, your results and their implications, and conclusions. Make sure your report highlights the contributions and the importance of your work. See also our open science policy regarding the publication of software and additional data you used for the challenge.

Challenge papers must not exceed 4 pages plus 1 additional page only with references and must conform to the MSR 2019 format and submission guidelines. Each submission will be reviewed by at least three members of the program committee. Submissions should follow the IEEE Conference Proceedings Formatting Guidelines, with title in 24pt font and full text in 10pt type. LaTEX users must use \documentclass[10pt,conference]{IEEEtran} without including the compsoc or compsocconf option.

1.6.2 Submission

An abstract must be submitted to MSR by Feb 1st.

A draft submission to be emailed to me by Midnight Feb 4 Then I will critique it for last minute edits and on Feb 6 we will submit it to the 2019 MSR Mining Challenge if I am OK with it. If you don’t submit don’t worry. If you don’t want to submit please make up a really great excuse.

Author: Abram Hindle

Created: 2019-01-07 Mon 16:50

Project Proposal

1 page project proposal. You should cite the relevant literature.

Replication – which paper
New research – or what’s your idea?

1 Project

1.1 Warning: Project definition is not complete!

1.2 When: April 10 (Final Report)

Proposal: February 13th
Proposal Presentation: Feb 25th and 27th. Be ready!
Final report: April 10
Presentations: April 8th, April 10th. Be ready!

1.3 Who: Your team (2-3 people)

4 people requires permission from instructor.
Collaboration with other teams is to follow the consultation style of collaboration illustrated on the CS course policy website. https://www.cs.ualberta.ca/resources-services/policy-information/department-course-policies

1.4 What: A project relevant to Mining Software Repositories that could be conceivably be submitted to a conference.

3 Styles of Project
- Challenge extension (instructor permission required)
  - the challenge was rushed but there is an interesting idea there.
  - most challenges do not lend to extension
- Original Research (instructor permission required)
  - do a new MSR study.
  - You should consult with the instructor if it has been done before.
    - only after googling it yourself.
- Replication with data-set time slicing
  - Hypothesis: Many MSR studies do not stable conclusions. If you evaluate the conclusions on more data from the same project, or different time-ranges of data from the same project you will get different results.
  - Importance: If shown in a majority of cases it will demonstrate that the community overclaims conclusions.
  - The easiest method is to take a MSR Paper from the previous years and to replicate it using its original data or also replicate it on the updated versions of the software that the report initially used.
  - The trick is that you’re going to evaluate the conclusions of the replicated paper across time.
    - E.g., given a paper who uses data from 2009 to 2015 are their conclusions stable or the same for 2009-2010, 2009-2012, 2009-2013, 2009-2015?
  - Limitation: don’t replicate the same paper as another team.

1 Papers

1.1 Full Papers

1.1.1 A Contextual Approach towards More Accurate Duplicate Bug Report Detection

Anahita Alipour, Abram Hindle, and Eleni Stroulia (University of Alberta, Canada)

1.1.2 An Empirical Study of End-user Programmers in the Computer Music Community

Gregory Burlet and Abram Hindle (University of Alberta, Canada) http://webdocs.cs.ualberta.ca/~gburlet/files/MSR2015_musiccoders.pdf] 1]

1.1.3 An empirical study on the evolution of design patterns Aversano, L., Canfora, G., Cerulo, L., Del Grosso, C., and Di Penta, M. . In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering

1.1.4 Analysing Software Repositories to Understand Software Evolution, Marco D’Ambros, Harald Gall, Michele Lanza, and Martin Pinzger

1.1.5 Automatic identification of bug-introducing changes by Sunghun Kim, Thomas Zimmermann, Kai Pan, E., and James Whitehead, Jr.

1.1.6 Beyond Lines of Code: Do We Need More Complexity Metrics?, by Israel Herraiz and Ahmed E. Hassan

1.1.7 BugCache for Inspections : Hit or Miss?

1.1.8 Bugs as Inconsistent Behavior: A General Approach to Inferring Errors in Systems Code.

Dawson R. Engler, David Yu Chen, Andy Chou SOSP 2001: 57-72 http://doi.acm.org/10.1145/502034.502042] 2]

1.1.9 Change Impact Graphs: Determining the Impact of Prior Code Changes German, D.M., Robles, G, and Hassan, A. , Journal of Information and Software Technology (INFSOF), Volume 51, Number 10, pages 1394–1408, Oct 2009.

1.1.10 Characteristics of Useful Code Reviews: An Empirical Study at Microsoft

Amiangshu Bosu, Michaela Greiler and Christian Bird (University of Alabama, United States, Microsoft Research, United States) http://www.amiangshu.com/papers/CodeReview-MSR-2015.pdf] 3]

1.1.11 Clones: What is that smell?

Foyzur Rahman, Christian Bird, Premkumar T. Devanbu MSR 2010:72-81 http://dx.doi.org/10.1109/MSR.2010.5463343] 4]

1.1.12 Copy-Paste as a Principled Engineering Tool, by Michael Godfrey and Cory Kapser + ‘Cloning Considered Harmful’ Considered Harmful, by Cory J. Kapser and Michael W. Godfrey. Proc. of the 2006 Working Conference on Reverse Engineering (WCRE-06), 23-28 October, Benevento, Italy.

1.1.13 Cross versus Within-Company Cost Estimation Studies: A Systematic Review.

Barbara A. Kitchenham, Emilia Mendes, Guilherme Horta Travassos IEEE Trans. Software Eng. 33(5): 316-329 (2007) http://dx.doi.org/10.1109/TSE.2007.1001] 5]

1.1.14 Evidence-Based Failure Prediction, by Nachi Nagappan and Thomas Ball + A Validation of Object-Oriented Design Metrics as Quality Indicators, by Victor R. Basili, Lionel C. Briand, and Walcelio L. Melo, IEEE Trans. on Software Engineering, 22(10, October 1996.

1.1.15 Gerrit Software Code Review Data from Android

Murtuza Mukadam, Christian Bird, and Peter C. Rigby (Concordia University, Canada; Microsoft Research, USA)

1.1.16 GreenMiner: A Hardware Based Mining Software Repositories Software Energy Consumption Framework

Abram Hindle, Alex Wilson, Kent Rasmussen, Jed Barlow, Joshua Campbell and Stephen Romansky (University of Alberta, Canada) http://webdocs.cs.ualberta.ca/~hindle1/2014/gm.pdf] 6]

1.1.17 Hipikat: recommending pertinent software development artifacts, by Davor Cubranic and Gail C. Murphy

1.1.18 How Well do Experienced Software Developers Predict Software Change?, by Mikael Lindvall and Kristian Sandahl, Journal of Systems and Software, 43(1), Jan 1998.

1.1.19 Identifying Changed Source Code Lines from Version Repositories by Gerardo Canfora, Luigi Cerulo, Massimiliano Di Penta. Proceedings of the Fourth International Workshop on Mining Software Repositories, 2007 (best paper award).

1.1.20 Identifying reasons for software change using historic databases by Audris Mockus and Larry G. Votta

1.1.21 Improving the Effectiveness of Test Suite Through Mining Historical Data

Jeff Anderson, Saeed Salem and Hyunsook Do (Microsoft, USA; North Dakota State University, USA)

1.1.22 Macro-level software evolution: a case study of a large software compilation Jesus M. Gonzalez-Barahona, Gregorio Robles, et al Journal of mpirical Software Engineering, Volume 14, Number 3 / June, 2009. Extended version of best paper award.

1.1.23 Measuring the Progress of Projects Using the Time Dependence of Code Changes, by Omar Alam, Bram Adams and Ahmed E. Hassan.

1.1.24 Mining Android App Usages for Generating Actionable GUI-based Execution Scenarios

Mario Linares-Vásquez, Martin White, Carlos Eduardo Bernal Cardenas, Kevin Moran and Denys Poshyvanyk (The College of William and Mary, United States) http://www.cs.wm.edu/~denys/pubs/MSR'15-MonkeyLab-CRC.pdf] 7]

Christian Bird, Alex Gourley, Premkumar T. Devanbu, Michael Gertz, Anand Swaminathan MSR 2006:137-143 http://doi.acm.org/10.1145/1137983.1138016] 8]

1.1.26 Mining Energy-Aware Commits

Irineu Moura, Gustavo Pinto, Felipe Ebert and Fernando Castor (Federal University of Pernambuco, Brazil) http://gustavopinto.org/lost+found/msr2015.pdf] 9]

1.1.27 Mining Energy-Greedy API Usage Patterns in Android Apps: an Empirical Study

Mario Linares-Vásquez, Gabriele Bavota, Carlos Eduardo Bernal Cardenas, Rocco Oliveto, Massimiliano Di Penta and Denys Poshyvanyk (College of William and Mary, USA; University of Sannio, Italy; Universidad Nacional de Colombia, Colombia; University of Molise, Italy) http://www.cs.wm.edu/~denys/pubs/MSR14-Android-energy-CRC.pdf] 10]

1.1.28 Mining Questions About Software Energy Consumption

Gustavo Pinto, Fernando Castor and Yu David Liu (Federal University of Pernambuco, Brazil; SUNY Binghamton, USA) http://gustavopinto.github.io/lost+found/msr.pdf] 11]

1.1.30 Mining version histories to guide software changes, Thomas Zimmermann, Peter Weißgerber, Stephan Diehl, Andreas Zeller

1.1.31 Novel applications of Machine Learning in Testing by Lionel Briand

1.1.32 Open Borders? Immigration in Open Source Projects.

Christian Bird, Alex Gourley, Premkumar T. Devanbu, Anand Swaminathan, Greta Hsu MSR 2007:6 http://doi.ieeecomputersociety.org/10.1109/MSR.2007.23] 12]

1.1.33 Scalable statistical bug isolation by Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan

1.1.35 Software Bertillonage: Finding the provenance of an entity, by Julius Davies, Abram J. Hindle, Daniel M. German, Michael W. Godfrey.

1.1.36 Studying Developers Copy and Paste Behavior

Tarek Ahmed, Weiyi Shang and Ahmed Hassan (Queen’s University, Canada)

1.1.37 Syntax Errors Just Aren’t Natural: Improving Error Reporting with Language Models

Joshua Campbell, Abram Hindle and José Nelson Amaral (University of Alberta, Canada) http://webdocs.cs.ualberta.ca/~joshua2/syntax.pdf] 13]

1.1.38 The Evidence for Design Patterns, by Walter Tichy + Design Pattern Detection Using Similarity Scoring, N. Tsantalis, A. Chatzigeorgiou, G. Stephanides, and S. T. Halkidis, IEEE Trans. on Software Engineering, November 2006.

1.1.39 The Impact of Code Review Coverage and Code Review Participation on Software Quality: A Case Study of the Qt, VTK, and ITK Projects

Shane Mcintosh, Yasutaka Kamei, Bram Adams and Ahmed E. Hassan (Queen’s University, Canada; Kyushu University, Japan; Polytechnique Montréal, Canada) http://sail.cs.queensu.ca/publications/pubs/msr2014-mcintosh.pdf] 14]

1.1.40 The Past, Present, and Future of Software Evolution, Michael W. Godfrey and Daniel M. German. Invited paper in Proc. of Frontiers of Software Maintenace track at the 2008 IEEE Intl. Conf. on Software Maintenance (ICSM-08), October 2008, Beijing, China.

1.1.41 The Promises and Perils of Mining Git. In Proceedings of the Sixth Working Conference on Mining Software Repositories (MSR 09), Vancouver, Canada, 2009. Christian Bird, Peter C. Rigby, Earl T. Barr, David J. Hamilton, Daniel M. German, Prem Devanbu.

Christian Bird, Peter C. Rigby, Earl T. Barr, David J. Hamilton, Daniel M. Germán, Premkumar T. Devanbu MSR 2009:1-10 http://dx.doi.org/10.1109/MSR.2009.5069475] 15]

1.1.42 The Promises and Perils of Mining GitHub

Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel German and Daniela Damian (University of Victoria, Canada; Delft University of Technology, Netherlands)

1.1.43 The secret life of bugs: Going past the errors and omissions in software repositories, by Jorge Aranda and Gina Venolia, Proc. of the 2009 Intl. Conf. on Software Engineering (ICSE-09), Vancouver, May 2009.

1.1.44 The Top Ten List: Dynamic Fault Prediction, by Ahmed E. Hassan and Richard C. Holt, Proc. of the 2005 IEEE Intl. Conf. on Software Maintenance (ICSM-05), Budapest, Hungary, Sept. 2005.

1.1.45 Toward Deep Learning Software Repositories

Martin White, Christopher Vendome, Mario Linares-Vásquez and Denys Poshyvanyk (College of William and Mary, United States) http://www.cs.wm.edu/~denys/pubs/MSR'15-DeepLearning-CRC.pdf 16

1.1.46 Towards Building a Universal Defect Prediction Model

Feng Zhang, Audris Mockus, Iman Keivanloo and Ying Zou (Queen’s University, Canada; Avaya Labs Research, USA) http://post.queensu.ca/~zouy/files/msr2014.pdf 17

1.1.47 Understanding the impact of code and process metrics on post-release defects: A case study on the Eclipse project, Emad Shihab, Zhen Ming Jiang, Walid M. Ibrahim, Bram Adams, Ahmed E. Hassan, Proc. of the 2010 ACM-IEEE Intl. Symposium on Empirical Software Engineering and Measurement (ESEM-10), Bolzano-Bolzen, Italy, Sept 2010.

1.1.48 Using information fragments to answer the questions developers ask.

Thomas Fritz, Gail C. Murphy ICSE 2010: 175-184 http://doi.acm.org/10.1145/1806799.1806828 18

1.1.49 Using Software Dependencies and Churn Metrics to Predict Field Failures: An Empirical Case Study Nachiappan Nagappan, Thomas Ball

1.1.50 Visualizing software changes, Stephen G. Eick, Todd L. Graves, Alan F. Karr, Audris Mockus, and Paul Schuster

1.1.51 What is the Gist? Understanding the Use of Public Gists on GitHub

Weiliang Wang, Germán Poo-Caamaño, Evan Wilde and Daniel German (University of Victoria, Canada)

1.1.52 What’s Hot and What’s Not: Windowing Developer Topic Analysis? by Abram J. Hindle, Michael W. Godfrey, Richard C. Holt.

1.1.53 When do changes induce fixes?: Jacek Śliwerski International Max Planck Research School, Saarbrücken, Germany

Thomas Zimmermann Saarland University, Saarbrücken, Germany Andreas Zeller Saarland University, Saarbrücken, Germany

1.1.54 Who Should Fix This Bug?, by John Anvik, Lyndon Hiew and Gail C. Murphy, Proc. of the 2006 Intl. Conference on Software Engineering (ICSE-06), Shanghai, May 2006.

1.1.55 Will My Patch Make It? And How Fast?: Case Study on the Linux Kernel

Yujuan Jiang, Bram Adams, and Daniel M. German (Polytechnique Montréal, Canada; University of Victoria, Canada) http://mcis.polymtl.ca/publications/2013/msr_jojo.pdf 19

1.1.56 Yesterday’s Weather: Guiding Early Reverse Engineering Efforts by Summarizing the Evolution of Changes, Tudor Girba, Stephane Ducasse, Michele Lanza, Proc. 20th IEEE Int’l Conference on Software Maintenance (ICSM'04), September 2004, pp. 40-49.

1.1.57 An Evaluation of Open-Source Software Microbenchmark Suites for Continuous Performance Assessment

Christoph Laaber and Philipp Leitner. MSR 2018

1.1.58 SOTorrent: Reconstructing and Analyzing the Evolution Stack Overflow Posts

Sebastian Baltes, Lorik Dumani, Christoph Treude and Stephan Diehl. MSR 2018

1.1.59 Data-Driven Search-based Software Engineering

Vivek Nair, Amritanshu Agrawal, Jianfeng Chen, Wei Fu, George Mathew, Tim Menzies, Leandro Minku, Markus Wagner and Zhe Yu. MSR 2018

1.1.60 CLEVER: Combining Code Metrics with Clone Detection for Just-In-Time Fault Prevention and Resolution in Large Industrial Projects

Mathieu Nayrolles and Abdelwahab Hamou-Lhadj. MSR 2018

Mario Linares-Vasquez, Gabriele Bavota and Camilo Escobar-Velasquez Universidad de los Andes, Università della Svizzera italiana (USI) MSR 2017

1.1.62 Extracting Code Segments and Their Descriptions from Research Articles preprint

Preetha Chatterjee, Benjamin Gause, Hunter Hedinger and Lori Pollock University of Delaware Structure and Evolution of Package Dependency Networks preprint Riivo Kikas, Georgios Gousios, Marlon Dumas and Dietmar Pfahl University of Tartu, Delft University of Technology MSR 2017

1.1.63 The Impact Of Using Regression Models to Build Defect Classifiers preprint

Gopi Krishnan Rajbahadur, Shaowei Wang, Yasutaka Kamei and Ahmed E. Hassan Queen’s University, Kyushu University MSR 2017

1.1.64 Choosing an NLP Library for Analyzing Software Documentation: A Systematic Literature Review and a Series of Experiments preprint

Fouad Nasser A. Al Omran and Christoph Treude University of Adelaide MSR 2017

1.1.65 GreenOracle: Estimating Software Energy Consumption with Energy Measurement Corpora

Shaiful Chowdhury and Abram Hindle University of Alberta MSR 2016

1.1.66 Mining Performance Regression Inducing Code Changes in Evolving Software

Qi Luo, Denys Poshyvanyk and Mark Grechanik The College of William and Mary, University of Illinois at Chicago MSR 2016

1.1.67 An Empirical Study on the Practice of Maintaining Object-Relational Mapping Code in Java Systems

Tse-Hsun Chen, Weiyi Shang, Jinqiu Yang, Ahmed E. Hassan, Michael W. Godfrey, Mohamed Nasser and Parminder Flora Queen’s University, Concordia University, University of Waterloo, BlackBerry

1.1.68 Software Ingredients: Detection of Third-party Component Reuse in Java Software Release

Takashi Ishio, Raula Gaikovina Kula, Tetsuya Kanda, Daniel German and Katsuro Inoue Osaka University, University of Victoria

1.1.69 A Look at the Dynamics of the JavaScript Package Ecosystem

Erik Wittern, Philippe Suter and Shriram Rajagopalan IBM T.J. Watson Research Center

1.1.70 A Large-Scale Study On Repetitiveness, Containment, and Composability of Routines in Source Code

Anh Nguyen, Hoan Nguyen and Tien Nguyen Iowa State University

1.1.71 A survey of machine learning for big code and naturalness

M Allamanis, ET Barr, P Devanbu, C Sutton ACM Computing Surveys (CSUR) 51 (4), 81

1.1.72 Are deep neural networks the best choice for modeling source code?

VJ Hellendoorn, P Devanbu Proceedings of the 2017 11th Joint Meeting on Foundations of Software …

1.1.73 Towards Accurate Duplicate Bug Retrieval Using Deep Learning Techniques

DOI: 10.1109/ICSME.2017.69 Conference: 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME)

1.2 Challenge Papers

1.2.1 Data A Repository with 44 Years of Unix Evolution

Diomidis Spinellis (Athens University of Economics and Business, Greece)

http://www.dmst.aueb.gr/dds/pubs/conf/2015-MSR-Unix-History/html/Spi15c.html 20

1.2.2 A comparative exploration of FreeBSD bug lifetimes.

Gargi Bougie, Christoph Treude, Daniel M. Germán, Margaret-Anne D. Storey

1.2.3 A newbie’s guide to eclipse APIs.

Reid Holmes, Robert J. Walker

1.2.4 A Tale of Two Browsers.

Olga Baysal, Ian J. Davis, Michael W. Godfrey

1.2.5 An initial study of the growth of eclipse defects.

Hongyu Zhang

1.2.6 Analyzing the evolution of eclipse plugins.

Michel Wermelinger, Yijun Yu

1.2.7 Apples vs. oranges?: an exploration of the challenges of comparing the source code of two software systems.

Daniel M. Germán, Julius Davies

1.2.8 Assessment of issue handling efficiency.

Bart Luijten, Joost Visser, Andy Zaidman

1.2.9 Author entropy vs. file size in the gnome suite of applications.

Jason R. Casebolt, Jonathan L. Krein, Alexander C. MacLean, Charles D. Knutson, Daniel P. Delorey

1.2.10 Cloning and copying between GNOME projects.

Jens Krinke, Nicolas Gold, Yue Jia, David Binkley

1.2.11 Co-Evolution of Project Documentation and Popularity within Github

Karan Aggarwal, Abram Hindle and Eleni Stroulia (University of Alberta, Canada) http://webdocs.cs.ualberta.ca/~hindle1/2016/msr14-Documentation.pdf 21

1.2.12 Do comments explain codes adequately?: investigation by text filtering.

Yukinao Hirata, Osamu Mizuno

1.2.13 Evaluating process quality in GNOME based on change request data.

Holger Schackmann, Horst Lichter

1.2.14 Finding file clones in FreeBSD Ports Collection.

Yusuke Sasaki, Tetsuo Yamamoto, Yasuhiro Hayase, Katsuro Inoue

1.2.15 Forecasting the Number of Changes in Eclipse Using Time Series Analysis.

Israel Herraiz, Jesús M. González-Barahona, Gregorio Robles

Haroon Malik, Peng Zhao and Michael Godfrey (University of Waterloo, Canada)

1.2.17 Impact of the Creation of the Mozilla Foundation in the Activity of Developers.

Jesús M. González-Barahona, Gregorio Robles, Israel Herraiz

1.2.18 Local and Global Recency Weighting Approach to Bug Prediction.

Hemant Joshi, Chuanlei Zhang, Srini Ramaswamy, Coskun Bayrak

1.2.19 Mining Eclipse Developer Contributions via Author-Topic Models.

Erik Linstead, Paul Rigor, Sushil Krishna Bajracharya, Cristina Videira Lopes, Pierre Baldi

1.2.20 Mining security changes in FreeBSD.

Andreas Mauczka, Christian Schanes, Florian Fankhauser, Mario Bernhart, Thomas Grechenig

1.2.21 Mining StackOverflow to Filter out Off-topic IRC Discussion

Shaiful Chowdhury and Abram Hindle (University of Alberta, Canada) http://webdocs.cs.ualberta.ca/~hindle1/2015/shaiful-mining_so.pdf 22

1.2.22 Mining the coherence of GNOME bug reports with statistical topic models.

Erik Linstead, Pierre Baldi

1.2.23 On the use of Internet Relay Chat (IRC) meetings by developers of the GNOME GTK+ project.

Emad Shihab, Zhen Ming Jiang, Ahmed E. Hassan

1.2.24 Perspectives on bugs in the Debian bug tracking system.

Julius Davies, Hanyu Zhang, Lucas Nussbaum, Daniel M. Germán

1.2.25 Predicting Defects and Changes with Import Relations.

Adrian Schröter

1.2.26 Predicting Eclipse Bug Lifetimes.

Lucas D. Panjer

1.2.27 Security and Emotion: Sentiment Analysis of Security Discussions on GitHub

Daniel Pletea, Bogdan Vasilescu and Alexander Serebrenik (Eindhoven University of Technology, Netherlands)

1.2.28 Summarizing developer work history using time series segmentation: challenge report.

Harvey P. Siy, Parvathi Chundi, Mahadevan Subramaniam

1.2.29 System compatibility analysis of Eclipse and Netbeans based on bug data.

Xinlei (Oscar) Wang, Eilwoo Baik, Premkumar T. Devanbu

1.2.30 Towards a simplification of the bug report form in eclipse.

Israel Herraiz, Daniel M. Germán, Jesús M. González-Barahona, Gregorio Robles

1.2.31 Visualizing Gnome with the Small Project Observatory.

Mircea Lungu, Jacopo Malnati, Michele Lanza

1.2.32 What topics do Firefox and Chrome contributors discuss?

Mario Luca Bernardi, Carmine Sementa, Quirino Zagarese, Damiano Distante, Massimiliano Di Penta

1.2.33 Which Non-functional Requirements do Developers Focus on? An Empirical Study on Stack Overflow using Topic Analysis

Jie Zou, Ling Xu, Weikang Guo, Meng Yan, Dan Yang and Xiaohong Zhang (Chongqing University, China)

1.2.34 On the Differences between Unit and Integration Testing in the TravisTorrent Dataset preprint

Manuel Gerardo Orellana Cordero, Gulsher Laghari, Alessandro Murgia and Serge Demeyer University of Antwerp

1.2.35 Cost-effective Build Outcome Prediction Using Cascaded Classifiers

Ansong Ni and Ming Li Nanjing University

1.2.36 Sentiment Analysis of Travis CI Builds

Rodrigo Souza and Bruno Silva Salvador University - UNIFACS, Federal University of Bahia

1.2.37 A Time Series Analysis of TravisTorrent: To Everything There is a Season

Abigail Atchison, Christina Berardi, Natalie Best, Elizabeth Stevens and Erik Linstead Chapman University

1.2.38 On the Interplay between Non-Functional Requirements and Builds on Continuous Integration

Klérisson Paixão, Crícia Z. Felício, Fernanda Delfim and Marcelo Maia Instituto Federal do Triângulo Mineiro, Universidade Federal de Uberlândia, UFU

1.2.39 The Impact of the Adoption of Continuous Integration on Developer Attraction and Retention

Yusaira Khan, Yash Gupta, Keheliya Gallaba and Shane McIntosh McGill University

1.2.40 The Hidden Cost of Code Completion: Understanding the Impact of the Recommendation-list Length on its Efficiency

Ariel Rodriguez, Fumiya Tanaka, Yasutaka Kamei.

1.2.41 Do Practitioners Use Autocompletion Features Differently Than Non-Practitioners?

John Wilkie, Ziad Al Halabi, Alperen Karaoglu, Jiafeng Liao, George Ndungu, Chaiyong Ragkhitwetsagul, Matheus Paixão, Jens Krinke.

1.2.42 Who’s this? Developer identification using IDE event data

Agnieszka Ciborowska, Nicholas A. Kraft and Kostadin Damevski.

1.2.43 Revisiting “Programmers’ Build Errors” in the Visual Studio Context: A Replication Study using IDE Interaction Traces

Mauricio Soto and Claire Le Goues.

1.2.44 Common Statement Kind Changes to Inform Automatic Program Repair

Christopher Bellman, Ahmad Seet, Olga Baysal.

1.2.45 Examining Programmer Practices for Locally Handling Exceptions

Mary Beth Kery, Claire Le Goues and Brad Myers Carnegie Mellon University

1.2.46 QualBoa: Reusability-aware Recommendations of Source Code Components

Themistoklis Diamantopoulos, Klearchos Thomopoulos and Andreas Symeonidis Aristotle University of Thessaloniki

1.2.47 The Dispersion of Build Maintenance Activity across Maven Lifecycle Phases

Casimir Desarmeaux, Andrea Pecatikov and Shane McIntosh McGill University

1.2.48 The Relationship between Commit Message Detail and Defect Proneness in Java Projects on GitHub

Jacob Barnett, Charles Gathuru, Luke Soldano and Shane McIntosh McGill University

1.2.49 Analysis of Exception Handling Patterns in Java Projects: An Empirical Study

Suman Nakshatri, Maithri Hegde and Sahithi Thandra University of Waterloo

1.2.50 Judging a commit by its cover: Correlating commit message entropy with build status on Travis-CI

Eddie Antonio Santos and Abram Hindle University of Alberta

1.2.51 Characterizing Energy-Aware Software Projects: Are They Different?

Shaiful Chowdhury and Abram Hindle University of Alberta

1.2.52 A deeper look into bug fixes: Patterns, replacements, deletions, and additions

Mauricio Soto, Ferdian Thung, Chu-Pan Wong, Claire Le Goues and David Lo Carnegie Mellon University, Singapore Management University

1.2.53 How Developers Use Exception Handling in Java?

Muhammad Asaduzzaman, Muhammad Ahasanuzzaman, Chanchal K. Roy and Kevin Schneider University of Saskatchewan, University of Dhaka

1.2.54 Analyzing Developer Sentiment in Commit Logs

Vinayak Sinha, Alina Lazar and Bonita Sharif Youngstown State University

CMPUT660: Machine Learning Applied: Software Engineering, Software Analtyics, and Software Energy Consumption (2022)

2022/07/18

CMPUT 660 Machine Learning Applied: Software Engineering, Software Analtyics, and Software Energy Consumption (2022)

Course Outline

General Information

Overview

Objectives

Pre-requisites

Course Topics

Course Work and Evaluation

Excused Absences and Missed Work

Missed Term Exams and Assignments:

Automated Analysis

Coursework License

Re-evaluation

Course Materials

Student Responsibilities

Academic Integrity

Academic Integrity

General Course Policies

Cell Phones

In Class Laptop Use

Students Eligible For Accessibility-Related Accommodations

Student Success Centre

Recording and/or Distribution of Course Materials

Department Policies

With respect to the models described in the Collaboration Policy section of

University Policies

Disclaimer

Assignment 1

Table of Contents

1 Assignment 1

1.1 When: January 23

1.2 Who: Just you, with some consultation

1.3 Why: Get an introduction to sifting through Software Repositories

1.4 What: We will use the MSR 2019 mining challenge data and tools

1.5 Questions

1.5.1 Briefly describe the schemas of the data held within SOTorrent?

1.5.2 Size metrics

1.5.3 Traceability

1.6 Notes

1.6.1 What do you mean by summary statistics

1.6.2 What format do you want?

1.6.3 How?

Assignment 2

1 Assignment 2

1.1 Warning: Assignment 2 definition is not complete!

1.2 When: February 1 (Abstract) & February 6th (Submission) & Presentations on Feb 11 & 13

1.3 Who: Your team (2 people+)

1.4 What: A Mining Software Repositories 2019 Mining Challenge Submission

1.5 Why:

1.6 Details

1.6.1 Write Up

1.6.2 Submission

Project Proposal

Project Proposal

1 Project

1.1 Warning: Project definition is not complete!

1.2 When: April 10 (Final Report)

1.3 Who: Your team (2-3 people)

1.4 What: A project relevant to Mining Software Repositories that could be conceivably be submitted to a conference.

1 Papers

1.1 Full Papers

1.1.1 A Contextual Approach towards More Accurate Duplicate Bug Report Detection

1.1.2 An Empirical Study of End-user Programmers in the Computer Music Community

1.1.3 An empirical study on the evolution of design patterns Aversano, L., Canfora, G., Cerulo, L., Del Grosso, C., and Di Penta, M. . In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering

1.1.4 Analysing Software Repositories to Understand Software Evolution, Marco D’Ambros, Harald Gall, Michele Lanza, and Martin Pinzger

1.1.5 Automatic identification of bug-introducing changes by Sunghun Kim, Thomas Zimmermann, Kai Pan, E., and James Whitehead, Jr.

1.1.6 Beyond Lines of Code: Do We Need More Complexity Metrics?, by Israel Herraiz and Ahmed E. Hassan

1.1.7 BugCache for Inspections : Hit or Miss?

1.1.8 Bugs as Inconsistent Behavior: A General Approach to Inferring Errors in Systems Code.

1.1.9 Change Impact Graphs: Determining the Impact of Prior Code Changes German, D.M., Robles, G, and Hassan, A. , Journal of Information and Software Technology (INFSOF), Volume 51, Number 10, pages 1394–1408, Oct 2009.

1.1.10 Characteristics of Useful Code Reviews: An Empirical Study at Microsoft

1.1.11 Clones: What is that smell?

1.1.12 Copy-Paste as a Principled Engineering Tool, by Michael Godfrey and Cory Kapser + ‘Cloning Considered Harmful’ Considered Harmful, by Cory J. Kapser and Michael W. Godfrey. Proc. of the 2006 Working Conference on Reverse Engineering (WCRE-06), 23-28 October, Benevento, Italy.

1.1.13 Cross versus Within-Company Cost Estimation Studies: A Systematic Review.

1.1.14 Evidence-Based Failure Prediction, by Nachi Nagappan and Thomas Ball + A Validation of Object-Oriented Design Metrics as Quality Indicators, by Victor R. Basili, Lionel C. Briand, and Walcelio L. Melo, IEEE Trans. on Software Engineering, 22(10, October 1996.

1.1.15 Gerrit Software Code Review Data from Android

1.1.16 GreenMiner: A Hardware Based Mining Software Repositories Software Energy Consumption Framework

1.1.17 Hipikat: recommending pertinent software development artifacts, by Davor Cubranic and Gail C. Murphy