CSER 2021 Fall, hosted by University of Calgary and University of Toronto
Date: The event will be online from 9 a.m. to 4pm (Eastern time) on November 21st (Sunday).
Registration: Please register or indicate your intention to participate by completing this Google doc form.
The first CSER meeting was in 1996, so 2021 is the 25th anniversary!
Over 235 people are registered, from 9 Canadian provinces, 5 other countries, amd 45 universities and colleges.
CSER meetings seek to motivate engaging discussions among faculty, graduate students and industry participants about software engineering research in a broad sense, as well as the intersection between software engineering, other areas of computer science, and other disciplines.
Abstract: Brian Randell described software engineering as “the multi-person development of multi-version programs”. David Parnas has expressed that this “pithy phrase implies everything that differentiates software engineering from other programming”. How does current software engineering research compare against this definition? Is there currently too much focus on research into problems and techniques more associated with programming than software engineering? Are there opportunities to use Randell’s description of software engineering to guide the community to new research directions? In this talk, I will explore these questions and discuss how a consideration of the development streams used by multiple individuals to produce multiple versions of software opens up new avenues for impactful software engineering research.
Bio: Gail C. Murphy is a Professor of Computer Science and Vice-President Research and Innovation at the University of British Columbia. She is a Fellow of the Royal Society of Canada and a Fellow of the Association for Computing Machinery (ACM), as well as co-founder of Tasktop Technologies Incorporated.
After completing her B.Sc. at the University of Alberta in 1987, she worked for five years as a software engineer in the Lower Mainland. She later pursued graduate studies in computer science at the University of Washington, earning first a M.Sc. (1994) and then a Ph.D. (1996) before joining UBC.
Dr. Murphy’s research focuses on improving the productivity of software developers and knowledge workers by providing the necessary tools to identify, manage and coordinate the information that matters most for their work. She also maintains an active research group with post-doctoral and graduate students.
Abstract: AI software is software. We are software engineers. We tame software. Where there is dange we add safeguards. Where this is confusion, we add clarity. So, it is high-time time for SE people to add clarity. We live in an age when algorithms rule. But who rules the algorithms? Surely, it must be software engineers. Algorithms make choices. Choices have consequences. Many choices are ethical and not choosing is unethical . Algorithms, once written, have to be wrangled. Do you know how to reason with your algorithms? If not, then please come to this talk.
This talk will discuss multi-objective semi-supervised learning and optimization methods for reducing complexity to simplicity (with applications to software fairness testing, requirements engineering, configuring cloud computing, software defect and effect estimation, prediction issue close time and technical debt, and recognizing software security issues, interactive search-based SE, etc etc, etc).
Bio: Tim Menzies (IEEE Fellow, Ph.D., UNSW, 1995) is a full Professor in CS at North Carolina State University where he teaches software engineering, automated software engineering, and foundations of software science. He is the directory of the RAISE lab (real world AI for SE). that explores SE, data mining, AI, search-based SE, and open access science.
He is the author of over 280 referred publications and editor of three recent books summarized the state of the art in software analytics. In his career, he has been a lead researcher on projects for NSF, NIJ, DoD, NASA, USDA (funding totalling over 13 million dollars) as well as joint research work with private companies. For 2002 to 2004, he was the software engineering research chair at NASA's software Independent Verification and Validation Facility.
Prof. Menzies is the co-founder of the PROMISE conference series devoted to reproducible experiments in software engineering (http://tiny.cc/seacraft). He is editor-in-chief of the Automated Software Engineering Journal, an associate editor of Communications of the ACM, IEEE Software, and the Software Quality Journal and an Advisory Board member for the Empirical Software Engineering Journal. In 2015, he served as co-chair for the ICSE'15 NIER track. He has served as co-general chair of ICSME'16 and co-PC-chair of SSBSE'17, and ASE'12. For more, see his vita or his list of publications ) or his home page .
The world is going mobile. Android has surpassed its counterparts and become the most popular operating system all over the world. The openness and fast evolution of Android are the key factors that lead to its rapid growth. However, these characteristics have also created the notorious problem: Android fragmentation. There are numerous different Android device models and operating system versions in use, making it diffcult for app developers to exhaustively test their apps on these devices. An Android app can behave differently on the different device models, inducing various compatibility issues that reduce software reliability.
Such fragmentation-induced compatibility issues (compatibility issues for short) have been well-recognized as a prominent problem in Android app development. In this talk, I will introduce my work helping developers identify compatibility issues in Android apps. I have conducted the first empirical study that analyzed 220 real-world compatibility issues down to the source code level. In the empirical study, I have disclosed that compatibility issues share common triggering and fixing patterns. Based on the identified patterns, I further proposed techniques that can (a) automatically mine the knowledge of compatibility issues from large app corpora, and (b) automatically detect compatibility issues in Android apps. Accumulatively, my proposed techniques have detected 185 true warnings of compatibility issues and 89 of them have already been fixed by the original app developers. In this talk, I will unveil the key characteristics of compatibility issues as well as introducing how my proposed techniques automatically detect compatibility issues in the wild. [Related papers: 1, 2, 3, 4, 5]
Code smells are indicators of quality issues in a software system and are classified based on their granularity, scope, and impact. In this talk, I would like to provide an overview of code smells and present my recent explorations to detect them and to understand various aspects associated with them. Also, I would like to use this opportunity to explore potential synergies with fellow software engineering researchers from Canada.
I will talk about 5 dimensions of Equity, Diversity and Inclusion (EDI) in software engineering research. My thoughts may be controversial, and I welcome debate. I will use the term 'equity-challenged' for people who find themselves the subject of bias (systemic or case-specific). My focus will be on three of the 5 dimensions. The first is EDI in peer review, where I will explain why double-blind reviewing might have unintended side-effects for certain equity-challenged researchers, so should be encouraged but not mandated. The second is EDI in the software engineering products we create (research software, databases) including ensuring maximum availability through what I call fully-elaborated open source (i.e. with metadata, tests, documentation, comprehensive examples as wellas low-footprint, low-dependency execution and analysis capability, not just pure code or data) so that people with fewer resources can learn about and build on work. The third is the need to permanently mandate conferences to allow online attendance, even after the pandemic: Now we know how to do it well and cheaply, we can ensure the inclusion everybody who cannot or prefers not to travel for health, mental-health, family, financial, political, environmental or other reasons, by making conferences all hybrid: Sponsors such as the ACM and IEEE should mandate this. The other dimensions include EDI in selection of students, and selection of research subjects, but I will not discuss these to save time to focus on the above.CSERFall2021TalkLethbridge-EDI-PeerRev-HybridConf.pdf
Software engineering is a decision-centric discipline. Decisions have to be made about people, processes, and artifacts at all stages of the life-cycle. Optimization is an established discipline aiming to find the best solutions under given constraints and objectives. Optimizing software engineering seems to be tempting and is the content of many publications (including some of my own), but how much does it matter in the presence of endless uncertainties? The talk explores the problem and tries to propose answers.
Regression Testing is an important quality assurance practice widely adopted today. Optimizing regression testing is important. Test parallelization has the potential to leverage the power of multi-core architectures to accelerate regression testing. Unfortunately, it is not possible to directly use parallelization options available in build systems and testing frameworks without introducing test ﬂakiness. Tests can fail because of data races or broken test dependencies. Although it is possible to safely circumvent those problems with the assistance of an automated tool to collect test dependencies (e.g., PRADET), the cost of that solution is prohibitive, defeating the purpose of test parallelization. This paper proposes PASTE an approach to automatically parallelize the execution of test suites. PASTE alternates parallel and sequential execution of test cases and test classes to circumvent provoked test failures. PASTE does not provide the safety guarantee that ﬂakiness will not be manifested, but our results indicate that the strategy is sufﬁcient to avoid them. We evaluated PASTE on 25 projects mined from GitHub using an objective selection criteria. Results show that (i) PASTE could circumvent ﬂakiness introduced with parallelization in all projects that manifested them and (ii) 52% of the projects beneﬁted from test-parallelization with a median speedup of 1.59x (best: 2.28x, average: 1.47x, worst: 0.93x). [Paper]
To better utilize cloud computing resources, an interesting proposal involves using the available machines for distributed computing in an ad-hoc manner during regular and off-peak hours. Our proposed framework, named AHA, implements resource availability considering (RAC) task speculation within the latest version of Apache Hadoop. The resource availability history of each worker node is stored locally and considered during scheduling of MapReduce (MR) workloads. In addition, a fuzzy-rule based self-tuning solution is also proposed to alleviate the need for manual tuning regarding resource availability consideration. Our preliminary evaluations indicate that AHA is able to decrease execution time by up to 20.2% for certain MR workloads in ad-hoc cloud settings. Overall, the approach shows potential in addressing this real-world issue as our results are also on average upper-bounded by Apache Hadoop with respect to workload execution time in a simulated ad-hoc cloud environment.
Software bugs are pervasive in modern software. As software is integrated into increasingly many aspects of our lives, these bugs have increasingly severe consequences, both from a security (e.g. Cloudbleed, Heartbleed, Shellshock) and cost standpoint. Fuzz testing or simply fuzzing refers to a set of techniques that automatically find bug-triggering inputs by sending many random-looking inputs to the program under test. In this talk, I will discuss how, by identifying core under-generalized components of modern fuzzing algorithms, and building algorithms that generalize or tune these components, I have expanded the application domains of fuzzing. First, by building a general feedback-directed fuzzing algorithm, I enabled fuzzing to consistently find performance and resource consumption errors. Second, by developing techniques to maintain structure during mutation, I brought fuzzing exploration to "deeper" program states. Third, by decoupling the user-facing abstraction of random input generators from their sampling distributions, I built faster validity fuzzing and even tackled program synthesis. Finally, I will discuss the key research problems that must be tackled to make fuzzing readily-available and useful to all developers. [Slides]
Nowadays, software developers create entire systems by reusing existing source code glued together with additional custom code to suit their specific needs. These platforms that contain a massive number of reusable source code components are referred to as software ecosystems. A good example of software ecosystems is the node package manager platform that contains over a million reusable packages. In this talk, I will highlight how developers rely on software ecosystems to reuse code and the main challenges. Then, I will present some of the solutions and actionable techniques that I built to help developers cope with the problems they face when using software ecosystems. Finally, I will give an overview of my future research agenda on “the evolution of software ecosystems at a large scale”.
From traditional client-based software to the more modern software as a service, software must shift and evolve to meet the needs of its users. However, creating and modifying software involves potential pitfalls such as bugs, documentation mistakes, version migrations, unknown usage constraints, configuration errors, performance and security issues, and more. Software developers are themselves software users, yet their usage data, a rich source of information, is often unexploited. This talk will cover my prior work and future vision for analyzing and leveraging how software is used by developers. I will concentrate on issues like API migration, API workarounds, code review recommenders, and build configuration management and how we can improve the software engineering cycle by incorporating feedback loops that leverage developer usage data.
The use of third-party packages can create thousands of dependencies within software projects that need to be continuously maintained by developers. The problem is that not all of the dependencies are released to production (i.e., dependencies that are used and not used in code that reaches the end product), so it is difficult for developers to know which ones to prioritize. Hence, the goal of our study is to quantify how many dependencies are shipped and not shipped to production. In other words, we want to investigate how often some dependencies can be prioritized in maintenance-related tasks over the ones that are not as critical. Our preliminary results show that only 49% of the projects in our dataset have dependencies released in production. This has major implications on dependency maintenance since a significant amount of the developers’ valuable time is allocated for managing dependencies that do not affect the end product. Our findings also have implications for security because developers tend to pay less attention to their dependencies when they have too many maintenance tasks. This can open the door for vulnerabilities and bugs to be lifted without proper fixes.
Chatbots are envisioned to dramatically change the future of Software Engineering, allowing practitioners to chat and inquire about their software projects and interact with different services using natural language. At the heart of every chatbot is a Natural Language Understanding (NLU) component that enables the chatbot to understand natural language input. Recently, many NLU platforms were provided to serve as an off-the-shelf NLU component for chatbots, however, selecting the best NLU for Software Engineering chatbots remains an open challenge. Therefore, in this paper, we evaluate four of the most commonly used NLUs, namely IBM Watson, Google Dialogflow, Rasa, and Microsoft LUIS to shed light on which NLU should be used in Software Engineering based chatbots. Specifically, we examine the NLUs' performance in classifying intents, confidence scores stability, and extracting entities. To evaluate the NLUs, we use two datasets that reflect two common tasks performed by Software Engineering practitioners, 1) the task of chatting with the chatbot to ask questions about software repositories 2) the task of asking development questions on Q&A forums (e.g., Stack Overflow). According to our findings, IBM Watson is the best performing NLU when considering the three aspects (intents classification, confidence scores, and entity extraction). However, the results from each individual aspect show that, in intents classification, IBM Watson performs the best with an F1-measure>84%, but in confidence scores, Rasa comes on top with a median confidence score higher than 0.91. Our results also show that all NLUs, except for Dialogflow, generally provide trustable confidence scores. For entity extraction, Microsoft LUIS and IBM Watson outperform other NLUs in the two SE tasks. Our results provide guidance to software engineering practitioners when deciding which NLU to use in their chatbots. [Paper][Dataset]
With the recent increase in the computational power of modern mobile devices, machine learning-based heavy tasks such as face detection and speech recognition are now integral parts of such devices. This requires frameworks to execute machine learning models (e.g., Deep Neural Networks) on mobile devices. Although there exist studies on the accuracy and performance of these frameworks, the quality of on-device deep learning frameworks, in terms of their robustness, has not been systematically studied yet. In this paper, we empirically compare two on-device deep learning frameworks with three adversarial attacks on three different model architectures. We also use both the quantized and unquantized variants for each architecture. The results show that, in general, neither of the deep learning frameworks is better than the other in terms of robustness, and there is not a significant difference between the PC and mobile frameworks either. However, in cases like Boundary attack, mobile version is more robust than PC. In addition, quantization improves robustness in all cases when moving from PC to mobile.
Software debugging, and program repair are among the most time-consuming and labor-intensive tasks in software engineering that would benefit a lot from automation. In this paper, we propose a novel automated program repair approach based on CodeBERT, which is a transformer-based neural architecture pre-trained on large corpus of source code. We fine-tune our model on the ManySStuBs4J small and large datasets to automatically generate the fix codes. The results show that our technique accurately predicts the fixed codes implemented by the developers in 19-72% of the cases, depending on the type of datasets, in less than a second per bug. We also observe that our method can generate varied-length fixes (short and long) and can fix different types of bugs, even if only a few instances of those types of bugs exist in the training dataset. [Paper]
Diversity and inclusion are a necessary prerequisite when it comes to shaping technological innovation for the benefit of society as a whole. A common indicator of diversity consideration is the representation of different social groups among software engineering (SE) researchers, developers, and students. However, this does not necessarily entail that diversity is considered in the SE research itself. In our novel study, we examine how diversity is embedded in SE research, particularly research that involves participant studies. To this end, we have selected 25 research papers from ICSE 2020 technical track. In a content analytical approach, we identified specific words and phrases that referred to the characterization of participants, we defined coding categories and developed a coding scheme. To identify how SE researchers report the various diversity categories and how these diversity categories play a role in their research, we differentiated the research papers into sections which allowed for a better analysis of the reporting of diversity across the sample. Our preliminary results show that SE research often focuses on reporting professional diversity data such as occupation and work experience, overlooking social diversity data such as gender or location of the participants. With this novel study we hope to shed light on a new approach to tackling the diversity and inclusion crisis in the SE world.
Software testing is an essential software quality assurance practice. Testing helps expose faults earlier, allowing developers to repair the code and reduce future maintenance costs. However, repairing (i.e., making failing tests pass) may not always be done immediately. Bugs may require multiple rounds of repairs and even remain unfixed due to the difficulty of bug-fixing tasks. To help test maintenance, along with code comments, the majority of testing frameworks (e.g., JUnit and TestNG) have also introduced annotations such as @Ignore to disable failing tests temporarily. Although disabling tests may help alleviate maintenance difficulties, they may also introduce technical debt. With the faster release of applications in modern software development, disabling tests may become the salvation for many developers to meet project deliverables. In the end, disabled tests may become outdated and a source of technical debt, harming long-term maintenance. Despite its harmful implications, there is little empirical research evidence on the prevalence, evolution, and maintenance of disabling tests in practice. To fill this gap, we perform the first empirical study on test disabling practice. We develop a tool to mine 122K commits and detect 3,111 changes that disable tests from 15 open-source Java systems. Our main findings are: (1) Test disabling changes are 19% more common than regular test refactorings, such as renames and type changes. (2) Our life-cycle analysis shows that 41% of disabled tests are never brought back to evaluate software quality, and most disabled tests stay disabled for several years. (3)We unveil the motivations behind test disabling practice and the associated technical debt by manually studying evolutions of 349 unique disabled tests, achieving a 95% confidence level and a 5% confidence interval. Finally, we present some actionable implications for researchers and developers. [Paper]
To analyze large-scale data efficiently, developers have created various big data processing frameworks (e.g., Apache Spark). These big data processing frameworks provide abstractions to developers so that they can focus on implementing the data analysis logic. In traditional software systems, developers leverage logging to monitor applications and record intermediate states to assist workload understanding and issue diagnosis. However, due to the abstraction and the peculiarity of big data frameworks, there is currently no effective monitoring approach for big data applications. In this paper, we first manually study 1,000 randomly sampled Spark-related questions on Stack Overflow to study their root causes and the type of information, if recorded, that can assist developers with motioning and diagnosis. Then, we design an approach, DPLOG, which assists developers with monitoring Spark applications. DPLOG leverages statistical sampling to minimize performance overhead and provides intermediate information and hint/warning messages for each data processing step of a chained method pipeline. We evaluate DPLOG on six benchmarking programs and find that DPLOG has a relatively small overhead (i.e., less than 10% increase in response time when processing 5GB data) compared to without using DPLOG, and reduce the overhead by over 500% compared to the baseline. Our user study with 20 developers shows that DPLOG can reduce the needed time to debug big data applications by 63% and the participants give DPLOG an average of 4.85/5 for its usefulness. The idea of DPLOG may be applied to other big data processing frameworks, and our study sheds light on future research opportunities in assisting developers with monitoring big data applications. [Paper]
What is the sentiment like within newcomer spaces of the Linux Kernel Development community? This question is crucial to be answered for the Linux Kernel development community overall. This is because sentiment can affect the efficiency and synergy within any software development team, especially the Linux community which is mainly based off of volunteers with no financial incentive to stay. This undergraduate study analyzed Question & Answer data from both StackOverflow, and the Newbie Linux Kernel Mailing list. It attempts to realize the true sentiment within the newbie community, possibly defining implications of a sentiment analysis within the Linux Kernel Development community.
This literature review and evidence-based study of collaborative technology investigates the features of the technology used during COVID-19 and how those features have enabled organizations to discard/forget and preserve/remember aspects of office procedures, hierarchies, and accountability in Scrum/Agile organizational cultures. [Research Blog]
Most automated software testing tasks, such as test case generation, selection, and prioritization, require an abstract representation of test cases. The representation can vary from the set of code statements covered by the test case, to the sequences of visited states in a UML state machine, modeling the behavior of the class under test. Given that specification-based testing is often not cost-effective, most current automated testing techniques rely on the code-driven representation of test cases. We hypothesize that execution traces of the test cases, as an abstraction of their behavior, can be leveraged to better represent test cases compared to code-driven methods.
Automated unit test case generation is one of the main software development challenges. Existing approaches such as Evosuit or Randoop use search-based algorithms. These approaches use coverage as a cost function. Usually, the output test set is hard to read and understand for developers. In this study, we try to use machine translation for test case generation. We train different sequential models such as transformers, Bert and CodeBert on a large set of methods and developer-written test cases. The main goal of this research is to leverage the large publicly available data and generate better test cases compared to available automated test case generation platforms.
We try to explain and interpret blackbox DL models used for software engineering tasks. [Group Webpage]
"UniSankey is a data visualization tool designed specifically to communicate the characteristics, connections, and results in industry-academic collaborations. UniSankey can help researchers and decision-makers in industry and academia better understand how the characteristics of industry-academic collaborations, such as funding and researcher experience, affect the project’s results and commercialization . In this poster, we present an extension to the UniSankey visualization tool that makes UniSankey more broadly applicable. This poster will describe the original UniSankey design, the extension, and future work.
CSER does not publish proceedings in order to keep presentations informal and speculative.
Ottawa is (as you all know) the capital of Canada. Carleton University is a short taxi-ride from the airport, and is on the O-Train Line 2, providing quick access to downtown hotels. There are also many hotels near the airport.
Registration is online at this Google doc form CSER Fall 2021 registration link. It will only be active until about 8 a.m. (EST) on November 14th, 2021 since the organizers will be preparing to start the meeting and will not be able to respond to new registrations after that time. You will be asked to provide your name, email address, position, and university or company. Registered attendees will be sent a Zoom link to attend the meeting. Feel free to invite colleagues, but ask them to register here; please don't send them the Zoom link you would have been given.