Welcome to CSER'23 Spring!

CSER 2023 Spring will be held on the campus of McGill University, Montréal on June 07, as a CS-CAN|INFO-CAN co-located event.

CSER meetings seek to motivate engaging discussions among faculty, graduate students and industry participants about software engineering research in a broad sense, as well as the intersection between software engineering, other areas of computer science, and other disciplines.

Advantages of co-locating with CS-CAN|INFO-CAN events

This year, the Spring CSER Workshop will be held along side other Canadian computer science communities at a co-located conference. There are several advantages to this arrangement including being able to register for just the CSER day or for multiple days (to attend other conferences or workshops), being able to connect with colleagues from other CS communities at lunch and breaks while attending CSER, having our poster session in a share space with other communities’ posters, and having an opportunity to attend the awards banquet the evening of the CSER workshop (note: the awards banquet is not included in the daily registration fee).


CSER brings together (primarily) Canadian-based software engineering researchers, including faculty, graduate students, industry participants, and any others who are interested.


Keynote by Martin Robillard (McGill University)

Sustainable Software Artifacts

Abstract: Best practices for software engineering research call for the release of software artifacts: prototype tools, scripts, or other types of executable research outcomes. At the same time, documenting and maintaining public-facing software requires time and expertise. Drawing on my experience maintaining an academic open-source project for over 8 years, and recent research on turnover-induced knowledge loss, I will explore why we should keep our research tools alive and what this can entail in practice.

Bio: Martin Robillard is a Professor in the School of Computer Science at McGill University. He received his Ph.D. in Computer Science from the University of British Columbia. His research is in the area of software engineering, with an emphasis on the human-centric aspects of software development. His current focus is on documentation generation, test suite quality, and information privacy. Martin is the author of the book Introduction to Software Design with Java and the architect and maintainer of the JetUML software modeling tool. He has served as program co-chair for both of the flagship conferences in software engineering (FSE 2012 and ICSE 2017), as well as ICSME 2015, and is currently on the editorial board of IEEE Transactions on Software Engineering.

Keynote by Sarra Habchi (Ubisoft)

Quality Engineering in the Video Game Industry: Challenges and Opportunities

Abstract: Developing high-budget video games is a complex process that demands significant effort and meticulous organization. These games consist of an extensive amount of code, estimated to be in the tens of millions of lines, distributed across hundreds of thousands of files, and requiring tens of thousands of code changes. The development of modern AAA games poses additional challenges due to the need to manage multidisciplinary teams, including developers, artists, and sound engineers, which is markedly different from typical code-only projects. In addition, these games must be designed to be compatible and scalable across multiple platforms, including various consoles, PCs, and mobile devices. In this environment, unique engineering challenges emerge, requiring novel research and development solutions from our community. In this talk, I will provide a brief overview of predicaments encountered by game development teams, covering topics such as performance analysis, automated testing, and cross-artifact build systems. Then, I will present some of the initiatives conducted by our research group to address these challenges. Finally, I will conclude with a discussion of open challenges and potential research opportunities.

Bio: Sarra Habchi is a research and development scientist at Ubisoft. She received her PhD in computer science from Inria with a thesis on the identification and tracking of mobile-specific code smells, a work that was awarded the accessit of Gilles Kahn 2020. Her research interests lie primarily in the area of software quality and reliability, with an emphasis on software testing, continuous integration systems, and automation tools. Sarra is currently leading the software engineering research group at Ubisoft La Forge, which focuses on quality engineering challenges in the video game industry.

Important Dates

Presentation abstract submission May 15th
Notification May 19th
Registration May 31st (Please register early to help our planning)

CSER Spring 2023 Program

Slides are available for some talks (in the Detailed Talk Informaiton section below).

Vote for the best posters: https://forms.gle/nuofgzr82QDRCKkB7

Detailed Talk Information

9:10 AM [Keynote] Martin Robillard (McGill U): Sustainable Software Artifacts
Abstract: Best practices for software engineering research call for the release of software artifacts: prototype tools, scripts, or other types of executable research outcomes. At the same time, documenting and maintaining public-facing software requires time and expertise. Drawing on my experience maintaining an academic open-source project for over 8 years, and recent research on turnover-induced knowledge loss, I will explore why we should keep our research tools alive and what this can entail in practice.
9:50 AM [Poster Lightning Talks]
Misheelt Munkhjargal (Concordia U): A Deep Learning Approach to Binary Code Similarity Detection for Obfuscated Java Android Applications
Abstract: Binary Code Similarity Detection (BCSD) has many useful applications, such as intellectual property protection, vulnerability search, and malware detection. An effective and modern approach to this problem is using a deep learning model to train and detect similarities from the assembly code of the binary. Existing approaches typically focus on binary code of languages that compile to assembly, such as C and C++. There is currently not enough research conducted on binary similarity detection in bytecode. In particular, we aim to focus our binary code similarity detection model on Java bytecode in Android applications. In our research, we introduce a deep learning approach to detect clones in Java binary code in Android applications. Our approach is effective for both normally built binary code and obfuscated binary code in Java. Additionally, we investigate which instruction embedding method performs the best for our Binary Code Similarity Detection model to understand the potential performance improvement by different embeddings.
Zerui Wang (Concordia U): Design Explanation Microservices and Provenance: A Case Study of Explaining Cloud AI Service
Abstract: This paper design an Explainable AI (XAI) service that aims to provide feature contribution explanations for cloud AI services. Cloud AI services have broad usage to develop domain-specific applications but lack transparency and explainability. The AI services provide general model evaluation metrics to demonstrate learning precision. However, the AI models behind the cloud services remain opaque on how the prediction is produced. Post hoc XAI methods based on feature contribution approximate the correlation between input features and prediction to gain explainability. The challenge is how to devise XAI that allows the approximation without unfolding the network structure of the learning model. We consider XAI operations are at the same stage of learning performance evaluation. We advocate that XAI operations are accessible as services to enable the consolidation of the XAI operations into the lifecycle of AI development. We propose an XAI as a service that provides feature contribution explanations for cloud AI services where the learning models remain "black-box". The XAI service is designed using a microservice architecture to integrate AI models and XAI methods. Additionally, provenance metadata is collected from XAI operations to provide traceability of the XAI service. In our case studies, we offer insights into the influential features that contribute to various published cloud AI services. Furthermore, we evaluate the explanation results and performance of the Explainable AI (XAI) services. We employ deep learning models as approximations to cloud AI services and incorporate a variety of CAM-based XAI techniques. We showcase the results that offer explanations for cloud AI services, while also assessing these services using the consistency evaluation metrics. Furthermore, we present the performance and cost associated with the XAI services.
Divya Madhav Kamath (Queen's U): On using lightweight dynamic batching techniques to avoid redundant CI practices
Abstract: Continuous Integration (CI) is a widely adopted process in software engineering that can merge developers’ code together and perform essential builds and tests on the software code. CI, however, is also an expensive process, due to the large number of commits that are pushed in by developers on a daily basis. To reduce the cost of CI, many companies adopt techniques to combine builds and tests for several commits at a time by using batching algorithms. Existing approaches include a range of straightforward to elaborate techniques that range from - combining several commits into a single build - to - predicting build outcomes in advance to avoid making unnecessary builds. While the simpler heuristics are limited in flexibility and complex heuristics are heavy in processing, in this paper we propose a middle ground with a lightweight dynamic batching technique. In contrast to an existing dynamic batching technique that relies on intricate computations, we propose a simpler, more flexible technique that can update batch sizes based on the outcome of the previous batch build. In this paper, we study 286,848 from 50 open-source projects using TravisCI. From our study, we find that our lightweight batching technique can perform similarly to existing complicated dynamic batching techniques and can outperform static batching techniques by a median of 10.01%.
Harsh Patel (Queen's U): Post-Deployment Model Recycling
Abstract: ML model selection is performed before model training during the development phase. Once a model is deployed, the same model is retrained on a scheduled interval to make sure the model reflects current data and environment trends to prevent issues such as dataset shifts. This study applies the idea of post-deployment techniques to model selection. This study proposes a post-deployment model recycling technique to address the problems like model retraining costs and dataset shifts that often are described as maintenance challenges of ML models. In this study, we look at the performance of the post-deployment model recycling approach on bug prediction task using a Just-In-Time defect prediction model trained on multiple Apache projects datasets. In this study, we evaluate the proposed technique against the traditional approach on different model evaluation metrics, evaluate different clustering techniques for model selection, evaluate impact of different methods (online vs. scratch) for training clustering algorithms on model selection and sensitivity analysis on different parameters.
Edward Wang (MIT): Work-in-Progress: SMT-powered agile, correct-by-construction layout synthesis
Abstract: Modern integrated circuit (chip) design relies extensively on software CAD systems, yet existing software engineering methodologies for building CAD systems pose significant reliability, security, and correctness challenges. This work-in-progress presents a novel approach to designing CAD systems based on formal methods, constraint/declarative programming as opposed to traditional hand-optimized C/C++ implementations and statistical/heuristic-based approaches. We hypothesize that this approach can yield significant benefits in software engineering effort, reliability/security, and circuit quality. To validate our hypothesis, we have built a compiler from netlist to SMT and used it to perform place-and-route, a critical and large step in CAD for chip design. While there are significant performance challenges with this approach, preliminary results indicate that our method could potentially yield significant benefits.
Akshat Malik (Queen's U): Bug Prediction Using Graph-Anonymised JIT metrics
Abstract: As the usage of models for understanding different organisational practices becomes prevalent, it is important that data for these models is shared across different organisations to build a common understanding of software systems. One such area which is in need of more data is Just-In-Time (JIT) defect prediction. Organisations are hesitant is sharing data with one another as models can leak information. To prevent data leaks we propose anonymising the data these models are trained. In this study, we use graph-anonymising techniques to anonymise the graph representation of the JIT defect prediction metrics. Graphs contain information about different entities, but also how they are linked together. Graph-anonymising techniques are an effective way to hide information about the nodes and the way they are linked to one another. Using this approach, we perform the study on four different repositories (OpenStack, Apache Flink, Apache Ignite and Apache Cassandra) using four different graph-anonymising techniques (Random Add/Delete, Random Switch, K-DA Anonymisation and Generalisation). To measure the privacy of the data we use two privacy metrics Increased Privacy Ratio (IPR) and Found Percentage (FP). In the case study, we find that graph-anonymising techniques are effective in providing anonymity to the JIT metrics. We find that JIT metrics for the three datasets are able to gain a privacy score of more than 90\% and more than 80\% by one or more techniques. Even small amounts of graph anonymisation are able to provide a large amount of privacy to the metrics. This privacy does not always come at the cost of performance. We find that it is possible for all datasets to gain high privacy by being in 90\% of baseline performance. In three datasets, we even observe a small amount of performance improvement when graph-anonymisation techniques are used. When we look closer at the model's training, we understand that different anonymisation techniques change the model's behaviour differently. Random Add/Delete and Random Switch changes the feature's importance as they are applied in higher amounts, thereby causing the model to perform poorly. On the other hand, Generalisation and K-DA anonymity are able to retain feature importance even when the data gain privacy, indicating that they do not alter systems performance greatly.
Mithila Sivakumar, Kimya Khakzad Shahandashti, Oluwafemi Odu, Alvine Boaye Belle (York U): To assure or not to assure, that must be the question!
Abstract: Prior to their deployment, systems developed in safety-critical domains (e.g., avionics, automotive) require a strong justification that they will effectively support the critical requirements (e.g., safety, security, reliability) for which they were designed. Thus, it is usually mandatory that the design authority develops compelling assurance cases to support that justification and allow regulatory bodies to certify such systems. In such contexts, detecting assurance deficits, relying on patterns to improve the structure of assurance cases, improving existing assurance cases notations, and (semi-)automating the generation of assurance cases are key to develop compelling assurance cases and foster consumer acceptance.
Jaskirat Singh (Queen's U): Deployment Strategies for Edge AI
Abstract: The rise of AI use cases catered towards the Edge, where devices have limited computation power and storage capabilities, motivates the need for optimized AI deployment strategies in production environments to provide faster inference, optimal accuracy, and enhanced data privacy. This study aims to empirically assess the impact of different Edge AI deployment strategies on the inference time and accuracy for Computer Vision tasks. In this paper, we conduct Inference experiments with 1) Traditional Deployment Strategies: Mobile, Edge, and Cloud Deployment and 2) (Hybrid) Edge AI Deployment Strategies: Model Partitioning across mobile-edge, edge-cloud, and mobile-cloud nodes of the Edge environment, Model Quantization, and Model Early Exiting for investigating the optimal inference approaches from the point of view of AI developers.
Nader Trabelsi, Cristiano Politowski, Ghizlane ElBoussaidi (ÉTS): Event-Driven Architecture: Between Theory and Practice
Abstract: Due to their applicability in different domains, Internet of Things software solutions have gained significant attention in many industries. Suitable architectures are needed in order to satisfy the needs of such applications. Event-Driven Architectures (EDA) ease the exchange of data between IoT devices while making each one of them completely decoupled from the others. EDA has been described in many studies and some industrial solutions are proposed to support such a paradigm. However, no study has evaluated the gap between the developed theory and the proposed industrial solutions. In order to study this gap, we selected and analyzed three industrial platforms according to what is described in the literature. Our results show that there is a lack of a unified definition of the elements of an EDA in academic studies. We also found that event-based communication is implemented differently in each platform, serving different use cases, which indicates the need to establish a commonly accepted method for the design process of this type of architecture.
10:00 AM [New Faculty Talk] Caroline Lemieux (UBC): Expanding the Reach of Fuzz Testing (Slides)
Abstract: Software bugs are pervasive in modern software. As software is integrated into increasingly many aspects of our lives, these bugs have increasingly severe consequences, both from a security (e.g. Cloudbleed, Heartbleed, Shellshock) and cost standpoint. Fuzzing refers to a set of techniques that automatically find bug-triggering inputs by sending many random-looking inputs to the program under test. In this talk, I will discuss how, by identifying core under-generalized components of modern fuzzing algorithms, and building algorithms that generalize or tune these components, I have expanded the application domains of fuzzing. Finally, I will discuss the problems _around_ fuzzing that must be solved in order to increase its impact.
11:00 AM [Poster Lightning Talks]
Vahid Majdinasab (Polytechnique): Mutation Testing of Deep Reinforcement Learning Based on Real Faults
Abstract: Testing Deep Learning (DL) systems is a complex task as they do not behave like traditional systems would, notably because of their stochastic nature. Nonetheless, being able to adapt existing testing techniques such as Mutation Testing (MT) to DL settings would greatly improve their potential verifiability. While some efforts have been made to extend MT to the Supervised Learning paradigm, little work has gone into extending it to Reinforcement Learning (RL) which is also an important component of the DL ecosystem but behaves very differently from SL. This paper builds on the existing approach of MT in order to propose a framework, RLMutation, for MT applied to RL. Notably, we use existing taxonomies of faults to build a set of mutation operators relevant to RL and use a simple heuristic to generate test cases for RL. This allows us to compare different mutation killing definitions based on existing approaches, as well as to analyze the behavior of the obtained mutation operators and their potential combinations called Higher Order Mutation(s) (HOM). We show that the design choice of the mutation killing definition can affect whether or not a mutation is killed as well as the generated test cases. Moreover, we found that even with a relatively small number of test cases and operators we manage to generate HOM with interesting properties which can enhance testing capability in RL systems.
Jiawen Liu (Queen's U): Understanding Contributors Profiles of Popular ML Libraries
Abstract: With the increasing popularity of machine learning (ML), a growing number of software developers have been attracted to developing and adopting ML approaches. Establishing an understanding of ML contributors is critical to the success of ML software development and maintenance process. Research efforts to study ML contributors are limited to the difficulties and challenges perceived by ML contributors using user surveys, interviews, or analyzing posts on Q&A systems. There is a lack of understanding of the characteristics of ML contributors based on their behaviors observed from the software repository history. In this paper, we aim to identify contributor profiles in ML library projects, study their contribution patterns and the reason for the acceptance/rejection of their contributions. By investigating 7640 contributors from 6 popular ML libraries (i.e., Tensorflow, Pytorch, Keras, MXNet, Theano and ONNX), we identify four ML contributor profiles, namely - casual, active, productive, and collaborative contributors.
Peter Yefi (Concordia U): Building IoT Systems Modeling: A Object-oriented Metamodeling Approach
Abstract: Creating human and computer-readable representations of built environments and their systems is necessary for efficiently managing buildings. The efficient management of buildings, and by extension, building representations and models help reduce energy consumption, increase occupant comfort, and, consequently, reduce costs and increase productivity. Most commercial buildings come with a Building Energy Management System (BEMS) with readable representations of some aspects of the building and its systems. The primary function of a BEMS is control of the heating, ventilation, air-conditioning, and lighting. The representations in a BEMS do not wholly model a building, are vendor-specific, and consequently inhibit interoperability. There are several approaches to modeling built environments and providing readable representations of buildings. However, these approaches, have some challenges and limitations that we identify and overcome by introducing an object-oriented metamodel validated with BEMs data from three buildings focused on the mechanical, electrical, and plumbing entities of a building.
Yiping Jia (Queen's U): Ranking Evolutionary Couplings in Software Systems using Learning to Rank Algorithms
Abstract: Software systems are subject to continuous changes over time by adding new features, fixing defects, improving performance, and adopting software systems to new technologies. During the software evolution, some artifacts (e.g., methods) in the same software are frequently changed together even though the artifacts have no explicit dependencies. Such a co-change relationship is known as evolutionary coupling. Ignoring evolutionary couplings in software maintenance and software testing could lead to defects. In this talk, we will discuss our proposed approach that detects evolutionary couplings using historical data (e.g., commit logs and code metrics), and applies machine learning algorithms (i.e., learning to rank algorithms) to recommend methods that need the developers' attention for code changes and testing. To examine the effectiveness of our approach, we conduct experiments on 25 open-source projects. We find that the Random Forest model outperforms other machine learning models and reaches 0.874 average NDCG@5. Our approach can effectively predict the co-change candidates and outperform the baselines by 0.204 average NDCG@5. Moveover, our approach only requires 30 days of training data to achieve a reasonable prediction performance. Our approach can help developers uncover hidden dependencies to enrich their test cases.
Ait Abderrahim Hamou (ÉTS): Towards a Catalog of Patterns for Migrating Legacy Applications to IoT
Abstract: Enterprises in the modern world aim to improve their decisions by gathering more data about the environments of their products and their services. They aspire to achieve that by evolving their legacy information systems with IoT. In this context, we aim to develop an approach to integrate IoT devices into existing legacy information systems. My research poster will highlight the problem associated with this integration, the obejctives we set, the methodology we will follow to attain these objectives and the expected outcomes of our research, at both the theoritical and practical levels.
Md Nakhla Rafi (Concordia U): Employing Code Change Information for Enhanced Fault Localization
Abstract: The efficacy and portability of coverage-based fault localization have led to a significant amount of study in the field. However, current methods sometimes oversimplify coverage by condensing it into a collection of tests or boolean vectors, neglecting code change information. This limits the practical use of these methods. In this paper, we show a new fault localization method that combines learning from graph-based representations and code change information with extensive coverage data. By treating tests and coverage as nodes and edges respectively, we present a graph-based representation of code that accurately represents fine-grained code structures with code change information. Following that, we study how these code change metrics can help boost the performance of fault localization techniques. To evaluate the effectiveness of our method, which utilizes these metrics, we applied it to a leading-edge graph-based fault localization model and used the defects4j and Fonte benchmarks for assessment. Compared to previous approaches, our method shows a slight improvement in terms of Top-1, Top-3, and Top-5 results, indicating future possibilities for improvement.
Doriane Olewicki (Queen's U): Towards Lifelong Learning for Software Analytics Models: Empirical Study on Brown Build and Risk Prediction
Abstract: Nowadays, software analytics tools using machine learning (ML) models are well established to, for example, predict the risk of a code change. However, as the goals of a project shift over time, and developers and their habits change, the performance of said models tends to degrade (drift) over time, until a model is retrained using new data. Current retraining practices typically are an afterthought (and hence costly), requiring a new model to be retrained from scratch on a large, updated data set at random points in time; also, there is no continuity between the old and new model. In this presentation, we propose to use lifelong learning (LL) to continuously build and maintain ML-based software analytics tools using an incremental learner that progressively updates the old model using new data.
Arghavan Moradi Dakhel (Polytechnique): Improving the Effectiveness of Tests Generation by LLMs Using Mutation Testing
Abstract: One of the critical phases in the software development life cycle is software testing. Testing helps with identifying potential bugs and reducing maintenance costs. The goal of automated test generation tools is to ease the development of tests by suggesting efficient bug-revealing tests. Recently, researchers have leveraged Large Language Models (LLMs) of code, such as OpenAI’s Codex, to generate unit tests. While the code coverage of generated tests was assessed, it has been acknowledged in the literature that the coverage is weakly correlated with bug detection. Hence, to improve over this limitation, in this paper, we introduce MutAP (Mutation Test case generation using Augmented Prompt) for improving the effectiveness of test cases generated by LLMs in bug detection. This is achieved by augmenting prompts with surviving mutants, as those mutants highlight the limitations of test cases in detecting bugs. MutAP is capable of generating effective test cases in the absence of natural language descriptions of the Program Under Test (PUTs). We employ Codex as the LLM component within MutAP and evaluate its performance on two different datasets of synthetic and real buggy programs. Our results show that our proposed method outperforms both the current state-of-the-art fully automated test generation tool (i.e., Pynguin) and zero-shot/few-shot learning approaches on LLMs. Specifically, our method achieves a Mutation Score of 92.02% on synthetic bugs, outperforming all other approaches. Furthermore, MutAP is able to detect 79 faulty human-written codes that are undetected by any other comparable method in our evaluation. Our findings suggest that although LLMs can be a useful tool for developers to generate test cases, caution should be exercised regarding the effectiveness of the generated tests, which may suffer from syntactic or functional errors and may be ineffective in detecting certain types of bugs and testing corner cases in PUTs. However, based on our results, these LLM-based techniques could be augmented with traditional automated test generation techniques to generate effective and efficient tests.
Fangjian Lei (Queen's U): Studying the Evolution Patterns of Self-admitted Technical Debits in Software Systems
Abstract: Self-Admitted Technical Debt (SATD) refers to technical debt that has been acknowledged by the developers themselves as an intentional decision to take in order to meet some specific goal, such as meeting a deadline or lack of knowledge or resources. Self-Admitted Technical Debt (SATD) is a common phenomenon in software development that can significantly impact the maintainability and stability of software systems. Code cloning, which refers to identical or structurally similar code fragments, can further damage the maintainability and stability of software systems if not managed properly. Despite extensive research on code clones and SATD to understand their impacts to system independently, the correlation between code clone and SATD remains underexplored. This paper aims to perform an empirical study to investigate the differences in SATD between cloned and non-cloned code, the patterns of SATD evolution in software history, and the semantic topics of SATD. We address four research questions, focusing on the accurate classification of SATD and non-SATD using a fine-tuned BERT model, patterns of SATD evolution, differences in evolution patterns between cloned and non-cloned code, and the differences in removed SATDs between cloned and non-cloned code. Our contributions include empirical findings that can help developers prioritize SATD resolution and code reviewers, insights into the linkage between code clones and SATD, a larger SATD-categorized dataset, and a methodology for extracting the evolution pattern of SATD. Our findings reveal significant differences in the distribution of SATD evolution patterns between cloned and non-cloned code and suggest that developers should be aware of these distinctions to establish effective maintenance strategies and improve the software.
Jean Baptiste MINANI (Concordia U): Exploring the Landscape of IoT System Testing: A Systematic Review
Abstract: The Internet of Things systems (IoT) are becoming prevalent in various domains. Testing IoT systems before their deployment is critical in ensuring their reliability. Many studies have discussed different tools, approaches, quality attributes, and challenges in the context of IoT systems testing. This paper presents a systematic literature review that aggregates, synthesizes, and discusses the results of 83 relevant primary studies (PSs) concerning IoT testing tools, approaches, quality attributes, and challenges. We followed the 2020 Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA) to report our findings. The included PSs were selected using inclusion and exclusion criteria applied to relevant studies published between 2012 and 2022. We extracted and analyzed the data from PSs to understand how IoT systems are tested. The results reveal a focus on testing approaches and tools, with an emphasis on the device layer. However, there is a need for more comprehensive tools and approaches that cover various aspects of IoT systems. Additionally, IoT systems require additional quality attributes beyond those of traditional software. Our research offers valuable insights into the testing challenges, tools, approaches, and quality attributes associated with IoT systems. It provides practical guidance for IoT practitioners by highlighting existing tools and approaches, while also identifying new research opportunities for interested researchers.
11:10 AM [Research Talks] Testing
Md. Ashraf Uddin (U of Manitoba): Demystifying Early Test Termination Due to Assertion Failure
Abstract: An assertion is commonly used to validate the expected program’s behavior (e.g., if the returned value of a method equals an expected value) in software testing. Although it is a recommended practice to use only one assertion in a single test to avoid code smell (e.g., Assertion Roulette), we observe that it is common to have multiple assertions in a single test. One issue with tests that have multiple assertions is that when the test fails at an early assertion, the test will terminate at that point, and the remaining testing code will not be executed. This, in turn, can potentially reduce the code coverage and the performance of techniques that rely on code coverage information (e.g., spectrum-based fault localization). We refer to such a scenario as early test termination. Understanding the impact of early test termination on test coverage is important for software testing and debugging, particularly for the techniques that rely on coverage information obtained from the testing. In this study, we investigated 207 versions of 6 open-source projects. We found that a non-negligible portion of the failed tests (19.1%) is early terminated due to assertion failure, which leads to the skipping of 15.3% to 60.5% of the test code on average, and a negative impact on testing coverage. To mitigate early test termination, we propose two approaches, i.e., Trycatch (adding a try-catch block surrounding an assertion) and Slicing (slicing a test into a set of independent sub-tests, in which only one assertion and its dependent code are contained). After applying our approaches, the line/branch coverage get improved in 55% of the studied versions. Moreover, Slicing improves the performance of SBFL (i.e., Ochiai) by 15.1% in terms of Mean First Rank (MFR). We also provide actionable suggestions to prevent early test termination, and approaches to mitigate early test termination if it already exists in their project.
Nima Shiri Harzevili (York U): History-Driven Fuzzing for Deep Learning Libraries
Abstract: Many deep learning (DL) fuzzers have been proposed to generate malformed inputs from the valid input domain of an API function that is mined from API documentation or open-source code examples to trigger incorrect behaviors of the API. Despite their promising results, existing DL fuzzers suffer from the following limitations. Firstly, most DL fuzzers only focused on high-level APIs that are used by end-users, which results in a large number of APIs used by library developers being untested, thus missing opportunities to find more bugs. Second, existing DL fuzzers used general input generation rules to generate malformed inputs such as random value generation and boundary-input generation, which are ineffective to generate DL-specific malformed inputs. To fill this gap, we first conduct an empirical study regarding root cause analysis on 447 history security vulnerabilities of two of the most popular DL libraries, i.e., PyTorch and TensorFlow, with the goal of characterizing and understanding their malicious inputs. As a result, we categorize 18 rules regarding the construction of malicious inputs, which we believe can be used to generate effective malformed inputs for testing DL libraries. We further design and implement Orion, a new fuzzer that tests DL libraries by utilizing our malformed input generation rules mined from real-world deep learning security vulnerabilities. Specifically, Orion first collects API invocation code from various sources such as API documentation, source code, developer tests, and publicly available repositories on GitHub. Then Orion instruments these code snippets to dynamically trace execution information for each API such as parameters' types, shapes, and values. After that, Orion combines the malformed input generation rules and the dynamic execution information to create inputs to test DL libraries. Our evaluation of TensorFlow and PyTorch shows that Orion reports 143 bugs and 68 of which are previously unknown. Among the 68 new bugs, 58 have been fixed or confirmed by developers after we report them and the left are awaiting confirmation. Compared to the state-of-the-art DL fuzzers (i.e., FreeFuzz and DocTer), Orion detects 21\% and 34\% more bugs respectively.
11:40 AM [Poster Lightning Talks]
Yu Shi (Queen's U): Planning task executions in distributed machine learning systems
Abstract: Distributed machine learning (DML) systems are becoming increasingly popular due to their ability to process large amounts of data and perform complex computations. However, managing the training process in such systems can be challenging, as it involves coordinating multiple and heterogeneous computing devices (i.e., CPU, GPU, TPU, mobiles, and IoT devices). This project proposes using automated planning techniques to manage the training process of distributed machine learning systems. Specifically, we use the Planning Domain Definition Language (PDDL) to formalize the training process and generate task plans that maximize training performance. We try to demonstrate that automated planning can effectively manage tasks executions in distributed machine-learning systems.
Shenyu Zheng (Queen's U): How Do Developers Use Bazel To Build Their Projects?
Abstract: Bazel provides promising features to users, aiming to achieve the efficiency of the build process while maintaining correctness. However, little is known about the usage of these features in open-source projects and to what extent developers utilize them. This paper reports our findings on the usage of Bazel features and build rules, and we compare these findings with Maven projects. After analyzing a set of 383 Bazel projects, we discovered: (1) Fewer Bazel projects use the build system in CI than Maven projects. (2) Bazel projects employ more custom and native build rules than Maven projects. (3) The impact of increasing parallelism on the speedup of build time in projects of different sizes is not significantly different. However, projects with shorter build durations demonstrate a greater improvement in build time when using parallelism levels of 2, 4, and 8. (4) Around 37% of Bazel projects leverage parallelism at degrees 2 and 4, with a 95% confidence interval. This percentage decreases to 28.76% at parallelism level 8 and 7.84% at parallelism level 16.
Amin Ghadesi (Polytechnique): What Causes Exceptions in Machine Learning Applications? Mining Machine Learning-Related Stack Traces on Stack Overflow
Abstract: Machine learning (ML), including deep learning, has recently gained tremendous popularity in a wide range of applications. However, like traditional software, ML applications are not immune to the bugs that result from programming errors. Explicit programming errors usually manifest through error messages and stack traces. These stack traces describe the chain of function calls that lead to an anomalous situation, or exception. Indeed, these exceptions may cross the entire software stack (including applications and libraries). Thus, studying the patterns in stack traces can help practitioners and researchers understand the causes of exceptions in ML applications and the challenges faced by ML developers. To that end, we mine Stack Overflow (SO) and study 11,449 stack traces related to seven popular Python ML libraries. First, we observe that ML questions that contain stack traces gain more popularity than questions without stack traces; however, they are less likely to get accepted answers. Second, we observe that recurrent patterns exists in ML stack traces, even across different ML libraries, with a small portion of patterns covering many stack traces. Third, we derive five high-level categories and 25 low-level types from the stack trace patterns: most patterns are related to python basic syntax, model training, parallelization, data transformation, and subprocess invocation. Furthermore, the patterns related to subprocess invocation, external module execution, and remote API call are among the least likely to get accepted answers on SO. Our findings provide insights for researchers, ML library providers, and ML application developers to improve the quality of ML libraries and their applications.
PAULINA STEVIA NOUWOU MINDOM (Polytechnique): A Comparison of Reinforcement Learning Frameworks for Software Testing Tasks
Abstract: Software testing activities scrutinize the artifacts and the behavior of a software product to find possible defects and ensure that the product meets its expected requirements. Although various approaches of software testing have shown to be very promising in revealing defects in software, some of them lack automation or are partly automated which increases the testing time, the manpower needed, and overall software testing costs. Recently, Deep Reinforcement Learning (DRL) has been successfully employed in complex testing tasks such as game testing, regression testing, and test case prioritization to automate the process and provide continuous adaptation. Practitioners can employ DRL by implementing from scratch a DRL algorithm or using a DRL framework. DRL frameworks offer well-maintained implemented state-of-the-art DRL algorithms to facilitate and speed up the development of DRL applications. Developers have widely used these frameworks to solve problems in various domains including software testing. However, to the best of our knowledge, there is no study that empirically evaluates the effectiveness and performance of implemented algorithms in DRL frameworks. Moreover, some guidelines are lacking from the literature that would help practitioners choose one DRL framework over another. In this paper, therefore, we empirically investigate the applications of carefully selected DRL algorithms (based on the characteristics of algorithms and environments) on two important software testing tasks: test case prioritization in the context of Continuous Integration (CI) and game testing. For the game testing task, we conduct experiments on a simple game and use DRL algorithms to explore the game to detect bugs. Results show that some of the selected DRL frameworks such as Tensorforce outperform recent approaches in the literature. To prioritize test cases, we run extensive experiments on a CI environment where DRL algorithms from different frameworks are used to rank the test cases. We find some cases where our DRL configurations outperform the implementation of the baseline. Our results show that the performance difference between implemented algorithms in some cases is considerable, motivating further investigation. Moreover, empirical evaluations on some benchmark problems are recommended for researchers looking to select DRL frameworks, to make sure that DRL algorithms perform as intended.
Ikram DARIF (ÉTS): A Model-driven and Template-based Approach for Requirements Specification
Abstract: Requirements specification and verification play an important role in the certification of safety-critical software (SCS). These activities are costly and error-prone due to two main problems : 1) SCS include a high number of requirements and 2) most SCS manufacturers use natural language to specify requirements. On one hand, natural language introduce ambiguity and inconsistency. On the other hand, formal languages add an overhead to the requirements specification because of their complexity. Controlled Natural Languages (CNL) fill these gaps by offering a middle-ground solution, although not yet well adopted by the industry. We propose an approach that combines CNLs and model-driven engineering (MDE) for requirements specification. The approach was proposed to support an industrial partner in the certification process of a SCS. The approach uses templates and relies on two types of models: 1) models that specify the templates, and 2) a model of the domain of the system at hand. Using models of the templates enables the automation of some requirements analysis tasks. While using a domain model allows an auto-completion and verification of the requirements specified using the templates. The approach is implemented and validated through three case studies including more than a thousand requirements. Through these case studies, we validated that 1) the templates are applicable across domains and 2) our approach yields requirements of better quality in terms of necessity, ambiguity, completeness, singularity, and verifiability.
Kaveh Shahedi (Polytechnique): Tracing Optimization for Performance Modelling and Regression Detection
Abstract: Although software performance modeling through application tracing is a precise approach, the trade-off between the accuracy and the generated overhead in the system is crucial. This research aimed to determine the insignificant performance-related functions in systems by monitoring their behavior during the runtime, and evaluating the gained information from their executions. Next, the performance-insensitive functions are removed from tracing, and an optimized performance model is trained based on the reduced collected data. The accuracy of the new performance model is assessed with different injected performance regressions in the programs to determine whether it is capable of detecting the regressions in the systems.
Mohammad Mahdi Mohajer (York U): Exploring the Fairness of Code Reviewer Recommendation
Abstract: The poster discusses the fairness issue in machine learning-based code reviewer recommendation systems. The study aims to evaluate and mitigate any unfairness in these systems. We found that such systems exhibit biased behavior on different datasets, which is primarily due to imbalanced data or popularity bias. Existing mitigation approaches can be effective if the dataset has similar distributions of protected and privileged groups, but their effectiveness can be limited otherwise. Future research will explore fairness issues in code reviewer recommendation systems with different architectures and sensitive attributes. The study highlights the need for fairness considerations in machine learning-based recommendation systems, especially those that interact with human entities. It emphasizes the importance of evaluating and mitigating any biases to ensure equitable outcomes and underscores the relevance of such research in contemporary software engineering practices.
Mohammadmehdi Morovati (Polytechnique): Bug Characterization in Machine Learning-based Systems
Abstract: Rapid growth of applying Machine Learning (ML) in different domains, especially in safety-critical areas, increases the value of having reliable ML components, i.e., a software component operating based on ML. Since corrective maintenance is a key task to deliver reliable software components, it is necessary to investigate the employment of ML components, from the software maintenance perspective. Understanding ML bugs can help developers to determine where to focus development and testing efforts. Hence, in this paper, we study how employing ML components affects software maintenance in ML-based software systems. We extracted 447,948 GitHub repositories that used one of the three most popular ML frameworks, i.e., TensorFlow, Keras, and PyTorch. After multiple steps of filtering, we check the top 100 repositories with the highest number of closed issues. We carefully inspect them to make sure that an ML component is utilized to interact with other software components in the system. Then, we manually inspect 386 sampled issues raised in the selected ML-based systems to indicate whether they are bugs related to ML components or not. Next, we review 109 identified ML bugs to determine their root cause, symptom, and effort-to-fix. We show that nearly half of the issues are related to ML components. The results also revealed that ML bugs have significantly different characteristics compared to non-ML ones, in terms of the complexity of bug-fixing (number of commits, changed files, and changed lines of code). We also show that fixing ML bugs consumes as many resources (time and expertise level) as non-ML bugs. Thus, paying significant attention to the reliability of the ML components is crucial since ML issues are more costly and their impacts can be propagated to other software components. These results deepen the understanding of ML bugs and we hope that our findings help shed light on opportunities to design effective tools for testing and debugging ML-based systems.
Pouya Fathollahzadeh (Queen's U): Techniques for Creating a dataset of Automated Generated Code
Abstract: In recent years, auto-generated code has been gaining popularity in the field of software engineering. Despite the growing use of this technology, there is still a need for further research to investigate its quality and reliability. While previous studies have examined auto-generated code, there is a lack of literature that addresses its comparison with human-written code. In this study, we use regular expressions to identify the automatic generated code based on their comments. Then, we extract the automatically generated code. We aim to collect a dataset for the projects using automated generated code for future studies.
Sristy Sumana Nath (U of Sask.): Recovering Traceability Links between Release Notes and Related Software Artifacts
Abstract: A release note is a document that summarizes all the changes made to a software product, including bug fixes, feature additions, and security updates, since its previous release. To create a release note, producers often reuse content from software artifacts such as pull requests (PRs), issues, and commits. Maintaining traceability links between releases and their corresponding artifacts is crucial for tracking software development and documenting knowledge for continuous delivery/integration (CD/CI) practices. However, this is usually done manually, leading to missing links and miss-labelling problems. Our study proposes an automated link recovery approach using a sequence-to-sequence (seq2seq) model to predict three pairs of traceability links for release notes: release note contents-issues, release note contents-commits, and release note contents-PRs. We utilize both textual and non-textual information from GitHub, including release note contents, commit messages, issue titles and PR titles, published date, closed date, and labels, to improve the accuracy of our model. Our proposed seq2seq model achieves F1-scores of 76%, 83%, and 88% for release note contents-issues, release note contents-commit, and release note contents-PRs, respectively. To improve the usability of automated techniques, we conducted a semi-structured interview with practitioners to understand their opinions about the automated traceability links for release notes.
Maximilian Schiedermeier (McGill U): The horsemen of controlled experiment apocalypse
Abstract: Controlled experiments with humans come with a delicate tradeoff: adjusting the level of control. Context and conduct must be sufficiently strict to ensure an acquisition of comparable and representative samples. At the same time too much control can easily jeopardizes the experiment significance by contrived boundaries. This poster illustrates lessons learnt from a conducted controlled experiment with humans, regarding a meaningful control tradeoff: we present “The horsemen of controlled experiment apocalypse” that came to visit us, and provide hands-on recommendations on how to combat them with meaningful control parameters.
Forough Majidi (Polytechnique): Investigate the Usage of Automated Machine Learning Tools in Practice
Abstract: The popularity of automated machine learning (AutoML) tools in different domains has increased over the past few years. Machine learning (ML) practitioners use AutoML tools to automate and optimize the process of feature engineering, model training, and hyperparameter optimization and so on. Recent work performed qualitative studies on practitioners' experiences of using AutoML tools and compared different AutoML tools based on their performance and provided features, but none of the existing work studied the practices of using AutoML tools in real-world projects at a large scale. Therefore, we conducted an empirical study to understand how ML practitioners use AutoML tools in their projects. To this end, we examined the top 10 most used AutoML tools and their respective usages in a large number of open-source project repositories hosted on GitHub. The results of our study show 1) which AutoML tools are mostly used by ML practitioners and 2) the characteristics of the repositories that use these AutoML tools. Also, we identified the purpose of using AutoML tools (e.g. model parameter sampling, search space management, model evaluation/error-analysis, Data/ feature transformation, and data labeling) and the stages of the ML pipeline (e.g. feature engineering) where AutoML tools are used. Finally, we report how often AutoML tools are used together in the same source code files. We hope our results can help ML practitioners learn about different AutoML tools and their usages, so that they can pick the right tool for their purposes. Besides, AutoML tool developers can benefit from our findings to gain insight into the usages of their tools and improve their tools to better fit the users' usages and needs.
1:00 PM [Keynote] Sarra Habchi (Ubisoft): Quality Engineering in the Video Game Industry: Challenges and Opportunities
Abstract: Developing high-budget video games is a complex process that demands significant effort and meticulous organization. These games consist of an extensive amount of code, estimated to be in the tens of millions of lines, distributed across hundreds of thousands of files, and requiring tens of thousands of code changes. The development of modern AAA games poses additional challenges due to the need to manage multidisciplinary teams, including developers, artists, and sound engineers, which is markedly different from typical code-only projects. In addition, these games must be designed to be compatible and scalable across multiple platforms, including various consoles, PCs, and mobile devices. In this environment, unique engineering challenges emerge, requiring novel research and development solutions from our community. In this talk, I will provide a brief overview of predicaments encountered by game development teams, covering topics such as performance analysis, automated testing, and cross-artifact build systems. Then, I will present some of the initiatives conducted by our research group to address these challenges. Finally, I will conclude with a discussion of open challenges and potential research opportunities.
1:40 PM [New Faculty Talk] Quentin Stiévenart (UQÀM): Program Analysis of WebAssembly Applications (Slides)
Abstract: The recently introduced WebAssembly standard aims to form a portable compilation target, enabling the cross-platform distribution of programs written in a variety of languages. The research community has quickly adressed the needs to perform automated program analysis on WebAssembly applications. In this presentation, we will go over some of the approaches that have been designed to support the analysis of WebAssembly programs. Of particular interest will be the Wassail framework (Stiévenart et al., 2021), and our work on program slicing for WebAssembly (Stiévenart et al., 2022).
2:10 PM [Research Talks] Code Quality
Chunli Yu (Queen's U): An Empirical Study on the Characteristics of Reusable Code Clones (Slides)
Abstract: Copy-and-paste of code is a common practice in software development, especially with the availability of myriad open-source projects. By reusing existing code fragments, code cloning may accelerate the development process and ensure better quality of software if high-quality code fragments are reused. Therefore, code cloning might not be avoided even though it can increase software maintenance cost as bugs can be propagated inadvertently by code cloning and go beyond the boundary of a system. Therefore, code should be reused judiciously instead of copy-and-paste activities whenever possible. In this study, we aim to help developers determine if the code clones are reusable or not. Generally speaking, reusable code clones can stand the test of time thus exist in the system for a long term, and have fewer bugs. Moreover, highly reusable code can be measured by its prevalence. We provide an approach that can automatically identify reusable code clones from three perspectives using Machine Learning (ML) classifiers: clone prevalence (i.e., the number of clone siblings), clone longevity (i.e., the clone genealogy length), and clone fault-resiliency (i.e., the percentage of non-buggy commits versus the buggy commits). Our approach can achieve the performance of a median AUC of 0.79 by conducting experiments on 26 open-source Java projects. The results show that the number of followers, the number of contributors, and the depth of common paths provide the most explanatory power that contributes to correctly identify reusable code clones. Hence, practitioners can utilize our classifiers and insights from our findings to make more reliable use of code clones and prioritize the use of high-quality clones in their clone management activities.
Shayan Noei (Queen's U): An Empirical Study of Refactoring Rhythms and Tactics in the Software Development Process
Abstract: It is critical for developers to develop high-quality software to reduce maintenance cost. While often, developers apply refactoring practices to make source code readable and maintainable without impacting the software functionality. Existing studies identify development rhythms (i.e., weekly development patterns) and their relationship with various metrics, such as productivity. However, existing studies focus entirely on development rhythms. There is no study on refactoring rhythms and their relationship with code quality. Moreover, the existing studies categorize the refactoring tactics (i.e., long-term refactoring patterns) into two general concepts of consistent and inconsistent refactoring. Nevertheless, the existence of other tactics and their relationship with code quality is not explored. In this paper, we conduct an empirical study on the refactoring practices of 196 Apache projects in the early, middle, and late stages of development. We aim to identify (1) existing refactoring rhythms, (2) further refactoring tactics, and (3) the relationship between the identified tactics and rhythms with code quality. The recognition of existing refactoring strategies and their relationship with code quality can assist practitioners in recognizing and applying the appropriate and high-quality refactoring rhythms or tactics to deliver a higher quality of software. We find two frequently used refactoring rhythms: work-day refactoring and all-day refactoring. We also identify two deviations of floss and root canal refactoring tactics as: intermittent root canal, intermittent spiked floss, frequent spiked floss, and frequent root canal. We find that root canal-based tactics are correlated with less increase in the code smells (i.e., higher quality code) compared to floss-based tactics. Moreover, we find that refactoring rhythms are not significantly correlated with the quality of the code. Furthermore, we provide detailed information on the relationship of each refactoring tactic to each code smell type.
Ding Li (Concordia U): Assessment of Software Vulnerability using eXplainable AI Techniques (Slides)
Abstract: Software vulnerability detection is proactive to reduce the risk to software security and reliability. Despite advancements in deep learning-based detection, a semantic gap still remains between learned deep learning features and human-understandable vulnerability semantics. Hence a systematic explanation to assess the importance of semantic-based features is a way to assess their relation. This paper presents eXplainable AI (XAI)-based techniques to assess the contributing factors of feature representations and their impacts to classifying software code into different Common Weakness Enumeration (CWE) types. Our XAI explanation techniques are deep learning model agnostic hence are applied to transformer-based and graph neural network based deep learning models. Our study identifies ten kinds of the syntax constructs as contributing features in addition to the existing work. Based on the identified semantic-based features, we derive the approximately 77.8% and around 89% Top1 and Top5 rate of CWE classification similarities compared to the collective expert knowledgebase from the open community.
Maram Assi (Queen's U): Predicting the Change Impact of Resolving Defects by Leveraging the Topics of Issue Reports in Open Source Software Systems
Abstract: Upon receiving a new issue report, practitioners start by investigating the defect type, the potential fixing effort needed to resolve the defect and the change impact. Moreover, issue reports contain valuable information, such as, the title, description and severity, and researchers leverage the topics of issue reports as a collective metric portraying similar characteristics of a defect. Nonetheless, none of the existing studies leverage the defect topic, i.e., a semantic cluster of defects of the same nature, such as Performance, GUI and Database, to estimate the change impact that represents the amount of change needed in terms of code churn and the number of files changed. To this end, in this paper, we conduct an empirical study on 298,548 issue reports belonging to three large-scale open-source systems, i.e., Mozilla, Apache and Eclipse, to estimate the change impact in terms of code churn or the number of files changed while leveraging the topics of issue reports. First, we adopt the Embedded Topic Model (ETM), a state-of-the-art topic modelling algorithm, to identify the topics. Second, we investigate the feasibility of predicting the change impact using the identified topics and other information extracted from the issue reports by building eight prediction models that classify issue reports requiring small or large change impact along two dimensions, i.e., the code churn size and the number of files changed. Our results suggest that XGBoost is the best-performing algorithm for predicting the change impact, with an AUC of 0.84, 0.76, and 0.73 for the code churn and 0.82, 0.71 and 0.73 for the number of files changed metric for Mozilla, Apache, and Eclipse, respectively. Our results also demonstrate that the topics of issue reports improve the recall of the prediction model by up to 45%.
3:30 PM [Research Talks] Program Analysis
Md Ahasanuzzaman (Queen's U): Using Knowledge Units of Programming Languages to Recommend Reviewers for Pull Requests: An Empirical Study (Slides)
Abstract: Code review is a key element of quality assurance in software development. Determining the right reviewer for a given code change requires understanding the characteristics of the changed code, identifying the skills of each potential reviewer (expertise profile), and finding a good match between the two. To facilitate this task, we design a code reviewer recommender that operates on the knowledge units (KUs) of a programming language. We define a KU as a cohesive set of key capabilities that are offered by one or more build- ing blocks of a given programming language. We operationalize our KUs using certification exams for the Java programming language. We detect KUs from 10 actively maintained Java projects from GitHub, spanning 290K commits and 65K pull requests (PRs). Next, we generate developer expertise profiles based on the detected KUs. Finally, these KU-based expertise profiles are used to build a code reviewer recommender (KUREC). The key assumption of KUREC is that the code reviewers of a given PR should be experts in the KUs that appear in the changed files of that PR. In RQ1, we compare KUREC’s per- formance to that of four baseline recommenders: (i) a commit-frequency-based recommender (CF), (ii) a review-frequency-based recommender (RF), (iii) a modification-expertise-based recommender (ER), and (iv) a review-history- based recommender (CHREV). We observe that KUREC performs as well as the top-performing baseline recommender (RF). From a practical standpoint, we highlight that KUREC’s performance is more stable (lower interquartile range) than that of RF, thus making it more consistent and potentially more trustworthy. Next, in RQ2 we design three new recommenders by combining KUREC with our baseline recommenders. These new combined recommenders outperform both KUREC and the individual baselines. Finally, in RQ3 we evaluate how reasonable the recommendations from KUREC and the combined recommenders are when those deviate from the ground truth. KUREC is the recommender with the highest percentage of reasonable recommendations (63.4%). One of our combined recommenders (AD_FREQ) strikes the best balance between sticking to the ground truth (best recommender from RQ2) and issuing reasonable recommendations when those deviate from that ground truth (59.4% reasonable recommendations, third best in this RQ). Taking together the results from all RQs, we conclude that KUREC and AD_FREQ are overall superior to the baseline recommenders that we studied. Future work in the area should thus (i) consider KU-based recommenders as baselines and (ii) experiment with combined recommenders.
Jiho Shin (York U): The Good, the Bad, and the Missing: Neural Code Generation for Machine Learning Tasks
Abstract: Machine learning (ML) has been increasingly used in a variety of domains, while solving ML programming tasks poses unique challenges because of the fundamentally different nature and construction from general programming tasks, especially for developers who do not have ML backgrounds. Automatic code generation that produces a code snippet from a natural language description can be a promising technique to accelerate ML programming tasks. In recent years, although many deep learning-based neural code generation models have been proposed with high accuracy, the fact that most of them are mainly evaluated on general programming tasks calls into question their effectiveness and usefulness in ML programming tasks. In this paper, we set out to investigate the effectiveness of existing neural code generation models on ML programming tasks. For our analysis, we select six state-of-the-art neural code generation models, and evaluate their performance on four widely used ML libraries, with newly-created 83K pairs of natural-language described ML programming tasks. Our empirical study reveals some good, bad, and missing aspects of neural code generation models on ML tasks, with a few major ones listed below. (Good) Neural code generation models perform significantly better on ML tasks than on non-ML tasks. (Bad) Most of the generated code is semantically incorrect. (Bad) Code generation models cannot significantly improve developers’ completion time. (Good) The generated code can help developers write more correct code by providing developers with clues for using correct APIs. (Missing) The observation from our user study reveals the missing aspects of code generation for ML tasks, e.g., decomposing code generation for divide-and-conquer into two tasks: API sequence identification and API usage generation.
Amir Ebrahimi (Queen's U): A Large-Scale Exploratory Study on the Proxy Pattern in Ethereum (Slides)
Abstract: The proxy pattern is a well-known design pattern with numerous use cases in several sectors of the software industry (e.g., network applications, microservices, and IoT). As such, the use of the proxy pattern is also a common approach in the development of complex decentralized applications (DApps) on the Ethereum blockchain. A contract that implements the proxy pattern (proxy contract) acts as a layer between the clients and the target contract, enabling greater flexibility (e.g., data validation checks) and upgradability (e.g., online smart contract replacement with zero downtime) in DApp development. Despite the importance of proxy contracts, little is known about (i) how their prevalence changed over time, (ii) the ways in which developers integrate proxies in the design of DApps, and (iii) what proxy types are being most commonly leveraged by developers. In this paper, we present a large-scale exploratory study on the use of the proxy pattern in Ethereum. We analyze a dataset of all Ethereum smart contracts as of Sep. 2022 containing 50M smart contracts and 1.6B transactions, and apply both quantitative and qualitative methods in order to (i) determine the prevalence of proxy contracts, (ii) understand the ways they are deployed and integrated into applications, and (iii) uncover the prevalence of different types of proxy contracts. Our findings reveal that 14.2% of all deployed smart contracts are proxy contracts. The deployment of various proxy contracts, transactions involving proxy contracts, and adoption of proxy contracts by contract deployers have shown an upward trend over time, peaking at 7%, 20%, and 70%, respectively, at the end of our study period. We show that proxy contracts are either deployed through off-chain scripts or on-chain factory contracts, with the former being the most employed style among practitioners and the latter being more flexible in terms of deploying proxies in a transparent, systematic fashion. We found that while the majority (86%) of proxies act as an interceptor, 14% enables upgradeability. Proxy contracts are typically (60%) implemented based on known reference implementations with 41% being of type Minimal Proxies, a class of proxies that aims to cheaply reuse and clone contracts' functionality. Our evaluation shows that our proposed behavioral proxy detection method has a precision and recall of 100% in detecting active proxies. Finally, we derive a set of practical recommendations for developers and introduce open research questions to guide future research on the topic.
4:15 PM [Research Talks] AI Engineering
Zerui Wang (Concordia U): Design Explanation Microservices and Provenance: A Case Study of Explaining Cloud AI Service
Abstract: This work design an Explainable AI (XAI) service that aims to provide feature contribution explanations for cloud AI services.Cloud AI services have broad usage to develop domain-specific applications but lack transparency and explainability. The AI services provide general model evaluation metrics to demonstrate learning precision. However, the AI models behind the cloud services remain opaque on how the prediction is produced. Post hoc XAI methods based on feature contribution approximate the correlation between input features and prediction to gain explainability. The challenge is how to devise XAI that allows the approximation without unfolding the network structure of the learning model. We consider XAI operations are at the same stage of learning performance evaluation. We advocate that XAI operations are accessible as services to enable the consolidation of the XAI operations into the lifecycle of AI development. We propose an XAI as a service that provides feature contribution explanations for cloud AI services where the learning models remain "black-box". The XAI service is designed using a microservice architecture to integrate AI models and XAI methods.Additionally, provenance metadata is collected from XAI operations to provide traceability of the XAI service.In our case studies, we offer insights into the influential features that contribute to various published cloud AI services. Furthermore, we evaluate the explanation results and performance of the Explainable AI (XAI) services. We employ deep learning models as approximations to cloud AI services and incorporate a variety of CAM-based XAI techniques.We showcase the results that offer explanations for cloud AI services, while also assessing these services using the consistency evaluation metrics. Furthermore, we present the performance and cost associated with the XAI services.
Jun Huang (Concordia U): The Analysis and Development of an XAI Process on Feature Contribution Explanation
Abstract: Explainable Artificial Intelligence (XAI) research focuses on effective explanation techniques to understand and build AI models with trust, reliability, safety, and fairness. Feature importance explanation summarizes feature contributions for end-users to make model decisions. However, XAI methods may produce varied summaries that lead to further analysis to evaluate the consistency across multiple XAI methods on the same model and data set. This paper defines metrics to measure the consistency of feature contribution explanation summaries under feature importance order and saliency map. Driven by these consistency metrics, we develop an XAI process oriented on the XAI criterion of feature importance, which performs a systematical selection of XAI techniques and evaluation of explanation consistency. We demonstrate the process development involving twelve XAI methods on three topics, including a search ranking system, code vulnerability detection and image classification. Our contribution is a practical and systematic process with defined consistency metrics to produce rigorous feature contribution explanations.
Elyas Rashno (Queen's U): Image Caption Generation Using Deep Reinforcement Learning
Abstract: Efficient and accurate caption generation is crucial for image captioning applications. In our paper, we employed a deep reinforcement learning-based actor-critic model to address this requirement. The model comprises an actor trained to predict captions by making sequential decisions based on the image, with each decision corresponding to a word in the caption. Additionally, the model includes a critic that estimates the value of each state as the reward the actor would receive for generating the current word and producing subsequent outputs. The predicted values from the critic are used to train the actor, which is implemented as a policy network, while the critic acts as a value network. Both networks are jointly trained using deep reinforcement learning. We introduced modifications to the learning pattern in the actor-critic model, training the value network at each step rather than after generating each word. To evaluate the effectiveness of our approach, we conducted experiments using the MS COCO dataset. The results demonstrate improved processing time and enhanced efficiency of our image captioning method.

Call for Submissions

We invite you to contribute by sending your proposals for:

  • New faculty talks (30 minutes including QA)
  • Regular research talks (15 minutes including QA)
  • Poster/demo + lightning talk presentations
The deadline to send in your proposals is Monday, May 15th, 2023, 23:59 (EDT). Please visit the following form (https://forms.gle/hzQ7XYwrF55PLcy6A) to submit your proposal. Include the following to allow us to prepare a great program:
  • Presentation title
  • Author name(s) (with the presenter(s) highlighted) and affiliation
  • Presentation abstract (up to 150 words)
  • Type of talk (e.g., new faculty, regular, or poster/demo)


CSER does not publish proceedings in order to keep presentations informal and speculative.

CSER Steering Committee

McGill University, Montréal, June 7th, 2023

Address: The Strathcona Dentistry building, Room 2/36, 3640 Rue University, Montréal, QC H3A 2B3

Montréal is the largest city in the Canadian province of Québec and the second-largest city in Canada. It is the second largest, primarily French-speaking city in the world, after Paris. The official language of Montreal is French, as defined by the city’s charter, yet most of its citizen are bilingual and it is always possible to study, work, and enjoy Montréal in English.

CSER will be held at McGill University, co-locating with CS-CAN|INFO-CAN events. Please check the CS-CAN|INFO-CAN website for updated information about the venue.

Should you need an accommodation you may consider the many hotel in downtown Montreal (within walking distance or by metro).


Registration method and fees:

CSER 2023 Spring registration is handled by CS-CAN|INFO-CAN co-located conference. You can visit the co-located conference website or directly follow the link below to register your ticket.

Registration link (registration deadline: May 31st). Please select "Software Engineering" under the Category selector.


Type/Deadline        Early-Bird         After May 12th          After May 29th

Student                       $100                       $125                           $150

Non-student              $150                       $175                            $200

Attending other CS-CAN|INFO-CAN co-located events:

Optionally, you can also register other events in the co-located conference. The registration rate is per day.

You also have the option to register the CS-CAN|INFO-CAN awards banquet. The banquet will take place on the evening of Wednesday, June 7th, which will include a 3 course dinner with chicken, salmon and vegetarian options, accompanied by wine. Entertainment will include presentations from each society, including the announcement of several major national awards. The price of the banquet ticket is $110.

Attending both CSER and SEMLA:

The Software Engineering for Machine Learning Applications (SEMLA 2023) international symposium will be held at Polytechnique Montréal, Montréal on June 9th and 10th, right after the CSER 2023 Spring meeting. If you register for CSER 2023, you can receive a discount of 30% off when registering for SEMLA 2023 by applying the promo code “CSER2023S”.