I am a data scientist among software engineers. Or the other way around, that is sometimes hard to distinguish.
I am currently a substitute Professor at the TU Clausthal, Germany.
Persona “Data Scientist”
The data scientist in me is mainly interested in three topics.
- Methods and infrastructures for data science. I love machine learning and statistical analysis of data, as well as working with new and ideally large scale state of the art infrastructures. For example, currently I am working a lot with MongoDB and Apache Spark.
- Application of data science methods to projects. Working with data methods and infrastructures makes no sense if you do not have interesting use cases to work on. Since I am working among software engineers and have a background from software testing, I currently mainly apply data science to software engineering problems. In recent years, I focused on cross-project defect prediction, an interesting and very challenging transfer learning problem. However, I also worked in projects where Hidden Markov Models are used for developer involvement modelling, software usage profiles through Markov models and their usage for software testing, as well as clustering of project phases. This does not mean I am limited to software engineering. For example, I am currently working on collaborations with bioinformatics researchers both in terms of how infrastructures could like like, or how methods can be scaled.
- Reproducible empirical research. Data science is by definition empirical. However, this does not mean that results are reproducible, e.g., due to lack of data sharing, missing implementations, and so on. In the long term, this leads to a lack of comparability between research results and problems with the external validity of studies. I am working on providing solutions to this problem, e.g., through cloud platforms, sharing replication kits, and other strategies for sharing information within a research community in addition to the publications themselves.
Persona “Software Engineer”
The software engineer in makes sure that what I do is actually usable, extensible, and fulfills a certain degree of quality. He is also the reason why I apply data science mainly to software engineering problems. However, he also has some interests on his own:
- Probabilitstic Model-based Testing. Since I like building stastical models of software, I also want to use them in practice. One application is to use them in combination with model-based testing. In this setting, someone defines a model of a software under test, and then the statistical model is used to derive test cases.
- Software engineering practices for data science. Since I come from both fields, I am using my software quality background together with the practical knowledge of having done many data analysis projects to define best practices for data science projects to ensure that the developed analytic software fulfills a certain standard of quality.