Dyr og Data

Data science

Gavin Simpson

Aarhus University

Mona Larsen

Aarhus University

2025-08-25

Data science

Learning objectives

At the end of this topic you should be able to

  • Articulate what data science is

  • Understand at a high level the steps involved in doing data science

  • Describe the roles and skills of a data scientist

What is data science?

Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets

Kelleher & Tierney, pp. 1

Related fields

  • Machine learning
  • Data mining

Data science is broader, borrowing from these fields and many other

What is data science?

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Actionable insight

Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets

Kelleher & Tierney, pp. 1

Data science outputs are only useful if we or others can make use of them

Insight

Does data science provide us with information that wasn’t obvious?

Actionable

Can we do something useful with the new information?

Example data science problems

Customer segmentation ⟶ clustering
  • find groups of individuals exhibiting similar behaviour
Association rule mining
  • find groups of things that co-occur together
  • animals with similar sets of symptoms
Anomaly or outlier detection
  • identifying strange or abnormal events; e.g. fraudulent billing, disease, behaviour
Classification ⟶ prediction
  • develop models to predict some outcome — missing piece of data
  • predict disease from risk factors & test results
  • predict disease from a CT scan

Four “A”s of Data Science

  1. Data Architecture

  2. Data Acquisition

  3. Data Analysis

  4. Data Archiving

Data Architecture

Provide input on how data need to routed and organized to support the

  • analysis,
  • visualization, and
  • presentation of data

How would you organise an Excel worksheet?

Data Acquisition

How should the data be collected and represented prior to analysis?

Important tasks that need to happen before data can be profitably analysed are

  • representing data

  • transforming data

  • grouping

  • linking

Data Analysis

How can we summarise data?

Use samples of data to make inferences about the larger context or population

Visualize data and analysis outputs in graphs, tables, animations, dashboards

Communicate the results of the analysis

Data Archiving

How should we preserve data that has been collected?

What forms of the data need to be preserved

Difficult to anticipate future uses of data

Skills

Data science skills

Domain expertise

Important to learn the application domain

Need to know enough to

  • understand the problem

  • understand why the problem is important

  • how data science might address the problem

Ethics and regulation

If data are important enough to collect, they’re important enough to affect people’s lives

Need to understand ethical issues

  • privacy, personal data-protection

  • biases in data & models

  • limitations of the data

  • prevent misuse

Data wrangling & databases

Working with data, files, & databases are essential skills

  • understand how data are stored

  • transform data

  • generate metadata

  • how to link data

  • query databases with & SQL

Computer science & HPC

Computer science & HPC provides algorithms & data structures to tackle increasingly large amounts of data

  • algorithms

  • distributed computing & map reduce

  • use computer clusters to parallelise operations

Data Visualization

Know how to present data in forms that are suitable and that aid decision making

  • theory behind perception

  • encoding data graphically

  • appropriate plots

  • grammar of graphics

  • create infographics

  • dashboards

Statistics & probability

Statistics is the field of science concerned with making inferences from samples of data drawn from larger populations

  • exploratory data analysis

  • summarize data

  • use statistical methods to make inferences

  • communicate results of statistical models

Machine learning

An offshoot from statistics (statistical learning) & computer science

  • underlying principals of machine learning methods

  • model assessment

  • variable importance

  • neural networks

  • tree-based models

  • prediction vs explanation

Communication

Communicating with end users, data generators, etc is an essential component of any applied science

Need to translate technical language of animal science, computer science, statistics, machine learning to the language used in specific domains

  • communicate with specialists

  • communicate with end users

  • aid decision making

  • communicate uncertainty