Data science
2025-08-25
At the end of this topic you should be able to
Articulate what data science is
Understand at a high level the steps involved in doing data science
Describe the roles and skills of a data scientist
Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets
Kelleher & Tierney, pp. 1
Related fields
Data science is broader, borrowing from these fields and many other
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting non-obvious and useful patterns from large data sets
Kelleher & Tierney, pp. 1
Data science outputs are only useful if we or others can make use of them
Does data science provide us with information that wasn’t obvious?
Can we do something useful with the new information?
Data Architecture
Data Acquisition
Data Analysis
Data Archiving
Provide input on how data need to routed and organized to support the
–
How would you organise an Excel worksheet?
How should the data be collected and represented prior to analysis?
Important tasks that need to happen before data can be profitably analysed are
representing data
transforming data
grouping
linking
How can we summarise data?
Use samples of data to make inferences about the larger context or population
Visualize data and analysis outputs in graphs, tables, animations, dashboards
Communicate the results of the analysis
How should we preserve data that has been collected?
What forms of the data need to be preserved
Difficult to anticipate future uses of data
Important to learn the application domain
Need to know enough to
understand the problem
understand why the problem is important
how data science might address the problem
If data are important enough to collect, they’re important enough to affect people’s lives
Need to understand ethical issues
privacy, personal data-protection
biases in data & models
limitations of the data
prevent misuse
Working with data, files, & databases are essential skills
understand how data are stored
transform data
generate metadata
how to link data
query databases with & SQL
Computer science & HPC provides algorithms & data structures to tackle increasingly large amounts of data
algorithms
distributed computing & map reduce
use computer clusters to parallelise operations
Know how to present data in forms that are suitable and that aid decision making
theory behind perception
encoding data graphically
appropriate plots
grammar of graphics
create infographics
dashboards
Statistics is the field of science concerned with making inferences from samples of data drawn from larger populations
exploratory data analysis
summarize data
use statistical methods to make inferences
communicate results of statistical models
An offshoot from statistics (statistical learning) & computer science
underlying principals of machine learning methods
model assessment
variable importance
neural networks
tree-based models
prediction vs explanation
Communicating with end users, data generators, etc is an essential component of any applied science
Need to translate technical language of animal science, computer science, statistics, machine learning to the language used in specific domains
communicate with specialists
communicate with end users
aid decision making
communicate uncertainty