- ✓The data science process typically follows a cycle: define the problem, collect data, clean and explore data, model, evaluate, and deploy.
- ✓Data cleaning, also known as data wrangling, often accounts for the majority of time spent on a data science project.
- ✓Exploratory data analysis uses visualisation and summary statistics to understand the structure, distribution, and relationships within a dataset.
- ✓Machine learning models must be validated carefully to ensure they generalise well to new data and are not simply overfitting the training set.
- ✓Communicating the business value of data science results clearly is as important as the technical quality of the analysis itself.
Listen to the full episode inside the course. Enrol to access all 80 episodes, plus assignments, tutor support and Student Finance funding.
Start learning →Alex: Today we're looking at how data science techniques are applied to real business problems. Sam, where does the typical data science project start?
Sam: With a business problem, not a dataset. The most common mistake in data science is starting with data and looking for a use for it, rather than starting with a specific business question and then finding or collecting the data needed to answer it. Grounding your work in a genuine business question keeps the effort focused and ensures the output will be actionable.
Alex: Can you walk us through the stages of a data science project?
Sam: The standard framework is CRISP-DM: Cross-Industry Standard Process for Data Mining. It has six phases. Business understanding: what is the question, what does success look like? Data understanding: what data is available, what does it look like, what quality issues does it have? Data preparation: cleaning, transforming, and engineering features. Modelling: selecting and applying appropriate analytical or machine learning techniques. Evaluation: assessing whether the model actually answers the business question. And deployment: making the results available to decision-makers.
Alex: Data cleaning takes up a disproportionate amount of time, doesn't it?
Sam: It does, and this surprises new data scientists who expect to spend most of their time on the exciting modelling work. In practice, real-world data is messy: it has missing values, inconsistent formats, outliers, duplicates, and encoding errors. Cleaning it properly is painstaking but essential. A model trained on poor quality data will produce poor quality predictions, and no amount of sophisticated modelling will compensate for bad data.
Alex: What does exploratory data analysis involve?
Sam: EDA is the investigative phase where you look at the data to understand its structure, distribution, and relationships before applying any models. You calculate summary statistics, plot distributions, look for correlations between variables, and identify anything unusual or unexpected. This phase often generates insights on its own and helps you decide which modelling approaches are likely to be fruitful.
Alex: How do you choose which machine learning technique to apply?
Sam: Start with the type of problem. Classification problems, where you want to predict which category something belongs to, suit techniques like logistic regression, decision trees, random forests, and support vector machines. Regression problems, where you want to predict a numerical value, suit linear regression, decision trees, and gradient boosting methods. Clustering problems, where you want to find natural groupings in data, suit algorithms like K-means and DBSCAN. And the choice within each category depends on the size of the data, the interpretability requirements, and the performance trade-offs.
Alex: How do you know if your model is actually good?
Sam: Model evaluation is critical. You never evaluate a model on the same data it was trained on, because that tells you nothing about how it will perform on new data. Instead you split your data into training and test sets, or use cross-validation, and measure performance on data the model hasn't seen. Common metrics include accuracy, precision, recall, and F1 score for classification problems, and mean squared error and R-squared for regression problems. The right metric depends on the business problem.
Alex: Thanks Sam. Next we look at data visualisation and storytelling.