Data Science Tools and Technologies: An Applied Overview

Series: HTQ Computing: The Study Podcast | Module: Unit 10: Business Process Support | Episode 52 of 80 | Hosts: Alex with Sam, Computing Specialist

Key Takeaways

✓Python has become the dominant language for data science due to its readability, versatility, and rich ecosystem of libraries.
✓Pandas provides powerful data manipulation capabilities including reading, cleaning, transforming, and aggregating structured data.
✓Scikit-learn offers a consistent interface for applying a wide range of machine learning algorithms to classification, regression, and clustering problems.
✓Cloud platforms including AWS, Google Cloud, and Microsoft Azure provide scalable infrastructure and managed data science services that reduce setup overhead.
✓Jupyter Notebooks are the standard environment for exploratory data analysis, combining code, visualisation, and narrative text in a single, shareable document.

Listen to This Episode

Listen to the full episode inside the course. Enrol to access all 80 episodes, plus assignments, tutor support and Student Finance funding.

Start learning →

Full Transcript

Alex: Today we're exploring data science tools and technologies. Sam, the field seems to move so fast. Where should students focus?

Sam: Focus on the fundamentals that have lasting value rather than chasing every new tool. Python is the clear choice as a primary language for data science; it has the richest ecosystem, the largest community, and is used in both academia and industry. Alongside Python, understanding SQL for working with databases remains essential. Beyond that, the specific libraries and platforms you choose depend on what you're doing.

Alex: Let's talk about the Python data science ecosystem.

Sam: The core libraries are NumPy, which provides efficient numerical computation; pandas, which provides data structures like the DataFrame for working with tabular data; and matplotlib and seaborn for visualisation. For machine learning, scikit-learn is the standard library for classical machine learning algorithms. For deep learning, TensorFlow and PyTorch are the dominant frameworks. And Jupyter Notebooks provide the standard interactive environment where you can combine code, output, visualisations, and narrative text.

Alex: How does cloud computing fit in?

Sam: Increasingly, data science work is done in the cloud. AWS, Google Cloud, and Microsoft Azure all offer managed data science services: storage for large datasets, managed Jupyter environments like AWS SageMaker and Google Colab, data processing services for large-scale ETL work, and machine learning model training and deployment platforms. Cloud computing removes the barrier of having to manage your own hardware and makes it practical to work with much larger datasets than would fit on a laptop.

Alex: What about data engineering? How does that relate to data science?

Sam: Data engineering is the discipline of building and maintaining the infrastructure and pipelines that make data available for analysis. A data engineer builds the ETL processes, the data warehouses, and the streaming infrastructure that data scientists use. Without good data engineering, data scientists spend the majority of their time cleaning and preparing data rather than doing analysis. The two disciplines are closely related and often overlap.

Alex: What skills should a computing student prioritise to become competitive in this space?

Sam: Python proficiency, particularly with pandas and scikit-learn, is the most immediately valuable. SQL competence for database querying. Basic statistical understanding, including probability, distributions, and hypothesis testing. And an understanding of how to evaluate and communicate the results of analysis. The glamorous parts of data science, the machine learning models, are only useful if you can clean the data, understand the results, and communicate them to people who can act on them.

Alex: Any tools outside of programming that are worth knowing?

Sam: Business intelligence tools like Tableau and Power BI are widely used for reporting and dashboards and are worth understanding even if they're not primary tools for a computing specialist. Excel remains ubiquitous in business settings and sophisticated Excel skills are genuinely valued. And version control with Git is as important in data science work as it is in software development, particularly as teams have moved toward reproducible, code-based analytical workflows.

Alex: Brilliant. Thanks Sam. Next we look at applying data science techniques to business problems.

Data Science Tools and Technologies: An Applied Overview

Related content

HTQ Computing: Full Curriculum

HTQ Computing: The Study Podcast

Welcome to Your HNC Computing: What to Expect

Your Basket