Data science is an interdisciplinary field that uses scientific methods, algorithms, and software to create informed deductions based on data analysis. This guide provides a more in-depth explanation of what data science is, why data science has so many people so excited and outlines some resources to take the next steps into the data science field.
Throughout the 2010s, there was an undeniable revolution in the scale at which data influenced business and society. Around the world, our perception of the power of data moved from some numbers on an Excel spreadsheet to digital gold, technological warfare, a force to save lives. So why is there rising importance of data right now vs. 100, 50, or even 20 years ago? To understand the expanding influence of data, you need to understand the genesis and rising popularity of those that manage it: The data scientists.
Data Science is Everywhere
While being a data scientist has been touted as the sexiest job of the 21st century, it wasn’t long ago that this term hardly existed in the mainstream.
The data scientist’s “role” in the 20th century was held mainly by statisticians, primarily in academic and medical research areas. For example, if a group of researchers wanted to see the impact of a specific medication on a group of patients, they could manually recruit 300 patients. Of those, 100 would be given the drug, 100 would get a placebo medication, and 100 patients would get nothing.
By quantifying each person’s response to the medication (or non-medication), a statistician could take these results and form a statistical test, commonly known as a hypothesis test, to assess the probability that the drug made a statistically significant impact on the patients that took it. If that probability was over 95 percent, the rule of thumb was to accept the results, and if those results were favorable, they could potentially be proposed to the FDA for mass retail.
These calculations in and of themselves are not too complex, but because separate calculations had to be made for each of the 300 patients, they could certainly get tedious. Still, with dedicated manual work or leveraging some of the budding statistical software of the late-20th century (SPSS and SAS, among others), creating statistically reliable and actionable results was doable, albeit not scalable.
Because of the constraints on computing power, or the speed at which computers could make these calculations, expanding this software to a level that businesses could leverage in real-time was out of the question. Sure, a company could perform some basic analysis on their different customer demographics, but something on par with Amazon or Netflix’s modern-day ability to take millions of users’ unique interests and make live recommendations was not even close to doable, given technological restraints.
If you’ve seen any visuals on the exponential nature of computing power over time, you’ll know that the processing speed growth moves very fast. Historical data can approximate that the maximum operations per second achievable by a high-speed computer increased by 500x between 1997 and 2007 and by another 100x or more from 2007 to today.
The potential of data had always existed, but it’s only over the last 10-20 years that it’s been usable at scale. Combining modern technology and computing power, a company like Netflix easily has the power to generate recommendations for millions of users in real-time.
Data Science Definition
So why the name change from statistician to data scientist? Creating the technical infrastructure behind something like the Netflix recommendation system, for example, requires so much more than understanding statistics. Namely, there is a high degree of software development, both backend (logic) and frontend (aesthetics), that creates the framework around a program of that complexity and useability.
But why call them data scientists instead of software developers? The required understanding of the statistical algorithms that go into creating a recommendation system is far beyond the toolbox of your typical software developer. And thus, we have the modern data scientist’s skillset: It is the joining of statistical acumen and software development skills to drive business value.
What Do Data Scientists Do?
The issue with the data scientist umbrella is that it is extremely general and often buzzwordy. Some data scientists have little to no overlap in their jobs; one could be more heavily weighted toward programming, another toward statistics, and another toward business analysis. Let’s cover the broad spectrum of roles within “data science.”
Machine learning engineer
Machine learning is very in line with the recommendation system I outlined above; it utilizes historical data to make future predictions. Netflix can take the millions of people who watched The Office and say, “Oh, it looks like 88 percent of these people also watched and enjoyed Parks and Recreation, which is a higher percentage than any other show! So for anyone who has watched The Office but has never watched Parks and Recreation, we can make a proper recommendation with a certain degree of confidence that the user will be interested.”
A machine learning engineer’s role is to create a system in line with the above that works for any show that a user may watch to make the suggestions with the highest probability of generating user interest. This requires understanding common machine learning algorithms (linear, tree-based, neural networks, etc.) and programming ability.
In today’s landscape, a data analyst can often refer to someone experienced with querying and organizing data in SQL that will be passed off to a machine learning engineer. It can also refer to someone proficient in working with data in Excel or creating elegant dashboards/visualizations in Tableau or PowerBI.
There is a substantial amount of overlap between data analysts, business analysts, and data visualization analysts. Generally speaking, these roles are more client-focused, people-oriented than a machine learning engineer, who is typically more heads-down in code and development.
On the more developer-heavy side of the equation, a data engineer is often directly involved in the logistics behind the deployment of algorithms. In general, the data pipelines that connect different data-related elements of a business are usually built out and overseen by data engineers. In the Netflix example, this would be the person to take the machine learning engineer’s model and deploy it live on the Netflix application. That includes structuring the model to read users’ data, make the necessary calculations, and generate a visible recommendation for each user—not a simple task!
With rapidly increasing computing power comes a growing amount of data throughout the tech ecosystem, and storing all that information is not as easy as it may seem. Data Architects are responsible for designing databases with which a company’s data can be stored and extracted. Companies can also use those databases to communicate. This is extremely important for any scaling business that wants to do business without major headaches.
Even with all these specifications, you will undoubtedly see roles labeled “Data Scientist.” So what types of skills may these roles call for? All of the skills listed above. As a data science consultant, I’ve gotten requests from creating business intelligence dashboards to building out a database’s infrastructure to creating voice recognition software. Do most other data scientists, or I contain the skills to fulfill all these requests? Not. It is so easy for a data scientist to get bogged down by the countless skills that exist within the space.
Time and time again, I see this pressure to learn everything as an instigator of imposter syndrome in many data scientists. Rather than trying to conquer all the skills within the data science umbrella, finding and developing the specific skills that fit best with your interests and strengths can be much less painful. Once you have the skills you like, it doesn’t matter how someone classifies a job title, just that you enjoy and thrive within the work itself!
Data Science Tools
Data science is software-intensive.
There is a lot of variety in coding languages and software that a data scientist (or someone within that umbrella of jobs) may utilize. That said, Python appears to be the consensus favorite for most data-related tasks, particularly as it relates to analysis, machine learning, deep learning, data cleaning, and other tasks.
The packages within Python that typically get the most love are Pandas, Numpy, Scikit-learn, Tensorflow, and Keras. Outside of Python, R is often used for comparable tasks, though it tends to be more heavily leveraged by those that favor statistics to programming. Python and R don’t scale particularly well for massive datasets (millions of rows, multiple GB or TB of data, etc.), so you’ll often see many of these tasks being run in SQL or some other cloud-based application that leverages SQL or a combination of the above languages.
Where Do Data Scientists Work?
As data science has expanded into the mainstream over the last ten years, its applications emerged where you’d expect it to be the most valuable — finance (particularly with market forecasting and trading), advertising (taking a user’s interests to recommend a relevant set of advertisements where they are most likely to be interested), and tech in general (see the above Netflix example). However, as we move into a more data-centric era of society, we’ve started to see data science positions popping up in virtually every industry.
I’ve completed jobs for healthcare nonprofits, major sports leagues, environmental researchers, logistics coordinators, and fitness applications as a freelancer. Companies across the board are starting to realize the value of information to derive objective conclusions that can help them improve what they do, whether it’s their bottom line metrics, user experience, or otherwise. This has led to a substantial increase in the demand for data-related skills.
Whereas a Ph.D. was very much the understood requirement to be a “real” data scientist ten years ago, demand has knocked down the barrier to entry to include those with a bachelor’s degree. We’re even seeing a growing movement to remove the emphasis on degrees altogether; employers increasingly want to find people with a knack and curiosity for problem-solving, understanding that concrete technical skills can be taught once they find the right minds.
In terms of the potential for the future of data science positions, we are still largely at the mercy of computing power restraints. While we’ve reached the capacity to make proper recommendation systems in real-time, imagine the remarkable volume of calculations to build out an advanced self-driving car or a human-like robot. Processing speeds are still a ways away from the next big technological leaps in data science and deep learning, but they will undoubtedly be embraced when microchips can handle them.
How to Get Started in Data Science
If all of this sounds exciting to you and in line with where you want your career to go, then there are plenty of things you can do to make the pivot to data science. The checklist typically goes as follows (with the understanding that each data scientist has a different skillset, and there’s no one way to get there):
- Solid proficiency in Python and Pandas. Pandas is the data analysis tool you will use within Python to work with data.
- A working understanding of linear and tree-based machine learning algorithms, the statistical theory that comes with them, and the ability to use them to make predictions in Python using scikit-learn. You could potentially get away with skipping this step, but understanding machine learning will go a long way in your project flexibility and help when job hunting.
- Build a portfolio of projects that use real-world data.
If step 3 sounds intimidating, try reframing it to something as simple as “what question is interesting to me?” If there’s data out there that’s relevant to your question, then try working with it once you gain proficiency in Python and Pandas.
A great place to start if you’re looking for projects to work on is the Kaggle dataset library, which has a myriad of rich, clean datasets that could inspire potential projects. I once did an entire project on Yelp reviews in New York City because they seemed a lot lower than in the midwest, where I grew up in the midwest. I did a full regional breakdown of Yelp reviews among different demographics and locations. I found some cool stuff, all because of some simple curiosity!
Above all else, that curiosity and desire to answer questions is the top requirement to thrive in this genuinely limitless field. Given the abundance of free resources to learn these skills (YouTube, Coursera, etc.), anyone with internet access is moments away from starting their data science journey.