The story of how data science became such an interesting subject is actually a story about how statistics, a mature discipline, supplements a very young discipline – computer science. The term data science came into existence rather recently, primarily with the intention of creating an outline of a new profession, which is expected to give a meaning to a great amount of data available to us. However, giving meaning to data, or extraction of knowledge and deducing from data has its own history in which many scientists’, statistics’, librarians’, and computer scientists’ paths have crossed. Given the fact that the term data science often commutes with the term big data, I would like to emphasize the difference, or better yet, the connection between them.
Collecting (big data) does not mean discovering (data science).
In other words, big data engages in methods and technologies of data collecting and control of tremendous quantities of structured and unstructured data, whereas data science creates models which unroot hidden behavioral patterns in complex systems and data, and implements these models into live applications.
What do data scientist do?
A data scientist’s job can briefly be explained by the following example. Imagine a special kind of a diver, who has to dive through and make a map of a blurry wavy sea, the size of the Mediterranean Sea, with a muddy bed and low visibility, sharp rocks and at times fake, but at times real sharks, in one week. This diver’s goal is to learn something no one else yet discovered, while he is not actually certain what is he looking for, nor if anything else (interesting enough) can be found. A data scientist is navigated by his previous experiences and by what he just saw/learned/concluded, but his primary task is to explore. He is well aware of the technical limits he constantly faces, but he doesn’t let it get in the way of his work. As soon as he affirms a new discovery, he transfers and shares it with others, and suggests a certain way of using it with a purpose of making a decision or improving a natural, business or research process. He is satisfied and fulfilled when his discovery is confirmed in practice.
What’s the catch?
Okay, it’s not like the diver from the previous story only has an oxygen tank, an underwater flashlight and a pair of fins with him. He also has super modern gadgets, which help him automatically capture, process, categorize and correlate gathered data, he moves using a type of a super-fast underwater vehicle, so it is all more manageable.
That exact technological moment is a turning point for data science. Today, probably the liveliest technological field is the development of technologies that can collect and process huge amounts of data in real time. On the infrastructural field technologies such as Hadoop, Spark, noSQL base, graphical base, MPP devices, integration tools, framework, security concepts, languages (R, Python, Scala), machine learning algorithms, real-time solutions, in a cloud or on-premise… picture 2 might help, with a warning that it might cause dizziness.
Following the development of the situation in the last few years, and having in mind the picture above, it is clear that we are in the middle of a minor, rapid technological revolution, which is not limited only to this area (there is also superficial intelligence, IoT and a lot of other areas), but still without any formal winners, proclaimed official standards or a valid number of successful users.
Unicorns – illusion or reality?
Due to the circumstances, and primarily due to the exponential development of data science discipline, there has been a rather incredible entanglement. People are increasingly starting to ask the question, who a data scientist actually is, that is, which specific set of skills does he/she have to possess? And then comes the listing… well, of course, he has to know how to work with SQL, noSQL, Hadoop/HDFS, Map/reduce, Hive, Spark, Storm and other bases. He also has to know how to program; R. Python; preferable even Java and Scale. He has to know statistics, predictive modelling and machine learning well. He has to be excellent in the business domain which he analyses. Furthermore, he has to have visualization techniques. He also has to be good at conversations and he has to be able to sell his story and findings to the administration; he has to have good soft skills. And when you add everything up, it seems as if everybody is looking for a unicorn. And those who wish to have a go in that area often lose their determination once they see the list of skills they’d have to master.
Future prospect
It is expected that this revolution will continue during the next couple of years and it will surely bring progress that will affect the approach to data science. But the fields in which I expect the greatest progress are:
New data sources
When you hear someone talking about Internet of Things (IoT), this is the area the story belongs to. While, before, everyone concentrated on traditional data groups, such as sales transactions, data scientists are going to try to extract value from data, generated through product lines, vehicles, roads,… Most of this data will be in the form of temporal sequences, of which every will bring forth its own set of challenges.
Tools and technologies that will come in handy
At this moment, open-source is a great originator of progress. The number of open-source libraries, written in R or in Python that are becoming available every day is increasing. Machine learning algorithms for regression or classification problems, that you’d have to write from scratch 5 or 10 years ago, are accurate, tested and available with a single import action from the Scikit Python package.
Delamination of skills into different roles
This is closely linked to the previous subparagraph. If there are going to be tools that will enable the strength of Python or Spark, but also be as simple as Excel, in time there will be more and more people in selling, production and other departments who will begin using these tools and do jobs similar to those data scientists do today. Thus, the artificial balloon, which people see data scientists in, will deflate. I also believe that infrastructural (architectural) aspect will disintegrate from the simple data or programmer one. Infrastructure still defines the way in which someone will realize data retrieval and processing.
Soft skills
Data scientists have to be good at selling their ideas to the management, they have to know how to convince management that their discovery is valuable and worth continuing the work and future researches. Here, visualization is half the job, while the other half is pure marketing. We do know that data scientists, primarily as technical people, are better at using R or Python than at giving presentations in front of an audience. It would, thus, do no harm for the universities to dedicate more attention to teaching soft skills.
Where do I, an average Croatian it, stand at?
Having in mind a, still, relatively traditional and inert policy of our educational institutions, that deliver fresh young people to the IT labor market, adequate for, still the most popular, classic, projects like ERP, basic web application, document management systems, and such, I fear the Croatian market will remain in the data science recognition and acceptance phase for some time. The whole data science branch in Croatia still mainly relies on the enthusiasm of a few individuals, startups, meetup groups and a few conferences.
Conclusion
The future of data is clear and inevitable. We are all invited to contribute to the development and I believe that there are positions, and soon, I hope, also a need, for anyone interested; while abroad, companies are literally fighting over data scientists, in Croatia, we are still a few years behind. It is on the companies to be the first to realize the benefits and to dare to invest in something that certainly won’t go to waste. Data mining should become a standard, a continuous process for every respectable company, for the data contains all the information needed for strategy development, decision-making and their business progress.
DATA SCIENCE IN PRACTICE
Practical application can vary, as they depend on the industry or the process being analyzed.
Example 1: UPS package delivery company, by analyzing information about their vehicles’ motion and by comparing it to the fuel consumption information, decided to introduce the so-called no-turn-left policy, i.e. they are going to avoid turning left. Based on the available information, they concluded that, apart from the fact that it is far more risky from the safety aspect, turning left consumes more gas! Following these new revelations, on a yearly base they have managed to save about 10 million liters of gas, as well as decrease harmful gas emission (the equivalent of 5300 cars).
Example 2: Telecoms are starting to use so-called location-based marketing, which uses info about the location of a user in real time, combined with historical info and users’ habits, with a view to creating a unique offer for every user in the given moment. An example of location-based marketing: the weather is hot, you are walking towards a cafe bar as you receive a coupon for ice coffee via SMS
Example 3: The Nike company used the big data story to enter a whole new market. Unlike the classic loyalty programs in which customers who are willing to share their basic info (address, age,…) , the company occasionally sends discount brochures, now, with the increase in popularity of wearable devices, Nike, with prior customer consent, started gathering and processing info those devices transmit (performances, control of health functions). Now the company is able to give their buyers for example, health related advices, warning them on potential data anomalies, thus strengthening the bond with the customers, to a mutual satisfaction.
Related News