When people heard about BigData, it is like a new wonderful complex technology which can solve various big issue related to data faced by industry these days. While it is very true, from technical point of view you can actually simplify the terms as it is actually just a framework to help us solving issue with very big data. Confuse?
I’ve been living in telecom world for almost 10 years now. Telecom is one of industry which definitely need bigdata. Huge amount of data circulated in the telecom network every second. Each network element generate tons of data, and the problem arise when you want to analyze those data. Let say you are an operation engineer working in a telecom network, a complaint come from customer about problem happened within specific time. In order to troubleshoot the issue, you must collect various symptoms. You can start from syslog, cdr logs, and even network trace logs. Let assume that all collected logs available, you will need to start analyze manually each logs, and download some of them into your local computer / laptop.
Have you ever tried to load a huge wireshark trace using a laptop? How long will it take? If you have a very nice laptop then you will get the output after maybe 5–10 minutes, as a bonus your laptop will be freezing during that time. If you’re unlucky, your laptop will be hang. That is one of problem with bigdata. Oh yes, that huge trace is considered as bigdata (for your laptop).
If you know how to do programming then you will make your day better than any other engineer. You can save time, create a program or script within few minutes or even hours, then you can reuse it in future. Let your script doing the analysis by itself while you’re making a cup of coffee, leave it run while you’re out for a lunch with friends, and doing some other relaxing stuff. When you’re back, the script will present the result of analysis for you. Another benefit when you have programming skill, if your laptop not powerful enough then you can run the script on the servers. Of course, non-utilized server i.e testbed system, installation server, or maybe even production servers which is under utilized. However, in my experience, even if you’re using powerful server that would be not enough. Some of my script even showing that the processing time running in my personal laptop (Macbook Pro, 8GB memory) not so much different to processing time running in a high end server (Few processor, more than 32 GB memory). If you’re smart enough, then you will use technique such as multithreading, or using low-level programming language which will give you better speed of processing. However, one day you will think the effort is too expensive, only to analyze a bunch of string you must use multithreading programming?!
You will need a parallel computing which can cut the processing time from 30 minutes to only i.e 5 minutes. Of course depend on how much and how complex the data is. That’s where bigdata technologies come. Your data will be distributed to several machine, and you can instruct those machine to perform data analysis in a similar way your script will do in a single laptop / server but this one is in parallels. Of course, you can’t just simply make 5 machine work in parallels without doing any programming effort. There should be a bunch of code or framework design to make those machine work in parallel as its base. Again, the effort is too expensive while you have another important daily operation task to do, right?
This is where popular technologies called Hadoop come. It is not the only one of course, but it is the most actively supported opensource project for bigdata. You will need to learn and setup hadoop and its ecosystem though, once completed then you can assume those parallel machine with hadoop installed as your magic data store. You have huge data, throw to hadoop, and start doing whatever you like with those data. You can throw all your data, specific data, anything. The best part is, you don’t need high-end servers, you can use any unused servers, low-end server, or even old type of PC.
Ecosystem in Hadoop means you can setup various software to help you analyze the data saved in Hadoop (HDFS). Even if you don’t know how to do programming, but you know how to use database syntax such as SQL, that would be enough to do data analysis or ETL process in Hadoop. If you know specific programming language, then you can use your favorite programming language to interact with your data. In any case, whatever options you choose to interact / communicate with your data saved in Hadoop, various ecosystems for Hadoop will help you to accomplish your task.
The sample above taken from operation perspective of telecom industry. There are lot of use case scenario to utilize a very big data with 3 V’s characteristic (Volume, Variety, and Velocity) on telecom industry for business / marketing purpose.
That’s all for the intro. In second part I will give some example how to utilize opensource bigdata framework & ecosystem to do data analysis in telecom industry.