Rethink Big Data

Reflection and doubt to big data have never stopped. They are mainly in two sides: big data itself and ethic problems created by big data.

In the last post, I discussed the problems causing by big data itself. There are some congenital defects in big data. As big data is used to predict the whole by part or the future by past. However, there is always deviation existed in this prediction.

Besides, new digital divide may occur. Indeed, big data can improve the decision-making efficiency. But in the meantime, challenges of privacy, interoperability between the systems, not perfect algorithm and so on can be accumulated in developing countries. Big data needs matched infrastructure such as facilities designed for large scale distributed data intensive work, high efficient store facilities, network facilities for fast large data set importing etc.


(google’s secret data center)

Another non-ignorable problem is ethic issues which is mainly about the privacy problems which I have discussed several posts previously. Rethink it by the most classic case in big data era. In the early 2012, one American burst into the Target shop near his home and angrily questioned the manager:” how could you send baby diapers and discount coupon of bassinet to my daughter? She’s only 17!.” The manager apologized immediately. However, after one month, this angry father called back and apologized as his daughter was really pregnant. Not only Target, but also google, yahoo, apple,twitter, advertisement companies, data analysis companies,software companies etc. are collecting users’ private data. How to protect public privacy from violation is a big challenge in the future.

Big data is a creature of the age and will have significant impact on current society. How to improve the accuracy of decision making and face on problems causing by big data needs efforts from all sectors of society.



View Big Data Rationally

Big data is a hot topic. But we should keep rationality to understand big data. Big data is not perfect.

Sample deviation always exists and big data does not exceed statistics. First what is sample deviation? One good example is from the second world war. The British royal air force wanted to strengthen aircraft armor to defend against anti-aircraft fire of German army. As the limit of carrying capacity, they could only strengthen part of armor. So they asked a statistician for help. After careful observation of ammunition mark from planes returning to airport, he gave a surprising conclusion that strengthen the part without ammunition mark. He explained that the planes who had ammunition mark in that part had crashed. Statistics is using part to speculate the whole or past to predict future.

However the biggest weakness is also sample deviation, as it can cause conclusion failure when part speculates the whole. In the era of big data, sample deviation still effects the accuracy of conclusion. Due to the reason of technology and benefit, the collected data of big data cannot cover every scene link. Besides, even though we know the past data, the world is always changing.

The conclusion of big data is an overall conclusion, not individual conclusion. Even though the accuracy is 99%, there are still million mismatches if the number of samples is 100 million.

Big data can get correlated conclusion instead of causal relationship. For example, with the rise of sales volume of ice cream, the number of drowning people rises. They are in positive correlation relationship. So can sales of ice cream cause drown? Of course not. That’s because the hot weather can increase the sales of ice cream and the possibility that people play in water at the same time.

As a conclusion, big data has its own limitations. Pure data analysis cannot guarantee the correct conclusion. Most times, traditional analysis approaches and experience should be considered together with big data.

More Possibilities of Future

In the previous post, big data has the future possibilities which can excavate the reason behind, be applied in traditional region and reshape labor relation with market demand, but that’s not all of big data. Big data can change the world in more industries and regions.

Love and marriage model can be transformed. Individual can get accurate match based on big data analysis. Couple’s hobby, special talents, financial situation, profession and so on can be excavated deeply and matched accurately.

Traditional family model may be reshaped. People will be grouped by data instead of region. People with similar data trait can live together to realize resource integration and high efficiency of lifestyle.

Big data may even create the next lady gaga. Social media has great impact on sales of songs and albums. Peoples make comments and share their favorite music on twitter, facebook and youtube. By tracking this data online, we can know people’s concern, present popular points and which singer’s awareness is promoting gradually. And by considering all the characteristics, the next lady gaga can be predicted.

If you can identify how much energy a person or a building uses, you can reduce its consumption. From sensors, devices and the web, a massive amount of data is suddenly emerging, which taps into energy data and then results into a whole new meaning. As a result, the tools of big data can some day be a fundamental way to help the world curb energy consumption.

For more possibilities, check the following video:



Future of Big Data

Big data is a new and hot word in recent years. As the trend of global digitization and networked characteristics, some people even call this “ the fifth wave of science and technology”. However, big data is still a young kid, there is strong potential of development.

The way of analyzing data will be smarter and deeper. The result will not only the relationship of targets, but the reason why it’s so. Think of the market of diapers. To know what sells best and what ranks best/worst, thousands of reviews are collected and analyzed. By text analytics, three questions can be answered: why did it sell so sell, why people did not like it and what do they want. For example, if words like “price”, “special” and “value” are mostly frequently mentioned and then are further analyzed, this may tell manager the reason customers buy diapers not because of the quality or features, but price. And managers can make new strategy to sell diaper to  increase sales volume and get more profit.

Sensors may be existed everywhere. As the break through of technology, sensors are more and more miniaturized. They can even put into human body to detect chemical environment and subtle changes of organs. As a result, the source collection of data is more diversified and Big Data can be applied more widely.


Big data will not only be used in new areas, but in the future, it will be applied to more traditional areas. It can analyze the status of soil to help improve agriculture farming, keep the coordination of supply and demand and mine the new growth point, make traffic system smarter avoiding traffic jams and reducing accidents, etc.

Traditional labor relation may be reshaped. Through big data platform, human resource and customer demand can be matched more accurately so that the individual potential can be put more to good use and break the barriers of region, language and culture.

Thank for the existence of big data, I believe society efficiency will be improved much more and people’s life will be much easier and more comfortable.



Traditional Data Protection Principles

To protect private data not violated immoderately, there are already some data protection principles existed.

Consent principle which the data subject (any information relating to an identified or identifiable natural person) has given his or her unambiguous consent. However, the effect is not as good as expected which has been discussed in last post.

Fairly and Lawful processing principle which means in general that processing shall not be contrary to any law. (Legislation) This is a broad principle and there shouldn’t be any doubt on this principle.

Purpose(specification and use) limitation principle which personal data may be collected for specified, explicit and legitimate purposes only and can only be processed(use, sharing, re-use) compatibly with these purposes. (OCED) But how we can apply the purpose limitation principle when many uses of data are not known at the time of collection is the challenge.

Data minimization and Data quality principles which the personal data processed must be adequate, relevant and not excessive in relation to the purposes for which it is collected and processed. (OCED) The collected data, the selection of data sources, the processing and etc. shall be fit and not excessive in relation to the purposes. Personal data must also be accurate and kept up to date and must not be kept longer than necessary for the purpose for which collected and processed. However, the challenges are how we can limit data collection when technology relies on inferences and thus on the potential of massive databases and the concept of adequate data for the purpose of the processing can not be accurate defined in such context.

Transparency principle which the controller has obligation to inform the data subject and notify the national Data Protection Authority prior to any processing activity. (Julia M. Fromholz, 2000) For example, many social media websites use profile setting dashboard to inform users the use of data.

Confidentiality and security of processing principle which implementation of appropriate technical and organizational measures to protect personal data from accidental or unlawful destruction, loss, alteration, unauthorized disclosure or access, or other forms of unlawful processing such as access control, logging, and encryption. (OCED)

As explained above, traditional data protection principles are working well in many situations, but still have some challenges to be met. Communication between data subject and data controller is needed frequently so that private data of data subject isn’t violated and data controller can have adequate data to improve service.



The Fact of Privacy Protection

To understand privacy in big data, first we have to know what is personal data. “’Personal data’ means data relating to a living individual who is or can be identified either from the data or from the data in conjunction with other information that is in, or is likely to come into, the possession of the data controller.” (Data Protection Commissioner) The principle to treat data is that processing data is based on and covered by one of the legal grounds.




What are the legal grounds? “The data subject has given his or her unambiguous consent” to the processing. The consent should be given before the start of processing and contains active indication of his wishes. It’s should be specific and informed.

The question is “Is everything work fine”. Have we been informed? Yes, we all receive a document before we download an app or use the service from google or Microsoft. But the fact is that “ It would take the average person about 250 working hours every year, or about 30 full working days  to actually read the privacy policies of the websites they visit in a year.”(World Economic Forum Report 2013). Have everyone been reached? Think about google street view. I’ve heard many news of lawsuits from people who is photoed by good street view car without permission. Imagine the photo of your naked body is accessed by anybody who uses goole street view service. Do we really understand the meaning of privacy agreement? Actually, people don’t always make relational decisions in their exact interests.

The next post I will discuss some principles to deal with this situation.



Privacy issues of Big Data

Before starting the blog, it’s very interesting to watch the video down below first.


Believe or not, there are already large amounts of our private information online now and the amount will be more and more with the fast development of technology. If someone can get access all your data online, you are just naked to this person. He or she can take advantage of your data to obtain his benefit and most people are not aware of this situation.

Sometimes it’s even worse. We all know that recently more than one hundred celebrities of Hollywood who use iOS operating system, fell victim to photo leaks. Don’t think that it’s none of our business. Hackers can hack the accounts of Hollywood celebrities, but they can hacker you and me as well. Can you tolerant the privacy violation? Of course not.

To understand the importance of data privacy, check the following video.


The question is “would you like to sacrifice privacy to get more convenience”, or “which extent can you sacrifice your privacy”. It’s predictable that your privacy will be accessible more in the future and the horrible situation could come true some day if there are no laws or regulations to restrict data owner to abuse our private information.



 Is Big Data Useful

Needless to say big data is a hot word in IT and internet industry, however, not really a lot of people know what kind of benefits big data can offer to them and think there is no relationship with them.

No company can live without data. Every company is generating data and the speed of generating data is faster and faster. Whether they can analyze the insight of the collecting data to make fast and correct decision can determine success or failure in competition.

T-Mobile USA integrated Big Data to predict customer defections by combining customer transaction and interactions data. And they claim they are able to cut customer defections in half in a single quater. US Xpress saves millions of dollars in operating costs by collecting and analyzing a thousand data elements for optimal fleet management and to drive productivity. (Kotadia, 2012)

How data can influence our life? The famous story “beer and diaper” can reveal the close relationship. It happens in Wal-Mart supermarket 1990s. The supermarket manager found an phenomenon when analyzing sales data that beer and diaper, which seems there is no relation between them, were often bought at the same time. And this phenomenon happened more in young men. The reason behind the story is that mothers often take care of kids at home and fathers go to the supermarket to buy diapers. When fathers buy diapers, normally they take some bottles of beer. After Wal-Mart found this special phenomenon, they put beer and diapers at the same zone and as a result, this promoted the sales volume of beer.(Mark Whitehorn, 2006)

The nice video down below shows how big data changes our life.



Big data properties

“Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on hand database management tools or traditional data processing applications” -Wikipedia




Just pick up some examples from above picture, 2,000,000 search queries/m processed by google, 684,478 peices of content shared by users of facebook per minute and 3,600 new photos share on Instagram per minute. Data volume increases 44 times from 2009 to 2020 (0.8 zettabytes to 35 zettabytes, gigabytes-terabytes-petabytes-exabytes-zettabytes). The capacity of data is so huge and this is the literal explanation of Big Data. (Spencer.S., 2012)

The information is exploding. The capabilities of digital device is getting more and more advanced, but the price is going down. Between 1990 and 2005, more than one billion people worldwide has entered the middle class. As they are getting richer, they touch more with digital world. There are 4.6 billion mobile phone subscriptions worldwide and more people have not only one mobile phone. This is the source of data explosion. (The Economist, 2010)

The types of data are variable, for example relational data like transaction and electronic health records, text data like document, semi-structured data like XML, wikipedia and amazon, graph data from social networks and disease networks, etc.

Because of the properties of volume and variety of big data, traditional database management system can work well. NoSQL arises.

Data is also generated very quickly and needs to be processed quickly. Opportunity can be missed due to late decisions. Real time analysis is necessary, for example, the marketing effectiveness of a promotion is improved while it is still in play, and analyzing the feedback of product experience from users in real time can improve the product performance and companies can seize advantages over competitors.



