Author: yafeishi2012

[Note]Chapter6 Geographic information visualization

Map projection

  • equal angle
    • Mercator Projection
  • equal area
    • Albers Projection
  • equal distance
    • Azimuthal Projection

Point data visualization


Discrete points of geographic space including longitude and latitude coordinate, no size property.

Vis approaches

  • Mark on the map directly
  • Indicate numeric properties by color and size
  • Apply different icons and symbols to indicate different data types
    • symbols should be easily perceived
    • the number of symbols shouldn’t be too large
    • use legend to explain the meaning of symbol
  • visualize vectorial data by icons like arrow


  • many overlaps when quantities of data.
    • Specially, overlaps occur in intensive place, whereas, blank occurs in sparse place.
  • unreasonable of use of screen space
    • a great deal of information is concentrated in small area, however, other areas are completely empty


  • split map to various blocks. Visualize the statistic data of each block.
    • showing data in discrete way, one method is to split map area into grids. Visualization can be 3D histogram with drawback of covering each other and can use color without drawback of covering.
    • showing data in continuous way by heatmap. Compared with discrete block, the granularity is finer.
  • Apply appropriate layout algorithm and utilize the rendering and integrating method to show each data object. This can express more details compared with the above method.

line data visualization


Line data with the property of length, is the data to connect two or more points.

Vis approaches

  • basic visualization methods such as color, line type, width and mark.
  • To avoid visual confusion, cluster a large quantity of data to bunches.
    • this kind of line binding technology is called flow map. There are some algorithm to improve vis performance.

Regional data


Regional data contains more information than point data and line data. It can express regional properties like population intensity and per-capita data.


  • Choropleth map
    • Color matching suggestion tool: ColorBrewer
    • Problem: data distribution and geographic region are asymmetric. Normally large data set is concentrated in limited population intensive region.
  • Cartogram map
    • Method: reshape each region according to the property of geographic region.
    • Problem: hard to guarantee the relative position of different region.
    • Solution: continuous Cartogram. This algorithm can keep the relative position of neighboring regions.
  • Map of regular shape
    • Rectangle or circle
  • Multi relational map
    • different geographic regions can have associated relation.
      • Approach: Bubble Sets. Problem: visual confusion when in quantity. Solution: LinSets which connecting a set of points by one curve.

spatio-temporal graphic data

  • frequently used approach: sequential animation.

Complex geographic data vis analysis

Map can carry various types of complex information, which can be formed as multi semantic map.

  • Approach: multiple views to help analyze multiple variables.


  • map labeling
  • map generalization which has the aim to filter and draw map information according to the showing size.
  • online map – limited bandwidth vs real-time interaction



陈为 沈则潜 陶煜波. (2013). 数据可视化. 电子工业出版社. (第六章)

Theory Behind NoSQL

1.1 CAP

CAP is first published by Eric Brewer in 2000 and is the basic theorem that describes any distributed system. (Brewer, 2000)  Figure10 explains what CAP stands for.



Figure 1. CAP stands for Consistency, Availability and Tolerance to network Partitions.

  • Consistency – Any node in the system has the same data with other nodes.
  • Availability – A request will always be responded even though there is a node going down.
  • Partition Tolerance – The system continues to operate even there is a partition or communication break between two nodes. (Brewer, 2000)

The truth is that it’s impossible to meet all the three guarantees.  If consistency is focused, the failure of write operation caused by system unavailability may happen. If availability is focused,  It’s possible that read operation can’t get the latest value of write operation. It’s the focus of system who determines the strategy and normally a combination of two guarantees are chosen.

1.2 Election of CAP guarantee


Figure 2. Databases pick up  two guarantees from CAP

Figure shows how different databases discussed in this paper make their choice. Caution should be taken that each database picks up two guarantee, but it does not mean it losing everything in the third guarantee.

The explanation of possible selections is discussed as following.

  • CA – Data among all nodes is same (consistency) and in order to meet this condition, read or write operation can be accessed in any node (availability). Nevertheless, if there is a communication break occurred in the system, the data won’t be synchronized and won’t be re-synchronized even when the partition is resolved.
  • CP – Consistency and Partition Tolerance are guaranteed. If partition occurs in the system,  in order to ensure consistency, nodes will be regarded as going down (losing availability).
  • AP – Nodes can still be reached even though there is communication failure. Meanwhile, the data can be re-synchronized if the communication break is resolved.

1.3 ACID and BASE

To guarantee the database transactions, a set of properties of ACID is applied.

ACID stands for:

  • Atomicity – Everything in a transaction works successfully or failed.
  • Consistency – The transaction should ensure the database transfer from one consistent state to another consistent state. The consistent state means data in database should meet the integrity constraint.
  • Isolation – Transactions are isolated which means different transactions won’t affect each other.
  • Durability – If data is committed, it remains its state even though there are errors, crashes or power loss.(Gray,& Reuter, 2007)

In order to handle transactions and keep data safe, traditional RDBMS works following the principle of ACID. NoSQL databases on the other hand need to guarantee the availability and scalability for storing lots more data and using a distributed set of servers working together. ACID in this case cannot guarantee this property and then BASE comes up.

BASE stands for Basically Available, Soft State and Eventually Consistency.  (Mapanga & Kadebu, 2013)

  • Basically Available – The system gives up the high availability as regards CAP theorem which means the data can be failed to be responded when it’s in an inconsistent or changing state. But basically, availability is guaranteed.
  • Soft State – The state of the system can change over time due to eventual consistency property. Partition tolerance is supported, but the state can be asynchronous for some time.

evntual consis.png

Figure 3. Eventual consistency model

  • Eventual Consistency – Observing figure12, suppose Node A is written with DataN, the system should guarantee that any read operation obtains the latest value DataN if there are no other write operations to Node A, B and C before the read operations of Node B and C are completed. However, the inconsistency problem could occur due to the interactive delay, system load or the number of replications.

strong consis.png

Figure 4. Strong consistency model

On the contrary to eventual consistency, strong consistency is quite different.

  • Strong consistency – Observing figure13, if Node A is written with DataN, the latest value DataN is always returned from Node B and C. However,  in order to ensure consistency, any read operation is blocked until the update of replications.

BASE model is against to ACID model, as it sacrifices high consistency to achieve availability or reliability.  Relational database is designed based on strong consistency guarantee. To some extent, it loses scalability and performance. In current modern companies like Google, Amazon, Twitter etc., scalability availablity and performance are concerned more, as a result, most NoSQL databases follows BASE model.

1.4 Concurrency control

Assuming a data record is accessed by two users, there is no problem when they read the data record at the same time, but what will happen if they update the data record simultaneously? Collision happens then. Concurrency control is used to deal with the issue which allows multiple users to access the same data simultaneously.

Concurrency control in databases is normally working with transactions. A Transaction is a series of data operations that have to follow ACID guarantee. Most relational management systems like MySQL support transaction, as users of RDBMS consider consistency and integrity of data  as high preference.  However, not all modern new databases support transaction such as MongoDB. If MongoDB operates on a single document, operations are always atomic. But operations on multiple documents are not atomic. In this case, multi-document transactions can not be executed. Fortunately, MongoDB can apply two-phase commits to offer transaction-like semantics. (MongoDB online manual – Perform Two Phase Commits). Whereas, NoSQL databases discussed in this paper Redis, HBASE and Neo4j support transaction.

If interleaving operations are allowed, without proper concurrency control, problems such as the lost update problem, the dirty read problem and the incorrect summary problem can occur.

  • The lost update problem – from figure 14, normally transaction C wanted to update value3, but it reads 8 actually.

lost update.png

Figure 5. The lost update problem

  • The dirty read – transaction B reads a value from transaction A. But transact A has aborted the value at the same time. Transaction A does not want transaction B to read that value actually.
  • The incorrect summary – – transaction A contains data item X(10) and data item Y(20). Both two values add 10. At the same time transaction is reading the result form transaction A. After the calculation, the data items in transaction A are X(20) and Y(30). But transaction may get the result X(20) and Y(20) which Y should be 30.

The mechanism of concurrency control categories (Bernstein, Hadzilacos & Goodman, 1987):

  • Pessimistic lock blocks every operation violating data integrity. It supposes that the probability of other users who attempt to visit or change the object which you are visiting or changing is so high that it will lock the object until you have committed the update operation. Definitely pessimistic lock enhances the operational integrity, but the drawback is that the delay may be too long which can restrict the access of other users.
  • Optimistic lock will check whether data integrity has been violated only when committing the operations. Contrary to pessimistic lock, it assumes that the probability of other users who attempt to visit or change the object which you are visiting or changing is low. Therefore, it will keep unlocking when you read or change the object and lock the object when you are ready to commit the change of object. The concurrent performance is improved, but it cannot solve the dirty read problem. If the second user reads the object before the first user commits the change, the database detects the change before the first user finishes the change and wants to commit, as a result, the first user has to re-read the object and then make the change.

The selection of two lock categories depends on the application requirements. If concurrency doesn’t happen often and dirty read problem is not allowed, pessimistic lock is preferred.  If there are a lot of concurrency issues, pessimistic lock can degrade the performance significantly, as a result, optimistic lock is preferred.

There are a lot of specific concurrency methods (Bernstein, Hadzilacos & Goodman, 1987).

  • Locking methods

Read lock – If one user applies read lock on one object, other users can also read the object. But any write operation of the object is blocked. Figure 15 illustrates how read lock works.

read lock.png

Figure 6. A flow chart to explain how read lock is performed.

Write lock – If one user applies write lock, other users can not read nor write on the object. Figure16 illustrates how write lock works.

write clock.png

Figure 7. A flow chart to explain how write lock performance

  • Multiversion concurrency control (MVCC) generates a new version of  a database object when the object is written and read the most recent version of the object based on timestamp order or increasing transaction ID.

Consider two concurrent writes shown in figure 17.

concurr write.png

Figure 8. Two writes to the same row

The performance of MVCC based on row lock operation is illustrated in the following figure 18.


Figure 9. A flow char to explain how MVCC is performed.

Considering the features of different databases, the methods of concurrent control is varied.

  • Mysql supports MVCC in InnoDB store engine and Row-level locking and MVCC in InnoDB store engine and locking the whole table in MyISAM store engine. For the two lock levels: table lock (All rows in the locked table are locked) and row lock (Some rows of a table are locked, but other rows are not locked).
  • As Redis is in single-process and single-thread mode, it adopts queue mode to replace concurrency access into serial access. There is no locking concept in Redis.
  • MongoDB supports read lock and write lock for concurrency control. However, read and write operations can yield their locks. It uses adaptive algorithms based predicted disk access patterns to resolve this problem. (MongoDB FAQ)
  • Hbase guarantees high effective concurrency control through row level lock and MVCC methods.
  • Neo4j applies default locking behavior. When there are changes on properties, nodes and relationships, write locks will be taken.  (Neo4j Manual)


Bernstein, P. A., Hadzilacos, V., & Goodman, N. (1987). Concurrency control and recovery in database systems (Vol. 370). New York: Addison-wesley.

Brewer, E. A. (2000, July). Towards robust distributed systems. In PODC (p. 7).

Gray, J., & Reuter, A. (2007). Transaction processing: concepts and techniques, 1993.

Mapanga, I., & Kadebu, P. (2013). Database Management Systems: A NoSQL Analysis. International journal of Modern Communication Technologies and Research, 1(7).


Database is an organized collection of data. To make operations like definition, querying, update, and administration of database, the special designed software application called Database Management System (DBMS) is necessary. DBMS helps the user to capture and analyze data. It’s classified by database model, for example the most famous model called relational model as the data model is relational. The database model is used to determine the logical structure of database and which manner data can be stored, analyzed and manipulated.

Relational database model is based on first-order predict logic which data is represented by tuples and grouped by relations.

However, when data is not structured and relational, relational DBMS is not capable to manage such kind of data. But relational database model is not good at adapting the change. Because of the various data formats such as hierarchies, cubes, linked-lists and unstructured data, it’s not capable to organize data into tables.

As a solution, NoSQL (not only SQL) comes up. NoSQL database management systems enable data to be stored in a variety of formats like key-value store, column store, graph store and document store. NoSQL called not only SQL is to emphasize that SQL-like query languages may also be supported. But it does not guarantee the true ACID (atomicity, consistency, integrity, and durability) principle. NoSQL database management systems remove hard constraints, such as tabular row store and strict data definition, and have distributed architectures to support high performance throughput. NoSQL databases are widely used in big data and real-time web applications.


Relational databases

Relational databases are the most popular and widely used databases. The data model  organizes data as tables or relations. Each table consists of rows and columns which is illustrated in figure1. Each row has an unique key.


Figure 1. An example of relational database model

Different from NoSQL databases, the data model of relational databases is fixed and the data is structured. It supports transaction management and guarantees true ACID principle. Relational database management system (RDBMS) which is based on relational model has been developed for several decades and still dominates current database market. It’s widely deployed in banks, schools, hospitals, governments and so on due to its properties.


Key-value store database

Key-value store is one of the most simple database management system. Data is stored by key and value illustrated in figure2, and can be retrieved when the key is known so that the complex querying and management functionality of RDBMS is not needed.

A string can represent the key and the actual data can be represented by value. The data can be any kind of data types in programming language such as string, integer, array and so on, or an abstract object which bindings to the key.  The data model is flexible so that the requirement for the formatted data is less strict.


Figure 2. An example of Key-Value database model

Compared to common SQL databases, it contains the advantage of fast speed in storing and retrieving data.  This happens when relations, correlations or collations of data are not necessary. An SQL table can be organized into two columns, a key and a value. In this case, for querying, just find the value and return it. This is very fast.

Specifically speaking, SQL language has great advantages of dealing with structured data and allows highly dynamic queries. However, for current web applications, it’s another case. It’s an object oriented way of thinking such as the back-end database of MVC (Model-View-Control) pattern, instead of a highly dynamic range of queries which are full of outer and inner joins, unions and complex calculations over large tables. Meanwhile, it will result in complex hierarchies of tables if relational data models are transferred into object oriented models because of large amounts of normalization. For key-value store databases, the data model is schema less and an object can be just represented by a value with a key to identify the object. Therefore, the storage of arbitrary data indexed using a single key to allow retrieval  is allowed. That’s why key-vale store is also called simple store.

The code tends to look clean and simple compared with embedded SQL strings in the programming language. As for object-relational mapping frameworks, a lot of complex code between an SQL database and an object-oriented programming language will be added.


Document store databases 

Data is stored in the document store databases with the data format  such as XML, PDF, JSON etc. The document contains a unique key “ID” to identify a document explicitly and a collection of documents. The example of the data model is illustrated in figure3. Documents of document store databases are similar to records in relational databases. The difference is that the data model in document oriented databases is more flexible as its property of schema-less. New documents, no matter which kinds of attribute are contained, can be stored as adding new attributes  in existing documents at runtime.


Figure 3. The left figure is an example of document format of JSON and the right figure is an example of document format of XML.

Unlike relational databases whose records inside the same database have same data fields, document databases have the property that document may have similar as well as dissimilar data. The data model of document databases is slightly more complex than that of key-value databases, which instead of key-value store, the data model of document store can be represented as key-document pairs. If the database has a lot of relations and normalization, it’s not appropriate to use document database. Instead document stores are used for content management system, blog software etc.


Column store databases

Column store databases support the standard relational logical data model. Databases consist of a collection of tables and each table has a named collection of attributes which are columns instead of rows for relational data model. Attributes can form a unique primary key or foreign key referring to another primary key in another table.

The most different point of the two kinds of databases is that the data model of the relational databases is row oriented, however, on the contrast, the data model of column store databases are column oriented.  Figure4 can illustrate the difference simply and clearly.


Figure 4. The upper figure illustrates the row- oriented data model and the lower figure illustrates the column-oriented data model.

From figure4, in a row-oriented database management system, the data would be stored as “1, John Smith, 19; 2, Jim Green, 18; 3, Lucy King, 16; 4, Freda Ford, 15”. Whereas in a column-oriented database management system, the data would be store as “1, 2, 3, 4; John Smith, Jim Green, Lucy King, Freda Ford; 19, 18 ,16, 15”.

The column store databases store the data with the way to be aggregated rapidly with less I/O activities and offer high scalability in data storage. They are efficient in applications including customer relationship management (CRM) systems, electronic library car catalogs, data warehousing and other ad-hoc query systems.


Graph store databases

Graph store databases are common used to handle relationships due to their efficient management of heavily linked data. (Neo4j)  The data model of graph contains nodes representing entities which hold proper types and  numbers of properties like key-value pairs. Figure5 is an example of the graph data model. The connection between two nodes are revealed by directed, named semantic relationships. The relationships also have properties such as know, own, like etc. Two nodes can have not only one number or type of relationships , but many if there are. As relationships are stored efficiently, this won’t sacrifice performance.


Figure 5. An example of graph data model. The ellipses represent nodes. Each node is a data entity with types and values.  The arrows represent connections with  relationships and their properties.



MySQL vs Redis vs MongoDB vs HBase vs Neo4j

1.1 MySQL

MySQL is the second most widely used relational database management system. As it’s open source, nothing needs to be paid for using it. It uses a standard form of the well-known SQL data language and can run on many operating systems and embedded in many programming languages.(MySQL online manual)

1.2 Redis

Redis is one of  the most popular key-value store database.  The data type of keys can be not only simple strings, but also hashes, lists, sets, sorted sets and bitmaps, with a number of server-side atomic operations associated to these data types, which atomic operation means that the processing of read or write operation is not disturbed by other read or write operations until it is completed.

The data sets of Redis are kept in memory so that it can achieve very high writing and reading speed with the limitation that data sets can’t be larger than memory. Meanwhile, the memory representation of complex data structures is much simpler to manipulate compared to the same data structure on disk, as internal complexity is reduced. The complex data sets are simply stored as a value if they are not necessary to be mapped into relational data structure.  (Redis FAQ)

Redis supports master-slave replication which is simple to use and configure. This makes Redis servers be exact copies of master servers so that it won’t be an issue if even few lost records are not acceptable in the application. Data is written asynchronously. If there is system crash, the last few queries can get lost. However, this is acceptable in many applications. (Redis Manual) Redis is recommended to run on Linux.

1.3 MongoDB

MongoDB is a document-oriented database. MongoDB supports document data with the format BSON (binary JSON).  The data model of JSON provides seamless mapping to native programming language types. As the schema is dynamic, it’s flexible to evolve the data model.

For simple query, the speed in MongoDB where the relational data sets are simply kept in a single document is faster than that in a relational database where relational data is separated into multiple tables. Join operations in relational databases are eliminated in MongoDB.

MongoDB supports indexes. Indexes can make the execution of queries more efficient. MongoDB has to scan every document in a collection to meet the match of query statement without indexes. All indexes in MongoDB are B-tree (Comer, 1979) indexes. In this case,  only the smallest possible number of documents is scanned, which optimizes the performance of query. Figure6 simply illustrates how indexes work in MongoDB.


Figure 1. Illustration of how MongoDB indexes work

MongoDB also has property of ease of use with features of easy installation, configuration, maintenance.  MongoDB can run on Linux, Windows and OS X. (MongoDB online manual)


HBase known as Hadoop database, is a column-store oriented and scalable distributed store system with high reliability and high performance. HBase can be used to set up large-scale structured storage clusters on cheap PC servers.

HBase is an open source after Google’s BigTable. The reason that HBase is called Hadoop database, is because HBase relies on Hadoop HDFS as its file storage system, Hadoop MapReduce to realize massive data processing and Zookeeper to do coordination. Figure7 illustrates the role of  HBase in Hadoop ecosystem.(Taylor, 2010)

hadoop ecosys

Figure 2. Diagram of Hadoop ecosystem

As shown in table 1, the data model of HBase contains “table” and “column family”.


Table 1. Data model of HBase

Table – Collection of rows. When table becomes larger and larger with the increasing number of records, it will spit into multiple parts called regions.  RowKey is the unique primary key of the table.

Column family – Collection of columns. One table can have one or multiple column families and one column family consists of any number of columns. Column family supports dynamic scaling without predefined number or types of columns.

1.5 Neo4j

Neo4j is a graph-store database. Neo4j reveals the relationship of data sets. The data model of Neo4j is also schema less.

Neo4j expresses queries as traversals. A graph is traversed by visiting the nodes following relationships under some rules. In reality, not the whole graph is visited, as not all nodes and relationships are interested in the graph. Neo4j supports fast deep traversal instead of slow SQL queries. Figure8 illustrates how traversal works.


Figure 3. Illustration of how traversal works.

Neo4j also supports indexes. It’s quite useful when a specific node or relationship can be found based on its property and the speed is faster than traversal which needs to loop up the entire graph. The figure9 illustrates how index look up works in Neo4j.


Figure 4. Illustration of how index look up works in Neo4j.

Rethink Big Data

Reflection and doubt to big data have never stopped. They are mainly in two sides: big data itself and ethic problems created by big data.

In the last post, I discussed the problems causing by big data itself. There are some congenital defects in big data. As big data is used to predict the whole by part or the future by past. However, there is always deviation existed in this prediction.

Besides, new digital divide may occur. Indeed, big data can improve the decision-making efficiency. But in the meantime, challenges of privacy, interoperability between the systems, not perfect algorithm and so on can be accumulated in developing countries. Big data needs matched infrastructure such as facilities designed for large scale distributed data intensive work, high efficient store facilities, network facilities for fast large data set importing etc.


(google’s secret data center)

Another non-ignorable problem is ethic issues which is mainly about the privacy problems which I have discussed several posts previously. Rethink it by the most classic case in big data era. In the early 2012, one American burst into the Target shop near his home and angrily questioned the manager:” how could you send baby diapers and discount coupon of bassinet to my daughter? She’s only 17!.” The manager apologized immediately. However, after one month, this angry father called back and apologized as his daughter was really pregnant. Not only Target, but also google, yahoo, apple,twitter, advertisement companies, data analysis companies,software companies etc. are collecting users’ private data. How to protect public privacy from violation is a big challenge in the future.

Big data is a creature of the age and will have significant impact on current society. How to improve the accuracy of decision making and face on problems causing by big data needs efforts from all sectors of society.



The New York Times. 2012 The Age of Big Data FROM

The Economist, 2010, Data, data everywhere. FROM

Capella.J., Horne.A., & Shah.S. (2012) Good Data Won’t Guarantee Good Decisions. From

Danah.B.. 2010. Privacy and Publicity in the Context of Big Data.WWW. Raleigh, North Carolina

View Big Data Rationally

Big data is a hot topic. But we should keep rationality to understand big data. Big data is not perfect.

Sample deviation always exists and big data does not exceed statistics. First what is sample deviation? One good example is from the second world war. The British royal air force wanted to strengthen aircraft armor to defend against anti-aircraft fire of German army. As the limit of carrying capacity, they could only strengthen part of armor. So they asked a statistician for help. After careful observation of ammunition mark from planes returning to airport, he gave a surprising conclusion that strengthen the part without ammunition mark. He explained that the planes who had ammunition mark in that part had crashed. Statistics is using part to speculate the whole or past to predict future.

However the biggest weakness is also sample deviation, as it can cause conclusion failure when part speculates the whole. In the era of big data, sample deviation still effects the accuracy of conclusion. Due to the reason of technology and benefit, the collected data of big data cannot cover every scene link. Besides, even though we know the past data, the world is always changing.

The conclusion of big data is an overall conclusion, not individual conclusion. Even though the accuracy is 99%, there are still million mismatches if the number of samples is 100 million.

Big data can get correlated conclusion instead of causal relationship. For example, with the rise of sales volume of ice cream, the number of drowning people rises. They are in positive correlation relationship. So can sales of ice cream cause drown? Of course not. That’s because the hot weather can increase the sales of ice cream and the possibility that people play in water at the same time.

As a conclusion, big data has its own limitations. Pure data analysis cannot guarantee the correct conclusion. Most times, traditional analysis approaches and experience should be considered together with big data.

More Possibilities of Future

In the previous post, big data has the future possibilities which can excavate the reason behind, be applied in traditional region and reshape labor relation with market demand, but that’s not all of big data. Big data can change the world in more industries and regions.

Love and marriage model can be transformed. Individual can get accurate match based on big data analysis. Couple’s hobby, special talents, financial situation, profession and so on can be excavated deeply and matched accurately.

Traditional family model may be reshaped. People will be grouped by data instead of region. People with similar data trait can live together to realize resource integration and high efficiency of lifestyle.

Big data may even create the next lady gaga. Social media has great impact on sales of songs and albums. Peoples make comments and share their favorite music on twitter, facebook and youtube. By tracking this data online, we can know people’s concern, present popular points and which singer’s awareness is promoting gradually. And by considering all the characteristics, the next lady gaga can be predicted.

If you can identify how much energy a person or a building uses, you can reduce its consumption. From sensors, devices and the web, a massive amount of data is suddenly emerging, which taps into energy data and then results into a whole new meaning. As a result, the tools of big data can some day be a fundamental way to help the world curb energy consumption.

For more possibilities, check the following video:



Fehrenbacher.K. (2012)How big data can curb the world’s energy consumption. From

Higginbotham. S. (2012) Can gigabytes predict the next Lady Gaga? From

Staff. E. (2012) 10 ways big data changes everything From

Future of Big Data

Big data is a new and hot word in recent years. As the trend of global digitization and networked characteristics, some people even call this “ the fifth wave of science and technology”. However, big data is still a young kid, there is strong potential of development.

The way of analyzing data will be smarter and deeper. The result will not only the relationship of targets, but the reason why it’s so. Think of the market of diapers. To know what sells best and what ranks best/worst, thousands of reviews are collected and analyzed. By text analytics, three questions can be answered: why did it sell so sell, why people did not like it and what do they want. For example, if words like “price”, “special” and “value” are mostly frequently mentioned and then are further analyzed, this may tell manager the reason customers buy diapers not because of the quality or features, but price. And managers can make new strategy to sell diaper to  increase sales volume and get more profit.

Sensors may be existed everywhere. As the break through of technology, sensors are more and more miniaturized. They can even put into human body to detect chemical environment and subtle changes of organs. As a result, the source collection of data is more diversified and Big Data can be applied more widely.


Big data will not only be used in new areas, but in the future, it will be applied to more traditional areas. It can analyze the status of soil to help improve agriculture farming, keep the coordination of supply and demand and mine the new growth point, make traffic system smarter avoiding traffic jams and reducing accidents, etc.

Traditional labor relation may be reshaped. Through big data platform, human resource and customer demand can be matched more accurately so that the individual potential can be put more to good use and break the barriers of region, language and culture.

Thank for the existence of big data, I believe society efficiency will be improved much more and people’s life will be much easier and more comfortable.



Spiegel. B. (2014) The future of big data, from

Park. H. (2014) Bigger, Better, Faster, Stronger: The Future of Big Data. From

Ohio state university, Big Data Future, from