Big Data - what are big data systems? Development of Big Data technologies. What is Big data: we collected all the most important things about big data Big data in the modern world

Big data (or Big Data) is a set of methods for working with huge volumes of structured or unstructured information. Big data specialists process and analyze it to obtain visual, human-perceivable results. Look At Me talked to professionals and found out what the situation is with big data processing in Russia, where and what is best to study for those who want to work in this field.

Alexey Ryvkin about the main trends in the field of big data, communication with customers and the world of numbers

I studied at the Moscow Institute of Electronic Technology. The main thing I managed to take away from there was fundamental knowledge in physics and mathematics. Simultaneously with my studies, I worked at the R&D center, where I was involved in the development and implementation of noise-resistant coding algorithms for secure data transmission. After finishing my bachelor's degree, I entered the master's program in business informatics at the Higher School of Economics. After that I wanted to work at IBS. I was lucky that at that time, due to big amount projects, there was an additional recruitment of interns, and after several interviews I started working at IBS, one of the largest Russian companies in this field. In three years, I went from an intern to an enterprise solutions architect. Currently I am developing expertise in Big Data technologies for customer companies from the financial and telecommunications sectors.

There are two main specializations for people who want to work with big data: analysts and IT consultants who create technologies to work with big data. In addition, we can also talk about the profession of Big Data Analyst, i.e. people who directly work with data, with the customer’s IT platform. Previously, these were ordinary mathematical analysts who knew statistics and mathematics and used statistical software to solve data analysis problems. Today, in addition to knowledge of statistics and mathematics, an understanding of technology and the data life cycle is also necessary. This, in my opinion, is the difference between modern Data Analysts and those analysts who came before.

My specialization is IT consulting, that is, I come up with and offer clients ways to solve business problems using IT technologies. People with different experiences come to consulting, but the most important qualities for this profession are the ability to understand the needs of the client, the desire to help people and organizations, good communication and team skills (since it is always working with the client and in a team), good analytical skills. Internal motivation is very important: we work in a competitive environment, and the customer expects unusual solutions and interest in work.

Most of my time is spent communicating with customers, formalizing their business needs and helping them develop the most suitable technology architecture. The selection criteria here have their own peculiarity: in addition to functionality and TCO (Total cost of ownership) non-functional requirements for the system are very important, most often these are response time and information processing time. To convince the customer, we often use a proof of concept approach - we offer to “test” the technology for free on some task, on a narrow set of data, to make sure that the technology works. The solution must create for the customer competitive advantage by obtaining additional benefits (for example, x-sell, cross-selling) or solving some problem in business, say, reducing high level loan fraud.

It would be much easier if clients came with a ready-made task, but so far they do not understand that a revolutionary technology has appeared that can change the market in a couple of years

What problems do you face? The market is not yet ready to use big data technologies. It would be much easier if clients came with a ready-made task, but so far they do not understand that a revolutionary technology has appeared that can change the market in a couple of years. This is why we essentially work in startup mode - we don’t just sell technologies, but every time we convince clients that they need to invest in these solutions. This is the position of visionaries - we show customers how they can change their business using data and IT. We create this new market- market for commercial IT consulting in the field of Big Data.

If a person wants to engage in data analysis or IT consulting in the field of Big Data, then the first thing that is important is a mathematical or technical education with good mathematical training. It is also useful to master specific technologies, for example SAS, Hadoop, R language or IBM solutions. In addition, you need to be actively interested in applications for Big Data - for example, how it can be used for improved credit scoring in a bank or customer lifecycle management. This and other knowledge can be obtained from available sources: for example, Coursera and Big Data University. There is also the Customer Analytics Initiative at Wharton University of Pennsylvania, where a lot of interesting materials have been published.

A major problem for those who want to work in our field is the clear lack of information about Big Data. You cannot go to a bookstore or some website and get, for example, a comprehensive collection of cases on all applications of Big Data technologies in banks. There are no such directories. Some of the information is in books, some is collected at conferences, and some you have to figure out on your own.

Another problem is that analysts are comfortable in the world of numbers, but they are not always comfortable in business. These people are often introverted and have difficulty communicating, making it difficult for them to communicate research findings convincingly to clients. To develop these skills, I would recommend books such as The Pyramid Principle, Speak the Language of Diagrams. They help develop presentation skills and express your thoughts concisely and clearly.

Participating in various case championships while studying at the National Research University Higher School of Economics helped me a lot. Case championships are intellectual competitions for students where they need to study business problems and propose solutions to them. There are two types: case championships of consulting firms, for example, McKinsey, BCG, Accenture, as well as independent case championships such as Changellenge. While participating in them, I learned to see and solve complex problems - from identifying a problem and structuring it to defending recommendations for its solution.

Oleg Mikhalsky about the Russian market and the specifics of creating a new product in the field of big data

Before joining Acronis, I was already involved in launching new products to market at other companies. It’s always interesting and challenging at the same time, so I was immediately interested in the opportunity to work on cloud services and data storage solutions. All my previous experience in the IT industry, including my own startup project I-accelerator, came in handy in this area. Having a business education (MBA) in addition to a basic engineering degree also helped.

In Russia, large companies - banks, mobile operators etc. - there is a need for big data analysis, so in our country there are prospects for those who want to work in this area. True, many projects now are integration projects, that is, made on the basis of foreign developments or open source technologies. In such projects, fundamentally new approaches and technologies are not created, but rather existing developments are adapted. At Acronis, we took a different path and, after analyzing the available alternatives, decided to invest in our own development, resulting in a reliable storage system for big data that is not inferior in cost to, for example, Amazon S3, but works reliably and efficiently and on a significantly smaller scale. Large Internet companies also have their own developments in big data, but they are more focused on internal needs rather than meeting the needs of external clients.

It is important to understand the trends and economic forces that influence the field of big data. To do this, you need to read a lot, listen to speeches by authoritative experts in the IT industry, and attend thematic conferences. Now almost every conference has a section on Big Data, but they all talk about it from a different angle: from a technology, business or marketing point of view. You can go for project work or an internship at a company that is already leading projects on this topic. If you are confident in your abilities, then it is not too late to organize a startup in the field of Big Data.

Without constant contact with the market new development risks being unclaimed

True, when you are responsible for a new product, a lot of time is spent on market analytics and communication with potential clients, partners, and professional analysts who know a lot about clients and their needs. Without constant contact with the market, a new development risks being unclaimed. There are always a lot of uncertainties: you have to figure out who the early adopters will be, what you have to offer them, and how to then attract a mass audience. The second most important task is to formulate and convey to developers a clear and holistic vision of the final product in order to motivate them to work in conditions where some requirements may still change, and priorities depend on feedback, coming from the first clients. Therefore, an important task is managing the expectations of clients on the one hand and developers on the other. So that neither one nor the other loses interest and brings the project to completion. After the first successful project, it becomes easier and the main challenge will be to find the right growth model for the new business.

(literally - big data)? Let's look first at the Oxford Dictionary:

Data- quantities, signs or symbols that a computer operates on and that can be stored and transmitted in the form electrical signals, recorded on magnetic, optical or mechanical media.

Term Big Data used to describe a large data set that grows exponentially over time. To process such a large amount of data, machine learning is indispensable.

The benefits that Big Data provides:

  1. Collecting data from various sources.
  2. Improving business processes through real-time analytics.
  3. Storing huge amounts of data.
  4. Insights. Big Data is more insightful hidden information using structured and semi-structured data.
  5. Big data helps you reduce risk and make smart decisions with the right risk analytics

Big Data Examples

New York Stock Exchange generates daily 1 terabyte trading data for the past session.

Social media: Statistics show that Facebook uploads every day 500 terabytes new data is generated mainly due to uploading photos and videos to social network servers, messaging, comments under posts, and so on.

Jet engine generates 10 terabytes data every 30 minutes during the flight. Since thousands of flights are made every day, the volume of data reaches petabytes.

Big Data classification

Big data forms:

  • Structured
  • Unstructured
  • Semi-structured

Structured form

Data that can be stored, accessed and processed in a form with a fixed format is called structured. Over time, computer science has made great strides in improving techniques for working with this type of data (where the format is known in advance) and learned how to benefit from it. However, today there are already problems associated with the growth of volumes to sizes measured in the range of several zettabytes.

1 zettabyte equals a billion terabytes

Looking at these numbers, it is easy to see the veracity of the term Big Data and the difficulties associated with processing and storing such data.

Data stored in a relational database is structured and looks like, for example, tables of company employees

Unstructured form

Data of unknown structure is classified as unstructured. In addition to its large size, this shape is characterized by a number of difficulties in processing and removal. useful information. A typical example of unstructured data is a heterogeneous source containing a combination of simple text files, images and videos. Today, organizations have access to large amounts of raw or unstructured data, but do not know how to extract value from it.

Semi-structured form

This category contains both of those described above, so semi-structured data has some form but is not actually defined by tables in relational databases. An example of this category is personal data presented in an XML file.

Prashant RaoMale35 Seema R.Female41 Satish ManeMale29 Subrato RoyMale26 Jeremiah J.Male35

Characteristics of Big Data

Big Data growth over time:

Blue color represents structured data (Enterprise data), which is stored in relational databases. Other colors indicate unstructured data from various sources (IP telephony, devices and sensors, social networks and web applications).

According to Gartner, big data varies in volume, rate of generation, variety, and variability. Let's take a closer look at these characteristics.

  1. Volume. The term Big Data itself is associated with large size. Data size is a critical metric in determining the potential value to be extracted. Every day, 6 million people use digital media, generating an estimated 2.5 quintillion bytes of data. Therefore, volume is the first characteristic to consider.
  2. Diversity- the next aspect. It refers to heterogeneous sources and the nature of data, which can be either structured or unstructured. Previously, spreadsheets and databases were the only sources of information considered in most applications. Today, data in the form of emails, photos, videos, PDF files, and audio are also considered in analytical applications. This variety of unstructured data leads to problems in storage, mining and analysis: 27% of companies are not confident that they are working with the right data.
  3. Generation speed. How quickly data is accumulated and processed to meet requirements determines potential. Speed ​​determines the speed of information flow from sources - business processes, application logs, social networking and media sites, sensors, mobile devices. The flow of data is huge and continuous over time.
  4. Variability describes the variability of data at some points in time, which complicates processing and management. For example, most data is unstructured in nature.

Big Data analytics: what are the benefits of big data

Promotion of goods and services: Access to data from search engines and sites like Facebook and Twitter allows businesses to more accurately develop marketing strategies.

Improving service for customers: Traditional customer feedback systems are being replaced by new ones that use Big Data and Natural Language Processing to read and evaluate customer feedback.

Risk calculation associated with the release of a new product or service.

Operational efficiency: big data is structured in order to quickly extract the necessary information and quickly produce accurate results. This combination of Big Data and storage technologies helps organizations optimize their work with rarely used information.

The constant acceleration of data growth is an integral element of modern realities. Social networks, mobile devices, data from measuring devices, business information are just a few types of sources that can generate gigantic amounts of data.

Currently, the term Big Data has become quite common. Not everyone is still aware of how quickly and deeply technologies for processing large amounts of data are changing the most diverse aspects of society. Changes are taking place in various areas, giving rise to new problems and challenges, including in the field of information security, where its most important aspects such as confidentiality, integrity, availability, etc. should be in the foreground.

Unfortunately, many modern companies resort to Big Data technology without creating the proper infrastructure to ensure reliable storage of the huge amounts of data they collect and store. On the other hand, blockchain technology is currently rapidly developing, which is designed to solve this and many other problems.

What is Big Data?

In fact, the definition of the term is straightforward: “big data” means the management of very large volumes of data, as well as their analysis. If we look more broadly, this is information that cannot be processed by classical methods due to its large volumes.

The term Big Data itself appeared relatively recently. According to Google Trends, the active growth in popularity of the term occurred at the end of 2011:

In 2010, the first products and solutions directly related to big data processing began to appear. By 2011, most of the largest IT companies, including IBM, Oracle, Microsoft and Hewlett-Packard, are actively using the term Big Data in their business strategies. Gradually market analysts information technologies are beginning active research into this concept.

Currently, this term has gained significant popularity and is actively used in a variety of fields. However, it cannot be said with certainty that Big Data is some kind of fundamentally new phenomenon - on the contrary, big data sources have existed for many years. In marketing, these include databases of customer purchases, credit histories, lifestyles, and so on. Over the years, analysts have used this data to help companies predict future customer needs, assess risks, shape consumer preferences, and more.

Currently, the situation has changed in two aspects:

— more sophisticated tools and methods have emerged for analyzing and comparing different data sets;
— analysis tools have been supplemented with many new data sources, which is due to the widespread transition to digital technologies, as well as new methods of data collection and measurement.

Researchers predict that Big Data technologies will be most actively used in manufacturing, healthcare, trade, government administration and in other diverse areas and industries.

Big Data is not a specific array of data, but a set of methods for processing it. The defining characteristic of big data is not only its volume, but also other categories that characterize labor-intensive data processing and analysis processes.

The initial data for processing can be, for example:

— logs of Internet user behavior;
— Internet of Things;
- social media;
— meteorological data;
— digitized books from major libraries;
— GPS signals from vehicles;
— information about transactions of bank clients;
— data on the location of mobile network subscribers;
— information about purchases in large retail chains, etc.

Over time, the volume of data and the number of its sources is constantly growing, and against this background, new methods of information processing are emerging and existing ones are being improved.

Basic principles of Big Data:

— Horizontal scalability – data arrays can be huge and this means that the big data processing system must dynamically expand as their volumes increase.
— Fault tolerance – even if some equipment elements fail, the entire system must remain operational.
— Data locality. In large distributed systems, data is typically distributed across a significant number of machines. However, whenever possible and to save resources, data is often processed on the same server where it is stored.

For stable operation of all three principles and, accordingly, high efficiency of storing and processing big data, new breakthrough technologies are needed, such as, for example, blockchain.

Why do we need big data?

The scope of Big Data is constantly expanding:

— Big data can be used in medicine. Thus, a diagnosis can be made for a patient not only based on data from an analysis of the patient’s medical history, but also taking into account the experience of other doctors, information about the environmental situation of the patient’s area of ​​residence, and many other factors.
— Big Data technologies can be used to organize the movement of unmanned vehicles.
— By processing large amounts of data, you can recognize faces in photos and videos.
— Big Data technologies can be used by retailers - trading companies can actively use data sets from social networks to effectively customize their advertising campaigns, which can be maximally targeted to a particular consumer segment.
This technology is actively used in organizing election campaigns, including for analyzing political preferences in society.
— The use of Big Data technologies is relevant for solutions of the income assurance (RA) class, which include tools for detecting inconsistencies and in-depth data analysis, allowing timely identification of probable losses or distortions of information that could lead to a decrease in financial results.
— Telecommunications providers can aggregate big data, including geolocation; in turn, this information may be of commercial interest to advertising agencies, which can use it to display targeted and local advertising, as well as to retailers and banks.
— Big data can play an important role in deciding to open a retail outlet in a certain location based on data about the presence of a powerful targeted flow of people.

Thus, the most obvious practical application of Big Data technology lies in the field of marketing. Thanks to the development of the Internet and the proliferation of all kinds of communication devices, behavioral data (such as the number of calls, shopping habits and purchases) is becoming available in real time.

Big data technologies can also be effectively used in finance, for sociological research and in many other areas. Experts argue that all these opportunities for using big data are only the visible part of the iceberg, since these technologies are used in much larger volumes in intelligence and counterintelligence, in military affairs, as well as in everything that is commonly called information warfare.

In general terms, the sequence of working with Big Data consists of collecting data, structuring the information received using reports and dashboards, and then formulating recommendations for action.

Let's briefly consider the possibilities of using Big Data technologies in marketing. As you know, for a marketer, information is the main tool for forecasting and strategy development. Big data analysis has long been successfully used to determine the target audience, interests, demand and activity of consumers. Big data analysis, in particular, makes it possible to display advertising (based on the RTB auction model - Real Time Bidding) only to those consumers who are interested in a product or service.

The use of Big Data in marketing allows businessmen to:

— get to know your consumers better, attract a similar audience on the Internet;
— assess the degree of customer satisfaction;
— understand whether the proposed service meets expectations and needs;
— find and implement new ways to increase customer trust;
— create projects that are in demand, etc.

For example, the Google.trends service can indicate to a marketer a forecast of seasonal demand activity for a specific product, fluctuations and geography of clicks. If you compare this information with the statistical data collected by the corresponding plugin on your own website, you can draw up a plan for the distribution of the advertising budget, indicating the month, region and other parameters.

According to many researchers, the success of the Trump election campaign lies in the segmentation and use of Big Data. The team of the future US President was able to correctly divide the audience, understand its desires and show exactly the message that voters want to see and hear. Thus, according to Irina Belysheva from the Data-Centric Alliance, Trump’s victory was largely possible thanks to a non-standard approach to Internet marketing, which was based on Big Data, psychological and behavioral analysis and personalized advertising.

Trump's political strategists and marketers used a specially developed mathematical model, which made it possible to deeply analyze the data of all US voters and systematize them, making ultra-precise targeting not only by geographic characteristics, but also by the intentions, interests of voters, their psychotype, behavioral characteristics, etc. After This is why marketers organized personalized communication with each group of citizens based on their needs, moods, political views, psychological characteristics and even skin color, using a different message for almost every individual voter.

As for Hillary Clinton, in her campaign she used “time-tested” methods based on sociological data and standard marketing, dividing the electorate only into formally homogeneous groups (men, women, African Americans, Latin Americans, poor, rich, etc.) .

As a result, the winner was the one who appreciated the potential of new technologies and methods of analysis. It is noteworthy that Hillary Clinton's campaign expenses were twice as much as her opponent's:

Data: Pew Research

Main problems of using Big Data

In addition to the high cost, one of the main factors hindering the implementation of Big Data in various areas is the problem of choosing the data to be processed: that is, determining which data needs to be retrieved, stored and analyzed, and which should not be taken into account.

Another problem with Big Data is ethical. In other words, a logical question arises: can such data collection (especially without the user’s knowledge) be considered a violation of privacy?

It's no secret that information stored in search engines Google and Yandex allow IT giants to constantly improve their services, make them user-friendly and create new interactive applications. To do this, search engines collect user data about user activity on the Internet, IP addresses, geolocation data, interests and online purchases, personal data, mail messages etc. All this allows you to display contextual advertising in accordance with user behavior on the Internet. In this case, users’ consent is usually not asked for this, and the opportunity to choose what information about themselves to provide is not given. That is, by default, everything is collected in Big Data, which will then be stored on the sites’ data servers.

This leads to the next important problem regarding the security of data storage and use. For example, is a particular analytics platform that consumers are using safe? automatic mode transmit your data? In addition, many business representatives note a shortage of highly qualified analysts and marketers who can effectively handle large volumes of data and solve specific business problems with their help.

Despite all the difficulties with the implementation of Big Data, the business intends to increase investments in this area. According to Gartner research, the leaders in industries investing in Big Data are media, retail, telecom, banking and service companies.

Prospects for interaction between blockchain and Big Data technologies

Integration with Big Data has a synergistic effect and opens up a wide range of new opportunities for business, including allowing:

— gain access to detailed information about consumer preferences, on the basis of which you can build detailed analytical profiles for specific suppliers, products and product components;
— integrate detailed transaction data and consumption statistics certain groups products for different categories of users;
— receive detailed analytical data on supply and consumption chains, control product losses during transportation (for example, weight loss due to drying and evaporation of certain types of goods);
— counteract product counterfeiting, increase the effectiveness of the fight against money laundering and fraud, etc.

Access to detailed data on the use and consumption of goods will significantly reveal the potential of Big Data technology for optimizing key business processes, reducing regulatory risks, revealing new opportunities for monetization and creating products that will best meet current consumer preferences.

As is known, representatives of the largest financial institutions are already showing significant interest in blockchain technology, including, etc. According to Oliver Bussmann, IT manager of the Swiss financial holding UBS, blockchain technology can “reduce transaction processing time from several days to several minutes” .

The potential for analysis from the blockchain using Big Data technology is enormous. Distributed ledger technology ensures the integrity of information, as well as reliable and transparent storage of the entire transaction history. Big Data, in turn, provides new tools for effective analysis, forecasting, economic modeling and, accordingly, opens up new opportunities for making more informed management decisions.

The tandem of blockchain and Big Data can be successfully used in healthcare. As is known, imperfect and incomplete data on a patient’s health greatly increases the risk of an incorrect diagnosis and incorrectly prescribed treatment. Critical data about the health of clients of medical institutions should be maximally protected, have the properties of immutability, be verifiable and should not be subject to any manipulation.

The information in the blockchain meets all of the above requirements and can serve as high-quality and reliable source data for in-depth analysis using new Big Data technologies. In addition, with the help of blockchain, medical institutions could exchange reliable data with insurance companies, justice authorities, employers, scientific institutions and other organizations in need of medical information.

Big Data and information security

In a broad sense, information security is the protection of information and supporting infrastructure from accidental or intentional negative impacts of a natural or artificial nature.

In area information security Big Data faces the following challenges:

— problems of data protection and ensuring their integrity;
— the risk of outside interference and leakage of confidential information;
— improper storage of confidential information;
— the risk of information loss, for example, due to someone’s malicious actions;
— risk of misuse of personal data by third parties, etc.

One of the main big data problems that blockchain is designed to solve lies in the area of ​​information security. By ensuring compliance with all its basic principles, distributed registry technology can guarantee the integrity and reliability of data, and due to the absence of a single point of failure, blockchain makes the operation of information systems stable. Distributed ledger technology can help solve the problem of trust in data, as well as enable universal data sharing.

Information is a valuable asset, which means that ensuring the basic aspects of information security must be at the forefront. In order to survive the competition, companies must keep up with the times, which means that they cannot ignore the potential opportunities and advantages that blockchain technology and Big Data tools contain.

You know this famous joke, right? Big Data is like sex before 18:

  • everyone thinks about it;
  • everyone talks about it;
  • everyone thinks their friends do it;
  • almost no one does this;
  • whoever does it does it badly;
  • everyone thinks it will work out better next time;
  • no one takes security measures;
  • anyone is ashamed to admit that they don’t know something;
  • if someone succeeds at something, there is always a lot of noise about it.

But let's be honest, with any hype there will always be the usual curiosity: what kind of fuss is there and is there something really important there? In short, yes, there is. Details are below. We have selected for you the most amazing and interesting applications of Big Data technologies. This small market study, using clear examples, confronts us with a simple fact: the future does not come, there is no need to “wait another n years and the magic will become reality.” No, it has already arrived, but is still invisible to the eye and therefore the burning of the singularity has not yet burned a certain point of the labor market so much. Go.

1 How Big Data technologies are applied where they originated

Large IT companies are where data science originated, so their internal knowledge in this area is the most interesting. Campaign Google, the birthplace of the Map Reduce paradigm, whose sole purpose is to train its programmers in machine learning technologies. And this is where their competitive advantage lies: after acquiring new knowledge, employees will introduce new methods in those Google projects where they constantly work. Imagine how huge the list of areas in which a campaign can revolutionize is. One example: neural networks are used.

The corporation implements machine learning in all its products. Its advantage is the presence of a large ecosystem that includes everyone digital devices, used in everyday life. This allows Apple to reach an impossible level: the campaign has more user data than any other. At the same time, the privacy policy is very strict: the corporation has always boasted that it does not use customer data for advertising purposes. Accordingly, user information is encrypted so that Apple lawyers or even the FBI with a warrant cannot read it. By you will find great review Apple's developments in the field of AI.

2 Big Data on 4 wheels

A modern car is an information store: it accumulates all the data about the driver, the environment, connected devices and itself. Soon, a single vehicle that is connected to a network like the one will generate up to 25 GB of data per hour.

Vehicle telematics has been used by automakers for many years, but there is now lobbying for a more sophisticated data collection method that takes full advantage of Big Data. This means that technology can now alert the driver to poor road conditions by automatically activating the anti-lock braking and traction control systems.

Other companies, including BMW, are using Big Data technology, combined with information collected from prototypes being tested, in-vehicle error memory systems, and customer complaints, to identify model weaknesses early in production. Now, instead of manually evaluating data, which takes months, a modern algorithm is used. Errors and troubleshooting costs are reduced, which speeds up information analysis workflows at BMW.

According to expert estimates, by 2019 the market turnover of connected cars will reach $130 billion. This is not surprising, given the pace of integration by automakers of technologies that are an integral part of the vehicle.

Using Big Data helps make the car safer and more functional. Thus, Toyota by integrating information communication modules (DCM). This Big Data tool processes and analyzes the data collected by DCM to further extract value from it.

3 Application of Big Data in medicine


The implementation of Big Data technologies in the medical field allows doctors to study the disease more thoroughly and choose an effective course of treatment for a particular case. Thanks to the analysis of information, it becomes easier for health workers to predict relapses and take preventive measures. The result is a more accurate diagnosis and improved treatment methods.

The new technique allowed us to look at patients' problems from a different perspective, which led to the discovery of previously unknown sources of the problem. For example, some races are genetically more predisposed to heart disease than other ethnic groups. Now, when a patient complains of a certain disease, doctors take into account data about members of his race who complained of the same problem. Collection and analysis of data allows us to learn much more about patients: from food preferences and lifestyle to the genetic structure of DNA and metabolites of cells, tissues, and organs. Thus, the Center for Children's Genomic Medicine in Kansas City uses patients and analyzes the mutations in the genetic code that cause cancer. An individual approach to each patient, taking into account his DNA, will raise the effectiveness of treatment to a qualitatively different level.

Understanding how Big Data is used is the first and very important change in the medical field. When a patient undergoes treatment, a hospital or other healthcare facility can obtain a lot of relevant information about the person. The collected information is used to predict disease recurrences with a certain degree of accuracy. For example, if a patient has suffered a stroke, doctors study information about the time of cerebrovascular accident, analyze the intermediate period between previous precedents (if any), paying special attention to stressful situations and heavy physical activity in the patient’s life. Based on this data, hospitals provide the patient with a clear action plan to prevent the possibility of a stroke in the future.

Wearable devices also play a role, helping to identify health problems even if a person does not have obvious symptoms of a particular disease. Instead of assessing the patient’s condition through a long course of examinations, the doctor can draw conclusions based on the information collected by a fitness tracker or smart watch.

One of the latest examples is . While the man was being examined for a new seizure caused by a missed medication, doctors discovered that the man had a much more serious health problem. This problem turned out to be atrial fibrillation. The diagnosis was made thanks to the fact that the department staff gained access to the patient’s phone, namely to the application associated with his fitness tracker. Data from the application turned out to be a key factor in determining the diagnosis, because at the time of the examination, no cardiac abnormalities were detected in the man.

This is just one of the few cases that shows why use big data plays such a significant role in the medical field today.

4 Data analysis has already become the core of retail

Understanding user queries and targeting is one of the largest and most publicized areas of application of Big Data tools. Big Data helps analyze customer habits in order to better understand consumer needs in the future. Companies are looking to expand the traditional data set with information from social networks and browser search history in order to create the most complete customer picture possible. Sometimes large organizations choose to create their own predictive model as a global goal.

For example, the Target store chain, using in-depth data analysis and its own forecasting system, manages to determine with high accuracy - . Each client is assigned an ID, which in turn is linked to a credit card, name or e-mail. The identifier serves as a kind of shopping cart, where information about everything that a person has ever purchased is stored. Network specialists have found that pregnant women actively purchase unscented products before the second trimester of pregnancy, and during the first 20 weeks they rely on calcium, zinc and magnesium supplements. Based on the data received, Target sends coupons for baby products to customers. The discounts on goods for children themselves are “diluted” with coupons for other products, so that offers to buy a crib or diapers do not look too intrusive.

Even government departments have found a way to use Big Data technologies to optimize election campaigns. Some believe that Barack Obama's victory in the 2012 US presidential election was due to the excellent work of his team of analysts, who processed huge amounts of data in the right way.

5 Big Data protects law and order


Over the past few years, law enforcement agencies have been able to figure out how and when to use Big Data. It is a well-known fact that the National Security Agency uses Big Data technologies to prevent terrorist attacks. Other departments are using advanced methodology to prevent smaller crimes.

The Los Angeles Police Department uses . She does what is commonly called proactive policing. Using crime reports over a period of time, the algorithm identifies areas where crime is most likely to occur. The system marks such areas on the city map with small red squares and this data is immediately transmitted to patrol cars.

Chicago cops use Big Data technologies in a slightly different way. Law enforcement officers in the Windy City do the same, but it is aimed at outlining a “risk circle” consisting of people who could be a victim or participant in an armed attack. According to The New York Times, this algorithm assigns a vulnerability rating to a person based on his criminal history (arrests and participation in shootings, membership of criminal groups). The system's developer says that while the system examines a person's criminal history, it does not take into account secondary factors such as a person's race, gender, ethnicity and location.

6 How Big Data technologies help cities develop


Veniam CEO Joao Barros shows a map of tracking Wi-Fi routers on Porto buses

Data analysis is also used to improve a number of aspects of the life of cities and countries. For example, knowing exactly how and when to use Big Data technologies, you can optimize traffic flows. To do this, the movement of cars online is taken into account, social media and meteorological data are analyzed. Today, a number of cities have committed themselves to using data analytics to combine transport infrastructure with other types of utilities into a single whole. This is the concept of a “smart” city, in which buses wait for late trains, and traffic lights are able to predict traffic congestion to minimize traffic jams.

Based on Big Data technologies, the city of Long Beach operates smart water meters that are used to stop illegal watering. Previously, they were used to reduce water consumption by private households (the maximum result was a reduction of 80%). Saving fresh water is always a pressing issue. Especially when the state is experiencing the worst drought ever recorded.

Representatives of the Los Angeles Department of Transportation have joined the list of those who use Big Data. Based on data received from traffic camera sensors, authorities monitor the operation of traffic lights, which in turn allows traffic regulation. The computerized system controls about 4,500 thousand traffic lights throughout the city. According to official data, the new algorithm helped reduce congestion by 16%.

7 The engine of progress in marketing and sales


In marketing, Big Data tools make it possible to identify which ideas are most effective in promoting at a particular stage of the sales cycle. Data analysis determines how investments can improve customer relationship management, what strategy should be chosen to increase conversion rates, and how to optimize life cycle client. In business related to cloud technologies, Big Data algorithms are used to figure out how to minimize the cost of acquiring a customer and increase their lifecycle.

Differentiation of pricing strategies depending on the intra-system level of the client is perhaps the main thing for which Big Data is used in the field of marketing. McKinsey found that about 75% of the average firm's revenue comes from core products, 30% of which are mispriced. A 1% increase in price results in an 8.7% increase in operating profit.

The Forrester research team found that data analytics allows marketers to focus on how to make customer relationships more successful. By examining the direction of customer development, specialists can assess the level of their loyalty, as well as extend the life cycle in the context of a specific company.

Optimization of sales strategies and stages of entering new markets using geo-analytics are reflected in the biopharmaceutical industry. According to McKinsey, drug manufacturing companies spend an average of 20 to 30% of profits on administration and sales. If enterprises become more active use Big Data to identify the most profitable and fastest growing markets, costs will be reduced immediately.

Data analytics is a means for companies to gain a complete picture of key aspects of their business. Increasing revenue, reducing costs and reducing working capital are the three objectives that modern business tries to solve using analytical tools.

Finally, 58% of marketing directors claim that the implementation of Big Data technologies can be seen in search engine optimization (SEO), e-mail and mobile marketing, where data analysis plays the most significant role in the formation of marketing programs. And only 4% fewer respondents are confident that Big Data will play a significant role in all marketing strategies for many years to come.

8 Global data analysis

No less curious is... It is possible that machine learning will ultimately be the only force capable of maintaining the delicate balance. The topic of human influence on global warming still causes a lot of controversy, so only reliable predictive models based on the analysis of large amounts of data can give an accurate answer. Ultimately, reducing emissions will help us all: we will spend less on energy.

Now Big Data is not an abstract concept that may find its application in a couple of years. This is a completely working set of technologies that can be useful in almost all areas of human activity: from medicine and public order to marketing and sales. The stage of active integration of Big Data into our daily life has just begun, and who knows what the role of Big Data will be in a few years?

Big data is a broad term for the unconventional strategies and technologies needed to collect, organize, and process information from large data sets. While the challenge of working with data that exceeds the processing or storage capacity of a single computer is not new, the scope and value of this type of computing has expanded significantly in recent years.

This article will walk you through the basic concepts you might encounter while exploring big data. It also discusses some of the processes and technologies that are currently used in this area.

What is big data?

A precise definition of “big data” is difficult to articulate because projects, vendors, practitioners, and business professionals use it in very different ways. With this in mind, big data can be defined as:

  • Large data sets.
  • A category of computing strategies and technologies that are used to process large data sets.

In this context, "large data set" means a set of data that is too large to be processed or stored using traditional tools or on a single computer. This means that the overall scale of large data sets is constantly changing and can vary significantly from case to case.

Big Data Systems

The basic requirements for working with big data are the same as for any other data set. However, the massive scale, processing speed, and data characteristics encountered at every step of the process present significant new challenges to tool development. The goal of most big data systems is to understand and communicate with large volumes of heterogeneous data, which would not be possible using conventional methods.

In 2001, Gartner's Doug Laney introduced the "three V's of big data" to describe some of the characteristics that distinguish big data processing from other types of data processing:

  1. Volume (data volume).
  2. Velocity (speed of data accumulation and processing).
  3. Variety (variety of types of data processed).

Data volume

The sheer scale of information processed helps define big data systems. These data sets can be orders of magnitude larger than traditional data sets, requiring greater attention at every stage of processing and storage.

Because demands exceed the capabilities of a single computer, the problem of pooling, distributing, and coordinating resources from groups of computers often arises. Cluster management and algorithms that can break tasks into smaller parts are becoming increasingly important in this area.

Accumulation and processing speed

The second characteristic that significantly distinguishes big data from other data systems is the speed at which information moves through the system. Data often enters a system from multiple sources and must be processed in real time to update the current state of the system.

This emphasis on instant feedback has led many practitioners to abandon the batch-oriented approach in favor of a real-time streaming system. Data is constantly being added, processed and analyzed to keep up with the influx of new information and provide valuable insights early, when it is most relevant. This requires robust systems with highly available components to protect against failures along the data pipeline.

Variety of data types processed

There are many unique challenges in big data due to the wide range of sources processed and their relative quality.

Data may come from internal systems, such as application and server logs, from social media channels and other external APIs, from physical device sensors and other sources. The goal of big data systems is to process potentially useful data, regardless of origin, by combining all information into a single system.

Media formats and types can also vary significantly. Media files (images, video and audio) are combined with text files, structured logs, etc. More traditional data processing systems expect data to enter the pipeline already labeled, formatted and organized, but big data systems typically ingest and store data, trying save them the initial state. Ideally, any transformations or changes to the raw data will occur in memory during processing.

Other characteristics

Over time, practitioners and organizations have proposed expansions of the original “three Vs,” although these innovations tend to describe the problems rather than the characteristics of big data.

  • Veracity: The variety of sources and complexity of processing can lead to problems in assessing the quality of the data (and therefore the quality of the resulting analysis).
  • Variability: Changes in data lead to wide variations in quality. Additional resources may be required to identify, process, or filter low-quality data to improve data quality.
  • Value: The ultimate goal of big data is value. Sometimes systems and processes are very complex, making it difficult to use data and extract actual values.

Big Data Lifecycle

So how is big data actually processed? There are several different approaches to implementation, but there are commonalities in the strategies and software.

  • Entering data into the system
  • Saving data to storage
  • Data Computing and Analysis
  • Visualization of results

Before we look at these four categories of workflows in detail, let's talk about cluster computing, an important strategy used by many big data tools. Setting up a computing cluster is the core technology used at each stage of the lifecycle.

Cluster computing

Due to the quality of big data, individual computers are not suitable for processing the data. Clusters are more suitable for this as they can handle the storage and computing needs of big data.

Big data clustering software combines the resources of many small machines, aiming to provide a number of benefits:

  • Resource Pooling: Processing large data sets requires large amounts of CPU and memory resources, as well as a lot of available storage space.
  • High Availability: Clusters can provide varying levels of fault tolerance and availability so that hardware or software failures do not impact data access and processing. This is especially important for real-time analytics.
  • Scalability: clusters support fast horizontal scaling (adding new machines to the cluster).

To work in a cluster, you need tools to manage cluster membership, coordinate resource distribution, and schedule work with individual nodes. Cluster membership and resource allocation can be handled using programs like Hadoop YARN (Yet Another Resource Negotiator) or Apache Mesos.

A prefabricated computing cluster often acts as a backbone with which other computers interact to process data. software. The machines participating in a compute cluster are also typically associated with managing a distributed storage system.

Receiving data

Data ingestion is the process of adding raw data to the system. The complexity of this operation largely depends on the format and quality of the data sources and the extent to which the data meets the requirements for processing.

You can add big data to the system using special tools. Technologies such as Apache Sqoop can take existing data from relational databases and add it to a big data system. You can also use Apache Flume and Apache Chukwa - projects designed for aggregating and importing application and server logs. Message brokers such as Apache Kafka can be used as an interface between various data generators and a big data system. Frameworks like Gobblin can combine and optimize the output of all tools at the end of the pipeline.

During data ingestion, analysis, sorting and labeling are usually carried out. This process is sometimes called ETL (extract, transform, load), which stands for extract, transform and load. Although the term usually refers to legacy data warehousing processes, it is sometimes applied to big data systems. Typical operations include modifying incoming data for formatting, categorizing and labeling, filtering, or checking data for compliance.

Ideally, the received data undergoes minimal formatting.

Data storage

Once received, the data moves to the components that manage the storage.

Typically, distributed file systems are used to store raw data. Solutions such as HDFS from Apache Hadoop allow large amounts of data to be written to multiple nodes in a cluster. This system provides compute resources access to data, can load data into cluster RAM for memory operations, and handle component failures. Other distributed file systems can be used in place of HDFS, including Ceph and GlusterFS.

Data can also be imported into other distributed systems for more structured access. Distributed databases, especially NoSQL databases, are well suited for this role because they can handle heterogeneous data. There are many various types distributed databases, the choice depends on how you want to organize and present the data.

Data Computing and Analysis

Once the data is available, the system can begin processing. The computing layer is perhaps the most free part of the system, since the requirements and approaches here can differ significantly depending on the type of information. Data is often processed repeatedly, either using a single tool or using a number of tools to process different types of data.

Batch processing is one of the methods for computing on large data sets. This process involves breaking the data into smaller parts, scheduling each part to be processed on a separate machine, rearranging the data based on intermediate results, and then computing and collecting the final result. Apache Hadoop's MapReduce uses this strategy. Batch processing is most useful when working with very large data sets that require quite a lot of computation.

Other workloads require real-time processing. However, information must be processed and prepared immediately, and the system must respond in a timely manner as new information becomes available. One way to implement real-time processing is to process a continuous stream of data consisting of individual elements. Another one general characteristics Real-time processors compute data in cluster memory, avoiding the need to write to disk.

Apache Storm, Apache Flink, and Apache Spark offer different ways to implement real-time processing. These flexible technologies allow you to choose the best approach for each individual problem. In general, real-time processing is best suited for analyzing small pieces of data that change or are quickly added to the system.

All of these programs are frameworks. However, there are many other ways to calculate or analyze data in a big data system. These tools often connect to the above frameworks and provide additional interfaces to interact with lower levels. For example, Apache Hive provides a data warehouse interface for Hadoop, Apache Pig provides a query interface, and interactions with SQL data provided using Apache Drill, Apache Impala, Apache Spark SQL and Presto. Machine learning uses Apache SystemML, Apache Mahout and MLlib from Apache Spark. For direct analytical programming, which is widely supported by the data ecosystem, R and Python are used.

Visualization of results

Often, recognizing trends or changes in data over time is more important than the resulting values. Data visualization is one of the most useful ways to identify trends and organize large numbers of data points.

Real-time processing is used to visualize application and server metrics. Data changes frequently, and large variations in metrics usually indicate a significant impact on the health of systems or organizations. Projects like Prometheus can be used to process data streams and time series and visualize this information.

One popular way to visualize data is the Elastic stack, formerly known as the ELK stack. Logstash is used for data collection, Elasticsearch for data indexing, and Kibana for visualization. The Elastic stack can work with big data, visualize the results of calculations, or interact with raw metrics. A similar stack can be obtained by combining Apache Solr for indexing with a fork of Kibana called Banana for visualization. This stack is called Silk.

Another visualization technology for interacting with data is documents. Such projects allow for interactive exploration and visualization of data in a format convenient for sharing and data presentation. Popular examples of this type of interface are Jupyter Notebook and Apache Zeppelin.

Big Data Glossary

  • Big data is a broad term for sets of data that cannot be properly processed by conventional computers or tools due to their volume, velocity, and variety. The term is also commonly applied to technologies and strategies for working with such data.
  • Batch processing is a computing strategy that involves processing data in large sets. Typically, this method is ideal for working with non-urgent data.
  • Clustered computing is the practice of pooling the resources of multiple machines and managing their shared capabilities to perform tasks. In this case, a cluster management layer is required that handles communication between individual nodes.
  • A data lake is a large repository of collected data in a relatively raw state. The term is often used to refer to unstructured and frequently changing big data.
  • Data mining is a broad term for different practices of finding patterns in large data sets. It is an attempt to organize a mass of data into a more understandable and coherent set of information.
  • A data warehouse is a large, organized repository for analysis and reporting. Unlike a data lake, a warehouse consists of formatted and well-organized data that is integrated with other sources. Data warehouses are often mentioned in relation to big data, but they are often components of conventional data processing systems.
  • ETL (extract, transform, and load) – extracting, transforming and loading data. This is the process of obtaining and preparing raw data for use. It is associated with data warehouses, but characteristics of this process are also found in the pipelines of big data systems.
  • Hadoop is an open source Apache project source code for big data. It consists of a distributed file system called HDFS and a cluster and resource scheduler called YARN. Batch processing capabilities are provided by the MapReduce computation engine. Modern Hadoop deployments can run other computing and analytics systems alongside MapReduce.
  • In-memory computing is a strategy that involves moving entire working datasets into cluster memory. Intermediate calculations are not written to disk; instead, they are stored in memory. This gives the systems a huge speed advantage over I/O-bound systems.
  • Machine learning is the study and practice of designing systems that can learn, adjust, and improve based on data fed to them. This usually means the implementation of predictive and statistical algorithms.
  • Map reduce (not to be confused with MapReduce from Hadoop) is an algorithm for scheduling a computing cluster. The process involves dividing the task between nodes and obtaining intermediate results, shuffling and then outputting a single value for each set.
  • NoSQL is a broad term that refers to databases designed outside of the traditional relational model. NoSQL databases are well suited for big data due to their flexibility and distributed architecture.
  • Stream processing is the practice of computing individual pieces of data as they move through a system. This allows real-time data analysis and is suitable for processing time-sensitive transactions using high-speed metrics.
Tags: ,