If you are new to big data, you may find it difficult to understand this field and you have no way to start. Recently, Ramesh Dontha published two articles on DataConomy, which briefly and comprehensively introduced 75 core terms about big data. This is not only a good introductory material for big data beginners, but also for high-level practitioners. The role of leak detection.
This article first introduces 25 basic big data terms to help you review the past and learn the new, so let’s get started~
01 Algorithm (Algorithm)
Algorithm can be understood as a mathematical formula or statistical process for data analysis. So, why is "algorithm" related to big data? You know, although the word algorithm is a collective term, in this era of popular big data analysis, algorithms are often mentioned and become more popular.
02 Analysis (Analytics analyze)
Let us imagine a very likely situation. Your credit card company has sent you an email that records the funds transfer situation in your card throughout the year. If you take this list at this time, start to study your food and clothing. What is the percentage of consumption in terms of entertainment, entertainment, etc.? You are doing analysis work. You are digging out useful information from your original data (these data can help you make decisions about your consumption in the coming year).
So, what if you use a similar method to process the posts made by people throughout the city on Twitter and Facebook? In this case, we can call it big data analysis. The so-called big data analysis is to reason about a large amount of data and tell useful information from it. There are three different types of analysis methods below. Now let's sort them out separately.
03 Descriptive Analytics
If you only say that your credit card consumption last year was: 25% for food, 35% for clothing, 20% for entertainment, and 20% for miscellaneous expenses, then this analysis method is called descriptive analysis. Of course, you can also find out more details.
04 Predictive Analytics
If you analyze the history of credit card consumption in the past 5 years and find that the annual consumption situation basically shows a continuous trend, then in this case you can predict with a high probability that the consumption status in the coming year should be the same as in the past. akin. This does not mean that we are predicting the future, but it should be understood that we are "predicting with probability" what might happen. In the predictive analysis of big data, data scientists may use advanced technologies, such as machine learning, and advanced statistical processing methods (we will talk about this later) to predict weather conditions, economic changes, and so on.
05 Prescriptive Analytics
Here we still use the example of credit card transfer to understand. If you want to find out which type of consumption (such as food, entertainment, clothing, etc.) can have a huge impact on the overall consumption, then the normative analysis method based on predictive analytics (action) "(Such as reducing food or clothing or entertainment) and analyzing the resulting results to determine the best consumption item that can reduce your overall expenses. You can extend it to the field of big data, and imagine how a person in charge can make so-called "data-driven" decisions by observing the impact of various dynamic indicators in front of him.
06 Batch processing
Although batch data processing has existed since the era of mainframes, in the era of big data that processes large amounts of data, batch processing has gained more significance. Batch data processing is an effective method for processing large amounts of data (such as a bunch of transaction data collected over a period of time). Distributed computing (Hadoop), which will be discussed later, is a method for processing batch data.
07 Cassandra
It is a very popular open source data management system developed and operated by the Apache Software Foundation. Apache has mastered a lot of big data processing technologies, and Cassandra is their system specially designed to process large amounts of data between distributed servers.
08 Cloud computing
Although the term cloud computing is now a household name, there is no need to repeat it here, but for the sake of completeness of the content of the whole article, the author has added the term cloud computing here. Essentially, software or data is processed on a remote server, and these resources can be accessed anywhere on the network, then it can be called cloud computing.
09 Cluster computing
This is a visualized term to describe the computing of a cluster using multiple servers with rich resources. A more technical understanding is that in the context of cluster processing, we may discuss nodes, cluster management layer, load balancing, parallel processing, and so on.
10 Dark data
This is a coined word. In the author's opinion, it is used to scare people and make senior management sound obscure. Basically, the so-called dark data refers to all the data accumulated and processed by companies that are actually completely useless. In this sense, we call them "dark" data, and they may not be analyzed at all. . These data can be information in social networks, call center records, meeting records, and so on. Many estimates believe that 60% to 90% of all company data may be dark data, but no one actually knows.
11 Data lake
When the author first heard this word, I really thought it was an April Fools' Day joke. But it is really a term. Therefore, a data lake is a knowledge base of company-level data saved in a large number of original formats. Here we introduce the data warehouse (Data warehouse). The data warehouse is a concept similar to the data lake mentioned here, but the difference is that it stores structured data that has been cleaned up and integrated with other resources.
Data warehouses are often used for general data (but not necessarily so). It is generally believed that a data lake can make it more convenient for people to access the data you really need. In addition, you can process and use them more conveniently.
12 Data mining
Data mining is about the process of finding meaningful patterns from a large group of data with complex pattern recognition techniques, and obtaining relevant insights. It is closely related to the "analysis" mentioned above. In data mining, you will first mine the data and then analyze the results obtained. In order to get meaningful patterns, data miners use statistics (a classic old method), machine learning algorithms, and artificial intelligence.
13 Data Scientist
Data scientist is a very sexy industry nowadays. It refers to a group of people who can understand, process and get insights by extracting raw data (this is what we called the data lake before). Some of the necessary skills for data scientists can be said that only super talents have: analytical skills, statistics, computer science, creativity, storytelling skills, and the ability to understand business background. No wonder this group of people are highly paid.
14 Distributed File System
The amount of big data is too large to be stored in a single system. A distributed file system is a file system that can store a large amount of data on multiple storage devices. It can reduce the cost and complexity of storing large amounts of data.
15 ETL
ETL stands for Extract, Transform and Load. It refers to this process: "extracting" the original data, "converting" the data into a "suitable for use" form through cleaning/enriching methods, and "loading" it into a suitable library for system use. Even if ETL originates from a data warehouse, this process is also used when acquiring data, for example, to acquire data from external sources in a big data system.
16 Hadoop
When people think about big data, they immediately think of Hadoop. Hadoop is an open source software architecture (the logo is a cute elephant), which consists of the Hadoop Distributed File System (HDFS), which allows the use of distributed hardware to store, abstract, and analyze big data. If you really want someone to be impressed by this thing, you can tell him YARN (Yet Another Resource Scheduler), as the name suggests, is another resource scheduler. I was really shocked by the people who proposed these names. The Apache Foundation, which proposed Hadoop, is also responsible for Pig, Hive, and Spark (this is the name of some software). Haven't you been surprised by these names?
17 In-memory computing
18 Internet of Things (IoT)
The latest buzzword is the Internet of Things (IoT). IoT is the interconnection of computing devices in embedded objects (such as Sensors, wearable devices, cars, refrigerators, etc.) through the Internet, and they can send and receive data. The Internet of Things has generated massive amounts of data and brought many opportunities for big data analysis.
19 Machine Learning (Machine Learning)
Machine learning is a method of designing a system that can learn, adjust, and improve based on fed data. Using set forecasting and statistical algorithms, they continue to approximate the "correct" behaviors and ideas, and as more data is input into the system, they can be further improved.
20 MapReduce
MapReduce may be a bit difficult to understand, let me try to explain it. MapReduce is a programming model. The best understanding is to note that Map and Reduce are two different processes. In MapReduce, the program model first divides the big data set into small pieces (these pieces are called "tuples" in technical terms, but I try to avoid obscure technical terms when I describe them), and then these small pieces will It is distributed to different computers in different locations (that is, the cluster described earlier), which is necessary in the Map process. Then the model will collect each calculation result and "reduce" them into one part. The data processing model of MapReduce is inseparable from the Hadoop distributed file system.
21 Non-relational database (NoSQL)
This word sounds almost the antonym of "SQL, Structured Query Language". SQL is necessary for traditional relational data management systems (RDBMS), but NOSQL actually refers to "more than SQL."
NoSQL actually refers to those database management systems that are designed to handle large amounts of data that have no structure (or no "schema"). NoSQL is suitable for big data systems, because large-scale unstructured databases require the flexibility and distributed priority of NoSQL.
22 R language
Can anyone give a programming language a worse name? R language is such a language. However, R language is a language that works very well in statistical work. If you don't know the R language, don't say you are a data scientist. Because R language is one of the most popular programming languages ​​in data science.
23 Spark (Apache Spark)
Apache Spark is a fast in-memory data processing engine that can effectively perform stream processing, machine learning, and SQL workloads that require iterative database access. Spark is usually much faster than the MapReduce we discussed earlier.
24 Stream processing
Stream processing is designed to continuously process stream data. Combined with stream analysis technology (referring to the ability to continuously calculate numerical values ​​and statistical analysis), stream processing methods are particularly capable of real-time processing of large-scale data.
25 Structured v Unstructured Data (Structured v Unstructured Data)
This is one of the contrasts in big data. Structured data is basically any data that can be placed in a relational database. The data organized in this way can be associated with other data through tables. Unstructured data refers to any data that cannot be placed in a relational database, such as email messages, status on social media, and human voice.
750KW Gas Generator,Gas Silent Generator Set,Gas Turbine Generator Set,750KW Gas Generator Set
Jinan Guohua Green Power Equipment Co.,Ltd. , https://www.guohuagenerator.com