The concept of data wrangling and search data analysis is simple. However, it is difficult to do it right. Data that is not cleaned or properly cleaned is garbage, and the GIGO principle (garbage, garbage out) also applies to modeling and analysis.
What is Data Wrangling?
It is very rare for data to come out in a ready-to-use form. They are often contaminated with defects and defects, rarely have the desired structure and are usually out of context. Data wrangling is the process of discovering, refining, verifying and constructing data, and then improving the quality of content (by adding information from public data such as weather and economic conditions) and in some cases aggregating and modifying data. .
Strictly speaking, the data varies from case to case. When data is provided by a device or IoT device, it can be a major part of the data transfer process. When data is used for machine learning, the transitions not only normalize and standardize, but also reduce the number of measurements.
When exploratory data analysis is performed on a personal computer with limited memory and storage space, a subset of data may be extracted during the wrangling process. When data comes from multiple sources, the field names and units of measurement should be merged by mapping and transformation.
What is Exploratory Data Analysis?
Inquiry data analysis involves John Tucky, a member of Princeton University and Bell Labs. Tuki proposed exploratory data analysis in 1961 and wrote about it in 1977. Tukey’s interest in exploratory data analysis influenced the development of statistical language at Bell Labs, which later became known as S-Plus and R.
Exploratory data analysis was developed in response to Tukey’s belief that he overestimated the statistical hypothesis test, also known as accurate data analysis. The difference between the two is that in exploratory data analysis, you go directly to a hypothesis, examine the data first rather than applying lines and curves to the data and use them to propose a hypothesis.
True, exploratory data analysis combines graphics and detailed statistics. In the 1990s, Tuki R was used to describe the economy of Vietnam as histograms, kernel density estimates, box plots, average and standard deviation, and concrete graphs. I explored.
ETL and ELT for data analysis
In traditional database usage, ETL (Extract, Transform, Load) is the process of capturing data from a data source (primarily a transactional database), converting it into an analytical structure, and loading it into a data warehouse.
ELT (Extract, Load, Transform) is a more advanced process. After moving the data in raw form to the data lake or data warehouse, the data warehouse makes the necessary transitions.
Whether you have a data lake, a data warehouse or both, ELT processes are better suited for data analysis, especially machine learning, than ETL processes. The primary reason is that data transformation in feature engineering services must be repeated during machine learning, which is very important for accurate assessment.
Screen scraping for data mining
Sometimes the data is presented as a file or in a format readable by the analysis program via an API. What if the data is only provided as output from another program, such as a tabulated website?
Analyzing and collecting web data with a program that simulates a web browser is not very difficult. This process is called screen scraping, web scraping, data scraping and so on. Screen scraping was originally intended to read text data on a computer terminal screen, but these days it is more common for data to be displayed on HTML web pages.
Data purge and missing values for data analysis
Most true raw datasets have missing or apparently incorrect data values. Columns and rows with a high percentage of missing values should be excluded in the simple data purge step. It can also remove outliers later in the process.
Sometimes, you will lose a lot of data when you follow this rule. Another way to deal with missing values is to attribute values. I.e. retracting the value ing. It can be easily implemented using regular Python libraries.
read_csv () ‘?’ Data import functions such as pandas can replace placeholder icons such as ‘Non’. Psychit_learning class Simple computer ()The ‘non’ value can be replaced using one of four strategies: column average, column median, column mode and constant. In the case of constant alternate values, the default value of the numeric field is ‘0’, and the string or object field is ‘missing_value’. You can override the default value by setting the file_value.
Which feature strategy is best? It depends on the data and the model, so the only way to find out Try it all out and see what strategy gives the fit model with the highest verification accuracy score.Should be.
Feature engineering for predictive modeling
An attribute is a property of an individually measurable property or an observed phenomenon. Feature processing is the construction of a minimum set of independent variables that describe the problem. If two variables are closely related to each other, they must be combined as one feature or one must be omitted. There are cases where the principal component analysis (PCA) is performed to change the interrelated features into a group of instantaneous signals that are simply not related to each other.
Generally, category variables in text form must be encoded as numbers before they can be used for machine learning. Assigning integers (label encoding) to each category may seem obvious and easy, but unfortunately some machine learning models mistake integers as ordinals. A popular alternative is one-hot encoding, where each category is assigned a row (or vector size), where each category is coded as 1 or 0.
Feature creation is the process of building new features from raw observations. For example, building the age of death minus the year of birth and the age at the time of death is the basic independent variable for life expectancy and mortality analysis. Deep feature synthesis algorithm is used to automate feature generation. Open source Featured tools(Feature Tools) Implemented in the Framework.
Feature selection is the process of avoiding ‘curse of dimensional number’ and data overuse by removing unnecessary features from the analysis. The size reduction algorithm can do this automatically. Deleting variables with multiple values, deleting variables with less difference, decision trees, random forests, deleting or combining highly correlated variables, Backward Feature Elimination (BFE), including Forward Feature Selection (FFS), Factor Analysis and PC.
Data normalization for machine learning (ML)
To use numeric data for machine regression, you usually need to generalize the data. Otherwise, numbers with larger arrays will dominate the Euclidean gap between the function vectors and the effect may expand at the expense of other fields and may have difficulty meeting the steep decent optimization. There are many ways to normalize and standardize data for machine learning, such as Mission-Max normalization, Mean Normalization, Normalization, and Extension to Unit Length. This process is sometimes called feature scaling.
Data analysis life cycle
There may be many variations in the data analysis life cycle, but they can be classified as 7 or 8 depending on the calculation method.
1. Identify the variables that ict need to follow to understand the problem and business to answer.
2. Obtaining data (data mining).
3. Ignore rows or assign values to clean data and consider missing data
4. Explore the data.
5. Doing variable matching.
6. Predictive modeling including machine learning, certification and statistical methods and tests.
7. Data visualization.
8. Go back to step 1 (understand the business) and continue the cycle.
Steps 2 and 3 are often considered data wrangling, but it is important to establish the context of the data wrangling by identifying the business problem to be answered (Step 1). It is also important to do exploratory data analysis (step 4) before modeling to avoid bias in estimates. It is common to repeat steps 5-7 to find the best model and feature set.
Due to changing circumstances, data drift and finding additional answers to business additional problems, this life cycle begins again when you think it is almost over.
* Info World Contribution Editor and Reviewer Martin Heller Web and Windows Programming Consultant. From 1986 to 2010, he worked as a database, software and website developer. Since then, he has been Vice President of Technology and Education at Alpha Software and Chairman and CEO of TubeP. firstname.lastname@example.org