定義
- Non-trivial extraction of implicit, previously unknown and potentially useful information from data
- The process of automatically discovering useful information in large data repositories
- They also provide capabilities to predict the outcome of a future observation
也就是能在大資料中尋找可使用的、或要應用的資料。
KDD(knowledge discovery of database) |
Preprocessing
-將Input Data整理成可用的資訊
Postprocessing
-確保只有整理合法的和有用的結果納入決策支持系統訊
Scalabilty
-用來探勘的演算法必須有高延展性
High Dimensionality
-能接納的參數可達數萬種
來看看放入的參數種類如何分吧~
Type of Attributes:
Nomail:一些無法做運算的參數,僅代表編號(=、!=)。例如:ID,眼睛顏色,壓縮編碼
Ordinal:
可做大小區分(>、<),但無實質的數值。例如:排名,身高(高,中,矮),甜度....
Interval:
可以做加減運算的數值。(+、-)例如:溫度,日期。
Ratio:
可以做乘除的數值。(*、/)EX:長度,重量...
另外數值還可以分為 Discrete(離散) 、Continuous (連續)
Data Preprocessing
- Aggregation(聚合)
- 將2個或多個參數結合在一起
- 目的:擴大搜索目標、參數的減少、結合後的DATA往往有較少變異
- Sampling
- 採樣適用於數據選擇的主要技術
- 他經常被用於數據的初步調查和最終數據分析
- 常常只對抽樣做調查,可減少時間&金錢成本
- Dimensionality Reduction
- 減少時間和記憶體通過data mining algorithm所需的量
- 數據更容易視覺化
- 可能消除不相關的功能
- Feature subset selection
- Redundant features
- duplicate much or all of the information contained in one or more other attributes
- Example: purchase price of a product and the amount of sales tax paid
- Irrelevant features
- contain no information that is useful for the data mining task at hand
- Example: students' ID is often irrelevant to the task of predicting students' GPA
- Feature creation
- Create new attributes
- that can capture the important information from the original attributes
- Three general methodologies:
- Feature Extraction
- Highly domain-specific
- e.g. edge detection from image data
- Mapping Data to New Space
- E.g. Fourier transform
- Feature Construction
- combining features : e.g. density=mass/volume
- Discretization and Binarization
- Binarization
- Transform the continuous and discrete attributes into one or more binary attributes
- Discretization
- Transform a continuous attribute into a categorical attribute
- Variable Transformation
沒有留言:
張貼留言