2015年10月4日 星期日

Data Mining Week 1

資料探勘

定義
  1. Non-trivial extraction of implicit, previously unknown and potentially useful information from data
  2. The process of automatically discovering useful information in large data repositories
  3. They also provide capabilities to predict the outcome of a future observation
也就是能在大資料中尋找可使用的、或要應用的資料。

KDD(knowledge discovery of database) 
Preprocessing
-將Input Data整理成可用的資訊

Postprocessing
-確保只有整理合法的和有用的結果納入決策支持系統訊

Scalabilty
-用來探勘的演算法必須有高延展性

High Dimensionality
-能接納的參數可達數萬種

來看看放入的參數種類如何分吧~


Type of Attributes:

Nomail:
一些無法做運算的參數,僅代表編號(=、!=)。例如:ID,眼睛顏色,壓縮編碼
Ordinal:
可做大小區分(>、<),但無實質的數值。例如:排名,身高(高,中,矮),甜度....
Interval:
可以做加減運算的數值。(+、-)例如:溫度,日期。
Ratio:
可以做乘除的數值。(*、/)EX:長度,重量...

另外數值還可以分為 Discrete(離散) 、Continuous (連續)

Data Preprocessing

  • Aggregation(聚合)
    • 將2個或多個參數結合在一起
    • 目的:擴大搜索目標、參數的減少、結合後的DATA往往有較少變異
  • Sampling
    • 採樣適用於數據選擇的主要技術
    • 他經常被用於數據的初步調查和最終數據分析
    • 常常只對抽樣做調查,可減少時間&金錢成本
  • Dimensionality Reduction
    • 減少時間和記憶體通過data mining algorithm所需的量
    • 數據更容易視覺化
    • 可能消除不相關的功能
  • Feature subset selection
    • Redundant features 
      • duplicate much or all of the information contained in one or more other attributes
      • Example: purchase price of a product and the amount of sales tax paid
    • Irrelevant features
      • contain no information that is useful for the data mining task at hand
      • Example: students' ID is often irrelevant to the task of predicting students' GPA
  • Feature creation
    • Create new attributes 
      • that can capture the important information from the original attributes
    • Three general methodologies:
      • Feature Extraction
        • Highly domain-specific
        •  e.g. edge detection from image data
      • Mapping Data to New Space
        • E.g. Fourier transform
      • Feature Construction
        •  combining features : e.g. density=mass/volume
  • Discretization and Binarization
    • Binarization
      • Transform the continuous and discrete attributes into one or more binary attributes
    • Discretization
      • Transform a continuous attribute into a categorical attribute
  • Variable Transformation



沒有留言:

張貼留言