Patent Classification via Textual Analysis Which Sections to be Included?

Yucesoy S., Dereli T., DURMUŞOĞLU A.

2018 International Conference on Artificial Intelligence and Data Processing, IDAP 2018, Malatya, Türkiye, 28 - 30 Eylül 2018, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/idap.2018.8620929
Basıldığı Şehir: Malatya
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: document classification, k-NN, patent classification, SVM, text mining
Samsun Üniversitesi Adresli: Hayır

Özet

Accurate classification of patent documents is very crucial to create universal and exchangeable information between different patent offices. In order to classify these applications in accordance with their type of invention; each is assigned to the class(es) from expansive classification systems. Increase in technological variety makes it much complex since the right classes are selected among more than 250,000 groups (for Cooperative Patent Classification). In a patent office with heavy workload, an important amount of time has been spent to perform these processes manually. In this regard, there have been efforts to develop automated systems, which are capable to detect class of an application via software. Textual analysis may yield important clues about the appropriate class of an application. However, longer pages of applications make it computationally difficult to analyze the whole documents. Therefore, this study attempts to investigate which parts of patents have much more significant role on the determination of right class(es) for a patent. To this end, this paper covers several trials to increase the accuracy of patent classification. These trials are systematically carried out and in each case different parts of the patent are added or removed. For this purpose, WIPO-alpha dataset (only d class) is textually analyzed via support vector machine (SVM) and k-nearest neighbour (k-NN) methods. It is concluded that the combination of abstract, title and description give similar results when compared with the use of all textual content.