Welcome to Kanopy
Kanopy analyses text documents and links them to DBpedia, in order to identify and label the document topics. The result of this process is that concepts that are relevant to the text are extracted 1. from the text itself and 2. from DBpedia. Those with high graph centrality are recommended as labels for the topic. Additionally, these concepts are extracted together with the relations between them, forming the so-called topic-graphs.
Kanopy's process gravitates around topics. For generality, we consider a topic as a bags of words. The only assumption is that the words in a topic are related, for example they can be grouped by co-occurence metrics. One of the main aims of Kanopy is to automatically label the topic with a short, linguistically coherent phrase. The main idea behind our approach is that a concept that lends itself to label the topic is central, from a semantic-graph perspective, with respect to all the topic words. Thus, in order to quantify this semantic centrality, we are analysing the semantic network that interconnects the concepts behind the words.
In order to identify the topics, Kanopy extracts all the noun-phrases from the text and clusters them based on co-occurence in the Wikipedia full text corpus. The used similarity metric is Positive Pointwise Mutual Information (ppmi). The clustering algorithm Kanopy uses is hierarchical agglomerative with complete linkage. Average linkage is also available and can be selected in the user interface.
Once the clusters (topics) are formed, the concepts are linked and disambiguated with respect to DBpedia. Two-hops neighbourhood graphs are extracted starting from the found DBpedia concepts. For each topic, the subgraphs that overlap forming a connected component are merged and the resulting graph is analysed. In order to identify the most relevant concepts from this graph with respect to the seed nodes, we apply the focused centrality measures. This step therefore ranks the nodes in the topic graph with respect to their semantic centrality to the seed concepts. Currently, Kanopy uses focused random walk betweenness (fRWB) and focused information centrality (fIC), but additional measures can be plugged in. The two measures follow different principles for weighting the nodes. fRWB gives advantage to "broker" nodes, which are crucial for the connectivity between the seed nodes. On the other side, fIC gives more importance to nodes that are close to the seed nodes.
In order to use Kanopy, the only requirement for the user is to enter the text they want to analyse. Kanopy then processes the text and returns the obtained results. The results are displayed as a list of topics. Each topic consists of the list of ''Extracted Concepts'', ''Categories'' and ''Topic Graph''.
Extracted Concepts represent the result of text analysis, concept linking and disambiguation. They are grouped based on hierarchical clustering.
Categories represent the DBpedia concepts selected for labelling the topic.
The Topic Graph represents the compressed graph, that connects the extracted concepts and the categories. In the user interface, this graph is initially collapsed, to avoid an overloaded UI. It can be expanded by clicking the Topic Graph bar. Besides the extracted concepts and categories, the topic graph also shows the concepts that are enabling the connection between them. Hovering over graph nodes, the corresponding concepts are highlighted, as well as the corresponding text occurrences.
The user interface enables several configurations that can help better understand the effects of changing parameters throughout the Kanopy process. The section 'Usage Scenario' explains the user interface and recommends some steps that can be followed in order to explore the system.