KMR > Wiki > Main > SearchFill

SearchFill

Master's degree project: Using Searches to fill Concepts with Content

Background / Problem

A context-map is a set of concepts and conceptual relations that are presented together, and that can be associated with various types of content (components). The KMR-group at CID has developed a concept-browser called Conzilla that supports navigation in an atlas of context-maps and makes it possible to present the content of a concept (or concept-relation) in various ways. To capture the manifold of different meanings of a concept or conceptual relation is in general impossible. We try to narrow the scope by presenting a set of concepts in a specific context, i.e. a context-map - together with short descriptions, keywords, etc. all of which is commonly denoted as meta-data . But in this case we have meta-data about a concept as well as meta-data about a content component such as e.g. a web page. This meta-data is used when presenting information about concepts in a context-map as pop-ups, inline comments, separate listings etc.

Concepts can also be understood by exemplifying them. Hence the creator of a concept can link it to a couple of examples, which can be web-content such as web pages, pictures, videos etc. but also real world objects (we call them content components below). In fact, anything that can be referenced can be added as content on a concept. These content components can have meta-data as well, helping you to choose which one to look into further.

It doesn't take long to realize that any such small set of content components will be incomplete in the respect of describing a concept. In the present version of Conzilla, adding a small set of content components takes too much time in general. Consequently the concepts we have included in our current context-maps are with few exceptions empty of content. Even though we have tried to simplify the procedure to add new content components it is still a rather cumbersome task. So the problem is twofold, how do we find the content we want to include, and how do we simplify the procedure to link it to our concepts? If we restrict the first question to regard only web content, the answer naturally includes various search engines. The problem will then be how to treat the search results in the concept-browser.

The Project

We want to be able to perform searches against various search-engines such as Google, Altavista, Netscapes Open Directory Project (dmoz.org), Lycos, etc. The result of such a search should be parsed by the concept-browser and presented in a nice list - just like any other set of content components when listed on a concept. Furthermore it should be possible to carry out context-searches, in the sense that if a new search is performed on a concept, the meta-data on this concept can be used to help specify the search. A typical scenario is that a new search (with e.g. Google) on a specific concept is initiated by the keywords on the concept, and after inspecting the result the search is refined by adding some more search criteria. When the result is specific enough, the search (observe, not the results) is saved on the concept. Every time someone views the content on this concept the search will be performed again, returning not necessarily the same but hopefully similar results. If a specific search-result is of great interest, it can of course be saved on the concept as well.

With each search result most search-engines produce some extra information to be used as an aid to determine which result is the better. This information should be extracted and represented as meta-data about each search result. Which kind of information to store in which meta-data field is something to be decided depending on the style of each search engine. Search engines that are aware of meta-data standards such as Dublin Core will of course help immensely. To our knowledge there is no common format for how search engines return their results, hence there will need to be specific code for each search engine to search against. However, there are projects that already extract information from search results. Mozilla's sidebar search-plugins are examples of such projects. Some of the corresponding code exists in the public domain and can serve as an inspiration and maybe even be reused. Some legal aspects of how search engines can be used will have to looked into, but that responsibility will fall on the project coordinators.

The project in short:

Some solution for how to specify searches against various search engines from within Conzilla, both free-text searches and context-based searches as described above.
Presentation of the results inside Conzilla.
Extraction and inclusion of meta-data in the presentation of the search-results.
Saving of the searches and search results as content in Conzilla.

The proposed project is the start of a very promising direction in the development of Conzilla, and we foresee several possible continuations. We are involved in an international collaboration project called Edutella , which is a P2P search system for meta-data that will allow very precise searches with graph-like user interfaces. From the perspective of the proposed project, Edutella is just another search-engine to be wrapped.

Project Environment

We strongly believe in the open source concept, and we would therefore like the produced code to be licensed under GPL (Gnu Public License).
For reasons of compatibility with earlier work, we prefer that the code is written in Java.

No comments