Providing an effective search service is a game of two halves. First, and arguably the most important thing to do is understand how content will be discovered and indexed. This will usually involve the use of web spiders or applications designed to "feed" your search engine with documents and web sites that it can then navigate around, read and store.
The second concern is how the relevancy and quality of those documents will be measured against the search terms your users are entering.
The mechanism employed to calculate a documents relevancy will more than likely be the major "unique selling point" of your chosen search platform. Some platforms rely heavily on the use of faceted queries, where key attributes (colour, author, and subject matter for instance) are extracted from the content at the point where those documents are indexed. Others use complex algorithms and various linguistic techniques such as stemming, synonym expansion, and lemmatisation in an attempt to effectively "understand" whether the document is relevant to the search query being used. Most enterprise scale search platforms, such as Microsoft Fast ESP will use a combination of both.
Symphony made light work of applying our taxonomy to the documents we fed to Fast ESP.
Its amazing how much storage we used to waste indexing pages that were irrelevant, but with symphony we are now filtering out content like about us and contact pages.
It was essential that our users were able to filter results by the drug mentioned in the pages, and Symphony made it easy to identify and extract this information
Only index content that is relevant to your audience, and don't rely solely on your search algorithms to filter out the rubbish from their search results.
So how can Symphony help? Our crawling technology provides a number of options for controlling which links and therefore which pages are visited on a site. The initial pages (website urls) that are visited can either be fixed, or made to vary based on the contents of an external system such as a web service or spreadsheet. Then for each page that is visited a number of rules can be configured on a site by site basis that tell the crawler how to identify links for it to follow, and which to ignore. These rules can be based on a number of factors including the urls or those links, the content of the pages being navigated from, or the content on the pages being linked to, or any combination of these. In addition, any aspect of the crawler link following mechanism can be altered and customised using scripts or external applications to establish whether documents should be included in the searchable content or not.
Once Symphony has established that it has arrived at a document you would like to include as searchable content, it then knows which fields of data to extract from the content of those pages. We train it through configuration to enable it to retrieve and store meta data about the page that can then be used either as part of an enhanced full text search, or as search filters that will enable your users to drill down into the search results they are interested in.
Whats more, the format and structure of the output provided for each document can be altered to be compatible with most of the enterprise search platforms available today.
Properties such as author, and the date a page was last updated can often be extracted automatically, but how will the search engine know to extract the size of a television, or the symptoms of a medical condition?