?What is the Symphony
MarabouStork's cloud based web crawler platform is an extremely customisable, on-demand enterprise website crawler.
It is used by businesses of all sizes for automatically mining data from pages in one or more websites.
?What is "Software-As-A-Service"?
Although there are options to host and run Symphony yourself, most of our customers take advantage of the fact that we provide it as a fully managed service.
This means there is nothing for you to install, nothing for you to manage and nothing for you to fix. We configure the product to your exact requirements, we host it, run it, maintain it provide you with the output.
?What Will the Service Cost?
The Symphony platform is extremely flexible to each customers requirements and for this reason we always speak with our potential customers to understand exactly what is required rather than try to place them into predefined pricing bands.
Factors that affect the price of our service includes the amount of data you want to receive and the frequency with which you want to receive it. You may want images, or screenshots. The sites may be easy, or complicated to configure, and may be subject to frequent change.
We also find that through the relationship we have with you other opportunities may present themselves to us, and so it may prove mutually beneficial to both parties for us to discount the cost of providing the service.
Rest assured though, you will not get the level of service we offer, cheaper, anywhere else.
?Can Symphony gather images and screenshots?
Symphony can retrieve and provide not only images from the sites crawled, but can also take screenshots of those pages if you need additional proof that the data collected is accurate. This is especially useful for legal clients who might need to be able to prove what a certain webpage looked like at a certain point in time.
Please contact us
to discuss your specific requirements for gathering images or screenshots.
?Can you cleanse the data for us?
Yes, the software has a number of built in functions that can ensure that the data you receive is as clean, and formatted as possible. We also are unique, in that we can also resolve discrepancies that may exist between different sites, and also plug holes in data records by sourcing the missing information from alternative sources. View our features page for additional information.
?How does Symphony deal with websites trying to block access?
Symphony is less likely to be blocked by websites due to the following unique features:
- Because the servers come online in the cloud and are only online while they are in use, they get given a new identity every time they are used. This means that the traditional way of blocking access to certain IP addresses tends to be ineffective.
- We crawl website content responsibly, throttling our servers so that they do not adversly affect the target web sites
- The way links are followed on the site is simular to the type of usage you would see if there were 10 people navigating the site simultaneously, rather than a single super-human navigating the system systematically at super human speeds.
?So does this mean that Symphony cannot be blocked from crawling?
No. We use our own user-agent in the header of all of our http requests. This means that target sites can identify that we are visiting them.
?There are multiple ways the same product can be identified across different websites - can Symphony deal with this?
There are a number of approaches that can be used in this scenario.
Symphony supports the use of a dictionary of preferred and none preferred terms to remove discrepancies in the terminology used across multiple websites.
- - Samsung G50
- - np: Samsung G-500
- - np: Samsung Gee-50
- - np: Samsung 50 G
In this instance irrespective of whether the web site contains "Samsung G50", "Samsung G-500", "Samsung Gee-50" or "Samsung 50 G" the data Symphony will always send to the customer is the preferred term (which in this case would be "Samsung G50").
?Can you produce the data with our own id's in the data?
Of course, using the in built dictionary mechanism allows you to maintain the terms or Id's which are used to control the preferred and non preferred terms we feed back to you in the data output.
?Are there legalities to consider?
Our software is fundamentally the same as the web spiders used by every major search engine to discover new and updated content on the web. The only difference is that we configure ours for a small set of sites, rather than any site on the internet. Google extract and resell data gathered from web pages every day, however it is important to remember that certain content and images especially may be covered by copyright laws.
We always access each site we collect data from in a responsible way, and never do so in a way that could have an adverse effect on the performance of those sites.
We also identify ourselves to the sites we crawl using our own user-agent identification should they wish to limit or throttle our access.
We cannot however be responsible for the way our customers use the data that we provide to them and this is covered in our user agreement.
?Some of the sites are complicated and require selection of network and working order. Can Symphony deal with this scenario?
Yes, we have the ability to interact with the web sites in the same way a human being would. We prefer to be able to extract the data we need without having to do this, by going to the HTML of the page directly or by calling the services that the pages use to get XML or JSON data, but Symphony can always be configured to manually complete website forms on the websites.
?Can you provide a sample?
Yes. You just need to tell us
which websites you need to gather data from and what format you want the data to look like when you receive it.