Sorry your browser is not supported!

You are using an outdated browser that does not support modern web technologies, in order to use this site please update to a new browser.

Browsers supported include Chrome, FireFox, Safari, Opera, Internet Explorer 10+ or Microsoft Edge.

Geek Culture / Search engine

Author
Message
Plystire
22
Years of Service
User Offline
Joined: 18th Feb 2003
Location: Staring into the digital ether
Posted: 19th Aug 2011 03:02
Creating a search site isn't terribly difficult nowadays, what with the plethora of search APIs provided by the well-known engine companies. However, making your own engine is just as difficult as it has always been.

I wanted to ask the community for any insight on such an endeavor.

Using a third party API to execute an internet search would, of course, sound like the best option to get a search site going with minimal issue, but the opposite has been the case thus far.

Google seems to have dropped support for web searches via their API... which is a shame since there's no question that they are the leader in favorable search results.
I have used the Yahoo BOSS V1 API with minimal issues, however they have discontinued support for it, in favor of their new BOSS V2 API brought about by their conjunction with Bing, which has been little more than a hassle. While setting it up is easy enough, their support has been lacking, allowing certain portions of the service to go offline far too frequently, failing to live up to a 99.9% claimed up-time. On top of this, results from their API are far from matching results obtained directly from Yahoo, which throws into question why they want me to pay for a service that is obviously lacking.
I have also had experience using Bing's own API, which was less favorable due to the generally poor results from their engine. This was quite a while ago, though, and research into using them is something on my mind.

All that aside, there's no question that an in-house engine would be best, as up-time and result favorability would be controlled directly and not reliant on third party support teams which are no doubt in over their heads with other support tickets from various other sites using their technology.

The first step to setting up an engine, as far as I can tell, would be to start up a crawler that would venture across the internet, saving pertinent information and accumulating a vast database that would be used later by the engine itself.
I don't have any experience with this personally, which is why I've come to ask if anyone here has any tips for me as far as what the crawler should be looking for, what the best way to set one up would be... and more importantly, if anyone here has any experience as an SEO that could share some bits of wisdom in the nitty gritty of how a search engine is designed and functions. Any articles on the topic (not discussion about the technology, but about how the technology works, in-depth if at all possible) would be appreciated as well.

This is a planning phase, hashing out the extremities of such an endeavor, finding out as much as I can about the technology, so that I can make an educated decision on whether or not this route is one worth taking at this point in time.


~Plystire

A rose is only a rose until it is held and cherished -- then it becomes a treasure.
bitJericho
22
Years of Service
User Offline
Joined: 9th Oct 2002
Location: United States
Posted: 19th Aug 2011 11:48
I can't speak as an SEO, but as a user of websites I can say I avoid site searches when they are google or yahoo. Those engines are great for finding sites in general, but searching specific sites blow.

Inhouse solutions are best. You don't necessarily need a crawler though. If you have a database with all the content already stored in there, just come up with some interesting SQL statements and dump em out to the end-user.

If you *require* a crawler, there are a lot of open source solutions that would make setting one up for an in-house solution extremely easy.


MrValentine
AGK Backer
14
Years of Service
User Offline
Joined: 5th Dec 2010
Playing: FFVII
Posted: 19th Aug 2011 20:24
I found this last week should answer a lot of questions for you

http://www.seomoz.org/google-algorithm-change#2011

Personally my web design package comes with a built in search engine for my sites you can see an example on my website/computers and try typing something like 'intel' in the search field

Hope this helped.

Though I am slightly confused in your aim... you want to create a search engine site such as Google.com or just one for your own website? {like on my website}

Ask for more info from me anytime happy to help

BatVink
Moderator
22
Years of Service
User Offline
Joined: 4th Apr 2003
Location: Gods own County, UK
Posted: 19th Aug 2011 22:58 Edited at: 19th Aug 2011 22:59
Creating your own search engine is a mammoth task. For it to be effective the first thing you need is 3 copies of everything on the internet, just to make it reliable - 1 online, 1 to be in a failed state and 1 to repair the failed copy (the live copy is too busy serving results). All this gives you just one entry point for users, you have even started to consider scaling it yet.

Now you've got your bare minimum three copies of everything on the internet, you're going to need a few thousand machines to index it. You need to cross-reference all keywords with all other keywords to be able to produce results in any bearable amount of time.

The next layer in your hardware is the query servers. These need to access the logical views on your data to turn them into results. And at the same time they need to rank them into some kind of useful order.

Next layer - presentation to the user. Putting all that useful information into readable HTML and sending it back.

Right, so now you have the most basic of search engines, producing barely useful results. To give you an insight into what Google would do next:

1. Monitor how much lag causes you to abandon your search, and make sure the threshold never gets breached. They have data that just 0.4 seconds over their threshold loses a huge percentage of users. They can measure that in ad revenue loss.

2. refine the search algorithm every time somebody selects anything other than the number 1 result. If you don't pick result 1, the engine failed in it's search accuracy.

3. Measure any statistics around selecting a result, then returning to select an alternative (long clicks are those where you stay on the result, short clicks are those where you return and try again). Redefine the algorithm.

4. Measure multiple searches on similar phrases - redefine.

5. Do the same for around 200 or so indicators of success.

Good luck, it will be fun if nothing else!

Plystire
22
Years of Service
User Offline
Joined: 18th Feb 2003
Location: Staring into the digital ether
Posted: 20th Aug 2011 00:33
Thanks for the info, guys!

Quote: "If you *require* a crawler, there are a lot of open source solutions that would make setting one up for an in-house solution extremely easy."


Would you happen to have a link or two to these open source crawlers? Borrowing a database from someone is still costly.

Quote: "you want to create a search engine site such as Google.com or just one for your own website?"


A search across the internet, not just my own site.

@BatVink:

Quote: "Creating your own search engine is a mammoth task."


After everything you've said, it seems that it's not really a mammoth task... just costs a lot of money to boot it up and jet out the door, serving results to people worldwide.


I've read a few articles on the subject since posting this thread and while the concept itself seems easy enough, there are some missing points in the big picture.

1) How to make a crawler that won't piss off webmasters, and preferably won't land me in a lawsuit.

2) Database design. While the simplest I've seen consists of indexing keywords, and saving page URLs... the most complex goes as far as saving all displayed text on the indexed page, as well as the title of said page. This... would take a ton of storage space for an Internet-wide search engine.

As far as SQL queries go, I'm familiar enough with SQL to write those myself. I guess I'm just having problems conceiving how the crawler works, how it doesn't get itself into trouble, and how to design the storage of the database to make it as memory efficient as I can. Obviously the Internet is a big place, and it will take tons of storage, but there must be some way to store the necessary information that can lighten the load. What I'd want to give to the users are the typical "Title", "Brief description with keywords in bold", "URL".

I've been through a few more articles as far as ranking goes, and I think I have the concept down well enough to come up with a decent ranking system. Still reading it, of course!


~Plystire

A rose is only a rose until it is held and cherished -- then it becomes a treasure.
bitJericho
22
Years of Service
User Offline
Joined: 9th Oct 2002
Location: United States
Posted: 22nd Aug 2011 16:38 Edited at: 22nd Aug 2011 16:39
Quote: "Would you happen to have a link or two to these open source crawlers? Borrowing a database from someone is still costly."


Something like this:

http://crawler.archive.org/

Quote: "How to make a crawler that won't piss off webmasters, and preferably won't land me in a lawsuit."


Respect robots.txt. Respect copyrights (don't cache (like goolge cache) or follow cache tag)

Quote: "After everything you've said, it seems that it's not really a mammoth task... just costs a lot of money to boot it up and jet out the door, serving results to people worldwide."


The threshold is probably in the hundreds of thousands of dollar range at this point in time.

However, you can always go niche and be useful for very specific searches!

I find myself craving to find a good way to find software. Perhaps an index search site rather than keyword search site and you pull info from popular code storage places.


Plystire
22
Years of Service
User Offline
Joined: 18th Feb 2003
Location: Staring into the digital ether
Posted: 23rd Aug 2011 01:46
Thanks, Jerico! I'll take a look into that crawler, if not for use, then at least for research.

Quote: "I find myself craving to find a good way to find software. Perhaps an index search site rather than keyword search site and you pull info from popular code storage places."


So are you looking for code or software?


~Plystire

A rose is only a rose until it is held and cherished -- then it becomes a treasure.
bitJericho
22
Years of Service
User Offline
Joined: 9th Oct 2002
Location: United States
Posted: 23rd Aug 2011 11:43
usually both. I'm an open source kind of guy


Login to post a reply

Server time is: 2025-05-20 18:42:59
Your offset time is: 2025-05-20 18:42:59