Free Republic
Browse · Search
News/Activism
Topics · Post Article

Skip to comments.

New Search Technology Created for Free Republic
FR Exclusive | 2/23/2005 | Dave Taylor

Posted on 02/23/2005 12:16:01 PM PST by Technocrat

I have created a new search technology that will make it much easier to find related articles and avoid duplicates. It is NOT keyword-based - instead, it uses sense and context to determine what an article is actually about, both in major and minor themes. I had about 500 FR articles lying around, so I shoved them into the indexer to see how well it would do, and the results are noteworthy :)

Some notes - everything is in lower case, to speed indexing. No content is actually stored in the database except the title, URL, and a 250 character snippet of internal text - instead, semantic compression reduces the entire contents of the article and first three replies to a 150-dimensional vector in domainspace. As a result, comparisons are blindingly fast (although initial indexing takes about a half second on my creaky old 1999 box) Also, I can't automatically retrieve text from FR to respect the robots policy, so for now, you have to copy and paste it from articles you want to match.

When you run a new article into the engine for comparison, its domain fingerprint is taken and stored for future comparisons. You get to see what that fingerprint is, and then you get three tiers of results: full matches, partial matches, and peripheral matches. Full matches will contain duplicates, articles about the same thing from different sources, and occasionally different articles about very similar things. For fun, you can press the back button and delete or change some text, submit it again, and see how much change it takes for your original submission to drop out of the Primary Match tier. (Don't change the URL, so it doesn't get inserted into the DB again) Secondary or partial matches contain closely related articles (although you may see a few of these in Tertiary or peripheral matches, especially the first few entries). The number in parenthesis is the match cost, or how far away the given reference is from the one you submitted in terms of domain space.

Tertiary matches let you go on a random walk through domain space. Sure, there's some definite relation to what you posted, but you will definitely be venturing afield in much of the linked article. If you want to see a really good example that works well against the current database, post the text from to the search engine.

You can try it by going here (or better yet, open a new browser window and point it at http://www.neurogy.com/sense/compare.html so you can copy and paste articles from this side to see similar articles that are already in my database.

Since this is something completely different, please post requests for format and capability, and I'll see what I can do.

Jim and John, once you see this, feel free to use it forever - I didn't have enough cash to contribute in the last fundraiser :( You can look at my previous donation history to get contact information if you want. I can provide you with simple CGI interface calls to make this part of the posting process, to show posters potential duplicates before they post. If you want to index a significant portion of the recent articles, let me know and I'll make it easier.


TOPICS: Your Opinion/Questions
KEYWORDS: donotduplicate; fr; search
Navigation: use the links below to view more comments.
first previous 1-2021-4041-54 last
To: Technocrat

Now, if you could find a way to eliminate those pesky duplicate REPLIES, ... ;O)


41 posted on 02/24/2005 9:00:34 AM PST by newgeezer (Just my opinion, of course. Your mileage may vary.)
[ Post Reply | Private Reply | To 39 | View Replies]

To: Technocrat
It sounds like something that would be great, but as for me, well...


42 posted on 02/24/2005 9:01:41 AM PST by Petronski (Zebras: Free Range Bar Codes of the Serengeti)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Technocrat

Ah. Anyway, it didn't return any primary or secondary matches at all, and the tertiary matches were pretty weak, which I suppose is to be expected - I assume this is a problem that will resolve itself as the indexes expand, though.


43 posted on 02/24/2005 9:04:20 AM PST by general_re ("Frantic orthodoxy is never rooted in faith, but in doubt." - Reinhold Niebuhr)
[ Post Reply | Private Reply | To 40 | View Replies]

To: newgeezer

That's right! The article you posted was one that I inserted to the search engine earlier, and provided earlier in this thread as a good example of how you can get related results because the domain set of that article was so rich. Now I indexed the entire source of the article, which was quite extensive compared to the piece you posted, so you will probably see the original copy in the secondary match list.

The fact that you got a 0 match cost on the primary tier means that the engine thinks it has a precise duplicate of something already in the engine (which it does, since you tried to insert it twice).

Unfortunately, I haven't indexed all of FR on this server, because I would need permission from the mighty ones to do that. So far I have a deafening silence, and I'm not quite sure what to do about that.


44 posted on 02/24/2005 9:04:24 AM PST by Technocrat
[ Post Reply | Private Reply | To 38 | View Replies]

To: newgeezer

Ouch. Really. Ouch. :)


45 posted on 02/24/2005 9:05:08 AM PST by Technocrat
[ Post Reply | Private Reply | To 41 | View Replies]

To: newgeezer

Hmmm. I just did exactly the same thing, and the original article (which has a maximum magnitude of over 150) doesn't even appear in the list. I may need to change this so the first paragraphs are indexed the strongest, and succeeding paragraphs are done with less and less emphasis. Another project for this afternoon!


46 posted on 02/24/2005 9:09:02 AM PST by Technocrat
[ Post Reply | Private Reply | To 38 | View Replies]

To: general_re

Yes - it would be really cool if we could get permission to get the last 100,000 articles in the DB. I'll work on that once I feel like the quality is up to snuff.


47 posted on 02/24/2005 9:10:24 AM PST by Technocrat
[ Post Reply | Private Reply | To 43 | View Replies]

To: Technocrat
So far I have a deafening silence, and I'm not quite sure what to do about that.

Perhaps you should email Jim and John privately - make it clear that you're willing to sign over your code to them gratis (assuming you're willing to do that, of course), and that the code is in fact your own original work that you are free to give to them. And then wait, I guess.

Anyway, for all we know, John was three days away from deploying his own search engine, so this isn't really useful to them. Or perhaps there are legal issues that they need to check out before accepting it from you. Who knows? In any case, it seems likely to me that even if da boss couldn't use this for some reason, he'd let you know rather than just leave you hanging....

48 posted on 02/24/2005 9:13:28 AM PST by general_re ("Frantic orthodoxy is never rooted in faith, but in doubt." - Reinhold Niebuhr)
[ Post Reply | Private Reply | To 44 | View Replies]

To: Technocrat
I may need to change this so the first paragraphs are indexed the strongest

I've always envisioned the 'perfect duplicate-avoiding FR search algorithm' would do exactly that. In fact, I'd have it look for phrase matches (e.g. words 1 thru 6, 2 thru 7, 3 thru 8, etc.) and only looking at first 100 or so words. Seems like that'd expose any potential duplicates for sure. But, my thoughts were always centered on duplicates avoidance, never on finding loosely-related articles.

49 posted on 02/24/2005 9:14:36 AM PST by newgeezer (Just my opinion, of course. Your mileage may vary.)
[ Post Reply | Private Reply | To 46 | View Replies]

To: Technocrat
Maybe I am confused as to the purpose of this.

I submitted and got a nice list of terms with numbers of appearances.

So....

None of the results had any links, nor article titles, ... nothing other than a list of terms followed by numbers.

===

Similar Article Analysis Results

Number of articles in database: 517

Domain fingerprint:
--------------------------------
Central domains
--------------------------------
Middle East 10
medical 8
organization - people 5
--------------------------------
Peripheral Domains:
--------------------------------
military 3
finance/economy 2
government 2
law/truth 2
process 2
travel 2
Africa 2
Iraq 2
education 1
factory 1
flaws 1
information 1
books/mags/print media 1
physics 1
politics/elections 1
relationship 1
science 1
sports 1
Western US 1
Primary Matches: (strong match on most subjects, match cost under 10 is duplicate) (Groundhog Day? Here's another article on Punxsutawney Phil.)
Secondary Matches: (significant shared subject matter) (Groundhog Day? Here's an article on rodent-centered holidays in India.)
Tertiary Matches: (somewhat related on one or two points) (Groundhog Day? Here's some articles on weather, and some groundhog recipes.)



Also, the results are about Groundhog Day and Punxsutawney Phil????

The article I submitted is: Idi Amin is overweight
50 posted on 02/24/2005 9:52:59 AM PST by TomGuy (America: Best friend or worst enemy. Choose wisely.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: TomGuy
Hmm. It looks like you might not have put the entire source in, because when I did the same thing, I got this:
Similar Article Analysis Results

Number of articles in database: 518

This article has already been posted to the database.
Domain fingerprint:
--------------------------------
Central domains
--------------------------------
Middle East 10
medical 8
organization - people 5
International rel 5
--------------------------------
Peripheral Domains:
--------------------------------
military 4
finance/economy 2
geography 2
government 2
law/truth 2
mathematics 2
process 2
travel 2
Africa 2
Iraq 2
education 1
factory 1
flaws 1
information 1
books/mags/print media 1
physics 1
politics/elections 1
relationship 1
religion 1
science 1
sports 1
weapons 1
Western US 1
Primary Matches: (strong match on most subjects, match cost under 10 is duplicate) (Groundhog Day? Here's another article on Punxsutawney Phil.)
Secondary Matches: (significant shared subject matter) (Groundhog Day? Here's an article on rodent-centered holidays in India.)
Idi Amin is overweight (89)
ms madina amin yesterday said that her husband is ailing because of advanced age and overweight. she said that mr idi amin dada, 80, who was yesterday still in coma, weighs about 220 kilogrammes.“at his age and then the fact that he is overweight th

Tertiary Matches: (somewhat related on one or two points) (Groundhog Day? Here's some articles on weather, and some groundhog recipes.)
dozens of cia operatives killed (383)
the shocking story surfaced on feb. 2, when former pentagon adviser richard n. perle told the house intelligence committee about what he called the terrible setback that we suffered in iran a few years ago when, in a display of unbelievable, careles

cnn news chief quits over iraq remarks (islam onlines take, al-barf alert) (391)
“after 23 years at cnn, i have decided to resign in an effort to prevent cnn from being unfairly tarnished by the controversy over conflicting accounts of my recent remarks regarding the alarming number of journalists killed in iraq, jordan said in a

syria, lebanon and flying pigs (424)
skip to comments. syria, lebanon and flying pigs jinsa ^ | february 22, 2005 3:09:36 pm pst by ooh-ah the times of london reports that syria has agreed to remove its 14,000 troops from lebanon“soon as it had agreed to do in the 1989 taif accord, pr

iranian alert - october 19, 2004 [est]- iran live thread - americans for regime change in iran (429)
he us media still largely ignores news regarding the islamic republic of iran. as tony snow of the fox news network has put it, this is probably the most under-reported news story of the year. as a result, most americans are unaware that the islamic

navy medicine will protect service members, patients despite flu vaccine shortages (451)
department of defense officials said their supplies for all the services are less than expected about 1.5 million fewer doses than projected. however, those who most need the vaccine will get it, according to capt. edward m. kilbane, an infectious d

nra bashs u.n. on small-arms reduction (524)
wayne lapierre, executive vice president of the national rifle association of america, has used the conservative political actions conference as a forum to harshly criticize u.n. efforts to reduce the amount of small arms in the hand of civilians. a

open letter to the citizens of the united states of america (mega-triple-barf alert!!) (525)
as a journalist who has the good fortune to write for an international journal... (snip) ...i write this letter as a citizen of this international community and as a journalist for a newspaper whose name is pravda (truth), i have the obligation to te

(614)
seismic changes: studies indicate major land shifts bankok and phuket moved) the nation (bangkok) ^ | february 23, 2005 12:04:12 pm pst by nickcarraway sumatra quake pushed phuket 32cm southwest and moved bangkok: expert phuket shifted 32 centimetr



This shows me two things - 1, there is a significant usability problem with the current interface, which could be fixed by simply retrieving the URL submitted by the user (I will ask permission to do this today), and 2, the text you submitted was indeed different (as shown by the partial match of the original submission on tier 2). Finally, I really need to get rid of those stupid references to Groundhog day :)
51 posted on 02/24/2005 10:01:40 AM PST by Technocrat
[ Post Reply | Private Reply | To 50 | View Replies]

To: TomGuy; newgeezer; general_re; Alia; fish hawk; Interesting Times; Nick Danger; Wiz; ...
Second round upgrades are now completed. There are three new features:


52 posted on 02/25/2005 6:59:57 AM PST by Technocrat
[ Post Reply | Private Reply | To 51 | View Replies]

To: elfman2

It sounds like it just means there are 150 different variables that they define for cataloging it all.


53 posted on 03/19/2005 7:25:37 PM PST by perfect stranger (I could be wrong. I've been wrong before.)
[ Post Reply | Private Reply | To 18 | View Replies]

To: perfect stranger

That makes more sense.


54 posted on 03/20/2005 9:15:53 AM PST by elfman2
[ Post Reply | Private Reply | To 53 | View Replies]


Navigation: use the links below to view more comments.
first previous 1-2021-4041-54 last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson