Free Republic
Browse · Search
News/Activism
Topics · Post Article

Skip to comments.

New Search Technology Created for Free Republic
FR Exclusive | 2/23/2005 | Dave Taylor

Posted on 02/23/2005 12:16:01 PM PST by Technocrat

I have created a new search technology that will make it much easier to find related articles and avoid duplicates. It is NOT keyword-based - instead, it uses sense and context to determine what an article is actually about, both in major and minor themes. I had about 500 FR articles lying around, so I shoved them into the indexer to see how well it would do, and the results are noteworthy :)

Some notes - everything is in lower case, to speed indexing. No content is actually stored in the database except the title, URL, and a 250 character snippet of internal text - instead, semantic compression reduces the entire contents of the article and first three replies to a 150-dimensional vector in domainspace. As a result, comparisons are blindingly fast (although initial indexing takes about a half second on my creaky old 1999 box) Also, I can't automatically retrieve text from FR to respect the robots policy, so for now, you have to copy and paste it from articles you want to match.

When you run a new article into the engine for comparison, its domain fingerprint is taken and stored for future comparisons. You get to see what that fingerprint is, and then you get three tiers of results: full matches, partial matches, and peripheral matches. Full matches will contain duplicates, articles about the same thing from different sources, and occasionally different articles about very similar things. For fun, you can press the back button and delete or change some text, submit it again, and see how much change it takes for your original submission to drop out of the Primary Match tier. (Don't change the URL, so it doesn't get inserted into the DB again) Secondary or partial matches contain closely related articles (although you may see a few of these in Tertiary or peripheral matches, especially the first few entries). The number in parenthesis is the match cost, or how far away the given reference is from the one you submitted in terms of domain space.

Tertiary matches let you go on a random walk through domain space. Sure, there's some definite relation to what you posted, but you will definitely be venturing afield in much of the linked article. If you want to see a really good example that works well against the current database, post the text from to the search engine.

You can try it by going here (or better yet, open a new browser window and point it at http://www.neurogy.com/sense/compare.html so you can copy and paste articles from this side to see similar articles that are already in my database.

Since this is something completely different, please post requests for format and capability, and I'll see what I can do.

Jim and John, once you see this, feel free to use it forever - I didn't have enough cash to contribute in the last fundraiser :( You can look at my previous donation history to get contact information if you want. I can provide you with simple CGI interface calls to make this part of the posting process, to show posters potential duplicates before they post. If you want to index a significant portion of the recent articles, let me know and I'll make it easier.


TOPICS: Your Opinion/Questions
KEYWORDS: donotduplicate; fr; search
Navigation: use the links below to view more comments.
first 1-2021-4041-54 next last

1 posted on 02/23/2005 12:16:02 PM PST by Technocrat
[ Post Reply | Private Reply | View Replies]

To: Technocrat

Oops - forgot the good example URL. Try http://www.freerepublic.com/focus/f-news/1319996/posts


2 posted on 02/23/2005 12:17:58 PM PST by Technocrat
[ Post Reply | Private Reply | To 1 | View Replies]

To: Technocrat; John Robinson; scripter

Bump & Ping


3 posted on 02/23/2005 12:21:26 PM PST by EdReform (Free Republic - helping to keep our country a free republic. Thank you for your financial support!)
[ Post Reply | Private Reply | To 2 | View Replies]

To: Technocrat
Maybe they'll incorporate your technology into the Post Article process, and we will no longer have to put up with duplicates and the replies about searching to avoid duplicates!

;O)

4 posted on 02/23/2005 12:22:58 PM PST by newgeezer (Just my opinion, of course. Your mileage may vary.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Technocrat
I didn't have enough cash to contribute in the last fundraiser :

Thanks to you and every FReeper who helps in one way or another.

5 posted on 02/23/2005 12:24:12 PM PST by Drango (NPR/PBS is the propaganda wing of the DNC.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: newgeezer

That's the idea - they could add a call to the indexer to check on zero (or low) match cost primary tier hits, and if any come up, they could add a "is this a duplicate" box.


6 posted on 02/23/2005 12:24:58 PM PST by Technocrat
[ Post Reply | Private Reply | To 4 | View Replies]

To: Technocrat

Awesome. While you're at it, add a date range filter. Also, can you set up a search engine for replies, as well-- using text from the reply or author (screenname) or date range or all or some of those (like they have at other message boards)?


7 posted on 02/23/2005 12:25:53 PM PST by GraniteStateConservative (...He had committed no crime against America so I did not bring him here...-- Worst.President.Ever.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Technocrat

Kudos to you sir!


8 posted on 02/23/2005 12:27:43 PM PST by 4CJ (Laissez les bon FReeps rouler - "Accurately quoting Lincoln is a bannable offense.")
[ Post Reply | Private Reply | To 1 | View Replies]

To: Technocrat
I just knew it would only be a matter of time before someone figured out how to do this. I simply miscalculated in thinking it would have been John Robinson. ;)
9 posted on 02/23/2005 12:28:17 PM PST by newgeezer (Just my opinion, of course. Your mileage may vary.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: GraniteStateConservative; Technocrat
Hey, hey - let's let the guy bask in the spotlight for a few minutes before adding new features on him ;)
10 posted on 02/23/2005 12:28:23 PM PST by general_re ("Frantic orthodoxy is never rooted in faith, but in doubt." - Reinhold Niebuhr)
[ Post Reply | Private Reply | To 7 | View Replies]

To: GraniteStateConservative

Date range - no problem. The replies might be more problematic, as this is a sense-based engine and the entire article with replies gets shrunk down to 150 bytes or so (that's what makes it so stinking fast). Although, if John feels like integrating, I could extend it to do searches like "where did member X post about subject Y?"


11 posted on 02/23/2005 12:30:13 PM PST by Technocrat
[ Post Reply | Private Reply | To 7 | View Replies]

To: newgeezer

Thanks for the vote of confidence, but this is an Alpha engine, and I bet we'll find a lot of holes in it before everyone is happy. That's OK - I needed something to do in the evenings anyway :)


12 posted on 02/23/2005 12:32:13 PM PST by Technocrat
[ Post Reply | Private Reply | To 9 | View Replies]

To: Technocrat

Fascinating


13 posted on 02/23/2005 12:38:16 PM PST by RightWhale (Please correct if cosmic balance requires.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Technocrat
I submitted this...
Article Title: Canseco's lack of remorse for steroid use damages kids

Article URL: http://www.chron.com/cs/CDA/ssistory.mpl/features/3052472

Article Text: This week I tallied up the number of years I've spent on or near a baseball field. Nineteen. ADVERTISEMENT Yep, for almost two decades I've sat in the bleachers, by the dugout or on the sidelines to watch one of my five kids play the Great American Pastime. I figure that, with my youngest at 11, I probably have five or more good seasons left in me before I retire. Which makes me, if not an expert in the game, at least a knowledgeable observer. In other words, I can — with some authority — tell my son he's hitting the ball late. I can also counsel from personal experience that losing 13-zip is not the end of the world. Like most parents of sports-obsessed children, I've survived the wax and wane of major-league dreams. At some point in his Little League career, each of my four sons thought he would one day make it to the pros.

...and got this back (I wonder if you can trap this and provide a layman's explanation) ...

CGI Error

The specified CGI application misbehaved by not returning a complete set of HTTP headers. The headers it did return are:
(That's all I got back. Are you checking for the minimum 1000 characters?)
14 posted on 02/23/2005 12:39:08 PM PST by newgeezer (Just my opinion, of course. Your mileage may vary.)
[ Post Reply | Private Reply | To 1 | View Replies]

To: newgeezer

I expect to see the /posts on the end of the url and right now it will freak out if you try to post anything that isn't on Free Republic (mainly because of a naming convention I used to cut some corners in the initial test). I can change that tonight so you can post from anywhere.


15 posted on 02/23/2005 12:44:02 PM PST by Technocrat
[ Post Reply | Private Reply | To 14 | View Replies]

To: Technocrat

Wow. Will try this later. Thank you.


16 posted on 02/23/2005 12:47:27 PM PST by andyandval
[ Post Reply | Private Reply | To 15 | View Replies]

To: Technocrat

Suggestion: Put the 'Submit Article' button at the top of the big text box so that I don't have to scroll all the way to the bottom of the page just to click the button (assuming I don't know or neglected to paste the article text first and save the title or URL 'til last, in which case I can just hit the Enter key to submit the form).


17 posted on 02/23/2005 12:54:06 PM PST by newgeezer (Just my opinion, of course. Your mileage may vary.)
[ Post Reply | Private Reply | To 15 | View Replies]

To: Technocrat
"150-dimensional vector in domainspace"

I followed you up to there. I can guess, but better not. What’s that mean?

18 posted on 02/23/2005 12:54:09 PM PST by elfman2
[ Post Reply | Private Reply | To 1 | View Replies]

To: newgeezer

Also, the more text you post, the better your results will be (higher domain dimensionality)


19 posted on 02/23/2005 12:55:14 PM PST by Technocrat
[ Post Reply | Private Reply | To 14 | View Replies]

To: Technocrat

If there were a way to search around dates, or between a date range, that would help.

That is a failing of most major search engines I've tried--no way to date search. Most show the most recent dated articles. But when researching, many times I want matches older than today, or last month, or even last year. To get to the older matches, one has to wade through mounds of the latest date.


20 posted on 02/23/2005 12:58:58 PM PST by TomGuy (America: Best friend or worst enemy. Choose wisely.)
[ Post Reply | Private Reply | To 1 | View Replies]


Navigation: use the links below to view more comments.
first 1-2021-4041-54 next last

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson