[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Open source English dictionary to use programmatically w/ python

dgoldsmith_89

1/7/2008 10:37:00 PM

Can anyone point me to a downloadable open source English dictionary
suitable for programmatic use with python: I'm programming a puzzle
generator, and I need to be able to generate more or less complete
lists of English words, alphabetized. Thanks! DG
10 Answers

rpdooling

1/7/2008 10:46:00 PM

0

On Jan 7, 4:37 pm, dgoldsmith_89 <d.l.goldsm...@gmail.com> wrote:
> Can anyone point me to a downloadable open source English dictionary
> suitable for programmatic use with python: I'm programming a puzzle
> generator, and I need to be able to generate more or less complete
> lists of English words, alphabetized. Thanks! DG

On Linux? WordNet and Dict and many others.

On Windows, maybe try WordWeb?

rd

Fredrik Lundh

1/7/2008 10:48:00 PM

0

dgoldsmith_89 wrote:

> Can anyone point me to a downloadable open source English dictionary
> suitable for programmatic use with python: I'm programming a puzzle
> generator, and I need to be able to generate more or less complete
> lists of English words, alphabetized. Thanks! DG

here's one:

http://www.dcs.shef.ac.uk/research/i...

</F>

dgoldsmith_89

1/7/2008 10:50:00 PM

0

On Jan 7, 2:46 pm, Rick Dooling <rpdool...@gmail.com> wrote:
> On Jan 7, 4:37 pm, dgoldsmith_89 <d.l.goldsm...@gmail.com> wrote:
>
> > Can anyone point me to a downloadable open source English dictionary
> > suitable for programmatic use with python: I'm programming a puzzle
> > generator, and I need to be able to generate more or less complete
> > lists of English words, alphabetized. Thanks! DG
>
> On Linux? WordNet and Dict and many others.
>
> On Windows, maybe try WordWeb?
>
> rd

Sorry, didn't know it would make a difference: on Mac, actually.

DG

Mensanator

1/7/2008 10:54:00 PM

0

On Jan 7, 4:37 pm, dgoldsmith_89 <d.l.goldsm...@gmail.com> wrote:
> Can anyone point me to a downloadable open source English dictionary
> suitable for programmatic use with python: I'm programming a puzzle
> generator, and I need to be able to generate more or less complete
> lists of English words, alphabetized.  Thanks!  DG


www.puzzlers.org has numerous word lists & dictionarys in text
format that can be downloaded. I recommend you insert them into
some form of database. I have most of them in an Access db and
it's 95 MB. That's a worse case as I also have some value-added
stuff, the OSPD alone would be a lot smaller.

<http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists...

Tobiah

1/7/2008 10:59:00 PM

0

dgoldsmith_89 wrote:
> Can anyone point me to a downloadable open source English dictionary
> suitable for programmatic use with python: I'm programming a puzzle
> generator, and I need to be able to generate more or less complete
> lists of English words, alphabetized. Thanks! DG

If all you want are the words themselves, then any linux box
has a fairly complete list. I put mine here:

http://tobiah.org...

--
Posted via a free Usenet account from http://www.te...

dgoldsmith_89

1/7/2008 11:08:00 PM

0

On Jan 7, 2:47 pm, Fredrik Lundh <fred...@pythonware.com> wrote:
> dgoldsmith_89 wrote:
> > Can anyone point me to a downloadable open source English dictionary
> > suitable for programmatic use with python: I'm programming a puzzle
> > generator, and I need to be able to generate more or less complete
> > lists of English words, alphabetized. Thanks! DG
>
> here's one:
>
> http://www.dcs.shef.ac.uk/research/i...
>
> </F>

Excellent, that'll do nicely! Thanks!!!

DG

dgoldsmith_89

1/7/2008 11:11:00 PM

0

On Jan 7, 2:54 pm, "mensana...@aol.com" <mensana...@aol.com> wrote:
> On Jan 7, 4:37 pm, dgoldsmith_89 <d.l.goldsm...@gmail.com> wrote:
>
> > Can anyone point me to a downloadable open source English dictionary
> > suitable for programmatic use with python: I'm programming a puzzle
> > generator, and I need to be able to generate more or less complete
> > lists of English words, alphabetized. Thanks! DG
>
> www.puzzlers.orghas numerous word lists & dictionarys in text
> format that can be downloaded. I recommend you insert them into
> some form of database. I have most of them in an Access db and
> it's 95 MB. That's a worse case as I also have some value-added
> stuff, the OSPD alone would be a lot smaller.
>
> <http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists...

Sorry for my ignorance: I can query an Access DB w/ standard SQL
queries (and this is how I would access it w/ Python)?

DG

Paul McGuire

1/7/2008 11:16:00 PM

0

On Jan 7, 5:10 pm, dgoldsmith_89 <d.l.goldsm...@gmail.com> wrote:
>
> Sorry for my ignorance: I can query an Access DB w/ standard SQL
> queries (and this is how I would access it w/ Python)?
>
> DG

If you are running on a Mac, just use sqlite, it's built-in to Python
as of v2.5 and you will find more help, documentation, and fellow
Python+sqlite users than you will Python+Access.

-- Paul

Mensanator

1/7/2008 11:51:00 PM

0

On Jan 7, 5:10 pm, dgoldsmith_89 <d.l.goldsm...@gmail.com> wrote:
> On Jan 7, 2:54 pm, "mensana...@aol.com" <mensana...@aol.com> wrote:
>
> > On Jan 7, 4:37 pm, dgoldsmith_89 <d.l.goldsm...@gmail.com> wrote:
>
> > > Can anyone point me to a downloadable open source English dictionary
> > > suitable for programmatic use with python: I'm programming a puzzle
> > > generator, and I need to be able to generate more or less complete
> > > lists of English words, alphabetized.  Thanks!  DG
>
> >www.puzzlers.orghasnumerous word lists & dictionarys in text
> > format that can be downloaded. I recommend you insert them into
> > some form of database. I have most of them in an Access db and
> > it's 95 MB. That's a worse case as I also have some value-added
> > stuff, the OSPD alone would be a lot smaller.
>
> > <http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists...
>
> Sorry for my ignorance: I can query an Access DB w/ standard SQL
> queries (and this is how I would access it w/ Python)?

Yes, if you have the appropriate way to link to the DB.
I use Windows and ODBC from Win32. I don't know what you
would use on a Mac.

As Paul McGuire said, you could easily do this with SqlLite3.
Personnaly, I always use Access since my job requires it
and I find it much more convenient. I often use Crosstab
tables which I think SqlLite3 doesn't support. Typically,
I'll write complex queries in Access and simple select SQL
statements in Python to grab them.

Here's my anagram locator. (the [signature] is an example
of the value-added I mentioned).

## I took a somewhat different approach. Instead of in a file,
## I've got my word list (562456 words) in an MS-Access database.
## And instead of calculating the signature on the fly, I did it
## once and added the signature as a second field:
##
## TABLE CONS_alpha_only_signature_unique
## --------------------------------------
## CONS text 75
## signature text 26
##
## The signature is a 26 character string where each character is
## the count of occurences of the matching letter. Luckily, in
## only a single case was there more than 9 occurences of any
## given letter, which turned not to be a word but a series of
## words concatenated so I just deleted it from the database
## (lots of crap in the original word list I used).
##
## Example:
##
## CONS signature
## aah 20000001000000000000000000 # 'a' occurs twice & 'h' once
## aahed 20011001000000000000000000
## aahing 20000011100001000000000000
## aahs 20000001000000000010000000
## aaii 20000000200000000000000000
## aaker 20001000001000000100000000
## aal 20000000000100000000000000
## aalborg 21000010000100100100000000
## aalesund
20011000000101000010100000
##
## Any words with identical signatures must be anagrams.
##
## Once this was been set up, I wrote a whole bunch of queries
## to use this table. I use the normal Access drag and drop
## design, but the SQL can be extracted from each, so I can
## simply open the query from Python or I can grab the SQL
## and build it inside the program. The example
##
## signatures_anagrams_select_signature
##
## is hard coded for criteria 9 & 10 and should be cast inside
## Python so the criteria can be changed dynamically.
##
##
## QUERY signatures_anagrams_longest
## ---------------------------------
## SELECT Len([CONS]) AS Expr1,
## Count(Cons_alpha_only_signature_unique.CONS) AS
CountOfCONS,
## Cons_alpha_only_signature_unique.signature
## FROM Cons_alpha_only_signature_unique
## GROUP BY Len([CONS]),
## Cons_alpha_only_signature_unique.signature
## HAVING (((Count(Cons_alpha_only_signature_unique.CONS))>1))
## ORDER BY Len([CONS]) DESC ,
## Count(Cons_alpha_only_signature_unique.CONS) DESC;
##
## This is why I don't use SQLite3, must have crosstab queries.
##
## QUERY signatures_anagram_summary
## --------------------------------
## TRANSFORM Count(signatures_anagrams_longest.signature) AS
CountOfsignature
## SELECT signatures_anagrams_longest.Expr1 AS [length of word]
## FROM signatures_anagrams_longest
## GROUP BY signatures_anagrams_longest.Expr1
## PIVOT signatures_anagrams_longest.CountOfCONS;
##
##
## QUERY signatures_anagrams_select_signature
## ------------------------------------------
## SELECT Len([CONS]) AS Expr1,
## Count(Cons_alpha_only_signature_unique.CONS) AS
CountOfCONS,
## Cons_alpha_only_signature_unique.signature
## FROM Cons_alpha_only_signature_unique
## GROUP BY Len([CONS]),
## Cons_alpha_only_signature_unique.signature
## HAVING (((Len([CONS]))=9) AND
## ((Count(Cons_alpha_only_signature_unique.CONS))=10))
## ORDER BY Len([CONS]) DESC ,
## Count(Cons_alpha_only_signature_unique.CONS) DESC;
##
## QUERY signatures_lookup_by_anagram_select_signature
## ---------------------------------------------------
## SELECT signatures_anagrams_select_signature.Expr1,
## signatures_anagrams_select_signature.CountOfCONS,
## Cons_alpha_only_signature_unique.CONS,
## Cons_alpha_only_signature_unique.signature
## FROM signatures_anagrams_select_signature
## INNER JOIN Cons_alpha_only_signature_unique
## ON signatures_anagrams_select_signature.signature
## = Cons_alpha_only_signature_unique.signature;
##
##
## Now it's a simple matter to use the ODBC from Win32 to extract
## the query output into Python.

import dbi
import odbc

con = odbc.odbc("words")
cursor = con.cursor()

## This first section grabs the anagram summary. Note that
## queries act just like tables (as long as they don't have
## internal dependencies. I read somewhere you can get the
## field names, but here I put them in by hand.

##cursor.execute("SELECT * FROM signature_anagram_summary")
##
##results = cursor.fetchall()
##
##for i in results:
## for j in i:
## print '%4s' % (str(j)),
## print

## (if this wraps, each line is 116 characters)
## 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18 23
## 2 259 None None None None None None None None None None None
None None None None None None
## 3 487 348 218 150 102 None None None None None None None
None None None None None None
## 4 1343 718 398 236 142 101 51 26 25 9 8 3
2 None None None None None
## 5 3182 1424 777 419 274 163 106 83 53 23 20 10
6 4 5 1 3 1
## 6 5887 2314 1051 545 302 170 114 54 43 21 15 6
5 4 4 2 None None
## 7 7321 2251 886 390 151 76 49 37 14 7 5 1
1 1 None None None None
## 8 6993 1505 452 166 47 23 8 6 4 2 2 None
None None None None None None
## 9 5127 830 197 47 17 6 None None 1 None None None
None None None None None None
## 10 2975 328 66 8 2 None None None None None None None
None None None None None None
## 11 1579 100 5 4 2 None None None None None None None
None None None None None None
## 12 781 39 2 1 None None None None None None None None
None None None None None None
## 13 326 11 2 None None None None None None None None None
None None None None None None
## 14 166 2 None None None None None None None None None None
None None None None None None
## 15 91 None 1 None None None None None None None None None
None None None None None None
## 16 60 None None None None None None None None None None None
None None None None None None
## 17 35 None None None None None None None None None None None
None None None None None None
## 18 24 None None None None None None None None None None None
None None None None None None
## 19 11 None None None None None None None None None None None
None None None None None None
## 20 6 None None None None None None None None None None None
None None None None None None
## 21 6 None None None None None None None None None None None
None None None None None None
## 22 4 None None None None None None None None None None None
None None None None None None

## From the query we have the word size as row header and size of
## anagram set as column header. The data value is the count of
## how many different anagram sets match the row/column header.
##
## For example, there are 7321 different 7-letter signatures that
## have 2 anagram sets. There is 1 5-letter signature having a
## 23 member anagram set.
##
## We can then pick any of these, say the single 10 member anagram
## set of 9-letter words, and query out out the anagrams:


cursor.execute("SELECT * FROM
signatures_lookup_by_anagram_select_signature")
results = cursor.fetchall()
for i in results:
for j in i:
print j,
print

## 9 10 anoretics 10101000100001100111000000
## 9 10 atroscine 10101000100001100111000000
## 9 10 certosina 10101000100001100111000000
## 9 10 creations 10101000100001100111000000
## 9 10 narcotise 10101000100001100111000000
## 9 10 ostracine 10101000100001100111000000
## 9 10 reactions 10101000100001100111000000
## 9 10 secration 10101000100001100111000000
## 9 10 tinoceras 10101000100001100111000000
## 9 10 tricosane 10101000100001100111000000

## Nifty, eh?


>
> DG

dgoldsmith_89

1/8/2008 6:26:00 PM

0

On Jan 7, 3:50 pm, "mensana...@aol.com" <mensana...@aol.com> wrote:
> On Jan 7, 5:10 pm, dgoldsmith_89 <d.l.goldsm...@gmail.com> wrote:
>
>
>
> > On Jan 7, 2:54 pm, "mensana...@aol.com" <mensana...@aol.com> wrote:
>
> > > On Jan 7, 4:37 pm, dgoldsmith_89 <d.l.goldsm...@gmail.com> wrote:
>
> > > > Can anyone point me to a downloadable open source English dictionary
> > > > suitable for programmatic use with python: I'm programming a puzzle
> > > > generator, and I need to be able to generate more or less complete
> > > > lists of English words, alphabetized. Thanks! DG
>
> > >www.puzzlers.orghasnumerousword lists & dictionarys in text
> > > format that can be downloaded. I recommend you insert them into
> > > some form of database. I have most of them in an Access db and
> > > it's 95 MB. That's a worse case as I also have some value-added
> > > stuff, the OSPD alone would be a lot smaller.
>
> > > <http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists...
>
> > Sorry for my ignorance: I can query an Access DB w/ standard SQL
> > queries (and this is how I would access it w/ Python)?
>
> Yes, if you have the appropriate way to link to the DB.
> I use Windows and ODBC from Win32. I don't know what you
> would use on a Mac.
>
> As Paul McGuire said, you could easily do this with SqlLite3.
> Personnaly, I always use Access since my job requires it
> and I find it much more convenient. I often use Crosstab
> tables which I think SqlLite3 doesn't support. Typically,
> I'll write complex queries in Access and simple select SQL
> statements in Python to grab them.
>
> Here's my anagram locator. (the [signature] is an example
> of the value-added I mentioned).
>
> ## I took a somewhat different approach. Instead of in a file,
> ## I've got my word list (562456 words) in an MS-Access database.
> ## And instead of calculating the signature on the fly, I did it
> ## once and added the signature as a second field:
> ##
> ## TABLE CONS_alpha_only_signature_unique
> ## --------------------------------------
> ## CONS text 75
> ## signature text 26
> ##
> ## The signature is a 26 character string where each character is
> ## the count of occurences of the matching letter. Luckily, in
> ## only a single case was there more than 9 occurences of any
> ## given letter, which turned not to be a word but a series of
> ## words concatenated so I just deleted it from the database
> ## (lots of crap in the original word list I used).
> ##
> ## Example:
> ##
> ## CONS signature
> ## aah 20000001000000000000000000 # 'a' occurs twice & 'h' once
> ## aahed 20011001000000000000000000
> ## aahing 20000011100001000000000000
> ## aahs 20000001000000000010000000
> ## aaii 20000000200000000000000000
> ## aaker 20001000001000000100000000
> ## aal 20000000000100000000000000
> ## aalborg 21000010000100100100000000
> ## aalesund
> 20011000000101000010100000
> ##
> ## Any words with identical signatures must be anagrams.
> ##
> ## Once this was been set up, I wrote a whole bunch of queries
> ## to use this table. I use the normal Access drag and drop
> ## design, but the SQL can be extracted from each, so I can
> ## simply open the query from Python or I can grab the SQL
> ## and build it inside the program. The example
> ##
> ## signatures_anagrams_select_signature
> ##
> ## is hard coded for criteria 9 & 10 and should be cast inside
> ## Python so the criteria can be changed dynamically.
> ##
> ##
> ## QUERY signatures_anagrams_longest
> ## ---------------------------------
> ## SELECT Len([CONS]) AS Expr1,
> ## Count(Cons_alpha_only_signature_unique.CONS) AS
> CountOfCONS,
> ## Cons_alpha_only_signature_unique.signature
> ## FROM Cons_alpha_only_signature_unique
> ## GROUP BY Len([CONS]),
> ## Cons_alpha_only_signature_unique.signature
> ## HAVING (((Count(Cons_alpha_only_signature_unique.CONS))>1))
> ## ORDER BY Len([CONS]) DESC ,
> ## Count(Cons_alpha_only_signature_unique.CONS) DESC;
> ##
> ## This is why I don't use SQLite3, must have crosstab queries.
> ##
> ## QUERY signatures_anagram_summary
> ## --------------------------------
> ## TRANSFORM Count(signatures_anagrams_longest.signature) AS
> CountOfsignature
> ## SELECT signatures_anagrams_longest.Expr1 AS [length of word]
> ## FROM signatures_anagrams_longest
> ## GROUP BY signatures_anagrams_longest.Expr1
> ## PIVOT signatures_anagrams_longest.CountOfCONS;
> ##
> ##
> ## QUERY signatures_anagrams_select_signature
> ## ------------------------------------------
> ## SELECT Len([CONS]) AS Expr1,
> ## Count(Cons_alpha_only_signature_unique.CONS) AS
> CountOfCONS,
> ## Cons_alpha_only_signature_unique.signature
> ## FROM Cons_alpha_only_signature_unique
> ## GROUP BY Len([CONS]),
> ## Cons_alpha_only_signature_unique.signature
> ## HAVING (((Len([CONS]))=9) AND
> ## ((Count(Cons_alpha_only_signature_unique.CONS))=10))
> ## ORDER BY Len([CONS]) DESC ,
> ## Count(Cons_alpha_only_signature_unique.CONS) DESC;
> ##
> ## QUERY signatures_lookup_by_anagram_select_signature
> ## ---------------------------------------------------
> ## SELECT signatures_anagrams_select_signature.Expr1,
> ## signatures_anagrams_select_signature.CountOfCONS,
> ## Cons_alpha_only_signature_unique.CONS,
> ## Cons_alpha_only_signature_unique.signature
> ## FROM signatures_anagrams_select_signature
> ## INNER JOIN Cons_alpha_only_signature_unique
> ## ON signatures_anagrams_select_signature.signature
> ## = Cons_alpha_only_signature_unique.signature;
> ##
> ##
> ## Now it's a simple matter to use the ODBC from Win32 to extract
> ## the query output into Python.
>
> import dbi
> import odbc
>
> con = odbc.odbc("words")
> cursor = con.cursor()
>
> ## This first section grabs the anagram summary. Note that
> ## queries act just like tables (as long as they don't have
> ## internal dependencies. I read somewhere you can get the
> ## field names, but here I put them in by hand.
>
> ##cursor.execute("SELECT * FROM signature_anagram_summary")
> ##
> ##results = cursor.fetchall()
> ##
> ##for i in results:
> ## for j in i:
> ## print '%4s' % (str(j)),
> ## print
>
> ## (if this wraps, each line is 116 characters)
> ## 2 3 4 5 6 7 8 9 10 11 12 13
> 14 15 16 17 18 23
> ## 2 259 None None None None None None None None None None None
> None None None None None None
> ## 3 487 348 218 150 102 None None None None None None None
> None None None None None None
> ## 4 1343 718 398 236 142 101 51 26 25 9 8 3
> 2 None None None None None
> ## 5 3182 1424 777 419 274 163 106 83 53 23 20 10
> 6 4 5 1 3 1
> ## 6 5887 2314 1051 545 302 170 114 54 43 21 15 6
> 5 4 4 2 None None
> ## 7 7321 2251 886 390 151 76 49 37 14 7 5 1
> 1 1 None None None None
> ## 8 6993 1505 452 166 47 23 8 6 4 2 2 None
> None None None None None None
> ## 9 5127 830 197 47 17 6 None None 1 None None None
> None None None None None None
> ## 10 2975 328 66 8 2 None None None None None None None
> None None None None None None
> ## 11 1579 100 5 4 2 None None None None None None None
> None None None None None None
> ## 12 781 39 2 1 None None None None None None None None
> None None None None None None
> ## 13 326 11 2 None None None None None None None None None
> None None None None None None
> ## 14 166 2 None None None None None None None None None None
> None None None None None None
> ## 15 91 None 1 None None None None None None None None None
> None None None None None None
> ## 16 60 None None None None None None None None None None None
> None None None None None None
> ## 17 35 None None None None None None None None None None None
> None None None None None None
> ## 18 24 None None None None None None None None None None None
> None None None None None None
> ## 19 11 None None None None None None None None None None None
> None None None None None None
> ## 20 6 None None None None None None None None None None None
> None None None None None None
> ## 21 6 None None None None None None None None None None None
> None None None None None None
> ## 22 4 None None None None None None None None None None None
> None None None None None None
>
> ## From the query we have the word size as row header and size of
> ## anagram set as column header. The data value is the count of
> ## how many different anagram sets match the row/column header.
> ##
> ## For example, there are 7321 different 7-letter signatures that
> ## have 2 anagram sets. There is 1 5-letter signature having a
> ## 23 member anagram set.
> ##
> ## We can then pick any of these, say the single 10 member anagram
> ## set of 9-letter words, and query out out the anagrams:
>
> cursor.execute("SELECT * FROM
> signatures_lookup_by_anagram_select_signature")
> results = cursor.fetchall()
> for i in results:
> for j in i:
> print j,
> print
>
> ## 9 10 anoretics 10101000100001100111000000
> ## 9 10 atroscine 10101000100001100111000000
> ## 9 10 certosina 10101000100001100111000000
> ## 9 10 creations 10101000100001100111000000
> ## 9 10 narcotise 10101000100001100111000000
> ## 9 10 ostracine 10101000100001100111000000
> ## 9 10 reactions 10101000100001100111000000
> ## 9 10 secration 10101000100001100111000000
> ## 9 10 tinoceras 10101000100001100111000000
> ## 9 10 tricosane 10101000100001100111000000
>
> ## Nifty, eh?
>
>
>
> > DG

Yes, nifty. Thanks for all the help, all!

DG