Asp Forum - using cl-ppcre

Jim Newton

9/28/2015 10:49:00 AM

I'd like to parse a file in a certain way, and I think I could possibly use cl-ppcre for it,
but I'm not sure of all the consequences.

I'd like to parse the content of a file into "tokens" such that
a "word" is a multi-character-sting token but a "punctuation" are single-char-string token.
Further, I'd like a given unary function to be called on the tokens. I really don't care which order the function gets called. But not called on whitespace. I.e., whitespace should delimit tokens, but itself not be considered a token.

For example if the file contains the following line,

abc.ddeeff->ghi jk(l[mnop])+

I'd like the given function, F, to be called as follows (in any order)
(F "abc")
(F ".")
(F "ddeeff")
(F "-")
(F ">")
(F "ghi")
(F "jk")
(F "(")
(F "l")
(F "[")
(F "mnop")
(F "]")
(F ")")
(F "+")

Can I do with with cl-ppcre?
What part of my spec is unclear/contradictory?
For example, I don't know what to do about encoding? What should be considered a word and what should be considered punctuation? Are there good ways to specify this?

Perhaps it is even easier if I had the functions is-punctuation? is-whitespace? is-alpha-numeric?
Are there good ways to write those function so that I won't be debugging them for the next 6 months?

The end goal is that I want to build histograms of "occurrences" of words and punctuation of texts from various programming languages and human languages. I.e., I want to run the program on APL code, and R code, and lisp code, C++ code, and others which I can find a corpus for. And also run it on English, Spanish, Finish, cropa as well.

If someone can point me in the right direction, or warn me about gotchas, I'd appreciate it.

Jim

9 Answers

William James

9/28/2015 3:42:00 PM

Jim Newton wrote:

> For example if the file contains the following line,
>
> abc.ddeeff->ghi jk(l[mnop])+
>
> I'd like the given function, F, to be called as follows (in any order)
> (F "abc")
> (F ".")
> (F "ddeeff")
> (F "-")
> (F ">")
> (F "ghi")
> (F "jk")
> (F "(")
> (F "l")
> (F "[")
> (F "mnop")
> (F "]")
> (F ")")
> (F "+")

Gauche Scheme:

(use gauche.lazy :only (lrxmatch))

(define (get-tokens text)
(map (cut rxmatch-substring <> 1)
(lrxmatch #/\s*([[:alnum:]]+|\S)/ text)))

(for-each print (get-tokens "abc555.ddeeff->ghi jk(l[mnop])+"))

abc555
.
ddeeff
-
>
ghi
jk
(
l
[
mnop
]
)
+

--
[T]he entire program of research ... involved deception from beginning to end.
This is suggested by the authors' clear political agenda and the pervasive
double standard in which gentile ethnocentrism and gentile adherence to
cohesive groups are seen as symptoms of psychopathology whereas ... no mention
is made of Jewish ethnocentrism or allegiance to cohesive groups.
--- Dr. Kevin MacDonald; "The Frankfurt School of Social Research and the
Pathologization of Gentile Group Allegiances"

Pascal J. Bourguignon

9/28/2015 6:50:00 PM

Jim Newton <jimka.issy@gmail.com> writes:

> I'd like to parse a file in a certain way, and I think I could possibly use cl-ppcre for it,
> but I'm not sure of all the consequences.
>
> I'd like to parse the content of a file into "tokens" such that
> a "word" is a multi-character-sting token but a "punctuation" are single-char-string token.
> Further, I'd like a given unary function to be called on the tokens.
> I really don't care which order the function gets called. But not
> called on whitespace. I.e., whitespace should delimit tokens, but
> itself not be considered a token.
>
> For example if the file contains the following line,
>
> abc.ddeeff->ghi jk(l[mnop])+
>
> I'd like the given function, F, to be called as follows (in any order)
> (F "abc")
> (F ".")
> (F "ddeeff")
> (F "-")
> (F ">")
> (F "ghi")
> (F "jk")
> (F "(")
> (F "l")
> (F "[")
> (F "mnop")
> (F "]")
> (F ")")
> (F "+")
>
>
> Can I do with with cl-ppcre?

You can do that with regular expressions, on the condition that the
language of your tokens is a regular language.

https://en.wikipedia.org/wiki/Chomsky...

This is even not 101 of programming languages, this is 000 of
programming language. This is written on the black board even before
you enter the programming language study or the compiler course
classrooms for the first time.

> What part of my spec is unclear/contradictory?

You didn't really specify the form of your tokens, and the auxiliary
rules (such as, it is customary to have a rule saying that spaces and
newlines can be inserted between tokens and are ignored (apart from the
fact that they may separate two tokens that would be a single token
without a space)).

> For example, I don't know what to do about encoding?

This is indeed a consideration. On the other hand, using Common Lisp,
you can defer this question to the CL system. CL provides you with
characters and text files, so you don't have to deal with the
encodings, apart when openining the file, where you may have to specify
the encoding of the file as the :external-format argument, if it's
different from the default encoding. So basically you may just write:

(defun parse-file (pathname &key (external-format :default))
(with-open-file (source pathname :external-format external-format )
(parse-text-stream source)))

and ignore the problem.

On the other hand, your file format may specify some specific encoding,
and actually be a binary format. For example, when parsing a XML file,
you have to read the first line as a stream of bytes, decode it using
ASCII encoding, parse it to detect the declared encoding for the rest of
the XML binary stream, and then further read bytes and decode them with
the specified encoding. But again, the simpliest thing to do in lisp,
is to provide a string or otherwise a character stream from this binary
file, so that you may still use your parse-text-stream reading
characters, instead of dealing with encodings.

> What should be considered a word and what should be considered
> punctuation? Are there good ways to specify this?

English is used, along with a set of regular expressions for each
category. The specification consistency must be veryified, that no
ambiguity exists, that is, that no two regular expression may match the
same input (or if such an ambiguity must exist, you must specify how it
is resulved, usual rules are to use the first regexp specified, or to
use the one matching the longest input string).

> Perhaps it is even easier if I had the functions is-punctuation?
> is-whitespace? is-alpha-numeric?

You would need such functions only if your tokens didn't consitute a
type 3 language, but a type 0 (requiring thus a Turing Machine to
parse). Basically, you're using a Vogon planet destruction fleet to
swat a fly.

> Are there good ways to write those function so that I won't be
> debugging them for the next 6 months?

Use your brains.

> The end goal is that I want to build histograms of "occurrences" of
> words and punctuation of texts from various programming languages and
> human languages. I.e., I want to run the program on APL code, and R
> code, and lisp code, C++ code, and others which I can find a corpus
> for.

There, you've got a nice problem, because the tokens in the different
programming languages are different sets. For example, string-equal is
1 token in lisp, but 3 tokens in C.

> And also run it on English, Spanish, Finish, cropa as well.

And similarly in natural languages, the definition of a word vary. For
example, the lexical rules for - or ' in words may be different in the
different languages, in some this would split a input sequence into two
tokens (two words), but in other languages it would be a single word.

Therefore you need to use a scanner per language. With a lot of fun to
be have when languages are mixed!

She said: "JÂ´ai dit : Â« Voici mon C.V. Â» !", giving him her rÃ©sumÃ©!

(Also, notice how the typographical rules are different in the different
languages).

> If someone can point me in the right direction, or warn me about
> gotchas, I'd appreciate it.

For example, one can specify a scanner by listing the regular
expressions corresponding to named tokens:

https://gitlab.com/com-informatimago/com-informatimago/blob/master/rdp/example-lis...

and including implicitely literal tokens found in the grammar of the
language, with the corresponding "regular expression". But here you
have to be careful to escape characters meaningful in regular
expressions:

If your tokens are:

(ident "[A-Za-z][A-Za-z0-9]*")
;; real must come first to match the longest first.
(real "[-+]?[0-9]+\\.[0-9]+([Ee][-+]?[0-9]+)?")
(integer "[-+]?[0-9]+")

and "+", "-" "*" and "/", but regular expressions for the later are
actually:

(plus "\\+")
(minus "-")
(times "\\*")
(divide "/")

Using such sets of token definitions you can generate a scanning
function:

https://gitlab.com/com-informatimago/com-informatimago/blob/master/common-lisp/parser/generate-scanner...

user2> (pprint (macroexpand '(define-scanner example
:terminals ((ident "[A-Za-z][A-Za-z0-9]*")
;; real must come first to match the longest first.
(real "[-+]?[0-9]+\\.[0-9]+([Ee][-+]?[0-9]+)?")
(integer "[-+]?[0-9]+")
"(" ")" "+" "-" "*" "/"))))

(progn (defclass example (buffered-scanner)
nil
(:default-initargs :spaces #(#\ #\Newline) :token-kind-package (load-time-value (find-package "USER2"))
:token-symbol-package (load-time-value (find-package "USER2"))))
(defmethod scan-next-token ((scanner example) &optional
com.informatimago.common-lisp.parser.scanner::parser-data)
"RETURN: (scanner-current-token scanner)"
(declare (ignore com.informatimago.common-lisp.parser.scanner::parser-data))
(let (com.informatimago.common-lisp.parser.scanner::match)
(setf com.informatimago.common-lisp.parser.scanner::match
(com.informatimago.common-lisp.regexp.regexp:string-match (format nil
"^([~A]+)"
(coerce
(scanner-spaces scanner)
'string))
(scanner-buffer scanner)
:start
(1- (scanner-column scanner))))
(when com.informatimago.common-lisp.parser.scanner::match
(setf (scanner-column scanner)
(1+ (com.informatimago.common-lisp.regexp.regexp:match-end 1
com.informatimago.common-lisp.parser.scanner::match))))
(let ((com.informatimago.common-lisp.parser.scanner::pos (1- (scanner-column scanner))))
(cond ((scanner-end-of-source-p scanner)
(setf (scanner-column scanner) (1+ (length (scanner-buffer scanner)))
(scanner-current-text scanner) "<END OF SOURCE>"
(scanner-current-token scanner)
'com.informatimago.common-lisp.parser.scanner::<END\ OF\ SOURCE>))
((scanner-end-of-line-p scanner) (advance-line scanner))
((or (setf com.informatimago.common-lisp.parser.scanner::match
(com.informatimago.common-lisp.regexp.regexp:string-match '"^(\\(|\\)|\\+|\\-|\\*|\\/)"
(scanner-buffer scanner)
:start
com.informatimago.common-lisp.parser.scanner::pos)))
(let ((text
(com.informatimago.common-lisp.regexp.regexp:match-string 1
(scanner-buffer scanner)
com.informatimago.common-lisp.parser.scanner::match)))
(setf (scanner-column scanner)
(1+ (com.informatimago.common-lisp.regexp.regexp:match-end 1
com.informatimago.common-lisp.parser.scanner::match))
(scanner-current-text scanner) text
(scanner-current-token scanner) text)))
((setf com.informatimago.common-lisp.parser.scanner::match
(com.informatimago.common-lisp.regexp.regexp:string-match '"^([A-Za-z][A-Za-z0-9]*)"
(scanner-buffer scanner)
:start
com.informatimago.common-lisp.parser.scanner::pos))
(setf (scanner-column scanner)
(1+ (com.informatimago.common-lisp.regexp.regexp:match-end 1
com.informatimago.common-lisp.parser.scanner::match))
(scanner-current-text scanner)
(com.informatimago.common-lisp.regexp.regexp:match-string 1
(scanner-buffer scanner)
com.informatimago.common-lisp.parser.scanner::match)
(scanner-current-token scanner) 'ident))
((setf com.informatimago.common-lisp.parser.scanner::match
(com.informatimago.common-lisp.regexp.regexp:string-match '"^([-+]?[0-9]+\\.[0-9]+([Ee][-+]?[0-9]+)?)"
(scanner-buffer scanner)
:start
com.informatimago.common-lisp.parser.scanner::pos))
(setf (scanner-column scanner)
(1+ (com.informatimago.common-lisp.regexp.regexp:match-end 1
com.informatimago.common-lisp.parser.scanner::match))
(scanner-current-text scanner)
(com.informatimago.common-lisp.regexp.regexp:match-string 1
(scanner-buffer scanner)
com.informatimago.common-lisp.parser.scanner::match)
(scanner-current-token scanner) 'real))
((setf com.informatimago.common-lisp.parser.scanner::match
(com.informatimago.common-lisp.regexp.regexp:string-match '"^([-+]?[0-9]+)"
(scanner-buffer scanner)
:start
com.informatimago.common-lisp.parser.scanner::pos))
(setf (scanner-column scanner)
(1+ (com.informatimago.common-lisp.regexp.regexp:match-end 1
com.informatimago.common-lisp.parser.scanner::match))
(scanner-current-text scanner)
(com.informatimago.common-lisp.regexp.regexp:match-string 1
(scanner-buffer scanner)
com.informatimago.common-lisp.parser.scanner::match)
(scanner-current-token scanner) 'integer))
(t
(error 'scanner-error-invalid-character
:file
(scanner-file scanner)
:line
(scanner-line scanner)
:column
(scanner-column scanner)
:state
(scanner-state scanner)
:current-token
(scanner-current-token scanner)
:scanner
scanner
:invalid-character
(aref (scanner-buffer scanner) com.informatimago.common-lisp.parser.scanner::pos)
:format-control
"Invalid character ~S at position: ~D~%"
:format-arguments
(list (aref (scanner-buffer scanner) com.informatimago.common-lisp.parser.scanner::pos)
(scanner-column scanner)))))))
(setf (scanner-current-token scanner) (make-current-token scanner))))
; No value
user2>

user2> (define-scanner example
:terminals ((ident "[A-Za-z][A-Za-z0-9]*")
;; real must come first to match the longest first.
(real "[-+]?[0-9]+\\.[0-9]+([Ee][-+]?[0-9]+)?")
(integer "[-+]?[0-9]+")
"(" ")" "+" "-" "*" "/"))
#<standard-method scan-next-token (example)>
user2> (let ((scanner (make-instance 'example :source "(2*pi)/3.0")))
(loop :for token := (scan-next-token scanner)
:until (scanner-end-of-source-p scanner)
:do (print token)))

(token :kind \( :text "(" :column 2 :line 1 :id "#x302002472C4D")
(token :kind integer :text "2" :column 3 :line 1 :id "#x30200246C8AD")
(token :kind * :text "*" :column 4 :line 1 :id "#x302002469C6D")
(token :kind ident :text "pi" :column 6 :line 1 :id "#x3020024663ED")
(token :kind \) :text ")" :column 7 :line 1 :id "#x30200246375D")
(token :kind / :text "/" :column 8 :line 1 :id "#x302002460B5D")
nil
user2>

user2> (define-scanner jim
:terminals ((ident "[A-Za-z][A-Za-z0-9]*")
(special "[^A-Za-z0-9]")))

#<standard-method scan-next-token (jim)>
user2> (let ((scanner (make-instance 'jim :source "abc.ddeeff->ghi jk(l[mnop])+")))
(loop :for token := (scan-next-token scanner)
:until (scanner-end-of-source-p scanner)
:do (print token)))

(token :kind ident :text "abc" :column 4 :line 1 :id "#x302002495CBD")
(token :kind special :text "." :column 5 :line 1 :id "#x30200249266D")
(token :kind ident :text "ddeeff" :column 11 :line 1 :id "#x30200248F93D")
(token :kind special :text "-" :column 12 :line 1 :id "#x30200248C1FD")
(token :kind special :text ">" :column 13 :line 1 :id "#x302002488C4D")
(token :kind ident :text "ghi" :column 16 :line 1 :id "#x302002485F1D")
(token :kind ident :text "jk" :column 23 :line 1 :id "#x302002482FED")
(token :kind special :text "(" :column 24 :line 1 :id "#x30200231F9AD")
(token :kind ident :text "l" :column 25 :line 1 :id "#x30200231CC8D")
(token :kind special :text "[" :column 26 :line 1 :id "#x3020023196DD")
(token :kind ident :text "mnop" :column 30 :line 1 :id "#x3020023169AD")
(token :kind special :text "]" :column 31 :line 1 :id "#x30200231330D")
(token :kind special :text ")" :column 32 :line 1 :id "#x30200230FD5D")
nil
user2>

--
__Pascal Bourguignon__ http://www.informat...
â??The factory of the future will have only two employees, a man and a
dog. The man will be there to feed the dog. The dog will be there to
keep the man from touching the equipment.â? -- Carl Bass CEO Autodesk

Jim Newton

9/29/2015 6:45:00 AM

>
> And similarly in natural languages, the definition of a word vary. For
> example, the lexical rules for - or ' in words may be different in the
> different languages, in some this would split a input sequence into two
> tokens (two words), but in other languages it would be a single word.
>
> Therefore you need to use a scanner per language. With a lot of fun to
> be have when languages are mixed!
>

I want to avoid writing 100 parsers for 100 languages. If I tried to do that I'd do a better
job on the languages I understand than the ones I don't, and the results would be
biased.

I'd rather come up with good, even if not perfect, definitions of punctuation, whitespace,
and word-character, and use those to build a common parser.

Here is what I have so far, but it does not take care of characters like è, ä, æ, ž etc.

(defun is-white-space? (char)
(member char '(#\Space
#\Newline
#\Linefeed
#\Return
#\Tab)
:test #'char=))

(defun is-alpha-numeric? (char)
;; this may need to be extended to include characters with accents etc.
(alphanumericp char))

(defun is-punctuation? (char)
(and (not (is-white-space? char))
(not (is-alpha-numeric? char))))

(defun parse-file-content (stream client)
"read the content of a stream, parsing into words. Call the given CLIENT function on each word."
(let (buf char (EOF stream))
(flet ((consume ()
(when (car buf)
(funcall client (concatenate 'string (car buf))))
(setf buf nil)))
(loop :while t
:do (progn (setf char (read-char stream nil EOF))
(cond
((eql EOF char)
(consume)
(return-from parse-file-content nil))
((is-white-space? char)
(consume))
((is-alpha-numeric? char)
(setf buf (tconc buf char)))
((is-punctuation? char)
(consume)
(setf buf (tconc buf char))
(consume))
(t
(error "char ~S is not handled correctly by parse-file-content" char))))))))

Didier Verna

9/29/2015 7:18:00 AM

Jim Newton wrote:

> (defun is-white-space? (char)

Side note: white-space-p is more CL like.

--
My new Jazz CD entitled "Roots and Leaves" is out!
Check it out: http://didierverna.com/records/roots-and-...

Lisp, Jazz, AÃ¯kido: http://www.didier...

William James

9/29/2015 8:28:00 AM

Didier Verna wrote:

> Jim Newton wrote:
>
> > (defun is-white-space? (char)
>
> Side note: white-space-p is more CL like.

Worshippers of CL (COBOL-Like) are grovelling conformists.
At the slightest hint of heterodoxy, they begin throwing
their feces at the offender.

Here's a question for those of you who are human, i.e.,
those who are not worshippers of CL.

Which is more natural, readable, and understandable:

white-space-p
white-space?

The CL pack, of course, has no interest in what is
natural, readable, or understandable.

They hanker after what is unnatural, unreadable, and
incomprehensible, since that is what their beetle-browed god
demands.

Kaz Kylheku

9/29/2015 12:21:00 PM

On 2015-09-29, Jim Newton <jimka.issy@gmail.com> wrote:
>
>>
>> And similarly in natural languages, the definition of a word vary. For
>> example, the lexical rules for - or ' in words may be different in the
>> different languages, in some this would split a input sequence into two
>> tokens (two words), but in other languages it would be a single word.
>>
>> Therefore you need to use a scanner per language. With a lot of fun to
>> be have when languages are mixed!
>>
>
> I want to avoid writing 100 parsers for 100 languages.

Do it in 100 days in a 100 different countries, and you have a nice
publicity stunt.

(Bonus: *their* languages).

Kaz Kylheku

9/29/2015 12:23:00 PM

On 2015-09-29, Didier Verna <didier@lrde.epita.fr> wrote:
> Jim Newton wrote:
>
>> (defun is-white-space? (char)
>
> Side note: white-space-p is more CL like.

Also, don't tack on ? if you already have the is-.

The usual notational choices are:

white-space?
is-white-space
white-space-p

Pascal J. Bourguignon

9/29/2015 10:19:00 PM

Kaz Kylheku <kaz@kylheku.com> writes:

> On 2015-09-29, Didier Verna <didier@lrde.epita.fr> wrote:
>> Jim Newton wrote:
>>
>>> (defun is-white-space? (char)
>>
>> Side note: white-space-p is more CL like.
>
> Also, don't tack on ? if you already have the is-.
>
> The usual notational choices are:

Let's qualify "usual":

white-space-p 99% in CL
white-space? 100% in Scheme, 0.5% in CL
is-white-space 0% in Scheme, 0% in CL.

--
__Pascal Bourguignon__ http://www.informat...
â??The factory of the future will have only two employees, a man and a
dog. The man will be there to feed the dog. The dog will be there to
keep the man from touching the equipment.â? -- Carl Bass CEO Autodesk

William James

4/22/2016 9:57:00 PM

WJ wrote:

> Jim Newton wrote:
>
> > For example if the file contains the following line,
> >
> > abc.ddeeff->ghi jk(l[mnop])+
> >
> > I'd like the given function, F, to be called as follows (in any order)
> > (F "abc")
> > (F ".")
> > (F "ddeeff")
> > (F "-")
> > (F ">")
> > (F "ghi")
> > (F "jk")
> > (F "(")
> > (F "l")
> > (F "[")
> > (F "mnop")
> > (F "]")
> > (F ")")
> > (F "+")
>
> Gauche Scheme:
>
> (use gauche.lazy :only (lrxmatch))
>
> (define (get-tokens text)
> (map (cut rxmatch-substring <> 1)
> (lrxmatch #/\s*([[:alnum:]]+|\S)/ text)))
>
> (for-each print (get-tokens "abc555.ddeeff->ghi jk(l[mnop])+"))
>
> abc555
> .
> ddeeff
> -
> >
> ghi
> jk
> (
> l
> [
> mnop
> ]
> )
> +

OCaml:

#load "str.cma";;

let rx_scan rx string =
let rec loop pos result =
try
ignore (Str.search_forward rx string pos) ;
loop (Str.match_end ()) (Str.matched_string string :: result)
with Not_found -> List.rev result
in loop 0 [];;

rx_scan (Str.regexp "[a-zA-Z0-9]+\\|[^ \t]")
"abc555.ddeeff->ghi jk(l[mnop])+";;

["abc555"; "."; "ddeeff"; "-"; ">"; "ghi"; "jk"; "("; "l"; "[";
"mnop"; "]"; ")"; "+"]

--
[T]he broadcast media ... create a separate and caustic virtual reality, then
broadcast that ideologically driven reality into the homes of millions of
people....
theoccidentalobserver.net/authors/Connelly-Gaza2.html

comp.lang.lisp

using cl-ppcre

Jim Newton

William James

Pascal J. Bourguignon

Jim Newton

Didier Verna

William James

Kaz Kylheku

Kaz Kylheku

Pascal J. Bourguignon

William James

x Login to ForumsZone