Jim Newton
9/28/2015 10:49:00 AM
I'd like to parse a file in a certain way, and I think I could possibly use cl-ppcre for it,
but I'm not sure of all the consequences.
I'd like to parse the content of a file into "tokens" such that
a "word" is a multi-character-sting token but a "punctuation" are single-char-string token.
Further, I'd like a given unary function to be called on the tokens. I really don't care which order the function gets called. But not called on whitespace. I.e., whitespace should delimit tokens, but itself not be considered a token.
For example if the file contains the following line,
abc.ddeeff->ghi jk(l[mnop])+
I'd like the given function, F, to be called as follows (in any order)
(F "abc")
(F ".")
(F "ddeeff")
(F "-")
(F ">")
(F "ghi")
(F "jk")
(F "(")
(F "l")
(F "[")
(F "mnop")
(F "]")
(F ")")
(F "+")
Can I do with with cl-ppcre?
What part of my spec is unclear/contradictory?
For example, I don't know what to do about encoding? What should be considered a word and what should be considered punctuation? Are there good ways to specify this?
Perhaps it is even easier if I had the functions is-punctuation? is-whitespace? is-alpha-numeric?
Are there good ways to write those function so that I won't be debugging them for the next 6 months?
The end goal is that I want to build histograms of "occurrences" of words and punctuation of texts from various programming languages and human languages. I.e., I want to run the program on APL code, and R code, and lisp code, C++ code, and others which I can find a corpus for. And also run it on English, Spanish, Finish, cropa as well.
If someone can point me in the right direction, or warn me about gotchas, I'd appreciate it.
Jim