[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.python

Trouble with quotes

Stephen Nelson-Smith

3/8/2010 5:06:00 PM

Hi,

I've written some (primitive) code to parse some apache logfies and
establish if apache has appended a session cookie to the end. We're
finding that some browsers don't and apache doesn't just append a "-"
- it just omits it.

It's working fine, but for an edge case:

Couldn't match 192.168.1.107 - - [24/Feb/2010:20:30:44 +0100] "GET
http://sekrit.com/n... HTTP/1.1" 200 -
"http://sekrit.com/searc..."3%2B2%20course&q... "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:31:15 +0100] "GET
http://sekrit.com/n... HTTP/1.1" 200 -
"http://sekrit.com/searc..."3%2B2%20course&q... "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:32:07 +0100] "GET
http://sekrit.com/n... HTTP/1.1" 200 -
"http://sekrit.com/searc..."3%2B2%20course&q... "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:32:33 +0100] "GET
http://sekrit.com/n... HTTP/1.1" 200 -
"http://sekrit.com/searc..."3%2B2%20course&q... "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:33:01 +0100] "GET
http://sekrit.com/n... HTTP/1.1" 200 -
"http://sekrit.com/searc..."3%2B2%20course&q... "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [25/Feb/2010:17:01:54 +0100] "GET
http://sekrit.com/searc... HTTP/1.0" 200 -
"http://sekrit.com/searc..."guideline%20grids"&page=1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)"
Couldn't match 192.168.1.107 - - [25/Feb/2010:17:02:15 +0100] "GET
http://sekrit.com/searc... HTTP/1.0" 200 -
"http://sekrit.com/searc..."guideline%20grids"&page=1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)"

If there are " " inside the request string, my regex breaks.

Here's the code:

#!/usr/bin/env python
import re

pattern = r'(?P<ForwardedFor>^(-|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(,
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})*){1})
(?P<RemoteLogname>(\S*)) (?P<RemoteUser>(\S*))
(?P<Timestamp>(\[[^\]]+\]))
(?P<FirstLineOfRequest>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<Status>(\S*)) (?P<Size>(\S*))
(?P<Referrer>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<UserAgent>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)(
)?(?P<SiteIntelligenceCookie>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)'

regex = re.compile(pattern)

lines = 0
no_cookies = 0
unmatched = 0

for line in open('/home/stephen/scratch/test-data.txt'):
lines +=1
line = line.strip()
match = regex.match(line)

if match:
data = match.groupdict()
if data['SiteIntelligenceCookie'] == '':
no_cookies +=1
else:
print "Couldn't match ", line
unmatched +=1

print "I analysed %s lines." % (lines,)
print "There were %s lines with missing Site Intelligence cookies." %
(no_cookies,)
print "I was unable to process %s lines." % (unmatched,)

How can I make the regex a bit more resilient so it doesn't break when
" " is embedded?

--
Stephen Nelson-Smith
Technical Director
Atalanta Systems Ltd
www.atalanta-systems.com
1 Answer

Martin P. Hellwig

3/8/2010 5:40:00 PM

0

On 03/08/10 17:06, Stephen Nelson-Smith wrote:
> Hi,
>
> I've written some (primitive) code to parse some apache logfies and
> establish if apache has appended a session cookie to the end. We're
> finding that some browsers don't and apache doesn't just append a "-"
> - it just omits it.
>
> It's working fine, but for an edge case:
>
> Couldn't match 192.168.1.107 - - [24/Feb/2010:20:30:44 +0100] "GET
> http://sekrit.com/n... HTTP/1.1" 200 -
> "http://sekrit.com/search/results/"3%2B2%20course&q... "Mozilla/4.0
> (compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
<cut rest>
I didn't try to mentally parse the regex pattern (I like to keep
reasonably sane). However from the sounds of it the script barfs when
there is a quoted part in the second URL part. So how about doing a
simple string.replace('/"','') & string.replace('" ','') before doing
your re foo?

--
mph