[lnkForumImage]
TotalShareware - Download Free Software

Confronta i prezzi di migliaia di prodotti.
Asp Forum
 Home | Login | Register | Search 


 

Forums >

comp.lang.ruby

reformatting a text file that has some binary in it

Adam Akhtar

4/15/2009 1:20:00 PM

I have never worked with binary before and after trying to solve this
problem for 3 hours im turning to the community for help

i have a text file which has entries comprised of a key written in
binary and its values written in strings (you can see an exerpt below).

I need to parse the binary and transform it into human readable hex and
parse its associated info. My reg exps dont seem to be behaving and im
wondering if its me or if its this binary text that is causing mischief
somehow. Heres a sample item

20: 琮祺ア��キ�G�まd8:completei7e10:downloadedi2046e10:incompletei1ee


binary parts are always enclosed between "20:" and "d8:complete" where
the 8 can be any integer(s) e.g. 5 or 23.

str = File.open('textfile.txt' , 'r').readlines.join
str.gsub!(/(20:)(.*?)(d\d+:)/m) do |x|
$1 + $2.unpack('H*').join + $3
end

The above works for some but not all of the text. It seeems to go beyond
the "d8:complete" marker

Heres a bigger sample set if needs be. Any tips or pointers would be
greatly appreciated.


d8:completei2e10:downloadedi770e10:incompletei1ee20:
琮祺ア��キ�G�まd8:completei7e10:downloadedi2046e10:incompletei1ee20:
6ç??ï¾?ã?»ï£³ æ¶?jュã?»wã?»d8:completei0e10:downloadedi72602e10:incompletei1ee20:
}tェスh>・�����d8:completei3e10:downloadedi7718e10:incompletei2ee20: �C
ï¾?ï½³J<ィFï¾?0ï¾?ç??Wd8:completei2e10:downloadedi617e10:incompletei0ee20:
incï¾?ã?»U]~鼡鐐å??< d8:completei3e10:downloadedi533e10:incompletei0ee20:
ゥ<Z�<0ケ_!Y/�3d8:completei1e10:downloadedi281e10:incompletei0ee20:
6ï¾?î??iå?ã?»ç?¤ï½¯ã?»ï½µGd8:completei0e10:downloadedi216e10:incompletei1ee20:
I�l�ォヲ7Z&��K�建�Ed8:completei4e10:downloadedi262e10:incompletei3ee20:
Smï¾?ソï¾?æ?¯ï½¿ï¾?ィh7rã?»é??d8:completei3e10:downloadedi787e10:incompletei0ee20:
Yjスゥ�d � ��]�ud8:completei0e10:downloadedi154e10:incompletei1ee20:
bj�]VF�w仭鱧卍�d8:completei10e10:downloadedi505e10:incompletei16ee20:
hã?»ï¾æ£\MDaî??3ヘコd8:completei2e10:downloadedi1050e10:incompletei2ee20:
hヘ���=ゥ�ソ��「�d8:completei1e10:downloadedi57e10:incompletei2ee20:
mbâ??<GSï½·bオゥqTã?»?d8:completei1e10:downloadedi3860e10:incompletei1ee20:
u �axB�z<�3d8:completei3e10:downloadedi700e10:incompletei7ee20:
u[���泱@2 ウ4ァ�d8:completei0e10:downloadedi658e10:incompletei3ee20:
��a�$ン#�3!COd8:completei3e10:downloadedi304e10:incompletei0ee20:
槍�G�_廰偶Z、7}d8:completei3e10:downloadedi2285e10:incompletei2ee20:
�サ�「�Gaヲ`q�vBd8:completei6e10:downloadedi1061e10:incompletei5ee20:
ç??ï½´?eî??ゥケソ母ã?»ï½©ï¾?jã?»d8:completei3e10:downloadedi2902e10:incompletei1ee20:
セオッ�Mエ_L���ーミld8:completei1e10:downloadedi147e10:incompletei1ee20:
ィï¾?Y#å??Uï¾? "ï½°Fçµ?bï¾?awd8:completei6e10:downloadedi39010e10:incompletei2ee20:
ゥ�ォ�2、�1�tス�8:completei7e10:downloadedi1835e10:incompletei0ee20:
ョ3��Lサ エ)サ�.G�8:completei2e10:downloadedi474e10:incompletei0ee20:
ï½´)POé??
オクヲ&ã?»%Lç??è?§d8:completei4e10:downloadedi3674e10:incompletei0ee20:
チ�ュヘE\GT���翡ェイョd8:completei2e10:downloadedi328e10:incompletei0ee20:
ï¾?è£?ã?»(ã?»é??Kç ¡ï½¾ï¾?ï¾?ィï¾?d8:completei43e10:downloadedi9665e10:incompletei31ee20:
��篠�殱 � 0h�qgd8:completei3e10:downloadedi17686e10:incompletei0ee20:
�Bコ�g�ェ��-/T�d8:completei5e10:downloadedi801e10:incompletei2ee20:
ヘヘ� 4�I�{-u)マア�bd8:completei3e10:downloadedi4878e10:incompletei2ee20:
�����&、Obシ�$�d8:completei4e10:downloadedi1499e10:incompletei0ee20:
ï¾?@iæ??ゥtã?»aKï¾?fç®?.ï¾?ï¾?d8:completei7e10:downloadedi1745e10:incompletei3ee20:
��<シエマ�B�i」\,ェEd8:completei0e10:downloadedi745e10:incompletei1ee20:
ã?»æ©¡æ??Pæ?¦Zï½·ï¾?î?エ吹2ï¾?d8:completei1e10:downloadedi11865e10:incompletei7ee20:
�。B<�ケ����D
=ï½²ã?»ï½µé??7è??î??d8:completei2e10:downloadedi9246e10:incompletei2ee20:G[fJã?»Y*d8:completei15e10:downloadedi3649e10:incompletei11ee20:ã?»/Xrï¾?ï¾?J
ã?»XA
å®?ã?»d8:completei3e10:downloadedi323e10:incompletei0ee20:å??ï½®ã?»ï¾?ッア|
jetei4e10:downloadedi12601e10:incompletei0ee20:aソ���� \�?@�。�8:completei3e10:downloadedi1005e10:incompletei0ee20:ctオ@訣+@�
�ァe5lマd8:completei0e10:downloadedi166e10:incompletei2ee20:cカGlusエBn、�]糾ィ�d8:completei1e10:downloadedi110e10:incompletei0ee20:j+s「�x」�iシ4!�mG~d8:completei5e10:downloadedi6427e10:incompletei0ee20:|1S�M�i贐��
æ?­/~d8:completei5e10:downloadedi865e10:incompletei2ee20:}lカî??ァ/ã?»2k+ï½·Bã?»å?8:completei4e10:downloadedi1032e10:incompletei0ee20:恵シセu碣Pî?ï¾?ã?»knå?8:completei1e10:downloadedi95e10:incompletei1ee20:å­¤ï¾?gHマã?»ï½°Ï?ï½¢Kヘ6ç¶?dd8:completei6e10:downloadedi14810e10:incompletei3ee20:è¢?ï¾?kï¾?ï½²Lpã?»ï¾?î?«Tォ%î??8:completei3e10:downloadedi430e10:incompletei1ee20:æ??W
ュウL��ッ|��8$d8:completei0e10:downloadedi69e10:incompletei1ee20:��b"mィウ��ウ�ケ�8:completei8e10:downloadedi9526e10:incompletei0ee20:ュ�キ4|コェ��屮
l�Ead8:completei15e10:downloadedi1775e10:incompletei9ee20:イ�A�q�kF�{D�ヘ
d8:completei5e10:downloadedi4154e10:incompletei1ee20:ウ03�ゥ8e10:incompletei0ee20:ゥ
漸î??ç©?ï¾?+î?½ï¿¤ï¾?hã?»!d8:completei3e10:downloadedi2874e10:incompletei1ee20:ゥï¾?l,8Hç?ºï¾?溝椿kャ{é¹½8:completei55e10:downloadedi10735e10:incompletei82ee20:ォï¾?~ pヘã?»(Qã??8?uL^4d8:completei2e10:downloadedi140e10:incompletei0ee20:ッî?ィç??ï½¢æ??。Uî??、fã?»8:completei0e10:downloadedi368e10:incompletei3ee20:ï½µï¾?Eî??ã?»d1カスã?»qQBWd8:completei6e10:downloadedi7221e10:incompletei9ee20:ï½»コï¾?,ï¾?tQ_ï¾?マ`ï¾?(ï¾?d8:completei3e10:downloadedi1536e10:incompletei21ee20:ï¾?Sã?»ï¾el%~シュ,yï¾?î??bd8:completei1e10:downloadedi111e10:incompletei1ee20:ï¾?Bd
ã?»ï½¹î²|e]ï½¥"、vTd8:completei2e10:downloadedi1096e10:incompletei1ee20:ï¾?ã?»ï½¾ã?»$î?®î??d8:completei1e10:downloadedi701e10:incompletei0ee20:ï¾?zンィ/ï¾?@g.3å??=ã?»ï¾?d8:completei0e10:downloadedi512e10:incompletei1ee20:ï¾?鏝å?·i15e10:downloadedi2161e10:incompletei8ee20:~è??Bヲ|ï¾?î?µï¾?0篷î??`hd8:completei0e10:downloadedi86e10:incompletei1ee20:話ã?»!ï¾?\é???ã?»Â?チï½°rd8:completei1e10:downloadedi36732e10:incompletei0ee20:ä¿?>(ã?»lå·?=å¡?Â?ã?»é¶?ç§?8:completei5e10:downloadedi8917e10:incompletei1ee20:ã?»ï½»Jï¾?磨gî?»è®F2é ?@kd8:completei7e10:downloadedi1644e10:incompletei22ee20:、"&cB:TRï¾?}ta禰シ0+å½­8:completei3e10:downloadedi418e10:incompletei0ee20:ï½¹ã?»æ?Â?ï¾?$NォcッPç?·{ï½½d8:completei8e10:downloadedi10297e10:incompletei10ee20:ï¾?æ??>Tã?»ゥ
ï¾?ンï¾?>ï¾?î?µæ?­8:completei0e10:downloadedi323e10:incompletei1ee20:ï¾?誨vï½½.s鍬â? ï½ªe1+|<ï½¼Gd8:completei0e10:downloadedi1412e10:incompletei1ee20:ï¾?ャç?®è??n、c
ã?»ォ_è »d8:completei2e10:downloadedi477e10:incompletei3ee20:禳""ï½»Rï½µç´?î?¼ã?»ã?»@î??d8:completei0e10:downloadedi175e10:incompletei4ee20:ã?»uå??æ?Zミ8wæ?¡>{ヲå??8:completei1e10:downloadedi5929e10:incompletei0ee20:î??é??î?°ï½¯i7MYã?»vÂ?Yd8:completei1e10:downloadedi212e10:incompletei1ee20:î??ï¾?ï½·fè?ºã?»QfCæ¸?+ï¾?ï½®d8:completei1e10:downloadedi159e10:incompletei2ee20:ã?»ï½¨7ゥã?»è??ï¾?ï¾?ï¾?ァ竫

can anyone help
--
Posted via http://www.ruby-....

19 Answers

James Gray

4/15/2009 9:09:00 PM

0

On Apr 15, 2009, at 8:19 AM, Adam Akhtar wrote:

> i have a text file which has entries comprised of a key written in
> binary and its values written in strings (you can see an exerpt =20
> below).
>
> I need to parse the binary and transform it into human readable hex =20=

> and
> parse its associated info. My reg exps dont seem to be behaving and im
> wondering if its me or if its this binary text that is causing =20
> mischief
> somehow. Heres a sample item
>
> 20: =0C=E7=90=AE=E7=A5=BA=EF=BD=B1=E3=83=BB=E3=83=BB=EF=BD=B7=E3=83=BBG=1B=
=E8=81=8A=E3=81=BE=20
> d8:completei7e10:downloadedi2046e10:incompletei1ee
>
>
> binary parts are always enclosed between "20:" and "d8:complete" where
> the 8 can be any integer(s) e.g. 5 or 23.
>
> str =3D File.open('textfile.txt' , 'r').readlines.join
> str.gsub!(/(20:)(.*?)(d\d+:)/m) do |x|
> $1 + $2.unpack('H*').join + $3
> end
>
> The above works for some but not all of the text. It seeems to go =20
> beyond
> the "d8:complete" marker

I suspect this is an encoding issue. If your data is UTF-8, this code =20=

may work for you:

data =3D File.read('textfile.txt')
data.scan(/(20:)(.*?)(d\d+:)/um) do |start, bin, finish|
p start + bin.unpack('H*').join + finish
end

I'm guessing though.

If you want to read more about what I believe is causing you problems, =20=

you may find my m17n series of blog posts helpful:

http://blog.grayproductions.net/articles/understa...

James Edward Gray II


Adam Akhtar

4/15/2009 10:29:00 PM

0

Ahh i didnt know you could use scan like that with blocks and
variables...thats going to come in very handy indeed.

Ill give that a go - many thanks James!
--
Posted via http://www.ruby-....

Adam Akhtar

4/15/2009 10:31:00 PM

0

Oh and your blog post looks good too, just started reading it.

--
Posted via http://www.ruby-....

James Gray

4/15/2009 10:34:00 PM

0

On Apr 15, 2009, at 5:30 PM, Adam Akhtar wrote:

> Oh and your blog post looks good too, just started reading it.

Great. I hope it helps.

James Edward Gray II

Adam Akhtar

4/21/2009 12:53:00 PM

0

Im back again and pretty confused as to why my regexp still is
overshooting the mark.

I want my regexp /(20:)(.*?)(d\d+:complete.+?incomplete.+?ee)/ium

to get everything between and including 20: and ee i.e. from the first
line of the sample at the bottom of this message id want want this

20: �0� �aュ�:$ ゥD��d8:completei0e10:downloadedi772e10:incompletei1ee

but sometimes it overshoots and does something like this
20: Â?0ï¾? ã?»aï½­ï¾?:$
ゥD��d8:completei0e10:downloadedi772e10:incompletei1ee20:
琮祺ア��キ�G�まd8:completei9e10:downloadedi2064e10:incompletei2ee

and I cant figure out why? In my notepad plus editor i have it set to
display line feeds and carriage returns. Soemtimes in the binary parts
it displays an lf symbol. In binary does lf serve as a representation
for a new line or it just used to represent data (bytes etc) - could it
be that thats tripping up rubys regexp engine?

I load the data text file like so
data = File.open("text.txt", "rb").readlines

Is there something im doing wrong?


sample from the data text file

20: Â?0ï¾? ã?»aï½­ï¾?:$
ゥD��d8:completei0e10:downloadedi772e10:incompletei1ee20:
琮祺ア��キ�G�まd8:completei9e10:downloadedi2064e10:incompletei2ee20:
}tェスh>・�����d8:completei4e10:downloadedi7724e10:incompletei5ee20: �C
ï¾?ï½³J<ィFï¾?0ï¾?ç??Wd8:completei4e10:downloadedi632e10:incompletei2ee20:
incï¾?ã?»U]~鼡ã?»`å??< d8:completei5e10:downloadedi536e10:incompletei0ee20:
シ�q�!p�ォス-��58Td8:completei1e10:downloadedi520e10:incompletei0ee20:
G*﨨î??ェ
�4T�オソk�d8:completei0e10:downloadedi1061e10:incompletei2ee20:
I�l�ォヲ7Z&��K�建�Ed8:completei5e10:downloadedi268e10:incompletei0ee20:
Smï¾?ソï¾?æ?¯ï½¿ï¾?ィh7rã?»é??d8:completei5e10:downloadedi798e10:incompletei0ee20:
bj�]VF�w仭鱧卍�d8:completei8e10:downloadedi523e10:incompletei11ee20:
hヘ���=ゥ�ソ��「�d8:completei0e10:downloadedi57e10:incompletei3ee20:
mbâ??<GSï½·bオゥqTã?»?d8:completei2e10:downloadedi3864e10:incompletei0ee20:
u �axB�z<�3d8:completei4e10:downloadedi713e10:incompletei7ee20:
u[���泱@2 ウ4ァ�d8:completei2e10:downloadedi659e10:incompletei5ee20:
å??ã?»-|ï¾?ï¾?-ï¾?ï¾?6â?¶ï¾?ï¾?ェã?»8:completei0e10:downloadedi108e10:incompletei2ee20:
��a�$ン#�3!COd8:completei3e10:downloadedi306e10:incompletei0ee20:
槍�G�_廰偶Z、7}d8:completei1e10:downloadedi2293e10:incompletei1ee



--
Posted via http://www.ruby-....

Adam Akhtar

4/22/2009 1:27:00 PM

0

Im thoroughly confused and have spent a good 10 hours getting nowhere
fast. Im gong to throw my monitor against the wall!

I have a file with text like the stuff in posts above. I dont create the
file, its given to me as a standard text file. I dont know how it is
encoded. Im assuming utf-8. There is your standard readable english
lower 128 ascii and then there are bits of garbled crap that are
supposed to be binary.

I do the following

$KCODE = "UTF8"

then i do

data_a = File.read('mn-scrape.txt')
data_b = File.open("mn-scrape.txt", "rb").readlines.join("")
data_a.scan(/./m).length ( ==> 170799 )
data_b.scan(/./m).length ( ==> 767702 )

why are they different?
When I look in notepad++ viewing the file under the utf-8 encoding it
says the num of characters is 767702 which is nearly 4 times bigger that
the .read version

Why is this happening?

What is the correct way to open this type of file? Any help whatsoever
will be a great great great help!

--
Posted via http://www.ruby-....

Adam Akhtar

4/23/2009 9:06:00 AM

0

anyone, im begging ;-)

if im not being clear please say and ill answer any questions you have
--
Posted via http://www.ruby-....

t3ch.dude

4/23/2009 1:07:00 PM

0

On Apr 23, 5:05 am, Adam Akhtar <adamtempor...@gmail.com> wrote:
> anyone, im begging ;-)
>
> if im not being clear please say and ill answer any questions you have
> --
> Posted viahttp://www.ruby-....

Adam,

Forum and e-mail cut & paste is iffy... is there somewhere you could
post all or part of one of these source files? Is it possible that
these inline binary blobs are actually all the same number of bytes?

-t3ch.dude

Adam Akhtar

4/23/2009 10:51:00 PM

0

ahh should have thought about that. here is a souce file

Attachments:
http://www.ruby-...attachment/3615/mini-...

--
Posted via http://www.ruby-....

Martin DeMello

4/23/2009 11:17:00 PM

0

On Thu, Apr 16, 2009 at 3:59 AM, Adam Akhtar <adamtemporary@gmail.com> wrote:
> Ahh i didnt know you could use scan like that with blocks and
> variables...thats going to come in very handy indeed.

You probably realise this, but for the benefit of newbies, there are
three different things going on there. Firstly, if the regexp passed
to scan has groups, the returned values are arrays with one element
per group (corresponding to $1, $2, ...). Secondly, if you pass a
block to scan, it yields its return values one by one, rather than
just accumulating them into an array. Thirdly, if you yield multiple
values to a block, the block can capture them either as an array, or
in multiple parameters. The beauty of ruby is how well all these
different features fit together to give the elegant scan syntax.

martin