Asp Forum - UTF-8 to ANSI - the A with caret dilemma - microsoft.public.vb.general.discussion

Mayayana

6/14/2016 12:49:00 PM

For Saucer Man or anyone else interested in this....
I didn't post to the former thread because it's old
and may be lost to some....

With HTML 5, UTF-8 is the default text encoding
for webpages and is becoming popular. On the other
hand, some editors don't deal with UTF-8 and browsers
don't always recognize it, depending on HTML tags
used in a page.

I thought it would be handy to have a drag/drop
script to just convert a UTF-8 file to ANSI, since
that's usually a lossless conversion -- and therefore
UTF-8 is largely superfluous -- for Western European
language speakers.

I came up with some interesting tidbits after searching,
which I've put into 2 VBScripts:

http://www.jsware.net/jsware/scrfile...

First, there the popular API method to convert
UTF-8 to 16-bit unicode and from there to ANSI,
using WideCharToMultiByte and vice versa. Since
I wanted a VBScript I didn't look at that, but I did
come up with two methods than can be adapted
to VB.

One of the scripts in this download uses a simple
routine to just walk the UTF-8 string and translate,
but it's limited to characters in the windows-1252
codepage. That is, it's fine for English/French/Spanish/
German but not with other languages that don't
use the same basic alphabet.

The other script uses a simple routine to convert
UTF-8 to 16-bit unicode, then uses the Textstream
object to convert that to ANSI. In VB it could be
done far more efficiently by pointing a SafeArray
structure at the UTF-8 string in a tokenizing routine
and then applying StrConv to the resulting string.
(VBS doesn't have StrConv. Thus the Textstream
file writing method.)

The nice thing about the second routine is that it
does an "intelligent" translation. I had a test file
with the Sanskrit words "samsara" and "nirvana",
written with numerous diacritical marks that won't
translate to windows-1252, such as an "s" with an
accent mark. When Windows coerces that string to
ANSI it translates them as shown above: The English
characters with the diacritical marks dropped. That
won't help, of course, with something like Russian or
Arabic that uses different characters, but it does
extend the usable range of ANSI Windows-1252 text,
insofar as that the English adaptation of "nirvana" is
just that: nirvana. The diacritical marks are of
relevance only to academics who actually work in
Sanskrit and need to render it in an English
transliteration.

5 Answers

Blue Planet

6/14/2016 8:59:00 PM

"Mayayana" <mayayana@invalid.nospam> wrote in message
news:njouh1$god$1@dont-email.me...

You can always use the ADO Stream object for such things.

<job>
<reference object="ADODB.Stream" />
<object id="stmIn" progId="ADODB.Stream" />
<script language="VBScript">
Option Explicit

With stmIn
.Open
.Type = adTypeText
.Charset = "utf-8"
.LoadFromFile "sample.txt"
MsgBox .ReadText() 'Text read is Unicode.
.Close
End With
</script>
</job>

Pretty trivial to make it transcode to windows-1252 and write that as a new
file too.

Mayayana

6/15/2016 12:16:00 AM

| You can always use the ADO Stream object for such things.
|

I started out with ADODB.Stream, but it doesn't
seem to work the way that most people think it does.
I found it was only converting UTF-8 to ANSI when
I first ran it through IE and only processed the displayed
text. I couldn't get it to go from UTF-8 to ANSI in
any configuration, despite many people showing
sample text that they claimed works.

In any case, the methods I came up with require
nothing extra, and work well. The Textstream idea
was actually from Microsoft. WiToANSI.vbs is in the
SDK and works on 16-bit unicode. Not surprisingly, it
only works exactly the way that MS wrote it. Textstream
is a very funky object that tries to make dealing with
text files transparent, but in doing so it does weird things.
For instance, ReadAll and Write fail when they encounter
Chr(0), but Read(number) and WriteLine don't.

I'm guessing the problem may be that 16-bit unicode
<-> ANSI is workable in a number of ways, but not so
with UTF-8.

Saucer Man

6/16/2016 8:58:00 PM

On 6/14/2016 8:49 AM, Mayayana wrote:
> For Saucer Man or anyone else interested in this....
> I didn't post to the former thread because it's old
> and may be lost to some....
>
> With HTML 5, UTF-8 is the default text encoding
> for webpages and is becoming popular. On the other
> hand, some editors don't deal with UTF-8 and browsers
> don't always recognize it, depending on HTML tags
> used in a page.
>
> I thought it would be handy to have a drag/drop
> script to just convert a UTF-8 file to ANSI, since
> that's usually a lossless conversion -- and therefore
> UTF-8 is largely superfluous -- for Western European
> language speakers.
>
> I came up with some interesting tidbits after searching,
> which I've put into 2 VBScripts:
>
> http://www.jsware.net/jsware/scrfile...
>

Thanks for this Mayayana. I see that the scripts individually address
each character code above 127. I have been playing with XML and SAX to
see what I can do with it. I know... I said I didn't care for XML and
it is easier for me to parse reading it as a text file one line at a
time but I want to see what I can come up with.

ObiWan

6/17/2016 8:19:00 AM

:: On Tue, 14 Jun 2016 08:49:16 -0400
:: (microsoft.public.vb.general.discussion)
:: <njouh1$god$1@dont-email.me>
:: "Mayayana" <mayayana@invalid.nospam> wrote:

[...]
> The nice thing about the second routine is that it
> does an "intelligent" translation. I had a test file
> with the Sanskrit words "samsara" and "nirvana",
> written with numerous diacritical marks that won't
[...]

nice and thanks for the code; now the only thing one will have to deal
with are the HTML entities, I'm referring to this stuff

https://en.wikipedia.org/wiki/Character_encodin...

I think that a proper "decoder" should first handle the HMTL entities,
since they may encode "special chars" and then do a second sweep to
decode the UTF-8 chars; as a note if my brain still serves me, I think
there should be some APIs exposed by the "IE" libraries which should
allow to do all the decoding easily enough

Mayayana

6/17/2016 1:01:00 PM

https://en.wikipedia.org/wiki/Character_encodin...

I think that a proper "decoder" should first handle the HMTL entities,
since they may encode "special chars" and then do a second sweep to
decode the UTF-8 chars; as a note if my brain still serves me, I think
there should be some APIs exposed by the "IE" libraries which should
allow to do all the decoding easily enough
>

The initial discussion was about that, too. I posted
a method using IE to load a webpage, then copy
the displayed text. I also wrote a version to wrap
XML files in an HTML shell. Both methods were
aimed at getting plain, readable ANSI text from HTML
or XML. But Saucer Man decided he wanted to keep
the XML, and I only wanted this code for converting
whole files -- usually HTML files that I'll be viewing
in a browser but want to convert to ANSI. So there
was no need for translating HTML encoding.

Though I suppose the two methods could be combined,
in either script, with an IE.application object, or in
VB, with an InternetExplorer or WB: Convert the page,
load it into IE, wait for documentComplete, then:

IE.ExecWB 17, 2 ' select all
IE.ExecWB 12, 2 'copy
s2 = IE.document.parentWindow.clipboardData.getData("Text")

I suppose that's a bit "hacky", but translating
all encoding does seem to require loading the page.
I'm not aware of anything like a shdocvw function
to do the job. It's hard to imagine any practical
use for it.

microsoft.public.vb.general.discussion

UTF-8 to ANSI - the A with caret dilemma

Mayayana

Blue Planet

Mayayana

Saucer Man

ObiWan

Mayayana

x Login to ForumsZone