Schmidt
5/28/2012 7:08:00 PM
Am 28.05.2012 19:44, schrieb Saucer Man:
> <?xml version="1.0"?>
> <wclass>
> <item id="69" stock="on hand" ordered="756">
> <productid>VB6</productid>
> <quantity>2</quantity>
> <color="blue" size="large" />
> </item>
>
> <item id="70" stock="on hand">
> <productid>C++</productid>
> <quantity>3</quantity>
> <color="red" size="small"/>
> </item>
> </wclass>
>
> 3) Since the xml is not standard, the lines similar to..
> <color="red" size="small"/>
> are still giving me grief.
Is the above snippet (without a doubt) really an example
for the exact content of the <item>-nodes (or "records")?
I mean especially the lines you also identified as
"giving you grief":
<color="red" size="small"/>
Please take an extra-look into your original data,
(maybe copy and paste an *exact* duplicate from a
TextEditor for one single <item>-node), so that we
can take a look at especially this "<color../>-thingy
ourselves - maybe you didn't reproduce it here exactly,
because of some nesting or maybe the color="..." snippet
is part of a CDATA-section... or some such thing)...
I supsect that you didn't gave us the "full story",
because you report, that the original 90MB-file
"takes a long time to load" into the MS-DOMDocument...
If that is the case, then it apparently *is* loading
your file - which it would not, when the "<color-node"
in question would be indeed as in your smaller example-
snippet (because in this case, when the problem-line
in question is indeed as in your smaller snippet,
the MS-parser would stop parsing (loading) this file,
handing out a parser-error after only a very short time.
So, please check that again - or upload the file
to some WebSpace, so that we can take a look ourselves...
But if this is indeed as you wrote, then this "subnode"
of your <item> parent-node (the one starting with
<color... ) is simply invalid XML ... and you would
need a *very* tolerant XML-Parser, to not choke on
that entry.
E.g. the MS-XML-parser hands out the error-message:
"A name contains an invalid character...".
That's because an "Opener-NodeTag" needs to provide
(after the '<') an immediately following NodeName
(or TagName) - which should not contain special characters
as for example space-characters or the equal-sign.
I could imagine, that the "producer" of this color-line
meant the color-part (since it contains an equals-character)
as an attribute - but in this case a describing tagname
of this node is missing ... for example this would be valid:
<itemextras color="red" size="small" />
In the above node-description a valid nodename was given
(itemextras) - and then separated from the following
attributes-descriptions within that 'itemextras'-node
by a space-character (so that the attributes: color="red"
as well as size="small" are identifyable at all).
So, if that color-line really is "as it is", then you
would either need to inform your "producer" of that XML,
that this line of his "hand-concatenated XML" needs to
be corrected (in the way as in my 'itemextras' example) -
only then it would be feedable into e.g. the MS-XML-Parser -
or you're entirely on your own, to write a fast "special"
XML-Parser, which tolerates this invalid node-line.
But writing such a parser (with a decent speed) is not
a trivial task - you might check out my (free) libraries
which contain such a "more tolerant" (simpler) XML-parser,
(independent from MS-XML) which would tolerate your
faulty color-node... and simply parse it as the tagname
in its entirety - so the tagname of the faulty node
would become e.g.:
'color="red"'
or
'color="blue"'
and this node-element then contains only one single attribute
(size=) instead of two attributes...(color= and size=)
In case you want to make a try with that - here's the
download-link:
www.datenhaus.de/Downloads/vbRC4BaseDlls.zip
(copy the three Dlls into a Folder on your Dev-machine,
and then register only the File vbRichClient4.dll).
The XML-Parser is provided by the Class: cSimpleDOM
Here's a small example:
'***Into a Form, after checking in: vbRichClient4 into
' your Project-References
Private Sub Form_Click()
Dim T!, DOM As cSimpleDOM, Elmt As cElement
AutoRedraw = True
Cls
T = Timer
Set DOM = New cSimpleDOM
On Error Resume Next
DOM.OpenFromFile "C:\Users\os\Desktop\Benchmark\Data\chrbig.xml"
If Err Then MsgBox Err.Description
On Error GoTo 0
Print "DOM parsing done after: "; Format(Timer - T, "0.00sec"); _
" ... total ElementCount in the Tree: "; DOM.ElementsTotal
'enumeration of all the ChildElements below the Root-Node
'in your case, this should be all the <item>Elements below: <wclass>
For Each Elmt In DOM.Root.ChildElements
Debug.Print Elmt.tagName
Next Elmt
End Sub
The libraries also give you (as GS already mentioned) a
superfast cSortedDictionary Class, which you could use,
to store all your Items (and their XML-Item-SubContent, or
an SHA1- or MD5-Value of this XML-Item-SubContent) in a
sorted fashion, with the Item-ID as the sorted Dictionary-Key,
which would then ensure all Items, sorted by their ID
in two separate cSortedDictionaries.
You could then do your item-comparison simply step by step,
enumerating these two sorted Dictionaries - compare their
item-ID-Keys - and if there's "gaps" in the "synchronous
ID-enumeration-loop", you would have already identified
Items which do not exist in both dictionaries - and
for all others, which do exist in both dictionaries,
you would need to do a simple string-comparison of
the dictionaries *item-value* (either the complete XML-
SubContent - or the MD5-hash, or SHA1-hash of this
content), to identify items, which share the same
item-ID, but differ in their content.
Also possible (and supported in these libs) is
a "translation whilst SAX-parsing" into a small
InMemory-DB (these libs come with SQLite) - so that
you can do your comparisons of the two XML-Files after
such an "InMemory-DB-Import" also using normal
SQL (using the "Distinct" SQL-Keyword for example).
Olaf