Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-30365 Add XREF Sasha service to K8s #18798

Draft
wants to merge 23 commits into
base: candidate-9.6.x
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
62d12fa
HPCC-30365 Add XREF Sasha service to K8s
jackdelv Apr 25, 2024
1f2b539
Add constructPartFilename to CLogicalNameEntry
jackdelv Jul 8, 2024
4cceb25
Check if running on localhost before setting Dafs port to 0.
jackdelv Jul 8, 2024
525f982
Change error message to report missing storage planes rather than clu…
jackdelv Jul 8, 2024
757a618
Move group combination to BM only.
jackdelv Jul 9, 2024
4b0b9c7
Common up XRefNodes code with containerized case
jackdelv Jul 9, 2024
18d94ab
Add call to calcStripeNumber and use makePhysucalPartName
jackdelv Jul 10, 2024
f340ccc
Use lfnHash from file metadata instead of logical filename
jackdelv Jul 11, 2024
0f9ba2e
Remove ifdef _CONTAINERIZED from constructPartFilename
jackdelv Aug 6, 2024
b0561b2
Fix dir-per-part and stripNum masks in parseFileName.
jackdelv Aug 6, 2024
6ec8287
Remove Containerized DBGLOG message from generalized BM code in onDFU…
jackdelv Aug 8, 2024
43d9d89
Co-locate fullname closer to makePhysicalPartName
jackdelv Aug 8, 2024
15eaf5d
Properly check for dirPerPart in file flags.
jackdelv Aug 8, 2024
e0b61ca
Add comment specifying meaning of localhost in xrefRemoteDirectories
jackdelv Aug 8, 2024
7bc4e0d
Remove changes to runXref and add directory getting code to CXRefMana…
jackdelv Aug 9, 2024
3af5870
Merge branch 'candidate-9.6.x' into xrefContainerized
jackdelv Aug 20, 2024
72fe732
Remove use of addUniqueXrefNode
jackdelv Nov 6, 2024
33bb654
Add example options to xref
jackdelv Nov 6, 2024
44fd255
Move constructPartFilename funtion to dadfs.cpp
jackdelv Nov 7, 2024
2a463a8
Change to makeStringException
jackdelv Nov 7, 2024
ded8478
Treat paths as case sensitive
jackdelv Nov 7, 2024
5aa25a9
Match file prefix to storage plane definition
jackdelv Nov 8, 2024
7fe148a
Add isContainerized() checks
jackdelv Nov 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions dali/base/dadfs.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,10 @@ RemoteFilename &constructPartFilename(IGroup *grp,unsigned partno,unsigned partm
StringBuffer fullname;
if (findPathSepChar(name)==NULL)
addPathSepChar(fullname.append(partdir));
#ifdef _CONTAINERIZED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, using isContainerized() vs the #define is preferrable (so that the code is scanned/compiled by developers routinely etc.)
And, if both can be avoided and the code made general (even if the use case is not currently used in bare-metal), that is most preferrable.

This is a case in point, we should cope with this even in the bare-metal case in general utility classes/methods, even though it is not configurable in BM at the moment.

But also, it looks like this exposes a wide problem(!). This function is used not just by XREF, it's also used by dadfs.cpp removePhysicalFiles

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, just noticed that removePhysicalFiles is not used by anything - we should delete it avoid confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect constructPartFilename should be deprecated (probably removed too).
In XREF's case, it needs a way to get a physical part + endpoint given the meta data it's collected.

The 'lnentry' (CLogicalNameEntry) it has built from meta data, should perhaps have a method to built it, it should know (from the meta data) if the file is striped and if it has dirPerPart.

Btw, dirPerPart (a flag in the meta data) means it has a directory extension that includes the file part number (i.e. each part is in it's own directory)
'striping' is different, and an additional directory not per part, but enumerated over N striped disks. The stripNum is calculated from a hash (@lfnHash) stored in the file meta. Originally calculated and set there by hashing the logicalfilename.

See makePhysicalPartName and it's callers (and uses of lfnHash and calcStripeNumber).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed remotePhysicalFiles and added a constructPartFilename function to CLogicalNameEntry. I had some trouble finding the dirPerPart flag in the metadata. The only examples I was able to find was using queryDirPerPart() on a StoragePlane, but when trying that it seemed the metadata was missing and it was defaulting to the return value of isContainerized().

Added lfnHash to the CLogicalNameEntry and used calcStripeNumber on the part number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old constructPartFilename function had some tests for it, so I couldn't delete the function without also removing the tests. Should there now be tests for the new constructPartFilename if it does get added to CLogicalNameEntry?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should remove the old constructPartFilename, that is now only used by tests, and if not covered already, add tests that exercise makePhysicalPartName.

But for now, for this PR, I would just add comments that it is deprecated, but test code uses it - and open a separate JIRA to revisit sometime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the new CLogicalNameEntry::constructPartFilename, I think you can remove the changes here right? i.e.:

#ifdef _CONTAINERIZED
    if (partmax>1)
        addPathSepChar(fullname.append(partno+1)); // If there are multiple files they will be striped by part number
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the _Containerized changes and added a comment to the old constructPartFilename.

if (partmax>1)
addPathSepChar(fullname.append(partno+1)); // If there are multiple files they will be striped by part number
#endif
fullname.append(name);
unsigned n;
unsigned d;
Expand Down
58 changes: 48 additions & 10 deletions dali/dfuXRefLib/dfuxreflib.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
#include "rmtfile.hpp"
#include "dautils.hpp"
#include "jptree.hpp"
#include "jcontainerized.hpp"

#include "XRefNodeManager.hpp"

Expand Down Expand Up @@ -942,6 +943,17 @@ static bool parseFileName(const char *name,StringBuffer &mname,unsigned &num,uns
s++;
}
if ((mn!=0)&&((*s==0)||(*s=='.'))&&(mn>=pn)) { // NB allow trailing extension
#ifdef _CONTAINERIZED
if (mn>1)
{
StringBuffer stripeDir;
stripeDir.append(getPathSepChar(mname)).append(pn).append(getPathSepChar(mname));
StringBuffer stripeMask;
stripeMask.append(getPathSepChar(mname)).append("$P$").append(getPathSepChar(mname));
// Replace part number in stripe directory with mask
mname.replaceString(stripeDir, stripeMask);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what this is for? (let's discuss)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the XREF code is trying to match all the parts of an orphan file and each part is in its own directory, it incorrectly creates separate entries for each part when they should be combined. Adding a part mask to the directory for each part allows them to be matched properly. Sorry for the confusion if I caused any because I was using the term striped incorrectly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A typical path on a striped storage that also has dirPerPart=true, will look like:
/var/lib/HPCCSystems/hpcc-data/d8/somescope/otherscope/35/thelfnname._35_of_400

where 'd8' is the stripe sub-directory, and '35' is the dir-per-part sub-directory.
When parsing a filename like this, the dirPerPart format can be deduced if the directory before the filename, is a numeric only (which is invalid for a scope/filename, which must begin with an alpha character), and that it matches the part # of the filename.
The fact that it's striped is less obvious, but could be deduced by the fact that the leading part of the filename '/var/lib/HPCCSystems/hpcc-data' is a prefix of one of the planes, and the plane specs will say numDevices>1

From the code here, it looks like it's dealing with the dir-per-part #, but code referencing 'stripe'.
I suspect this needs looking at closely, and testing with files that are on striped storage, and dirPerPlane, and ones that aren't - would be good to add some CPPUNIT tests here of various paths to make sure it is giving expected results.

I haven't looked in detail how the info is being used by the callers of parseFileName, but I suspect it may need to pass back, the deduced plane, and perhaps the stripe and/or flags like dirPerPart.. not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have restructured how it gets the dir-per-part number and added checks for the stripe number. I was able to test by changing the string before passing into the function, but I am not sure how to add to the CPPUNIT tests.

}
#endif
mname.append("._$P$_of_").append(mn);
if (*s)
mname.append(s);
Expand Down Expand Up @@ -1867,7 +1879,11 @@ class CXRefManager: public CXRefManagerBase
Owned<IPropertyTree> results;
{
CriticalUnblock unblock(manager.logsect);
#ifdef _CONTAINERIZED
unsigned short port = 0; // No need to connect to Dafs, data is available through mounted storage planes
#else
unsigned short port = getDafsPort(node.endpoint(),numfails,&crit);
#endif
jackdelv marked this conversation as resolved.
Show resolved Hide resolved
results.setown(getDirectory(dirlist,&node,port));
}
manager.log("Crossreferencing %s",msg.str());
Expand Down Expand Up @@ -2600,7 +2616,11 @@ class CXRefManager: public CXRefManagerBase
Owned<IGroup> g;
unsigned j;
if (!nclusters) {
#ifdef _CONTAINERIZED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preferable to use isCotainerized()

In fact, I think it would be worth just changing the message to the planes version.
In BM, internally, we generate planes for all clusters, such that methods like getStoragePlane/getPlanesIterator work with clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to planes message.

error("XREF","No storage planes specified\n");
#else
error("XREF","No clusters specified\n");
#endif
return NULL;
}
if (!numdirs) {
Expand All @@ -2618,18 +2638,29 @@ class CXRefManager: public CXRefManagerBase
else
g.setown(g->combine(gsub.get()));
}
totalSizeOrphans =0;
totalSizeOrphans = 0;
totalNumOrphans = 0;


logicalnamelist.kill();
dirlist.kill();
orphanlist.kill();


#ifdef _CONTAINERIZED
const char *storageDir[1];
for (int i = 0; i < nclusters; i++)
{
const char *storagePlane = clusters[i]; // clusters holds a list of storage plane names
storageDir[0] = dirbaselist[i]; // dirbaselist holds the storage plane directories

loadFromDFS(*this,g,1,storageDir,storagePlane);
xrefRemoteDirectories(g,numdirs,storageDir,numthreads);
}
jackdelv marked this conversation as resolved.
Show resolved Hide resolved
#else
const char* cluster = clusters[0];
loadFromDFS(*this,g,numdirs,dirbaselist,cluster);

xrefRemoteDirectories(g,numdirs,dirbaselist,numthreads);
#endif
StringBuffer filename;
filename.clear().append("xrefrpt");
addFileTimestamp(filename, true);
Expand Down Expand Up @@ -2669,24 +2700,31 @@ IPropertyTree * runXRef(unsigned nclusters,const char **clusters,IXRefProgressC
if (nclusters==0)
return NULL;
CXRefManager xrefmanager;
const char *dirs[2];
unsigned numdirs;
#ifdef _WIN32
bool islinux = false;
#else
bool islinux = true;
#endif
// assume all nodes same OS
Owned<IGroup> group = queryNamedGroupStore().lookup(clusters[0]);
#ifdef _CONTAINERIZED
WARNLOG("CONTAINERIZED(runXRef calls queryOS())");
DBGLOG("CONTAINERIZED(runXRef)");
const char *dirs[nclusters]; // nclusters is the number of storage planes
unsigned numdirs = nclusters;

for (int i = 0; i < numdirs; i++)
{
Owned<IPropertyTree> storagePlane = getStoragePlane(clusters[i]);
dirs[i] = storagePlane->queryProp("@prefix");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it makes sense to be enumerating the directories into 'dirs' and passing them xrefmanager.process.

In BM, the dirs are not different cluster dirs, but the primary/replicate dirs, which are consistent across all clusters.

I think we should keep it simple in cloud version, and assert that we are only dealing with 1 plane at a time (at least initially), the dirs should contain a single directory of the plane prefix being processes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to only pass in a single directory through dirs.

}
#else
const char *dirs[2];
unsigned numdirs = 2;
// assume all nodes same OS
Owned<IGroup> group = queryNamedGroupStore().lookup(clusters[0]);
if (group)
islinux = queryOS(group->queryNode(0).endpoint())==MachineOsLinux;
#endif
dirs[0] = queryBaseDirectory(grp_unknown, 0,islinux?DFD_OSunix:DFD_OSwindows); // MORE - should use the info from the group store
dirs[1] = queryBaseDirectory(grp_unknown, 1,islinux?DFD_OSunix:DFD_OSwindows);
numdirs = 2;
#endif
IPropertyTree *ret=NULL;
try {
ret = xrefmanager.process(nclusters,clusters,numdirs,dirs,PMtreeoutput,callback,numthreads);
Expand Down
4 changes: 2 additions & 2 deletions dali/sasha/saserver.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -435,8 +435,8 @@ int main(int argc, const char* argv[])
servers.append(*createSashaFileExpiryServer());
else if (strieq(service, "thor-qmon"))
servers.append(*createSashaQMonitorServer());
//else if (strieq(service, "xref")) // TODO
// servers.append(*createSashaXrefServer());
else if (strieq(service, "xref"))
servers.append(*createSashaXrefServer());
else
throw makeStringExceptionV(0, "Unrecognised 'service': %s", service);
#else
Expand Down
20 changes: 17 additions & 3 deletions esp/services/ws_dfu/ws_dfuXRefService.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -599,11 +599,25 @@ bool CWsDfuXRefEx::onDFUXRefList(IEspContext &context, IEspDFUXRefListRequest &r
{
try
{
#ifdef _CONTAINERIZED
IERRLOG("CONTAINERIZED(CWsDfuXRefEx::onDFUXRefList)");
#else
context.ensureFeatureAccess(XREF_FEATURE_URL, SecAccess_Read, ECLWATCH_DFU_XREF_ACCESS_DENIED, "WsDfuXRef::DFUXRefList: Permission denied.");

#ifdef _CONTAINERIZED
DBGLOG("CONTAINERIZED(CWsDfuXRefEx::onDFUXRefList)");
Owned<IPropertyTreeIterator> planesIter = getPlanesIterator("data", nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This k8s code may work for BM too.. would be good to unify if so.
BM has "data" planes automatically generated from the clusters, so it should in theory..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the k8s code on BM and there were additional storage planes that didn't work. In BM when iterating through the available storage planes there were mythor, myroxie (which both worked normally), mythor_mirror, hthor_hthor, and hthor_myeclagent (which all returned "Not Found" when trying to generate the results). Additionally, there was no storage plane for SuperFiles. Should this still be added manually?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, did you manage to consolidate it such that BM works using common code (e.g. getPlanesIterator("data". .) ?

Additionally, there was no storage plane for SuperFiles. Should this still be added manually?

hm not sure - superfiles don't have their own plane, they could even contain files that were on different planes.
Not sure how XREF handles them tbh - will have to look more closely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the BM code now iterates through the 'clusters' using getPlanesIterator("data", nullptr). It partially worked where the Thor and Roxie clusters showed up properly and could be run, but additional 'clusters' (mythor_mirror, hthor_hthor, etc) were found that failed when trying to generate results.

I have removed the DBGLOG with a containerized message from the generalized code to avoid confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mythor_mirror, hthor_hthor, etc

I think the (BM only) mirror planes should probably be ignored here. It would best if they were ignored not by name, but by 'copy', which currently isn't recorded when the plane IPropertyTree is created, but could be.

The hthor_hthor plane .. what was the error? (and can you check where "hthor_hthor" is being constructed?)
Hthor wasn't available as an option before in BM - I don't know why specifically it wasn't supported, but I think we can ignore for now this plane type, with a big fat comment saying we should revisit this functionality.
(but like ignoring the mirror, we should ignore by type, not by name, which may need a flag recording in the IPT).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to find where the hthor_hthor plane is created. I don't even see it in the list of planes anymore. I'm really not sure what happened with that. I was able to see the error when hthor__hthor did show up. When I clicked generate, the status returned "Not Found".


BoolHash uniquePlanes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addUniqueXRefNode and it's use of a HT to track unique, doesn't seem useful now (and not sure it was).
i.e. each plane name is guarantee to be unique anyway..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the use of addUniqueXRef node. Should the function be deleted entirely if it isn't useful anymore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but either way, we need to be careful this unification on planes doesn't break XREF in BM.

Owned<IPropertyTree> xrefNodeTree = createPTree("XRefNodes");

ForEach(*planesIter)
{
IPropertyTree &item = planesIter->query();

addUniqueXRefNode(item.queryProp("@name"), uniquePlanes, xrefNodeTree);
}

StringBuffer buf;
resp.setDFUXRefListResult(formatResult(context, xrefNodeTree, buf));
#else
CConstWUClusterInfoArray clusters;
getEnvironmentClusterInfo(clusters);

Expand Down
22 changes: 15 additions & 7 deletions esp/src/src-react/components/Xrefs.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -66,13 +66,21 @@ export const Xrefs: React.FunctionComponent<XrefsProps> = ({
.then(({ DFUXRefListResponse }) => {
const xrefNodes = DFUXRefListResponse?.DFUXRefListResult?.XRefNode;
if (xrefNodes) {
setData(xrefNodes.map((item, idx) => {
return {
name: item.Name,
modified: item.Modified,
status: item.Status
};
}));
if (Array.isArray(xrefNodes)) {
jakesmith marked this conversation as resolved.
Show resolved Hide resolved
setData(xrefNodes.map((item, idx) => {
return {
name: item.Name,
modified: item.Modified,
status: item.Status
};
}));
} else {
setData([{
name: xrefNodes.Name,
modified: xrefNodes.Modified,
status: xrefNodes.Status
}]);
}
}
})
.catch(err => logger.error(err))
Expand Down
2 changes: 2 additions & 0 deletions helm/examples/vault-pki-remote/values-hpcc1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,8 @@ sasha:
disabled: true
file-expiry:
disabled: true
xref:
disabled: true

esp:
- name: eclwatch
Expand Down
4 changes: 3 additions & 1 deletion helm/hpcc/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -1590,7 +1590,7 @@ prometheusMetricsReporter: "yes"
{{- end -}}

{{/*
Return access permssions for a given service
Return access permissions for a given service
*/}}
{{- define "hpcc.getSashaServiceAccess" }}
{{- if (eq "coalescer" .name) -}}
Expand All @@ -1605,6 +1605,8 @@ dali
dali data
{{- else if (eq "thor-qmon" .name) -}}
dali queues
{{- else if (eq "xref" .name) -}}
dali dalidata
{{- else -}}
{{- $_ := fail (printf "Unknown sasha service:" .name ) -}}
{{- end -}}
Expand Down
11 changes: 11 additions & 0 deletions helm/hpcc/values.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -2946,6 +2946,14 @@
},
"additionalProperties": false
},
"sasha-xref": {
"type": "object",
"allOf": [{ "$ref": "#/definitions/sashacommon" }],
"properties": {
"disabled": {}
},
"additionalProperties": false
},
"sashaservice": {
"oneOf": [
{
Expand All @@ -2970,6 +2978,9 @@
"thor-qmon": {
"$ref": "#/definitions/sasha-thor-qmon"
},
"xref": {
"$ref": "#/definitions/sasha-xref"
},
"disabled": {
"type": "boolean"
}
Expand Down
3 changes: 2 additions & 1 deletion helm/hpcc/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -508,7 +508,8 @@ sasha:
#disabled: true
#switchMinTime: 1
#queues: "*"


xref: {} # NB: no properties defined, use defaults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with others, it would be good to list (commented out) common xref config options here.

Copy link
Contributor Author

@jackdelv jackdelv Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some options that don't really mean anything. Should they be usable options or just placeholders? I don't think any would be used even if they were defined right now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be..
saserver should be loading the component configuration (line 331 of saserver.cpp) into 'serverConfig',
and various points in saxref.cpp looks for properties in it.
Some may not be relevant for containerized.
Some will need reworking, e.g. xrefMaxMemory can prob. remain as an option, but it should by default use all available memory (as resourced for the container), e.g. 90% of container memory. Probably similarly with maxThreads.
The schedule ones are still relevant.
As the code stands, it's going to be trying to pick up properties from "DFUXRef/" because in BM, the sasha service configs are subtrees within 1 config. In Containerized it should assume the root is the component config
( e.g. see createSashaFileExpiryServer's way of conditionally treating the config depending on whether containerized or not ).

It may be okay as it is for now, it should default to semi ok default, and then we can open a separate JIRA to revisit.


dfuserver:
- name: dfuserver
Expand Down
Loading