Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-30365 Add XREF Sasha service to K8s #18798

Draft
wants to merge 23 commits into
base: candidate-9.6.x
Choose a base branch
from

Conversation

jackdelv
Copy link
Contributor

@jackdelv jackdelv commented Jun 21, 2024

These changes add Xref functionality to the containerized version of the platform. There were a couple changes because thor file parts were striped across multiple directories.

I'm not sure that I correctly added the new XRef pod because everything still runs on the ECL Watch pod.

What is the best way to make this toggleable? An additional #ifdef? Should I still be using _CONTAINERIZED?

Type of change:

  • This change is a bug fix (non-breaking change which fixes an issue).
  • This change is a new feature (non-breaking change which adds functionality).
  • This change improves the code (refactor or other change that does not change the functionality)
  • This change fixes warnings (the fix does not alter the functionality or the generated code)
  • This change is a breaking change (fix or feature that will cause existing behavior to change).
  • This change alters the query API (existing queries will have to be recompiled)

Checklist:

  • My code follows the code style of this project.
    • My code does not create any new warnings from compiler, build system, or lint.
  • The commit message is properly formatted and free of typos.
    • The commit message title makes sense in a changelog, by itself.
    • The commit is signed.
  • My change requires a change to the documentation.
    • I have updated the documentation accordingly, or...
    • I have created a JIRA ticket to update the documentation.
    • Any new interfaces or exported functions are appropriately commented.
  • I have read the CONTRIBUTORS document.
  • The change has been fully tested:
    • I have added tests to cover my changes.
    • All new and existing tests passed.
    • I have checked that this change does not introduce memory leaks.
    • I have used Valgrind or similar tools to check for potential issues.
  • I have given due consideration to all of the following potential concerns:
    • Scalability
    • Performance
    • Security
    • Thread-safety
    • Cloud-compatibility
    • Premature optimization
    • Existing deployed queries will not be broken
    • This change fixes the problem, not just the symptom
    • The target branch of this pull request is appropriate for such a change.
  • There are no similar instances of the same problem that should be addressed
    • I have addressed them here
    • I have raised JIRA issues to address them separately
  • This is a user interface / front-end modification
    • I have tested my changes in multiple modern browsers
    • The component(s) render as expected

Smoketest:

  • Send notifications about my Pull Request position in Smoketest queue.
  • Test my draft Pull Request.

Testing:

@jackdelv jackdelv requested a review from jakesmith June 21, 2024 11:46
Copy link

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-30365

Jirabot Action Result:
Workflow Transition: Merge Pending
Updated PR

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jackdelv, it's looking good, I've given it a first pass, and made some comments. Please take a look.

@@ -246,6 +246,10 @@ RemoteFilename &constructPartFilename(IGroup *grp,unsigned partno,unsigned partm
StringBuffer fullname;
if (findPathSepChar(name)==NULL)
addPathSepChar(fullname.append(partdir));
#ifdef _CONTAINERIZED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, using isContainerized() vs the #define is preferrable (so that the code is scanned/compiled by developers routinely etc.)
And, if both can be avoided and the code made general (even if the use case is not currently used in bare-metal), that is most preferrable.

This is a case in point, we should cope with this even in the bare-metal case in general utility classes/methods, even though it is not configurable in BM at the moment.

But also, it looks like this exposes a wide problem(!). This function is used not just by XREF, it's also used by dadfs.cpp removePhysicalFiles

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, just noticed that removePhysicalFiles is not used by anything - we should delete it avoid confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect constructPartFilename should be deprecated (probably removed too).
In XREF's case, it needs a way to get a physical part + endpoint given the meta data it's collected.

The 'lnentry' (CLogicalNameEntry) it has built from meta data, should perhaps have a method to built it, it should know (from the meta data) if the file is striped and if it has dirPerPart.

Btw, dirPerPart (a flag in the meta data) means it has a directory extension that includes the file part number (i.e. each part is in it's own directory)
'striping' is different, and an additional directory not per part, but enumerated over N striped disks. The stripNum is calculated from a hash (@lfnHash) stored in the file meta. Originally calculated and set there by hashing the logicalfilename.

See makePhysicalPartName and it's callers (and uses of lfnHash and calcStripeNumber).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed remotePhysicalFiles and added a constructPartFilename function to CLogicalNameEntry. I had some trouble finding the dirPerPart flag in the metadata. The only examples I was able to find was using queryDirPerPart() on a StoragePlane, but when trying that it seemed the metadata was missing and it was defaulting to the return value of isContainerized().

Added lfnHash to the CLogicalNameEntry and used calcStripeNumber on the part number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old constructPartFilename function had some tests for it, so I couldn't delete the function without also removing the tests. Should there now be tests for the new constructPartFilename if it does get added to CLogicalNameEntry?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should remove the old constructPartFilename, that is now only used by tests, and if not covered already, add tests that exercise makePhysicalPartName.

But for now, for this PR, I would just add comments that it is deprecated, but test code uses it - and open a separate JIRA to revisit sometime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the new CLogicalNameEntry::constructPartFilename, I think you can remove the changes here right? i.e.:

#ifdef _CONTAINERIZED
    if (partmax>1)
        addPathSepChar(fullname.append(partno+1)); // If there are multiple files they will be striped by part number
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the _Containerized changes and added a comment to the old constructPartFilename.

StringBuffer stripeMask;
stripeMask.append(getPathSepChar(mname)).append("$P$").append(getPathSepChar(mname));
// Replace part number in stripe directory with mask
mname.replaceString(stripeDir, stripeMask);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what this is for? (let's discuss)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the XREF code is trying to match all the parts of an orphan file and each part is in its own directory, it incorrectly creates separate entries for each part when they should be combined. Adding a part mask to the directory for each part allows them to be matched properly. Sorry for the confusion if I caused any because I was using the term striped incorrectly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A typical path on a striped storage that also has dirPerPart=true, will look like:
/var/lib/HPCCSystems/hpcc-data/d8/somescope/otherscope/35/thelfnname._35_of_400

where 'd8' is the stripe sub-directory, and '35' is the dir-per-part sub-directory.
When parsing a filename like this, the dirPerPart format can be deduced if the directory before the filename, is a numeric only (which is invalid for a scope/filename, which must begin with an alpha character), and that it matches the part # of the filename.
The fact that it's striped is less obvious, but could be deduced by the fact that the leading part of the filename '/var/lib/HPCCSystems/hpcc-data' is a prefix of one of the planes, and the plane specs will say numDevices>1

From the code here, it looks like it's dealing with the dir-per-part #, but code referencing 'stripe'.
I suspect this needs looking at closely, and testing with files that are on striped storage, and dirPerPlane, and ones that aren't - would be good to add some CPPUNIT tests here of various paths to make sure it is giving expected results.

I haven't looked in detail how the info is being used by the callers of parseFileName, but I suspect it may need to pass back, the deduced plane, and perhaps the stripe and/or flags like dirPerPart.. not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have restructured how it gets the dir-per-part number and added checks for the stripe number. I was able to test by changing the string before passing into the function, but I am not sure how to add to the CPPUNIT tests.

dali/dfuXRefLib/dfuxreflib.cpp Outdated Show resolved Hide resolved
@@ -2600,7 +2616,11 @@ class CXRefManager: public CXRefManagerBase
Owned<IGroup> g;
unsigned j;
if (!nclusters) {
#ifdef _CONTAINERIZED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preferable to use isCotainerized()

In fact, I think it would be worth just changing the message to the planes version.
In BM, internally, we generate planes for all clusters, such that methods like getStoragePlane/getPlanesIterator work with clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to planes message.

dali/dfuXRefLib/dfuxreflib.cpp Show resolved Hide resolved
for (int i = 0; i < numdirs; i++)
{
Owned<IPropertyTree> storagePlane = getStoragePlane(clusters[i]);
dirs[i] = storagePlane->queryProp("@prefix");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it makes sense to be enumerating the directories into 'dirs' and passing them xrefmanager.process.

In BM, the dirs are not different cluster dirs, but the primary/replicate dirs, which are consistent across all clusters.

I think we should keep it simple in cloud version, and assert that we are only dealing with 1 plane at a time (at least initially), the dirs should contain a single directory of the plane prefix being processes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to only pass in a single directory through dirs.

context.ensureFeatureAccess(XREF_FEATURE_URL, SecAccess_Read, ECLWATCH_DFU_XREF_ACCESS_DENIED, "WsDfuXRef::DFUXRefList: Permission denied.");

#ifdef _CONTAINERIZED
DBGLOG("CONTAINERIZED(CWsDfuXRefEx::onDFUXRefList)");
Owned<IPropertyTreeIterator> planesIter = getPlanesIterator("data", nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This k8s code may work for BM too.. would be good to unify if so.
BM has "data" planes automatically generated from the clusters, so it should in theory..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the k8s code on BM and there were additional storage planes that didn't work. In BM when iterating through the available storage planes there were mythor, myroxie (which both worked normally), mythor_mirror, hthor_hthor, and hthor_myeclagent (which all returned "Not Found" when trying to generate the results). Additionally, there was no storage plane for SuperFiles. Should this still be added manually?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, did you manage to consolidate it such that BM works using common code (e.g. getPlanesIterator("data". .) ?

Additionally, there was no storage plane for SuperFiles. Should this still be added manually?

hm not sure - superfiles don't have their own plane, they could even contain files that were on different planes.
Not sure how XREF handles them tbh - will have to look more closely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the BM code now iterates through the 'clusters' using getPlanesIterator("data", nullptr). It partially worked where the Thor and Roxie clusters showed up properly and could be run, but additional 'clusters' (mythor_mirror, hthor_hthor, etc) were found that failed when trying to generate results.

I have removed the DBGLOG with a containerized message from the generalized code to avoid confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mythor_mirror, hthor_hthor, etc

I think the (BM only) mirror planes should probably be ignored here. It would best if they were ignored not by name, but by 'copy', which currently isn't recorded when the plane IPropertyTree is created, but could be.

The hthor_hthor plane .. what was the error? (and can you check where "hthor_hthor" is being constructed?)
Hthor wasn't available as an option before in BM - I don't know why specifically it wasn't supported, but I think we can ignore for now this plane type, with a big fat comment saying we should revisit this functionality.
(but like ignoring the mirror, we should ignore by type, not by name, which may need a flag recording in the IPT).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to find where the hthor_hthor plane is created. I don't even see it in the list of planes anymore. I'm really not sure what happened with that. I was able to see the error when hthor__hthor did show up. When I clicked generate, the status returned "Not Found".

@jakesmith
Copy link
Member

I'm not sure that I correctly added the new XRef pod because everything still runs on the ECL Watch pod.

that's fine for now. We can move it to a separate esp service pod if we really want to.

Copy link
Contributor Author

@jackdelv jackdelv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith I left some responses to your comments and questions. Back to you.

context.ensureFeatureAccess(XREF_FEATURE_URL, SecAccess_Read, ECLWATCH_DFU_XREF_ACCESS_DENIED, "WsDfuXRef::DFUXRefList: Permission denied.");

#ifdef _CONTAINERIZED
DBGLOG("CONTAINERIZED(CWsDfuXRefEx::onDFUXRefList)");
Owned<IPropertyTreeIterator> planesIter = getPlanesIterator("data", nullptr);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the k8s code on BM and there were additional storage planes that didn't work. In BM when iterating through the available storage planes there were mythor, myroxie (which both worked normally), mythor_mirror, hthor_hthor, and hthor_myeclagent (which all returned "Not Found" when trying to generate the results). Additionally, there was no storage plane for SuperFiles. Should this still be added manually?

dali/dfuXRefLib/dfuxreflib.cpp Show resolved Hide resolved
StringBuffer stripeMask;
stripeMask.append(getPathSepChar(mname)).append("$P$").append(getPathSepChar(mname));
// Replace part number in stripe directory with mask
mname.replaceString(stripeDir, stripeMask);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the XREF code is trying to match all the parts of an orphan file and each part is in its own directory, it incorrectly creates separate entries for each part when they should be combined. Adding a part mask to the directory for each part allows them to be matched properly. Sorry for the confusion if I caused any because I was using the term striped incorrectly.

@@ -246,6 +246,10 @@ RemoteFilename &constructPartFilename(IGroup *grp,unsigned partno,unsigned partm
StringBuffer fullname;
if (findPathSepChar(name)==NULL)
addPathSepChar(fullname.append(partdir));
#ifdef _CONTAINERIZED
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed remotePhysicalFiles and added a constructPartFilename function to CLogicalNameEntry. I had some trouble finding the dirPerPart flag in the metadata. The only examples I was able to find was using queryDirPerPart() on a StoragePlane, but when trying that it seemed the metadata was missing and it was defaulting to the return value of isContainerized().

Added lfnHash to the CLogicalNameEntry and used calcStripeNumber on the part number.

@@ -246,6 +246,10 @@ RemoteFilename &constructPartFilename(IGroup *grp,unsigned partno,unsigned partm
StringBuffer fullname;
if (findPathSepChar(name)==NULL)
addPathSepChar(fullname.append(partdir));
#ifdef _CONTAINERIZED
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old constructPartFilename function had some tests for it, so I couldn't delete the function without also removing the tests. Should there now be tests for the new constructPartFilename if it does get added to CLogicalNameEntry?

@jackdelv jackdelv requested a review from jakesmith July 12, 2024 14:15
Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - looks good overall.
Please see comments and questions.

@@ -246,6 +246,10 @@ RemoteFilename &constructPartFilename(IGroup *grp,unsigned partno,unsigned partm
StringBuffer fullname;
if (findPathSepChar(name)==NULL)
addPathSepChar(fullname.append(partdir));
#ifdef _CONTAINERIZED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should remove the old constructPartFilename, that is now only used by tests, and if not covered already, add tests that exercise makePhysicalPartName.

But for now, for this PR, I would just add comments that it is deprecated, but test code uses it - and open a separate JIRA to revisit sometime.

@@ -246,6 +246,10 @@ RemoteFilename &constructPartFilename(IGroup *grp,unsigned partno,unsigned partm
StringBuffer fullname;
if (findPathSepChar(name)==NULL)
addPathSepChar(fullname.append(partdir));
#ifdef _CONTAINERIZED
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the new CLogicalNameEntry::constructPartFilename, I think you can remove the changes here right? i.e.:

#ifdef _CONTAINERIZED
    if (partmax>1)
        addPathSepChar(fullname.append(partno+1)); // If there are multiple files they will be striped by part number
#endif

Owned<IStoragePlane> plane = getDataStoragePlane(grpname, true);
const char * prefix = plane->queryPrefix();
unsigned stripeNum = calcStripeNumber(partNo+1, lfnHash, plane->numDevices());
bool dirPerPart = max>1?plane->queryDirPerPart():false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not make any practical difference - but whether a file is dirPerPart is a flag in the logical file meta data.
In theory, a logical file may have this flag set, but the plane doesn't or vice versa. That could only happen if the plane spec. had changed since files were created.

You can determine it from the file from "@flags". See CFileDescriptor::getFlags()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added file to new constructPartFilename parameters to get dirPerPart from the flags.

}
expandMask(partName, pmask, partNo, max);

StringBuffer fullname;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial/style: better to co-locate closer to where used ~(i.e. just before makePhysicalPartName) for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

StringBuffer stripeMask;
stripeMask.append(getPathSepChar(mname)).append("$P$").append(getPathSepChar(mname));
// Replace part number in stripe directory with mask
mname.replaceString(stripeDir, stripeMask);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A typical path on a striped storage that also has dirPerPart=true, will look like:
/var/lib/HPCCSystems/hpcc-data/d8/somescope/otherscope/35/thelfnname._35_of_400

where 'd8' is the stripe sub-directory, and '35' is the dir-per-part sub-directory.
When parsing a filename like this, the dirPerPart format can be deduced if the directory before the filename, is a numeric only (which is invalid for a scope/filename, which must begin with an alpha character), and that it matches the part # of the filename.
The fact that it's striped is less obvious, but could be deduced by the fact that the leading part of the filename '/var/lib/HPCCSystems/hpcc-data' is a prefix of one of the planes, and the plane specs will say numDevices>1

From the code here, it looks like it's dealing with the dir-per-part #, but code referencing 'stripe'.
I suspect this needs looking at closely, and testing with files that are on striped storage, and dirPerPlane, and ones that aren't - would be good to add some CPPUNIT tests here of various paths to make sure it is giving expected results.

I haven't looked in detail how the info is being used by the callers of parseFileName, but I suspect it may need to pass back, the deduced plane, and perhaps the stripe and/or flags like dirPerPart.. not sure.

dali/dfuXRefLib/dfuxreflib.cpp Show resolved Hide resolved
{
DBGLOG("CONTAINERIZED(runXRef)");
numdirs = 1;
Owned<IPropertyTree> storagePlane = getStoragePlane(clusters[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think xrefmanger.process() will now (in containerized) handle >1 cluster, processing them in series,
but getting the dir (prefix) from the 1st here, won't be correct if they different.

I think this directory getting code would probably make more sense inside the process() method, where in BM, it can do as it does now and assume same dir for all, but in containerized, it can get the correct dir for each 'cluster' (plane) each time in the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it makes more sense this way. I have moved the directory getting code into process.

context.ensureFeatureAccess(XREF_FEATURE_URL, SecAccess_Read, ECLWATCH_DFU_XREF_ACCESS_DENIED, "WsDfuXRef::DFUXRefList: Permission denied.");

#ifdef _CONTAINERIZED
DBGLOG("CONTAINERIZED(CWsDfuXRefEx::onDFUXRefList)");
Owned<IPropertyTreeIterator> planesIter = getPlanesIterator("data", nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, did you manage to consolidate it such that BM works using common code (e.g. getPlanesIterator("data". .) ?

Additionally, there was no storage plane for SuperFiles. Should this still be added manually?

hm not sure - superfiles don't have their own plane, they could even contain files that were on different planes.
Not sure how XREF handles them tbh - will have to look more closely.

Copy link
Contributor Author

@jackdelv jackdelv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith I have responded to your comments and questions.

I think the new constructPartFilename function should be added to the CPPUNIT tests, but I am not sure how they work or the best way to extend them.

@@ -246,6 +246,10 @@ RemoteFilename &constructPartFilename(IGroup *grp,unsigned partno,unsigned partm
StringBuffer fullname;
if (findPathSepChar(name)==NULL)
addPathSepChar(fullname.append(partdir));
#ifdef _CONTAINERIZED
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the _Containerized changes and added a comment to the old constructPartFilename.

}
expandMask(partName, pmask, partNo, max);

StringBuffer fullname;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Owned<IStoragePlane> plane = getDataStoragePlane(grpname, true);
const char * prefix = plane->queryPrefix();
unsigned stripeNum = calcStripeNumber(partNo+1, lfnHash, plane->numDevices());
bool dirPerPart = max>1?plane->queryDirPerPart():false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added file to new constructPartFilename parameters to get dirPerPart from the flags.

StringBuffer stripeMask;
stripeMask.append(getPathSepChar(mname)).append("$P$").append(getPathSepChar(mname));
// Replace part number in stripe directory with mask
mname.replaceString(stripeDir, stripeMask);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have restructured how it gets the dir-per-part number and added checks for the stripe number. I was able to test by changing the string before passing into the function, but I am not sure how to add to the CPPUNIT tests.

dali/dfuXRefLib/dfuxreflib.cpp Show resolved Hide resolved
{
DBGLOG("CONTAINERIZED(runXRef)");
numdirs = 1;
Owned<IPropertyTree> storagePlane = getStoragePlane(clusters[0]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it makes more sense this way. I have moved the directory getting code into process.

context.ensureFeatureAccess(XREF_FEATURE_URL, SecAccess_Read, ECLWATCH_DFU_XREF_ACCESS_DENIED, "WsDfuXRef::DFUXRefList: Permission denied.");

#ifdef _CONTAINERIZED
DBGLOG("CONTAINERIZED(CWsDfuXRefEx::onDFUXRefList)");
Owned<IPropertyTreeIterator> planesIter = getPlanesIterator("data", nullptr);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the BM code now iterates through the 'clusters' using getPlanesIterator("data", nullptr). It partially worked where the Thor and Roxie clusters showed up properly and could be run, but additional 'clusters' (mythor_mirror, hthor_hthor, etc) were found that failed when trying to generate results.

I have removed the DBGLOG with a containerized message from the generalized code to avoid confusion.

@jackdelv jackdelv requested a review from jakesmith August 9, 2024 18:19
unsigned stripeNum = calcStripeNumber(partNo+1, lfnHash, plane->numDevices());

// Get dir-per-part # from file metadata or storage plan info
FileDescriptorFlags flags = static_cast<FileDescriptorFlags>(file.getPropInt("Attr/@flags"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith I added file to the parameters for constructPartFilename, but I think that may be incorrect since it uses file in the CLogicalNameEntry constructor. Which is considered best practice, initializing CLogicalNameEntry with the dir-per-part/stripe-number and adding additional members to the class or passing file into constructPartFilename and getting the information right when we need it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think best to follow same pattern as for other info. extracted from the IPT file now, and pull them out and store them in CLogicalNameEntry as members. So, Attr/@flags can be read there once and used here. See previous command above. Otherwise, there is a lot of repeated work re-calling things for every part of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved IPT file extraction to the CLogicalNameEntry constructor to avoid re-calling things too many times. Added info as members of CLogicalNameEntry.

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackdelv - apologies for taking an age to get back to this.

Please see comments.
It's definitely getting closer.

@@ -508,7 +508,8 @@ sasha:
#disabled: true
#switchMinTime: 1
#queues: "*"


xref: {} # NB: no properties defined, use defaults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with others, it would be good to list (commented out) common xref config options here.

Copy link
Contributor Author

@jackdelv jackdelv Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some options that don't really mean anything. Should they be usable options or just placeholders? I don't think any would be used even if they were defined right now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be..
saserver should be loading the component configuration (line 331 of saserver.cpp) into 'serverConfig',
and various points in saxref.cpp looks for properties in it.
Some may not be relevant for containerized.
Some will need reworking, e.g. xrefMaxMemory can prob. remain as an option, but it should by default use all available memory (as resourced for the container), e.g. 90% of container memory. Probably similarly with maxThreads.
The schedule ones are still relevant.
As the code stands, it's going to be trying to pick up properties from "DFUXRef/" because in BM, the sasha service configs are subtrees within 1 config. In Containerized it should assume the root is the component config
( e.g. see createSashaFileExpiryServer's way of conditionally treating the config depending on whether containerized or not ).

It may be okay as it is for now, it should default to semi ok default, and then we can open a separate JIRA to revisit.

esp/src/src-react/components/Xrefs.tsx Show resolved Hide resolved
context.ensureFeatureAccess(XREF_FEATURE_URL, SecAccess_Read, ECLWATCH_DFU_XREF_ACCESS_DENIED, "WsDfuXRef::DFUXRefList: Permission denied.");

#ifdef _CONTAINERIZED
DBGLOG("CONTAINERIZED(CWsDfuXRefEx::onDFUXRefList)");
Owned<IPropertyTreeIterator> planesIter = getPlanesIterator("data", nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mythor_mirror, hthor_hthor, etc

I think the (BM only) mirror planes should probably be ignored here. It would best if they were ignored not by name, but by 'copy', which currently isn't recorded when the plane IPropertyTree is created, but could be.

The hthor_hthor plane .. what was the error? (and can you check where "hthor_hthor" is being constructed?)
Hthor wasn't available as an option before in BM - I don't know why specifically it wasn't supported, but I think we can ignore for now this plane type, with a big fat comment saying we should revisit this functionality.
(but like ignoring the mirror, we should ignore by type, not by name, which may need a flag recording in the IPT).


BoolHash uniqueProcesses;
BoolHash uniquePlanes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addUniqueXRefNode and it's use of a HT to track unique, doesn't seem useful now (and not sure it was).
i.e. each plane name is guarantee to be unique anyway..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the use of addUniqueXRef node. Should the function be deleted entirely if it isn't useful anymore?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but either way, we need to be careful this unification on planes doesn't break XREF in BM.

expandMask(partName, pmask, partNo, max);

// Get stripeNum from storage plane
Owned<IStoragePlane> plane = getDataStoragePlane(grpname, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially the plane may not exist.

This an other errors at this point should abort processing this file further I think, and somehow be part of the report of logical files which are 'bad'.

in fact it's calling this per file part.. we should minimize the work..
it looks like this and prefix (and anything else that is at the file level), should have been called/extracted once, probably in processFile, when the lnentry is created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think constructPartFilename should be moved outside of xref and into dadfs.cpp (where the deprecated version is), and extended to take any necessary parameters it needs.

To avoid confusion with the deprecated, but not yet completely deleted old method, we should rename the old method to e.g. deprecatedConstructPartFilename for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check for plane. When an error occurs at this point, it is logged with CXRefManagerBase::error.

Moved file level info to lnentry creation and move constructPartFilename to dadfs.cpp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should pass in required=false, otherwise getDataStoragePlane will throw an exception, instead of the more contextual error message below.

for (;;) {
char c=*name;
if (!c)
break;
if ((c=='d')&&(isdigit(name[1])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routine should really parse the filename, looking for a plane prefix match first.
Reject if no match. All files should exist in a plane.

Then check next segment for striping, and check if corresponding plane has striping, if it hasn't then complain/reject.
Then parse remainder of path, and check penultimate segment for dir-per-part pattern (i.e. only check path segment before the filename).

Because this is called a lot, it does need to be efficient.
So I would collect the data plane prefixes upfront, then initially check to see if the name has a prefix of one of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to check prefixes of data storage planes at the beginning. If no match is found throw an error. If a storage plane prefix is matched then it checks numDevices>1 and looks for striping in the next segment.

Skips what has already been added, parses the remainder of the path, and avoids copying until the filename is found. Once the file name is found it checks the last segment for a dir-per-part number. If a number is found and matches the part number of the path then the mask is added to mname.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay, but given how many times (maybe a billion times perhaps on prod), we may want to revisit, to cache the plane prefixes, to avoid iterating over the IPT each time.

Also, given it's fairly opaque functionality and critical, it would be good to add a CPPUNIT test with a list of paths to validate it behaves as expected for each. Could you add ?

dali/dfuXRefLib/dfuxreflib.cpp Outdated Show resolved Hide resolved
dali/dfuXRefLib/dfuxreflib.cpp Outdated Show resolved Hide resolved
dali/dfuXRefLib/dfuxreflib.cpp Show resolved Hide resolved
dali/dfuXRefLib/dfuxreflib.cpp Show resolved Hide resolved
- Not sure if any options are used
- Change deprecated constructPartFilename to deprecatedConstructPartFilename
- Keeps the constructPartFilename method simply for passing members to the function
- Move file level info to creation of lnentry rather than get it for each part
- Put guards on variables that can be null/empty
- Find match first and throw an error if one is not found
- Check next segment for striping if numDevices>1
- Only check last segment before filename for dir-per-part number
- avoid copying single chars and copy chunks after we look for dir-per-part
Copy link
Contributor Author

@jackdelv jackdelv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakesmith I responded to all your comments and left a couple small questions. Back to you.

expandMask(partName, pmask, partNo, max);

// Get stripeNum from storage plane
Owned<IStoragePlane> plane = getDataStoragePlane(grpname, true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check for plane. When an error occurs at this point, it is logged with CXRefManagerBase::error.

Moved file level info to lnentry creation and move constructPartFilename to dadfs.cpp

dali/dfuXRefLib/dfuxreflib.cpp Outdated Show resolved Hide resolved
unsigned stripeNum = calcStripeNumber(partNo+1, lfnHash, plane->numDevices());

// Get dir-per-part # from file metadata or storage plan info
FileDescriptorFlags flags = static_cast<FileDescriptorFlags>(file.getPropInt("Attr/@flags"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved IPT file extraction to the CLogicalNameEntry constructor to avoid re-calling things too many times. Added info as members of CLogicalNameEntry.

unsigned n;
unsigned d;
mspec.calcPartLocation(partNo, max, copy, grp?grp->ordinality():max, n, d);
setReplicateFilename(fullname, d);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added !isContainerized().

@@ -923,17 +964,60 @@ static bool parseFileName(const char *name,StringBuffer &mname,unsigned &num,uns
name = nonrepdir.str();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added !isContainerized().

dali/dfuXRefLib/dfuxreflib.cpp Show resolved Hide resolved
context.ensureFeatureAccess(XREF_FEATURE_URL, SecAccess_Read, ECLWATCH_DFU_XREF_ACCESS_DENIED, "WsDfuXRef::DFUXRefList: Permission denied.");

#ifdef _CONTAINERIZED
DBGLOG("CONTAINERIZED(CWsDfuXRefEx::onDFUXRefList)");
Owned<IPropertyTreeIterator> planesIter = getPlanesIterator("data", nullptr);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to find where the hthor_hthor plane is created. I don't even see it in the list of planes anymore. I'm really not sure what happened with that. I was able to see the error when hthor__hthor did show up. When I clicked generate, the status returned "Not Found".


BoolHash uniqueProcesses;
BoolHash uniquePlanes;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the use of addUniqueXRef node. Should the function be deleted entirely if it isn't useful anymore?

esp/src/src-react/components/Xrefs.tsx Show resolved Hide resolved
@@ -508,7 +508,8 @@ sasha:
#disabled: true
#switchMinTime: 1
#queues: "*"


xref: {} # NB: no properties defined, use defaults
Copy link
Contributor Author

@jackdelv jackdelv Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some options that don't really mean anything. Should they be usable options or just placeholders? I don't think any would be used even if they were defined right now?

Copy link
Member

@jakesmith jakesmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jackdelv.
See follow up comments/replies.

Is the mirror plane being handled correctly now (being ignored) in BM?

@@ -508,7 +508,8 @@ sasha:
#disabled: true
#switchMinTime: 1
#queues: "*"


xref: {} # NB: no properties defined, use defaults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be..
saserver should be loading the component configuration (line 331 of saserver.cpp) into 'serverConfig',
and various points in saxref.cpp looks for properties in it.
Some may not be relevant for containerized.
Some will need reworking, e.g. xrefMaxMemory can prob. remain as an option, but it should by default use all available memory (as resourced for the container), e.g. 90% of container memory. Probably similarly with maxThreads.
The schedule ones are still relevant.
As the code stands, it's going to be trying to pick up properties from "DFUXRef/" because in BM, the sasha service configs are subtrees within 1 config. In Containerized it should assume the root is the component config
( e.g. see createSashaFileExpiryServer's way of conditionally treating the config depending on whether containerized or not ).

It may be okay as it is for now, it should default to semi ok default, and then we can open a separate JIRA to revisit.


BoolHash uniqueProcesses;
BoolHash uniquePlanes;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, but either way, we need to be careful this unification on planes doesn't break XREF in BM.

expandMask(partName, pmask, partNo, max);

// Get stripeNum from storage plane
Owned<IStoragePlane> plane = getDataStoragePlane(grpname, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should pass in required=false, otherwise getDataStoragePlane will throw an exception, instead of the more contextual error message below.

makePhysicalPartName(lname, partNo+1, max, fullname, 0, DFD_OSdefault, prefix, dirPerPart, stripeNum);

unsigned n = 0;
if (!isContainerized())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #18798 (comment)

can you add a comment here to something like:
"revisit: constructPartFilename should probably be refactored not to deal with replicate directories, by pre-determining the alternate prefix if copy>0"

IGroup *grp = lnentry->queryGroup();
if (!grp) {
manager.warn(lnentry->lname.get(),"No group found, ignoring logical file");
return;
}
constructPartFilename(grp,partno,numparts,partname,partmask,partdir,replicate,replicateoffset,rfn);
lnentry->constructPartFilename(partno, replicate, rfn, file);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added replicate?1:0 because that was how it was handled in the deprecated version, but is that correct? What does copy mean?

I think good enough for now. It is assuming 2 copies (in BM), but theoretically there could be more.

@@ -1581,33 +1648,32 @@ void loadFromDFS(CXRefManagerBase &manager,IGroup *grp,unsigned numdirs,const ch
if (lnentry->outsidenodes.find(rep)==NotFound)
lnentry->outsidenodes.append(rep);
}
if (isContainerized())
{
UWARNLOG("Replication not supported in containerized version");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to generate a very large number of of logged warnings.
replication should really be a property of the plane.
For now, let's just turn this into a comment. In containerized we not expect planes to be replicated.

IGroup *grp = lnentry->queryGroup();
if (!grp) {
manager.warn(lnentry->lname.get(),"No group found, ignoring logical file");
return;
}
constructPartFilename(grp,partno,numparts,partname,partmask,partdir,replicate,replicateoffset,rfn);
lnentry->constructPartFilename(partno, replicate, rfn, file);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added if (isContainerized()) to break out of loop and warn the user. Is the user the correct audience?

this will be hit very frequently. Should just be a comment. See other PR comment near code.

for (;;) {
char c=*name;
if (!c)
break;
if ((c=='d')&&(isdigit(name[1])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay, but given how many times (maybe a billion times perhaps on prod), we may want to revisit, to cache the plane prefixes, to avoid iterating over the IPT each time.

Also, given it's fairly opaque functionality and critical, it would be good to add a CPPUNIT test with a list of paths to validate it behaves as expected for each. Could you add ?

name++;
while (*name&&isdigit(*name))
name++;
mname.append("d$P$");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why d$P$?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants