Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drive failure tests #69

Merged
merged 5 commits into from
Dec 21, 2017
Merged

Drive failure tests #69

merged 5 commits into from
Dec 21, 2017

Conversation

adityadani
Copy link
Member

Add utilities in node driver and volume driver to Fail and Recover drives.

  • Add two new APIs in node driver -YankDrive and RecoverDrive
  • Add implementations for ssh driver.
  • Add two new APIs in volume driver - GetStorageDevices and RecoverDriver.

Add a ginkgo test to induce a drive failure and check apps.

Drive Failure implementation for AWS coming soon.

}

driveID = strings.TrimRight(driveID, "\n")
driveNameToFail = strings.TrimRight(strings.TrimLeft(driveNameToFail, "/"), "/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use just Trim to delete trailing and leading characters

}

// Enable the drive by rescaning
recoverCmd := "echo \" - - -\" > /sys/class/scsi_host/host" + driveUUIDToRecover + "/scan"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have a '/' after host and before driverUUID

Copy link
Member Author

@adityadani adityadani Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.
It is no a subdirectory under host, but it appends it to host to build the correct directory.

resourcesKey = "Resources"
pathKey = "path"
)
pxNode, err := d.getClusterManager().Inspect(n.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will have to use VolDriverNodeId instead of Name after my change because Name is the scheduler ID/Name for the node

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I didn't run the tests after master rebase

)
pxNode, err := d.getClusterManager().Inspect(n.Name)
if err != nil {
return []string{}, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could just be nil, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


resourcesMapIntf, ok := storageInfoMap[resourcesKey]
if !ok || resourcesMapIntf == nil {
return []string{}, fmt.Errorf("Unable to find resource info for node: %v", n.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here.. can we send nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
resourcesMap := resourcesMapIntf.(map[string]interface{})

for _, v := range resourcesMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move the devPaths slice above this for loop from line 167.. it seems out of place there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved devPaths down in the function

}

var _ = Describe("Induce drive failure on of the nodes", func() {
driveFailureTest("drivefailure")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of a separate method we could just put the 'it' block in this 'describe'
In reboot its done so because two Describes are using the same method with different params

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the wrapper function.

InitInstance()
})

var _ = Describe("Induce drive failure on of the nodes", func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in description

appNodes, err = Inst().S.GetNodesForApp(ctx)
Expect(err).NotTo(HaveOccurred())
Expect(appNodes).NotTo(BeEmpty())
nodeWithDrive = appNodes[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of picking the first one, we can do this on a quorum on nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason why we need to fail drives on more than one node.
What I can check for is if the node whose drive I am failing has the replica of the volume being used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We could end up off-lining drives on a node which doesn't have the replica of the volume.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it assumes that the volume replica is 3 and the cluster size is 3 (default settings).
I have created an issue to add a new API to handle the generic case where replica may not exist on all nodes - #83

drives, err = Inst().V.GetStorageDevices(nodeWithDrive)
Expect(err).NotTo(HaveOccurred())
Expect(drives).NotTo(BeEmpty())
driveToFail = drives[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just the first drive and not all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add code to fail all the drives on one node.

…ives.

- Add two new APIs in node driver -YankDrive and RecoverDrive
- Add implementations for ssh driver.
- Add two new APIs in volume driver - GetStorageDevices and RecoverDriver.
@adityadani adityadani merged commit 593f1a0 into master Dec 21, 2017
@harsh-px harsh-px deleted the drive_failure_test branch January 17, 2018 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants