Friday, December 12, 2014

The process for disposing of digital information

The shredding of information is technically complicated but doable. What is really challenging is figuring out what should actually disposed. EDRM published a paper on the topic call Disposing of digital debris.

The paper cites a 2012 CGOC survey about the state of information in enterprises. Apparently, 69% of retained information has no legal or business value. Low value information includes:  short-term reference files; orphaned files; outdated or superseded files; upgrades, safety, and litigation copies; outdated storage technology; and technical duplicates (that may or may not be deletable)

The benefits of eliminating this information include: reduced litigation and productions costs ($18K per GB); reduce employee cost; storage/infrastructure TCO; litigation and compliance risk; better predictive capability.

Not surprisingly, the EDRM heralds its own IGRM reference model. Basically, it provides an information lifecycle model that moves information through create/use to retain/archive to dispose. Movement ultimately depends on issues like  business/profit; privacy and security; IT efficiency; RIM; and legal.

EDRM notes that people are a huge part of the digital disposal issue. Different groups provide different perspectives:

  • Senior leadership provides support for an information governance program.
  • Business units can identify what is important to them on an ongoing basis.
  • Information security/privacy/compliance needs to have input.
  • Legal must provide insight on the legal hold expectations and processes.
  • Records management maintains the retention schedule and provides guidance on naming conventions, deduplication, etc.
  • IT must maintain all infrastructure and police actual retention and disposal.
Review all policies and procedures concerning the disposition of data. Determine what actually works within the organization and what doesn't. Identify common classification and data retention models. Identify the relevant workflows. Address particular demands e.g., prior art, patents, contracts, etc. It might also be of some benefit to explore the use of auto-classification and put other technologies in place such as automated legal holds, records retention, deduplication, storage tiers, etc.

Each group will have specific responsibilities:

Records management
  • provide master retention schedule
  • provide taxonomy
  • IT can then interpret how to deal with un-managed or legacy data
  • identify data for legal hold
  • define process for collection
Line of business
  • identify the information it creates
  • identify the information it uses
  • classify retained information
Privacy and security
  • ensure that industry and regulatory obligations are met.
  • manage information based upon business value and duty.

This model also exists as a maturity model which is quite interesting:

Deleting information is also hard to do

The deletion of digital information is also hard to do. The most basic challenge is ensuring that when something is deleted, it is actually deleted. To destroy digital information it needs to be overwritten. NIST 800-88 provides the gory details. This deletion must also happen according to a defined process and should be auditable. Specialist firms like Blancco provide software support for sanitizing files or entire devices. It provides samples of confirmation reports for LUNs, servers, files, and even mobile devices.

The process of securely deleting information is typically called data erasure, data clearing, data wiping, or -- more colloquially -- data shredding.

Data erasure can get a little bit more challenging when we look at the realities of IT infrastructure. Backup tapes, for example, will contain copies of information which can cause eDiscovery challenges (i.e., Coleman Holdings). High reliability infrastructures might also cause challenges due to mirroring, as might local site caching. The biggest issues might be challenges resulting from SAN architectures and virtualization, where the actual storage site is a bit mysterious. Blancco provides some details on virtualization challenges. The SAN issue is even more difficult. It might be possible to sanitize entire LUNs or disks, but individual files will be challenging since the SAN abstracts access to the underlying storage blocks. In these situations, it might be necessary to take a proactive approach by encrypting files stored on the SAN and the shredding those files by destroying the keys. Securely deleting email can also be challenging due to the underlying database structure. Secure delete at the server level might require eliminating archive/recycle copies, compressing/defragging the database, and overwriting the empty space on the drive.

The bucket model of information management

I've been beating around the bush on a few different issues here. So what does my model look? We could talk about information lifecycles but I really want to talk about buckets, specifically this one:

 Basically, we need to do one thing: stop the bucket from overflowing. To do this, we have three different options:

Reduce the inflow
This goal might be difficult to attain. In general, humans are information pack rats. We keep more than we know what to do with because we have a hard time of determining long-term value of information... and then we forget about it.

There might be localized opportunities for improvement. With email, for example, we can ensure that we have targeted spam filtering, etc. Managing email might involve a layered approach:

  • People. Convince people to use email effectively and collect less crap. For example, "reply alls" with large attachments might not be such a good thing. It might also be valuable to assist people with actually getting stuff done. Email, for example, might not be the best way to manage certain types of documents or records.
  • Process. Maintain appropriate email hygiene /w black lists, etc. Another process could be to communicate back with users to tell them how much email they're getting, and how to get rid of that email that they don't want. This exercise could be tied to general ediscovery for acceptable use exercises or it could rely on reporting (e.g., who has a lot of unopened email in their inbox).
  • Technology. Get better spam filters or general email security gateway tools. Over 70% of inbound email is illegitimate so managing that inflow is important.
As for regular files, the inflow problem is certainly a challenge. It could be a matter of shifting user's perceptions to what should be kept and what shouldn't be kept. For example, can you help users clearly identify records where there is a retention requirement? Another issue might be personal information management. For example, do we really want people clogging up shares with their personal stuff? Do we advocate for something a bit more personal. For example, users could have a personal drive that syncs with a local drive. It could be the place we put all documentation. Public shares are then the location for information with a clear retention period (i.e., project documentation). Another strategy could be the use of cloud services. For example, Evernote does a great job of off-loading the storage requirement for personal documents of uncertain value. It also, however, introduces the possibility of data leakage or loss.

Increase the size of the bucket

It's the brute force approach. Gear up: more storage, more servers, etc. A bigger bucket will take a longer time to fill up... but it will fill up. There are a few different things that are required to increase the bucket. The first is really the effectiveness of our capacity management strategy. Ah hoc IT shops might not really have anything in place for capacity management so any expansion will come as a surprise. COBIT's BAI04 gives us a run-down on what we should, ideally, have in place for capacity management:
  • Manage availability and capacity
    • Assess current availability, performance and capacity and create a baseline
      • Availability, performance and capacity baselines
      • Evaluations against SLAs
    • Assess business impact
      • Availability, performance and capacity scenarios
      • Availability, performance and capacity business impact assessments
    • Plan for new or changed service requirements
      • Prioritized improvements
      • Performance and capacity plans
    • Monitor and review availability and capacity
      • Availability, performance and reports
    • Investigate and address availability, performance, and capacity issues
      • Performance and capacity gaps
      • Corrective actions
      • Emergency escalation procedure

That all seems a bit onerous.

The one area where we can immediately increase the size of the bucket is through appropriate management of the resources that we have. For example, has everything been routinely purged and defragmented (ideally, with the white space overwritten)? Maintaining an Exchange database, for example, can provide some additional lifespan. Challenges can emerge from having to demount databases, etc.

Another approach is to not necessarily make the bucket bigger but to get another bucket. For example, one could implement email archiving or use Exchange archiving to improve performance and capacity.

These same challenges apply to file-based storage. Again, proper maintenance might be appropriate and administrators can be proactive in culling collections without necessarily deleting necessary documents and records. Storage reports certainly help (e.g., Windows Server's FSRM reports like Duplicate Files, Large Files, Least Recently Accessed Files, and the File Screening Audit to eliminate files that contravene acceptable use policies -- MP3s, etc.). These documents either shouldn't be on the drives in the first place or is overly resource intensive.

Increase the outflow

Getting rid of stuff is hard. Humans really are terrible information hoarders. People keep more than they need because they tend to over-estimate the potential value of information; they then forget where they stored that information.

So, how can we help? We can encourage our users to clean out their inboxes and empty deleted items. Unfortunately, people lack strategies for actually accomplishing this goal and need some instruction. General guidance could be:

  1. Sort email by sender. Delete junk. You will probably end up with a list of senders that you actually know because they are colleagues, partners, etc.
  2. Sort by subject. Delete long meandering threads that have little persistent value.
  3. Sort by date. Deal with everything older than four months.
In step 3 I used the expression "deal with". What do we do with information that might have value but might not? Some users may elect to export that information to a personal information management tool such as Evernote. Users might also be required to maintain some of that email as records and could export or forward those messages to a a records management system. More likely, they will maintain the email in folders.

Determining a folder structure is inherently difficult due to the nature of categorization -- it is personal and idiosyncratic. The other issue is that people keep information for its use as a reminder. It's perhaps better to keep everything relatively flat and build in some sort of function for reminding. 

Dave Allen's Getting Things Done (GTD) productivity technique offers a variety of suggestions for email management but basically adheres to an inbox zero philosophy. It also suggests a minimal filing system:
  • Inbox
  • @ 1. Action. Anything requiring action goes into this folder. It's basically a to-do list.
  • @ 2. Waiting. Anything for which one is waiting for a response goes in here. In some cases, the associated action may have been completed but we can't dispose of it.
  • @ 3. A-Z. Information that must be kept goes into his folder. Users can create sub-folders for particular processes or projects. Encourage those users to take a functional approach, that is, file things according to the necessary action. Some users will certainly create elaborate structures that will be largely empty.
  • @ 4. Archive. Information that might be necessary goes into the archive. Users can get access details via search. In old-school filing systems, this kind of collection would typically be organized by the name of the correspondence sender. Email clients do this for us automatically.
What's with the weird prefixes? Putting special characters at the beginning of a folder name enables us to group them and put them into some sort of logical order. Otherwise, the folders would be listed alphabetically which may -- or may not -- be of value.

Individual folders can also be associated with specific retention periods. Exchange, for example, now enables the use of specific retention tags to automate disposition. The challenge, of course, is getting the users to actually put email in the right locations!

Digital files also pose problems. Earlier this year, Mimi Dionne wrote a pair of articles about cleaning up file shares for CMSwire (article 1, article 2).

The first part of a file share clean up is statistical analysis. Dionne recommends getting the following for each file:

Must haves

  • file name
  • file extension
  • file type
  • volume
  • size
  • creation time
  • owner user name
  • last access time
  • last modified time
  • days since last access
  • days since last modify
  • days since file creation

Nice to have features include:

  • attributes
  • read only
  • hidden
  • system flag
  • encrypted flag
  • non content indexed
  • duplicate key string

You can then start mapping the retention schedule to various combinations of keywords and extensions. Typically you will get:

  • miscellaneous files (79%)
  • container files (10%)
  • data files (4.6%)
  • text files (4.4%)
  • temporary/backup files (1.3%)
  • graphic files (0.2%)
  • system files (0.2%)
  • virtualization files (0.2%)
  • database files
  • office files and documetns
  • program files
  • internet files
  • software dev files
  • video files
  • configuration files
  • mail files
  • audio files
  • help files

This information can also be used to determine the relative age of documents and the impact of aggressive file retention periods. For example, what would a shorter retention period due to storage requirements based on historical data?

Study the patterns over time. These observations might encourage better conversations with end users about what should -- and what shouldn't -- be in the file share.

The content categories should identify "easy deletes", objects that are redundant, obsolete, or transitory (ROT). You could win back a quick 1%. Removing duplicates might get you another 2%.

It's also helpful to devide data between administrative and technical functions.

Beyond that, you might need a more sophisticated approach.

Thursday, December 11, 2014

Destroying information is hard to do

There is a season for everything. Every document or record created must, inevitably, be destroyed.

Let's start with physical documents. It's not as easy as just dumping the documents in a dumpster. We need controlled processes to ensure that everything is appropriately management. There are any number of standards stating that you have ensure secure disposal but how do you actually do it?

The National Association for Information Destruction (NAID) -- who knew there was such a thing? -- advocates for that particular industry and offers the Certified Secure Destruction Specialist (CSDS) Accreditation Program. It also certified specific service providers. The certification program manual is incredibly detailed in the controls required to minimize the inherent risk of a document destruction provider (

Validation of the destruction process is particularly important. A document issued in conjunction with Ontario's Privacy Commissioner -- who is, incidentally, the sister of child entertainer Raffi -- lists what should be contained in the destruction authorization document (
  • date of destruction
  • name, title, contact info, and department of person submitting the authorization
  • description of the information or media being destroyed
  • retention schedule reference number
  • relevant serial or tracking numbers
  • quantity being destroyed
  • origination or acquisition year (range)
  • rentention expiration date
  • location of the records
  • reason for destruction
  • method of destruction
  • whether destruction is to be performed in-house
  • approved contract and vendor number, if relevant
  • approved destruction method 
Actual destruction should be accompanied by  Certificate of Destruction, including:
  • company name
  • unique serialized transaction number
  • transfer of custody
  • reference to the terms and conditions
  • acceptance of fiduciary responsibility
  • the date and time the information ceased to exist
  • location of the destruction
  • witness to the destruction
  • method of destruction
  • reference to compliance with the contract
  • signature
Internally-generate certificates might also include:
  • who conducted the destruction
  • when collection began
  • type of media collected 
  • specific containers targeted
  • time at which collection was completed
  • start time of destruction
  • location of destruction
  • equipment used
  • quantity destroyed
  • destruction completion time
Relevant logs might include the destruction of records not subject to retention (incidental or duplicates) and the results of random sampling audits.

What about digital documents and records?

We can certainly follow a similar process when we're dealing with well-defined sets of records or documents. Things might, however, get a bit more challenging when we're working with the type of digital documents in email and file shares.

Tuesday, December 09, 2014

Tags or folders?

I like tags. I often dream of a future where we have no folders and I can access everything in post-coordinated fashion using search... but I'm, apparently, not typical.

Bergman and co-authors explored this issue in the JASIST paper "Folder vs. tag preference in personal information management."

Tags are apparently preferred because they enable "multiple classification" and a lack of "hierarchical location" so that information doesn't get hidden.

The reminding function of information piles is an important concept. As long as information is on the desktop people know that they have to do something with it.

As an aside, how do we tag messages in email? Not easily, apparently. There are Outlook add-ons like Email Tags from Outlook4Business. It seems to work via an archiving process using Outlook categories.

U of Washington's Records Management Guidance

Some more academic perspective ( It is largely consistent with DIRKS but gives us a valuable list of how to analyze existing records:

  • who creates the records
  • who uses the records
  • how are records requested
  • how often are various types of records requested
  • how long do records remain current
  • how many people need access to the records
  • how much equipment is available to store the records
  • how spaces is available for equipment/growth
  • which records are confidential
  • are there legal requirements for retaining records
  • which are vital records

The guidance also gives us an overview of the types of file classifications that one tpically finds:

  • Administrative files -- internal admin and operation
  • Organizational files -- the relationship of an office with other offices
  • Program files -- documentation of activities and programs
  • Case files -- documentation of a specific event, project, person, or transaction

Categorization and Classification, revisited

An incredibly ad hoc search led me to a 2004 Library Trends article by Elin Jacob, entitled "Classification and Categorization: A Difference that Makes a Difference."

Some insights:

"Categorization divides the world of experience into groups or categories whose members share some perceptible similarity within a given context. That this context may vary and with it the composition of the of the category is the very basis for both the flexibility and the power of cognitive categorization." (p. 518)

"As experimentally-based categories evolve into well-defined, domain-specific classes that facilitate sharing of knowledge without lose of information, they lose their original flexibility and plasticity as well as the ability to response to new patterns of similarity." (p. 519)

Classical theory of categorization -- rigid hierarchies of shared features. Empirical research indicated that people are vary good at assigning a graded structure for category membership. For example, consider the set |robin, pigeon, ostrich, butterfly, chair|. People can assign each of these a score for how well they belong to the category of "bird". This observation challenges the idea "that there is an explicit inclusion/exclusion relationship between an entity and a category."

"Classification" refers to the use of a "representational tool used to organize a collection of information resources." Jacob explains:

"Classification as process involves the orderly and systematic assignment of each entity to one and one class within a system of mutually exclusive and non-overlapping classes. This process is lawful and systematic: lawful because it is carried out in accordance with an established set of principles that governs the structure of classes and class relationships; and systematic because it mandates consistent application of these principles within the framework of a prescribed order of reality. The scheme itself is artificial and arbitrary." (p.522)

Jacob explores different types of classification starting with the most rigid: taxonomic classification, as exemplified by the Linnaeun system. These hierarchies are incredibly valuable for stabilizing nomenclature and facilitating knowledge transmission. They are also limited in that they constrain "the information context by limiting the identification of knowledge-bearing associations to hierarchical relationships between classes."

Classification Schemes represent another approach. Jacob cites Shera who noted that all classificaion schemes rely on four assumptions: universal order, unity of knowledge, similarity of class members, and intrinsic essence. Bibliographic Classification Schemes have traditionally been a deductive approach. Faceted schemes are inductive, requiring an analysis of the "universe of knowledge" to identify appropriate properties and features. These terms can then be grouped into hierarchies. Jacob notes:

"The result is not a classification scheme but a controlled vocabulary of concepts and their associated labels that can be used, in association with a notation and prescribed citation order, to synthesize the classes that will populate the classification scheme."

Jacob goes back to Shera to describe the seven properties of a bibliographic classification scheme:

  • linearity
  • inclusivity of all knowledge within the classification's universe
  • well-defined, specific, and meaningful class labels
  • an arrangement of classes that establishes relationships between them
  • distinctions between classes that are meaningful
  • a mutually exclusive and nonoverlapping class structure
  • an infinite hospitality than can accomodate every entity in the bibliographic universe

Classification and categorization are related concepts but they are not the same thing:

"While traditional classification is rigorous in that it mandates that an entity either is or is not a member of a particular class, the process of categorization is flexible and creative and draws nonbinding associations between entities." (p. 527)

Basically, classification is the process of forcing entities into an arbitrary system based on specific rules while categorization drives definition based on context. Borges's taxonomy, for example, seems like a terrible classification because the underlying rules are impossible to divine but it might be an effective categorization based on the context in which it was created and for the relevant epistemic community.

Jacob's table describing the differences between classification and categorization is sufficiently interesting that I will include it completely:

Interestingly, Jacob notes that categories don't necessarily provide organization. The categories might be relevant to a particular group member but they might not actually demonstrate any hierarchical structure, thereby introducing challenges of access and navigation.

Free-text search, for example, represents "a very elementary mechanism for grouping." This limitation means that "a free-text retrieval system cannot contribute to an information environment that will support or enhance the value of system output through the establishment of meaningful context." More nuanced controls are post-coordinated systems, pre-coordinated systems, and classification systems. Subject headings, for example, enable multiple access points while a classification system enables only one (e.g., shelving location).

It's with subject headings that we get some challenges:

"Unlike the systematic and principled structure of a classification system, the structure of a subject heading system is frequently unprincipled, unsystematic, and poly-hierarchical. And, unlike the relationships established between well-defined and mutually exclusive classes in a classification, any relationships created between the categories of a subject heading system cannot be assumed to be either meaningful for information-bearing." (p.536)

Post-coordinated systems also introduce some challenges. According to Jacob, they are "simply mechanisms for grouping, not systems of organization."

Whew... I'm not sure if I'm any further ahead after reading that. Basically, categorization is dependent on context and may -- or may not -- actually lead to any sort of organization. Classification however, is a "lawful and arbitrary" system of forcing things into an organizational system. This system enforces the vocabulary and conventions of a discipline. Pre-coordinated and post-coordinated subject headings provide some middle ground.

But do users actually care?

Monday, December 08, 2014

Training 2014/12/5 #002

Defense from roundhouse kick

Position: standing; defense; face-to-opponent

A good chunk of training today was really about how to kick properly and some of the differences between karate and other arts. The power of the kick comes from chambering (i.e., bringing your knee high), rotating, and extending. Not suprisingly, the power comes from the hip.

UPDATE -- think of kicking across and down.

I'm not great at kicking!

The defense is straight forward. Recognize that the kick will probably be coming at your left side. Ideally, block it by keeping your elbow tight to your side. As you block, grab the leg. Step towards your opponent with your right foot, grabbing and shoving through their right shoulder. Inside trip their standing leg. They will go down. Try to keep your right arm straight (i.e., don't leak energy through the elbow).

They're down and you still control one leg. Join your hands in a guillotine grip and go for the achilles lock. My training partner suggested that I start relatively high on the calf, maintain pressure until my hand locks into the achilles position. Maintain good posture; extend your hips, etc.

The other option is to step over with your right leg and go for the knee bar... which will require more explanation at some other time.

UPDATE -- really keep the posture on the achilles lock. No pooping dog stance!

Labels: ,